Search | arXiv e-print repository

arXiv:2405.19035 [pdf, other]

A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

Authors: Niclas Vödisch, Kürsat Petek, Markus Käppeler, Abhinav Valada, Wolfram Burgard

Abstract: A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploi… ▽ More A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18852 [pdf, other]

LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping

Authors: Nikhil Gosala, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Paulo Drews-Jr, Wolfram Burgard, Abhinav Valada

Abstract: Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation lea… ▽ More Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 23 pages, 5 figures

arXiv:2404.17298 [pdf, other]

Automatic Target-Less Camera-LiDAR Calibration From Motion and Deep Point Correspondences

Authors: Kürsat Petek, Niclas Vödisch, Johannes Meyer, Daniele Cattaneo, Abhinav Valada, Wolfram Burgard

Abstract: Sensor setups of robotic platforms commonly include both camera and LiDAR as they provide complementary information. However, fusing these two modalities typically requires a highly accurate calibration between them. In this paper, we propose MDPCalib which is a novel method for camera-LiDAR calibration that requires neither human supervision nor any specific target objects. Instead, we utilize se… ▽ More Sensor setups of robotic platforms commonly include both camera and LiDAR as they provide complementary information. However, fusing these two modalities typically requires a highly accurate calibration between them. In this paper, we propose MDPCalib which is a novel method for camera-LiDAR calibration that requires neither human supervision nor any specific target objects. Instead, we utilize sensor motion estimates from visual and LiDAR odometry as well as deep learning-based 2D-pixel-to-3D-point correspondences that are obtained without in-domain retraining. We represent the camera-LiDAR calibration as a graph optimization problem and minimize the costs induced by constraints from sensor motion and point correspondences. In extensive experiments, we demonstrate that our approach yields highly accurate extrinsic calibration parameters and is robust to random initialization. Additionally, our approach generalizes to a wide range of sensor setups, which we demonstrate by employing it on various robotic platforms including a self-driving perception car, a quadruped robot, and a UAV. To make our calibration method publicly accessible, we release the code on our project website at http://calibration.cs.uni-freiburg.de. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2403.17846 [pdf, other]

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Authors: Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, Wolfram Burgard

Abstract: Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this… ▽ More Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/. △ Less

Submitted 3 June, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Code and video are available at http://hovsg.github.io/

arXiv:2403.14305 [pdf, other]

Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

Authors: Adrian Röfer, Iman Nematollahi, Tim Welschehold, Wolfram Burgard, Abhinav Valada

Abstract: Sample efficient learning of manipulation skills poses a major challenge in robotics. While recent approaches demonstrate impressive advances in the type of task that can be addressed and the sensing modalities that can be incorporated, they still require large amounts of training data. Especially with regard to learning actions on robots in the real world, this poses a major problem due to the hi… ▽ More Sample efficient learning of manipulation skills poses a major challenge in robotics. While recent approaches demonstrate impressive advances in the type of task that can be addressed and the sensing modalities that can be incorporated, they still require large amounts of training data. Especially with regard to learning actions on robots in the real world, this poses a major problem due to the high costs associated with both demonstrations and real-world robot interactions. To address this challenge, we introduce BOpt-GMM, a hybrid approach that combines imitation learning with own experience collection. We first learn a skill model as a dynamical system encoded in a Gaussian Mixture Model from a few demonstrations. We then improve this model with Bayesian optimization building on a small number of autonomous skill executions in a sparse reward setting. We demonstrate the sample efficiency of our approach on multiple complex manipulation skills in both simulations and real-world experiments. Furthermore, we make the code and pre-trained models publicly available at http://bopt-gmm. cs.uni-freiburg.de. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 7 pages, 5 figures, 2 tables, submitted to IROS2024

arXiv:2403.11914 [pdf, other]

Single-Agent Actor Critic for Decentralized Cooperative Driving

Authors: Shengchao Yan, Lukas König, Wolfram Burgard

Abstract: Active traffic management incorporating autonomous vehicles (AVs) promises a future with diminished congestion and enhanced traffic flow. However, developing algorithms for real-world application requires addressing the challenges posed by continuous traffic flow and partial observability. To bridge this gap and advance the field of active traffic management towards greater decentralization, we in… ▽ More Active traffic management incorporating autonomous vehicles (AVs) promises a future with diminished congestion and enhanced traffic flow. However, developing algorithms for real-world application requires addressing the challenges posed by continuous traffic flow and partial observability. To bridge this gap and advance the field of active traffic management towards greater decentralization, we introduce a novel asymmetric actor-critic model aimed at learning decentralized cooperative driving policies for autonomous vehicles using single-agent reinforcement learning. Our approach employs attention neural networks with masking to handle the dynamic nature of real-world traffic flow and partial observability. Through extensive evaluations against baseline controllers across various traffic scenarios, our model shows great potential for improving traffic flow at diverse bottleneck locations within the road system. Additionally, we explore the challenge associated with the conservative driving behaviors of autonomous vehicles that adhere strictly to traffic regulations. The experiment results illustrate that our proposed cooperative policy can mitigate potential traffic slowdowns without compromising safety. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11761 [pdf, other]

BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation

Authors: Jonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, Abhinav Valada

Abstract: Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibiti… ▽ More Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.07691 [pdf, other]

Evaluation of a Smart Mobile Robotic System for Industrial Plant Inspection and Supervision

Authors: Georg K. J. Fischer, Max Bergau, D. Adriana Gómez-Rosal, Andreas Wachaja, Johannes Gräter, Matthias Odenweller, Uwe Piechottka, Fabian Hoeflinger, Nikhil Gosala, Niklas Wetzel, Daniel Büscher, Abhinav Valada, Wolfram Burgard

Abstract: Automated and autonomous industrial inspection is a longstanding research field, driven by the necessity to enhance safety and efficiency within industrial settings. In addressing this need, we introduce an autonomously navigating robotic system designed for comprehensive plant inspection. This innovative system comprises a robotic platform equipped with a diverse array of sensors integrated to fa… ▽ More Automated and autonomous industrial inspection is a longstanding research field, driven by the necessity to enhance safety and efficiency within industrial settings. In addressing this need, we introduce an autonomously navigating robotic system designed for comprehensive plant inspection. This innovative system comprises a robotic platform equipped with a diverse array of sensors integrated to facilitate the detection of various process and infrastructure parameters. These sensors encompass optical (LiDAR, Stereo, UV/IR/RGB cameras), olfactory (electronic nose), and acoustic (microphone array) capabilities, enabling the identification of factors such as methane leaks, flow rates, and infrastructural anomalies. The proposed system underwent individual evaluation at a wastewater treatment site within a chemical plant, providing a practical and challenging environment for testing. The evaluation process encompassed key aspects such as object detection, 3D localization, and path planning. Furthermore, specific evaluations were conducted for optical methane leak detection and localization, as well as acoustic assessments focusing on pump equipment and gas leak localization. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: Submitted for publication in IEEE Sensors Journal

arXiv:2402.05840 [pdf, other]

uPLAM: Robust Panoptic Localization and Mapping Leveraging Perception Uncertainties

Authors: Kshitij Sirohi, Daniel Büscher, Wolfram Burgard

Abstract: The availability of a robust map-based localization system is essential for the operation of many autonomously navigating vehicles. Since uncertainty is an inevitable part of perception, it is beneficial for the robustness of the robot to consider it in typical downstream tasks of navigation stacks. In particular localization and mapping methods, which in modern systems often employ convolutional… ▽ More The availability of a robust map-based localization system is essential for the operation of many autonomously navigating vehicles. Since uncertainty is an inevitable part of perception, it is beneficial for the robustness of the robot to consider it in typical downstream tasks of navigation stacks. In particular localization and mapping methods, which in modern systems often employ convolutional neural networks (CNNs) for perception tasks, require proper uncertainty estimates. In this work, we present uncertainty-aware Panoptic Localization and Mapping (uPLAM), which employs pixel-wise uncertainty estimates for panoptic CNNs as a bridge to fuse modern perception with classical probabilistic localization and mapping approaches. Beyond the perception, we introduce an uncertainty-based map aggregation technique to create accurate panoptic maps, containing surface semantics and landmark instances. Moreover, we provide cell-wise map uncertainties, and present a particle filter-based localization method that employs perception uncertainties. Extensive evaluations show that our proposed incorporation of uncertainties leads to more accurate maps with reliable uncertainty estimates and improved localization accuracy. Additionally, we present the Freiburg Panoptic Driving dataset for evaluating panoptic mapping and localization methods. We make our code and dataset available at: \url{http://uplam.cs.uni-freiburg.de} △ Less

Submitted 20 March, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

arXiv:2312.08240 [pdf, other]

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

Authors: Eugenio Chisari, Nick Heppert, Tim Welschehold, Wolfram Burgard, Abhinav Valada

Abstract: Reliable object grasping is a crucial capability for autonomous robots. However, many existing grasping approaches focus on general clutter removal without explicitly modeling objects and thus only relying on the visible local geometry. We introduce CenterGrasp, a novel framework that combines object awareness and holistic grasping. CenterGrasp learns a general object prior by encoding shapes and… ▽ More Reliable object grasping is a crucial capability for autonomous robots. However, many existing grasping approaches focus on general clutter removal without explicitly modeling objects and thus only relying on the visible local geometry. We introduce CenterGrasp, a novel framework that combines object awareness and holistic grasping. CenterGrasp learns a general object prior by encoding shapes and valid grasps in a continuous latent space. It consists of an RGB-D image encoder that leverages recent advances to detect objects and infer their pose and latent code, and a decoder to predict shape and grasps for each object in the scene. We perform extensive experiments on simulated as well as real-world cluttered scenes and demonstrate strong scene reconstruction and 6-DoF grasp-pose estimation performance. Compared to the state of the art, CenterGrasp achieves an improvement of 38.5 mm in shape reconstruction and 33 percentage points on average in grasp success. We make the code and trained models publicly available at http://centergrasp.cs.uni-freiburg.de. △ Less

Submitted 5 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted at RA-L. Video, code and models available at http://centergrasp.cs.uni-freiburg.de

arXiv:2310.15059 [pdf, other]

Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models

Authors: Iman Nematollahi, Kirill Yankov, Wolfram Burgard, Tim Welschehold

Abstract: A long-standing challenge for a robotic manipulation system operating in real-world scenarios is adapting and generalizing its acquired motor skills to unseen environments. We tackle this challenge employing hybrid skill models that integrate imitation and reinforcement paradigms, to explore how the learning and adaptation of a skill, along with its core grounding in the scene through a learned ke… ▽ More A long-standing challenge for a robotic manipulation system operating in real-world scenarios is adapting and generalizing its acquired motor skills to unseen environments. We tackle this challenge employing hybrid skill models that integrate imitation and reinforcement paradigms, to explore how the learning and adaptation of a skill, along with its core grounding in the scene through a learned keypoint, can facilitate such generalization. To that end, we develop Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models (KIS-GMM) approach that learns to predict the reference of a dynamical system within the scene as a 3D keypoint, leveraging visual observations obtained by the robot's physical interactions during skill learning. Through conducting comprehensive evaluations in both simulated and real-world environments, we show that our method enables a robot to gain a significant zero-shot generalization to novel environments and to refine skills in the target environments faster than learning from scratch. Importantly, this is achieved without the need for new ground truth data. Moreover, our method effectively copes with scene displacements. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted at the International Symposium on Experimental Robotics (ISER) 2023. Videos at http://kis-gmm.cs.uni-freiburg.de/

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2310.05600 [pdf, other]

Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments

Authors: Michael G. Adam, Sebastian Eger, Martin Piccolrovazzi, Maged Iskandar, Joern Vogel, Alexander Dietrich, Seongjien Bien, Jon Skerlj, Abdeldjallil Naceri, Eckehard Steinbach, Alin Albu-Schaeffer, Sami Haddadin, Wolfram Burgard

Abstract: As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are al… ▽ More As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We further provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.05239 [pdf, other]

LAN-grasp: Using Large Language Models for Semantic Object Grasping

Authors: Reihaneh Mirjalili, Michael Krawez, Simone Silenzi, Yannik Blei, Wolfram Burgard

Abstract: In this paper, we propose LAN-grasp, a novel approach towards more appropriate semantic grasping. We use foundation models to provide the robot with a deeper understanding of the objects, the right place to grasp an object, or even the parts to avoid. This allows our robot to grasp and utilize objects in a more meaningful and safe manner. We leverage the combination of a Large Language Model, a Vi… ▽ More In this paper, we propose LAN-grasp, a novel approach towards more appropriate semantic grasping. We use foundation models to provide the robot with a deeper understanding of the objects, the right place to grasp an object, or even the parts to avoid. This allows our robot to grasp and utilize objects in a more meaningful and safe manner. We leverage the combination of a Large Language Model, a Vision Language Model, and a traditional grasp planner to generate grasps demonstrating a deeper semantic understanding of the objects. We first prompt the Large Language Model about which object part is appropriate for grasping. Next, the Vision Language Model identifies the corresponding part in the object image. Finally, we generate grasp proposals in the region proposed by the Vision Language Model. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without the need for further training or fine-tuning. We evaluated our method in real-world experiments on a custom object data set. We present the results of a survey that asks the participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.10726 [pdf, other]

Few-Shot Panoptic Segmentation With Foundation Models

Authors: Markus Käppeler, Kürsat Petek, Niclas Vödisch, Wolfram Burgard, Abhinav Valada

Abstract: Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with complete… ▽ More Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de. △ Less

Submitted 1 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted for "IEEE International Conference on Robotics and Automation (ICRA) 2024"

arXiv:2309.06635 [pdf, other]

Collaborative Dynamic 3D Scene Graphs for Automated Driving

Authors: Elias Greve, Martin Büchner, Niclas Vödisch, Wolfram Burgard, Abhinav Valada

Abstract: Maps have played an indispensable role in enabling safe and automated driving. Although there have been many advances on different fronts ranging from SLAM to semantics, building an actionable hierarchical semantic representation of urban dynamic scenes and processing information from multiple agents are still challenging problems. In this work, we present Collaborative URBan Scene Graphs (CURB-SG… ▽ More Maps have played an indispensable role in enabling safe and automated driving. Although there have been many advances on different fronts ranging from SLAM to semantics, building an actionable hierarchical semantic representation of urban dynamic scenes and processing information from multiple agents are still challenging problems. In this work, we present Collaborative URBan Scene Graphs (CURB-SG) that enable higher-order reasoning and efficient querying for many functions of automated driving. CURB-SG leverages panoptic LiDAR data from multiple agents to build large-scale maps using an effective graph-based collaborative SLAM approach that detects inter-agent loop closures. To semantically decompose the obtained 3D map, we build a lane graph from the paths of ego agents and their panoptic observations of other vehicles. Based on the connectivity of the lane graph, we segregate the environment into intersecting and non-intersecting road areas. Subsequently, we construct a multi-layered scene graph that includes lane information, the position of static landmarks and their assignment to certain map sections, other vehicles observed by the ego agents, and the pose graph from SLAM including 3D panoptic point clouds. We extensively evaluate CURB-SG in urban scenarios using a photorealistic simulator. We release our code at http://curb.cs.uni-freiburg.de. △ Less

Submitted 4 March, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: Accepted for "IEEE International Conference on Robotics and Automation (ICRA) 2024"

arXiv:2308.05612 [pdf, other]

A Smart Robotic System for Industrial Plant Supervision

Authors: D. Adriana Gómez-Rosal, Max Bergau, Georg K. J. Fischer, Andreas Wachaja, Johannes Gräter, Matthias Odenweller, Uwe Piechottka, Fabian Hoeflinger, Nikhil Gosala, Niklas Wetzel, Daniel Büscher, Abhinav Valada, Wolfram Burgard

Abstract: In today's chemical plants, human field operators perform frequent integrity checks to guarantee high safety standards, and thus are possibly the first to encounter dangerous operating conditions. To alleviate their task, we present a system consisting of an autonomously navigating robot integrated with various sensors and intelligent data processing. It is able to detect methane leaks and estimat… ▽ More In today's chemical plants, human field operators perform frequent integrity checks to guarantee high safety standards, and thus are possibly the first to encounter dangerous operating conditions. To alleviate their task, we present a system consisting of an autonomously navigating robot integrated with various sensors and intelligent data processing. It is able to detect methane leaks and estimate its flow rate, detect more general gas anomalies, recognize oil films, localize sound sources and detect failure cases, map the environment in 3D, and navigate autonomously, employing recognition and avoidance of dynamic obstacles. We evaluate our system at a wastewater facility in full working conditions. Our results demonstrate that the system is able to robustly navigate the plant and provide useful information about critical operating conditions. △ Less

Submitted 1 September, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

Comments: Final submission for IEEE Sensors 2023

arXiv:2307.00488 [pdf, other]

POV-SLAM: Probabilistic Object-Aware Variational SLAM in Semi-Static Environments

Authors: Jingxing Qian, Veronica Chatrath, James Servos, Aaron Mavrinac, Wolfram Burgard, Steven L. Waslander, Angela P. Schoellig

Abstract: Simultaneous localization and mapping (SLAM) in slowly varying scenes is important for long-term robot task completion. Failing to detect scene changes may lead to inaccurate maps and, ultimately, lost robots. Classical SLAM algorithms assume static scenes, and recent works take dynamics into account, but require scene changes to be observed in consecutive frames. Semi-static scenes, wherein objec… ▽ More Simultaneous localization and mapping (SLAM) in slowly varying scenes is important for long-term robot task completion. Failing to detect scene changes may lead to inaccurate maps and, ultimately, lost robots. Classical SLAM algorithms assume static scenes, and recent works take dynamics into account, but require scene changes to be observed in consecutive frames. Semi-static scenes, wherein objects appear, disappear, or move slowly over time, are often overlooked, yet are critical for long-term operation. We propose an object-aware, factor-graph SLAM framework that tracks and reconstructs semi-static object-level changes. Our novel variational expectation-maximization strategy is used to optimize factor graphs involving a Gaussian-Uniform bimodal measurement likelihood for potentially-changing objects. We evaluate our approach alongside the state-of-the-art SLAM solutions in simulation and on our novel real-world SLAM dataset captured in a warehouse over four months. Our method improves the robustness of localization in the presence of semi-static changes, providing object-level reasoning about the scene. △ Less

Submitted 2 July, 2023; originally announced July 2023.

Comments: Published in Robotics: Science and Systems (RSS) 2023

arXiv:2306.16316 [pdf, other]

Learning Continuous Control with Geometric Regularity from Robot Intrinsic Symmetry

Authors: Shengchao Yan, Baohe Zhang, Yuan Zhang, Joschka Boedecker, Wolfram Burgard

Abstract: Geometric regularity, which leverages data symmetry, has been successfully incorporated into deep learning architectures such as CNNs, RNNs, GNNs, and Transformers. While this concept has been widely applied in robotics to address the curse of dimensionality when learning from high-dimensional data, the inherent reflectional and rotational symmetry of robot structures has not been adequately explo… ▽ More Geometric regularity, which leverages data symmetry, has been successfully incorporated into deep learning architectures such as CNNs, RNNs, GNNs, and Transformers. While this concept has been widely applied in robotics to address the curse of dimensionality when learning from high-dimensional data, the inherent reflectional and rotational symmetry of robot structures has not been adequately explored. Drawing inspiration from cooperative multi-agent reinforcement learning, we introduce novel network structures for single-agent control learning that explicitly capture these symmetries. Moreover, we investigate the relationship between the geometric prior and the concept of Parameter Sharing in multi-agent reinforcement learning. Last but not the least, we implement the proposed framework in online and offline learning methods to demonstrate its ease of use. Through experiments conducted on various challenging continuous control tasks on simulators and real robots, we highlight the significant potential of the proposed geometric regularity in enhancing robot learning capabilities. △ Less

Submitted 18 March, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: accepted by ICRA 2024

arXiv:2306.15410 [pdf, other]

AutoGraph: Predicting Lane Graphs from Traffic Observations

Authors: Jannik Zürn, Ingmar Posner, Wolfram Burgard

Abstract: Lane graph estimation is a long-standing problem in the context of autonomous driving. Previous works aimed at solving this problem by relying on large-scale, hand-annotated lane graphs, introducing a data bottleneck for training models to solve this task. To overcome this limitation, we propose to use the motion patterns of traffic participants as lane graph annotations. In our AutoGraph approach… ▽ More Lane graph estimation is a long-standing problem in the context of autonomous driving. Previous works aimed at solving this problem by relying on large-scale, hand-annotated lane graphs, introducing a data bottleneck for training models to solve this task. To overcome this limitation, we propose to use the motion patterns of traffic participants as lane graph annotations. In our AutoGraph approach, we employ a pre-trained object tracker to collect the tracklets of traffic participants such as vehicles and trucks. Based on the location of these tracklets, we predict the successor lane graph from an initial position using overhead RGB images only, not requiring any human supervision. In a subsequent stage, we show how the individual successor predictions can be aggregated into a consistent lane graph. We demonstrate the efficacy of our approach on the UrbanLaneGraph dataset and perform extensive quantitative and qualitative evaluations, indicating that AutoGraph is on par with models trained on hand-annotated graph data. Model and dataset will be made available at redacted-for-review. △ Less

Submitted 10 November, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: 8 pages, 6 figures

arXiv:2306.11346 [pdf, other]

End-to-end 2D-3D Registration between Image and LiDAR Point Cloud for Vehicle Localization

Authors: Guangming Wang, Yu Zheng, Yanfeng Guo, Zhe Liu, Yixiang Zhu, Wolfram Burgard, Hesheng Wang

Abstract: Robot localization using a previously built map is essential for a variety of tasks including highly accurate navigation and mobile manipulation. A popular approach to robot localization is based on image-to-point cloud registration, which combines illumination-invariant LiDAR-based mapping with economical image-based localization. However, the recent works for image-to-point cloud registration ei… ▽ More Robot localization using a previously built map is essential for a variety of tasks including highly accurate navigation and mobile manipulation. A popular approach to robot localization is based on image-to-point cloud registration, which combines illumination-invariant LiDAR-based mapping with economical image-based localization. However, the recent works for image-to-point cloud registration either divide the registration into separate modules or project the point cloud to the depth image to register the RGB and depth images. In this paper, we present I2PNet, a novel end-to-end 2D-3D registration network. I2PNet directly registers the raw 3D point cloud with the 2D RGB image using differential modules with a unique target. The 2D-3D cost volume module for differential 2D-3D association is proposed to bridge feature extraction and pose regression. 2D-3D cost volume module implicitly constructs the soft point-to-pixel correspondence on the intrinsic-independent normalized plane of the pinhole camera model. Moreover, we introduce an outlier mask prediction module to filter the outliers in the 2D-3D association before pose regression. Furthermore, we propose the coarse-to-fine 2D-3D registration architecture to increase localization accuracy. We conduct extensive localization experiments on the KITTI Odometry and nuScenes datasets. The results demonstrate that I2PNet outperforms the state-of-the-art by a large margin. In addition, I2PNet has a higher efficiency than the previous works and can perform the localization in real-time. Moreover, we extend the application of I2PNet to the camera-LiDAR online calibration and demonstrate that I2PNet outperforms recent approaches on the online calibration task. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 18 pages, 14 figures, under review

arXiv:2306.06525 [pdf, other]

doi 10.1016/j.ifacol.2023.10.711

Fast yet predictable braking manoeuvers for real-time robot control

Authors: Mazin Hamad, Jesus Gutierrez-Moreno, Hugo T. M. Kussaba, Nico Mansfeld, Saeed Abdolshah, Abdalla Swikir, Wolfram Burgard, Sami Haddadin

Abstract: This paper proposes a framework for generating fast, smooth and predictable braking manoeuvers for a controlled robot. The proposed framework integrates two approaches to obtain feasible modal limits for designing braking trajectories. The first approach is real-time capable but conservative considering the usage of the available feasible actuator control region, resulting in longer braking times.… ▽ More This paper proposes a framework for generating fast, smooth and predictable braking manoeuvers for a controlled robot. The proposed framework integrates two approaches to obtain feasible modal limits for designing braking trajectories. The first approach is real-time capable but conservative considering the usage of the available feasible actuator control region, resulting in longer braking times. In contrast, the second approach maximizes the used braking control inputs at the cost of requiring more time to evaluate larger, feasible modal limits via optimization. Both approaches allow for predicting the robot's stopping trajectory online. In addition, we also formulated and solved a constrained, nonlinear final-time minimization problem to find optimal torque inputs. The optimal solutions were used as a benchmark to evaluate the performance of the proposed predictable braking framework. A comparative study was compiled in simulation versus a classical optimal controller on a 7-DoF robot arm with only three moving joints. The results verified the effectiveness of our proposed framework and its integrated approaches in achieving fast robot braking manoeuvers with accurate online predictions of the stopping trajectories and distances under various braking settings. △ Less

Submitted 10 June, 2023; originally announced June 2023.

Comments: This work has been accepted to the 22nd IFAC World Congress

arXiv:2305.04718 [pdf, other]

doi 10.1109/LRA.2023.3313917

The Treachery of Images: Bayesian Scene Keypoints for Deep Policy Learning in Robotic Manipulation

Authors: Jan Ole von Hartz, Eugenio Chisari, Tim Welschehold, Wolfram Burgard, Joschka Boedecker, Abhinav Valada

Abstract: In policy learning for robotic manipulation, sample efficiency is of paramount importance. Thus, learning and extracting more compact representations from camera observations is a promising avenue. However, current methods often assume full observability of the scene and struggle with scale invariance. In many tasks and settings, this assumption does not hold as objects in the scene are often occl… ▽ More In policy learning for robotic manipulation, sample efficiency is of paramount importance. Thus, learning and extracting more compact representations from camera observations is a promising avenue. However, current methods often assume full observability of the scene and struggle with scale invariance. In many tasks and settings, this assumption does not hold as objects in the scene are often occluded or lie outside the field of view of the camera, rendering the camera observation ambiguous with regard to their location. To tackle this problem, we present BASK, a Bayesian approach to tracking scale-invariant keypoints over time. Our approach successfully resolves inherent ambiguities in images, enabling keypoint tracking on symmetrical objects and occluded and out-of-view objects. We employ our method to learn challenging multi-object robot manipulation tasks from wrist camera observations and demonstrate superior utility for policy learning compared to other representation learning techniques. Furthermore, we show outstanding robustness towards disturbances such as clutter, occlusions, and noisy depth measurements, as well as generalization to unseen objects both in simulation and real-world robotic experiments. △ Less

Submitted 20 September, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 6931-6938, Nov. 2023

arXiv:2304.07058 [pdf, other]

FM-Loc: Using Foundation Models for Improved Vision-based Localization

Authors: Reihaneh Mirjalili, Michael Krawez, Wolfram Burgard

Abstract: Visual place recognition is essential for vision-based robot localization and SLAM. Despite the tremendous progress made in recent years, place recognition in changing environments remains challenging. A promising approach to cope with appearance variations is to leverage high-level semantic features like objects or place categories. In this paper, we propose FM-Loc which is a novel image-based lo… ▽ More Visual place recognition is essential for vision-based robot localization and SLAM. Despite the tremendous progress made in recent years, place recognition in changing environments remains challenging. A promising approach to cope with appearance variations is to leverage high-level semantic features like objects or place categories. In this paper, we propose FM-Loc which is a novel image-based localization approach based on Foundation Models that uses the Large Language Model GPT-3 in combination with the Visual-Language Model CLIP to construct a semantic image descriptor that is robust to severe changes in scene geometry and camera viewpoint. We deploy CLIP to detect objects in an image, GPT-3 to suggest potential room labels based on the detected objects, and CLIP again to propose the most likely location label. The object labels and the scene label constitute an image descriptor that we use to calculate a similarity score between the query and database images. We validate our approach on real-world data that exhibit significant changes in camera viewpoints and object placement between the database and query trajectories. The experimental results demonstrate that our method is applicable to a wide range of indoor scenarios without the need for training or fine-tuning. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2303.11756 [pdf, other]

Improving Deep Dynamics Models for Autonomous Vehicles with Multimodal Latent Mapping of Surfaces

Authors: Johan Vertens, Nicolai Dorka, Tim Welschehold, Michael Thompson, Wolfram Burgard

Abstract: The safe deployment of autonomous vehicles relies on their ability to effectively react to environmental changes. This can require maneuvering on varying surfaces which is still a difficult problem, especially for slippery terrains. To address this issue we propose a new approach that learns a surface-aware dynamics model by conditioning it on a latent variable vector storing surface information a… ▽ More The safe deployment of autonomous vehicles relies on their ability to effectively react to environmental changes. This can require maneuvering on varying surfaces which is still a difficult problem, especially for slippery terrains. To address this issue we propose a new approach that learns a surface-aware dynamics model by conditioning it on a latent variable vector storing surface information about the current location. A latent mapper is trained to update these latent variables during inference from multiple modalities on every traversal of the corresponding locations and stores them in a map. By training everything end-to-end with the loss of the dynamics model, we enforce the latent mapper to learn an update rule for the latent map that is useful for the subsequent dynamics model. We implement and evaluate our approach on a real miniature electric car. The results show that the latent map is updated to allow more accurate predictions of the dynamics model compared to a model without this information. We further show that by using this model, the driving performance can be improved on varying and challenging surfaces. △ Less

Submitted 21 March, 2023; originally announced March 2023.

arXiv:2303.10149 [pdf, other]

doi 10.1109/CVPRW59228.2023.00245

CoVIO: Online Continual Learning for Visual-Inertial Odometry

Authors: Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, Abhinav Valada

Abstract: Visual odometry is a fundamental task for many applications on mobile devices and robotic platforms. Since such applications are oftentimes not limited to predefined target domains and learning-based vision systems are known to generalize poorly to unseen environments, methods for continual adaptation during inference time are of significant interest. In this work, we introduce CoVIO for online co… ▽ More Visual odometry is a fundamental task for many applications on mobile devices and robotic platforms. Since such applications are oftentimes not limited to predefined target domains and learning-based vision systems are known to generalize poorly to unseen environments, methods for continual adaptation during inference time are of significant interest. In this work, we introduce CoVIO for online continual learning of visual-inertial odometry. CoVIO effectively adapts to new domains while mitigating catastrophic forgetting by exploiting experience replay. In particular, we propose a novel sampling strategy to maximize image diversity in a fixed-size replay buffer that targets the limited storage capacity of embedded devices. We further provide an asynchronous version that decouples the odometry estimation from the network weight update step enabling continuous inference in real time. We extensively evaluate CoVIO on various real-world datasets demonstrating that it successfully adapts to new domains while outperforming previous methods. The code of our work is publicly available at http://continual-slam.cs.uni-freiburg.de. △ Less

Submitted 17 March, 2023; originally announced March 2023.

Journal ref: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

arXiv:2303.10147 [pdf, other]

CoDEPS: Online Continual Learning for Depth Estimation and Panoptic Segmentation

Authors: Niclas Vödisch, Kürsat Petek, Wolfram Burgard, Abhinav Valada

Abstract: Operating a robot in the open world requires a high level of robustness with respect to previously unseen environments. Optimally, the robot is able to adapt by itself to new conditions without human supervision, e.g., automatically adjusting its perception system to changing lighting conditions. In this work, we address the task of continual learning for deep learning-based monocular depth estima… ▽ More Operating a robot in the open world requires a high level of robustness with respect to previously unseen environments. Optimally, the robot is able to adapt by itself to new conditions without human supervision, e.g., automatically adjusting its perception system to changing lighting conditions. In this work, we address the task of continual learning for deep learning-based monocular depth estimation and panoptic segmentation in new environments in an online manner. We introduce CoDEPS to perform continual learning involving multiple real-world domains while mitigating catastrophic forgetting by leveraging experience replay. In particular, we propose a novel domain-mixing strategy to generate pseudo-labels to adapt panoptic segmentation. Furthermore, we explicitly address the limited storage capacity of robotic systems by leveraging sampling strategies for constructing a fixed-size replay buffer based on rare semantic class sampling and image diversity. We perform extensive evaluations of CoDEPS on various real-world datasets demonstrating that it successfully adapts to unseen environments without sacrificing performance on previous domains while achieving state-of-the-art results. The code of our work is publicly available at http://codeps.cs.uni-freiburg.de. △ Less

Submitted 31 May, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: Accepted for "Robotics: Science and Systems (RSS) 2023"

arXiv:2303.10144 [pdf, other]

Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting

Authors: Nicolai Dorka, Tim Welschehold, Wolfram Burgard

Abstract: Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dyn… ▽ More Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on under- and overfitting detection on a small subset of the continuously collected experience not used for training. We apply our method to DreamerV2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the DeepMind Control Suite and the Atari $100$k benchmark. The results demonstrate that one can better balance under- and overestimation by adjusting the UTD ratio with our approach compared to the default setting in DreamerV2 and that it is competitive with an extensive hyperparameter search which is not feasible for many applications. Our method eliminates the need to set the UTD hyperparameter by hand and even leads to a higher robustness with regard to other learning-related hyperparameters further reducing the amount of necessary tuning. △ Less

Submitted 17 March, 2023; originally announced March 2023.

Comments: ICLR 2023

arXiv:2303.07522 [pdf, other]

Audio Visual Language Maps for Robot Navigation

Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard

Abstract: While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of… ▽ More While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world - navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io. △ Less

Submitted 27 March, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: Project page: https://avlmaps.github.io/

arXiv:2303.03037 [pdf, other]

EvCenterNet: Uncertainty Estimation for Object Detection using Evidential Learning

Authors: Monish R. Nallapareddy, Kshitij Sirohi, Paulo L. J. Drews-Jr, Wolfram Burgard, Chih-Hong Cheng, Abhinav Valada

Abstract: Uncertainty estimation is crucial in safety-critical settings such as automated driving as it provides valuable information for several downstream tasks including high-level decision making and path planning. In this work, we propose EvCenterNet, a novel uncertainty-aware 2D object detection framework using evidential learning to directly estimate both classification and regression uncertainties.… ▽ More Uncertainty estimation is crucial in safety-critical settings such as automated driving as it provides valuable information for several downstream tasks including high-level decision making and path planning. In this work, we propose EvCenterNet, a novel uncertainty-aware 2D object detection framework using evidential learning to directly estimate both classification and regression uncertainties. To employ evidential learning for object detection, we devise a combination of evidential and focal loss functions for the sparse heatmap inputs. We introduce class-balanced weighting for regression and heatmap prediction to tackle the class imbalance encountered by evidential learning. Moreover, we propose a learning scheme to actively utilize the predicted heatmap uncertainties to improve the detection performance by focusing on the most uncertain points. We train our model on the KITTI dataset and evaluate it on challenging out-of-distribution datasets including BDD100K and nuImages. Our experiments demonstrate that our approach improves the precision and minimizes the execution time loss in relation to the base model. △ Less

Submitted 28 September, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

arXiv:2302.06175 [pdf, other]

Learning and Aggregating Lane Graphs for Urban Automated Driving

Authors: Martin Büchner, Jannik Zürn, Ion-George Todoran, Abhinav Valada, Wolfram Burgard

Abstract: Lane graph estimation is an essential and highly challenging task in automated driving and HD map learning. Existing methods using either onboard or aerial imagery struggle with complex lane topologies, out-of-distribution scenarios, or significant occlusions in the image space. Moreover, merging overlapping lane graphs to obtain consistent large-scale graphs remains difficult. To overcome these c… ▽ More Lane graph estimation is an essential and highly challenging task in automated driving and HD map learning. Existing methods using either onboard or aerial imagery struggle with complex lane topologies, out-of-distribution scenarios, or significant occlusions in the image space. Moreover, merging overlapping lane graphs to obtain consistent large-scale graphs remains difficult. To overcome these challenges, we propose a novel bottom-up approach to lane graph estimation from aerial imagery that aggregates multiple overlapping graphs into a single consistent graph. Due to its modular design, our method allows us to address two complementary tasks: predicting ego-respective successor lane graphs from arbitrary vehicle positions using a graph neural network and aggregating these predictions into a consistent global lane graph. Extensive experiments on a large-scale lane graph dataset demonstrate that our approach yields highly accurate lane graphs, even in regions with severe occlusions. The presented approach to graph aggregation proves to eliminate inconsistent predictions while increasing the overall graph quality. We make our large-scale urban lane graph dataset and code publicly available at http://urbanlanegraph.cs.uni-freiburg.de. △ Less

Submitted 17 March, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: 22 pages, 17 figures

arXiv:2302.04233 [pdf, other]

SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images

Authors: Nikhil Gosala, Kürsat Petek, Paulo L. J. Drews-Jr, Wolfram Burgard, Abhinav Valada

Abstract: Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first sel… ▽ More Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: 14 pages, 7 figures

arXiv:2210.05714 [pdf, other]

Visual Language Maps for Robot Navigation

Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard

Abstract: Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometri… ▽ More Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https://vlmaps.github.io. △ Less

Submitted 8 March, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted at the 2023 IEEE International Conference on Robotics and Automation (ICRA). Project page: https://vlmaps.github.io

arXiv:2210.04472 [pdf, other]

Uncertainty-aware LiDAR Panoptic Segmentation

Authors: Kshitij Sirohi, Sajad Marvi, Daniel Büscher, Wolfram Burgard

Abstract: Modern autonomous systems often rely on LiDAR scanners, in particular for autonomous driving scenarios. In this context, reliable scene understanding is indispensable. Current learning-based methods typically try to achieve maximum performance for this task, while neglecting a proper estimation of the associated uncertainties. In this work, we introduce a novel approach for solving the task of unc… ▽ More Modern autonomous systems often rely on LiDAR scanners, in particular for autonomous driving scenarios. In this context, reliable scene understanding is indispensable. Current learning-based methods typically try to achieve maximum performance for this task, while neglecting a proper estimation of the associated uncertainties. In this work, we introduce a novel approach for solving the task of uncertainty-aware panoptic segmentation using LiDAR point clouds. Our proposed EvLPSNet network is the first to solve this task efficiently in a sampling-free manner. It aims to predict per-point semantic and instance segmentations, together with per-point uncertainty estimates. Moreover, it incorporates methods for improving the performance by employing the predicted uncertainties. We provide several strong baselines combining state-of-the-art panoptic segmentation networks with sampling-free uncertainty estimation techniques. Extensive evaluations show that we achieve the best performance on uncertainty-aware panoptic segmentation quality and calibration compared to these baselines. We make our code available at: https://github.com/kshitij3112/EvLPSNet △ Less

Submitted 10 October, 2022; originally announced October 2022.

arXiv:2210.01911 [pdf, other]

Grounding Language with Visual Affordances over Unstructured Data

Authors: Oier Mees, Jessica Borja-Diaz, Wolfram Burgard

Abstract: Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach… ▽ More Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach to efficiently learn general-purpose language-conditioned robot skills from unstructured, offline and reset-free data in the real world by exploiting a self-supervised visuo-lingual affordance model, which requires annotating as little as 1% of the total data with language. We evaluate our method in extensive experiments both in simulated and real-world robotic tasks, achieving state-of-the-art performance on the challenging CALVIN benchmark and learning over 25 distinct visuomotor manipulation tasks with a single policy in the real world. We find that when paired with LLMs to break down abstract natural language instructions into subgoals via few-shot prompting, our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches. Code and videos are available at http://hulc2.cs.uni-freiburg.de △ Less

Submitted 8 March, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted at the 2023 IEEE International Conference on Robotics and Automation (ICRA). Project website: http://hulc2.cs.uni-freiburg.de

arXiv:2209.11693 [pdf, other]

T3VIP: Transformation-based 3D Video Prediction

Authors: Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghu Rajan, Frank Hutter, Wolfram Burgard

Abstract: For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding r… ▽ More For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at http://t3vip.cs.uni-freiburg.de. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2209.09699 [pdf, other]

doi 10.1109/LRA.2023.3239312

PADLoC: LiDAR-Based Deep Loop Closure Detection and Registration Using Panoptic Attention

Authors: José Arce, Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, Abhinav Valada

Abstract: A key component of graph-based SLAM systems is the ability to detect loop closures in a trajectory to reduce the drift accumulated over time from the odometry. Most LiDAR-based methods achieve this goal by using only the geometric information, disregarding the semantics of the scene. In this work, we introduce PADLoC for joint loop closure detection and registration in LiDAR-based SLAM frameworks.… ▽ More A key component of graph-based SLAM systems is the ability to detect loop closures in a trajectory to reduce the drift accumulated over time from the odometry. Most LiDAR-based methods achieve this goal by using only the geometric information, disregarding the semantics of the scene. In this work, we introduce PADLoC for joint loop closure detection and registration in LiDAR-based SLAM frameworks. We propose a novel transformer-based head for point cloud matching and registration, and to leverage panoptic information during training time. In particular, we propose a novel loss function that reframes the matching problem as a classification task for the semantic labels and as a graph connectivity assignment for the instance labels. During inference, PADLoC does not require panoptic annotations, making it more versatile than other methods. Additionally, we show that using two shared matching and registration heads with their source and target inputs swapped increases the overall performance by enforcing forward-backward consistency. We perform extensive evaluations of PADLoC on multiple real-world datasets demonstrating that it achieves state-of-the-art results. The code of our work is publicly available at http://padloc.cs.uni-freiburg.de. △ Less

Submitted 28 March, 2023; v1 submitted 20 September, 2022; originally announced September 2022.

Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1319-1326, March 2023

arXiv:2209.08959 [pdf, other]

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Authors: Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, Wolfram Burgard

Abstract: Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have compl… ▽ More Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have complementary strengths and weaknesses, we propose a novel hierarchical approach that combines the strengths of both methods to learn task-agnostic long-horizon policies from high-dimensional camera observations. Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors. Experiments in various simulated and real robot control tasks show that our formulation enables producing previously unseen combinations of skills to reach temporally extended goals by "stitching" together latent skills through goal chaining with an order-of-magnitude improvement in performance upon state-of-the-art baselines. We even learn one multi-task visuomotor policy for 25 distinct manipulation tasks in the real world which outperforms both imitation learning and offline reinforcement learning techniques. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: CoRL 2022. Project website: http://tacorl.cs.uni-freiburg.de/

arXiv:2209.05247 [pdf, other]

TrackletMapper: Ground Surface Segmentation and Mapping from Traffic Participant Trajectories

Authors: Jannik Zürn, Sebastian Weber, Wolfram Burgard

Abstract: Robustly classifying ground infrastructure such as roads and street crossings is an essential task for mobile robots operating alongside pedestrians. While many semantic segmentation datasets are available for autonomous vehicles, models trained on such datasets exhibit a large domain gap when deployed on robots operating in pedestrian spaces. Manually annotating images recorded from pedestrian vi… ▽ More Robustly classifying ground infrastructure such as roads and street crossings is an essential task for mobile robots operating alongside pedestrians. While many semantic segmentation datasets are available for autonomous vehicles, models trained on such datasets exhibit a large domain gap when deployed on robots operating in pedestrian spaces. Manually annotating images recorded from pedestrian viewpoints is both expensive and time-consuming. To overcome this challenge, we propose TrackletMapper, a framework for annotating ground surface types such as sidewalks, roads, and street crossings from object tracklets without requiring human-annotated data. To this end, we project the robot ego-trajectory and the paths of other traffic participants into the ego-view camera images, creating sparse semantic annotations for multiple types of ground surfaces from which a ground segmentation model can be trained. We further show that the model can be self-distilled for additional performance benefits by aggregating a ground surface map and projecting it into the camera images, creating a denser set of training annotations compared to the sparse tracklet annotations. We qualitatively and quantitatively attest our findings on a novel large-scale dataset for mobile robots operating in pedestrian areas. Code and dataset will be made available at http://trackletmapper.cs.uni-freiburg.de. △ Less

Submitted 8 January, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

Comments: 19 pages, 14 figures, CoRL 2022 v4 (updated acknowledgements)

arXiv:2207.07469 [pdf, other]

USegScene: Unsupervised Learning of Depth, Optical Flow and Ego-Motion with Semantic Guidance and Coupled Networks

Authors: Johan Vertens, Wolfram Burgard

Abstract: In this paper we propose USegScene, a framework for semantically guided unsupervised learning of depth, optical flow and ego-motion estimation for stereo camera images using convolutional neural networks. Our framework leverages semantic information for improved regularization of depth and optical flow maps, multimodal fusion and occlusion filling considering dynamic rigid object motions as indepe… ▽ More In this paper we propose USegScene, a framework for semantically guided unsupervised learning of depth, optical flow and ego-motion estimation for stereo camera images using convolutional neural networks. Our framework leverages semantic information for improved regularization of depth and optical flow maps, multimodal fusion and occlusion filling considering dynamic rigid object motions as independent SE(3) transformations. Furthermore, complementary to pure photo-metric matching, we propose matching of semantic features, pixel-wise classes and object instance borders between the consecutive images. In contrast to previous methods, we propose a network architecture that jointly predicts all outputs using shared encoders and allows passing information across the task-domains, e.g., the prediction of optical flow can benefit from the prediction of the depth. Furthermore, we explicitly learn the depth and optical flow occlusion maps inside the network, which are leveraged in order to improve the predictions in therespective regions. We present results on the popular KITTI dataset and show that our approach outperforms other methods by a large margin. △ Less

Submitted 15 July, 2022; originally announced July 2022.

arXiv:2206.14554 [pdf, other]

Uncertainty-aware Panoptic Segmentation

Authors: Kshitij Sirohi, Sajad Marvi, Daniel Büscher, Wolfram Burgard

Abstract: Reliable scene understanding is indispensable for modern autonomous systems. Current learning-based methods typically try to maximize their performance based on segmentation metrics that only consider the quality of the segmentation. However, for the safe operation of a system in the real world it is crucial to consider the uncertainty in the prediction as well. In this work, we introduce the nove… ▽ More Reliable scene understanding is indispensable for modern autonomous systems. Current learning-based methods typically try to maximize their performance based on segmentation metrics that only consider the quality of the segmentation. However, for the safe operation of a system in the real world it is crucial to consider the uncertainty in the prediction as well. In this work, we introduce the novel task of uncertainty-aware panoptic segmentation, which aims to predict per-pixel semantic and instance segmentations, together with per-pixel uncertainty estimates. We define two novel metrics to facilitate its quantitative analysis, the uncertainty-aware Panoptic Quality (uPQ) and the panoptic Expected Calibration Error (pECE). We further propose the novel top-down Evidential Panoptic Segmentation Network (EvPSNet) to solve this task. Our architecture employs a simple yet effective panoptic fusion module that leverages the predicted uncertainties. Furthermore, we provide several strong baselines combining state-of-the-art panoptic segmentation networks with sampling-free uncertainty estimation techniques. Extensive evaluations show that our EvPSNet achieves the new state-of-the-art for the standard Panoptic Quality (PQ), as well as for our uncertainty-aware panoptic metrics. We make the code available at: \url{https://github.com/kshitij3112/EvPSNet} △ Less

Submitted 24 December, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

arXiv:2204.06252 [pdf, other]

What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data

Authors: Oier Mees, Lukas Hermann, Wolfram Burgard

Abstract: A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices… ▽ More A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices due to the underlying variation in setups. In this paper, we conduct an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets. We further identify architectural and algorithmic techniques that improve performance, such as a hierarchical decomposition of the robot control learning, a multimodal transformer encoder, discrete latent plans and a self-supervised contrastive loss that aligns video and language representations. By combining the results of our investigation with our improved model components, we are able to present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark. We have open-sourced our implementation to facilitate future research in learning to perform many complex manipulation skills in a row specified with natural language. Codebase and trained models available at http://hulc.cs.uni-freiburg.de △ Less

Submitted 30 August, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Accepted for publication at IEEE Robotics and Automation Letters (RAL). Codebase and trained models available at http://hulc.cs.uni-freiburg.de

arXiv:2203.01578 [pdf, other]

doi 10.1007/978-3-031-25555-7_3

Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping Through Continual Learning

Authors: Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, Abhinav Valada

Abstract: Robots operating in the open world encounter various different environments that can substantially differ from each other. This domain gap also poses a challenge for Simultaneous Localization and Mapping (SLAM) being one of the fundamental tasks for navigation. In particular, learning-based SLAM methods are known to generalize poorly to unseen environments hindering their general adoption. In this… ▽ More Robots operating in the open world encounter various different environments that can substantially differ from each other. This domain gap also poses a challenge for Simultaneous Localization and Mapping (SLAM) being one of the fundamental tasks for navigation. In particular, learning-based SLAM methods are known to generalize poorly to unseen environments hindering their general adoption. In this work, we introduce the novel task of continual SLAM extending the concept of lifelong SLAM from a single dynamically changing environment to sequential deployments in several drastically differing environments. To address this task, we propose CL-SLAM leveraging a dual-network architecture to both adapt to new environments and retain knowledge with respect to previously visited environments. We compare CL-SLAM to learning-based as well as classical SLAM methods and show the advantages of leveraging online data. We extensively evaluate CL-SLAM on three different datasets and demonstrate that it outperforms several baselines inspired by existing continual learning-based visual odometry methods. We make the code of our work publicly available at http://continual-slam.cs.uni-freiburg.de. △ Less

Submitted 13 March, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

Journal ref: Robotics Research. ISRR 2022. Springer Proceedings in Advanced Robotics, vol 27, pp 19-35

arXiv:2203.00403 [pdf, other]

OpenDR: An Open Toolkit for Enabling High Performance, Low Footprint Deep Learning for Robotics

Authors: N. Passalis, S. Pedrazzi, R. Babuska, W. Burgard, D. Dias, F. Ferro, M. Gabbouj, O. Green, A. Iosifidis, E. Kayacan, J. Kober, O. Michel, N. Nikolaidis, P. Nousi, R. Pieters, M. Tzelepi, A. Valada, A. Tefas

Abstract: Existing Deep Learning (DL) frameworks typically do not provide ready-to-use solutions for robotics, where very specific learning, reasoning, and embodiment problems exist. Their relatively steep learning curve and the different methodologies employed by DL compared to traditional approaches, along with the high complexity of DL models, which often leads to the need of employing specialized hardwa… ▽ More Existing Deep Learning (DL) frameworks typically do not provide ready-to-use solutions for robotics, where very specific learning, reasoning, and embodiment problems exist. Their relatively steep learning curve and the different methodologies employed by DL compared to traditional approaches, along with the high complexity of DL models, which often leads to the need of employing specialized hardware accelerators, further increase the effort and cost needed to employ DL models in robotics. Also, most of the existing DL methods follow a static inference paradigm, as inherited by the traditional computer vision pipelines, ignoring active perception, which can be employed to actively interact with the environment in order to increase perception accuracy. In this paper, we present the Open Deep Learning Toolkit for Robotics (OpenDR). OpenDR aims at developing an open, non-proprietary, efficient, and modular toolkit that can be easily used by robotics companies and research institutions to efficiently develop and deploy AI and cognition technologies to robotics applications, providing a solid step towards addressing the aforementioned challenges. We also detail the design choices, along with an abstract interface that was created to overcome these challenges. This interface can describe various robotic tasks, spanning beyond traditional DL cognition and inference, as known by existing frameworks, incorporating openness, homogeneity and robotics-oriented perception e.g., through active perception, as its core design principles. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2203.00352 [pdf, other]

Affordance Learning from Play for Sample-Efficient Policy Learning

Authors: Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Hermann, Joschka Boedecker, Wolfram Burgard

Abstract: Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy le… ▽ More Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment. We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies. We find that our policies train 4x faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: Accepted at the 2022 IEEE International Conference on Robotics and Automation (ICRA). Videos at http://vapo.cs.uni-freiburg.de/

arXiv:2201.12771 [pdf, other]

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Authors: Jannik Zürn, Wolfram Burgard

Abstract: Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle. Most modern approaches for solving this task rely on training image-based detectors using large-scale vehicle detection datasets such as nuScenes or the Waymo Open Dataset. Providing manual annotations is an expensive and laborious exercise that does not scale well in practi… ▽ More Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle. Most modern approaches for solving this task rely on training image-based detectors using large-scale vehicle detection datasets such as nuScenes or the Waymo Open Dataset. Providing manual annotations is an expensive and laborious exercise that does not scale well in practice. To tackle this problem, we propose a self-supervised approach that leverages audio-visual cues to detect moving vehicles in videos. Our approach employs contrastive learning for localizing vehicles in images from corresponding pairs of images and recorded audio. In extensive experiments carried out with a real-world dataset, we demonstrate that our approach provides accurate detections of moving vehicles and does not require manual annotations. We furthermore show that our model can be used as a teacher to supervise an audio-only detection model. This student model is invariant to illumination changes and thus effectively bridges the domain gap inherent to models leveraging exclusively vision as the predominant modality. △ Less

Submitted 13 June, 2022; v1 submitted 30 January, 2022; originally announced January 2022.

Comments: 8 pages, 6 figures

arXiv:2112.03227 [pdf, other]

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Authors: Oier Mees, Lukas Hermann, Erick Rosete-Beas, Wolfram Burgard

Abstract: General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from… ▽ More General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark. △ Less

Submitted 13 July, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

Comments: Accepted for publication at IEEE Robotics and Automation Letters (RAL). Code, models and dataset available at http://calvin.cs.uni-freiburg.de

arXiv:2111.13129 [pdf, other]

Robot Skill Adaptation via Soft Actor-Critic Gaussian Mixture Models

Authors: Iman Nematollahi, Erick Rosete-Beas, Adrian Röfer, Tim Welschehold, Abhinav Valada, Wolfram Burgard

Abstract: A core challenge for an autonomous agent acting in the real world is to adapt its repertoire of skills to cope with its noisy perception and dynamics. To scale learning of skills to long-horizon tasks, robots should be able to learn and later refine their skills in a structured manner through trajectories rather than making instantaneous decisions individually at each time step. To this end, we pr… ▽ More A core challenge for an autonomous agent acting in the real world is to adapt its repertoire of skills to cope with its noisy perception and dynamics. To scale learning of skills to long-horizon tasks, robots should be able to learn and later refine their skills in a structured manner through trajectories rather than making instantaneous decisions individually at each time step. To this end, we propose the Soft Actor-Critic Gaussian Mixture Model (SAC-GMM), a novel hybrid approach that learns robot skills through a dynamical system and adapts the learned skills in their own trajectory distribution space through interactions with the environment. Our approach combines classical robotics techniques of learning from demonstration with the deep reinforcement learning framework and exploits their complementary nature. We show that our method utilizes sensors solely available during the execution of preliminarily learned skills to extract relevant features that lead to faster skill refinement. Extensive evaluations in both simulation and real-world environments demonstrate the effectiveness of our method in refining robot skills by leveraging physical interactions, high-dimensional sensory data, and sparse task completion rewards. Videos, code, and pre-trained models are available at http://sac-gmm.cs.uni-freiburg.de. △ Less

Submitted 19 September, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

Comments: Accepted at the 2022 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2111.12673 [pdf, other]

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Authors: Nicolai Dorka, Tim Welschehold, Joschka Boedecker, Wolfram Burgard

Abstract: Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low var… ▽ More Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics, which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. ACC further achieves improved results on different tasks from the Meta-World robot benchmark. Additionally, we demonstrate the generality of ACC by applying it to TD3 and showing an improved performance also in this setting. △ Less

Submitted 21 October, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Submitted to RA-L

arXiv:2110.10563 [pdf, other]

Robust Monocular Localization in Sparse HD Maps Leveraging Multi-Task Uncertainty Estimation

Authors: Kürsat Petek, Kshitij Sirohi, Daniel Büscher, Wolfram Burgard

Abstract: Robust localization in dense urban scenarios using a low-cost sensor setup and sparse HD maps is highly relevant for the current advances in autonomous driving, but remains a challenging topic in research. We present a novel monocular localization approach based on a sliding-window pose graph that leverages predicted uncertainties for increased precision and robustness against challenging scenario… ▽ More Robust localization in dense urban scenarios using a low-cost sensor setup and sparse HD maps is highly relevant for the current advances in autonomous driving, but remains a challenging topic in research. We present a novel monocular localization approach based on a sliding-window pose graph that leverages predicted uncertainties for increased precision and robustness against challenging scenarios and per frame failures. To this end, we propose an efficient multi-task uncertainty-aware perception module, which covers semantic segmentation, as well as bounding box detection, to enable the localization of vehicles in sparse maps, containing only lane borders and traffic lights. Further, we design differentiable cost maps that are directly generated from the estimated uncertainties. This opens up the possibility to minimize the reprojection loss of amorphous map elements in an association free and uncertainty-aware manner. Extensive evaluation on the Lyft 5 dataset shows that, despite the sparsity of the map, our approach enables robust and accurate 6D localization in challenging urban scenarios △ Less

Submitted 20 October, 2021; originally announced October 2021.

Showing 1–50 of 136 results for author: Burgard, W