Search | arXiv e-print repository

doi 10.1109/WACVW58289.2023.00078

Explainable Model-Agnostic Similarity and Confidence in Face Verification

Authors: Martin Knoche, Torben Teepe, Stefan Hörmann, Gerhard Rigoll

Abstract: Recently, face recognition systems have demonstrated remarkable performances and thus gained a vital role in our daily life. They already surpass human face verification accountability in many scenarios. However, they lack explanations for their predictions. Compared to human operators, typical face recognition network system generate only binary decisions without further explanation and insights… ▽ More Recently, face recognition systems have demonstrated remarkable performances and thus gained a vital role in our daily life. They already surpass human face verification accountability in many scenarios. However, they lack explanations for their predictions. Compared to human operators, typical face recognition network system generate only binary decisions without further explanation and insights into those decisions. This work focuses on explanations for face recognition systems, vital for developers and operators. First, we introduce a confidence score for those systems based on facial feature distances between two input images and the distribution of distances across a dataset. Secondly, we establish a novel visualization approach to obtain more meaningful predictions from a face recognition system, which maps the distance deviation based on a systematic occlusion of images. The result is blended with the original images and highlights similar and dissimilar facial regions. Lastly, we calculate confidence scores and explanation maps for several state-of-the-art face verification datasets and release the results on a web platform. We optimize the platform for a user-friendly interaction and hope to further improve the understanding of machine learning decisions. The source code is available on GitHub, and the web platform is publicly available at http://explainable-face-verification.ey.r.appspot.com. △ Less

Submitted 16 February, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

arXiv:2208.14167 [pdf, other]

Synthehicle: Multi-Vehicle Multi-Camera Tracking in Virtual Cities

Authors: Fabian Herzog, Junpeng Chen, Torben Teepe, Johannes Gilg, Stefan Hörmann, Gerhard Rigoll

Abstract: Smart City applications such as intelligent traffic routing or accident prevention rely on computer vision methods for exact vehicle localization and tracking. Due to the scarcity of accurately labeled data, detecting and tracking vehicles in 3D from multiple cameras proves challenging to explore. We present a massive synthetic dataset for multiple vehicle tracking and segmentation in multiple ove… ▽ More Smart City applications such as intelligent traffic routing or accident prevention rely on computer vision methods for exact vehicle localization and tracking. Due to the scarcity of accurately labeled data, detecting and tracking vehicles in 3D from multiple cameras proves challenging to explore. We present a massive synthetic dataset for multiple vehicle tracking and segmentation in multiple overlapping and non-overlapping camera views. Unlike existing datasets, which only provide tracking ground truth for 2D bounding boxes, our dataset additionally contains perfect labels for 3D bounding boxes in camera- and world coordinates, depth estimation, and instance, semantic and panoptic segmentation. The dataset consists of 17 hours of labeled video material, recorded from 340 cameras in 64 diverse day, rain, dawn, and night scenes, making it the most extensive dataset for multi-target multi-camera tracking so far. We provide baselines for detection, vehicle re-identification, and single- and multi-camera tracking. Code and data are publicly available. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2207.06726 [pdf, other]

doi 10.1109/FG57933.2023.10042669

Octuplet Loss: Make Face Recognition Robust to Image Resolution

Authors: Martin Knoche, Mohamed Elkadeem, Stefan Hörmann, Gerhard Rigoll

Abstract: Image resolution, or in general, image quality, plays an essential role in the performance of today's face recognition systems. To address this problem, we propose a novel combination of the popular triplet loss to improve robustness against image resolution via fine-tuning of existing face recognition models. With octuplet loss, we leverage the relationship between high-resolution images and thei… ▽ More Image resolution, or in general, image quality, plays an essential role in the performance of today's face recognition systems. To address this problem, we propose a novel combination of the popular triplet loss to improve robustness against image resolution via fine-tuning of existing face recognition models. With octuplet loss, we leverage the relationship between high-resolution images and their synthetically down-sampled variants jointly with their identity labels. Fine-tuning several state-of-the-art approaches with our method proves that we can significantly boost performance for cross-resolution (high-to-low resolution) face verification on various datasets without meaningfully exacerbating the performance on high-to-high resolution images. Our method applied on the FaceTransformer network achieves 95.12% face verification accuracy on the challenging XQLFW dataset while reaching 99.73% on the LFW database. Moreover, the low-to-low face verification accuracy benefits from our method. We release our code to allow seamless integration of the octuplet loss into existing frameworks. △ Less

Submitted 21 March, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

arXiv:2205.13796 [pdf, other]

Face Morphing: Fooling a Face Recognition System Is Simple!

Authors: Stefan Hörmann, Tianlin Kong, Torben Teepe, Fabian Herzog, Martin Knoche, Gerhard Rigoll

Abstract: State-of-the-art face recognition (FR) approaches have shown remarkable results in predicting whether two faces belong to the same identity, yielding accuracies between 92% and 100% depending on the difficulty of the protocol. However, the accuracy drops substantially when exposed to morphed faces, specifically generated to look similar to two identities. To generate morphed faces, we integrate a… ▽ More State-of-the-art face recognition (FR) approaches have shown remarkable results in predicting whether two faces belong to the same identity, yielding accuracies between 92% and 100% depending on the difficulty of the protocol. However, the accuracy drops substantially when exposed to morphed faces, specifically generated to look similar to two identities. To generate morphed faces, we integrate a simple pretrained FR model into a generative adversarial network (GAN) and modify several loss functions for face morphing. In contrast to previous works, our approach and analyses are not limited to pairs of frontal faces with the same ethnicity and gender. Our qualitative and quantitative results affirm that our approach achieves a seamless change between two faces even in unconstrained scenarios. Despite using features from a simpler FR model for face morphing, we demonstrate that even recent FR systems struggle to distinguish the morphed face from both identities obtaining an accuracy of only 55-70%. Besides, we provide further insights into how knowing the FR system makes it particularly vulnerable to face morphing attacks. △ Less

Submitted 27 May, 2022; originally announced May 2022.

arXiv:2204.07855 [pdf, other]

Towards a Deeper Understanding of Skeleton-based Gait Recognition

Authors: Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan Hörmann, Gerhard Rigoll

Abstract: Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns. In recent years, most gait recognition methods used the person's silhouette to extract the gait features. However, silhouette images can lose fine-grained spatial information, suffer from (self) occlusion, and be challenging to obtain in real-world scenarios.… ▽ More Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns. In recent years, most gait recognition methods used the person's silhouette to extract the gait features. However, silhouette images can lose fine-grained spatial information, suffer from (self) occlusion, and be challenging to obtain in real-world scenarios. Furthermore, these silhouettes also contain other visual clues that are not actual gait features and can be used for identification, but also to fool the system. Model-based methods do not suffer from these problems and are able to represent the temporal motion of body joints, which are actual gait features. The advances in human pose estimation started a new era for model-based gait recognition with skeleton-based gait recognition. In this work, we propose an approach based on Graph Convolutional Networks (GCNs) that combines higher-order inputs, and residual networks to an efficient architecture for gait recognition. Extensive experiments on the two popular gait datasets, CASIA-B and OUMVLP-Pose, show a massive improvement (3x) of the state-of-the-art (SotA) on the largest gait dataset OUMVLP-Pose and strong temporal modeling capabilities. Finally, we visualize our method to understand skeleton-based gait recognition better and to show that we model real gait features. △ Less

Submitted 16 April, 2022; originally announced April 2022.

Comments: 8 Pages, 5 figures, Accepted at 17th IEEE Computer Society Workshop on Biometrics 2022 (CVPRW'22)

arXiv:2108.10290 [pdf, other]

doi 10.1109/FG52635.2021.9666960

Cross-Quality LFW: A Database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained Environments

Authors: Martin Knoche, Stefan Hörmann, Gerhard Rigoll

Abstract: Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions such as various subject-to-camera distances, poor camera settings, or motion blur. This characteristic has an unignorable effect on performance. Recent cross-resolution face recognition approaches used simple, arbitrary, and unrealistic down- and up-scaling techniqu… ▽ More Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions such as various subject-to-camera distances, poor camera settings, or motion blur. This characteristic has an unignorable effect on performance. Recent cross-resolution face recognition approaches used simple, arbitrary, and unrealistic down- and up-scaling techniques to measure robustness against real-world edge-cases in image quality. Thus, we propose a new standardized benchmark dataset and evaluation protocol derived from the famous Labeled Faces in the Wild (LFW). In contrast to previous derivatives, which focus on pose, age, similarity, and adversarial attacks, our Cross-Quality Labeled Faces in the Wild (XQLFW) maximizes the quality difference. It contains only more realistic synthetically degraded images when necessary. Our proposed dataset is then used to further investigate the influence of image quality on several state-of-the-art approaches. With XQLFW, we show that these models perform differently in cross-quality cases, and hence, the generalizing capability is not accurately predicted by their performance on LFW. Additionally, we report baseline accuracy with recent deep learning models explicitly trained for cross-resolution applications and evaluate the susceptibility to image quality. To encourage further research in cross-resolution face recognition and incite the assessment of image quality robustness, we publish the database and code for evaluation. △ Less

Submitted 25 November, 2022; v1 submitted 23 August, 2021; originally announced August 2021.

Comments: 9 pages, 4 figures, 2 tables

arXiv:2107.03769 [pdf, other]

doi 10.4230/LITES.8.1.1

Susceptibility to Image Resolution in Face Recognition and Trainings Strategies

Authors: Martin Knoche, Stefan Hörmann, Gerhard Rigoll

Abstract: Face recognition approaches often rely on equal image resolution for verifying faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on face verification performance with a state-of-the-art face recognition model. For imag… ▽ More Face recognition approaches often rely on equal image resolution for verifying faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on face verification performance with a state-of-the-art face recognition model. For images synthetically reduced to $5\,\times\,5$ px resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforwardly with $50\%$ low-resolution images directly within each batch. 2) Train a siamese-network structure and add a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution pair. Thus, we extend the aforementioned methods by training them with multiple image resolutions at once. The performances for particular testing image resolutions are slightly worse, but the advantage is that this model can be applied to arbitrary resolution images and achieves an overall better performance ($97.72\%$ compared to $96.86\%$). Due to the lack of a benchmark for arbitrary resolution images for the cross-resolution and equal-resolution task, we propose an evaluation protocol for five well-known datasets, focusing on high, mid, and low-resolution images. △ Less

Submitted 25 November, 2022; v1 submitted 8 July, 2021; originally announced July 2021.

Comments: 19 pages, 15 figures, 2 tables

arXiv:2106.06415 [pdf, ps, other]

Attention-based Partial Face Recognition

Authors: Stefan Hörmann, Zeyuan Zhang, Martin Knoche, Torben Teepe, Gerhard Rigoll

Abstract: Photos of faces captured in unconstrained environments, such as large crowds, still constitute challenges for current face recognition approaches as often faces are occluded by objects or people in the foreground. However, few studies have addressed the task of recognizing partial faces. In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with differ… ▽ More Photos of faces captured in unconstrained environments, such as large crowds, still constitute challenges for current face recognition approaches as often faces are occluded by objects or people in the foreground. However, few studies have addressed the task of recognizing partial faces. In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with different occluded areas. We achieve this by combining attentional pooling of a ResNet's intermediate feature maps with a separate aggregation module. We further adapt common losses to partial faces in order to ensure that the attention maps are diverse and handle occluded parts. Our thorough analysis demonstrates that we outperform all baselines under multiple benchmark protocols, including naturally and synthetically occluded partial faces. This suggests that our method successfully focuses on the relevant parts of the occluded face. △ Less

Submitted 14 June, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: To be published in IEEE ICIP 2021

arXiv:2101.11228 [pdf, ps, other]

doi 10.1109/ICIP42928.2021.9506717

GaitGraph: Graph Convolutional Network for Skeleton-Based Gait Recognition

Authors: Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Stefan Hörmann, Gerhard Rigoll

Abstract: Gait recognition is a promising video-based biometric for identifying individual walking patterns from a long distance. At present, most gait recognition methods use silhouette images to represent a person in each frame. However, silhouette images can lose fine-grained spatial information, and most papers do not regard how to obtain these silhouettes in complex scenes. Furthermore, silhouette imag… ▽ More Gait recognition is a promising video-based biometric for identifying individual walking patterns from a long distance. At present, most gait recognition methods use silhouette images to represent a person in each frame. However, silhouette images can lose fine-grained spatial information, and most papers do not regard how to obtain these silhouettes in complex scenes. Furthermore, silhouette images contain not only gait features but also other visual clues that can be recognized. Hence these approaches can not be considered as strict gait recognition. We leverage recent advances in human pose estimation to estimate robust skeleton poses directly from RGB images to bring back model-based gait recognition with a cleaner representation of gait. Thus, we propose GaitGraph that combines skeleton poses with Graph Convolutional Network (GCN) to obtain a modern model-based approach for gait recognition. The main advantages are a cleaner, more elegant extraction of the gait features and the ability to incorporate powerful spatio-temporal modeling using GCN. Experiments on the popular CASIA-B gait dataset show that our method archives state-of-the-art performance in model-based gait recognition. The code and models are publicly available. △ Less

Submitted 9 June, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

Comments: 5 pages, 2 figures

arXiv:2101.10774 [pdf, ps, other]

doi 10.1109/ICIP42928.2021.9506733

Lightweight Multi-Branch Network for Person Re-Identification

Authors: Fabian Herzog, Xunbo Ji, Torben Teepe, Stefan Hörmann, Johannes Gilg, Gerhard Rigoll

Abstract: Person Re-Identification aims to retrieve person identities from images captured by multiple cameras or the same cameras in different time instances and locations. Because of its importance in many vision applications from surveillance to human-machine interaction, person re-identification methods need to be reliable and fast. While more and more deep architectures are proposed for increasing perf… ▽ More Person Re-Identification aims to retrieve person identities from images captured by multiple cameras or the same cameras in different time instances and locations. Because of its importance in many vision applications from surveillance to human-machine interaction, person re-identification methods need to be reliable and fast. While more and more deep architectures are proposed for increasing performance, those methods also increase overall model complexity. This paper proposes a lightweight network that combines global, part-based, and channel features in a unified multi-branch architecture that builds on the resource-efficient OSNet backbone. Using a well-founded combination of training techniques and design choices, our final model achieves state-of-the-art results on CUHK03 labeled, CUHK03 detected, and Market-1501 with 85.1% mAP / 87.2% rank1, 82.4% mAP / 84.9% rank1, and 91.5% mAP / 96.3% rank1, respectively. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: 5 pages, 1 figure

arXiv:2009.14639 [pdf, other]

Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Authors: Okan Köpüklü, Stefan Hörmann, Fabian Herzog, Hakan Cevikalp, Gerhard Rigoll

Abstract: Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames. There have been many successful 3D-CNN architectures surpassing the state-of-the-art results successively. However, nearly all of them are designed to operate offline creating several serious… ▽ More Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames. There have been many successful 3D-CNN architectures surpassing the state-of-the-art results successively. However, nearly all of them are designed to operate offline creating several serious handicaps during online operation. Firstly, conventional 3D-CNNs are not dynamic since their output features represent the complete input clip instead of the most recent frame in the clip. Secondly, they are not temporal resolution-preserving due to their inherent temporal downsampling. Lastly, 3D-CNNs are constrained to be used with fixed temporal input size limiting their flexibility. In order to address these drawbacks, we propose dissected 3D-CNNs, where the intermediate volumes of the network are dissected and propagated over depth (time) dimension for future calculations, substantially reducing the number of computations at online operation. For action classification, the dissected version of ResNet models performs 77-90% fewer computations at online operation while achieving ~5% better classification accuracy on the Kinetics-600 dataset than conventional 3D-ResNet models. Moreover, the advantages of dissected 3D-CNNs are demonstrated by deploying our approach onto several vision tasks, which consistently improved the performance. △ Less

Submitted 18 October, 2021; v1 submitted 30 September, 2020; originally announced September 2020.

arXiv:2006.01615 [pdf, ps, other]

A Multi-Task Comparator Framework for Kinship Verification

Authors: Stefan Hörmann, Martin Knoche, Gerhard Rigoll

Abstract: Approaches for kinship verification often rely on cosine distances between face identification features. However, due to gender bias inherent in these features, it is hard to reliably predict whether two opposite-gender pairs are related. Instead of fine tuning the feature extractor network on kinship verification, we propose a comparator network to cope with this bias. After concatenating both fe… ▽ More Approaches for kinship verification often rely on cosine distances between face identification features. However, due to gender bias inherent in these features, it is hard to reliably predict whether two opposite-gender pairs are related. Instead of fine tuning the feature extractor network on kinship verification, we propose a comparator network to cope with this bias. After concatenating both features, cascaded local expert networks extract the information most relevant for their corresponding kinship relation. We demonstrate that our framework is robust against this gender bias and achieves comparable results on two tracks of the RFIW Challenge 2020. Moreover, we show how our framework can be further extended to handle partially known or unknown kinship relations. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: To be published in IEEE FG 2020 - RFIW Workshop

arXiv:1904.09007 [pdf, other]

DeepLocalization: Landmark-based Self-Localization with Deep Neural Networks

Authors: Nico Engel, Stefan Hoermann, Markus Horn, Vasileios Belagiannis, Klaus Dietmayer

Abstract: We address the problem of vehicle self-localization from multi-modal sensor information and a reference map. The map is generated off-line by extracting landmarks from the vehicle's field of view, while the measurements are collected similarly on the fly. Our goal is to determine the autonomous vehicle's pose from the landmark measurements and map landmarks. To learn this mapping, we propose DeepL… ▽ More We address the problem of vehicle self-localization from multi-modal sensor information and a reference map. The map is generated off-line by extracting landmarks from the vehicle's field of view, while the measurements are collected similarly on the fly. Our goal is to determine the autonomous vehicle's pose from the landmark measurements and map landmarks. To learn this mapping, we propose DeepLocalization, a deep neural network that regresses the vehicle's translation and rotation parameters from unordered and dynamic input landmarks. The proposed network architecture is robust to changes of the dynamic environment and can cope with a small number of extracted landmarks. During the training process we rely on synthetically generated ground-truth. In our experiments, we evaluate two inference approaches in real-world scenarios. We show that DeepLocalization can be combined with regular GPS signals and filtering algorithms such as the extended Kalman filter. Our approach achieves state-of-the-art accuracy and is about ten times faster than the related work. △ Less

Submitted 19 July, 2019; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: Accepted for publication by the IEEE Intelligent Transportation Systems Conference (ITSC 2019), Auckland, New Zealand

arXiv:1901.09615 [pdf, other]

Convolutional Neural Networks with Layer Reuse

Authors: Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll

Abstract: A convolutional layer in a Convolutional Neural Network (CNN) consists of many filters which apply convolution operation to the input, capture some special patterns and pass the result to the next layer. If the same patterns also occur at the deeper layers of the network, why wouldn't the same convolutional filters be used also in those layers? In this paper, we propose a CNN architecture, Layer R… ▽ More A convolutional layer in a Convolutional Neural Network (CNN) consists of many filters which apply convolution operation to the input, capture some special patterns and pass the result to the next layer. If the same patterns also occur at the deeper layers of the network, why wouldn't the same convolutional filters be used also in those layers? In this paper, we propose a CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers are used repeatedly without the need of introducing new layers to get a better performance. This approach introduces several advantages: (i) Considerable amount of parameters are saved since we are reusing the layers instead of introducing new layers, (ii) the Memory Access Cost (MAC) can be reduced since reused layer parameters can be fetched only once, (iii) the number of nonlinearities increases with layer reuse, and (iv) reused layers get gradient updates from multiple parts of the network. The proposed approach is evaluated on CIFAR-10, CIFAR-100 and Fashion-MNIST datasets for image classification task, and layer reuse improves the performance by 5.14%, 5.85% and 2.29%, respectively. The source code and pretrained models are publicly available. △ Less

Submitted 1 February, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

Comments: Computer Vision and Pattern Recognition

arXiv:1809.03782 [pdf, other]

Long-Term Occupancy Grid Prediction Using Recurrent Neural Networks

Authors: Marcel Schreiber, Stefan Hoermann, Klaus Dietmayer

Abstract: We tackle the long-term prediction of scene evolution in a complex downtown scenario for automated driving based on Lidar grid fusion and recurrent neural networks (RNNs). A bird's eye view of the scene, including occupancy and velocity, is fed as a sequence to a RNN which is trained to predict future occupancy. The nature of prediction allows generation of multiple hours of training data without… ▽ More We tackle the long-term prediction of scene evolution in a complex downtown scenario for automated driving based on Lidar grid fusion and recurrent neural networks (RNNs). A bird's eye view of the scene, including occupancy and velocity, is fed as a sequence to a RNN which is trained to predict future occupancy. The nature of prediction allows generation of multiple hours of training data without the need of manual labeling. Thus, the training strategy and loss function is designed for long sequences of real-world data (unbalanced, continuously changing situations, false labels, etc.). The deep CNN architecture comprises convolutional long short-term memories (ConvLSTMs) to separate static from dynamic regions and to predict dynamic objects in future frames. Novel recurrent skip connections show the ability to predict small occluded objects, i.e. pedestrians, and occluded static regions. Spatio-temporal correlations between grid cells are exploited to predict multimodal future paths and interactions between objects. Experiments also quantify improvements to our previous network, a Monte Carlo approach, and literature. △ Less

Submitted 7 June, 2019; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: 8 pages, 10 figures

arXiv:1805.08986 [pdf, other]

Deep Object Tracking on Dynamic Occupancy Grid Maps Using RNNs

Authors: Nico Engel, Stefan Hoermann, Philipp Henzler, Klaus Dietmayer

Abstract: The comprehensive representation and understanding of the driving environment is crucial to improve the safety and reliability of autonomous vehicles. In this paper, we present a new approach to establish an environment model containing a segmentation between static and dynamic background and parametric modeled objects with shape, position and orientation. Multiple laser scanners are fused into a… ▽ More The comprehensive representation and understanding of the driving environment is crucial to improve the safety and reliability of autonomous vehicles. In this paper, we present a new approach to establish an environment model containing a segmentation between static and dynamic background and parametric modeled objects with shape, position and orientation. Multiple laser scanners are fused into a dynamic occupancy grid map resulting in a 360° perception of the environment. A single-stage deep convolutional neural network is combined with a recurrent neural network, which takes a time series of the occupancy grid map as input and tracks cell states and its corresponding object hypotheses. The labels for training are created unsupervised with an automatic label generation algorithm. The proposed methods are evaluated in real-world experiments in complex inner city scenarios using the aforementioned 360° laser perception. The results show a better object detection accuracy in comparison with our old approach as well as an AUC score of 0.946 for the dynamic and static segmentation. Furthermore, we gain an improved detection for occluded objects and a more consistent size estimation due to the usage of time series as input and the memory about previous states introduced by the recurrent neural network. △ Less

Submitted 23 May, 2018; originally announced May 2018.

arXiv:1804.03933 [pdf, other]

Offline Object Extraction from Dynamic Occupancy Grid Map Sequences

Authors: Daniel Stumper, Fabian Gies, Stefan Hoermann, Klaus Dietmayer

Abstract: A dynamic occupancy grid map (DOGMa) allows a fast, robust, and complete environment representation for automated vehicles. Dynamic objects in a DOGMa, however, are commonly represented as independent cells while modeled objects with shape and pose are favorable. The evaluation of algorithms for object extraction or the training and validation of learning algorithms rely on labeled ground truth da… ▽ More A dynamic occupancy grid map (DOGMa) allows a fast, robust, and complete environment representation for automated vehicles. Dynamic objects in a DOGMa, however, are commonly represented as independent cells while modeled objects with shape and pose are favorable. The evaluation of algorithms for object extraction or the training and validation of learning algorithms rely on labeled ground truth data. Manually annotating objects in a DOGMa to obtain ground truth data is a time consuming and expensive process. Additionally the quality of labeled data depend strongly on the variation of filtered input data. The presented work introduces an automatic labeling process, where a full sequence is used to extract the best possible object pose and shape in terms of temporal consistency. A two direction temporal search is executed to trace single objects over a sequence, where the best estimate of its extent and pose is refined in every time step. Furthermore, the presented algorithm only uses statistical constraints of the cell clusters for the object extraction instead of fixed heuristic parameters. Experimental results show a well-performing automatic labeling algorithm with real sensor data even at challenging scenarios. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: 8 Pages, 7 Figures, submitted to IEEE IV2018, waiting for acceptance

arXiv:1802.02202 [pdf, other]

Object Detection on Dynamic Occupancy Grid Maps Using Deep Learning and Automatic Label Generation

Authors: Stefan Hoermann, Philipp Henzler, Martin Bach, Klaus Dietmayer

Abstract: We tackle the problem of object detection and pose estimation in a shared space downtown environment. For perception multiple laser scanners with 360° coverage were fused in a dynamic occupancy grid map (DOGMa). A single-stage deep convolutional neural network is trained to provide object hypotheses comprising of shape, position, orientation and an existence score from a single input DOGMa. Furthe… ▽ More We tackle the problem of object detection and pose estimation in a shared space downtown environment. For perception multiple laser scanners with 360° coverage were fused in a dynamic occupancy grid map (DOGMa). A single-stage deep convolutional neural network is trained to provide object hypotheses comprising of shape, position, orientation and an existence score from a single input DOGMa. Furthermore, an algorithm for offline object extraction was developed to automatically label several hours of training data. The algorithm is based on a two-pass trajectory extraction, forward and backward in time. Typical for engineered algorithms, the automatic label generation suffers from misdetections, which makes hard negative mining impractical. Therefore, we propose a loss function counteracting the high imbalance between mostly static background and extremely rare dynamic grid cells. Experiments indicate, that the trained network has good generalization capabilities since it detects objects occasionally lost by the label algorithm. Evaluation reaches an average precision (AP) of 75.9% △ Less

Submitted 30 January, 2018; originally announced February 2018.

arXiv:1705.08781 [pdf, other]

Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling

Authors: Stefan Hoermann, Martin Bach, Klaus Dietmayer

Abstract: Long-term situation prediction plays a crucial role in the development of intelligent vehicles. A major challenge still to overcome is the prediction of complex downtown scenarios with multiple road users, e.g., pedestrians, bikes, and motor vehicles, interacting with each other. This contribution tackles this challenge by combining a Bayesian filtering technique for environment representation, an… ▽ More Long-term situation prediction plays a crucial role in the development of intelligent vehicles. A major challenge still to overcome is the prediction of complex downtown scenarios with multiple road users, e.g., pedestrians, bikes, and motor vehicles, interacting with each other. This contribution tackles this challenge by combining a Bayesian filtering technique for environment representation, and machine learning as long-term predictor. More specifically, a dynamic occupancy grid map is utilized as input to a deep convolutional neural network. This yields the advantage of using spatially distributed velocity estimates from a single time step for prediction, rather than a raw data sequence, alleviating common problems dealing with input time series of multiple sensors. Furthermore, convolutional neural networks have the inherent characteristic of using context information, enabling the implicit modeling of road user interaction. Pixel-wise balancing is applied in the loss function counteracting the extreme imbalance between static and dynamic cells. One of the major advantages is the unsupervised learning character due to fully automatic label generation. The presented algorithm is trained and evaluated on multiple hours of recorded sensor data and compared to Monte-Carlo simulation. △ Less

Submitted 7 November, 2017; v1 submitted 24 May, 2017; originally announced May 2017.

Showing 1–19 of 19 results for author: Hörmann, S