Search | arXiv e-print repository

arXiv:2407.19564 [pdf, other]

Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models

Authors: Jifeng Wang, Kaouther Messaoud, Yuejiang Liu, Juergen Gall, Alexandre Alahi

Abstract: Recent progress in motion forecasting has been substantially driven by self-supervised pre-training. However, adapting pre-trained models for specific downstream tasks, especially motion prediction, through extensive fine-tuning is often inefficient. This inefficiency arises because motion prediction closely aligns with the masked pre-training tasks, and traditional full fine-tuning methods fail t… ▽ More Recent progress in motion forecasting has been substantially driven by self-supervised pre-training. However, adapting pre-trained models for specific downstream tasks, especially motion prediction, through extensive fine-tuning is often inefficient. This inefficiency arises because motion prediction closely aligns with the masked pre-training tasks, and traditional full fine-tuning methods fail to fully leverage this alignment. To address this, we introduce Forecast-PEFT, a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters. This approach not only preserves the pre-learned representations but also significantly reduces the number of parameters that need retraining, thereby enhancing efficiency. This tailored strategy, supplemented by our method's capability to efficiently adapt to different datasets, enhances model efficiency and ensures robust performance across datasets without the need for extensive retraining. Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks, achieving higher accuracy with only 17% of the trainable parameters typically required. Moreover, our comprehensive adaptation, Forecast-FT, further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods. Code will be available at https://github.com/csjfwang/Forecast-PEFT. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2407.16079 [pdf, ps, other]

Towards the Discovery of New Elements: Production of Livermorium (Z=116) with 50Ti

Authors: J. M. Gates, R. Orford, D. Rudolph, C. Appleton, B. M. Barrios, J. Y. Benitez, M. Bordeau, W. Botha, C. M. Campbell, J. Chadderton, A. T. Chemey, R. M. Clark, H. L. Crawford, J. D. Despotopulos, O. Dorvaux, N. E. Esker, P. Fallon, C. M. Folden III, B. J. P. Gall, F. H. Garcia, P. Golubev, J. A. Gooding, M. Grebo, K. E. Gregorich, M. Guerrero , et al. (29 additional authors not shown)

Abstract: The $^{244}$Pu($^{50}$Ti,$xn$)$^{294-x}$Lv reaction was investigated at Lawrence Berkeley National Laboratory's 88-Inch Cyclotron facility. The experiment was aimed at the production of a superheavy element with $Z\ge 114$ by irradiating an actinide target with a beam heavier than $^{48}$Ca. Produced Lv ions were separated from the unwanted beam and nuclear reaction products using the Berkeley Gas… ▽ More The $^{244}$Pu($^{50}$Ti,$xn$)$^{294-x}$Lv reaction was investigated at Lawrence Berkeley National Laboratory's 88-Inch Cyclotron facility. The experiment was aimed at the production of a superheavy element with $Z\ge 114$ by irradiating an actinide target with a beam heavier than $^{48}$Ca. Produced Lv ions were separated from the unwanted beam and nuclear reaction products using the Berkeley Gas-filled Separator and implanted into a newly commissioned focal plane detector system. Two decay chains were observed and assigned to the decay of $^{290}$Lv. The production cross section was measured to be $σ_{\rm prod}=0.44(^{+58}_{-28})$~pb at a center-of-target center-of-mass energy of 220(3)~MeV. This represents the first published measurement of the production of a superheavy element near the `Island-of-Stability', with a beam of $^{50}$Ti and is an essential precursor in the pursuit of searching for new elements beyond $Z=118$. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: Submitted to Physical Review Letters

arXiv:2407.13772 [pdf, other]

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

Authors: Abdelrahman Shaker, Syed Talal Wasim, Salman Khan, Juergen Gall, Fahad Shahbaz Khan

Abstract: Recent advancements in state-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability a… ▽ More Recent advancements in state-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. To address this, we introduce a Modulated Group Mamba layer which divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Our code and models are available at: https://github.com/Amshaker/GroupMamba. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: Preprint. Our code and models are available at: https://github.com/Amshaker/GroupMamba

arXiv:2407.13544 [pdf, other]

Drilling holes in the Brownian disk: The Brownian annulus

Authors: Jean-François Le Gall, Alexis Metz-Donnadieu

Abstract: We give a new construction of the Brownian annulus based on removing a hull centered at the distinguished point in the free Brownian disk. We use this construction to prove that the Brownian annulus is the scaling limit of Boltzmann triangulations with two boundaries. We also prove that the space obtained by removing hulls centered at the two distinguished points of the Brownian sphere is a Browni… ▽ More We give a new construction of the Brownian annulus based on removing a hull centered at the distinguished point in the free Brownian disk. We use this construction to prove that the Brownian annulus is the scaling limit of Boltzmann triangulations with two boundaries. We also prove that the space obtained by removing hulls centered at the two distinguished points of the Brownian sphere is a Brownian annulus. Our proofs rely on a detailed analysis of the peeling by layers algorithm for Boltzmann triangulations with a boundary. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 47 pages, 4 figures

MSC Class: 60D05; 60F17

arXiv:2407.11954 [pdf, other]

Gated Temporal Diffusion for Stochastic Long-Term Dense Anticipation

Authors: Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall

Abstract: Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed metho… ▽ More Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed methods address the task in a deterministic setup and ignore the underlying uncertainty. In this paper, we propose a novel Gated Temporal Diffusion (GTD) network that models the uncertainty of both the observation and the future predictions. As generator, we introduce a Gated Anticipation Network (GTAN) to model both observed and unobserved frames of a video in a mutual representation. On the one hand, using a mutual representation for past and future allows us to jointly model ambiguities in the observation and future, while on the other hand GTAN can by design treat the observed and unobserved parts differently and steer the information flow between them. Our model achieves state-of-the-art results on the Breakfast, Assembly101 and 50Salads datasets in both stochastic and deterministic settings. Code: https://github.com/olga-zats/GTDA . △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.09431 [pdf, other]

Rethinking temporal self-similarity for repetitive action counting

Authors: Yanan Luo, Jinhui Yi, Yazan Abu Farha, Moritz Wolter, Juergen Gall

Abstract: Counting repetitive actions in long untrimmed videos is a challenging task that has many applications such as rehabilitation. State-of-the-art methods predict action counts by first generating a temporal self-similarity matrix (TSM) from the sampled frames and then feeding the matrix to a predictor network. The self-similarity matrix, however, is not an optimal input to a network since it discards… ▽ More Counting repetitive actions in long untrimmed videos is a challenging task that has many applications such as rehabilitation. State-of-the-art methods predict action counts by first generating a temporal self-similarity matrix (TSM) from the sampled frames and then feeding the matrix to a predictor network. The self-similarity matrix, however, is not an optimal input to a network since it discards too much information from the frame-wise embeddings. We thus rethink how a TSM can be utilized for counting repetitive actions and propose a framework that learns embeddings and predicts action start probabilities at full temporal resolution. The number of repeated actions is then inferred from the action start probabilities. In contrast to current approaches that have the TSM as an intermediate representation, we propose a novel loss based on a generated reference TSM, which enforces that the self-similarity of the learned frame-wise embeddings is consistent with the self-similarity of repeated actions. The proposed framework achieves state-of-the-art results on three datasets, i.e., RepCount, UCFRep, and Countix. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Accepted to ICIP 2024

arXiv:2405.08909 [pdf, other]

ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

Authors: Shuxiao Ding, Lukas Schneider, Marius Cordts, Juergen Gall

Abstract: Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble… ▽ More Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at https://github.com/dsx0511/ADA-Track. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 14 pages, 3 figures, accepted by CVPR 2024

arXiv:2404.18489 [pdf, other]

Peeling the Brownian half-plane

Authors: Jean-François Le Gall, Armand Riera

Abstract: We establish a new spatial Markov property of the Brownian half-plane. According to this property, if one removes a hull centered at a boundary point, the remaining space equipped with an intrinsic metric is still a Brownian half-plane, which is independent of the part that has been removed. This is an analog of the well-known peeling procedure for random planar maps. We also investigate several d… ▽ More We establish a new spatial Markov property of the Brownian half-plane. According to this property, if one removes a hull centered at a boundary point, the remaining space equipped with an intrinsic metric is still a Brownian half-plane, which is independent of the part that has been removed. This is an analog of the well-known peeling procedure for random planar maps. We also investigate several distributional properties of hulls centered at a boundary point, and we provide a new construction of the Brownian half-plane giving information about distances from a half-boundary. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 28 pages, 2 figures

MSC Class: 60D05

arXiv:2402.18319 [pdf, other]

A Multimodal Handover Failure Detection Dataset and Baselines

Authors: Santosh Thoduka, Nico Hochgeschwender, Juergen Gall, Paul G. Plöger

Abstract: An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpre… ▽ More An object handover between a robot and a human is a coordinated action which is prone to failure for reasons such as miscommunication, incorrect actions and unexpected object properties. Existing works on handover failure detection and prevention focus on preventing failures due to object slip or external disturbances. However, there is a lack of datasets and evaluation methods that consider unpreventable failures caused by the human participant. To address this deficit, we present the multimodal Handover Failure Detection dataset, which consists of failures induced by the human participant, such as ignoring the robot or not releasing the object. We also present two baseline methods for handover failure detection: (i) a video classification method using 3D CNNs and (ii) a temporal action segmentation approach which jointly classifies the human action, robot action and overall outcome of the action. The results show that video is an important modality, but using force-torque data and gripper position help improve failure detection and action segmentation accuracy. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted at ICRA 2024

arXiv:2401.05074 [pdf, other]

doi 10.1016/j.enbuild.2024.113968

Occupancy Prediction for Building Energy Systems with Latent Force Models

Authors: Thore Wietzke, Jan Gall, Knut Graichen

Abstract: This paper presents a new approach to predict the occupancy for building energy systems (BES). A Gaussian Process (GP) is used to model the occupancy and is represented as a state space model that is equivalent to the full GP if Kalman filtering and smoothing is used. The combination of GPs and mechanistic models is called Latent Force Model (LFM). An LFM-based model predictive control (MPC) conce… ▽ More This paper presents a new approach to predict the occupancy for building energy systems (BES). A Gaussian Process (GP) is used to model the occupancy and is represented as a state space model that is equivalent to the full GP if Kalman filtering and smoothing is used. The combination of GPs and mechanistic models is called Latent Force Model (LFM). An LFM-based model predictive control (MPC) concept for BES is presented that benefits from the extrapolation capability of mechanistic models and the learning ability of GPs to predict the occupancy within the building. Simulations with EnergyPlus and a comparison with real-world data from the Bosch Research Campus in Renningen show that a reduced energy demand and thermal discomfort can be obtained with the LFM-based MPC scheme by accounting for the predicted stochastic occupancy. △ Less

Submitted 6 February, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: submitted to Energy and Buildings, data and code available at https://github.com/ThoreWietzke/occupancy-benchmark-dataset

Journal ref: Energy and Buildings, Volume 307 (2024) 113968

arXiv:2312.15289 [pdf, other]

Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

Authors: Lokesh Veeramacheneni, Moritz Wolter, Hildegard Kuehne, Juergen Gall

Abstract: Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectru… ▽ More Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectrum of frequencies in images with a high resolution, along with preserving both spatial and textural aspects. Specifically, we use Wp to project generated and dataset images to packet coefficient space. Further, we compute Fréchet distance with the resultant coefficients to evaluate the quality of a generator. This metric is general-purpose and dataset-domain agnostic, as it does not rely on any pre-trained network while being more interpretable because of frequency band transparency. We conclude with an extensive evaluation of a wide variety of generators across various datasets that the proposed FWD is able to generalize and improve robustness to domain shift and various corruptions compared to other metrics. △ Less

Submitted 10 June, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.08892 [pdf, other]

VaLID: Variable-Length Input Diffusion for Novel View Synthesis

Authors: Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, Amirhossein Habibian

Abstract: Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well… ▽ More Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision. As this task is heavily under-constrained, some recent work, like Zero123, tries to solve this problem with generative modeling, specifically using pre-trained diffusion models. Although this strategy generalizes well to new scenes, compared to neural radiance field-based methods, it offers low levels of flexibility. For example, it can only accept a single-view image as input, despite realistic applications often offering multiple input images. This is because the source-view images and corresponding poses are processed separately and injected into the model at different stages. Thus it is not trivial to generalize the model into multi-view source images, once they are available. To solve this issue, we try to process each pose image pair separately and then fuse them as a unified visual representation which will be injected into the model to guide image synthesis at the target-views. However, inconsistency and computation costs increase as the number of input source-view images increases. To solve these issues, the Multi-view Cross Former module is proposed which maps variable-length input data to fix-size output data. A two-stage training strategy is introduced to further improve the efficiency during training time. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed method against previous approaches. The code will be released according to the acceptance. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: paper and supplementary material

arXiv:2311.15991 [pdf, other]

DiffAnt: Diffusion Models for Action Anticipation

Authors: Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer

Abstract: Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we r… ▽ More Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2309.17257 [pdf, other]

A Survey on Deep Learning Techniques for Action Anticipation

Authors: Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer

Abstract: The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with… ▽ More The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Submitted to TPAMI

arXiv:2309.07849 [pdf, other]

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

Authors: Rong Li, ShiJie Li, Xieyuanli Chen, Teli Ma, Juergen Gall, Junwei Liang

Abstract: LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. A multitude of methods exist within this domain, including point-based, range-image-based, polar-coordinate-based, and hybrid strategies. Among these, range-image-based techniques have gained widespread adoption in practical applications due to their… ▽ More LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. A multitude of methods exist within this domain, including point-based, range-image-based, polar-coordinate-based, and hybrid strategies. Among these, range-image-based techniques have gained widespread adoption in practical applications due to their efficiency. However, they face a significant challenge known as the ``many-to-one'' problem caused by the range image's limited horizontal and vertical angular resolution. As a result, around 20% of the 3D points can be occluded. In this paper, we present TFNet, a range-image-based LiDAR semantic segmentation method that utilizes temporal information to address this issue. Specifically, we incorporate a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. We then design a max-voting-based post-processing technique to correct false predictions, particularly those caused by the ``many-to-one'' issue. We evaluated the approach on two benchmarks and demonstrated that the plug-in post-processing technique is generic and can be applied to various networks. △ Less

Submitted 14 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: accepted by CVPR2024 Workshop on Autonomous Driving

arXiv:2309.06899 [pdf, ps, other]

A stochastic differential equation for local times of super-Brownian motion

Authors: Jean-François Le Gall, Edwin Perkins

Abstract: We show that local times of super-Brownian motion, or of Brownian motion indexed by the Brownian tree, satisfy an explicit stochastic differential equation. Our proofs rely on both excursion theory for the Brownian snake and tools from the theory of superprocesses. We show that local times of super-Brownian motion, or of Brownian motion indexed by the Brownian tree, satisfy an explicit stochastic differential equation. Our proofs rely on both excursion theory for the Brownian snake and tools from the theory of superprocesses. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: 32 pages

MSC Class: 60J55; 60J68; 60H10

arXiv:2308.11358 [pdf, other]

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Authors: Emad Bahrami, Gianpiero Francesca, Juergen Gall

Abstract: Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation t… ▽ More Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation. △ Less

Submitted 25 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2308.11356 [pdf, other]

Semantic RGB-D Image Synthesis

Authors: Shijie Li, Rong Li, Juergen Gall

Abstract: Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training da… ▽ More Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training data. In this paper, we thus introduce semantic RGB-D image synthesis to address this problem. It requires synthesising a realistic-looking RGB-D image for a given semantic label map. Current approaches, however, are uni-modal and cannot cope with multi-modal data. Indeed, we show that extending uni-modal approaches to multi-modal data does not perform well. In this paper, we therefore propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information that is needed to generate an RGB and a depth image, respectively. Furthermore, we propose a discriminator that ensures semantic consistency between the label maps and the generated images and perceptual similarity between the real and generated images. Our comprehensive experiments demonstrate that the proposed method outperforms previous uni-modal methods by a large margin and that the accuracy of an approach for RGB-D semantic segmentation can be significantly improved by mixing real and generated images during training. △ Less

Submitted 18 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: ICCV Workshop on Representation Learning with Very Limited Images 2023

arXiv:2308.09717 [pdf, other]

Smoothness Similarity Regularization for Few-Shot GAN Adaptation

Authors: Vadim Sushko, Ruyu Wang, Juergen Gall

Abstract: The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this… ▽ More The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this limitation, we propose a new smoothness similarity regularization that transfers the inherently learned smoothness of the pre-trained GAN to the few-shot target domain even if the two domains are very different. We evaluate our approach by adapting an unconditional and a class-conditional GAN to diverse few-shot target domains. Our proposed method significantly outperforms prior few-shot GAN adaptation methods in the challenging case of structurally dissimilar source-target domains, while performing on par with the state of the art for similar source-target domains. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: International Conference on Computer Vision (ICCV) 2023

arXiv:2308.06635 [pdf, other]

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Authors: Shuxiao Ding, Eike Rehder, Lukas Schneider, Marius Cordts, Juergen Gall

Abstract: Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) appro… ▽ More Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer. △ Less

Submitted 12 August, 2023; originally announced August 2023.

Comments: 17 pages, 8 figures, accepted by ICCV2023

arXiv:2306.15045 [pdf, other]

Action Anticipation with Goal Consistency

Authors: Olga Zatsarynna, Juergen Gall

Abstract: In this paper, we address the problem of short-term action anticipation, i.e., we want to predict an upcoming action one second before it happens. We propose to harness high-level intent information to anticipate actions that will take place in the future. To this end, we incorporate an additional goal prediction branch into our model and propose a consistency loss function that encourages the ant… ▽ More In this paper, we address the problem of short-term action anticipation, i.e., we want to predict an upcoming action one second before it happens. We propose to harness high-level intent information to anticipate actions that will take place in the future. To this end, we incorporate an additional goal prediction branch into our model and propose a consistency loss function that encourages the anticipated actions to conform to the high-level goal pursued in the video. In our experiments, we show the effectiveness of the proposed approach and demonstrate that our method achieves state-of-the-art results on two large-scale datasets: Assembly101 and COIN. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: Accepted to ICIP 2023

arXiv:2306.10761 [pdf, other]

PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

Authors: Peizheng Li, Shuxiao Ding, Xieyuanli Chen, Niklas Hanselmann, Marius Cordts, Juergen Gall

Abstract: Accurately perceiving instances and predicting their future motion are key tasks for autonomous vehicles, enabling them to navigate safely in complex urban traffic. While bird's-eye view (BEV) representations are commonplace in perception for autonomous driving, their potential in a motion prediction setting is less explored. Existing approaches for BEV instance prediction from surround cameras re… ▽ More Accurately perceiving instances and predicting their future motion are key tasks for autonomous vehicles, enabling them to navigate safely in complex urban traffic. While bird's-eye view (BEV) representations are commonplace in perception for autonomous driving, their potential in a motion prediction setting is less explored. Existing approaches for BEV instance prediction from surround cameras rely on a multi-task auto-regressive setup coupled with complex post-processing to predict future instances in a spatio-temporally consistent manner. In this paper, we depart from this paradigm and propose an efficient novel end-to-end framework named POWERBEV, which differs in several design choices aimed at reducing the inherent redundancy in previous methods. First, rather than predicting the future in an auto-regressive fashion, POWERBEV uses a parallel, multi-scale module built from lightweight 2D convolutional networks. Second, we show that segmentation and centripetal backward flow are sufficient for prediction, simplifying previous multi-task objectives by eliminating redundant output modalities. Building on this output representation, we propose a simple, flow warping-based post-processing approach which produces more stable instance associations across time. Through this lightweight yet powerful design, POWERBEV outperforms state-of-the-art baselines on the NuScenes Dataset and poses an alternative paradigm for BEV instance prediction. We made our code publicly available at: https://github.com/EdwardLeeLPZ/PowerBEV. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 12 pages, 8 figures. This paper is accepted by IJCAI2023. Peizheng Li and Shuxiao Ding contributed equally to this work

arXiv:2306.05807 [pdf, other]

A Gated Attention Transformer for Multi-Person Pose Tracking

Authors: Andreas Doering, Juergen Gall

Abstract: Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Gated Attentio… ▽ More Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Gated Attention Transformer. The core aspect of our model is the gating mechanism that automatically adapts the impact of appearance embeddings and embeddings based on temporal pose similarity in the attention layers. In order to re-identify persons that have been occluded, we incorporate a pose-conditioned re-identification network that provides initial embeddings and allows to match persons even if the number of visible joints differ between frames. We further propose a matching layer based on gated attention for pose-to-track association and duplicate removal. We evaluate our approach on PoseTrack 2018 and PoseTrack21. △ Less

Submitted 21 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted to ICCVW23

arXiv:2302.01138 [pdf, other]

Spatial Markov property in Brownian disks

Authors: Jean-François Le Gall, Armand Riera

Abstract: We derive a new representation of the Brownian disk in terms of a forest of labeled trees, where labels correspond to distances from a subset of the boundary. We then use this representation to obtain a spatial Markov property showing that the complement of a hull centered at a boundary point of a Brownian disk is again a Brownian disk, with a random perimeter, and is independent of the hull condi… ▽ More We derive a new representation of the Brownian disk in terms of a forest of labeled trees, where labels correspond to distances from a subset of the boundary. We then use this representation to obtain a spatial Markov property showing that the complement of a hull centered at a boundary point of a Brownian disk is again a Brownian disk, with a random perimeter, and is independent of the hull conditionally on its perimeter. Our proofs rely in part on a study of the peeling process for triangulations with a boundary, which is of independent interest. The results of the present work will be applied to a continuous version of the peeling process for the Brownian half-plane in a companion paper. △ Less

Submitted 29 April, 2024; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: 46 pages, 3 figures

MSC Class: 60D05; 05C80

arXiv:2212.08208 [pdf, other]

doi 10.1109/TGRS.2023.3285401

Location-aware Adaptive Normalization: A Deep Learning Approach For Wildfire Danger Forecasting

Authors: Mohamad Hakam Shams Eddin, Ribana Roscher, Juergen Gall

Abstract: Climate change is expected to intensify and increase extreme events in the weather cycle. Since this has a significant impact on various sectors of our life, recent works are concerned with identifying and predicting such extreme events from Earth observations. With respect to wildfire danger forecasting, previous deep learning approaches duplicate static variables along the time dimension and neg… ▽ More Climate change is expected to intensify and increase extreme events in the weather cycle. Since this has a significant impact on various sectors of our life, recent works are concerned with identifying and predicting such extreme events from Earth observations. With respect to wildfire danger forecasting, previous deep learning approaches duplicate static variables along the time dimension and neglect the intrinsic differences between static and dynamic variables. Furthermore, most existing multi-branch architectures lose the interconnections between the branches during the feature learning stage. To address these issues, this paper proposes a 2D/3D two-branch convolutional neural network (CNN) with a Location-aware Adaptive Normalization layer (LOAN). Using LOAN as a building block, we can modulate the dynamic features conditional on their geographical locations. Thus, our approach considers feature properties as a unified yet compound 2D/3D model. Besides, we propose using the sinusoidal-based encoding of the day of the year to provide the model with explicit temporal information about the target day within the year. Our experimental results show a better performance of our approach than other baselines on the challenging FireCube dataset. The results show that location-aware adaptive feature normalization is a promising technique to learn the relation between dynamic variables and their geographic locations, which is highly relevant for areas where remote sensing data builds the basis for analysis. The source code is available at https://github.com/HakamShams/LOAN. △ Less

Submitted 7 April, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Journal ref: in IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1-18, 2023, Art no. 4703018

arXiv:2211.08041 [pdf, ps, other]

The Markov property of local times of Brownian motion indexed by the Brownian tree

Authors: Jean-François Le Gall

Abstract: We consider the model of Brownian motion indexed by the Brownian tree, which has appeared in a variety of different contexts in probability, statistical physics and combinatorics. For this model, the total occupation measure is known to have a continuously differentiable density. Although the density process indexed by nonnegative reals is not Markov, we prove that the pair consisting of the densi… ▽ More We consider the model of Brownian motion indexed by the Brownian tree, which has appeared in a variety of different contexts in probability, statistical physics and combinatorics. For this model, the total occupation measure is known to have a continuously differentiable density. Although the density process indexed by nonnegative reals is not Markov, we prove that the pair consisting of the density and its derivative is a time-homogeneous Markov process. We also establish a similar result for the local times of one-dimensional super-Brownian motion. Our methods rely on the excursion theory for Brownian motion indexed by the Brownian tree. △ Less

Submitted 14 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Revised version, taking account of all remarks made by referees. In particular, an error at the end of the proof of Lemma 8 has been corrected. A couple of arguments have also been simplified

MSC Class: 60J55; 60J65; 60J68; 60J80

arXiv:2210.06501 [pdf, other]

Robust Action Segmentation from Timestamp Supervision

Authors: Yaser Souri, Yazan Abu Farha, Emad Bahrami, Gianpiero Francesca, Juergen Gall

Abstract: Action segmentation is the task of predicting an action label for each frame of an untrimmed video. As obtaining annotations to train an approach for action segmentation in a fully supervised way is expensive, various approaches have been proposed to train action segmentation models using different forms of weak supervision, e.g., action transcripts, action sets, or more recently timestamps. Times… ▽ More Action segmentation is the task of predicting an action label for each frame of an untrimmed video. As obtaining annotations to train an approach for action segmentation in a fully supervised way is expensive, various approaches have been proposed to train action segmentation models using different forms of weak supervision, e.g., action transcripts, action sets, or more recently timestamps. Timestamp supervision is a promising type of weak supervision as obtaining one timestamp per action is less expensive than annotating all frames, but it provides more information than other forms of weak supervision. However, previous works assume that every action instance is annotated with a timestamp, which is a restrictive assumption since it assumes that annotators do not miss any action. In this work, we relax this restrictive assumption and take missing annotations for some action instances into account. We show that our approach is more robust to missing annotations compared to other approaches and various baselines. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: BMVC 2022

arXiv:2210.04085 [pdf, other]

Dual Pyramid Generative Adversarial Networks for Semantic Image Synthesis

Authors: Shijie Li, Ming-Ming Cheng, Juergen Gall

Abstract: The goal of semantic image synthesis is to generate photo-realistic images from semantic label maps. It is highly relevant for tasks like content generation and image editing. Current state-of-the-art approaches, however, still struggle to generate realistic objects in images at various scales. In particular, small objects tend to fade away and large objects are often generated as collages of patc… ▽ More The goal of semantic image synthesis is to generate photo-realistic images from semantic label maps. It is highly relevant for tasks like content generation and image editing. Current state-of-the-art approaches, however, still struggle to generate realistic objects in images at various scales. In particular, small objects tend to fade away and large objects are often generated as collages of patches. In order to address this issue, we propose a Dual Pyramid Generative Adversarial Network (DP-GAN) that learns the conditioning of spatially-adaptive normalization blocks at all scales jointly, such that scale information is bi-directionally used, and it unifies supervision at different scales. Our qualitative and quantitative results show that the proposed approach generates images where small and large objects look more realistic compared to images generated by state-of-the-art methods. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: BMVC2022

arXiv:2209.12074 [pdf, other]

Self-supervised Learning for Unintentional Action Prediction

Authors: Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

Abstract: Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straightforward due to lack of annotated data. While videos of unintentional or fa… ▽ More Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straightforward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well. △ Less

Submitted 24 September, 2022; originally announced September 2022.

Comments: Accepted to GCPR 2022

arXiv:2209.07547 [pdf, other]

One-Shot Synthesis of Images and Segmentation Masks

Authors: Vadim Sushko, Dan Zhang, Juergen Gall, Anna Khoreva

Abstract: Joint synthesis of images and segmentation masks with generative adversarial networks (GANs) is promising to reduce the effort needed for collecting image data with pixel-wise annotations. However, to learn high-fidelity image-mask synthesis, existing GAN approaches first need a pre-training phase requiring large amounts of image data, which limits their utilization in restricted image domains. In… ▽ More Joint synthesis of images and segmentation masks with generative adversarial networks (GANs) is promising to reduce the effort needed for collecting image data with pixel-wise annotations. However, to learn high-fidelity image-mask synthesis, existing GAN approaches first need a pre-training phase requiring large amounts of image data, which limits their utilization in restricted image domains. In this work, we take a step to reduce this limitation, introducing the task of one-shot image-mask synthesis. We aim to generate diverse images and their segmentation masks given only a single labelled example, and assuming, contrary to previous models, no access to any pre-training data. To this end, inspired by the recent architectural developments of single-image GANs, we introduce our OSMIS model which enables the synthesis of segmentation masks that are precisely aligned to the generated images in the one-shot regime. Besides achieving the high fidelity of generated masks, OSMIS outperforms state-of-the-art single-image GAN models in image synthesis quality and diversity. In addition, despite not using any additional data, OSMIS demonstrates an impressive ability to serve as a source of useful data augmentation for one-shot segmentation applications, providing performance gains that are complementary to standard data augmentation techniques. Code is available at https://github.com/ boschresearch/one-shot-synthesis △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: Accepted as a conference paper at IEEE Winter Conference on Applications of Computer Vision (WACV) 2023

arXiv:2209.00638 [pdf, other]

Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation

Authors: Nadine Behrmann, S. Alireza Golestaneh, Zico Kolter, Juergen Gall, Mehdi Noroozi

Abstract: This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to current state-of-the-art frame-level prediction methods, we view action segmentation as a seq2seq translation task, i.e., mapping a sequence of video frames to a sequence of action segments. Our proposed method involves a s… ▽ More This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to current state-of-the-art frame-level prediction methods, we view action segmentation as a seq2seq translation task, i.e., mapping a sequence of video frames to a sequence of action segments. Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model to cope with long input sequences opposed to short output sequences and relatively few videos. We incorporate an auxiliary supervision signal for the encoder via a frame-wise loss and propose a separate alignment decoder for an implicit duration prediction. Finally, we extend our framework to the timestamp supervised setting via our proposed constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets. Our code is publicly available at https://github.com/boschresearch/UVAST. △ Less

Submitted 11 October, 2022; v1 submitted 1 September, 2022; originally announced September 2022.

Comments: ECCV 2022 (Main Conference)

arXiv:2207.11427 [pdf, other]

doi 10.1088/1748-0221/19/05/P05066

The FASER Detector

Authors: FASER Collaboration, Henso Abreu, Elham Amin Mansour, Claire Antel, Akitaka Ariga, Tomoko Ariga, Florian Bernlochner, Tobias Boeckh, Jamie Boyd, Lydia Brenner, Franck Cadoux, David W. Casper, Charlotte Cavanagh, Xin Chen, Andrea Coccaro, Olivier Crespo-Lopez, Stephane Debieux, Monica D'Onofrio, Liam Dougherty, Candan Dozen, Abdallah Ezzat, Yannick Favre, Deion Fellers, Jonathan L. Feng, Didier Ferrere , et al. (72 additional authors not shown)

Abstract: FASER, the ForwArd Search ExpeRiment, is an experiment dedicated to searching for light, extremely weakly-interacting particles at CERN's Large Hadron Collider (LHC). Such particles may be produced in the very forward direction of the LHC's high-energy collisions and then decay to visible particles inside the FASER detector, which is placed 480 m downstream of the ATLAS interaction point, aligned… ▽ More FASER, the ForwArd Search ExpeRiment, is an experiment dedicated to searching for light, extremely weakly-interacting particles at CERN's Large Hadron Collider (LHC). Such particles may be produced in the very forward direction of the LHC's high-energy collisions and then decay to visible particles inside the FASER detector, which is placed 480 m downstream of the ATLAS interaction point, aligned with the beam collisions axis. FASER also includes a sub-detector, FASER$ν$, designed to detect neutrinos produced in the LHC collisions and to study their properties. In this paper, each component of the FASER detector is described in detail, as well as the installation of the experiment system and its commissioning using cosmic-rays collected in September 2021 and during the LHC pilot beam test carried out in October 2021. FASER will start taking LHC collision data in 2022, and will run throughout LHC Run 3. △ Less

Submitted 23 July, 2022; originally announced July 2022.

Comments: 92 pages, 72 Figures

Report number: CERN-FASER-2022-001

Journal ref: JINST 19 (2024) P05066

arXiv:2206.08929 [pdf, other]

TAVA: Template-free Animatable Volumetric Actors

Authors: Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhofer, Jurgen Gall, Angjoo Kanazawa, Christoph Lassner

Abstract: Coordinate-based volumetric representations have the potential to generate photo-realistic virtual avatars from images. However, virtual avatars also need to be controllable even to a novel pose that may not have been observed. Traditional techniques, such as LBS, provide such a function; yet it usually requires a hand-designed body template, 3D scan data, and limited appearance models. On the oth… ▽ More Coordinate-based volumetric representations have the potential to generate photo-realistic virtual avatars from images. However, virtual avatars also need to be controllable even to a novel pose that may not have been observed. Traditional techniques, such as LBS, provide such a function; yet it usually requires a hand-designed body template, 3D scan data, and limited appearance models. On the other hand, neural representation has been shown to be powerful in representing visual details, but are under explored on deforming dynamic articulated actors. In this paper, we propose TAVA, a method to create T emplate-free Animatable Volumetric Actors, based on neural representations. We rely solely on multi-view data and a tracked skeleton to create a volumetric model of an actor, which can be animated at the test time given novel pose. Since TAVA does not require a body template, it is applicable to humans as well as other creatures such as animals. Furthermore, TAVA is designed such that it can recover accurate dense correspondences, making it amenable to content-creation and editing tasks. Through extensive experiments, we demonstrate that the proposed method generalizes well to novel poses as well as unseen views and showcase basic editing capabilities. △ Less

Submitted 20 June, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: Code: https://github.com/facebookresearch/tava; Project Website: https://www.liruilong.cn/projects/tava/

arXiv:2206.06741 [pdf, other]

Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis

Authors: Rania Briq, Chuhang Zou, Leonid Pishchulin, Chris Broaddus, Juergen Gall

Abstract: We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages expressiveness of Recurrent Transformers and generative richness of co… ▽ More We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages expressiveness of Recurrent Transformers and generative richness of conditional Variational Autoencoders. The proposed iterative approach is able to generate smooth and realistic human motion sequences with an arbitrary number of actions and frames while doing so in linear space and time. We train and evaluate the proposed approach on PROX and Charades datasets, where we augment PROX with ground-truth action labels and Charades with human mesh annotations. Experimental evaluation shows significant improvements in FID score and semantic consistency metrics compared to the state-of-the-art. △ Less

Submitted 27 June, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: accepted at Transformers for Vision workshop at CVPR 2022

arXiv:2203.08126 [pdf, other]

Recent Progress and Next Steps for the MATHUSLA LLP Detector

Authors: Cristiano Alpigiani, Juan Carlos Arteaga-Velázquez, Austin Ball, Liron Barak, Jared Barron, Brian Batell, James Beacham, Yan Benhammo, Benjamin Brau, Karen Salomé Caballero-Mora, Paolo Camarri, Roberto Cardarelli, John Paul Chou, Wentao Cui, David Curtin, Miriam Diamond, Keith R. Dienes, Liam Andrew Dougherty, William Dougherty, Marco Drewes, Sameer Erramilli, Rouven Essig, Erez Etzion, Jared Evans, Arturo Fernández Téllez , et al. (71 additional authors not shown)

Abstract: We report on recent progress and next steps in the design of the proposed MATHUSLA Long Lived Particle (LLP) detector for the HL-LHC as part of the Snowmass 2021 process. Our understanding of backgrounds has greatly improved, aided by detailed simulation studies, and significant R&D has been performed on designing the scintillator detectors and understanding their performance. The collaboration is… ▽ More We report on recent progress and next steps in the design of the proposed MATHUSLA Long Lived Particle (LLP) detector for the HL-LHC as part of the Snowmass 2021 process. Our understanding of backgrounds has greatly improved, aided by detailed simulation studies, and significant R&D has been performed on designing the scintillator detectors and understanding their performance. The collaboration is on track to complete a Technical Design Report, and there are many opportunities for interested new members to contribute towards the goal of designing and constructing MATHUSLA in time for HL-LHC collisions, which would increase the sensitivity to a large variety of highly motivated LLP signals by orders of magnitude. △ Less

Submitted 30 March, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Contribution to Snowmass 2021 (EF09, EF10, IF6, IF9), 18 pages, 12 figures. v2: included additional endorsers. v3: updated affiliations. v4: added missing contributors as authors

arXiv:2201.11736 [pdf, other]

Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives

Authors: David T. Hoffmann, Nadine Behrmann, Juergen Gall, Thomas Brox, Mehdi Noroozi

Abstract: This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples. In contrast to the standard InfoNCE loss, which requires a strict binary separation of the training pairs into similar and dissimilar samples, RINCE can exploit information about a similarity ranking for learning a corresponding… ▽ More This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples. In contrast to the standard InfoNCE loss, which requires a strict binary separation of the training pairs into similar and dissimilar samples, RINCE can exploit information about a similarity ranking for learning a corresponding embedding space. We show that the proposed loss function learns favorable embeddings compared to the standard InfoNCE whenever at least noisy ranking information can be obtained or when the definition of positives and negatives is blurry. We demonstrate this for a supervised classification task with additional superclass labels and noisy similarity scores. Furthermore, we show that RINCE can also be applied to unsupervised training with experiments on unsupervised representation learning from videos. In particular, the embedding yields higher classification accuracy, retrieval rates and performs better in out-of-distribution detection than the standard InfoNCE loss. △ Less

Submitted 27 January, 2022; originally announced January 2022.

Comments: AAAI 2022 (Main Track)

arXiv:2111.15667 [pdf, other]

Adaptive Token Sampling For Efficient Vision Transformers

Authors: Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall

Abstract: While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Ada… ▽ More While state-of-the-art vision transformer models achieve promising results in image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we therefore introduce a differentiable parameter-free Adaptive Token Sampler (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within the current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to the off-the-shelf pre-trained vision transformers as a plug and play module, thus reducing their GFLOPs without any additional training. Moreover, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate the efficiency of our module in both image and video classification tasks by adding it to multiple SOTA vision transformers. Our proposed module improves the SOTA by reducing their computational costs (GFLOPs) by 2X, while preserving their accuracy on the ImageNet, Kinetics-400, and Kinetics-600 datasets. △ Less

Submitted 26 July, 2022; v1 submitted 30 November, 2021; originally announced November 2021.

Comments: ECCV 2022

arXiv:2111.08279 [pdf, other]

Keypoint Message Passing for Video-based Person Re-Identification

Authors: Di Chen, Andreas Doering, Shanshan Zhang, Jian Yang, Juergen Gall, Bernt Schiele

Abstract: Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras. Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from… ▽ More Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras. Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement. In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph. These keypoint features are then updated by message passing from their connected nodes with a graph convolutional network (GCN). During training, the GCN can be attached to any CNN-based person re-ID model to assist representation learning on feature maps, whilst it can be dropped after training for better inference speed. Our method brings significant improvements over the CNN-based baseline model on the MARS dataset with generated person keypoints and a newly annotated dataset: PoseTrackReID. It also defines a new state-of-the-art method in terms of top-1 accuracy and mean average precision in comparison to prior works. △ Less

Submitted 13 December, 2021; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: To appear in AAAI 2022

arXiv:2110.14392 [pdf, other]

TaylorSwiftNet: Taylor Driven Temporal Modeling for Swift Future Frame Prediction

Authors: Saber Pourheydari, Emad Bahrami, Mohsen Fayyaz, Gianpiero Francesca, Mehdi Noroozi, Juergen Gall

Abstract: While recurrent neural networks (RNNs) demonstrate outstanding capabilities for future video frame prediction, they model dynamics in a discrete time space, i.e., they predict the frames sequentially with a fixed temporal step. RNNs are therefore prone to accumulate the error as the number of future frames increases. In contrast, partial differential equations (PDEs) model physical phenomena like… ▽ More While recurrent neural networks (RNNs) demonstrate outstanding capabilities for future video frame prediction, they model dynamics in a discrete time space, i.e., they predict the frames sequentially with a fixed temporal step. RNNs are therefore prone to accumulate the error as the number of future frames increases. In contrast, partial differential equations (PDEs) model physical phenomena like dynamics in a continuous time space. However, the estimated PDE for frame forecasting needs to be numerically solved, which is done by discretization of the PDE and diminishes most of the advantages compared to discrete models. In this work, we, therefore, propose to approximate the motion in a video by a continuous function using the Taylor series. To this end, we introduce TaylorSwiftNet, a novel convolutional neural network that learns to estimate the higher order terms of the Taylor series for a given input video. TaylorSwiftNet can swiftly predict future frames in parallel and it allows to change the temporal resolution of the forecast frames on-the-fly. The experimental results on various datasets demonstrate the superiority of our model. △ Less

Submitted 12 October, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: BMVC 2022

arXiv:2109.11593 [pdf, other]

Long Short View Feature Decomposition via Contrastive Video Representation Learning

Authors: Nadine Behrmann, Mohsen Fayyaz, Juergen Gall, Mehdi Noroozi

Abstract: Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more be… ▽ More Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more beneficial for downstream tasks involving more fine-grained temporal understanding, such as action segmentation. We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i.e. long video sequences and their shorter sub-sequences. Stationary features are shared between the short and long views, while non-stationary features aggregate the short views to match the corresponding long view. To empirically verify our approach, we demonstrate that our stationary features work particularly well on an action recognition downstream task, while our non-stationary features perform better on action segmentation. Furthermore, we analyse the learned representations and find that stationary features capture more temporally stable, static attributes, while non-stationary features encompass more temporally varying ones. △ Less

Submitted 23 September, 2021; originally announced September 2021.

Comments: ICCV 2021 (Main Conference)

arXiv:2109.10905 [pdf, other]

doi 10.1016/j.physrep.2022.04.004

The Forward Physics Facility: Sites, Experiments, and Physics Potential

Authors: Luis A. Anchordoqui, Akitaka Ariga, Tomoko Ariga, Weidong Bai, Kincso Balazs, Brian Batell, Jamie Boyd, Joseph Bramante, Mario Campanelli, Adrian Carmona, Francesco G. Celiberto, Grigorios Chachamis, Matthew Citron, Giovanni De Lellis, Albert De Roeck, Hans Dembinski, Peter B. Denton, Antonia Di Crecsenzo, Milind V. Diwan, Liam Dougherty, Herbi K. Dreiner, Yong Du, Rikard Enberg, Yasaman Farzan, Jonathan L. Feng , et al. (56 additional authors not shown)

Abstract: The Forward Physics Facility (FPF) is a proposal to create a cavern with the space and infrastructure to support a suite of far-forward experiments at the Large Hadron Collider during the High Luminosity era. Located along the beam collision axis and shielded from the interaction point by at least 100 m of concrete and rock, the FPF will house experiments that will detect particles outside the acc… ▽ More The Forward Physics Facility (FPF) is a proposal to create a cavern with the space and infrastructure to support a suite of far-forward experiments at the Large Hadron Collider during the High Luminosity era. Located along the beam collision axis and shielded from the interaction point by at least 100 m of concrete and rock, the FPF will house experiments that will detect particles outside the acceptance of the existing large LHC experiments and will observe rare and exotic processes in an extremely low-background environment. In this work, we summarize the current status of plans for the FPF, including recent progress in civil engineering in identifying promising sites for the FPF and the experiments currently envisioned to realize the FPF's physics potential. We then review the many Standard Model and new physics topics that will be advanced by the FPF, including searches for long-lived particles, probes of dark matter and dark sectors, high-statistics studies of TeV neutrinos of all three flavors, aspects of perturbative and non-perturbative QCD, and high-energy astroparticle physics. △ Less

Submitted 25 May, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: revised version, accepted by Physics Reports

Report number: BNL-222142-2021-FORE, CERN-PBC-Notes-2021-025, DESY-21-142, FERMILAB-CONF-21-452-AE-E-ND-PPD-T, KYUSHU-RCAPP-2021-01, LU TP 21-36, PITT-PACC-2118, SMU-HEP-21-10, UCI-TR-2021-22

Journal ref: Phys. Rept. 968 (2022), 1-50

arXiv:2108.03894 [pdf, other]

FIFA: Fast Inference Approximation for Action Segmentation

Authors: Yaser Souri, Yazan Abu Farha, Fabien Despinoy, Gianpiero Francesca, Juergen Gall

Abstract: We introduce FIFA, a fast approximate inference method for action segmentation and alignment. Unlike previous approaches, FIFA does not rely on expensive dynamic programming for inference. Instead, it uses an approximate differentiable energy function that can be minimized using gradient-descent. FIFA is a general approach that can replace exact inference improving its speed by more than 5 times w… ▽ More We introduce FIFA, a fast approximate inference method for action segmentation and alignment. Unlike previous approaches, FIFA does not rely on expensive dynamic programming for inference. Instead, it uses an approximate differentiable energy function that can be minimized using gradient-descent. FIFA is a general approach that can replace exact inference improving its speed by more than 5 times while maintaining its performance. FIFA is an anytime inference algorithm that provides a better speed vs. accuracy trade-off compared to exact inference. We apply FIFA on top of state-of-the-art approaches for weakly supervised action segmentation and alignment as well as fully supervised action segmentation. FIFA achieves state-of-the-art results on most metrics on two action segmentation datasets. △ Less

Submitted 9 August, 2021; originally announced August 2021.

arXiv:2107.14206 [pdf, other]

doi 10.1109/IROS51168.2021.9636133

Using Visual Anomaly Detection for Task Execution Monitoring

Authors: Santosh Thoduka, Juergen Gall, Paul G. Plöger

Abstract: Execution monitoring is essential for robots to detect and respond to failures. Since it is impossible to enumerate all failures for a given task, we learn from successful executions of the task to detect visual anomalies during runtime. Our method learns to predict the motions that occur during the nominal execution of a task, including camera and robot body motion. A probabilistic U-Net architec… ▽ More Execution monitoring is essential for robots to detect and respond to failures. Since it is impossible to enumerate all failures for a given task, we learn from successful executions of the task to detect visual anomalies during runtime. Our method learns to predict the motions that occur during the nominal execution of a task, including camera and robot body motion. A probabilistic U-Net architecture is used to learn to predict optical flow, and the robot's kinematics and 3D model are used to model camera and body motion. The errors between the observed and predicted motion are used to calculate an anomaly score. We evaluate our method on a dataset of a robot placing a book on a shelf, which includes anomalies such as falling books, camera occlusions, and robot disturbances. We find that modeling camera and body motion, in addition to the learning-based optical flow prediction, results in an improvement of the area under the receiver operating characteristic curve from 0.752 to 0.804, and the area under the precision-recall curve from 0.467 to 0.549. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: Accepted for publication at the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2107.09504 [pdf, other]

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Authors: Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

Abstract: Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not suffi… ▽ More Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster. △ Less

Submitted 18 July, 2021; originally announced July 2021.

Comments: CVPR Precognition Workshop

arXiv:2107.01869 [pdf, other]

Towards Better Adversarial Synthesis of Human Images from Text

Authors: Rania Briq, Pratika Kochar, Juergen Gall

Abstract: This paper proposes an approach that generates multiple 3D human meshes from text. The human shapes are represented by 3D meshes based on the SMPL model. The model's performance is evaluated on the COCO dataset, which contains challenging human shapes and intricate interactions between individuals. The model is able to capture the dynamics of the scene and the interactions between individuals base… ▽ More This paper proposes an approach that generates multiple 3D human meshes from text. The human shapes are represented by 3D meshes based on the SMPL model. The model's performance is evaluated on the COCO dataset, which contains challenging human shapes and intricate interactions between individuals. The model is able to capture the dynamics of the scene and the interactions between individuals based on text. We further show how using such a shape as input to image synthesis frameworks helps to constrain the network to synthesize humans with realistic human shapes. △ Less

Submitted 5 July, 2021; originally announced July 2021.

arXiv:2105.08971 [pdf, other]

doi 10.1109/LRA.2021.3093567

Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data

Authors: Xieyuanli Chen, Shijie Li, Benedikt Mersch, Louis Wiesmann, Jürgen Gall, Jens Behley, Cyrill Stachniss

Abstract: The ability to detect and segment moving objects in a scene is essential for building consistent maps, making future state predictions, avoiding collisions, and planning. In this paper, we address the problem of moving object segmentation from 3D LiDAR scans. We propose a novel approach that pushes the current state of the art in LiDAR-only moving object segmentation forward to provide relevant in… ▽ More The ability to detect and segment moving objects in a scene is essential for building consistent maps, making future state predictions, avoiding collisions, and planning. In this paper, we address the problem of moving object segmentation from 3D LiDAR scans. We propose a novel approach that pushes the current state of the art in LiDAR-only moving object segmentation forward to provide relevant information for autonomous robots and other vehicles. Instead of segmenting the point cloud semantically, i.e., predicting the semantic classes such as vehicles, pedestrians, roads, etc., our approach accurately segments the scene into moving and static objects, i.e., also distinguishing between moving cars vs. parked cars. Our proposed approach exploits sequential range images from a rotating 3D LiDAR sensor as an intermediate representation combined with a convolutional neural network and runs faster than the frame rate of the sensor. We compare our approach to several other state-of-the-art methods showing superior segmentation quality in urban environments. Additionally, we created a new benchmark for LiDAR-based moving object segmentation based on SemanticKITTI. We published it to allow other researchers to compare their approaches transparently and we furthermore published our code. △ Less

Submitted 13 July, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

Comments: Accepted by RA-L with IROS 2021

arXiv:2105.05847 [pdf, other]

Learning to Generate Novel Scene Compositions from Single Images and Videos

Authors: Vadim Sushko, Juergen Gall, Anna Khoreva

Abstract: Training GANs in low-data regimes remains a challenge, as overfitting often leads to memorization or training divergence. In this work, we introduce One-Shot GAN that can learn to generate samples from a training set as little as one image or one video. We propose a two-branch discriminator, with content and layout branches designed to judge the internal content separately from the scene layout re… ▽ More Training GANs in low-data regimes remains a challenge, as overfitting often leads to memorization or training divergence. In this work, we introduce One-Shot GAN that can learn to generate samples from a training set as little as one image or one video. We propose a two-branch discriminator, with content and layout branches designed to judge the internal content separately from the scene layout realism. This allows synthesis of visually plausible, novel compositions of a scene, with varying content and layout, while preserving the context of the original sample. Compared to previous single-image GAN models, One-Shot GAN achieves higher diversity and quality of synthesis. It is also not restricted to the single image setting, successfully learning in the introduced setting of a single video. △ Less

Submitted 12 May, 2021; originally announced May 2021.

Comments: The AI for Content Creation (AICC) workshop at CVPR 2021. The full 8-page version of this submission is available at arXiv:2103.13389

arXiv:2105.05615 [pdf, ps, other]

The volume measure of the Brownian sphere is a Hausdorff measure

Authors: Jean-François Le Gall

Abstract: We prove that the volume measure of the Brownian sphere is equal to a constant multiple of the Hausdorff measure associated with the gauge function $h(r)=r^4\log\log(1/r)$. This shows in particular that the volume measure of the Brownian sphere is determined by its metric structure. As a key ingredient of our proofs, we derive precise estimates on moments of the volume of balls in the Brownian sph… ▽ More We prove that the volume measure of the Brownian sphere is equal to a constant multiple of the Hausdorff measure associated with the gauge function $h(r)=r^4\log\log(1/r)$. This shows in particular that the volume measure of the Brownian sphere is determined by its metric structure. As a key ingredient of our proofs, we derive precise estimates on moments of the volume of balls in the Brownian sphere. △ Less

Submitted 29 July, 2022; v1 submitted 12 May, 2021; originally announced May 2021.

Comments: 26 pages - revised version incorporating suggestions of two referees

MSC Class: 60D05

arXiv:2103.13389 [pdf, other]

Generating Novel Scene Compositions from Single Images and Videos

Authors: Vadim Sushko, Dan Zhang, Juergen Gall, Anna Khoreva

Abstract: Given a large dataset for training, generative adversarial networks (GANs) can achieve remarkable performance for the image synthesis task. However, training GANs in extremely low data regimes remains a challenge, as overfitting often occurs, leading to memorization or training divergence. In this work, we introduce SIV-GAN, an unconditional generative model that can generate new scene composition… ▽ More Given a large dataset for training, generative adversarial networks (GANs) can achieve remarkable performance for the image synthesis task. However, training GANs in extremely low data regimes remains a challenge, as overfitting often occurs, leading to memorization or training divergence. In this work, we introduce SIV-GAN, an unconditional generative model that can generate new scene compositions from a single training image or a single video clip. We propose a two-branch discriminator architecture, with content and layout branches designed to judge internal content and scene layout realism separately from each other. This discriminator design enables synthesis of visually plausible, novel compositions of a scene, with varying content and layout, while preserving the context of the original sample. Compared to previous single image GANs, our model generates more diverse, higher quality images, while not being restricted to a single image setting. We further introduce a new challenging task of learning from a few frames of a single video. In this training setup the training images are highly similar to each other, which makes it difficult for prior GAN models to achieve a synthesis of both high quality and diversity. △ Less

Submitted 13 December, 2023; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: Accepted for publication in Computer Vision and Image Understanding: https://www.sciencedirect.com/science/article/pii/S1077314223002680. Code repository: https://github.com/boschresearch/one-shot-synthesis

arXiv:2103.06669 [pdf, other]

Temporal Action Segmentation from Timestamp Supervision

Authors: Zhe Li, Yazan Abu Farha, Juergen Gall

Abstract: Temporal action segmentation approaches have been very successful recently. However, annotating videos with frame-wise labels to train such models is very expensive and time consuming. While weakly supervised methods trained using only ordered action lists require less annotation effort, the performance is still worse than fully supervised approaches. In this paper, we propose to use timestamp sup… ▽ More Temporal action segmentation approaches have been very successful recently. However, annotating videos with frame-wise labels to train such models is very expensive and time consuming. While weakly supervised methods trained using only ordered action lists require less annotation effort, the performance is still worse than fully supervised approaches. In this paper, we propose to use timestamp supervision for the temporal action segmentation task. Timestamps require a comparable annotation effort to weakly supervised approaches, and yet provide a more supervisory signal. To demonstrate the effectiveness of timestamp supervision, we propose an approach to train a segmentation model using only timestamps annotations. Our approach uses the model output and the annotated timestamps to generate frame-wise labels by detecting the action changes. We further introduce a confidence loss that forces the predicted probabilities to monotonically decrease as the distance to the timestamps increases. This ensures that all and not only the most distinctive frames of an action are learned during training. The evaluation on four datasets shows that models trained with timestamps annotations achieve comparable performance to the fully supervised approaches. △ Less

Submitted 26 March, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: CVPR 2021

Showing 1–50 of 187 results for author: Gall, J