-
MaskInversion: Localized Embeddings via Optimization of Explainability Maps
Authors:
Walid Bousselham,
Sofian Chaybouti,
Christian Rupprecht,
Vittorio Ferrari,
Hilde Kuehne
Abstract:
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for…
▽ More
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
SHIC: Shape-Image Correspondences with no Keypoint Supervision
Authors:
Aleksandar Shtedritski,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps witho…
▽ More
Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads
Authors:
Orest Kupyn,
Eugene Khvedchenia,
Christian Rupprecht
Abstract:
Human head detection, keypoint estimation, and 3D head model fitting are important tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce VGGHeads -- a large scale synthetic dataset generated wit…
▽ More
Human head detection, keypoint estimation, and 3D head model fitting are important tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce VGGHeads -- a large scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset we introduce a new model architecture capable of simultaneous heads detection and head meshes reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads. Additionally, we provide detailed information about the synthetic data generation pipeline, enabling it to be re-used for other tasks and domains.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Dataset Enhancement with Instance-Level Augmentations
Authors:
Orest Kupyn,
Christian Rupprecht
Abstract:
We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augm…
▽ More
We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image
Authors:
Stanislaw Szymanowicz,
Eldar Insafutdinov,
Chuanxia Zheng,
Dylan Campbell,
João F. Henriques,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically…
▽ More
In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting
Authors:
Paul Engstler,
Andrea Vedaldi,
Iro Laina,
Christian Rupprecht
Abstract:
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing s…
▽ More
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
Authors:
Ruining Li,
Chuanxia Zheng,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restric…
▽ More
We introduce DragAPart, a method that, given an image and a set of drags as input, generates a new image of the same object that responds to the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. We start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.
△ Less
Submitted 28 July, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Recent Trends in 3D Reconstruction of General Non-Rigid Scenes
Authors:
Raza Yunus,
Jan Eric Lenssen,
Michael Niemeyer,
Yiyi Liao,
Christian Rupprecht,
Christian Theobalt,
Gerard Pons-Moll,
Jia-Bin Huang,
Vladislav Golyanik,
Eddy Ilg
Abstract:
Reconstructing models of the real world, including 3D geometry, appearance, and motion of real scenes, is essential for computer graphics and computer vision. It enables the synthesizing of photorealistic novel views, useful for the movie industry and AR/VR applications. It also facilitates the content creation necessary in computer games and AR/VR by avoiding laborious manual design processes. Fu…
▽ More
Reconstructing models of the real world, including 3D geometry, appearance, and motion of real scenes, is essential for computer graphics and computer vision. It enables the synthesizing of photorealistic novel views, useful for the movie industry and AR/VR applications. It also facilitates the content creation necessary in computer games and AR/VR by avoiding laborious manual design processes. Further, such models are fundamental for intelligent computing systems that need to interpret real-world scenes and actions to act and interact safely with the human world. Notably, the world surrounding us is dynamic, and reconstructing models of dynamic, non-rigidly moving scenes is a severely underconstrained and challenging problem. This state-of-the-art report (STAR) offers the reader a comprehensive summary of state-of-the-art techniques with monocular and multi-view inputs such as data from RGB and RGB-D sensors, among others, conveying an understanding of different approaches, their potential applications, and promising further research directions. The report covers 3D reconstruction of general non-rigid scenes and further addresses the techniques for scene decomposition, editing and controlling, and generalizable and generative modeling. More specifically, we first review the common and fundamental concepts necessary to understand and navigate the field and then discuss the state-of-the-art techniques by reviewing recent approaches that use traditional and machine-learning-based neural representations, including a discussion on the newly enabled applications. The STAR is concluded with a discussion of the remaining limitations and open challenges.
△ Less
Submitted 6 May, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation
Authors:
Luke Melas-Kyriazi,
Iro Laina,
Christian Rupprecht,
Natalia Neverova,
Andrea Vedaldi,
Oran Gafni,
Filippos Kokkinos
Abstract:
Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In th…
▽ More
Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Learning the 3D Fauna of the Web
Authors:
Zizhang Li,
Dor Litvak,
Ruining Li,
Yunzhi Zhang,
Tomas Jakab,
Christian Rupprecht,
Shangzhe Wu,
Andrea Vedaldi,
Jiajun Wu
Abstract:
Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Interne…
▽ More
Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.
△ Less
Submitted 1 April, 2024; v1 submitted 4 January, 2024;
originally announced January 2024.
-
Splatter Image: Ultra-Fast Single-View 3D Reconstruction
Authors:
Stanislaw Szymanowicz,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We introduce the \method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main…
▽ More
We introduce the \method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main innovation is the surprisingly straightforward design of this network, which, using 2D operators, maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image, the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS), we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works. Code, models, demo and more results are available at https://szymanowiczs.github.io/splatter-image.
△ Less
Submitted 16 April, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Scene-Conditional 3D Object Stylization and Composition
Authors:
Jinghao Zhou,
Tomas Jakab,
Philip Torr,
Christian Rupprecht
Abstract:
Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene,…
▽ More
Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings-but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Visual Geometry Grounded Deep Structure From Motion
Authors:
Jianyuan Wang,
Nikita Karaev,
Christian Rupprecht,
David Novotny
Abstract:
Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research effo…
▽ More
Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Cache Me if You Can: Accelerating Diffusion Models through Block Caching
Authors:
Felix Wimbauer,
Bichen Wu,
Edgar Schoenfeld,
Xiaoliang Dai,
Ji Hou,
Zijian He,
Artsiom Sanakoyeu,
Peizhao Zhang,
Sam Tsai,
Jonas Kohler,
Christian Rupprecht,
Daniel Cremers,
Peter Vajda,
Jialiang Wang
Abstract:
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce th…
▽ More
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
△ Less
Submitted 12 January, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation
Authors:
Paul Engstler,
Luke Melas-Kyriazi,
Christian Rupprecht,
Iro Laina
Abstract:
Self-supervised learning (SSL) can be used to solve complex visual tasks without human labels. Self-supervised representations encode useful semantic information about images, and as a result, they have already been used for tasks such as unsupervised semantic segmentation. In this paper, we investigate self-supervised representations for instance segmentation without any manual annotations. We fi…
▽ More
Self-supervised learning (SSL) can be used to solve complex visual tasks without human labels. Self-supervised representations encode useful semantic information about images, and as a result, they have already been used for tasks such as unsupervised semantic segmentation. In this paper, we investigate self-supervised representations for instance segmentation without any manual annotations. We find that the features of different SSL methods vary in their level of instance-awareness. In particular, DINO features, which are known to be excellent semantic descriptors, lack behind MAE features in their sensitivity for separating instances.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
S4C: Self-Supervised Semantic Scene Completion with Neural Fields
Authors:
Adrian Hayler,
Felix Wimbauer,
Dominik Muhle,
Christian Rupprecht,
Daniel Cremers
Abstract:
3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This proces…
▽ More
3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This process relies on special sensors and annotation by hand which are costly and do not scale well. To overcome this issue, our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data. Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.
△ Less
Submitted 12 October, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Authors:
Nina Shvetsova,
Anna Kukleva,
Xudong Hong,
Christian Rupprecht,
Bernt Schiele,
Hilde Kuehne
Abstract:
Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal…
▽ More
Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
CoTracker: It is Better to Track Together
Authors:
Nikita Karaev,
Ignacio Rocco,
Benjamin Graham,
Natalia Neverova,
Andrea Vedaldi,
Christian Rupprecht
Abstract:
We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness. We also provide several technical innovations, including the con…
▽ More
We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness. We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking. We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.
△ Less
Submitted 26 December, 2023; v1 submitted 14 July, 2023;
originally announced July 2023.
-
PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
Authors:
Jianyuan Wang,
Christian Rupprecht,
David Novotny
Abstract:
Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of…
▽ More
Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method PoseDiffusion significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training. Project page: https://posediffusion.github.io/
△ Less
Submitted 24 January, 2024; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
Authors:
Laurynas Karazija,
Iro Laina,
Andrea Vedaldi,
Christian Rupprecht
Abstract:
The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, lev…
▽ More
The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data
Authors:
Stanislaw Szymanowicz,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We present Viewset Diffusion, a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We note that there exists a one-to-one mapping between viewsets, i.e., collections of several 2D views of an object, and 3D models. Hence, we train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally correspondi…
▽ More
We present Viewset Diffusion, a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We note that there exists a one-to-one mapping between viewsets, i.e., collections of several 2D views of an object, and 3D models. Hence, we train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models, thus generating those too. We fit a diffusion model to a large number of viewsets for a given category of objects. The resulting generator can be conditioned on zero, one or more input views. Conditioned on a single view, it performs 3D reconstruction accounting for the ambiguity of the task and allowing to sample multiple solutions compatible with the input. The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset. Project page: szymanowiczs.github.io/viewset-diffusion.
△ Less
Submitted 1 September, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
DynamicStereo: Consistent Dynamic Depth from Stereo Videos
Authors:
Nikita Karaev,
Ignacio Rocco,
Benjamin Graham,
Natalia Neverova,
Andrea Vedaldi,
Christian Rupprecht
Abstract:
We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a nove…
▽ More
We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments, which provides complementary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion
Authors:
Tomas Jakab,
Ruining Li,
Shangzhe Wu,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object catego…
▽ More
We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
△ Less
Submitted 14 May, 2024; v1 submitted 20 April, 2023;
originally announced April 2023.
-
What does CLIP know about a red circle? Visual prompt engineering for VLMs
Authors:
Aleksandar Shtedritski,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving…
▽ More
Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention to some potential ethical concerns of large language-vision models.
△ Less
Submitted 18 August, 2023; v1 submitted 13 April, 2023;
originally announced April 2023.
-
Continual Detection Transformer for Incremental Object Detection
Authors:
Yaoyao Liu,
Bernt Schiele,
Andrea Vedaldi,
Christian Rupprecht
Abstract:
Incremental object detection (IOD) aims to train an object detector in phases, each with annotations for new object categories. As other incremental settings, IOD is subject to catastrophic forgetting, which is often addressed by techniques such as knowledge distillation (KD) and exemplar replay (ER). However, KD and ER do not work well if applied directly to state-of-the-art transformer-based obj…
▽ More
Incremental object detection (IOD) aims to train an object detector in phases, each with annotations for new object categories. As other incremental settings, IOD is subject to catastrophic forgetting, which is often addressed by techniques such as knowledge distillation (KD) and exemplar replay (ER). However, KD and ER do not work well if applied directly to state-of-the-art transformer-based object detectors such as Deformable DETR and UP-DETR. In this paper, we solve these issues by proposing a ContinuaL DEtection TRansformer (CL-DETR), a new method for transformer-based IOD which enables effective usage of KD and ER in this context. First, we introduce a Detector Knowledge Distillation (DKD) loss, focusing on the most informative and reliable predictions from old versions of the model, ignoring redundant background predictions, and ensuring compatibility with the available ground-truth labels. We also improve ER by proposing a calibration strategy to preserve the label distribution of the training set, therefore better matching training and testing statistics. We conduct extensive experiments on COCO 2017 and demonstrate that CL-DETR achieves state-of-the-art results in the IOD setting.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data
Authors:
Anna Kukleva,
Moritz Böhle,
Bernt Schiele,
Hilde Kuehne,
Christian Rupprecht
Abstract:
Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contr…
▽ More
Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contrastive loss, by analysing the loss through the lens of average distance maximisation, and find that a large $τ$ emphasises group-wise discrimination, whereas a small $τ$ leads to a higher degree of instance discrimination. While $τ$ has thus far been treated exclusively as a constant hyperparameter, in this work, we propose to employ a dynamic $τ$ and show that a simple cosine schedule can yield significant improvements in the learnt representations. Such a schedule results in a constant `task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details. Since frequent classes benefit from the former, while infrequent classes require the latter, we find this method to consistently improve separation between the classes in long-tail data without any additional computational cost.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
$PC^2$: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
Authors:
Luke Melas-Kyriazi,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3…
▽ More
Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.
△ Less
Submitted 23 February, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
RealFusion: 360° Reconstruction of Any Object from a Single Image
Authors:
Luke Melas-Kyriazi,
Christian Rupprecht,
Iro Laina,
Andrea Vedaldi
Abstract:
We consider the problem of reconstructing a full 360° photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using an approach inspi…
▽ More
We consider the problem of reconstructing a full 360° photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using an approach inspired by DreamFields and DreamFusion, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.
△ Less
Submitted 23 February, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Behind the Scenes: Density Fields for Single View Reconstruction
Authors:
Felix Wimbauer,
Nan Yang,
Christian Rupprecht,
Daniel Cremers
Abstract:
Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to…
▽ More
Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict implicit density fields. A density field maps every location in the frustum of the input image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The prediction network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.
△ Less
Submitted 19 April, 2023; v1 submitted 18 January, 2023;
originally announced January 2023.
-
MagicPony: Learning Articulated 3D Animals in the Wild
Authors:
Shangzhe Wu,
Ruining Li,
Tomas Jakab,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We consider the problem of predicting the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse given a single test image as input. We present a new method, dubbed MagicPony, that learns this predictor purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-exp…
▽ More
We consider the problem of predicting the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse given a single test image as input. We present a new method, dubbed MagicPony, that learns this predictor purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no additional training cost. MagicPony outperforms prior work on this challenging task and demonstrates excellent generalisation in reconstructing art, despite the fact that it is only trained on real images.
△ Less
Submitted 3 April, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Nondestructive thermographic detection of internal defects using pixel-pattern based laser excitation and photothermal super resolution reconstruction
Authors:
Julien Lecompagnon,
Philipp Daniel Hirsch,
Christian Rupprecht,
Mathias Ziegler
Abstract:
In this work, we present a novel approach to photothermal super resolution based thermographic resolution of internal defects using two-dimensional pixel pattern-based active photothermal laser heating in conjunction with subsequent numerical reconstruction to achieve a high-resolution reconstruction of internal defect structures. With the proposed adoption of pixelated patterns generated using la…
▽ More
In this work, we present a novel approach to photothermal super resolution based thermographic resolution of internal defects using two-dimensional pixel pattern-based active photothermal laser heating in conjunction with subsequent numerical reconstruction to achieve a high-resolution reconstruction of internal defect structures. With the proposed adoption of pixelated patterns generated using laser coupled high-power DLP projector technology the complexity for achieving true two-dimensional super resolution can be dramatically reduced taking a crucial step forward towards widespread practical viability. Furthermore, based on the latest developments in high-power DLP projectors, we present their first application for structured pulsed thermographic inspection of macroscopic metal samples. In addition, a forward solution to the underlying inverse problem is proposed along with an appropriate heuristic to find the regularization parameters necessary for the numerical inversion in a laboratory setting. This allows the generation of synthetic measurement data, opening the door for the application of machine learning based methods for future improvements towards full automation of the method. Finally, the proposed method is experimentally validated and shown to outperform several established conventional thermographic testing techniques while conservatively improving the required measurement times by a factor of 8 compared to currently available photothermal super resolution techniques.
△ Less
Submitted 3 January, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns
Authors:
Laurynas Karazija,
Subhabrata Choudhury,
Iro Laina,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence…
▽ More
We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence of objects. Hence, our model learns to predict image regions that are likely to contain motion patterns characteristic of objects moving rigidly. It does not predict specific motion, which cannot be done unambiguously from a still image, but a distribution of possible motions, which includes the possibility that an object does not move at all. We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement. Project page and code: https://www.robots.ox.ac.uk/~vgg/research/ppmp .
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
VTC: Improving Video-Text Retrieval with User Comments
Authors:
Laura Hanu,
James Thewlis,
Yuki M. Asano,
Christian Rupprecht
Abstract:
Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, sin…
▽ More
Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github.io/vtc-paper.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion
Authors:
Subhabrata Choudhury,
Laurynas Karazija,
Iro Laina,
Andrea Vedaldi,
Christian Rupprecht
Abstract:
Motion, measured via optical flow, provides a powerful cue to discover and learn objects in images and videos. However, compared to using appearance, it has some blind spots, such as the fact that objects become invisible if they do not move. In this work, we propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmenta…
▽ More
Motion, measured via optical flow, provides a powerful cue to discover and learn objects in images and videos. However, compared to using appearance, it has some blind spots, such as the fact that objects become invisible if they do not move. In this work, we propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmentation network with the pretext task of predicting regions that are likely to contain simple motion patterns, and thus likely to correspond to objects. As the model only uses a single image as input, we can apply it in two settings: unsupervised video segmentation, and unsupervised image segmentation. We achieve state-of-the-art results for videos, and demonstrate the viability of our approach on still images containing novel objects. Additionally we experiment with different motion models and optical flow backbones and find the method to be robust to these change. Project page and code available at https://www.robots.ox.ac.uk/~vgg/research/gwm.
△ Less
Submitted 13 October, 2022; v1 submitted 16 May, 2022;
originally announced May 2022.
-
Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
Authors:
Luke Melas-Kyriazi,
Christian Rupprecht,
Iro Laina,
Andrea Vedaldi
Abstract:
Unsupervised localization and segmentation are long-standing computer vision challenges that involve decomposing an image into semantically-meaningful segments without any labeled data. These tasks are particularly interesting in an unsupervised setting due to the difficulty and cost of obtaining dense image annotations, but existing unsupervised approaches struggle with complex scenes containing…
▽ More
Unsupervised localization and segmentation are long-standing computer vision challenges that involve decomposing an image into semantically-meaningful segments without any labeled data. These tasks are particularly interesting in an unsupervised setting due to the difficulty and cost of obtaining dense image annotations, but existing unsupervised approaches struggle with complex scenes containing multiple objects. Differently from existing methods, which are purely based on deep learning, we take inspiration from traditional spectral segmentation methods by reframing image decomposition as a graph partitioning problem. Specifically, we examine the eigenvectors of the Laplacian of a feature affinity matrix from self-supervised networks. We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene. Furthermore, by clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions, i.e. semantic segmentations. Experiments on complex datasets (Pascal VOC, MS-COCO) demonstrate that our simple spectral method outperforms the state-of-the-art in unsupervised localization and segmentation by a significant margin. Furthermore, our method can be readily used for a variety of complex image editing tasks, such as background removal and compositing.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Brightening of a dark monolayer semiconductor via strong light-matter coupling in a cavity
Authors:
Hangyong Shan,
Ivan Iorsh,
Bo Han,
Christoph Rupprecht,
Heiko Knopf,
Falk Eilenberger,
Martin Esmann,
Kentaro Yumigeta,
Kenji Watanabe,
Takashi Taniguchi,
Sebastian Klembt,
Sven Höfling,
Sefaattin Tongay,
Carlos Antón-Solanas,
Ivan A. Shelykh,
Christian Schneider
Abstract:
Engineering the properties of quantum materials via strong light-matter coupling is a compelling research direction with a multiplicity of modern applications. Those range from modifying charge transport in organic molecules, steering particle correlation and interactions, and even controlling chemical reactions. Here, we study the modification of the material properties via strong coupling and de…
▽ More
Engineering the properties of quantum materials via strong light-matter coupling is a compelling research direction with a multiplicity of modern applications. Those range from modifying charge transport in organic molecules, steering particle correlation and interactions, and even controlling chemical reactions. Here, we study the modification of the material properties via strong coupling and demonstrate an effective inversion of the excitonic band-ordering in a monolayer of WSe2 with spin-forbidden, optically dark ground state. In our experiments, we harness the strong light-matter coupling between cavity photon and the high energy, spin-allowed bright exciton, and thus creating two bright polaritonic modes in the optical bandgap with the lower polariton mode pushed below the WSe2 dark state. We demonstrate that in this regime the commonly observed luminescence quenching stemming from the fast relaxation to the dark ground state is prevented, which results in the brightening of this intrinsically dark material. We probe this effective brightening by temperature-dependent photoluminescence, and we find an excellent agreement with a theoretical model accounting for the inversion of the band ordering and phonon-assisted polariton relaxation.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Thermographic detection of internal defects using 2D photothermal super resolution reconstruction with sequential laser heating
Authors:
Julien Lecompagnon,
Samim Ahmadi,
Philipp Hirsch,
Christian Rupprecht,
Mathias Ziegler
Abstract:
Thermographic photothermal super resolution reconstruction enables the resolution of internal defects/inhomogeneities below the classical limit which is governed by the diffusion properties of thermal wave propagation. Based on a combination of the application of special sampling strategies and a subsequent numerical optimization step in post-processing, thermographic super resolution has already…
▽ More
Thermographic photothermal super resolution reconstruction enables the resolution of internal defects/inhomogeneities below the classical limit which is governed by the diffusion properties of thermal wave propagation. Based on a combination of the application of special sampling strategies and a subsequent numerical optimization step in post-processing, thermographic super resolution has already proven to be superior to standard thermographic methods in the detection of one-dimensional defect/inhomogeneity structures. In our work, we report an extension of the capabilities of the method for efficient detection and resolution of defect cross sections with fully two-dimensional structured laser-based heating. The reconstruction is carried out using one of two different algorithms which are proposed within this work. Both algorithms utilize the combination of several coherent measurements using convex optimization and exploit the sparse nature of defects/inhomogeneities as is typical for most nondestructive testing scenarios. Finally, the performance of each algorithm is rated on reconstruction quality and algorithmic complexity. The presented experimental approach is based on repeated spatially structured heating by a high power laser. As a result, a two-dimensional sparse defect/inhomogeneity map can be obtained. In addition, the obtained results are compared with those of conventional thermographic inspection methods which make use of homogeneous illumination. Due to the sparse nature of the reconstructed defect/inhomogeneity map, this comparison is performed qualitatively.
△ Less
Submitted 24 April, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
De-rendering 3D Objects in the Wild
Authors:
Felix Wimbauer,
Shangzhe Wu,
Christian Rupprecht
Abstract:
With increasing focus on augmented and virtual reality applications (XR) comes the demand for algorithms that can lift objects from images and videos into representations that are suitable for a wide variety of related 3D tasks. Large-scale deployment of XR devices and applications means that we cannot solely rely on supervised learning, as collecting and annotating data for the unlimited variety…
▽ More
With increasing focus on augmented and virtual reality applications (XR) comes the demand for algorithms that can lift objects from images and videos into representations that are suitable for a wide variety of related 3D tasks. Large-scale deployment of XR devices and applications means that we cannot solely rely on supervised learning, as collecting and annotating data for the unlimited variety of objects in the real world is infeasible. We present a weakly supervised method that is able to decompose a single image of an object into shape (depth and normals), material (albedo, reflectivity and shininess) and global lighting parameters. For training, the method only relies on a rough initial shape estimate of the training objects to bootstrap the learning process. This shape supervision can come for example from a pretrained depth network or - more generically - from a traditional structure-from-motion pipeline. In our experiments, we show that the method can successfully de-render 2D images into a decomposed 3D representation and generalizes to unseen object categories. Since in-the-wild evaluation is difficult due to the lack of ground truth data, we also introduce a photo-realistic synthetic test set that allows for quantitative evaluation.
△ Less
Submitted 27 September, 2022; v1 submitted 6 January, 2022;
originally announced January 2022.
-
ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation
Authors:
Laurynas Karazija,
Iro Laina,
Christian Rupprecht
Abstract:
There has been a recent surge in methods that aim to decompose and segment scenes into multiple objects in an unsupervised manner, i.e., unsupervised multi-object segmentation. Performing such a task is a long-standing goal of computer vision, offering to unlock object-level reasoning without requiring dense annotations to train segmentation models. Despite significant progress, current models are…
▽ More
There has been a recent surge in methods that aim to decompose and segment scenes into multiple objects in an unsupervised manner, i.e., unsupervised multi-object segmentation. Performing such a task is a long-standing goal of computer vision, offering to unlock object-level reasoning without requiring dense annotations to train segmentation models. Despite significant progress, current models are developed and trained on visually simple scenes depicting mono-colored objects on plain backgrounds. The natural world, however, is visually complex with confounding aspects such as diverse textures and complicated lighting effects. In this study, we present a new benchmark called ClevrTex, designed as the next challenge to compare, evaluate and analyze algorithms. ClevrTex features synthetic scenes with diverse shapes, textures and photo-mapped materials, created using physically based rendering techniques. It includes 50k examples depicting 3-10 objects arranged on a background, created using a catalog of 60 materials, and a further test set featuring 10k images created using 25 different materials. We benchmark a large set of recent unsupervised multi-object segmentation models on ClevrTex and find all state-of-the-art approaches fail to learn good representations in the textured setting, despite impressive performance on simpler data. We also create variants of the ClevrTex dataset, controlling for different aspects of scene complexity, and probe current approaches for individual shortcomings. Dataset and code are available at https://www.robots.ox.ac.uk/~vgg/research/clevrtex.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Unsupervised Part Discovery from Contrastive Reconstruction
Authors:
Subhabrata Choudhury,
Iro Laina,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
The goal of self-supervised visual representation learning is to learn strong, transferable image representations, with the majority of research focusing on object or scene level. On the other hand, representation learning at part level has received significantly less attention. In this paper, we propose an unsupervised approach to object part discovery and segmentation and make three contribution…
▽ More
The goal of self-supervised visual representation learning is to learn strong, transferable image representations, with the majority of research focusing on object or scene level. On the other hand, representation learning at part level has received significantly less attention. In this paper, we propose an unsupervised approach to object part discovery and segmentation and make three contributions. First, we construct a proxy task through a set of objectives that encourages the model to learn a meaningful decomposition of the image into its parts. Secondly, prior work argues for reconstructing or clustering pre-computed features as a proxy to parts; we show empirically that this alone is unlikely to find meaningful parts; mainly because of their low resolution and the tendency of classification networks to spatially smear out information. We suggest that image reconstruction at the level of pixels can alleviate this problem, acting as a complementary cue. Lastly, we show that the standard evaluation based on keypoint regression does not correlate well with segmentation quality and thus introduce different metrics, NMI and ARI, that better characterize the decomposition of objects into parts. Our method yields semantic parts which are consistent across fine-grained but visually distinct categories, outperforming the state of the art on three benchmark datasets. Code is available at the project page: https://www.robots.ox.ac.uk/~vgg/research/unsup-parts/.
△ Less
Submitted 21 March, 2022; v1 submitted 11 November, 2021;
originally announced November 2021.
-
The Curious Layperson: Fine-Grained Image Recognition without Expert Labels
Authors:
Subhabrata Choudhury,
Iro Laina,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless t…
▽ More
Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: https://github.com/subhc/clever
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
PASS: An ImageNet replacement for self-supervised pretraining without humans
Authors:
Yuki M. Asano,
Christian Rupprecht,
Andrew Zisserman,
Andrea Vedaldi
Abstract:
Computer vision has long relied on ImageNet and other large datasets of images sampled from the Internet for pretraining models. However, these datasets have ethical and technical shortcomings, such as containing personal information taken without consent, unclear license usage, biases, and, in some cases, even problematic image content. On the other hand, state-of-the-art pretraining is nowadays…
▽ More
Computer vision has long relied on ImageNet and other large datasets of images sampled from the Internet for pretraining models. However, these datasets have ethical and technical shortcomings, such as containing personal information taken without consent, unclear license usage, biases, and, in some cases, even problematic image content. On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining. We thus propose an unlabelled dataset PASS: Pictures without humAns for Self-Supervision. PASS only contains images with CC-BY license and complete attribution metadata, addressing the copyright issue. Most importantly, it contains no images of people at all, and also avoids other types of images that are problematic for data protection or ethics. We show that PASS can be used for pretraining with methods such as MoCo-v2, SwAV and DINO. In the transfer learning setting, it yields similar downstream performances to ImageNet pretraining even on tasks that involve humans, such as human pose estimation. PASS does not make existing datasets obsolete, as for instance it is insufficient for benchmarking. However, it shows that model pretraining is often possible while using safer data, and it also provides the basis for a more robust evaluation of pretraining methods.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
DOVE: Learning Deformable 3D Objects by Watching Videos
Authors:
Shangzhe Wu,
Tomas Jakab,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
Learning deformable 3D objects from 2D images is often an ill-posed problem. Existing methods rely on explicit supervision to establish multi-view correspondences, such as template shape models and keypoint annotations, which restricts their applicability on objects "in the wild". A more natural way of establishing correspondences is by watching videos of objects moving around. In this paper, we p…
▽ More
Learning deformable 3D objects from 2D images is often an ill-posed problem. Existing methods rely on explicit supervision to establish multi-view correspondences, such as template shape models and keypoint annotations, which restricts their applicability on objects "in the wild". A more natural way of establishing correspondences is by watching videos of objects moving around. In this paper, we present DOVE, a method that learns textured 3D models of deformable object categories from monocular videos available online, without keypoint, viewpoint or template shape supervision. By resolving symmetry-induced pose ambiguities and leveraging temporal correspondences in videos, the model automatically learns to factor out 3D shape, articulated pose and texture from each individual RGB frame, and is ready for single-image inference at test time. In the experiments, we show that existing methods fail to learn sensible 3D shapes without additional keypoint or template supervision, whereas our method produces temporally consistent 3D models, which can be animated and rendered from arbitrary viewpoints.
△ Less
Submitted 29 June, 2022; v1 submitted 22 July, 2021;
originally announced July 2021.
-
Finding an Unsupervised Image Segmenter in Each of Your Deep Generative Models
Authors:
Luke Melas-Kyriazi,
Christian Rupprecht,
Iro Laina,
Andrea Vedaldi
Abstract:
Recent research has shown that numerous human-interpretable directions exist in the latent space of GANs. In this paper, we develop an automatic procedure for finding directions that lead to foreground-background image separation, and we use these directions to train an image segmentation model without human supervision. Our method is generator-agnostic, producing strong segmentation results with…
▽ More
Recent research has shown that numerous human-interpretable directions exist in the latent space of GANs. In this paper, we develop an automatic procedure for finding directions that lead to foreground-background image separation, and we use these directions to train an image segmentation model without human supervision. Our method is generator-agnostic, producing strong segmentation results with a wide range of different GAN architectures. Furthermore, by leveraging GANs pretrained on large datasets such as ImageNet, we are able to segment images from a range of domains without further training or finetuning. Evaluating our method on image segmentation benchmarks, we compare favorably to prior work while using neither human supervision nor access to the training data. Broadly, our results demonstrate that automatically extracting foreground-background structure from pretrained deep generative models can serve as a remarkably effective substitute for human supervision.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
Neural Response Interpretation through the Lens of Critical Pathways
Authors:
Ashkan Khakzar,
Soroosh Baselizadeh,
Saurabh Khanduja,
Christian Rupprecht,
Seong Tae Kim,
Nassir Navab
Abstract:
Is critical input information encoded in specific sparse pathways within the neural network? In this work, we discuss the problem of identifying these critical pathways and subsequently leverage them for interpreting the network's response to an input. The pruning objective -- selecting the smallest group of neurons for which the response remains equivalent to the original network -- has been prev…
▽ More
Is critical input information encoded in specific sparse pathways within the neural network? In this work, we discuss the problem of identifying these critical pathways and subsequently leverage them for interpreting the network's response to an input. The pruning objective -- selecting the smallest group of neurons for which the response remains equivalent to the original network -- has been previously proposed for identifying critical pathways. We demonstrate that sparse pathways derived from pruning do not necessarily encode critical input information. To ensure sparse pathways include critical fragments of the encoded input information, we propose pathway selection via neurons' contribution to the response. We proceed to explain how critical pathways can reveal critical input features. We prove that pathways selected via neuron contribution are locally linear (in an L2-ball), a property that we use for proposing a feature attribution method: "pathway gradient". We validate our interpretation method using mainstream evaluation experiments. The validation of pathway gradient interpretation method further confirms that selected pathways using neuron contributions correspond to critical input features. The code is publicly available.
△ Less
Submitted 31 March, 2021;
originally announced March 2021.
-
Spatial coherence of room-temperature monolayer WSe$_2$ exciton-polaritons in a trap
Authors:
Hangyong Shan,
Lukas Lackner,
Bo Han,
Evgeny Sedov,
Christoph Rupprecht,
Heiko Knopf,
Falk Eilenberger,
Johannes Beierlein,
Nils Kunte,
Martin Esmann,
Kentaro Yumigeta,
Kenji Watanabe,
Takashi Taniguchi,
Sebastian Klembt,
Sven Höfling,
Alexey V. Kavokin,
Sefaattin Tongay,
Christian Schneider,
Carlos Antón-Solanas
Abstract:
The emergence of spatial and temporal coherence of light emitted from solid-state systems is a fundamental phenomenon, rooting in a plethora of microscopic processes. It is intrinsically aligned with the control of light-matter coupling, and canonical for laser oscillation. However, it also emerges in the superradiance of multiple, phase-locked emitters, and more recently, coherence and long-range…
▽ More
The emergence of spatial and temporal coherence of light emitted from solid-state systems is a fundamental phenomenon, rooting in a plethora of microscopic processes. It is intrinsically aligned with the control of light-matter coupling, and canonical for laser oscillation. However, it also emerges in the superradiance of multiple, phase-locked emitters, and more recently, coherence and long-range order have been investigated in bosonic condensates of thermalized light, as well as in exciton-polaritons driven to a ground state via stimulated scattering. Here, we experimentally show that the interaction between photons in a Fabry-Perot microcavity and excitons in an atomically thin WSe$_2$ layer is sufficient such that the system enters the hybridized regime of strong light-matter coupling at ambient conditions. Via Michelson interferometry, we capture clear evidence of increased spatial and temporal coherence of the emitted light from the spatially confined system ground-state. The coherence build-up is accompanied by a threshold-like behaviour of the emitted light intensity, which is a fingerprint of a polariton laser effect. Valley-physics is manifested in the presence of an external magnetic field, which allows us to manipulate K and K' polaritons via the Valley-Zeeman-effect. Our findings are of high application relevance, as they confirm the possibility to use atomically thin crystals as simple and versatile components of coherent light-sources, and in valleytronic applications at room temperature.
△ Less
Submitted 9 November, 2021; v1 submitted 18 March, 2021;
originally announced March 2021.
-
Micro- Mechanical assembly of high-quality Fabry-Perot microcavities for the integration with two-dimensional materials
Authors:
Christoph Rupprecht,
Nils Lundt,
Sven Höfling,
Christian Schneider
Abstract:
Integrating monolayers of two-dimensional semiconductors in planar, and potentially microstructured microcavities is challenging because of the few, available approaches to overgrow the monolayers without damaging them. Some strategies have been developed, but they either rely on complicated experimental settings, expensive technologies or compromise the available quality factors. As a result, hig…
▽ More
Integrating monolayers of two-dimensional semiconductors in planar, and potentially microstructured microcavities is challenging because of the few, available approaches to overgrow the monolayers without damaging them. Some strategies have been developed, but they either rely on complicated experimental settings, expensive technologies or compromise the available quality factors. As a result, high quality Fabry-Perot microcavities are not widely available to the community focusing on light-matter coupling with atomically thin materials. Here, we provide details on a recently developed technique to micro-mechanically assemble Fabry-Perot Microcavities. Our approach does not rely on difficult or expensive technologies, and yields device characteristics marking the state of the art in cavities with integrated atomically thin semiconductors.
△ Less
Submitted 17 September, 2020;
originally announced September 2020.
-
Demonstration of a polariton step potential by local variation of light-matter coupling in a van-der-Waals heterostructure
Authors:
C. Rupprecht,
M. Klaas,
H. Knopf,
T. Taniguchi,
K. Watanabe,
Y. Qin,
S. Tongay,
S. SchrÖder,
F. Eilenberger,
S. HÖfling,
C. Schneider
Abstract:
The large oscillator strength of excitons in transition metal dichalcogenide layers facilitates the formation of exciton-polariton resonances for monolayers and van-der-Waals heterostructures embedded in optical microcavities. Here, we show, that locally changing the number of layers in a WSe2/hBN/WSe2 van-der-Waals heterostructure embedded in a monolithic, high-quality-factor cavity gives rise to…
▽ More
The large oscillator strength of excitons in transition metal dichalcogenide layers facilitates the formation of exciton-polariton resonances for monolayers and van-der-Waals heterostructures embedded in optical microcavities. Here, we show, that locally changing the number of layers in a WSe2/hBN/WSe2 van-der-Waals heterostructure embedded in a monolithic, high-quality-factor cavity gives rise to a local variation of the coupling strength. This effect yields a polaritonic stair case potential, which we demonstrate at room temperature. Our result paves the way towards engineering local polaritonic potentials at length scales down to atomically sharp interfaces, based on purely modifying its real part contribution via the coherent light-matter coupling strength g.
△ Less
Submitted 29 July, 2020;
originally announced July 2020.
-
Manipulation of room-temperature Valley-Coherent Exciton-Polaritons in atomically thin crystals by real and artificial magnetic fields
Authors:
Christoph Rupprecht,
Evgeny Sedov,
Martin Klaas,
Heiko Knopf,
Mark Blei,
Nils Lundt,
Sefaattin Tongay,
Takashi Taniguchi,
Kenji Watanabe,
Ulrike Schulz,
Alexey Kavokin,
Falk Eilenberger,
Sven Höfling,
Christian Schneider
Abstract:
Strong spin-orbit coupling and inversion symmetry breaking in transition metal dichalcogenide monolayers yield the intriguing effects of valley-dependent optical selection rules. As such, it is possible to substantially polarize valley excitons with chiral light and furthermore create coherent superpositions of K and K- polarized states. Yet, at ambient conditions dephasing usually becomes too dom…
▽ More
Strong spin-orbit coupling and inversion symmetry breaking in transition metal dichalcogenide monolayers yield the intriguing effects of valley-dependent optical selection rules. As such, it is possible to substantially polarize valley excitons with chiral light and furthermore create coherent superpositions of K and K- polarized states. Yet, at ambient conditions dephasing usually becomes too dominant, and valley coherence typically is not observable. Here, we demonstrate that valley coherence is, however, clearly observable for a single monolayer of WSe2, if it is strongly coupled to the optical mode of a high quality factor microcavity. The azimuthal vector, representing the phase of the valley coherent superposition, can be directly manipulated by applying magnetic fields, and furthermore, it sensibly reacts to the polarization anisotropy of the cavity which represents an artificial magnetic field. Our results are in qualitative and quantitative agreement with our model based on pseudospin rate equations, accounting for both effects of real and pseudo-magnetic fields.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Labelling unlabelled videos from scratch with multi-modal self-supervision
Authors:
Yuki M. Asano,
Mandela Patrick,
Christian Rupprecht,
Andrea Vedaldi
Abstract:
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the vi…
▽ More
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.
△ Less
Submitted 28 February, 2021; v1 submitted 24 June, 2020;
originally announced June 2020.