-
ViPer: Visual Personalization of Generative Models via Individual Preference Learning
Authors:
Sogand Salehi,
Mahdi Shafiei,
Teresa Yeo,
Roman Bachmann,
Amir Zamir
Abstract:
Different users find different images generated for the same prompt desirable. This gives rise to personalized image generation which involves creating images aligned with an individual's visual preference. Current generative models are, however, unpersonalized, as they are tuned to produce outputs that appeal to a broad audience. Using them to generate images aligned with individual users relies…
▽ More
Different users find different images generated for the same prompt desirable. This gives rise to personalized image generation which involves creating images aligned with an individual's visual preference. Current generative models are, however, unpersonalized, as they are tuned to produce outputs that appeal to a broad audience. Using them to generate images aligned with individual users relies on iterative manual prompt engineering by the user which is inefficient and undesirable. We propose to personalize the image generation process by first capturing the generic preferences of the user in a one-time process by inviting them to comment on a small selection of images, explaining why they like or dislike each. Based on these comments, we infer a user's structured liked and disliked visual attributes, i.e., their visual preference, using a large language model. These attributes are used to guide a text-to-image model toward producing images that are tuned towards the individual user's visual preference. Through a series of user studies and large language model guided evaluations, we demonstrate that the proposed method results in generations that are well aligned with individual users' visual preferences.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Authors:
Roman Bachmann,
Oğuzhan Fatih Kar,
David Mizrahi,
Ali Garjani,
Mingfei Gao,
David Griffiths,
Jiaming Hu,
Afshin Dehghan,
Amir Zamir
Abstract:
Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse moda…
▽ More
Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.
△ Less
Submitted 14 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
EUSO-SPB1 Mission and Science
Authors:
JEM-EUSO Collaboration,
:,
G. Abdellaoui,
S. Abe,
J. H. Adams. Jr.,
D. Allard,
G. Alonso,
L. Anchordoqui,
A. Anzalone,
E. Arnone,
K. Asano,
R. Attallah,
H. Attoui,
M. Ave Pernas,
R. Bachmann,
S. Bacholle,
M. Bagheri,
M. Bakiri,
J. Baláz,
D. Barghini,
S. Bartocci,
M. Battisti,
J. Bayer,
B. Beldjilali,
T. Belenguer
, et al. (271 additional authors not shown)
Abstract:
The Extreme Universe Space Observatory on a Super Pressure Balloon 1 (EUSO-SPB1) was launched in 2017 April from Wanaka, New Zealand. The plan of this mission of opportunity on a NASA super pressure balloon test flight was to circle the southern hemisphere. The primary scientific goal was to make the first observations of ultra-high-energy cosmic-ray extensive air showers (EASs) by looking down on…
▽ More
The Extreme Universe Space Observatory on a Super Pressure Balloon 1 (EUSO-SPB1) was launched in 2017 April from Wanaka, New Zealand. The plan of this mission of opportunity on a NASA super pressure balloon test flight was to circle the southern hemisphere. The primary scientific goal was to make the first observations of ultra-high-energy cosmic-ray extensive air showers (EASs) by looking down on the atmosphere with an ultraviolet (UV) fluorescence telescope from suborbital altitude (33~km). After 12~days and 4~hours aloft, the flight was terminated prematurely in the Pacific Ocean. Before the flight, the instrument was tested extensively in the West Desert of Utah, USA, with UV point sources and lasers. The test results indicated that the instrument had sensitivity to EASs of approximately 3 EeV. Simulations of the telescope system, telescope on time, and realized flight trajectory predicted an observation of about 1 event assuming clear sky conditions. The effects of high clouds were estimated to reduce this value by approximately a factor of 2. A manual search and a machine-learning-based search did not find any EAS signals in these data. Here we review the EUSO-SPB1 instrument and flight and the EAS search.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
4M: Massively Multimodal Masked Modeling
Authors:
David Mizrahi,
Roman Bachmann,
Oğuzhan Fatih Kar,
Teresa Yeo,
Mingfei Gao,
Afshin Dehghan,
Amir Zamir
Abstract:
Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a sin…
▽ More
Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens.
4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility.
Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Modality-invariant Visual Odometry for Embodied Vision
Authors:
Marius Memmel,
Roman Bachmann,
Amir Zamir
Abstract:
Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared…
▽ More
Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.
△ Less
Submitted 29 April, 2023;
originally announced May 2023.
-
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Authors:
Roman Bachmann,
David Mizrahi,
Andrei Atanov,
Amir Zamir
Abstract:
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi…
▽ More
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi-task").
We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset.
The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
CLIPasso: Semantically-Aware Object Sketching
Authors:
Yael Vinker,
Ehsan Pajouheshgar,
Jessica Y. Bo,
Roman Christian Bachmann,
Amit Haim Bermano,
Daniel Cohen-Or,
Amir Zamir,
Ariel Shamir
Abstract:
Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present CLIPasso, an object sketching meth…
▽ More
Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present CLIPasso, an object sketching method that can achieve different levels of abstraction, guided by geometric and semantic simplifications. While sketch generation methods often rely on explicit sketch datasets for training, we utilize the remarkable ability of CLIP (Contrastive-Language-Image-Pretraining) to distill semantic concepts from sketches and images alike. We define a sketch as a set of Bézier curves and use a differentiable rasterizer to optimize the parameters of the curves directly with respect to a CLIP-based perceptual loss. The abstraction degree is controlled by varying the number of strokes. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual components of the subject drawn.
△ Less
Submitted 16 May, 2022; v1 submitted 11 February, 2022;
originally announced February 2022.
-
Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Authors:
Ainaz Eftekhar,
Alexander Sax,
Roman Bachmann,
Jitendra Malik,
Amir Zamir
Abstract:
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models.
Common…
▽ More
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks, despite having seen no benchmark or non-pipeline data. The depth estimation network outperforms MiDaS and the surface normal estimation network is the first to achieve human-level performance for in-the-wild surface normal estimation -- at least according to one metric on the OASIS benchmark.
The Dockerized pipeline with CLI, the (mostly python) code, PyTorch dataloaders for the generated data, the generated starter dataset, download scripts and other utilities are available through our project website, https://omnidata.vision.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Training Binary Neural Networks using the Bayesian Learning Rule
Authors:
Xiangming Meng,
Roman Bachmann,
Mohammad Emtiyaz Khan
Abstract:
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the discrete nature of the problem and using gradient-based methods, such as the Straight-Through Estimator, still works well in practice. This raises the question: are there principled approaches which ju…
▽ More
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the discrete nature of the problem and using gradient-based methods, such as the Straight-Through Estimator, still works well in practice. This raises the question: are there principled approaches which justify such methods? In this paper, we propose such an approach using the Bayesian learning rule. The rule, when applied to estimate a Bernoulli distribution over the binary weights, results in an algorithm which justifies some of the algorithmic choices made by the previous approaches. The algorithm not only obtains state-of-the-art performance, but also enables uncertainty estimation for continual learning to avoid catastrophic forgetting. Our work provides a principled approach for training binary neural networks which justifies and extends existing approaches.
△ Less
Submitted 17 August, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Motion Capture from Pan-Tilt Cameras with Unknown Orientation
Authors:
Roman Bachmann,
Jörg Spörri,
Pascal Fua,
Helge Rhodin
Abstract:
In sports, such as alpine skiing, coaches would like to know the speed and various biomechanical variables of their athletes and competitors. Existing methods use either body-worn sensors, which are cumbersome to setup, or manual image annotation, which is time consuming. We propose a method for estimating an athlete's global 3D position and articulated pose using multiple cameras. By contrast to…
▽ More
In sports, such as alpine skiing, coaches would like to know the speed and various biomechanical variables of their athletes and competitors. Existing methods use either body-worn sensors, which are cumbersome to setup, or manual image annotation, which is time consuming. We propose a method for estimating an athlete's global 3D position and articulated pose using multiple cameras. By contrast to classical markerless motion capture solutions, we allow cameras to rotate freely so that large capture volumes can be covered. In a first step, tight crops around the skier are predicted and fed to a 2D pose estimator network. The 3D pose is then reconstructed using a bundle adjustment method. Key to our solution is the rotation estimation of Pan-Tilt cameras in a joint optimization with the athlete pose and conditioning on relative background motion computed with feature tracking. Furthermore, we created a new alpine skiing dataset and annotated it with 2D pose labels, to overcome shortcomings of existing ones. Our method estimates accurate global 3D poses from images only and provides coaches with an automatic and fast tool for measuring and improving an athlete's performance.
△ Less
Submitted 30 August, 2019;
originally announced August 2019.
-
Driving forces for Ag-induced periodic faceting of vicinal Cu(111)
Authors:
A. R. Bachmann,
A. Mugarza,
S. Speller,
J. E. Ortega
Abstract:
Adsorption of submonolayer amounts of Ag on vicinal Cu(111) induces periodic faceting. The equilibrium structure is characterized by Ag-covered facets that alternate with clean Cu stripes. In the atomic scale, the driving force is the matching of Ag(111)-like packed rows with Cu(111) terraces underneath. This determines the preference for the facet orientation and the evolution of different phas…
▽ More
Adsorption of submonolayer amounts of Ag on vicinal Cu(111) induces periodic faceting. The equilibrium structure is characterized by Ag-covered facets that alternate with clean Cu stripes. In the atomic scale, the driving force is the matching of Ag(111)-like packed rows with Cu(111) terraces underneath. This determines the preference for the facet orientation and the evolution of different phases as a function of coverage. Both Cu and Ag stripe widths can be varied smoothly in the 3-30 nm range by tuning Ag coverage, allowing to test theoretical predictions of elastic theories.
△ Less
Submitted 2 October, 2002;
originally announced October 2002.