-
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Authors:
Brandon Huang,
Chancharik Mitra,
Assaf Arbelle,
Leonid Karlinsky,
Trevor Darrell,
Roei Herzig
Abstract:
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, wh…
▽ More
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Authors:
Irene Huang,
Wei Lin,
M. Jehanzeb Mirza,
Jacob A. Hansen,
Sivan Doveh,
Victor Ion Butoi,
Roei Herzig,
Assaf Arbelle,
Hilde Kuhene,
Trevor Darrel,
Chuang Gan,
Aude Oliva,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark…
▽ More
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
Authors:
Eli Schwartz,
Leshem Choshen,
Joseph Shtok,
Sivan Doveh,
Leonid Karlinsky,
Assaf Arbelle
Abstract:
Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose…
▽ More
Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose a simple adjustment to how numbers are represented by including the count of digits before each number. For instance, instead of "42", we suggest using "{2:42}" as the new format. This approach, which we term NumeroLogic, offers an added advantage in number generation by serving as a Chain of Thought (CoT). By requiring the model to consider the number of digits first, it enhances the reasoning process before generating the actual number. We use arithmetic tasks to demonstrate the effectiveness of the NumeroLogic formatting. We further demonstrate NumeroLogic applicability to general natural language modeling, improving language understanding performance in the MMLU benchmark.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Towards Multimodal In-Context Learning for Vision & Language Models
Authors:
Sivan Doveh,
Shaked Perek,
M. Jehanzeb Mirza,
Amit Alfassy,
Assaf Arbelle,
Shimon Ullman,
Leonid Karlinsky
Abstract:
Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' by an LLM, primarily via converting their samples into a sequence of embedded language-like tokens directly fed into the LLM (decoder) input stream. However, so far limited attention has been given…
▽ More
Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' by an LLM, primarily via converting their samples into a sequence of embedded language-like tokens directly fed into the LLM (decoder) input stream. However, so far limited attention has been given to transferring (and evaluating) one of the core LLM capabilities to the emerging VLMs, namely the In-Context Learning (ICL) ability, or in other words to guide VLMs to desired target downstream tasks or output structure using in-context image+text demonstrations. In this work, we dive deeper into analyzing the capabilities of some of the state-of-the-art VLMs to follow ICL instructions, discovering them to be somewhat lacking. We discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot (ICL) demonstrations, likely due to their lack of `direct' ICL instruction tuning. To test this conjecture, we propose a simple, yet surprisingly effective, strategy of extending a common VLM alignment framework with ICL support, methodology, and curriculum. We explore, analyze, and provide insights into effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. We also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Authors:
Sivan Doveh,
Assaf Arbelle,
Sivan Harary,
Roei Herzig,
Donghyun Kim,
Paola Cascante-bonilla,
Amit Alfassy,
Rameswar Panda,
Raja Giryes,
Rogerio Feris,
Shimon Ullman,
Leonid Karlinsky
Abstract:
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of no…
▽ More
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.
△ Less
Submitted 1 June, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Authors:
Roei Herzig,
Alon Mendelson,
Leonid Karlinsky,
Assaf Arbelle,
Rogerio Feris,
Trevor Darrell,
Amir Globerson
Abstract:
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models…
▽ More
Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models is time-consuming and costly, and thus cannot be used on a large scale. Here we ask whether small SG datasets can provide sufficient information for enhancing structured understanding of pretrained VLMs. We show that it is indeed possible to improve VLMs when learning from SGs by integrating components that incorporate structured information into both visual and textual representations. For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene. Our method improves the performance of several popular VLMs on multiple VL datasets with only a mild degradation in ZS capabilities.
△ Less
Submitted 24 October, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Authors:
Roei Herzig,
Ofir Abramovich,
Elad Ben-Avraham,
Assaf Arbelle,
Leonid Karlinsky,
Ariel Shamir,
Trevor Darrell,
Amir Globerson
Abstract:
Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide power…
▽ More
Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy" approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: \url{https://ofir1080.github.io/PromptonomyViT}
△ Less
Submitted 5 December, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
MAEDAY: MAE for few and zero shot AnomalY-Detection
Authors:
Eli Schwartz,
Assaf Arbelle,
Leonid Karlinsky,
Sivan Harary,
Florian Scheidegger,
Sivan Doveh,
Raja Giryes
Abstract:
We propose using Masked Auto-Encoder (MAE), a transformer model self-supervisedly trained on image inpainting, for anomaly detection (AD). Assuming anomalous regions are harder to reconstruct compared with normal regions. MAEDAY is the first image-reconstruction-based anomaly detection method that utilizes a pre-trained model, enabling its use for Few-Shot Anomaly Detection (FSAD). We also show th…
▽ More
We propose using Masked Auto-Encoder (MAE), a transformer model self-supervisedly trained on image inpainting, for anomaly detection (AD). Assuming anomalous regions are harder to reconstruct compared with normal regions. MAEDAY is the first image-reconstruction-based anomaly detection method that utilizes a pre-trained model, enabling its use for Few-Shot Anomaly Detection (FSAD). We also show the same method works surprisingly well for the novel tasks of Zero-Shot AD (ZSAD) and Zero-Shot Foreign Object Detection (ZSFOD), where no normal samples are available. Code is available at https://github.com/EliSchwartz/MAEDAY .
△ Less
Submitted 15 February, 2024; v1 submitted 25 November, 2022;
originally announced November 2022.
-
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
Authors:
James Seale Smith,
Leonid Karlinsky,
Vyshnavi Gutta,
Paola Cascante-Bonilla,
Donghyun Kim,
Assaf Arbelle,
Rameswar Panda,
Rogerio Feris,
Zsolt Kira
Abstract:
Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has e…
▽ More
Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt
△ Less
Submitted 30 March, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Teaching Structured Vision&Language Concepts to Vision&Language Models
Authors:
Sivan Doveh,
Assaf Arbelle,
Sivan Harary,
Rameswar Panda,
Roei Herzig,
Eli Schwartz,
Donghyun Kim,
Raja Giryes,
Rogerio Feris,
Shimon Ullman,
Leonid Karlinsky
Abstract:
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have…
▽ More
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.
△ Less
Submitted 30 May, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Authors:
James Seale Smith,
Paola Cascante-Bonilla,
Assaf Arbelle,
Donghyun Kim,
Rameswar Panda,
David Cox,
Diyi Yang,
Zsolt Kira,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object…
▽ More
Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL
△ Less
Submitted 30 March, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
FETA: Towards Specializing Foundation Models for Expert Task Applications
Authors:
Amit Alfassy,
Assaf Arbelle,
Oshri Halimi,
Sivan Harary,
Roei Herzig,
Eli Schwartz,
Rameswar Panda,
Michele Dolfi,
Christoph Auer,
Kate Saenko,
PeterW. J. Staar,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail…
▽ More
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects.
△ Less
Submitted 19 December, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
Unsupervised Domain Generalization by Learning a Bridge Across Domains
Authors:
Sivan Harary,
Eli Schwartz,
Assaf Arbelle,
Peter Staar,
Shady Abu-Hussein,
Elad Amrani,
Roei Herzig,
Amit Alfassy,
Raja Giryes,
Hilde Kuehne,
Dina Katabi,
Kate Saenko,
Rogerio Feris,
Leonid Karlinsky
Abstract:
The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz…
▽ More
The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generalization (UDG) setup of having no training supervision in neither source nor target domains. Our approach is based on self-supervised learning of a Bridge Across Domains (BrAD) - an auxiliary bridge domain accompanied by a set of semantics preserving visual (image-to-image) mappings to BrAD from each of the training domains. The BrAD and mappings to it are learned jointly (end-to-end) with a contrastive self-supervised representation model that semantically aligns each of the domains to its BrAD-projection, and hence implicitly drives all the domains (seen or unseen) to semantically align to each other. In this work, we show how using an edge-regularized BrAD our approach achieves significant gains across multiple benchmarks and a range of tasks, including UDG, Few-shot UDA, and unsupervised generalization across multi-domain datasets (including generalization to unseen domains and classes).
△ Less
Submitted 17 May, 2022; v1 submitted 4 December, 2021;
originally announced December 2021.
-
CHARTER: heatmap-based multi-type chart data extraction
Authors:
Joseph Shtok,
Sivan Harary,
Ophir Azulai,
Adi Raz Goldfarb,
Assaf Arbelle,
Leonid Karlinsky
Abstract:
The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the d…
▽ More
The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the digital domain. Our approach extracts and analyses charts along with their graphical elements and supporting structures such as legends, axes, titles, and captions. Our detection system is based on neural networks, trained solely on synthetic data, eliminating the limiting factor of data collection. As opposed to previous methods, which detect graphical elements using bounding-boxes, our networks feature auxiliary domain specific heatmaps prediction enabling the precise detection of pie charts, line and scatter plots which do not fit the rectangular bounding-box presumption. Qualitative and quantitative results show high robustness and precision, improving upon previous works on popular benchmarks
△ Less
Submitted 28 November, 2021;
originally announced November 2021.
-
Detector-Free Weakly Supervised Grounding by Separation
Authors:
Assaf Arbelle,
Sivan Doveh,
Amit Alfassy,
Joseph Shtok,
Guy Lev,
Eli Schwartz,
Hilde Kuehne,
Hila Barak Levi,
Prasanna Sattigeri,
Rameswar Panda,
Chun-Fu Chen,
Alex Bronstein,
Kate Saenko,
Shimon Ullman,
Raja Giryes,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de…
▽ More
Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation
Authors:
Mor Avi-Aharon,
Assaf Arbelle,
Tammy Riklin Raviv
Abstract:
We present the DeepHist - a novel Deep Learning framework for augmenting a network by histogram layers and demonstrate its strength by addressing image-to-image translation problems. Specifically, given an input image and a reference color distribution we aim to generate an output image with the structural appearance (content) of the input (source) yet with the colors of the reference. The key ide…
▽ More
We present the DeepHist - a novel Deep Learning framework for augmenting a network by histogram layers and demonstrate its strength by addressing image-to-image translation problems. Specifically, given an input image and a reference color distribution we aim to generate an output image with the structural appearance (content) of the input (source) yet with the colors of the reference. The key idea is a new technique for a differentiable construction of joint and color histograms of the output images. We further define a color distribution loss based on the Earth Mover's Distance between the output's and the reference's color histograms and a Mutual Information loss based on the joint histograms of the source and the output images. Promising results are shown for the tasks of color transfer, image colorization and edges $\rightarrow$ photo, where the color distribution of the output image is controlled. Comparison to Pix2Pix and CyclyGANs are shown.
△ Less
Submitted 6 May, 2020;
originally announced May 2020.
-
Hue-Net: Intensity-based Image-to-Image Translation with Differentiable Histogram Loss Functions
Authors:
Mor Avi-Aharon,
Assaf Arbelle,
Tammy Riklin Raviv
Abstract:
We present the Hue-Net - a novel Deep Learning framework for Intensity-based Image-to-Image Translation. The key idea is a new technique termed network augmentation which allows a differentiable construction of intensity histograms from images. We further introduce differentiable representations of (1D) cyclic and joint (2D) histograms and use them for defining loss functions based on cyclic Earth…
▽ More
We present the Hue-Net - a novel Deep Learning framework for Intensity-based Image-to-Image Translation. The key idea is a new technique termed network augmentation which allows a differentiable construction of intensity histograms from images. We further introduce differentiable representations of (1D) cyclic and joint (2D) histograms and use them for defining loss functions based on cyclic Earth Mover's Distance (EMD) and Mutual Information (MI). While the Hue-Net can be applied to several image-to-image translation tasks, we choose to demonstrate its strength on color transfer problems, where the aim is to paint a source image with the colors of a different target image. Note that the desired output image does not exist and therefore cannot be used for supervised pixel-to-pixel learning. This is accomplished by using the HSV color-space and defining an intensity-based loss that is built on the EMD between the cyclic hue histograms of the output and the target images. To enforce color-free similarity between the source and the output images, we define a semantic-based loss by a differentiable approximation of the MI of these images. The incorporation of histogram loss functions in addition to an adversarial loss enables the construction of semantically meaningful and realistic images. Promising results are presented for different datasets.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
QANet -- Quality Assurance Network for Image Segmentation
Authors:
Assaf Arbelle,
Eliav Elul,
Tammy Riklin Raviv
Abstract:
We introduce a novel Deep Learning framework, which quantitatively estimates image segmentation quality without the need for human inspection or labeling. We refer to this method as a Quality Assurance Network -- QANet. Specifically, given an image and a `proposed' corresponding segmentation, obtained by any method including manual annotation, the QANet solves a regression problem in order to esti…
▽ More
We introduce a novel Deep Learning framework, which quantitatively estimates image segmentation quality without the need for human inspection or labeling. We refer to this method as a Quality Assurance Network -- QANet. Specifically, given an image and a `proposed' corresponding segmentation, obtained by any method including manual annotation, the QANet solves a regression problem in order to estimate a predefined quality measure with respect to the unknown ground truth. The QANet is by no means yet another segmentation method. Instead, it performs a multi-level, multi-feature comparison of an image-segmentation pair based on a unique network architecture, called the RibCage.
To demonstrate the strength of the QANet, we addressed the evaluation of instance segmentation using two different datasets from different domains, namely, high throughput live cell microscopy images from the Cell Segmentation Benchmark and natural images of plants from the Leaf Segmentation Challenge. While synthesized segmentations were used to train the QANet, it was tested on segmentations obtained by publicly available methods that participated in the different challenges. We show that the QANet accurately estimates the scores of the evaluated segmentations with respect to the hidden ground truth, as published by the challenges' organizers.
The code is available at: TBD.
△ Less
Submitted 5 November, 2019; v1 submitted 9 April, 2019;
originally announced April 2019.
-
Microscopy Cell Segmentation via Convolutional LSTM Networks
Authors:
Assaf Arbelle,
Tammy Riklin Raviv
Abstract:
Live cell microscopy sequences exhibit complex spatial structures and complicated temporal behaviour, making their analysis a challenging task. Considering cell segmentation problem, which plays a significant role in the analysis, the spatial properties of the data can be captured using Convolutional Neural Networks (CNNs). Recent approaches show promising segmentation results using convolutional…
▽ More
Live cell microscopy sequences exhibit complex spatial structures and complicated temporal behaviour, making their analysis a challenging task. Considering cell segmentation problem, which plays a significant role in the analysis, the spatial properties of the data can be captured using Convolutional Neural Networks (CNNs). Recent approaches show promising segmentation results using convolutional encoder-decoders such as the U-Net. Nevertheless, these methods are limited by their inability to incorporate temporal information, that can facilitate segmentation of individual touching cells or of cells that are partially visible. In order to exploit cell dynamics we propose a novel segmentation architecture which integrates Convolutional Long Short Term Memory (C-LSTM) with the U-Net. The network's unique architecture allows it to capture multi-scale, compact, spatio-temporal encoding in the C-LSTMs memory units. The method was evaluated on the Cell Tracking Challenge and achieved state-of-the-art results (1st on Fluo-N2DH-SIM+ and 2nd on DIC-C2DL-HeLa datasets) The code is freely available at: https://github.com/arbellea/LSTM-UNet.git
△ Less
Submitted 6 January, 2019; v1 submitted 29 May, 2018;
originally announced May 2018.
-
Microscopy Cell Segmentation via Adversarial Neural Networks
Authors:
Assaf Arbelle,
Tammy Riklin Raviv
Abstract:
We present a novel method for cell segmentation in microscopy images which is inspired by the Generative Adversarial Neural Network (GAN) approach. Our framework is built on a pair of two competitive artificial neural networks, with a unique architecture, termed Rib Cage, which are trained simultaneously and together define a min-max game resulting in an accurate segmentation of a given image. Our…
▽ More
We present a novel method for cell segmentation in microscopy images which is inspired by the Generative Adversarial Neural Network (GAN) approach. Our framework is built on a pair of two competitive artificial neural networks, with a unique architecture, termed Rib Cage, which are trained simultaneously and together define a min-max game resulting in an accurate segmentation of a given image. Our approach has two main strengths, similar to the GAN, the method does not require a formulation of a loss function for the optimization process. This allows training on a limited amount of annotated data in a weakly supervised manner. Promising segmentation results on real fluorescent microscopy data are presented. The code is freely available at: https://github.com/arbellea/DeepCellSeg.git
△ Less
Submitted 13 September, 2018; v1 submitted 18 September, 2017;
originally announced September 2017.