Search | arXiv e-print repository

Towards Multimodal In-Context Learning for Vision & Language Models

Authors: Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

Abstract: State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis… ▽ More State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art. △ Less

Submitted 17 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2311.15276 [pdf, other]

Efficient Rehearsal Free Zero Forgetting Continual Learning using Adaptive Weight Modulation

Authors: Yonatan Sverdlov, Shimon Ullman

Abstract: Artificial neural networks encounter a notable challenge known as continual learning, which involves acquiring knowledge of multiple tasks over an extended period. This challenge arises due to the tendency of previously learned weights to be adjusted to suit the objectives of new tasks, resulting in a phenomenon called catastrophic forgetting. Most approaches to this problem seek a balance between… ▽ More Artificial neural networks encounter a notable challenge known as continual learning, which involves acquiring knowledge of multiple tasks over an extended period. This challenge arises due to the tendency of previously learned weights to be adjusted to suit the objectives of new tasks, resulting in a phenomenon called catastrophic forgetting. Most approaches to this problem seek a balance between maximizing performance on the new tasks and minimizing the forgetting of previous tasks. In contrast, our approach attempts to maximize the performance of the new task, while ensuring zero forgetting. This is accomplished by creating a task-specific modulation parameters for each task. Only these would be learnable parameters during learning of consecutive tasks. Through comprehensive experimental evaluations, our model demonstrates superior performance in acquiring and retaining novel tasks that pose difficulties for other multi-task models. This emphasizes the efficacy of our approach in preventing catastrophic forgetting while accommodating the acquisition of new tasks △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2306.02415 [pdf, other]

Biologically-Motivated Learning Model for Instructed Visual Processing

Authors: Roy Abel, Shimon Ullman

Abstract: As part of understanding how the brain learns, ongoing work seeks to combine biological knowledge and current artificial intelligence (AI) modeling in an attempt to find an efficient biologically plausible learning scheme. Current models of biologically plausible learning often use a cortical-like combination of bottom-up (BU) and top-down (TD) processing, where the TD part carries feedback signal… ▽ More As part of understanding how the brain learns, ongoing work seeks to combine biological knowledge and current artificial intelligence (AI) modeling in an attempt to find an efficient biologically plausible learning scheme. Current models of biologically plausible learning often use a cortical-like combination of bottom-up (BU) and top-down (TD) processing, where the TD part carries feedback signals used for learning. However, in the visual cortex, the TD pathway plays a second major role of visual attention, by guiding the visual process to locations and tasks of interest. A biological model should therefore combine the two tasks, and learn to guide the visual process. We introduce a model that uses a cortical-like combination of BU and TD processing that naturally integrates the two major functions of the TD stream. The integrated model is obtained by an appropriate connectivity pattern between the BU and TD streams, a novel processing cycle that uses the TD part twice, and the use of 'Counter-Hebb' learning that operates across the streams. We show that the 'Counter-Hebb' mechanism can provide an exact backpropagation synaptic modification. We further demonstrate the model's ability to guide the visual stream to perform a task of interest, achieving competitive performance compared with AI models on standard multi-task learning benchmarks. The successful combination of learning and visual guidance could provide a new view on combining BU and TD processing in human vision, and suggests possible directions for both biologically plausible models and artificial instructed models, such as vision-language models (VLMs). △ Less

Submitted 16 June, 2024; v1 submitted 4 June, 2023; originally announced June 2023.

arXiv:2305.19595 [pdf, other]

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of no… ▽ More Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average. △ Less

Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

arXiv:2211.11733 [pdf, other]

Teaching Structured Vision&Language Concepts to Vision&Language Models

Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

Abstract: Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have… ▽ More Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model. △ Less

Submitted 30 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Journal ref: CVPR 2023

arXiv:2110.08744 [pdf]

A model for full local image interpretation

Authors: Guy Ben-Yosef, Liav Assif, Daniel Harari, Shimon Ullman

Abstract: We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of cur… ▽ More We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class-specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models. △ Less

Submitted 17 October, 2021; originally announced October 2021.

Comments: Published in the Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci), 2015

Journal ref: https://cogsci.mindmodeling.org/2015/papers/0048/

arXiv:2108.01696 [pdf]

Linking Common Vulnerabilities and Exposures to the MITRE ATT&CK Framework: A Self-Distillation Approach

Authors: Benjamin Ampel, Sagar Samtani, Steven Ullman, Hsinchun Chen

Abstract: Due to the ever-increasing threat of cyber-attacks to critical cyber infrastructure, organizations are focusing on building their cybersecurity knowledge base. A salient list of cybersecurity knowledge is the Common Vulnerabilities and Exposures (CVE) list, which details vulnerabilities found in a wide range of software and hardware. However, these vulnerabilities often do not have a mitigation st… ▽ More Due to the ever-increasing threat of cyber-attacks to critical cyber infrastructure, organizations are focusing on building their cybersecurity knowledge base. A salient list of cybersecurity knowledge is the Common Vulnerabilities and Exposures (CVE) list, which details vulnerabilities found in a wide range of software and hardware. However, these vulnerabilities often do not have a mitigation strategy to prevent an attacker from exploiting them. A well-known cybersecurity risk management framework, MITRE ATT&CK, offers mitigation techniques for many malicious tactics. Despite the tremendous benefits that both CVEs and the ATT&CK framework can provide for key cybersecurity stakeholders (e.g., analysts, educators, and managers), the two entities are currently separate. We propose a model, named the CVE Transformer (CVET), to label CVEs with one of ten MITRE ATT&CK tactics. The CVET model contains a fine-tuning and self-knowledge distillation design applied to the state-of-the-art pre-trained language model RoBERTa. Empirical results on a gold-standard dataset suggest that our proposed novelties can increase model performance in F1-score. The results of this research can allow cybersecurity stakeholders to add preliminary MITRE ATT&CK information to their collected CVEs. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: Proceedings of 2021 ACM Conference Knowledge Discovery and Data Mining (KDD' 21) Workshop on AI-enabled Cybersecurity Analytics

arXiv:2105.05592 [pdf]

Image interpretation by iterative bottom-up top-down processing

Authors: Shimon Ullman, Liav Assif, Alona Strugatski, Ben-Zion Vatashsky, Hila Levy, Aviv Netanyahu, Adam Yaari

Abstract: Scene understanding requires the extraction and representation of scene components together with their properties and inter-relations. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (counter-streams structure)… ▽ More Scene understanding requires the extraction and representation of scene components together with their properties and inter-relations. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (counter-streams structure). The model constructs a scene representation by the iterative use of three components. The first model component is a BU stream that extracts selected scene elements, properties and relations. The second component (cognitive augmentation) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the TD stream, in the form of a TD instruction, instructing the model what task to perform next. The TD stream then guides the BU visual stream to perform the selected task in the next cycle. During this process, the visual representations extracted from the image can be combined with relevant non-visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world. We describe how a sequence of TD-instructions is used to extract from the scene structures of interest, including an algorithm to automatically select the next TD-instruction in the sequence. The extraction process is shown to have favorable properties in terms of combinatorial generalization, generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2104.09829 [pdf, other]

Detector-Free Weakly Supervised Grounding by Separation

Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de… ▽ More Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2104.09447 [pdf, other]

doi 10.1016/j.cognition.2020.104263

What can human minimal videos tell us about dynamic recognition models?

Authors: Guy Ben-Yosef, Gabriel Kreiman, Shimon Ullman

Abstract: In human vision objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that human visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. Th… ▽ More In human vision objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that human visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art deep networks for dynamic visual recognition cannot replicate human behavior in these configurations. This gap between humans and machines points to critical mechanisms in human dynamic vision that are lacking in current models. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: Published as a workshop paper at Bridging AI and Cognitive Science (ICLR 2020). Extended paper was published at Cognition

arXiv:2006.05249 [pdf]

What takes the brain so long: Object recognition at the level of minimal images develops for up to seconds of presentation time

Authors: Hanna Benoni, Daniel Harari, Shimon Ullman

Abstract: Rich empirical evidence has shown that visual object recognition in the brain is fast and effortless, with relevant brain signals reported to start as early as 80 ms. Here we study the time trajectory of the recognition process at the level of minimal recognizable images (termed MIRC). These are images that can be recognized reliably, but in which a minute change of the image (reduction by either… ▽ More Rich empirical evidence has shown that visual object recognition in the brain is fast and effortless, with relevant brain signals reported to start as early as 80 ms. Here we study the time trajectory of the recognition process at the level of minimal recognizable images (termed MIRC). These are images that can be recognized reliably, but in which a minute change of the image (reduction by either size or resolution) has a drastic effect on recognition. Subjects were assigned to one of nine exposure conditions: 200, 500, 1000, 2000 ms with or without masking, as well as unlimited time. The subjects were not limited in time to respond after presentation. The results show that in the masked conditions, recognition rates develop gradually over an extended period, e.g. average of 18% for 200 ms exposure and 45% for 500 ms, increasing significantly with longer exposure even above 2 secs. When presented for unlimited time (until response), MIRC recognition rates were equivalent to the rates of full-object images presented for 50 ms followed by masking. What takes the brain so long to recognize such images? We discuss why processes involving eye-movements, perceptual decision-making and pattern completion are unlikely explanations. Alternatively, we hypothesize that MIRC recognition requires an extended top-down process complementing the feed-forward phase. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 7 pages, 2 figures, 1 table

arXiv:2002.03335 [pdf, other]

Multi-Task Learning by a Top-Down Control Network

Authors: Hila Levi, Shimon Ullman

Abstract: As the range of tasks performed by a general vision system expands, executing multiple tasks accurately and efficiently in a single network has become an important and still open problem. Recent computer vision approaches address this problem by branching networks, or by a channel-wise modulation of the network feature-maps with task specific vectors. We present a novel architecture that uses a de… ▽ More As the range of tasks performed by a general vision system expands, executing multiple tasks accurately and efficiently in a single network has become an important and still open problem. Recent computer vision approaches address this problem by branching networks, or by a channel-wise modulation of the network feature-maps with task specific vectors. We present a novel architecture that uses a dedicated top-down control network to modify the activation of all the units in the main recognition network in a manner that depends on the selected task, image content, and spatial location. We show the effectiveness of our scheme by achieving significantly better results than alternative state-of-the-art approaches on four datasets. We further demonstrate our advantages in terms of task selectivity, scaling the number of tasks and interpretability. △ Less

Submitted 29 October, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

arXiv:1811.12152 [pdf, other]

Efficient Coarse-to-Fine Non-Local Module for the Detection of Small Objects

Authors: Hila Levi, Shimon Ullman

Abstract: An image is not just a collection of objects, but rather a graph where each object is related to other objects through spatial and semantic relations. Using relational reasoning modules, such as the non-local module \cite{wang2017non}, can therefore improve object detection. Current schemes apply such dedicated modules either to a specific layer of the bottom-up stream, or between already-detected… ▽ More An image is not just a collection of objects, but rather a graph where each object is related to other objects through spatial and semantic relations. Using relational reasoning modules, such as the non-local module \cite{wang2017non}, can therefore improve object detection. Current schemes apply such dedicated modules either to a specific layer of the bottom-up stream, or between already-detected objects. We show that the relational process can be better modeled in a coarse-to-fine manner and present a novel framework, applying a non-local module sequentially to increasing resolution feature maps along the top-down stream. In this way, information can naturally passed from larger objects to smaller related ones. Applying the module to fine feature maps further allows the information to pass between the small objects themselves, exploiting repetitions of instances of the same class. In practice, due to the expensive memory utilization of the non-local module, it is infeasible to apply the module as currently used to high-resolution feature maps. We redesigned the non local module, improved it in terms of memory and number of operations, allowing it to be placed anywhere along the network. We further incorporated relative spatial information into the module, in a manner that can be incorporated into our efficient implementation. We show the effectiveness of our scheme by improving the results of detecting small objects on COCO by 1-2 AP points over Faster and Mask RCNN and by 1 AP over using non-local module on the bottom-up stream. △ Less

Submitted 20 May, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

arXiv:1811.08481 [pdf, other]

VQA with no questions-answers training

Authors: Ben-Zion Vatashsky, Shimon Ullman

Abstract: Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples. We propose a novel method that consists of two main parts: generatin… ▽ More Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples. We propose a novel method that consists of two main parts: generating a question graph representation, and an answering procedure, guided by the abstract structure of the question graph to invoke an extendable set of visual estimators. Training is performed for the language part and the visual part on their own, but unlike existing schemes, the method does not require any training using images with associated questions and answers. This approach is able to handle novel domains (extended question types and new object classes, properties and relations) as long as corresponding visual estimators are available. In addition, it can provide explanations to its answers and suggest alternatives when questions are not grounded in the image. We demonstrate that this approach achieves both high performance and domain extensibility without any questions-answers training. △ Less

Submitted 26 May, 2020; v1 submitted 20 November, 2018; originally announced November 2018.

Comments: Accepted to CVPR 2020

arXiv:1810.10656 [pdf, other]

Understand, Compose and Respond - Answering Visual Questions by a Composition of Abstract Procedures

Authors: Ben Zion Vatashsky, Shimon Ullman

Abstract: An image related question defines a specific visual task that is required in order to produce an appropriate answer. The answer may depend on a minor detail in the image and require complex reasoning and use of prior knowledge. When humans perform this task, they are able to do it in a flexible and robust manner, integrating modularly any novel visual capability with diverse options for various el… ▽ More An image related question defines a specific visual task that is required in order to produce an appropriate answer. The answer may depend on a minor detail in the image and require complex reasoning and use of prior knowledge. When humans perform this task, they are able to do it in a flexible and robust manner, integrating modularly any novel visual capability with diverse options for various elaborations of the task. In contrast, current approaches to solve this problem by a machine are based on casting the problem as an end-to-end learning problem, which lacks such abilities. We present a different approach, inspired by the aforementioned human capabilities. The approach is based on the compositional structure of the question. The underlying idea is that a question has an abstract representation based on its structure, which is compositional in nature. The question can consequently be answered by a composition of procedures corresponding to its substructures. The basic elements of the representation are logical patterns, which are put together to represent the question. These patterns include a parametric representation for object classes, properties and relations. Each basic pattern is mapped into a basic procedure that includes meaningful visual tasks, and the patterns are composed to produce the overall answering procedure. The UnCoRd (Understand Compose and Respond) system, based on this approach, integrates existing detection and classification schemes for a set of object classes, properties and relations. These schemes are incorporated in a modular manner, providing elaborated answers and corrections for negative answers. In addition, an external knowledge base is queried for required common-knowledge. We performed a qualitative analysis of the system, which demonstrates its representation capabilities and provide suggestions for future developments. △ Less

Submitted 24 October, 2018; originally announced October 2018.

arXiv:1804.04604 [pdf, other]

Discovery and usage of joint attention in images

Authors: Daniel Harari, Joshua B. Tenenbaum, Shimon Ullman

Abstract: Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such eve… ▽ More Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such events for image descriptions. The work makes two novel contributions. First, our extraction algorithm is the first which identifies joint visual attention in single static images. It computes 3D gaze direction, identifies the gaze target by combining gaze direction with a 3D depth map computed for the image, and identifies the common gaze target. Second, we use a human study to demonstrate the sensitivity of humans to joint attention, suggesting that the detection of such a configuration in an image can be useful for understanding the image, including the goals of the agents and their joint activity, and therefore can contribute to image captioning and related tasks. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: 6 pages, 3 figures

arXiv:1804.03576 [pdf, other]

Large Field and High Resolution: Detecting Needle in Haystack

Authors: Hadar Gorodissky, Daniel Harari, Shimon Ullman

Abstract: The growing use of convolutional neural networks (CNN) for a broad range of visual tasks, including tasks involving fine details, raises the problem of applying such networks to a large field of view, since the amount of computations increases significantly with the number of pixels. To deal effectively with this difficulty, we develop and compare methods of using CNNs for the task of small target… ▽ More The growing use of convolutional neural networks (CNN) for a broad range of visual tasks, including tasks involving fine details, raises the problem of applying such networks to a large field of view, since the amount of computations increases significantly with the number of pixels. To deal effectively with this difficulty, we develop and compare methods of using CNNs for the task of small target localization in natural images, given a limited "budget" of samples to form an image. Inspired in part by human vision, we develop and compare variable sampling schemes, with peak resolution at the center and decreasing resolution with eccentricity, applied iteratively by re-centering the image at the previous predicted target location. The results indicate that variable resolution models significantly outperform constant resolution models. Surprisingly, variable resolution models and in particular multi-channel models, outperform the optimal, "budget-free" full-resolution model, using only 5\% of the samples. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: 15 pages, 7 figures

arXiv:1802.09030 [pdf, ps, other]

Cakewalk Sampling

Authors: Uri Patish, Shimon Ullman

Abstract: We study the task of finding good local optima in combinatorial optimization problems. Although combinatorial optimization is NP-hard in general, locally optimal solutions are frequently used in practice. Local search methods however typically converge to a limited set of optima that depend on their initialization. Sampling methods on the other hand can access any valid solution, and thus can be u… ▽ More We study the task of finding good local optima in combinatorial optimization problems. Although combinatorial optimization is NP-hard in general, locally optimal solutions are frequently used in practice. Local search methods however typically converge to a limited set of optima that depend on their initialization. Sampling methods on the other hand can access any valid solution, and thus can be used either directly or alongside methods of the former type as a way for finding good local optima. Since the effectiveness of this strategy depends on the sampling distribution, we derive a robust learning algorithm that adapts sampling distributions towards good local optima of arbitrary objective functions. As a first use case, we empirically study the efficiency in which sampling methods can recover locally maximal cliques in undirected graphs. Not only do we show how our adaptive sampler outperforms related methods, we also show how it can even approach the performance of established clique algorithms. As a second use case, we consider how greedy algorithms can be combined with our adaptive sampler, and we demonstrate how this leads to superior performance in k-medoid clustering. Together, these findings suggest that our adaptive sampler can provide an effective strategy to combinatorial optimization problems that arise in practice. △ Less

Submitted 1 January, 2020; v1 submitted 25 February, 2018; originally announced February 2018.

Comments: Accepted as a conference paper by AAAI-2020 (oral presentation)

arXiv:1712.09299 [pdf]

A model for interpreting social interactions in local image regions

Authors: Guy Ben-Yosef, Alon Yachin, Shimon Ullman

Abstract: Understanding social interactions (such as 'hug' or 'fight') is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. In this work we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably f… ▽ More Understanding social interactions (such as 'hug' or 'fight') is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. In this work we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably from reduced images (called 'minimal images'). (ii) The recognition of a social interaction depends on identifying components and relations within the minimal image (termed 'interpretation'). We show psychophysics data for minimal images and modeling results for their interpretation. We discuss the integration of minimal configurations in recognizing social interactions in a detailed, high-resolution image. △ Less

Submitted 26 December, 2017; originally announced December 2017.

Comments: In AAAI spring symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence, Palo Alto, 2017

arXiv:1711.11151 [pdf]

Structured learning and detailed interpretation of minimal object images

Authors: Guy Ben-Yosef, Liav Assif, Shimon Ullman

Abstract: We model the process of human full interpretation of object images, namely the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. We model interpretation by a structured learning framework, in w… ▽ More We model the process of human full interpretation of object images, namely the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. We model interpretation by a structured learning framework, in which there are primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of minimal configurations, namely reduced local regions that are minimal in the sense that further reduction will turn them unrecognizable and uninterpretable. We show experimental results of our model, and results of predicting and testing relations that were useful to the model via transformed minimal images. △ Less

Submitted 29 November, 2017; originally announced November 2017.

Comments: Accepted to Workshop on Mutual Benefits of Cognitive and Computer Vision, at the International Conference on Computer Vision. Venice, Italy, 2017

arXiv:1611.09819 [pdf]

Measuring and modeling the perception of natural and unconstrained gaze in humans and machines

Authors: Daniel Harari, Tao Gao, Nancy Kanwisher, Joshua Tenenbaum, Shimon Ullman

Abstract: Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sourc… ▽ More Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sources of information for the underlying perceptual processes? These questions pose a challenge from both empirical and computational perspectives, due to the complexity of the visual input in real-life situations. Here we measure empirically human accuracy in perceiving the gaze direction of others in lifelike scenes, and study computationally the sources of information and representations underlying this cognitive capacity. We show that humans perform better in face-to-face conditions compared with recorded conditions, and that this advantage is not due to the availability of input dynamics. We further show that humans are still performing well when only the eyes-region is visible, rather than the whole face. We develop a computational model, which replicates the pattern of human performance, including the finding that the eyes-region contains on its own, the required information for estimating both head orientation and direction of gaze. Consistent with neurophysiological findings on task-specific face regions in the brain, the learned computational representations reproduce perceptual effects such as the Wollaston illusion, when trained to estimate direction of gaze, but not when trained to recognize objects or faces. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: Daniel Harari and Tao Gao contributed equally to this work

Report number: Center for Brains, Minds and Machines Memo No. 059

arXiv:1610.09625 [pdf]

doi 10.1016/j.cognition.2018.11.001

Discovering containment: from infants to machines

Authors: Shimon Ullman, Nimrod Dorfman, Daniel Harari

Abstract: Current artificial learning systems can recognize thousands of visual categories, or play Go at a champion"s level, but cannot explain infants learning, in particular the ability to learn complex concepts without guidance, in a specific order. A notable example is the category of 'containers' and the notion of containment, one of the earliest spatial relations to be learned, starting already at 2.… ▽ More Current artificial learning systems can recognize thousands of visual categories, or play Go at a champion"s level, but cannot explain infants learning, in particular the ability to learn complex concepts without guidance, in a specific order. A notable example is the category of 'containers' and the notion of containment, one of the earliest spatial relations to be learned, starting already at 2.5 months, and preceding other common relations (e.g., support). Such spontaneous unsupervised learning stands in contrast with current highly successful computational models, which learn in a supervised manner, that is, by using large data sets of labeled examples. How can meaningful concepts be learned without guidance, and what determines the trajectory of infant learning, making some notions appear consistently earlier than others? △ Less

Submitted 30 October, 2016; originally announced October 2016.

Journal ref: Cognition 183 (2019) 67-81

arXiv:1605.07824 [pdf, other]

Action Classification via Concepts and Attributes

Authors: Amir Rosenfeld, Shimon Ullman

Abstract: Classes in natural images tend to follow long tail distributions. This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts wh… ▽ More Classes in natural images tend to follow long tail distributions. This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts which are readily available. We detect the presence of prominent concepts in images and use them to infer the target labels instead of using visual features directly, combining tools from vision and natural-language processing. We validate our method on the recently introduced HICO dataset reaching a mAP of 31.54\% and on the Stanford-40 Actions dataset, where the proposed method outperforms that obtained by direct visual features, obtaining an accuracy 83.12\%. Moreover, the method provides for each class a semantically meaningful list of keywords and relevant image regions relating it to its constituent concepts. △ Less

Submitted 6 March, 2018; v1 submitted 25 May, 2016; originally announced May 2016.

arXiv:1603.08212 [pdf, other]

Human Pose Estimation using Deep Consensus Voting

Authors: Ita Lifshitz, Ethan Fetaya, Shimon Ullman

Abstract: In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only prod… ▽ More In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only produces good keypoint predictions, but also enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting. This differs from most previous methods where joint probabilities are learned from relative keypoint locations and are independent of the image. We finally combine the keypoints votes and joint probabilities in order to identify the optimal pose configuration. We show our competitive performance on the MPII Human Pose and Leeds Sports Pose datasets. △ Less

Submitted 27 March, 2016; originally announced March 2016.

arXiv:1603.08079 [pdf, other]

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Authors: Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz, Shimon Ullman

Abstract: Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, r… ▽ More Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types. △ Less

Submitted 26 March, 2016; originally announced March 2016.

Comments: EMNLP 2015

Journal ref: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pages 1477--1487

arXiv:1603.04186 [pdf, other]

Visual Concept Recognition and Localization via Iterative Introspection

Authors: Amir Rosenfeld, Shimon Ullman

Abstract: Convolutional neural networks have been shown to develop internal representations, which correspond closely to semantically meaningful objects and parts, although trained solely on class labels. Class Activation Mapping (CAM) is a recent method that makes it possible to easily highlight the image regions contributing to a network's classification decision. We build upon these two developments to e… ▽ More Convolutional neural networks have been shown to develop internal representations, which correspond closely to semantically meaningful objects and parts, although trained solely on class labels. Class Activation Mapping (CAM) is a recent method that makes it possible to easily highlight the image regions contributing to a network's classification decision. We build upon these two developments to enable a network to re-examine informative image regions, which we term introspection. We propose a weakly-supervised iterative scheme, which shifts its center of attention to increasingly discriminative regions as it progresses, by alternating stages of classification and introspection. We evaluate our method and show its effectiveness over a range of several datasets, where we obtain competitive or state-of-the-art results: on Stanford-40 Actions, we set a new state-of the art of 81.74%. On FGVC-Aircraft and the Stanford Dogs dataset, we show consistent improvements over baselines, some of which include significantly more supervision. △ Less

Submitted 25 May, 2016; v1 submitted 14 March, 2016; originally announced March 2016.

arXiv:1601.04293 [pdf, other]

Face-space Action Recognition by Face-Object Interactions

Authors: Amir Rosenfeld, Shimon Ullman

Abstract: Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations. However, there are still many cases in which performance remains far from that of humans. In this paper, we approach the problem by learning explicitly, and then integrating three components of transitive actions: (1) the h… ▽ More Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations. However, there are still many cases in which performance remains far from that of humans. In this paper, we approach the problem by learning explicitly, and then integrating three components of transitive actions: (1) the human body part relevant to the action (2) the object being acted upon and (3) the specific form of interaction between the person and the object. The process uses class-specific features and relations not used in the past for action recognition and which use inherently two cycles in the process unlike most standard approaches. We focus on face-related actions (FRA), a subset of actions that includes several currently challenging categories. We present an average relative improvement of 52% over state-of-the art. We also make a new benchmark publicly available. △ Less

Submitted 17 January, 2016; originally announced January 2016.

Comments: our more recent work on a related topic is described in a separate paper : http://arxiv.org/abs/1511.03814

arXiv:1511.03814 [pdf, other]

Hand-Object Interaction and Precise Localization in Transitive Action Recognition

Authors: Amir Rosenfeld, Shimon Ullman

Abstract: Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor po… ▽ More Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor pose is similar, and recognition therefore depends on details of the grasp and the object, which may be largely occluded. In this paper we demonstrate how recognition is improved by obtaining precise localization of the action-object and consequently extracting details of the object shape together with the actor-object interaction. To obtain exact localization of the action object and its interaction with the actor, we employ a coarse-to-fine approach which combines semantic segmentation and contextual features, in successive stages. We focus on (but are not limited) to face-related actions, a set of actions that includes several currently challenging categories. We present an average relative improvement of 35% over state-of-the art and validate through experimentation the effectiveness of our approach. △ Less

Submitted 24 February, 2016; v1 submitted 12 November, 2015; originally announced November 2015.

Comments: Minor changes: title and abstract

arXiv:1502.01176 [pdf, other]

Learning Local Invariant Mahalanobis Distances

Authors: Ethan Fetaya, Shimon Ullman

Abstract: For many tasks and data types, there are natural transformations to which the data should be invariant or insensitive. For instance, in visual recognition, natural images should be insensitive to rotation and translation. This requirement and its implications have been important in many machine learning applications, and tolerance for image transformations was primarily achieved by using robust fe… ▽ More For many tasks and data types, there are natural transformations to which the data should be invariant or insensitive. For instance, in visual recognition, natural images should be insensitive to rotation and translation. This requirement and its implications have been important in many machine learning applications, and tolerance for image transformations was primarily achieved by using robust feature vectors. In this paper we propose a novel and computationally efficient way to learn a local Mahalanobis metric per datum, and show how we can learn a local invariant metric to any transformation in order to improve performance. △ Less

Submitted 4 February, 2015; originally announced February 2015.

arXiv:1412.2672 [pdf]

When Computer Vision Gazes at Cognition

Authors: Tao Gao, Daniel Harari, Joshua Tenenbaum, Shimon Ullman

Abstract: Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that… ▽ More Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that are further away, especially in unconstraint cases where the looker can move her head and eyes freely. In this paper we address this question by jointly exploring human psychophysics and a cognitively motivated computer vision model, which can detect the 3D direction of gaze from 2D face images. The synthesis of behavioral study and computer vision yields several interesting discoveries. (1) Human accuracy of discriminating targets 8°-10° of visual angle apart is around 40% in a free looking gaze task; (2) The ability to interpret gaze of different lookers vary dramatically; (3) This variance can be captured by the computational model; (4) Human outperforms the current model significantly. These results collectively show that the acuity of human joint attention is indeed highly impressive, given the computational challenge of the natural looking task. Moreover, the gap between human and model performance, as well as the variability of gaze interpretation across different lookers, require further understanding of the underlying mechanisms utilized by humans for this challenging task. △ Less

Submitted 8 December, 2014; originally announced December 2014.

Comments: Tao Gao and Daniel Harari contributed equally to this work

Report number: CBMM Memo No. 025, MIT

arXiv:1406.2602 [pdf, other]

Graph Approximation and Clustering on a Budget

Authors: Ethan Fetaya, Ohad Shamir, Shimon Ullman

Abstract: We consider the problem of learning from a similarity matrix (such as spectral clustering and lowd imensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results (which focused on spectral clustering with two clu… ▽ More We consider the problem of learning from a similarity matrix (such as spectral clustering and lowd imensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results (which focused on spectral clustering with two clusters). We also propose a new algorithmic approach based on adaptive sampling, which experimentally matches or improves on previous methods, while being considerably more general and computationally cheaper. △ Less

Submitted 10 June, 2014; originally announced June 2014.

Showing 1–31 of 31 results for author: Ullman, S