Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–28 of 28 results for author: Herzig, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15334  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

    Authors: Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig

    Abstract: The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, wh… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2406.12172  [pdf, other

    cs.AI

    Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

    Authors: Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

    Abstract: Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of in… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.11815  [pdf, other

    cs.RO cs.CV cs.LG

    LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

    Authors: Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig

    Abstract: In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2406.08164  [pdf, other

    cs.CV

    ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

    Authors: Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

    Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: The first three authors contributed equally

  5. arXiv:2404.01476  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

    Authors: Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

    Abstract: Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps a… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  6. arXiv:2312.17243  [pdf, other

    cs.CV

    Unsupervised Universal Image Segmentation

    Authors: Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell

    Abstract: Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at perform… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  7. arXiv:2312.02249  [pdf, other

    cs.CV cs.CL

    Recursive Visual Programming

    Authors: Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell

    Abstract: Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms o… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  8. arXiv:2311.17942  [pdf, other

    cs.CV

    Object-based (yet Class-agnostic) Video Domain Adaptation

    Authors: Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach

    Abstract: Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data. Current methods for video domain adaptation typically fine-tune the model using fully annotated data on a subset of target domain data or align the representation of the two domains using bootstrapping or adversa… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  9. arXiv:2311.17076  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Compositional Chain-of-Thought Prompting for Large Multimodal Models

    Authors: Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

    Abstract: The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. O… ▽ More

    Submitted 31 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  10. arXiv:2305.19595  [pdf, other

    cs.CV

    Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

    Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

    Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of no… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

  11. arXiv:2305.06343  [pdf, other

    cs.CV

    Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

    Authors: Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson

    Abstract: Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models… ▽ More

    Submitted 24 October, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  12. arXiv:2212.04821  [pdf, other

    cs.CV

    PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

    Authors: Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson

    Abstract: Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide power… ▽ More

    Submitted 5 December, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: WACV 2024

  13. arXiv:2211.11733  [pdf, other

    cs.CV

    Teaching Structured Vision&Language Concepts to Vision&Language Models

    Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

    Abstract: Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Journal ref: CVPR 2023

  14. arXiv:2209.03648  [pdf, other

    cs.CV

    FETA: Towards Specializing Foundation Models for Expert Task Applications

    Authors: Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

    Abstract: Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail… ▽ More

    Submitted 19 December, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

  15. arXiv:2206.07689  [pdf, other

    cs.CV

    Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

    Authors: Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

    Abstract: This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structur… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Ego4D CVPR22 Object State Localization challenge. arXiv admin note: substantial text overlap with arXiv:2206.06346

  16. arXiv:2206.06346  [pdf

    cs.CV

    Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

    Authors: Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

    Abstract: Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how cou… ▽ More

    Submitted 29 November, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: Tech report

  17. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  18. arXiv:2110.06915  [pdf, other

    cs.CV

    Object-Region Video Transformers

    Authors: Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

    Abstract: Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly… ▽ More

    Submitted 9 June, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: CVPR 2022

  19. arXiv:2106.04550  [pdf, other

    cs.CV

    DETReg: Unsupervised Pretraining with Region Priors for Object Detection

    Authors: Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

    Abstract: Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture. Instead, we introduce DETReg, a new self-supervised method that pretrains the entire object detection network, including the object localization and embedding components. During pretraining, DETReg predicts object localiza… ▽ More

    Submitted 19 July, 2023; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Project page: https://www.amirbar.net/detreg/

  20. arXiv:2009.14558  [pdf, other

    cs.CV

    Learning Object Detection from Captions via Textual Scene Attributes

    Authors: Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, Amir Globerson

    Abstract: Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detectio… ▽ More

    Submitted 30 September, 2020; originally announced September 2020.

  21. arXiv:2006.15327  [pdf, other

    cs.CV cs.LG

    Compositional Video Synthesis with Action Graphs

    Authors: Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson

    Abstract: Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthes… ▽ More

    Submitted 10 June, 2021; v1 submitted 27 June, 2020; originally announced June 2020.

    Comments: ICML 2021 Camera Ready

  22. arXiv:1912.09930  [pdf, other

    cs.CV

    Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

    Authors: Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, Trevor Darrell

    Abstract: Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an a… ▽ More

    Submitted 12 September, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

  23. arXiv:1912.07414  [pdf, other

    cs.CV

    Learning Canonical Representations for Scene Graph to Image Generation

    Authors: Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson

    Abstract: Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images. Previous approaches showed that scenes with few entities can be controlled using scene graphs, but this approach struggles as the complexity of the graph (the number of objects and edges) increases. In this work, we show that one limitation of current methods i… ▽ More

    Submitted 24 August, 2020; v1 submitted 16 December, 2019; originally announced December 2019.

    Comments: ECCV 2020

  24. arXiv:1905.03706  [pdf, other

    cs.CV cs.AI

    Accurate Visual Localization for Automotive Applications

    Authors: Eli Brosh, Matan Friedmann, Ilan Kadar, Lev Yitzhak Lavy, Elad Levi, Shmuel Rippa, Yair Lempert, Bruno Fernandez-Ruiz, Roei Herzig, Trevor Darrell

    Abstract: Accurate vehicle localization is a crucial step towards building effective Vehicle-to-Vehicle networks and automotive applications. Yet standard grade GPS data, such as that provided by mobile phones, is often noisy and exhibits significant localization errors in many urban areas. Approaches for accurate localization from imagery often rely on structure-based techniques, and thus are limited in sc… ▽ More

    Submitted 1 May, 2019; originally announced May 2019.

  25. arXiv:1904.00853  [pdf, other

    cs.CV

    Precise Detection in Densely Packed Scenes

    Authors: Eran Goldman, Roei Herzig, Aviv Eisenschtat, Oria Ratzon, Itsik Levi, Jacob Goldberger, Tal Hassner

    Abstract: Man-made scenes can be densely packed, containing numerous objects, often identical, positioned in close proximity. We show that precise object detection in such scenes remains a challenging frontier even for state-of-the-art object detectors. We propose a novel, deep-learning based method for precise object detection, designed for such challenging settings. Our contributions include: (1) A layer… ▽ More

    Submitted 30 April, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: CVPR 2019

    Journal ref: IEEE Conference on Computer Vision and Pattern Recognition, 2019

  26. arXiv:1902.10200  [pdf, other

    cs.CV

    Differentiable Scene Graphs

    Authors: Moshiko Raboh, Roei Herzig, Gal Chechik, Jonathan Berant, Amir Globerson

    Abstract: Reasoning about complex visual scenes involves perception of entities and their relations. Scene graphs provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Unfortunately, reasoning systems based on SGs are typically trained in a two-step procedure: First, training a model to predict SGs from images; Then, a separate model is cre… ▽ More

    Submitted 14 March, 2020; v1 submitted 26 February, 2019; originally announced February 2019.

    Comments: Winter Conference on Applications of Computer Vision (WACV), 2020

  27. arXiv:1812.01233  [pdf, other

    cs.CV

    Spatio-Temporal Action Graph Networks

    Authors: Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, Trevor Darrell

    Abstract: Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance. Activity recognition models that represent object interactions explicitly have the potential to learn in a more efficient manner than those that represent scenes with globa… ▽ More

    Submitted 29 September, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019

  28. arXiv:1802.05451  [pdf, other

    stat.ML cs.CV cs.LG

    Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

    Authors: Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson

    Abstract: Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label… ▽ More

    Submitted 1 November, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

    Comments: Paper is accepted for NIPS 2018 conference