Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 104 results for author: Feris, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12172  [pdf, other

    cs.AI

    Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

    Authors: Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

    Abstract: Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of in… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.12034  [pdf, other

    cs.CL cs.LG

    Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

    Authors: Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

    Abstract: We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic a… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.10082  [pdf, other

    eess.AS cs.CV cs.SD

    Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

    Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

  4. arXiv:2406.09240  [pdf, other

    cs.CV

    Comparison Visual Instruction Tuning

    Authors: Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja Giryes, Sepp Hochreiter, Leonid Karlinsky

    Abstract: Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attent… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset repo: https://huggingface.co/datasets/wlin21at/CaD-Inst

  5. arXiv:2406.08164  [pdf, other

    cs.CV

    ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

    Authors: Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

    Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: The first three authors contributed equally

  6. arXiv:2405.17258  [pdf, other

    cs.LG cs.AI

    $\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning

    Authors: Runqian Wang, Soumya Ghosh, David Cox, Diego Antognini, Aude Oliva, Rogerio Feris, Leonid Karlinsky

    Abstract: Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modul… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  7. arXiv:2404.12526  [pdf, other

    cs.LG cs.CL cs.CV

    Adaptive Memory Replay for Continual Learning

    Authors: James Seale Smith, Lazar Valkov, Shaunak Halbe, Vyshnavi Gutta, Rogerio Feris, Zsolt Kira, Leonid Karlinsky

    Abstract: Foundation Models (FMs) have become the hallmark of modern AI, however, these models are trained on massive data, leading to financially expensive training. Updating FMs as new data becomes available is important, however, can lead to `catastrophic forgetting', where models underperform on tasks related to data sub-populations observed too long ago. This continual learning (CL) phenomenon has been… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: CVPR-W 2024 (Spotlight)

  8. arXiv:2402.15514  [pdf

    cs.CL cs.AI

    Large Scale Generative AI Text Applied to Sports and Music

    Authors: Aaron Baughman, Stephen Hammer, Rahul Agarwal, Gozde Akay, Eduardo Morales, Tony Johnson, Leonid Karlinsky, Rogerio Feris

    Abstract: We address the problem of scaling up the production of media content, including commentary and personalized news stories, for large-scale sports and music events worldwide. Our approach relies on generative AI models to transform a large volume of multimodal data (e.g., videos, articles, real-time scoring feeds, statistics, and fact sheets) into coherent and fluent text. Based on this approach, we… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: 9 pages, 8 figures, 5 tables

  9. arXiv:2402.13449  [pdf, other

    cs.CL

    CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

    Authors: Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, Rogerio Feris

    Abstract: Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  10. arXiv:2311.06231  [pdf, other

    cs.CV

    Learning Human Action Recognition Representations Without Real Humans

    Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris

    Abstract: Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to a… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Track

  11. arXiv:2310.07889  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    LangNav: Language as a Perceptual Representation for Navigation

    Authors: Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, Yoon Kim

    Abstract: We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, base… ▽ More

    Submitted 30 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

  12. arXiv:2310.00160  [pdf, other

    cs.CL cs.AI

    Self-Specialization: Uncovering Latent Expertise within Large Language Models

    Authors: Junmo Kang, Hongyin Luo, Yada Zhu, Jacob Hansen, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky

    Abstract: Recent works have demonstrated the effectiveness of self-alignment in which a large language model is aligned to follow general instructions using instructional data generated from the model itself starting from a handful of human-written seeds. Instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine, finance). As a preliminary, we… ▽ More

    Submitted 5 June, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

    Comments: ACL 2024 (Findings; Long Paper)

  13. arXiv:2309.06809  [pdf, other

    cs.CV

    TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

    Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof

    Abstract: Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Code is available at: https://github.com/jmiemirza/TAP

  14. arXiv:2305.19595  [pdf, other

    cs.CV

    Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

    Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

    Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of no… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

  15. arXiv:2305.18287  [pdf, other

    cs.CV cs.CL

    LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

    Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Mateusz Kozinski, Horst Possegger, Rogerio Feris, Horst Bischof

    Abstract: Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed categ… ▽ More

    Submitted 23 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023 (Camera Ready) - Project Page: https://jmiemirza.github.io/LaFTer/

  16. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  17. arXiv:2305.06343  [pdf, other

    cs.CV

    Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

    Authors: Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson

    Abstract: Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models… ▽ More

    Submitted 24 October, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  18. arXiv:2303.17590  [pdf, other

    cs.CV cs.CL

    Going Beyond Nouns With Vision & Language Models Using Synthetic Data

    Authors: Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky

    Abstract: Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (… ▽ More

    Submitted 30 August, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023. Project page: https://synthetic-vic.github.io/

  19. arXiv:2303.16990  [pdf, other

    cs.CV

    What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

    Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

    Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More

    Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

  20. arXiv:2303.14744  [pdf, other

    cs.CV

    Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection

    Authors: Kuniaki Saito, Donghyun Kim, Piotr Teterwak, Rogerio Feris, Kate Saenko

    Abstract: Building object detectors that are robust to domain shifts is critical for real-world applications. Prior approaches fine-tune a pre-trained backbone and risk overfitting it to in-distribution (ID) data and distorting features useful for out-of-distribution (OOD) generalization. We propose to use Relative Gradient Norm (RGN) as a way to measure the vulnerability of a backbone to feature distortion… ▽ More

    Submitted 15 May, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

    Comments: Project page: http://ai.bu.edu/mind_back/

  21. arXiv:2303.08914  [pdf, other

    cs.CV

    MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

    Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

    Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best ze… ▽ More

    Submitted 22 July, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023

  22. arXiv:2303.02861  [pdf, other

    cs.CL

    Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

    Authors: Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, Yoon Kim

    Abstract: Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks. However, existing methods typically learn soft prompt vectors from scratch, and it has not been clear how to exploit the rich cross-task knowledge with prompt vectors in a… ▽ More

    Submitted 5 March, 2023; originally announced March 2023.

    Comments: ICLR 2023. Project page: https://zhenwang9102.github.io/mpt.html

  23. arXiv:2303.00980  [pdf, other

    cs.LG

    Learning to Grow Pretrained Models for Efficient Transformer Training

    Authors: Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim

    Abstract: Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: International Conference on Learning Representations (ICLR), 2023

  24. arXiv:2212.09864  [pdf, other

    cs.CL cs.AI

    Synthetic Pre-Training Tasks for Neural Machine Translation

    Authors: Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio Feris

    Abstract: Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-trainin… ▽ More

    Submitted 30 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL2023-Findings. New added Phrase-cat for synthetic pre-training. 17 pages including 5-page appendix

  25. arXiv:2211.16412  [pdf, other

    cs.CV cs.LG

    Procedural Image Programs for Representation Learning

    Authors: Manel Baradad, Chun-Fu Chen, Jonas Wulff, Tongzhou Wang, Rogerio Feris, Antonio Torralba, Phillip Isola

    Abstract: Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, eac… ▽ More

    Submitted 6 November, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: 29 pages, Accepted in the Conference on Neural Information Processing Systems 2022 (NeurIPS 2022)

    Journal ref: NeurIPS 2022

  26. arXiv:2211.14703  [pdf, other

    cs.CV

    Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation

    Authors: Kaihong Wang, Donghyun Kim, Rogerio Feris, Kate Saenko, Margrit Betke

    Abstract: While transformers have greatly boosted performance in semantic segmentation, domain adaptive transformers are not yet well explored. We identify that the domain gap can cause discrepancies in self-attention. Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain. We propose to perform adaptation on attention maps with cross-domain… ▽ More

    Submitted 20 December, 2022; v1 submitted 26 November, 2022; originally announced November 2022.

  27. arXiv:2211.13218  [pdf, other

    cs.CV cs.AI cs.LG

    CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

    Authors: James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, Zsolt Kira

    Abstract: Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has e… ▽ More

    Submitted 30 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

  28. arXiv:2211.11733  [pdf, other

    cs.CV

    Teaching Structured Vision&Language Concepts to Vision&Language Models

    Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

    Abstract: Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Journal ref: CVPR 2023

  29. arXiv:2211.09790  [pdf, other

    cs.LG cs.AI cs.CV

    ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

    Authors: James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, Leonid Karlinsky

    Abstract: Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object… ▽ More

    Submitted 30 March, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

  30. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  31. arXiv:2209.03648  [pdf, other

    cs.CV

    FETA: Towards Specializing Foundation Models for Expert Task Applications

    Authors: Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

    Abstract: Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail… ▽ More

    Submitted 19 December, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

  32. arXiv:2206.00100  [pdf, other

    cs.CV cs.CL

    VALHALLA: Visual Hallucination for Machine Translation

    Authors: Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Chen, Rogerio Feris, David Cox, Nuno Vasconcelos

    Abstract: Designing better machine translation systems by considering auxiliary inputs such as images has attracted much attention in recent years. While existing methods show promising performance over the conventional text-only translation systems, they typically require paired text and image as input during inference, which limits their applicability to real-world scenarios. In this paper, we introduce a… ▽ More

    Submitted 31 May, 2022; originally announced June 2022.

    Comments: CVPR 2022

  33. arXiv:2203.17219  [pdf, other

    cs.CV

    SimVQA: Exploring Simulated Environments for Visual Question Answering

    Authors: Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio Feris, Vicente Ordonez

    Abstract: Existing work on VQA explores data augmentation to achieve better generalization by perturbing the images in the dataset or modifying the existing questions and answers. While these methods exhibit good performance, the diversity of the questions and answers are constrained by the available image set. In this work we explore using synthetic computer-generated data to fully control the visual and l… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022. Camera-Ready version. Project page: https://simvqa.github.io/

  34. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  35. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  36. arXiv:2112.00054  [pdf, other

    cs.CV cs.LG

    Task2Sim : Towards Effective Pre-training and Transfer from Synthetic Data

    Authors: Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, Rogerio S. Feris

    Abstract: Pre-training models on Imagenet or other massive datasets of real images has led to major advances in computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very di… ▽ More

    Submitted 28 March, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

    Comments: Accepted to CVPR'22

  37. arXiv:2111.13998  [pdf, other

    cs.CV

    Targeted Supervised Contrastive Learning for Long-Tailed Recognition

    Authors: Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio Feris, Piotr Indyk, Dina Katabi

    Abstract: Real-world data often exhibits long tail distributions with heavy class imbalance, where the majority classes can dominate the training process and alter the decision boundaries of the minority classes. Recently, researchers have investigated the potential of supervised contrastive learning for long-tailed recognition, and demonstrated that it provides a strong performance gain. In this paper, we… ▽ More

    Submitted 2 May, 2022; v1 submitted 27 November, 2021; originally announced November 2021.

    Comments: The first two authors contributed equally to this paper

  38. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  39. arXiv:2108.10394  [pdf, other

    cs.CV

    Dynamic Network Quantization for Efficient Video Inference

    Authors: Ximeng Sun, Rameswar Panda, Chun-Fu Chen, Aude Oliva, Rogerio Feris, Kate Saenko

    Abstract: Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.

    Comments: ICCV 2021 Camera Ready Version

  40. arXiv:2107.09106  [pdf, other

    cs.CV cs.CL cs.LG

    Separating Skills and Concepts for Novel Visual Question Answering

    Authors: Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, Kate Saenko

    Abstract: Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models. To measure generalization to novel questions, we propose to separate them into "skills" and "concepts". "Skills" are visual tasks, such as counting or attribute recognition, and are applied to "concepts" mentioned in the question, such as objects and people. VQA methods should be able to compo… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

    Comments: Paper at CVPR 2021. 14 pages, 7 figures

  41. arXiv:2106.12620  [pdf, other

    cs.CV

    IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

    Authors: Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, Aude Oliva

    Abstract: The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory costs. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We star… ▽ More

    Submitted 26 October, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: Accepted in NeurIPS 2021

  42. arXiv:2106.07807  [pdf, other

    cs.CV

    Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

    Authors: Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Richard J. Radke

    Abstract: Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature.… ▽ More

    Submitted 1 November, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted to NeurIPS 2021

  43. arXiv:2105.05165  [pdf, other

    cs.CV cs.AI cs.LG

    AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

    Authors: Rameswar Panda, Chun-Fu Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris

    Abstract: Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational expense limits its impact for many real-world applications. In this paper, we propose an adaptive multi-modal learning framework, called AdaMML, that selects on-… ▽ More

    Submitted 12 May, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

  44. arXiv:2105.04489  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

    Authors: Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

    Abstract: When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people gener… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

    Comments: To appear at CVPR 2021

  45. arXiv:2104.14082  [pdf, other

    cs.CV

    Pseudo-IoU: Improving Label Assignment in Anchor-Free Object Detection

    Authors: Jiachen Li, Bowen Cheng, Rogerio Feris, Jinjun Xiong, Thomas S. Huang, Wen-Mei Hwu, Humphrey Shi

    Abstract: Current anchor-free object detectors are quite simple and effective yet lack accurate label assignment methods, which limits their potential in competing with classic anchor-based models that are supported by well-designed assignment methods based on the Intersection-over-Union~(IoU) metric. In this paper, we present \textbf{Pseudo-Intersection-over-Union~(Pseudo-IoU)}: a simple metric that brings… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: CVPR 2021 Workshop

  46. arXiv:2104.12671  [pdf, other

    cs.CV

    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

    Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

    Abstract: Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalitie… ▽ More

    Submitted 3 September, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To be presented at ICCV 2021

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8012-8021

  47. arXiv:2104.09829  [pdf, other

    cs.CV

    Detector-Free Weakly Supervised Grounding by Separation

    Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

    Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

  48. arXiv:2103.13517  [pdf, other

    cs.CV

    A Broad Study on the Transferability of Visual Representations with Contrastive Learning

    Authors: Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Richard Radke, Rogerio Feris

    Abstract: Tremendous progress has been made in visual representation learning, notably with the recent success of self-supervised contrastive learning methods. Supervised contrastive learning has also been shown to outperform its cross-entropy counterparts by leveraging labels for choosing where to contrast. However, there has been little work to explore the transfer capability of contrastive learning to a… ▽ More

    Submitted 15 August, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

    Comments: accepted to ICCV 2021

  49. arXiv:2103.01435  [pdf, other

    cs.CV

    Improved Techniques for Quantizing Deep Networks with Adaptive Bit-Widths

    Authors: Ximeng Sun, Rameswar Panda, Chun-Fu Chen, Naigang Wang, Bowen Pan, Kailash Gopalakrishnan, Aude Oliva, Rogerio Feris, Kate Saenko

    Abstract: Quantizing deep networks with adaptive bit-widths is a promising technique for efficient inference across many devices and resource constraints. In contrast to static methods that repeat the quantization process and train different models for different constraints, adaptive quantization enables us to flexibly adjust the bit-widths of a single deep network during inference for instant adaptation in… ▽ More

    Submitted 16 September, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

  50. arXiv:2102.07887  [pdf, other

    cs.CV

    VA-RED$^2$: Video Adaptive Redundancy Reduction

    Authors: Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

    Abstract: Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depe… ▽ More

    Submitted 4 October, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: Accepted in ICLR 2021