Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 73 results for author: Kembhavi, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20083  [pdf, other

    cs.RO cs.CV

    PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

    Authors: Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

    Abstract: We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of mil… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.12276  [pdf, other

    cs.AI cs.CL cs.SE

    CodeNav: Beyond tool-use to using real-world codebases with LLM agents

    Authors: Tanmay Gupta, Luca Weihs, Aniruddha Kembhavi

    Abstract: We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require ``registration'' of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2406.11775  [pdf, other

    cs.CV cs.AI

    Task Me Anything

    Authors: Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their spec… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: website: https://www.task-me-anything.org

  4. arXiv:2406.08953  [pdf, other

    cs.CV cs.LG

    Preserving Identity with Variational Score for General-purpose 3D Editing

    Authors: Duong H. Le, Tuan Pham, Aniruddha Kembhavi, Stephan Mandt, Wei-Chiu Ma, Jiasen Lu

    Abstract: We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models. Specifically, our approach is inspired by the recently proposed method for 2D image editing - Delta Denoising Score (DDS). We pinpoint the limitations in DDS for 2D and 3D editing, which causes detail loss and over-saturation. To a… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 22 pages, 14 figures

  5. arXiv:2404.02145  [pdf, other

    cs.CV

    Iterated Learning Improves Compositionality in Large Vision-Language Models

    Authors: Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man… ▽ More

    Submitted 16 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  6. arXiv:2403.12120  [pdf, other

    astro-ph.IM astro-ph.SR cs.LG

    Light Curve Classification with DistClassiPy: a new distance-based classifier

    Authors: Siddharth Chaini, Ashish Mahabal, Ajit Kembhavi, Federica B. Bianco

    Abstract: The rise of synoptic sky surveys has ushered in an era of big data in time-domain astronomy, making data science and machine learning essential tools for studying celestial objects. While tree-based models (e.g. Random Forests) and deep learning models dominate the field, we explore the use of different distance metrics to aid in the classification of astrophysical objects. We developed DistClassi… ▽ More

    Submitted 25 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted for publication in Astronomy and Computing (2024). 24 pages, 19 figures

  7. arXiv:2401.07770  [pdf, other

    cs.CV

    Seeing the Unseen: Visual Common Sense for Semantic Placement

    Authors: Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

    Abstract: Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or boundi… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  8. arXiv:2312.17172  [pdf, other

    cs.CV cs.AI cs.CL

    Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

    Authors: Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

    Abstract: We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse moda… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: 38 pages, 20 figures

  9. arXiv:2312.09337  [pdf, other

    cs.CV cs.AI cs.RO

    Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences

    Authors: Minyoung Hwang, Luca Weihs, Chanwoo Park, Kimin Lee, Aniruddha Kembhavi, Kiana Ehsani

    Abstract: Customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied AI. In this paper, we present Promptable Behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. We use multi-objective reinforcement learning to train a single policy adaptable to a… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  10. arXiv:2312.09067  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Holodeck: Language Guided Generation of 3D Embodied AI Environments

    Authors: Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark

    Abstract: 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs… ▽ More

    Submitted 22 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Published in CVPR 2024, 21 pages, 27 figures, 2 tables

  11. arXiv:2312.06639  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Harmonic Mobile Manipulation

    Authors: Ruihan Yang, Yejin Kim, Aniruddha Kembhavi, Xiaolong Wang, Kiana Ehsani

    Abstract: Recent advancements in robotics have enabled robots to navigate complex scenes or manipulate diverse objects independently. However, robots are still impotent in many household tasks requiring coordinated behaviors such as opening doors. The factorization of navigation and manipulation, while effective for some tasks, fails in scenarios requiring coordinated actions. To address this challenge, we… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: More results are on our project site: https://rchalyang.github.io/HarmonicMM/

  12. arXiv:2312.02976  [pdf, other

    cs.RO cs.AI cs.CV

    SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

    Authors: Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi

    Abstract: Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely… ▽ More

    Submitted 7 August, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: First six authors contributed equally. Project page: https://spoc-robot.github.io/

  13. arXiv:2311.18082  [pdf, other

    cs.CV

    Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing

    Authors: Piper Wolters, Favyen Bastani, Aniruddha Kembhavi

    Abstract: Super-Resolution for remote sensing has the potential for huge impact on planet monitoring by producing accurate and realistic high resolution imagery on a frequent basis and a global scale. Despite a lot of attention, several inconsistencies and challenges have prevented it from being deployed in practice. These include the lack of effective metrics, fragmented and relatively small-scale datasets… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  14. arXiv:2311.04193  [pdf, other

    cs.CV cs.AI

    Selective Visual Representations Improve Convergence and Generalization for Embodied AI

    Authors: Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi, Ranjay Krishna

    Abstract: Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visu… ▽ More

    Submitted 9 March, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

    Comments: See project website: https://embodied-codebook.github.io

  15. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  16. arXiv:2307.11073  [pdf, other

    cs.CV cs.AI cs.GR

    OBJECT 3DIT: Language-guided 3D-aware Image Editing

    Authors: Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta

    Abstract: Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  17. arXiv:2307.05663  [pdf, other

    cs.CV cs.AI

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Authors: Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi

    Abstract: Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

  18. arXiv:2306.15128  [pdf, other

    cs.CV cs.AI cs.LG

    MIMIC: Masked Image Modeling with Image Correspondences

    Authors: Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna

    Abstract: Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking.… ▽ More

    Submitted 15 May, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

  19. arXiv:2306.14610  [pdf, other

    cs.CV cs.CL cs.LG

    SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

    Authors: Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-language models have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in all these benchmarks rendering them hackable. This hackabi… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  20. arXiv:2306.10191  [pdf, other

    cs.LG cs.AI cs.CV

    Neural Priming for Sample-Efficient Adaptation

    Authors: Matthew Wallingford, Vivek Ramanujan, Alex Fang, Aditya Kusupati, Roozbeh Mottaghi, Aniruddha Kembhavi, Ludwig Schmidt, Ali Farhadi

    Abstract: We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be perfo… ▽ More

    Submitted 4 December, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: 18 pages, 7 figures, 9 tables

  21. arXiv:2303.16133  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

    Authors: Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

    Abstract: As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that migh… ▽ More

    Submitted 21 February, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: TMLR 2024; Project Website: https://adymaharana.github.io/cococon/

  22. arXiv:2301.04101  [pdf, other

    cs.CV cs.LG

    Neural Radiance Field Codebooks

    Authors: Matthew Wallingford, Aditya Kusupati, Alex Fang, Vivek Ramanujan, Aniruddha Kembhavi, Roozbeh Mottaghi, Ali Farhadi

    Abstract: Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks. Learning such representations for complex scenes and tasks remains an open challenge. Towards this goal, we introduce Neural Radiance Field Codebooks (NRC), a scalable method for learning object-centric representations through novel view recons… ▽ More

    Submitted 30 April, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

    Comments: 19 pages, 8 figures, 9 tables

    Journal ref: International Conference on Learning Representations 2023

  23. arXiv:2212.08051  [pdf, other

    cs.CV cs.AI cs.GR cs.RO

    Objaverse: A Universe of Annotated 3D Objects

    Authors: Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi

    Abstract: Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: Website: objaverse.allenai.org

  24. arXiv:2212.04819  [pdf, other

    cs.RO cs.AI cs.CV

    Phone2Proc: Bringing Robust Robots Into Our Chaotic World

    Authors: Matt Deitke, Rose Hendrix, Luca Weihs, Ali Farhadi, Kiana Ehsani, Aniruddha Kembhavi

    Abstract: Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: https://allenai.org/project/phone2proc

  25. arXiv:2212.01186  [pdf, other

    cs.CV cs.AI

    A General Purpose Supervisory Signal for Embodied Agents

    Authors: Kunal Pratap Singh, Jordi Salvador, Luca Weihs, Aniruddha Kembhavi

    Abstract: Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised object… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  26. arXiv:2211.15660  [pdf, other

    cs.CV

    SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding

    Authors: Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, Aniruddha Kembhavi

    Abstract: Remote sensing images are useful for a wide variety of planet monitoring applications, from tracking deforestation to tackling illegal fishing. The Earth is extremely diverse -- the amount of potential tasks in remote sensing images is massive, and the sizes of features range from several kilometers to just tens of centimeters. However, creating generalizable computer vision methods is a challenge… ▽ More

    Submitted 21 August, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: ICCV 2023

  27. arXiv:2211.11559  [pdf, other

    cs.CV cs.AI cs.CL

    Visual Programming: Compositional visual reasoning without training

    Authors: Tanmay Gupta, Aniruddha Kembhavi

    Abstract: We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rational… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

  28. arXiv:2211.09778  [pdf, other

    cs.CV cs.CL

    I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

    Authors: Sophia Gu, Christopher Clark, Aniruddha Kembhavi

    Abstract: Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from text data and then transfer them to vision tasks without ever training on visual training data. Ke… ▽ More

    Submitted 18 August, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: website (https://prior.allenai.org/projects/close), code (https://github.com/allenai/close)

  29. arXiv:2211.08388  [pdf, other

    astro-ph.GA astro-ph.IM cs.LG

    Photometric identification of compact galaxies, stars and quasars using multiple neural networks

    Authors: Siddharth Chaini, Atharva Bagul, Anish Deshpande, Rishi Gondkar, Kaushal Sharma, M. Vivek, Ajit Kembhavi

    Abstract: We present MargNet, a deep learning-based classifier for identifying stars, quasars and compact galaxies using photometric parameters and images from the Sloan Digital Sky Survey (SDSS) Data Release 16 (DR16) catalogue. MargNet consists of a combination of Convolutional Neural Network (CNN) and Artificial Neural Network (ANN) architectures. Using a carefully curated dataset consisting of 240,000 c… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: 14 pages, 10 figures, Accepted for publication in MNRAS

  30. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  31. arXiv:2206.08916  [pdf, other

    cs.CV

    Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

    Authors: Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for suc… ▽ More

    Submitted 4 October, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

  32. arXiv:2206.08500  [pdf, other

    cs.CV cs.LG cs.RO

    What do navigation agents learn about their environment?

    Authors: Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, Roozbeh Mottaghi

    Abstract: Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: CVPR 2022

  33. arXiv:2206.06994  [pdf, other

    cs.AI cs.CV cs.RO

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

    Authors: Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi

    Abstract: Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, an… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: ProcTHOR website: https://procthor.allenai.org

  34. arXiv:2204.13653  [pdf, other

    cs.CV

    GRIT: General Robust Image Task Benchmark

    Authors: Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem

    Abstract: Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark.… ▽ More

    Submitted 2 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

  35. arXiv:2203.08141  [pdf, other

    cs.CV cs.LG cs.RO

    Object Manipulation via Visual Target Localization

    Authors: Kiana Ehsani, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi

    Abstract: Object manipulation is a critical skill required for Embodied AI agents interacting with the world around them. Training agents to manipulate objects, poses many challenges. These include occlusion of the target object by the agent's arm, noisy object detection and localization, and the target frequently going out of view as the agent moves around in the scene. We propose Manipulation via Visual O… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

  36. arXiv:2202.06987  [pdf, other

    cs.CV cs.AI

    ASC me to Do Anything: Multi-task Training for Embodied AI

    Authors: Jiasen Lu, Jordi Salvador, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: Embodied AI has seen steady progress across a diverse set of independent tasks. While these varied tasks have different end goals, the basic skills required to complete them successfully overlap significantly. In this paper, our goal is to leverage these shared skills to learn to perform multiple tasks jointly. We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embod… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

    Comments: 22 pages, 11 figures

  37. arXiv:2202.02317  [pdf, other

    cs.CV cs.CL

    Webly Supervised Concept Expansion for General Purpose Vision Models

    Authors: Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi

    Abstract: General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective a… ▽ More

    Submitted 20 July, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: ECCV 2022

  38. arXiv:2112.00800  [pdf, other

    cs.CL cs.AI

    Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text

    Authors: Christopher Clark, Jordi Salvador, Dustin Schwenk, Derrick Bonafilia, Mark Yatskar, Eric Kolve, Alvaro Herrasti, Jonghyun Choi, Sachin Mehta, Sam Skjonsberg, Carissa Schoenick, Aaron Sarnat, Hannaneh Hajishirzi, Aniruddha Kembhavi, Oren Etzioni, Ali Farhadi

    Abstract: Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challeng… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

    Comments: In EMNLP 2021

  39. arXiv:2111.09888  [pdf, other

    cs.CV cs.LG

    Simple but Effective: CLIP Embeddings for Embodied AI

    Authors: Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of sema… ▽ More

    Submitted 14 April, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

    Comments: Published in CVPR 2022

  40. arXiv:2106.04531  [pdf, other

    cs.CV cs.RO

    RobustNav: Towards Benchmarking Robustness in Embodied Navigation

    Authors: Prithvijit Chattopadhyay, Judy Hoffman, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual - affecting RGB inputs - and dynamics - affecting transition dynamics - corruptions. Most recent efforts in visual navigation have typically focused on generalizing to novel target environm… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: 18 pages, 8 figures, Code: https://github.com/allenai/robustnav

  41. arXiv:2106.01401  [pdf, other

    cs.CV

    Container: Context Aggregation Network

    Authors: Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising… ▽ More

    Submitted 18 October, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

    Comments: NeuIPS 2021

  42. arXiv:2106.00188  [pdf, other

    cs.CL cs.AI

    PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

    Authors: Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi

    Abstract: We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language mode… ▽ More

    Submitted 30 January, 2022; v1 submitted 31 May, 2021; originally announced June 2021.

    Comments: ACL 2021 camera ready, project page at https://rowanzellers.com/piglet/

  43. arXiv:2105.00931  [pdf, other

    cs.CV cs.AI cs.LG cs.MA

    GridToPix: Training Embodied Agents with Minimal Supervision

    Authors: Unnat Jain, Iou-Jen Liu, Svetlana Lazebnik, Aniruddha Kembhavi, Luca Weihs, Alexander Schwing

    Abstract: While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards. Indeed, without shaped rewards, i.e., with only terminal rewards, present-day Embodied AI results degrade significantly across Embodied AI problems from single-agent Habitat-based PointGoal Navigati… ▽ More

    Submitted 13 October, 2021; v1 submitted 14 April, 2021; originally announced May 2021.

    Comments: Project page: https://unnat.github.io/gridtopix/ ; last two authors contributed equally

  44. arXiv:2104.11213  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    ManipulaTHOR: A Framework for Visual Object Manipulation

    Authors: Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha Kembhavi, Roozbeh Mottaghi

    Abstract: The domain of Embodied AI has recently witnessed substantial progress, particularly in navigating agents within their environments. These early successes have laid the building blocks for the community to tackle tasks that require agents to actively interact with objects in their environment. Object manipulation is an established research domain within the robotics community and poses several chal… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

    Comments: CVPR 2021 -- (Oral presentation)

  45. arXiv:2104.00990  [pdf, other

    cs.CV cs.CL

    Visual Semantic Role Labeling for Video Understanding

    Authors: Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha Kembhavi

    Abstract: We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchm… ▽ More

    Submitted 2 April, 2021; originally announced April 2021.

    Comments: CVPR21 camera-ready including appendix. Project Page at https://vidsitu.org/

  46. arXiv:2104.00743  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Towards General Purpose Vision Systems

    Authors: Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

    Abstract: Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like… ▽ More

    Submitted 19 April, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: CVPR 2022 Oral; Project page: https://prior.allenai.org/projects/gpv

  47. arXiv:2103.16544  [pdf, other

    cs.CV cs.RO

    Visual Room Rearrangement

    Authors: Luca Weihs, Matt Deitke, Aniruddha Kembhavi, Roozbeh Mottaghi

    Abstract: There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: CVPR 2021 - Oral Presentation

  48. arXiv:2009.11278  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

    Authors: Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi

    Abstract: Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these model… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

    Comments: EMNLP 2020

  49. arXiv:2008.12760  [pdf, other

    cs.CV cs.AI cs.LG cs.MA cs.RO

    AllenAct: A Framework for Embodied AI Research

    Authors: Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, Aniruddha Kembhavi

    Abstract: The domain of Embodied AI, in which agents learn to complete tasks through interaction with their environment from egocentric observations, has experienced substantial growth with the advent of deep reinforcement learning and increased interest from the computer vision, NLP, and robotics communities. This growth has been facilitated by the creation of a large number of simulated environments (such… ▽ More

    Submitted 28 August, 2020; originally announced August 2020.

  50. arXiv:2007.12173  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Bridging the Imitation Gap by Adaptive Insubordination

    Authors: Luca Weihs, Unnat Jain, Iou-Jen Liu, Jordi Salvador, Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing

    Abstract: In practice, imitation learning is preferred over pure reinforcement learning whenever it is possible to design a teaching agent to provide expert supervision. However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potenti… ▽ More

    Submitted 3 December, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: NeurIPS'21 version. The first two authors contributed equally. Project page: https://unnat.github.io/advisor/