Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 172 results for author: Savarese, S

.
  1. arXiv:2406.18518  [pdf, other

    cs.CL cs.AI cs.LG cs.SE

    APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

    Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

    Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scal… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.11271  [pdf, other

    cs.CV cs.LG

    MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

    Authors: Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt

    Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimo… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.10290  [pdf, other

    cs.CL cs.AI cs.LG

    MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

    Authors: Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understand… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2404.07972  [pdf, other

    cs.AI cs.CL

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Authors: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

    Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature… ▽ More

    Submitted 30 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: 51 pages, 21 figures

  5. arXiv:2403.09227  [pdf, other

    cs.RO cs.AI

    BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

    Authors: Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews , et al. (10 additional authors not shown)

    Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)

  6. arXiv:2403.04857  [pdf, other

    hep-ph astro-ph.HE hep-ex

    Dark Matter Line Searches with the Cherenkov Telescope Array

    Authors: S. Abe, J. Abhir, A. Abhishek, F. Acero, A. Acharyya, R. Adam, A. Aguasca-Cabot, I. Agudo, A. Aguirre-Santaella, J. Alfaro, R. Alfaro, N. Alvarez-Crespo, R. Alves Batista, J. -P. Amans, E. Amato, G. Ambrosi, L. Angel, C. Aramo, C. Arcaro, T. T. H. Arnesen, L. Arrabito, K. Asano, Y. Ascasibar, J. Aschersleben, H. Ashkar , et al. (540 additional authors not shown)

    Abstract: Monochromatic gamma-ray signals constitute a potential smoking gun signature for annihilating or decaying dark matter particles that could relatively easily be distinguished from astrophysical or instrumental backgrounds. We provide an updated assessment of the sensitivity of the Cherenkov Telescope Array (CTA) to such signals, based on observations of the Galactic centre region as well as of sele… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 43 pages JCAP style (excluding author list and references), 19 figures

  7. arXiv:2402.15538  [pdf, other

    cs.MA cs.AI

    AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System

    Authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, Silvio Savarese

    Abstract: The booming success of LLMs initiates rapid development in LLM agents. Though the foundation of an LLM agent is the generative model, it is critical to devise the optimal reasoning strategies and agent architectures. Accordingly, LLM agent research advances from the simple chain-of-thought prompting to more complex ReAct and Reflection reasoning strategy; agent architecture also evolves from singl… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: preprint. Library is available at https://github.com/SalesforceAIResearch/AgentLite

  8. arXiv:2402.15506  [pdf, other

    cs.AI cs.CL cs.LG

    AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning

    Authors: Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan, Thai Hoang, Liangwei Yang, Yihao Feng, Zuxin Liu, Tulika Awalgaonkar, Juan Carlos Niebles, Silvio Savarese, Shelby Heinecke, Huan Wang, Caiming Xiong

    Abstract: Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \… ▽ More

    Submitted 20 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: Add GitHub repo link at \url{https://github.com/SalesforceAIResearch/xLAM} and HuggingFace model link at \url{https://huggingface.co/Salesforce/xLAM-v0.1-r}

  9. arXiv:2402.10941  [pdf, other

    cs.CL cs.AI cs.LG

    Text2Data: Low-Resource Data Generation with Textual Control

    Authors: Shiyu Wang, Yihao Feng, Tian Lan, Ning Yu, Yu Bai, Ran Xu, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesi… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: We propose a method that can achieve text-to-data generation under low-resource situation

  10. arXiv:2402.02592  [pdf, other

    cs.LG cs.AI

    Unified Training of Universal Time Series Forecasting Transformers

    Authors: Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, Doyen Sahoo

    Abstract: Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forec… ▽ More

    Submitted 22 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  11. arXiv:2401.10495  [pdf, ps, other

    cs.LG cs.AI stat.ME

    Causal Layering via Conditional Entropy

    Authors: Itai Feigenbaum, Devansh Arpit, Huan Wang, Shelby Heinecke, Juan Carlos Niebles, Weiran Yao, Caiming Xiong, Silvio Savarese

    Abstract: Causal discovery aims to recover information about an unobserved causal graph from the observable data it generates. Layerings are orderings of the variables which place causes before effects. In this paper, we provide ways to recover layerings of a graph by accessing the data via a conditional entropy oracle, when distributions are discrete. Our algorithms work by repeatedly removing sources or s… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

  12. arXiv:2401.07526  [pdf, other

    cs.CL cs.AI cs.LG

    Editing Arbitrary Propositions in LLMs without Subject Labels

    Authors: Itai Feigenbaum, Devansh Arpit, Huan Wang, Shelby Heinecke, Juan Carlos Niebles, Weiran Yao, Caiming Xiong, Silvio Savarese

    Abstract: Large Language Model (LLM) editing modifies factual information in LLMs. Locate-and-Edit (L\&E) methods accomplish this by finding where relevant information is stored within the neural network, and editing the weights at that location. The goal of editing is to modify the response of an LLM to a proposition independently of its phrasing, while not modifying its response to other related propositi… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  13. arXiv:2311.18799  [pdf, other

    cs.CV cs.CL

    X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

    Authors: Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

    Abstract: Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific custo… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  14. arXiv:2311.09346  [pdf, other

    cs.CV cs.LG cs.RO

    Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Registration Under Large Geometric and Temporal Change

    Authors: Tao Sun, Yan Hao, Shengyu Huang, Silvio Savarese, Konrad Schindler, Marc Pollefeys, Iro Armeni

    Abstract: Building 3D geometric maps of man-made spaces is a well-established and active field that is fundamental to computer vision and robotics. However, considering the evolving nature of built environments, it is essential to question the capabilities of current mapping efforts in handling temporal changes. In addition, spatiotemporal mapping holds significant potential for achieving sustainability and… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: 27 pages, 29 figures. For the project page, see http://nothing-stands-still.com

  15. arXiv:2310.10616  [pdf, other

    cs.LG

    How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

    Authors: Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, Yu Bai

    Abstract: While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scena… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  16. MORFEO enters final design phase

    Authors: Lorenzo Busoni, Guido Agapito, Alessandro Ballone, Alfio Puglisi, Alexander Goncharov, Amedeo Petrella, Amico Di Cianno, Andrea Balestra, Andrea Baruffolo, Andrea Bianco, Andrea Di Dato, Angelo Valentini, Benedetta Di Francesco, Benoit Sassolas, Bernardo Salasnich, Carmelo Arcidiacono, Cedric Plantet, Christian Eredia, Daniela Fantinel, Danilo Selvestrel, Deborah Malone, Demetrio Magrin, Domenico D'Auria, Edoardo Redaelli, Elena Carolo , et al. (59 additional authors not shown)

    Abstract: MORFEO (Multi-conjugate adaptive Optics Relay For ELT Observations, formerly MAORY), the MCAO system for the ELT, will provide diffraction-limited optical quality to the large field camera MICADO. MORFEO has officially passed the Preliminary Design Review and it is entering the final design phase. We present the current status of the project, with a focus on the adaptive optics system aspects and… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

  17. arXiv:2309.03450  [pdf, other

    cs.CL cs.AI cs.LG

    XGen-7B Technical Report

    Authors: Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński, Lidiya Murakhovs'ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, Caiming Xiong

    Abstract: Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many t… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

  18. arXiv:2308.08169  [pdf, other

    cs.CL cs.AI

    Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

    Authors: Jianguo Zhang, Stephen Roller, Kun Qian, Zhiwei Liu, Rui Meng, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

    Abstract: End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: Accepted by SIGDIAL 2023 as a long paper

  19. arXiv:2308.05960  [pdf, other

    cs.AI

    BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents

    Authors: Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: The massive successes of large language models (LLMs) encourage the emerging exploration of LLM-augmented Autonomous Agents (LAAs). An LAA is able to generate actions with its core LLM and interact with environments, which facilitates the ability to resolve complex tasks by conditioning on past interactions such as observations and actions. Since the investigation of LAA is still very recent, limi… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Comments: Preprint

  20. arXiv:2308.02151  [pdf, other

    cs.CL cs.AI

    Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

    Authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents ena… ▽ More

    Submitted 5 May, 2024; v1 submitted 4 August, 2023; originally announced August 2023.

  21. arXiv:2307.10172  [pdf, other

    cs.CL cs.AI

    DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

    Authors: Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong

    Abstract: Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Ou… ▽ More

    Submitted 5 February, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

    Comments: 17 pages, accepted by EACL 2024 Findings as a long paper. All datasets, licenses, codes, and models are available at at https://github.com/salesforce/DialogStudio

  22. arXiv:2307.08962  [pdf, other

    cs.AI cs.LG

    REX: Rapid Exploration and eXploitation for AI Agents

    Authors: Rithesh Murthy, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Le Xue, Weiran Yao, Yihao Feng, Zeyuan Chen, Akash Gokul, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: In this paper, we propose an enhanced approach for Rapid Exploration and eXploitation for AI Agents called REX. Existing AutoGPT-style techniques have inherent limitations, such as a heavy reliance on precise descriptions for decision-making, and the lack of a systematic approach to leverage try-and-fail procedures akin to traditional Reinforcement Learning (RL). REX introduces an additional layer… ▽ More

    Submitted 26 January, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

  23. arXiv:2306.00923  [pdf, other

    cs.RO cs.CV cs.HC

    Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

    Authors: Ruohan Gao, Hao Li, Gokul Dharan, Zhuzhu Wang, Chengshu Li, Fei Xia, Silvio Savarese, Li Fei-Fei, Jiajun Wu

    Abstract: Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation… ▽ More

    Submitted 16 September, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: In ICRA 2023. Project page: https://ai.stanford.edu/~rhgao/sonicverse/. Code: https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to this work and are in alphabetical order

  24. arXiv:2305.17537  [pdf, other

    cs.LG cs.AI

    Modeling Dynamic Environments with Scene Graph Memory

    Authors: Andrey Kurenkov, Michael Lingelbach, Tanmay Agarwal, Emily Jin, Chengshu Li, Ruohan Zhang, Li Fei-Fei, Jiajun Wu, Silvio Savarese, Roberto Martín-Martín

    Abstract: Embodied AI agents that search for objects in large environments such as households often need to make efficient decisions by predicting object locations based on partial information. We pose this as a new type of link prediction problem: link prediction on partially observable dynamic graphs. Our graph is a representation of a scene in which rooms and objects are nodes, and their relationships ar… ▽ More

    Submitted 12 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

  25. arXiv:2305.14352  [pdf, other

    cs.CV cs.LG

    An Extensible Multimodal Multi-task Object Dataset with Materials

    Authors: Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese

    Abstract: We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon's product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic $\rightarrow$ Thermoplastic… ▽ More

    Submitted 29 April, 2023; originally announced May 2023.

    Comments: ICLR 2023

  26. arXiv:2305.11147  [pdf, other

    cs.CV cs.AI

    UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

    Authors: Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, Ran Xu

    Abstract: Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such… ▽ More

    Submitted 2 November, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  27. arXiv:2305.08275  [pdf, other

    cs.CV

    ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

    Authors: Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese

    Abstract: Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes, their 2D counterparts, and language descriptions. However, the methods used by existing frameworks to curate such multimodal data, in particular language descriptions for 3D shapes, are not scalable, and the collected language descriptions are… ▽ More

    Submitted 25 April, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

    Comments: CVPR2024

    Journal ref: CVPR2024

  28. arXiv:2305.02309  [pdf, other

    cs.LG

    CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

    Authors: Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo Zhou

    Abstract: Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, w… ▽ More

    Submitted 11 July, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

  29. arXiv:2303.18230  [pdf, other

    cs.CV

    Procedure-Aware Pretraining for Instructional Video Understanding

    Authors: Honglu Zhou, Roberto Martín-Martín, Mubbasir Kapadia, Silvio Savarese, Juan Carlos Niebles

    Abstract: Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential nex… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  30. arXiv:2303.09618  [pdf, other

    cs.CV cs.AI cs.CL cs.HC cs.LG

    HIVE: Harnessing Human Feedback for Instructional Visual Editing

    Authors: Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, Ran Xu

    Abstract: Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferenc… ▽ More

    Submitted 26 March, 2024; v1 submitted 16 March, 2023; originally announced March 2023.

    Comments: In CVPR, 2024

  31. arXiv:2301.12597  [pdf, other

    cs.CV

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Authors: Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

    Abstract: The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Tra… ▽ More

    Submitted 15 June, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

  32. arXiv:2212.05171  [pdf, other

    cs.CV

    ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

    Authors: Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese

    Abstract: The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modalit… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023

  33. arXiv:2211.11924  [pdf, other

    cs.CL

    Best-$k$ Search Algorithm for Neural Text Generation

    Authors: Jiacheng Xu, Caiming Xiong, Silvio Savarese, Yingbo Zhou

    Abstract: Modern natural language generation paradigms require a good decoding strategy to obtain quality sequences out of the model. Beam search yields high-quality but low diversity outputs; stochastic approaches suffer from high variance and sometimes low quality, but the outputs tend to be more natural and creative. In this work, we propose a deterministic search algorithm balancing both quality and div… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: 17 pages

  34. arXiv:2211.09916  [pdf, other

    cs.RO cs.LG

    Online Distribution Shift Detection via Recency Prediction

    Authors: Rachel Luo, Rohan Sinha, Yixiao Sun, Ali Hindy, Shengjia Zhao, Silvio Savarese, Edward Schmerling, Marco Pavone

    Abstract: When deploying modern machine learning-enabled robotic systems in high-stakes applications, detecting distribution shift is critical. However, most existing methods for detecting distribution shift are not well-suited to robotics settings, where data often arrives in a streaming fashion and may be very high-dimensional. In this work, we present an online method for detecting distribution shift wit… ▽ More

    Submitted 17 May, 2024; v1 submitted 17 November, 2022; originally announced November 2022.

  35. arXiv:2210.08773  [pdf, other

    cs.CV

    Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

    Authors: Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven C. H. Hoi

    Abstract: Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. In… ▽ More

    Submitted 19 March, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 (Findings); correct typos in Equation 2 on page 4

  36. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  37. arXiv:2209.09019  [pdf, other

    cs.CV cs.CL cs.LG

    LAVIS: A Library for Language-Vision Intelligence

    Authors: Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C. H. Hoi

    Abstract: We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-langu… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: Preprint of LAVIS technical report

  38. Progress on the SOXS transients chaser for the ESO-NTT

    Authors: P. Schipani, S. Campana, R. Claudi, M. Aliverti, A. Baruffolo, S. Ben-Ami, G. Capasso, R. Cosentino, F. D'Alessio, P. D'Avanzo, O. Hershko, H. Kuncarayakti, M. Landoni, M. Munari, G. Pignata, K. Radhakrishnan, A. Rubin, S. Scuderi, F. Vitali, D. Young, J. Achrén, J. A. Araiza-Durán, I. Arcavi, F. Battaini, A. Brucalassi , et al. (31 additional authors not shown)

    Abstract: SOXS (Son Of X-Shooter) is a single object spectrograph offering a simultaneous spectral coverage from U- to H-band, built by an international consortium for the 3.58-m ESO New Technology Telescope at the La Silla Observatory. It is designed to observe all kind of transients and variable sources discovered by different surveys with a highly flexible schedule maintained by the consortium, based on… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: Proc. SPIE 12184, Ground-based and Airborne Instrumentation for Astronomy IX, 121840O (2022)

  39. arXiv:2208.10056  [pdf, other

    cs.CV

    Minkowski Tracker: A Sparse Spatio-Temporal R-CNN for Joint Object Detection and Tracking

    Authors: JunYoung Gwak, Silvio Savarese, Jeannette Bohg

    Abstract: Recent research in multi-task learning reveals the benefit of solving related problems in a single neural network. 3D object detection and multi-object tracking (MOT) are two heavily intertwined problems predicting and associating an object instance location across time. However, most previous works in 3D MOT treat the detector as a preceding separated pipeline, disjointly taking the output of the… ▽ More

    Submitted 26 August, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

  40. arXiv:2207.01780  [pdf, other

    cs.LG cs.CL cs.PL

    CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

    Authors: Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C. H. Hoi

    Abstract: Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language proble… ▽ More

    Submitted 3 November, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: An earlier version of the work was accepted to NeurIPS 2022

  41. arXiv:2206.02967  [pdf, other

    cs.CV cs.AI

    Masked Unsupervised Self-training for Label-free Image Classification

    Authors: Junnan Li, Silvio Savarese, Steven C. H. Hoi

    Abstract: State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervisi… ▽ More

    Submitted 9 March, 2023; v1 submitted 6 June, 2022; originally announced June 2022.

  42. arXiv:2206.01612  [pdf, other

    cs.LG cs.AI cs.CV

    OmniXAI: A Library for Explainable AI

    Authors: Wenzhuo Yang, Hung Le, Tanmay Laud, Silvio Savarese, Steven C. H. Hoi

    Abstract: We introduce OmniXAI (short for Omni eXplainable AI), an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice. OmniXAI aims to be a one-stop comprehensive library that makes explai… ▽ More

    Submitted 12 December, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: Github repo: https://github.com/salesforce/OmniXAI

    MSC Class: 68T09; 68T20; 68T01 ACM Class: I.2.6; I.2.5

  43. arXiv:2203.13474  [pdf, other

    cs.LG cs.CL cs.PL

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Authors: Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong

    Abstract: Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of… ▽ More

    Submitted 27 February, 2023; v1 submitted 25 March, 2022; originally announced March 2022.

  44. arXiv:2203.07586  [pdf, other

    cs.CL

    Long Document Summarization with Top-down and Bottom-up Inference

    Authors: Bo Pang, Erik Nijkamp, Wojciech Kryściński, Silvio Savarese, Yingbo Zhou, Caiming Xiong

    Abstract: Text summarization aims to condense long documents and retain key information. Critical to the success of a summarization model is the faithful inference of latent representations of words or tokens in the source documents. Most recent models infer the latent representations with a transformer encoder, which is purely bottom-up. Also, self-attention-based inference models face the challenge of qua… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: 21 pages

  45. arXiv:2203.06856  [pdf, other

    cs.CV cs.AI cs.RO

    ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object Manipulation

    Authors: Bokui Shen, Zhenyu Jiang, Christopher Choy, Leonidas J. Guibas, Silvio Savarese, Anima Anandkumar, Yuke Zhu

    Abstract: Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, bring substantial challenges due to infinite shape variations, non-rigid motions, and partial observability. We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations. ACID integrates two new techniques: implicit r… ▽ More

    Submitted 5 August, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: RSS 2022 Best Student Paper Award Finalist. Please check out more details at https://b0ku1.github.io/acid/

    Journal ref: Robotics: Science and Systems (RSS), 2022

  46. arXiv:2112.06857  [pdf, other

    astro-ph.IM

    Software solutions for numerical modeling of wide-field telescopes

    Authors: Salvatore Savarese, Pietro Schipani, Giulio Capasso, Mirko Colapietro, Sergio D'Orsi, Marcella Iuzzolino, Laurent Marty, Francesco Perrotta, Giacomo Basile

    Abstract: This paper presents an integrated modeling software to analyze the PSF of wide-field telescopes affected by misalignments. Even relatively small misalignments in the optical system of a telescope can significantly deteriorate the image quality by introducing large aberrations. In particular, wide-field telescopes are critically affected by these errors, insomuch that usually a closed-loop active o… ▽ More

    Submitted 13 December, 2021; originally announced December 2021.

    Comments: 4 pages, 3 figures, ADASS 2021 Conference

  47. arXiv:2112.05251  [pdf, other

    cs.RO cs.AI cs.LG

    Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation

    Authors: Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese, Roberto Martín-Martín

    Abstract: In mobile manipulation (MM), robots can both navigate within and interact with their environment and are thus able to complete many more tasks than robots only capable of navigation or manipulation. In this work, we explore how to apply imitation learning (IL) to learn continuous visuo-motor policies for MM tasks. Much prior work has shown that IL can train visuo-motor policies for either manipula… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: CoRL 2021

  48. Sample-Efficient Safety Assurances using Conformal Prediction

    Authors: Rachel Luo, Shengjia Zhao, Jonathan Kuck, Boris Ivanovic, Silvio Savarese, Edward Schmerling, Marco Pavone

    Abstract: When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than $ε$ will… ▽ More

    Submitted 2 January, 2024; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: International Journal of Robotics Research, 2023

  49. arXiv:2109.09265  [pdf, other

    cs.LG cs.MS stat.ML

    Merlion: A Machine Learning Library for Time Series

    Authors: Aadyot Bhatnagar, Paul Kassianik, Chenghao Liu, Tian Lan, Wenzhuo Yang, Rowan Cassius, Doyen Sahoo, Devansh Arpit, Sri Subramanian, Gerald Woo, Amrita Saha, Arun Kumar Jagota, Gokulakrishnan Gopalakrishnan, Manpreet Singh, K C Krithika, Sukumar Maddineni, Daeki Cho, Bo Zong, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Steven Hoi, Huan Wang

    Abstract: We introduce Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for anomaly detection and forecasting on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including visualization, anomaly score calibration to improve in… ▽ More

    Submitted 19 September, 2021; originally announced September 2021.

    Comments: 22 pages, 1 figure, 14 tables

  50. arXiv:2109.01115  [pdf, other

    cs.RO cs.AI cs.LG

    Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

    Authors: Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, Chelsea Finn

    Abstract: We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot's observation space. However, goal images also have a number of drawbacks: t… ▽ More

    Submitted 31 October, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

    Comments: Conference on Robot Learning (CoRL) 2021. 24 Pages, 18 Figures