Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 124 results for author: Stoica, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.07092  [pdf, other

    cs.LG cs.AI cs.CL

    Post-Training Sparse Attention with Double Sparsity

    Authors: Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

    Abstract: The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  2. arXiv:2408.03561  [pdf, other

    cs.CR cs.AI cs.LG

    MPC-Minimized Secure LLM Inference

    Authors: Deevashwer Rathee, Dacheng Li, Ion Stoica, Hao Zhang, Raluca Popa

    Abstract: Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we prop… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  3. arXiv:2407.16831  [pdf, ps, other

    cs.AI

    Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design

    Authors: Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, Matei Zaharia

    Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  4. arXiv:2406.18665  [pdf, other

    cs.LG cs.AI cs.CL

    RouteLLM: Learning to Route LLMs with Preference Data

    Authors: Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica

    Abstract: Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select betwe… ▽ More

    Submitted 21 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  5. arXiv:2406.14066  [pdf, other

    cs.AI cs.PF

    Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

    Authors: Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

    Abstract: Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real on… ▽ More

    Submitted 25 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  6. arXiv:2406.11939  [pdf, other

    cs.LG cs.AI cs.CL

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    Authors: Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

    Abstract: The rapid evolution of language models has necessitated the development of more challenging benchmarks. Current static benchmarks often struggle to consistently distinguish between the capabilities of different models and fail to align with real-world user preferences. On the other hand, live crowd-sourced platforms like the Chatbot Arena collect a wide range of natural prompts and user feedback.… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  7. arXiv:2405.20947  [pdf, other

    cs.CL cs.AI

    OR-Bench: An Over-Refusal Benchmark for Large Language Models

    Authors: Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

    Abstract: Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is cha… ▽ More

    Submitted 20 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: version 2, 10 pages main, 22 pages total

  8. arXiv:2405.16714  [pdf, other

    cs.CL cs.AI cs.LG q-bio.NC

    Crafting Interpretable Embeddings by Asking LLMs Questions

    Authors: Vinamra Benara, Chandan Singh, John X. Morris, Richard Antonello, Ion Stoica, Alexander G. Huth, Jianfeng Gao

    Abstract: Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb),… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  9. arXiv:2404.18928  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    Stylus: Automatic Adapter Selection for Diffusion Models

    Authors: Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

    Abstract: Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prom… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Project Website: https://stylus-diffusion.github.io

  10. arXiv:2404.14527  [pdf, other

    cs.DC cs.LG

    Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

    Authors: Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

    Abstract: Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and g… ▽ More

    Submitted 22 July, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

  11. arXiv:2404.06921  [pdf, other

    cs.CL cs.AI

    GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

    Authors: Shishir G. Patil, Tianjun Zhang, Vivian Fang, Noppapon C., Roy Huang, Aaron Hao, Martin Casado, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica

    Abstract: Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses signi… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  12. arXiv:2404.04500  [pdf, other

    cs.CR cs.AI cs.CY cs.LG

    Trustless Audits without Revealing Data or Models

    Authors: Suppakit Waiwitlikhit, Ion Stoica, Yi Sun, Tatsunori Hashimoto, Daniel Kang

    Abstract: There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult,… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

  13. arXiv:2404.02015  [pdf, other

    cs.DC

    MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

    Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

    Abstract: Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing… ▽ More

    Submitted 12 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  14. arXiv:2403.13839  [pdf, other

    cs.LG cs.AI cs.PL

    depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

    Authors: Kaichao You, Runsheng Bai, Meng Cao, Jianmin Wang, Ion Stoica, Mingsheng Long

    Abstract: PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: 16 pages, 2 figures

  15. arXiv:2403.10131  [pdf, other

    cs.CL cs.AI

    RAFT: Adapting Language Model to Domain Specific RAG

    Authors: Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, Joseph E. Gonzalez

    Abstract: Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain su… ▽ More

    Submitted 5 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  16. arXiv:2403.07974  [pdf, other

    cs.SE cs.CL cs.LG

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Authors: Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

    Abstract: Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contaminati… ▽ More

    Submitted 6 June, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: Website - https://livecodebench.github.io/

  17. arXiv:2403.05821  [pdf, other

    cs.LG cs.DB

    Optimizing LLM Queries in Relational Workloads

    Authors: Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia

    Abstract: Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  18. arXiv:2403.04132  [pdf, other

    cs.AI cs.CL

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Authors: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica

    Abstract: Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowd… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  19. arXiv:2403.02419  [pdf, other

    cs.LG cs.AI cs.CL eess.SY

    Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

    Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou

    Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we i… ▽ More

    Submitted 4 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  20. arXiv:2402.02057  [pdf, other

    cs.LG cs.CL

    Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

    Authors: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang

    Abstract: Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead decoding,… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  21. arXiv:2401.00588  [pdf, other

    cs.AI cs.LG cs.PF

    Fairness in Serving Large Language Models

    Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica

    Abstract: High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilizatio… ▽ More

    Submitted 5 June, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

  22. arXiv:2312.16733  [pdf, other

    cs.DC cs.LG

    SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

    Authors: Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov

    Abstract: The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overal… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  23. arXiv:2312.15157  [pdf, other

    cs.SE cs.LG cs.PL

    CodeScholar: Growing Idiomatic Code Examples

    Authors: Manish Shetty, Koushik Sen, Ion Stoica

    Abstract: Programmers often search for usage examples for API methods. A tool that could generate realistic, idiomatic, and contextual usage examples for one or more APIs would be immensely beneficial to developers. Such a tool would relieve the need for a deep understanding of the API landscape, augment existing documentation, and help discover interactions among APIs. We present CodeScholar, a tool that g… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

  24. arXiv:2312.07104  [pdf, other

    cs.AI cs.PL

    SGLang: Efficient Execution of Structured Language Model Programs

    Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

    Abstract: Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend langua… ▽ More

    Submitted 5 June, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

  25. arXiv:2311.14904  [pdf, other

    cs.LG cs.SE

    LLM-Assisted Code Cleaning For Training Accurate Code Generators

    Authors: Naman Jain, Tianjun Zhang, Wei-Lin Chiang, Joseph E. Gonzalez, Koushik Sen, Ion Stoica

    Abstract: Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works ha… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  26. arXiv:2311.04850  [pdf, other

    cs.CL cs.AI

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

    Authors: Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica

    Abstract: Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple… ▽ More

    Submitted 11 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

  27. arXiv:2311.03285  [pdf, other

    cs.LG cs.AI cs.DC

    S-LoRA: Serving Thousands of Concurrent LoRA Adapters

    Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

    Abstract: The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched in… ▽ More

    Submitted 5 June, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

  28. arXiv:2310.08560  [pdf, other

    cs.AI

    MemGPT: Towards LLMs as Operating Systems

    Authors: Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez

    Abstract: Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appea… ▽ More

    Submitted 12 February, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: Code and data available at https://research.memgpt.ai

  29. arXiv:2310.07177  [pdf, other

    cs.AI cs.CL cs.LG

    Online Speculative Decoding

    Authors: Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

    Abstract: Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduc… ▽ More

    Submitted 9 June, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

  30. arXiv:2310.03294  [pdf, other

    cs.LG cs.AI cs.DC

    DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training

    Authors: Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Xuezhe Ma, Ion Stoica, Joseph E. Gonzalez, Hao Zhang

    Abstract: FlashAttention (Dao, 2023) effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU. In this paper, we introduce DISTFLASHATTN, a distributed memory-efficient attention mechanism optimized for long-context LLMs training. We propose three key techniques: token-level workload balancing, overlapping key-value communicatio… ▽ More

    Submitted 31 March, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

  31. arXiv:2309.11998  [pdf, other

    cs.CL cs.AI

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

    Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

    Abstract: Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

  32. arXiv:2309.06180  [pdf, other

    cs.LG cs.DC

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

    Abstract: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: SOSP 2023

  33. arXiv:2308.03204  [pdf, other

    cs.RO

    Leveraging Cloud Computing to Make Autonomous Vehicles Safer

    Authors: Peter Schafhalter, Sukrit Kalra, Le Xu, Joseph E. Gonzalez, Ion Stoica

    Abstract: The safety of autonomous vehicles (AVs) depends on their ability to perform complex computations on high-volume sensor data in a timely manner. Their ability to run these computations with state-of-the-art models is limited by the processing power and slow update cycles of their onboard hardware. In contrast, cloud computing offers the ability to burst computation to vast amounts of the latest gen… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: IROS 2023 (to appear); 8 pages, 7 figures, 2 tables

  34. arXiv:2306.05685  [pdf, other

    cs.CL cs.AI

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

    Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement… ▽ More

    Submitted 23 December, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  35. arXiv:2303.06865  [pdf, other

    cs.LG cs.AI cs.PF

    FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

    Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

    Abstract: The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generat… ▽ More

    Submitted 12 June, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

  36. arXiv:2302.11665  [pdf, other

    cs.LG cs.DC cs.NI

    AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

    Authors: Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

    Abstract: Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off… ▽ More

    Submitted 19 July, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: OSDI 2023

  37. arXiv:2302.05733  [pdf, other

    cs.CR cs.LG

    Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

    Authors: Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto

    Abstract: Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of th… ▽ More

    Submitted 11 February, 2023; originally announced February 2023.

  38. arXiv:2301.03734  [pdf, other

    cs.DC cs.OS

    Exoshuffle-CloudSort

    Authors: Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, Isaac Ong, Tony Hong, SangBin Cho, Eric Liang, Ion Stoica

    Abstract: We present Exoshuffle-CloudSort, a sorting application running on top of Ray using the Exoshuffle architecture. Exoshuffle-CloudSort runs on Amazon EC2, with input and output data stored on Amazon S3. Using 40 i4i.4xlarge workers, Exoshuffle-CloudSort completes the 100 TB CloudSort Benchmark (Indy category) in 5378 seconds, with an average total cost of $97.

    Submitted 9 January, 2023; originally announced January 2023.

  39. arXiv:2211.05322  [pdf, other

    cs.LG cs.DC

    On Optimizing the Communication of Model Parallelism

    Authors: Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

    Abstract: We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  40. arXiv:2211.04775  [pdf, other

    cs.CR

    ZK-IMG: Attested Images via Zero-Knowledge Proofs to Fight Disinformation

    Authors: Daniel Kang, Tatsunori Hashimoto, Ion Stoica, Yi Sun

    Abstract: Over the past few years, AI methods of generating images have been increasing in capabilities, with recent breakthroughs enabling high-resolution, photorealistic "deepfakes" (artificially generated images with the purpose of misinformation or harm). The rise of deepfakes has potential for social disruption. Recent work has proposed using ZK-SNARKs (zero-knowledge succinct non-interactive argument… ▽ More

    Submitted 10 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

  41. CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

    Authors: Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica

    Abstract: Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the generated tasks. This is a non-stationary process where the task distribution evolves along with agent policies; cr… ▽ More

    Submitted 7 March, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Preprint, Currently Under Review

  42. arXiv:2210.08674  [pdf, ps, other

    cs.CR cs.LG

    Scaling up Trustless DNN Inference with Zero-Knowledge Proofs

    Authors: Daniel Kang, Tatsunori Hashimoto, Ion Stoica, Yi Sun

    Abstract: As ML models have increased in capabilities and accuracy, so has the complexity of their deployments. Increasingly, ML model consumers are turning to service providers to serve the ML models in the ML-as-a-service (MLaaS) paradigm. As MLaaS proliferates, a critical requirement emerges: how can model consumers verify that the correct predictions were served, in the face of malicious, lazy, or buggy… ▽ More

    Submitted 16 October, 2022; originally announced October 2022.

  43. arXiv:2210.07259  [pdf, other

    cs.NI cs.DC

    Skyplane: Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays

    Authors: Paras Jain, Sam Kumar, Sarah Wooders, Shishir G. Patil, Joseph E. Gonzalez, Ion Stoica

    Abstract: Cloud applications are increasingly distributing data across multiple regions and cloud providers. Unfortunately, wide-area bulk data transfers are often slow, bottlenecking applications. We demonstrate that it is possible to significantly improve inter-region cloud bulk transfer throughput by adapting network overlays to the cloud setting -- that is, by routing data through indirect paths at the… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: To appear at NSDI 2023

  44. arXiv:2208.07479  [pdf, other

    cs.CV

    Context-Aware Streaming Perception in Dynamic Environments

    Authors: Gur-Eyal Sela, Ionel Gog, Justin Wong, Kumar Krishna Agrawal, Xiangxi Mo, Sukrit Kalra, Peter Schafhalter, Eric Leong, Xin Wang, Bharathan Balaji, Joseph Gonzalez, Ion Stoica

    Abstract: Efficient vision works maximize accuracy under a latency budget. These works evaluate accuracy offline, one image at a time. However, real-time vision applications like autonomous driving operate in streaming settings, where ground truth changes between inference start and finish. This results in a significant accuracy drop. Therefore, a recent work proposed to maximize accuracy in streaming setti… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: 26 pages, 10 figures, to be published in ECCV 2022

  45. arXiv:2207.07697  [pdf, other

    cs.LG cs.CV cs.DC stat.ML

    POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging

    Authors: Shishir G. Patil, Paras Jain, Prabal Dutta, Ion Stoica, Joseph E. Gonzalez

    Abstract: Fine-tuning models on edge devices like mobile phones would enable privacy-preserving personalization over sensitive data. However, edge training has historically been limited to relatively small models with simple architectures because training is both memory and energy intensive. We present POET, an algorithm to enable training large neural networks on memory-scarce battery-operated edge devices… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: Proceedings of the 39th International Conference on Machine Learning 2022 (ICML 2022)

  46. arXiv:2206.14276  [pdf, other

    cs.DC cs.LG cs.MS stat.AP

    NumS: Scalable Array Programming for the Cloud

    Authors: Melih Elibol, Vinamra Benara, Samyu Yagati, Lianmin Zheng, Alvin Cheung, Michael I. Jordan, Ion Stoica

    Abstract: Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interfa… ▽ More

    Submitted 12 July, 2022; v1 submitted 28 June, 2022; originally announced June 2022.

  47. arXiv:2205.09778  [pdf, other

    cs.RO

    FogROS2: An Adaptive Platform for Cloud and Fog Robotics Using ROS 2

    Authors: Jeffrey Ichnowski, Kaiyuan Chen, Karthik Dharmarajan, Simeon Adebola, Michael Danielczuk, Vıctor Mayoral-Vilches, Nikhil Jha, Hugo Zhan, Edith LLontop, Derek Xu, Camilo Buscaron, John Kubiatowicz, Ion Stoica, Joseph Gonzalez, Ken Goldberg

    Abstract: Mobility, power, and price points often dictate that robots do not have sufficient computing power on board to run contemporary robot algorithms at desired rates. Cloud computing providers such as AWS, GCP, and Azure offer immense computing power and increasingly low latency on demand, but tapping into that power from a robot is non-trivial. We present FogROS2, an open-source platform to facilitat… ▽ More

    Submitted 24 April, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

  48. arXiv:2205.07147  [pdf

    cs.DC

    The Sky Above The Clouds

    Authors: Sarah Chasins, Alvin Cheung, Natacha Crooks, Ali Ghodsi, Ken Goldberg, Joseph E. Gonzalez, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph, Michael W. Mahoney, Aditya Parameswaran, David Patterson, Raluca Ada Popa, Koushik Sen, Scott Shenker, Dawn Song, Ion Stoica

    Abstract: Technology ecosystems often undergo significant transformations as they mature. For example, telephony, the Internet, and PCs all started with a single provider, but in the United States each is now served by a competitive market that uses comprehensive and universal technology standards to provide compatibility. This white paper presents our view on how the cloud ecosystem, barely over fifteen ye… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

    Comments: 35 pages

  49. arXiv:2203.05072  [pdf, other

    cs.DC

    Exoshuffle: An Extensible Shuffle Architecture

    Authors: Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, Isaac Ong, Tony Hong, SangBin Cho, Eric Liang, Ion Stoica

    Abstract: Shuffle is one of the most expensive communication primitives in distributed data processing and is difficult to scale. Prior work addresses the scalability challenges of shuffle by building monolithic shuffle systems. These systems are costly to develop, and they are tightly integrated with batch processing frameworks that offer only high-level APIs such as SQL. New applications, such as ML train… ▽ More

    Submitted 17 August, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

  50. arXiv:2201.12023  [pdf, other

    cs.LG cs.DC cs.PL

    Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

    Authors: Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica

    Abstract: Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models… ▽ More

    Submitted 28 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: OSDI 2022