Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–25 of 25 results for author: Gibbons, P B

.
  1. arXiv:2406.10181  [pdf, other

    cs.DC cs.AI

    Practical offloading for fine-tuning LLM on commodity GPU via learned subspace projectors

    Authors: Siyuan Chen, Zelong Guan, Yudong Liu, Phillip B. Gibbons

    Abstract: Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU. In this paper, we present an offloading fram… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  2. arXiv:2309.09212  [pdf, other

    cs.RO

    RobotPerf: An Open-Source, Vendor-Agnostic, Benchmarking Suite for Evaluating Robotics Computing System Performance

    Authors: Víctor Mayoral-Vilches, Jason Jabbour, Yu-Shun Hsiao, Zishen Wan, Martiño Crespo-Álvarez, Matthew Stewart, Juan Manuel Reina-Muñoz, Prateek Nagras, Gaurav Vikhe, Mohammad Bakhshalipour, Martin Pinzger, Stefan Rass, Smruti Panigrahi, Giulio Corradi, Niladri Roy, Phillip B. Gibbons, Sabrina M. Neuman, Brian Plancher, Vijay Janapa Reddi

    Abstract: We introduce RobotPerf, a vendor-agnostic benchmarking suite designed to evaluate robotics computing performance across a diverse range of hardware platforms using ROS 2 as its common baseline. The suite encompasses ROS 2 packages covering the full robotics pipeline and integrates two distinct benchmarking approaches: black-box testing, which measures performance by eliminating upper layers and re… ▽ More

    Submitted 29 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

  3. arXiv:2305.10611  [pdf, other

    cs.LG

    ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. The control flow divergence resulting from dynamic control flow makes batching, an important optimization enabling high throughput and hardware utilization, difficult to perfor… ▽ More

    Submitted 16 May, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

  4. arXiv:2302.03851  [pdf, other

    cs.LG cs.SE

    ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

    Authors: Siyuan Chen, Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to h… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  5. arXiv:2211.10516  [pdf, other

    cs.DB cs.DC cs.DS cs.PF

    PIM-tree: A Skew-resistant Index for Processing-in-Memory

    Authors: Hongbo Kang, Yiwei Zhao, Guy E. Blelloch, Laxman Dhulipala, Yan Gu, Charles McGuffey, Phillip B. Gibbons

    Abstract: The performance of today's in-memory indexes is bottlenecked by the memory latency/bandwidth wall. Processing-in-memory (PIM) is an emerging approach that potentially mitigates this bottleneck, by enabling low-latency memory access whose aggregate memory bandwidth scales with the number of PIM nodes. There is an inherent tension, however, between minimizing inter-node communication and achieving l… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    MSC Class: 68P05 ACM Class: H.2.4

  6. arXiv:2206.00799  [pdf, other

    cs.LG

    Federated Learning under Distributed Concept Drift

    Authors: Ellango Jothimurugesan, Kevin Hsieh, Jianyu Wang, Gauri Joshi, Phillip B. Gibbons

    Abstract: Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). To the best of our knowledge, this work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solu… ▽ More

    Submitted 27 February, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: 20 pages. Published in AISTATS 2023

    ACM Class: I.2.6

  7. arXiv:2205.14543  [pdf, other

    cs.DS

    Spatial Locality and Granularity Change in Caching

    Authors: Nathan Beckmann, Phillip B Gibbons, Charles McGuffey

    Abstract: Caches exploit temporal and spatial locality to allow a small memory to provide fast access to data stored in large, slow memory. The temporal aspect of locality is extremely well studied and understood, but the spatial aspect much less so. We seek to gain an increased understanding of spatial locality by defining and studying the Granularity-Change Caching Problem. This problem modifies the tradi… ▽ More

    Submitted 28 May, 2022; originally announced May 2022.

    Comments: 13 pages (including references), 6 figures, and 2 tables

  8. arXiv:2110.10221  [pdf, other

    cs.LG

    The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then off… ▽ More

    Submitted 21 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: 23 pages, 25 figures and 10 tables

  9. arXiv:2105.08123  [pdf, other

    cs.AR

    MetaSys: A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

    Authors: Nandita Vijaykumar, Ataberk Olgun, Konstantinos Kanellopoulos, Nisa Bostancı, Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, Onur Mutlu

    Abstract: This paper introduces the first open-source FPGA-based infrastructure, MetaSys, with a prototype in a RISC-V core, to enable the rapid implementation and evaluation of a wide range of cross-layer techniques in real hardware. Hardware-software cooperative techniques are powerful approaches to improve the performance, quality of service, and security of general-purpose processors. They are however t… ▽ More

    Submitted 21 January, 2023; v1 submitted 17 May, 2021; originally announced May 2021.

    Comments: A shorter version of this work is to appear at the ACM Transactions on Architecture and Code Optimization (TACO). 27 pages, 15 figures

  10. arXiv:2011.01383  [pdf, other

    cs.LG cs.DC

    Cortex: A Compiler for Recursive Deep Learning Models

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant performance on the table, especially for the case of recursive deep learning models. In this paper, we present Cortex, a compiler-based approach to genera… ▽ More

    Submitted 5 March, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 11 pages, 12 figures and 6 tables

    MSC Class: 68N20 ACM Class: D.3.4

  11. arXiv:2003.06508  [pdf, other

    cs.LG stat.ML

    DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift

    Authors: Ashraf Tahmasbi, Ellango Jothimurugesan, Srikanta Tirthapura, Phillip B. Gibbons

    Abstract: When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The advantage of our approach is that w… ▽ More

    Submitted 2 August, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

    Comments: 32 pages, 12 figures. Submitted to NeurIPS 2020. Replaced to include revision of Lemma 2 and additional experimental results

    ACM Class: I.2.6

  12. arXiv:1912.04977  [pdf, other

    cs.LG cs.CR stat.ML

    Advances and Open Problems in Federated Learning

    Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

    Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More

    Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

  13. arXiv:1910.12310  [pdf, other

    cs.DC cs.DS

    Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

    Authors: Laxman Dhulipala, Charlie McGuffey, Hongbo Kang, Yan Gu, Guy E. Blelloch, Phillip B. Gibbons, Julian Shun

    Abstract: Non-volatile main memory (NVRAM) technologies provide an attractive set of features for large-scale graph analytics, including byte-addressability, low idle power, and improved memory-density. NVRAM systems today have an order of magnitude more NVRAM than traditional memory (DRAM). NVRAM systems could therefore potentially allow very large graph problems to be solved on a single machine, at a mode… ▽ More

    Submitted 28 May, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

    Comments: This is an extended version of a paper in PVLDB (to be presented at VLDB'20)

  14. arXiv:1910.00189  [pdf, other

    cs.LG stat.ML

    The Non-IID Data Quagmire of Decentralized Machine Learning

    Authors: Kevin Hsieh, Amar Phanishayee, Onur Mutlu, Phillip B. Gibbons

    Abstract: Many large-scale machine learning (ML) applications need to perform decentralized learning over datasets generated at different devices and locations. Such datasets pose a significant challenge to decentralized learning because their different contexts result in significant data distribution skew across devices/locations. In this paper, we take a step toward better understanding this challenge by… ▽ More

    Submitted 18 August, 2020; v1 submitted 30 September, 2019; originally announced October 2019.

    Journal ref: International Conference on Machine Learning (ICML), 2020

  15. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  16. The Parallel Persistent Memory Model

    Authors: Guy E. Blelloch, Phillip B. Gibbons, Yan Gu, Charles McGuffey, Julian Shun

    Abstract: We consider a parallel computational model that consists of $P$ processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upco… ▽ More

    Submitted 13 June, 2018; v1 submitted 15 May, 2018; originally announced May 2018.

    Comments: This paper is the full version of a paper at SPAA 2018 with the same name

  17. arXiv:1805.03502  [pdf, other

    cs.AR

    RowClone: Accelerating Data Movement and Initialization Using DRAM

    Authors: Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry

    Abstract: In existing systems, to perform any bulk data movement operation (copy or initialization), the data has to first be read into the on-chip processor, all the way into the L1 cache, and the result of the operation must be written back to main memory. This is despite the fact that these operations do not involve any actual computation. RowClone exploits the organization and operation of commodity DRA… ▽ More

    Submitted 7 May, 2018; originally announced May 2018.

    Comments: arXiv admin note: text overlap with arXiv:1605.06483

  18. arXiv:1805.02498  [pdf, other

    cs.DC

    Decoupling GPU Programming Models from Resource Management for Enhanced Programming Ease, Portability, and Performance

    Authors: Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

    Abstract: The application resource specification--a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block--forms a critical component of modern GPU programming models. This specification determines the parallelism, and hence performance, of the application during execution because the corresponding on-chip hardware resources are allocated a… ▽ More

    Submitted 2 May, 2018; originally announced May 2018.

    Comments: arXiv admin note: substantial text overlap with arXiv:1802.02573

  19. arXiv:1803.07445  [pdf, other

    cs.LG stat.ML

    MLtuner: System Support for Automatic Machine Learning Tuning

    Authors: Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons

    Abstract: MLtuner automatically tunes settings for training tunables (such as the learning rate, the momentum, the mini-batch size, and the data staleness bound) that have a significant impact on large-scale machine learning (ML) performance. Traditionally, these tunables are set manually, which is unsurprisingly error-prone and difficult to do without extensive domain knowledge. MLtuner uses efficient snap… ▽ More

    Submitted 20 March, 2018; originally announced March 2018.

  20. arXiv:1802.02573  [pdf, other

    cs.DC cs.AR

    Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management

    Authors: Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Phillip B. Gibbons, Onur Mutlu

    Abstract: The application resource specification--a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block--forms a critical component of the existing GPU programming models. This specification determines the performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely… ▽ More

    Submitted 7 February, 2018; originally announced February 2018.

    Report number: SAFARI Technical Report 2016-005

  21. arXiv:1801.03493  [pdf, other

    cs.DB cs.CV cs.DC

    Focus: Querying Large Video Datasets with Low Latency and Low Cost

    Authors: Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, Onur Mutlu

    Abstract: Large volumes of videos are continuously recorded from cameras deployed for traffic control and surveillance with the goal of answering "after the fact" queries: identify video frames with objects of certain classes (cars, bags) from many days of recorded video. While advancements in convolutional neural networks (CNNs) have enabled answering such queries with high accuracy, they are too expensive… ▽ More

    Submitted 10 January, 2018; originally announced January 2018.

  22. arXiv:1710.02637  [pdf, other

    cs.DS

    Implicit Decomposition for Write-Efficient Connectivity Algorithms

    Authors: Naama Ben-David, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Charles McGuffey, Julian Shun

    Abstract: The future of main memory appears to lie in the direction of new technologies that provide strong capacity-to-performance ratios, but have write operations that are much more expensive than reads in terms of latency, bandwidth, and energy. Motivated by this trend, we propose sequential and parallel algorithms to solve graph connectivity problems using significantly fewer writes than conventional a… ▽ More

    Submitted 7 October, 2017; originally announced October 2017.

  23. arXiv:1611.09988  [pdf, other

    cs.AR

    Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

    Authors: Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available… ▽ More

    Submitted 29 November, 2016; originally announced November 2016.

    Comments: arXiv admin note: text overlap with arXiv:1605.06483

  24. Sorting with Asymmetric Read and Write Costs

    Authors: Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Julian Shun

    Abstract: Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in… ▽ More

    Submitted 10 March, 2016; originally announced March 2016.

  25. arXiv:1511.01038  [pdf, other

    cs.DS

    Efficient Algorithms with Asymmetric Read and Write Costs

    Authors: Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Yan Gu, Julian Shun

    Abstract: In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but… ▽ More

    Submitted 28 August, 2016; v1 submitted 3 November, 2015; originally announced November 2015.