Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–8 of 8 results for author: Ganger, G R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17145  [pdf, other

    cs.DC cs.AI cs.LG

    GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

    Authors: Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia

    Abstract: Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only c… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  2. arXiv:2103.08191  [pdf, other

    cs.DC

    PACEMAKER: Avoiding HeART attacks in storage clusters with disk-adaptive redundancy

    Authors: Saurabh Kadekodi, Francisco Maturana, Suhas Jayaram Subramanya, Juncheng Yang, K. V. Rashmi, Gregory R. Ganger

    Abstract: Data redundancy provides resilience in large-scale storage clusters, but imposes significant cost overhead. Substantial space-savings can be realized by tuning redundancy schemes to observed disk failure rates. However, prior design proposals for such tuning are unusable in real-world clusters, because the IO load of transitions between schemes overwhelms the storage infrastructure (termed transit… ▽ More

    Submitted 15 March, 2021; originally announced March 2021.

    Comments: Published in USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2020

    ACM Class: B.8.1; C.4; D.4.2; D.4.5

    Journal ref: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020, (pp. 369-385)

  3. arXiv:2008.12260  [pdf, other

    cs.DC cs.LG

    Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

    Authors: Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

    Abstract: Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers expect users to specify the number of resources for each job, often leading to inefficient resource use. Some recent schedulers choose job resources for users, but do so without awareness of how D… ▽ More

    Submitted 26 May, 2021; v1 submitted 27 August, 2020; originally announced August 2020.

  4. arXiv:2004.09619  [pdf, other

    cs.OS

    Vilamb: Low Overhead Asynchronous Redundancy for Direct Access NVM

    Authors: Rajat Kateja, Andy Pavlo, Gregory R. Ganger

    Abstract: Vilamb provides efficient asynchronous systemredundancy for direct access (DAX) non-volatile memory (NVM) storage. Production storage deployments often use system-redundancy in form of page checksums and cross-page parity. State-of-the-art solutions for maintaining system-redundancy for DAX NVM either incur a high performance overhead or require specialized hardware. The Vilamb user-space library… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

    Report number: CMU-PDL-20-101

  5. arXiv:1910.00762  [pdf, other

    cs.LG stat.ML

    Accelerating Deep Learning by Focusing on the Biggest Losers

    Authors: Angela H. Jiang, Daniel L. -K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai

    Abstract: This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co… ▽ More

    Submitted 1 October, 2019; originally announced October 2019.

  6. arXiv:1908.09922  [pdf, other

    cs.AR cs.OS

    Tvarak: Software-managed hardware offload for DAX NVM storage redundancy

    Authors: Rajat Kateja, Nathan Beckmann, Gregory R. Ganger

    Abstract: Tvarak efficiently implements system-level redundancy for direct-access (DAX) NVM storage. Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables detection of and recovery from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct acces… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

  7. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  8. arXiv:1803.07445  [pdf, other

    cs.LG stat.ML

    MLtuner: System Support for Automatic Machine Learning Tuning

    Authors: Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons

    Abstract: MLtuner automatically tunes settings for training tunables (such as the learning rate, the momentum, the mini-batch size, and the data staleness bound) that have a significant impact on large-scale machine learning (ML) performance. Traditionally, these tunables are set manually, which is unsurprisingly error-prone and difficult to do without extensive domain knowledge. MLtuner uses efficient snap… ▽ More

    Submitted 20 March, 2018; originally announced March 2018.