Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3458817.3480853acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers

Published: 13 November 2021 Publication History

Abstract

Multi-accelerator servers are increasingly being deployed in shared multi-tenant environments (such as in cloud data centers) in order to meet the demands of large-scale compute-intensive workloads. In addition, these accelerators are increasingly being inter-connected in complex topologies and workloads are exhibiting a wider variety of inter-accelerator communication patterns. However, existing allocation policies are ill-suited for these emerging use-cases. Specifically, this work identifies that multi-accelerator workloads are commonly fragmented leading to reduced bandwidth and increased latency for inter-accelerator communication.
We propose Multi-Accelerator Pattern Allocation (MAPA), a graph pattern mining approach towards providing generalized allocation support for allocating multi-accelerator workloads on multi-accelerator servers. We demonstrate that MAPA is able to improve the execution time of multi-accelerator workloads and that MAPA is able to provide generalized benefits across various accelerator topologies. Finally, we demonstrate a speedup of 12.4% for 75th percentile of jobs with the worst case execution time reduced by up to 35% against baseline policy using MAPA.

Supplementary Material

MP4 File (MAPA Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers 232 Afternoon 4.mp4)
Presentation video

References

[1]
2021. Nvidia Docker Containers. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
[2]
2021. NVIDIA Multi-instance GPU. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
[3]
Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 111--122.
[4]
Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, and Murali Annavaram. 2016. Origami: Folding Warps for Energy Efficient GPUs. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16).
[5]
AmirAli Abdolrashidi, Hodjat Asghari Esfeden, Ali Jahanshahi, Kaustubh Singh, Nael Abu-Ghazaleh, and Daniel Wong. 2021. Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 333--346.
[6]
AmirAli Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong. 2017. Wireframe: Supporting data-dependent parallelism through dependency graph execution in gpus. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 600--611.
[7]
Marcelo Amaral, Jordà Polo, David Carrera, Seetharami Seelam, and Malgorzata Steinder. 2017. Topology-aware gpu scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
[8]
Amazon. 2019. Amazon EC2 elastic GPUs. https://aws.amazon.com/ec2/elastic-gpus/ Accessed 04-20-2019.
[9]
Advanced Micro Devices (AMD). 2021. ROCm Communication Collectives Library. https://github.com/ROCmSoftwarePlatform/rccl
[10]
Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K Panda. 2016. Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning. In Proceedings of the 23rd European MPI Users' Group Meeting. 15--22.
[11]
Noah D Brenowitz, Brian Henn, Jeremy McGibbon, Spencer K Clark, Anna Kwa, W Andre Perkins, Oliver Watt-Meyer, and Christopher S Bretherton. 2020. Machine learning climate model dynamics: Offline versus online performance. arXiv preprint arXiv:2011.03081 (2020).
[12]
V. Carletti, P. Foggia, A. Saggese, and M. Vento. 2018. Challenging the Time Complexity of Exact Subgraph Isomorphism for Huge and Dense Graphs with VF3. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 804--818.
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[14]
Wenqian Dong, Zhen Xie, Gokcen Kestor, and Dong Li. 2020. Smart-PGSim: using neural network to accelerate AC-OPF power grid simulation. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.
[15]
Alexandru Duţu, Matthew D Sinclair, Bradford M Beckmann, David A Wood, and Marcus Chow. 2020. Independent forward progress of work-groups. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1022--1035.
[16]
Steffen Ernsting and Herbert Kuchen. 2012. Algorithmic skeletons for multicore, multi-GPU systems and clusters. International Journal of High Performance Computing and Networking 7, 2 (2012), 129--138.
[17]
Facebook. 2018. Facebook Flexible GPU Expander Big Basin Refresh. https://www.opencompute.org/files/OCP2018-Facebook-Flexible-GPU-Expander-Big-Basin-Refresh-v0.7.pdf
[18]
Iman Faraji, Seyed H Mirsadeghi, and Ahmad Afsahi. 2016. Topology-aware GPU selection on multi-GPU nodes. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 712--720.
[19]
Geoffrey Fox, James A Glazier, JCS Kadupitiya, Vikram Jadhao, Minje Kim, Judy Qiu, James P Sluka, Endre Somogyi, Madhav Marathe, Abhijin Adiga, et al. 2019. Learning everywhere: Pervasive machine learning for effective high-performance computation. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 422--429.
[20]
Google. 2020. Google Cloud: Cloud GPUs. https://cloud.google.com/gpu Accessed 04-16-2020.
[21]
Jonathan Hines. 2018. Stepping up to Summit. Computing in Science & Engineering 20, 2 (2018), 78--82.
[22]
NVIDIA Inc. 2019. NVIDIA DGX-1: The essential instrument for AIResearch: Spec Sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-rhel-datasheet-nvidia-us-808336-r3-web.pdf
[23]
NVIDIA Inc. 2019. NVIDIA DGX-2: The World Most Powerful Deep Learning System For the Most Complex AI Challenges: Spec Sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-2-datasheet-us-nvidia-955420-r2-web-new.pdf
[24]
Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong. 2020. GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers. IEEE Computer Architecture Letters 19, 2 (2020), 139--142.
[25]
Kasra Jamshidi, Rakesh Mahadasa, and Keval Vora. 2020. Peregrine: A PatternAware Graph Mining System. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 13, 16 pages.
[26]
Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Tech. Rep. (2018).
[27]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675--678.
[28]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM '14). ACM, New York, NY, USA, 675--678.
[29]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1--12.
[30]
JCS Kadupitiya, Geoffrey C Fox, and Vikram Jadhao. 2019. Machine learning for performance enhancement of molecular dynamics simulations. In International Conference on Computational Science. Springer, 116--130.
[31]
JCS Kadupitiya, Geoffrey C Fox, and Vikram Jadhao. 2020. Machine learning for parameter auto-tuning in molecular dynamics simulations: Efficient dynamics of ions near polarizable nanoparticles. The International Journal of High Performance Computing Applications 34, 3 (2020), 357--374.
[32]
Riaz Ullah Khan, Xiaosong Zhang, Rajesh Kumar, and Emelia Opoku Aboagye. 2018. Evaluating the Performance of ResNet Model Based on Image Recognition. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence (Chengdu, China) (ICCAI 2018). Association for Computing Machinery, New York, NY, USA, 86--90.
[33]
Robin Kobus, Daniel Jünger, Christian Hundt, and Bertil Schmidt. 2019. Gossip: Efficient Communication Primitives for Multi-GPU Systems. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[35]
Ang Li, Tong Geng, Tianqi Wang, Martin Herbordt, Shuaiwen Leon Song, and Kevin Barker. 2019. BSTC: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--30.
[36]
Ang Li, Weifeng Liu, Linnan Wang, Kevin Barker, and Shuaiwen Leon Song. 2018. Warp-consolidation: A novel execution model for gpus. In Proceedings of the 2018 International Conference on Supercomputing. 53--64.
[37]
A. Li, S. Song, J. Chen, J. Li, X. Liu, N. Tallent, and K. J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems (2019).
[38]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). arXiv:1903.04611 http://arxiv.org/abs/1903.04611
[39]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). 191--202.
[40]
Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1273--1278.
[41]
Zhenhong Liu, Daniel Wong, and Nam Sung Kim. 2018. Load-Triggered Warp Approximation on GPU. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '18).
[42]
Daniel Mawhirter and Bo Wu. 2019. AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP '19). Association for Computing Machinery, New York, NY, USA, 509--523.
[43]
Saiful A Mojumder, Marcia S Louis, Yifan Sun, Amir Kavyan Ziabari, José L Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling dnn workloads on a volta-based dgx-1 system. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 122--133.
[44]
Nvidia. 2019. Multi-GPU Programming Models. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9139-multi-gpu-programming-models.pdf
[45]
Nvidia. 2021. Optimized primitives for collective multi-GPU communication. https://github.com/nvidia/nccl
[46]
Tesla NVIDIA. 2017. V100 white paper. NVIDIA Corporation (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
[47]
Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 10 (Oct. 2004), 1367--1372.
[48]
Sam Partee, Matthew Ellis, Alessandro Rigazzi, Scott Bachman, Gustavo Marques, Andrew Shao, and Benjamin Robbins. 2021. Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling. arXiv preprint arXiv:2104.09355 (2021).
[49]
J Luc Peterson, Rushil Anirudh, Kevin Athey, Benjamin Bay, Peer-Timo Bremer, Vic Castillo, Francesco Di Natale, David Fox, Jim A Gaffney, David Hysom, et al. 2019. Merlin: enabling machine learning-ready HPC ensembles. arXiv preprint arXiv:1912.02892 (2019).
[50]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.
[51]
Kiran Ranganath, AmirAli Abdolrashidi, Shuaiwen Leon Song, and Daniel Wong. 2019. Speeding up Collective Communications Through Inter-GPU Re-routing. IEEE Computer Architecture Letters 18, 2 (2019), 128--131.
[52]
Baidu Research. 2017. Baidu All-Reduce. https://github.com/baidu-research/baidu-allreduce
[53]
Dipanjan Sengupta, Anshuman Goswami, Karsten Schwan, and Krishna Pallavi. 2014. Scheduling multi-tenant cloud workloads on accelerator-based systems. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 513--524.
[54]
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Neural Information Processing Systems.
[55]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.
[56]
Shuaiwen Song, Matthew Grove, and Kirk W Cameron. 2011. An iso-energy-efficient approach to scalable system power-performance optimization. In 2011 IEEE International Conference on Cluster Computing. IEEE, 262--271.
[57]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI'17). AAAI Press, 4278--4284.
[58]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842
[59]
Nathan R Tallent and Adolfy Hoisie. 2014. Palm: Easing the burden of analytical performance modeling. In Proceedings of the 28th ACM international conference on Supercomputing. 221--230.
[60]
Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the reliability challenge of GPU register file at low supply voltage. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 3--15.
[61]
Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, and Ashraf Aboulnaga. 2015. Arabesque: A System for Distributed Graph Mining. In Proceedings of the 25th Symposium on Operating Systems Principles (Monterey, California) (SOSP '15). Association for Computing Machinery, New York, NY, USA, 425--440.
[62]
Devashree Tripathy, Amirali Abdolrashidi, Laxmi Narayan Bhuyan, Liang Zhou, and Daniel Wong. 2021. Paver: Locality graph-based thread block scheduling for gpus. ACM Transactions on Architecture and Code Optimization (TACO) 18, 3 (2021), 1--26.
[63]
Devashree Tripathy, Amirali Abdolrashidi, Quan Fan, Daniel Wong, and Manoranjan Satpathy. 2021 (To appear). LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs. Proceedings of the 15th IEEE/ACM International Conference on Networking, Architecture, and Storage (2021 (To appear)).
[64]
Devashree Tripathy, Hadi Zamani, Debiprasanna Sahoo, Laxmi N Bhuyan, and Manoranjan Satpathy. 2020. Slumber: static-power management for gpgpu register files. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 109--114.
[65]
J. R. Ullmann. 1976. An Algorithm for Subgraph Isomorphism. J. ACM 23, 1 (Jan. 1976), 31--42.
[66]
Julian R. Ullmann. 2011. Bit-Vector Algorithms for Binary Constraint Satisfaction and Subgraph Isomorphism. ACM J. Exp. Algorithmics 15, Article 1.6 (Feb. 2011), 64 pages.
[67]
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940 (2019).
[68]
Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. 2019. DEFSI: Deep learning based epidemic forecasting with synthetic information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9607--9612.
[69]
Emma White. 2019. Optimizing deep learning on P3 and P3dn with EFA. https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/
[70]
Jeremiah J Wilke, Joseph P Kenny, Samuel Knight, and Sebastien Rumley. 2018. Compiler-assisted source-to-source skeletonization of application models for system simulation. In International Conference on High Performance Computing. Springer, 123--143.
[71]
Daniel Wong, Nam Sung Kim, and Murali Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 176--187.
[72]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 595--610.
[73]
J. Yin, Z. Lin, O. Kayiran, M. Poremba, M. Shoaib Bin Altaf, N. Enright Jerger, and G. H. Loh. 2018. Modular Routing Design for Chiplet-Based Systems. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 726--738.
[74]
Hadi Zamani, Yuanlai Liu, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2019. GreenMM: energy efficient GPU matrix multiplication through undervolting. In Proceedings of the ACM International Conference on Supercomputing. 308--318.
[75]
Hadi Zamani, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2020. SAOU: Safe adaptive overclocking and undervolting for energy-efficient GPU computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 205--210.

Cited By

View all
  • (2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
  • (2023)Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage StrategiesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567505(867-882)Online publication date: 8-May-2023
  • (2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2021
      1493 pages
      ISBN:9781450384421
      DOI:10.1145/3458817
      This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 November 2021

      Check for updates

      Badges

      Qualifiers

      • Research-article

      Funding Sources

      • University of Sydney
      • NSF
      • Australia Research Council (ARC)
      • U.S. Dept. of Energy's Office of Science Center for Advanced Technology Evaluation (CENATE)

      Conference

      SC '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)282
      • Downloads (Last 6 weeks)37
      Reflects downloads up to 23 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
      • (2023)Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage StrategiesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567505(867-882)Online publication date: 8-May-2023
      • (2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
      • (2023)Privacy-preserving Job Scheduler for GPU Sharing2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00077(337-339)Online publication date: May-2023
      • (2022)MISOProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563510(173-189)Online publication date: 7-Nov-2022
      • (2022)PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation ServiceACM Transactions on Architecture and Code Optimization10.1145/352412919:3(1-27)Online publication date: 22-Aug-2022
      • (2022)Optimizing Random Access to Hierarchically-Compressed Data on GPUSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00023(1-15)Online publication date: Nov-2022
      • (2021)LC-MEMENTO: A Memory Model for Accelerated ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_5(67-82)Online publication date: 13-Oct-2021

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media