research-article

Open access

MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers

Authors:

Kiran Ranganath,

Joshua D. Suetterlein,

Joseph B. Manzano,

Shuaiwen Leon Song,

Daniel WongAuthors Info & Claims

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 99, Pages 1 - 14

https://doi.org/10.1145/3458817.3480853

Published: 13 November 2021 Publication History

Abstract

Multi-accelerator servers are increasingly being deployed in shared multi-tenant environments (such as in cloud data centers) in order to meet the demands of large-scale compute-intensive workloads. In addition, these accelerators are increasingly being inter-connected in complex topologies and workloads are exhibiting a wider variety of inter-accelerator communication patterns. However, existing allocation policies are ill-suited for these emerging use-cases. Specifically, this work identifies that multi-accelerator workloads are commonly fragmented leading to reduced bandwidth and increased latency for inter-accelerator communication.

We propose Multi-Accelerator Pattern Allocation (MAPA), a graph pattern mining approach towards providing generalized allocation support for allocating multi-accelerator workloads on multi-accelerator servers. We demonstrate that MAPA is able to improve the execution time of multi-accelerator workloads and that MAPA is able to provide generalized benefits across various accelerator topologies. Finally, we demonstrate a speedup of 12.4% for 75th percentile of jobs with the worst case execution time reduced by up to 35% against baseline policy using MAPA.

Supplementary Material

MP4 File (MAPA Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers 232 Afternoon 4.mp4)

Presentation video

Download
229.39 MB

References

[1]

2021. Nvidia Docker Containers. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

[2]

2021. NVIDIA Multi-instance GPU. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

[3]

Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 111--122.

Digital Library

[4]

Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, and Murali Annavaram. 2016. Origami: Folding Warps for Energy Efficient GPUs. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16).

Digital Library

[5]

AmirAli Abdolrashidi, Hodjat Asghari Esfeden, Ali Jahanshahi, Kaustubh Singh, Nael Abu-Ghazaleh, and Daniel Wong. 2021. Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 333--346.

Digital Library

[6]

AmirAli Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong. 2017. Wireframe: Supporting data-dependent parallelism through dependency graph execution in gpus. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 600--611.

Digital Library

[7]

Marcelo Amaral, Jordà Polo, David Carrera, Seetharami Seelam, and Malgorzata Steinder. 2017. Topology-aware gpu scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.

Digital Library

[8]

Amazon. 2019. Amazon EC2 elastic GPUs. https://aws.amazon.com/ec2/elastic-gpus/ Accessed 04-20-2019.

[9]

Advanced Micro Devices (AMD). 2021. ROCm Communication Collectives Library. https://github.com/ROCmSoftwarePlatform/rccl

[10]

Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, and Dhabaleswar K Panda. 2016. Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning. In Proceedings of the 23rd European MPI Users' Group Meeting. 15--22.

Digital Library

[11]

Noah D Brenowitz, Brian Henn, Jeremy McGibbon, Spencer K Clark, Anna Kwa, W Andre Perkins, Oliver Watt-Meyer, and Christopher S Bretherton. 2020. Machine learning climate model dynamics: Offline versus online performance. arXiv preprint arXiv:2011.03081 (2020).

[12]

V. Carletti, P. Foggia, A. Saggese, and M. Vento. 2018. Challenging the Time Complexity of Exact Subgraph Isomorphism for Huge and Dense Graphs with VF3. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 804--818.

[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

[14]

Wenqian Dong, Zhen Xie, Gokcen Kestor, and Dong Li. 2020. Smart-PGSim: using neural network to accelerate AC-OPF power grid simulation. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--15.

[15]

Alexandru Duţu, Matthew D Sinclair, Bradford M Beckmann, David A Wood, and Marcus Chow. 2020. Independent forward progress of work-groups. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1022--1035.

Digital Library

[16]

Steffen Ernsting and Herbert Kuchen. 2012. Algorithmic skeletons for multicore, multi-GPU systems and clusters. International Journal of High Performance Computing and Networking 7, 2 (2012), 129--138.

Digital Library

[17]

Facebook. 2018. Facebook Flexible GPU Expander Big Basin Refresh. https://www.opencompute.org/files/OCP2018-Facebook-Flexible-GPU-Expander-Big-Basin-Refresh-v0.7.pdf

[18]

Iman Faraji, Seyed H Mirsadeghi, and Ahmad Afsahi. 2016. Topology-aware GPU selection on multi-GPU nodes. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 712--720.

[19]

Geoffrey Fox, James A Glazier, JCS Kadupitiya, Vikram Jadhao, Minje Kim, Judy Qiu, James P Sluka, Endre Somogyi, Madhav Marathe, Abhijin Adiga, et al. 2019. Learning everywhere: Pervasive machine learning for effective high-performance computation. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 422--429.

[20]

Google. 2020. Google Cloud: Cloud GPUs. https://cloud.google.com/gpu Accessed 04-16-2020.

[21]

Jonathan Hines. 2018. Stepping up to Summit. Computing in Science & Engineering 20, 2 (2018), 78--82.

[22]

NVIDIA Inc. 2019. NVIDIA DGX-1: The essential instrument for AIResearch: Spec Sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-rhel-datasheet-nvidia-us-808336-r3-web.pdf

[23]

NVIDIA Inc. 2019. NVIDIA DGX-2: The World Most Powerful Deep Learning System For the Most Complex AI Challenges: Spec Sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-2-datasheet-us-nvidia-955420-r2-web-new.pdf

[24]

Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong. 2020. GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers. IEEE Computer Architecture Letters 19, 2 (2020), 139--142.

[25]

Kasra Jamshidi, Rakesh Mahadasa, and Keval Vora. 2020. Peregrine: A PatternAware Graph Mining System. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 13, 16 pages.

Digital Library

[26]

Myeongjae Jeon, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang. 2018. Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. Tech. Rep. (2018).

[27]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675--678.

Digital Library

[28]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM '14). ACM, New York, NY, USA, 675--678.

Digital Library

[29]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1--12.

Digital Library

[30]

JCS Kadupitiya, Geoffrey C Fox, and Vikram Jadhao. 2019. Machine learning for performance enhancement of molecular dynamics simulations. In International Conference on Computational Science. Springer, 116--130.

Digital Library

[31]

JCS Kadupitiya, Geoffrey C Fox, and Vikram Jadhao. 2020. Machine learning for parameter auto-tuning in molecular dynamics simulations: Efficient dynamics of ions near polarizable nanoparticles. The International Journal of High Performance Computing Applications 34, 3 (2020), 357--374.

Digital Library

[32]

Riaz Ullah Khan, Xiaosong Zhang, Rajesh Kumar, and Emelia Opoku Aboagye. 2018. Evaluating the Performance of ResNet Model Based on Image Recognition. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence (Chengdu, China) (ICCAI 2018). Association for Computing Machinery, New York, NY, USA, 86--90.

Digital Library

[33]

Robin Kobus, Daniel Jünger, Christian Hundt, and Bertil Schmidt. 2019. Gossip: Efficient Communication Primitives for Multi-GPU Systems. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.

Digital Library

[34]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Digital Library

[35]

Ang Li, Tong Geng, Tianqi Wang, Martin Herbordt, Shuaiwen Leon Song, and Kevin Barker. 2019. BSTC: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--30.

Digital Library

[36]

Ang Li, Weifeng Liu, Linnan Wang, Kevin Barker, and Shuaiwen Leon Song. 2018. Warp-consolidation: A novel execution model for gpus. In Proceedings of the 2018 International Conference on Supercomputing. 53--64.

Digital Library

[37]

A. Li, S. Song, J. Chen, J. Li, X. Liu, N. Tallent, and K. J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems (2019).

Digital Library

[38]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). arXiv:1903.04611 http://arxiv.org/abs/1903.04611

[39]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). 191--202.

[40]

Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1273--1278.

[41]

Zhenhong Liu, Daniel Wong, and Nam Sung Kim. 2018. Load-Triggered Warp Approximation on GPU. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '18).

Digital Library

[42]

Daniel Mawhirter and Bo Wu. 2019. AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP '19). Association for Computing Machinery, New York, NY, USA, 509--523.

Digital Library

[43]

Saiful A Mojumder, Marcia S Louis, Yifan Sun, Amir Kavyan Ziabari, José L Abellán, John Kim, David Kaeli, and Ajay Joshi. 2018. Profiling dnn workloads on a volta-based dgx-1 system. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 122--133.

[44]

Nvidia. 2019. Multi-GPU Programming Models. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9139-multi-gpu-programming-models.pdf

[45]

Nvidia. 2021. Optimized primitives for collective multi-GPU communication. https://github.com/nvidia/nccl

[46]

Tesla NVIDIA. 2017. V100 white paper. NVIDIA Corporation (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf

[47]

Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 10 (Oct. 2004), 1367--1372.

Digital Library

[48]

Sam Partee, Matthew Ellis, Alessandro Rigazzi, Scott Bachman, Gustavo Marques, Andrew Shao, and Benjamin Robbins. 2021. Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling. arXiv preprint arXiv:2104.09355 (2021).

[49]

J Luc Peterson, Rushil Anirudh, Kevin Athey, Benjamin Bay, Peer-Timo Bremer, Vic Castillo, Francesco Di Natale, David Fox, Jim A Gaffney, David Hysom, et al. 2019. Merlin: enabling machine learning-ready HPC ensembles. arXiv preprint arXiv:1912.02892 (2019).

[50]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[51]

Kiran Ranganath, AmirAli Abdolrashidi, Shuaiwen Leon Song, and Daniel Wong. 2019. Speeding up Collective Communications Through Inter-GPU Re-routing. IEEE Computer Architecture Letters 18, 2 (2019), 128--131.

Digital Library

[52]

Baidu Research. 2017. Baidu All-Reduce. https://github.com/baidu-research/baidu-allreduce

[53]

Dipanjan Sengupta, Anshuman Goswami, Karsten Schwan, and Krishna Pallavi. 2014. Scheduling multi-tenant cloud workloads on accelerator-based systems. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 513--524.

Digital Library

[54]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Neural Information Processing Systems.

[55]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.

[56]

Shuaiwen Song, Matthew Grove, and Kirk W Cameron. 2011. An iso-energy-efficient approach to scalable system power-performance optimization. In 2011 IEEE International Conference on Cluster Computing. IEEE, 262--271.

Digital Library

[57]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI'17). AAAI Press, 4278--4284.

[58]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842

[59]

Nathan R Tallent and Adolfy Hoisie. 2014. Palm: Easing the burden of analytical performance modeling. In Proceedings of the 28th ACM international conference on Supercomputing. 221--230.

Digital Library

[60]

Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the reliability challenge of GPU register file at low supply voltage. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 3--15.

Digital Library

[61]

Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, and Ashraf Aboulnaga. 2015. Arabesque: A System for Distributed Graph Mining. In Proceedings of the 25th Symposium on Operating Systems Principles (Monterey, California) (SOSP '15). Association for Computing Machinery, New York, NY, USA, 425--440.

Digital Library

[62]

Devashree Tripathy, Amirali Abdolrashidi, Laxmi Narayan Bhuyan, Liang Zhou, and Daniel Wong. 2021. Paver: Locality graph-based thread block scheduling for gpus. ACM Transactions on Architecture and Code Optimization (TACO) 18, 3 (2021), 1--26.

Digital Library

[63]

Devashree Tripathy, Amirali Abdolrashidi, Quan Fan, Daniel Wong, and Manoranjan Satpathy. 2021 (To appear). LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs. Proceedings of the 15th IEEE/ACM International Conference on Networking, Architecture, and Storage (2021 (To appear)).

[64]

Devashree Tripathy, Hadi Zamani, Debiprasanna Sahoo, Laxmi N Bhuyan, and Manoranjan Satpathy. 2020. Slumber: static-power management for gpgpu register files. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 109--114.

Digital Library

[65]

J. R. Ullmann. 1976. An Algorithm for Subgraph Isomorphism. J. ACM 23, 1 (Jan. 1976), 31--42.

Digital Library

[66]

Julian R. Ullmann. 2011. Bit-Vector Algorithms for Binary Constraint Satisfaction and Subgraph Isomorphism. ACM J. Exp. Algorithmics 15, Article 1.6 (Feb. 2011), 64 pages.

Digital Library

[67]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940 (2019).

[68]

Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. 2019. DEFSI: Deep learning based epidemic forecasting with synthetic information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9607--9612.

Digital Library

[69]

Emma White. 2019. Optimizing deep learning on P3 and P3dn with EFA. https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

[70]

Jeremiah J Wilke, Joseph P Kenny, Samuel Knight, and Sebastien Rumley. 2018. Compiler-assisted source-to-source skeletonization of application models for system simulation. In International Conference on High Performance Computing. Springer, 123--143.

[71]

Daniel Wong, Nam Sung Kim, and Murali Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 176--187.

[72]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 595--610.

[73]

J. Yin, Z. Lin, O. Kayiran, M. Poremba, M. Shoaib Bin Altaf, N. Enright Jerger, and G. H. Loh. 2018. Modular Routing Design for Chiplet-Based Systems. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 726--738.

Digital Library

[74]

Hadi Zamani, Yuanlai Liu, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2019. GreenMM: energy efficient GPU matrix multiplication through undervolting. In Proceedings of the ACM International Conference on Supercomputing. 308--318.

Digital Library

[75]

Hadi Zamani, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2020. SAOU: Safe adaptive overclocking and undervolting for energy-efficient GPU computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 205--210.

Digital Library

Cited By

Dio Lavore IMaffi DArnaboldi MDelamare ABonetta DSantambrogio M(2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00132
Wang ZLin HZhu YNg TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage StrategiesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567505(867-882)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567505
Hu YZhang FXia YYao ZZeng LDing HWei ZZhang XZhai JDu XMa S(2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3294341
Show More Cited By

Index Terms

MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers
1. Computer systems organization
2. Software and its engineering
  1. Software organization and properties

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2021

1493 pages

ISBN:9781450384421

DOI:10.1145/3458817

General Chair:
Bronis R. de Supinski,
Program Chairs:
Mary Hall,
Todd Gamblin

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Check for updates

Badges

Qualifiers

Research-article

Funding Sources

University of Sydney
NSF
Australia Research Council (ARC)
U.S. Dept. of Energy's Office of Science Center for Advanced Technology Evaluation (CENATE)

Conference

SC '21

Sponsor:

SIGHPC

SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 19, 2021

Missouri, St. Louis

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,064
Total Downloads

Downloads (Last 12 months)282
Downloads (Last 6 weeks)37

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dio Lavore IMaffi DArnaboldi MDelamare ABonetta DSantambrogio M(2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00132
Wang ZLin HZhu YNg TFedorova ANarayanan DDi Luna GQuerzoni L(2023)Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage StrategiesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567505(867-882)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567505
Hu YZhang FXia YYao ZZeng LDing HWei ZZhang XZhai JDu XMa S(2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3294341
Ray ALafata KZhang ZXiong YChakrabarty K(2023)Privacy-preserving Job Scheduler for GPU Sharing2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00077(337-339)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00077
Li BPatel TSamsi SGadepally VTiwari DGavrilovska AAltınbüken DBinnig C(2022)MISOProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563510(173-189)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563510
Jahanshahi AYu NWong D(2022)PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation ServiceACM Transactions on Architecture and Code Optimization10.1145/352412919:3(1-27)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3524129
Zhang FHu YDing HYao ZWei ZZhang XDu X(2022)Optimizing Random Access to Hierarchically-Compressed Data on GPUSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00023(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00023
Ranganath KFiroz JSuetterlein JManzano JMarquez ARaugas MWong D(2021)LC-MEMENTO: A Memory Model for Accelerated ArchitecturesLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_5(67-82)Online publication date: 13-Oct-2021
https://dl.acm.org/doi/10.1007/978-3-030-99372-6_5

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents