Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2023
Democratizing HPC Access and Use with Knowledge Graphs
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023, Pages 243–251https://doi.org/10.1145/3624062.3624094The field of High-Performance Computing (HPC) is undergoing rapid evolution, with an expanding and diverse user base harnessing its unparalleled computational capabilities. As the range of HPC applications grows, newcomers to the field are faced with the ...
- ArticleMay 2023
SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC
- Pouya Kousha,
- Arpan Jain,
- Ayyappa Kolli,
- Matthew Lieber,
- Mingzhe Han,
- Nicholas Contini,
- Hari Subramoni,
- Dhableswar K. Panda
AbstractHigh-Performance Computing (HPC) is increasingly being used in traditional scientific domains as well as emerging areas like Deep Learning (DL). This has led to a diverse set of professionals who interact with state-of-the-art HPC systems. The ...
- ArticleMay 2022
Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters
- Qinghua Zhou,
- Pouya Kousha,
- Quentin Anthony,
- Kawthar Shafie Khorassani,
- Aamir Shafi,
- Hari Subramoni,
- Dhabaleswar K. Panda
AbstractAs more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in ...
- research-articleJuly 2020
Frontera: The Evolution of Leadership Computing at the National Science Foundation
PEARC '20: Practice and Experience in Advanced Research Computing 2020: Catch the WaveJuly 2020, Pages 106–111https://doi.org/10.1145/3311790.3396656As part of the NSF’s cyberinfrastructure vision for a robust mix of high capability and capacity HPC systems, Frontera represents the most recent evolution of trans-petascale resources available to all open science research projects in the U.S. Debuting ...
-
- research-articleJune 2020
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems
- Ching-Hsiang Chu,
- Pouya Kousha,
- Ammar Ahmad Awan,
- Kawthar Shafie Khorassani,
- Hari Subramoni,
- Dhabaleswar K. (D K) Panda
ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingJune 2020, Article No.: 6, Pages 1–12https://doi.org/10.1145/3392717.3392771The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large-scale GPU-enabled systems for distributed deep learning (DL) training, it is ...
Cooperative rendezvous protocols for improved performance and overlap
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and AnalysisNovember 2018, Article No.: 28, Pages 1–13https://doi.org/10.1109/SC.2018.00031With the emergence of larger multi-/many-core clusters and new areas of HPC applications, performance of large message communication is becoming more important. MPI libraries use different rendezvous protocols to perform large message communication. ...
- research-articleJuly 2019
Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Computing (PACO), Volume 85, Issue CJul 2019, Pages 141–152https://doi.org/10.1016/j.parco.2019.03.005Highlights- Propose and design new MPI_Bcast algorithms and mechanisms that provide efficient GPU-based communication across all message sizes for emerging Deep Learning ...
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes ...
- research-articleApril 2019
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUsApril 2019, Pages 43–52https://doi.org/10.1145/3300053.3319419The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
- tutorialFebruary 2019
High performance distributed deep learning: a beginner's guide
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingFebruary 2019, Pages 452–454https://doi.org/10.1145/3293883.3302260The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe2, TensorFlow, Cognitive Toolkit (CNTK), PyTorch, ...
Cooperative rendezvous protocols for improved performance and overlap
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and AnalysisNovember 2018, Article No.: 28, Pages 1–13With the emergence of larger multi-/many-core clusters and new areas of HPC applications, performance of large message communication is becoming more important. MPI libraries use different rendezvous protocols to perform large message communication. ...
- research-articleOctober 2018
A 1 PB/s file system to checkpoint three million MPI tasks
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013, Pages 143–154https://doi.org/10.1145/2462902.2462908With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external ...
- research-articleSeptember 2018
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
EuroMPI '18: Proceedings of the 25th European MPI Users' Group MeetingSeptember 2018, Article No.: 2, Pages 1–9https://doi.org/10.1145/3236367.3236381Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with ...
- research-articleSeptember 2018
Efficient Asynchronous Communication Progress for MPI without Dedicated Resources
- Amit Ruhela,
- Hari Subramoni,
- Sourav Chakraborty,
- Mohammadreza Bayatpour,
- Pouya Kousha,
- Dhabaleswar K. Panda
EuroMPI '18: Proceedings of the 25th European MPI Users' Group MeetingSeptember 2018, Article No.: 14, Pages 1–11https://doi.org/10.1145/3236367.3236376The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), ...
- research-articleSeptember 2018
Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures
EuroMPI '18: Proceedings of the 25th European MPI Users' Group MeetingSeptember 2018, Article No.: 4, Pages 1–10https://doi.org/10.1145/3236367.3236371Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as ...
- research-articleJune 2013
A 1 PB/s file system to checkpoint three million MPI tasks
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013, Pages 143–154https://doi.org/10.1145/2493123.2462908With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external ...
- research-articleMay 2013
SR-IOV support for virtualization on infiniband clusters: early experience
- Jithin Jose,
- Mingzhe Li,
- Xiaoyi Lu,
- Krishna Chaitanya Kandalla,
- Mark Daniel Arnold,
- Dhabaleswar K. (DK) Panda
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingMay 2013, Pages 385–392https://doi.org/10.1109/CCGrid.2013.76High Performance Computing (HPC) systems are becoming increasingly complex and are also associated with very high operational costs. The cloud computing paradigm, coupled with modern Virtual Machine (VM) technology offers attractive techniques to easily ...