Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleFebruary 2025
Integrating ORNL's HPC and Neutron Facilities with a Performance-Portable CPU/GPU Ecosystem
- Steven E. Hahn,
- Philip W. Fackler,
- William F. Godoy,
- Ketan Maheshwari,
- Zachary Morgan,
- Andrei T. Savici,
- Christina M. Hoffmann,
- Pedro Valero-Lara,
- Jeffrey S. Vetter,
- Rafael Ferreira da Silva
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 2107–2117https://doi.org/10.1109/SCW63240.2024.00264We explore the development of a performance-portable CPU/GPU ecosystem to integrate two of the US Department of Energy's (DOE's) largest scientific instruments, the Oak Ridge Leadership Computing facility and the Spallation Neutron Source (SNS), both of ...
- research-articleFebruary 2025
Productive, Vendor-Neutral GPU Programming Using Chapel
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1914–1922https://doi.org/10.1109/SCW63240.2024.00241HPC programming ecosystem is mostly based on sequential C/C++/Fortran languages. These fundamental languages are then augmented with other frameworks such as MPI and OpenMP to enable different types of parallelism. Increased prevalence of GPUs in HPC and ...
Portability of Fortran's 'do concurrent' on GPUs
- Ronald M. Caplan,
- Miko M. Stulajter,
- Jon A. Linker,
- Jeff Larkin,
- Henry A. Gabb,
- Shiquan Su,
- Ivan Rodriguez,
- Zachary Tschirhart,
- Nicholas Malaya
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1904–1913https://doi.org/10.1109/SCW63240.2024.00240There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the do concurrent (DC) loop has been successfully demonstrated on the NVIDIA ...
- research-articleFebruary 2025
Copper: Cooperative Caching Layer for Scalable Data Loading in Exascale Supercomputers
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1320–1329https://doi.org/10.1109/SCW63240.2024.00173Job initialization time of dynamic executables increases as HPC jobs launch on a larger number of nodes and processes. This is due to the processes flooding the storage system with a tremendous number of I/O requests for the same files leading to ...
Mitigating synchronization bottlenecks in high-performance actor-model-based software
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1274–1287https://doi.org/10.1109/SCW63240.2024.00168Bulk synchronous programming (in distributedmemory systems) and the fork-join pattern (in shared-memory systems) are often used for problems where independent processes must periodically synchronize. Frequent synchronization can greatly undermine the ...
- research-articleFebruary 2025
Accelerating Multi-GPU Embedding Retrieval with PGAS-Style Communication for Deep Learning Recommendation Systems
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1262–1273https://doi.org/10.1109/SCW63240.2024.00167In this paper, we propose using Partitioned Global Address Space (PGAS) GPU one-sided asynchronous small messages to replace the widely used collective communication calls for sparse input multi-GPU embedding retrieval in deep learning recommendation ...
- research-articleFebruary 2025
Speeding-Up LULESH on HPX: Useful Tricks and Lessons Learned using a Many-Task-Based Approach
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1223–1235https://doi.org/10.1109/SCW63240.2024.00164Current programming models face challenges in dealing with modern supercomputers' growing parallelism and heterogeneity. Emerging programming models, like the task-based programming model found in the asynchronous many-task HPX programming framework, ...
- research-articleFebruary 2025
Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1141–1151https://doi.org/10.1109/SCW63240.2024.00156In this paper, we present GPU-optimizations for an ice-sheet modeling code known as MPAS-Albany Land Ice (MALI). MALI is a C++ template code that leverages the Kokkos programming model for portability and the Trilinos library for data structures, ...
- research-articleFebruary 2025
Optimizing MILC-Dslash Performance on NVIDIA A100 GPU: Parallel Strategies using SYCL
- Amanda S. Dufek,
- Steven A. Gottlieb,
- Muaaz Gul Awan,
- Douglas Adriano Augusto,
- Jack Deslippe,
- Brandon Cook
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1106–1116https://doi.org/10.1109/SCW63240.2024.00151MILC-Dslash is a benchmark that is derived from the MILC code which simulates lattice-gauge theory on a four-dimensional hypercube. This paper outlines a gradual progression in increasing the granularity of parallelism in the MILC-Dslash kernel using the ...
- research-articleFebruary 2025
Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 1006–1013https://doi.org/10.1109/SCW63240.2024.00140Sum reduction is a primitive operation in parallel computing. With OpenMP directives that enable data and computation offload to a graphics processing unit (GPU), we annotate the serial sum reduction with the OpenMP directives and evaluate the ...
- research-articleFebruary 2025
ACID Support for Compute eXpress Link Memory Transactions
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 982–995https://doi.org/10.1109/SCW63240.2024.00138With the recent explosive growth in worldwide data and data processing demands, the need to support a large volume of transactions on shared data is increasing in both high performance computing and datacenter processing. A recent innovation in server ...
- research-articleFebruary 2025
Parallel Runtime Interface for Fortran (PRIF): A Multi-Image Solution for LLVM Flang
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 950–960https://doi.org/10.1109/SCW63240.2024.00134Fortran compilers that provide support for Fortran's native parallel features often do so with a runtime library that depends on details of both the compiler implementation and the communication library, while others provide limited or no support at all. ...
- research-articleFebruary 2025
Pragma driven shared memory parallelism in Zig by supporting OpenMP loop directives
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 930–938https://doi.org/10.1109/SCW63240.2024.00132The Zig programming language, which is designed to provide performance and safety as first class concerns, has become popular in recent years. Given that Zig is built upon LLVM, and-so enjoys many of the benefits provided by the ecosystem, including ...
- research-articleFebruary 2025
Shared Memory-Aware Latency-Sensitive Message Aggregation for Fine-Grained Communication
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 682–687https://doi.org/10.1109/SCW63240.2024.00095Message aggregation is widely used with a goal to reduce communication cost in HPC applications. The difference in the order of overhead of sending a message and cost of per byte transferred motivates the need for message aggregation, for several ...
- research-articleFebruary 2025
Offloaded MPI message matching: an optimistic approach
- Jerónimo S. García,
- Salvatore Di Girolamo,
- Sokol Kosta,
- J. J. Vegas Olmos,
- Rami Nudelman,
- Torsten Hoefler,
- Gil Bloch
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 457–469https://doi.org/10.1109/SCW63240.2024.00067Message matching is a critical process ensuring the correct delivery of messages in distributed and HPC environments. The advent of SmartNICs presents an opportunity to develop offloaded message-matching approaches that leverage this on-NIC programmable ...
- research-articleFebruary 2025
Modes, Persistence and Orthogonality: Blowing MPI Up
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 404–413https://doi.org/10.1109/SCW63240.2024.00061The Message-Passing Interface (MPI) specification provides a restricted form of persistence in point-to-point and collective communication operations that purportedly enables libraries to amortize precomputation and setup costs over longer sequences of ...
- research-articleFebruary 2025
Introduction to Parallel and Distributed Programming using N-Body Simulations
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and AnalysisPages 347–354https://doi.org/10.1109/SCW63240.2024.00052This paper describes how we use n-body simulations as an interesting and visually compelling way to teach efficient, parallel, and distributed programming. Our first course focuses on bachelor students introducing them to algorithmic complexities and ...