Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2024
Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and Views
- Benjamin Brock,
- Robert Cohn,
- Suyash Bakshi,
- Tuomas Karna,
- Jeongnim Kim,
- Mateusz Nowak,
- Łukasz Ślusarczyk,
- Kacper Stefanski,
- Timothy G. Mattson
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 236–246https://doi.org/10.1145/3650200.3656632Data structures and algorithms are essential building blocks for programs, and distributed data structures, which automatically partition data across multiple memory locales, are essential to writing high-level parallel programs. While many projects ...
- research-articleJune 2024
Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 498–510https://doi.org/10.1145/3650200.3656631With the growing importance of deploying deep neural networks (DNNs), there are increasing demands to improve both the efficiency and quality of tensor program optimization (TPO). TPO involves searching for possible program transformations for a given ...
- research-articleJune 2024
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 225–235https://doi.org/10.1145/3650200.3656623Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) ...
- research-articleJune 2022
Efficient, out-of-memory sparse MTTKRP on massively parallel architectures
- Andy Nguyen,
- Ahmed E. Helal,
- Fabio Checconi,
- Jan Laukemann,
- Jesmin Jahan Tithi,
- Yongseok Soh,
- Teresa Ranadive,
- Fabrizio Petrini,
- Jee W. Choi
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingArticle No.: 26, Pages 1–13https://doi.org/10.1145/3524059.3532363Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. ...
- research-articleJune 2021
ALTO: adaptive linearized storage of sparse tensors
- Ahmed E. Helal,
- Jan Laukemann,
- Fabio Checconi,
- Jesmin Jahan Tithi,
- Teresa Ranadive,
- Fabrizio Petrini,
- Jeewhan Choi
ICS '21: Proceedings of the 35th ACM International Conference on SupercomputingPages 404–416https://doi.org/10.1145/3447818.3461703The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive ...
-
- research-articleJune 2019
Software combining to mitigate multithreaded MPI contention
- Abdelhalim Amer,
- Charles Archer,
- Michael Blocksome,
- Chongxiao Cao,
- Michael Chuvelev,
- Hajime Fujita,
- Maria Garzaran,
- Yanfei Guo,
- Jeff R. Hammond,
- Shintaro Iwasaki,
- Kenneth J. Raffenetti,
- Mikhail Shiryaev,
- Min Si,
- Kenjiro Taura,
- Sagar Thapaliya,
- Pavan Balaji
ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 367–379https://doi.org/10.1145/3330345.3330378Efforts to mitigate lock contention from concurrent threaded accesses to MPI have reduced contention through fine-grained locking, avoided locking altogether by offloading communication to dedicated threads, or alleviated negative side effects from ...
- research-articleJune 2019
Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior
ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 195–205https://doi.org/10.1145/3330345.3330363Modern workloads such as graph analytics, sparse matrix multiplication, and in-memory key-value stores use very large datasets and typically have non-uniform memory access patterns which defy traditional concepts of locality. Moreover, many of these ...
- research-articleJune 2017
Toward Full Specialization of the HPC Software Stack: Reconciling Application Containers and Lightweight Multi-kernels
ROSS '17: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017Article No.: 7, Pages 1–8https://doi.org/10.1145/3095770.3095777Application containers enable users to have greater control of their user-space execution environment by bundling application code with all the necessary libraries in a single software package. Lightweight multi-kernels leverage multi-core CPUs to run ...
- research-articleJune 2017
HPAT: high performance analytics with scripting ease-of-use
ICS '17: Proceedings of the International Conference on SupercomputingArticle No.: 9, Pages 1–10https://doi.org/10.1145/3079079.3079099Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We ...
- research-articleJune 2016
A Multi-Kernel Survey for High-Performance Computing
ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 5, Pages 1–8https://doi.org/10.1145/2931088.2931092In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been the increase in complexity of the HPC software environment, which ...
- research-articleJune 2016
Simulation and Analysis Engine for Scale-Out Workloads
ICS '16: Proceedings of the 2016 International Conference on SupercomputingArticle No.: 22, Pages 1–13https://doi.org/10.1145/2925426.2926293We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the ...
- research-articleJune 2015
Analyzing System Calls in Multi-OS Hierarchical Environments
ROSS '15: Proceedings of the 5th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 6, Pages 1–8https://doi.org/10.1145/2768405.2768411As supercomputers progress to exascale computing and beyond, the number of nodes in a supercomputer is continuing to increase, and the number of cores within a node is dramatically increasing. The amount of parallelism argues for hierarchical approaches ...
- research-articleJune 2015
Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale Computing
ROSS '15: Proceedings of the 5th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 5, Pages 1–8https://doi.org/10.1145/2768405.2768410As systems sizes increase to exascale and beyond, there is a need to enhance the system software to meet the needs and challenges of applications. The evolutionary versus revolutionary debate can be set aside by providing system software that ...
- research-articleJune 2015
History-Assisted Adaptive-Granularity Caches (HAAG$) for High Performance 3D DRAM Architectures
- Ke Chen,
- Sheng Li,
- Jung Ho Ahn,
- Naveen Muralimanohar,
- Jishen Zhao,
- Cong Xu,
- Seongil O,
- Yuan Xie,
- Jay B. Brockman,
- Norman P. Jouppi
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 251–261https://doi.org/10.1145/2751205.27512273D-stacked DRAM has the potential to provide high performance and large capacity memory for future high performance computing systems and datacenters, and the integration of a dedicated logic die opens up opportunities for architectural enhancements ...
- research-articleJune 2015
Optimizing Overlapped Memory Accesses in User-directed Vectorization
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 393–404https://doi.org/10.1145/2751205.2751224Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive ...
- research-articleJune 2014
Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 7, Pages 1–7https://doi.org/10.1145/2612262.2612268Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has ...
- research-articleJune 2014
Hybrid MPI: a case study on the Xeon Phi platform
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 6, Pages 1–8https://doi.org/10.1145/2612262.2612267New many-core architectures such as Intel Xeon Phi offer applications significantly higher power efficiency than conventional multi-core processors. However, while this processor's compute and communication performance is an excellent match for MPI ...
- research-articleJune 2014
mOS: an architecture for extreme-scale operating systems
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 2, Pages 1–8https://doi.org/10.1145/2612262.2612263Linux®, or more specifically, the Linux API, plays a key role in HPC computing. Even for extreme-scale computing, a known and familiar API is required for production machines. However, an off-the-shelf Linux distribution faces challenges at extreme ...
- research-articleJune 2014
Author's retrospective for biomedical image analysis on a cooperative cluster of gpus and multicores
ACM International Conference on Supercomputing 25th Anniversary VolumePages 82–84https://doi.org/10.1145/2591635.2591670The last six years has seen Moore's Law continue to produce incredible gains in computational power. Indeed, the November, 2007 list of the top ten fastest supercomputers in the world contained no machines with acceleration of any kind. The same list ...
- research-articleJune 2014
Author retrospective for bloom filtering cache misses for accurate data speculation and prefetching
ACM International Conference on Supercomputing 25th Anniversary VolumePages 65–67https://doi.org/10.1145/2591635.2591664In this paper, we provide the authors? retrospective analysis of the paper "Bloom Filtering Cache Misses for Accurate Data Speculative and Prefetching" which was published in the proceedings of 2002 International Conference on Supercomputing.
DOI: http:/...