: Search

Applied Filters

People

Publications

Publication Date

Searched The ACM Guide to Computing Literature (3,764,881 records)|Limit your search to The ACM Full-Text Collection (758,134 records)

Showing 1 - 20of61 Results

Filters

Select All

Export Citations Save to Binder

per page:

Recency

research-article
Open Access
June 2024
Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and Views
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 236–246https://doi.org/10.1145/3650200.3656632

Data structures and algorithms are essential building blocks for programs, and distributed data structures, which automatically partition data across multiple memory locales, are essential to writing high-level parallel programs. While many projects ...
0
172
Metrics
Total Citations0
Total Downloads172
Last 12 Months172
Last 6 weeks31
View online with eReader
View this article in HTML format
PDF
research-article
Open Access
June 2024
Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 498–510https://doi.org/10.1145/3650200.3656631

With the growing importance of deploying deep neural networks (DNNs), there are increasing demands to improve both the efficiency and quality of tensor program optimization (TPO). TPO involves searching for possible program transformations for a given ...
0
246
Metrics
Total Citations0
Total Downloads246
Last 12 Months246
Last 6 weeks71
View online with eReader
View this article in HTML format
PDF
research-article
Open Access
June 2024
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
ICS '24: Proceedings of the 38th ACM International Conference on SupercomputingPages 225–235https://doi.org/10.1145/3650200.3656623

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) ...
0
236
Metrics
Total Citations0
Total Downloads236
Last 12 Months236
Last 6 weeks59
View online with eReader
View this article in HTML format
PDF
research-article
June 2022
Efficient, out-of-memory sparse MTTKRP on massively parallel architectures
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingArticle No.: 26, Pages 1–13https://doi.org/10.1145/3524059.3532363

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. ...
6
147
Metrics
Total Citations6
Total Downloads147
Last 12 Months52
Last 6 weeks7
Get Access
research-article
June 2021
ALTO: adaptive linearized storage of sparse tensors
ICS '21: Proceedings of the 35th ACM International Conference on SupercomputingPages 404–416https://doi.org/10.1145/3447818.3461703

The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive ...
14
374
Metrics
Total Citations14
Total Downloads374
Last 12 Months53
Last 6 weeks5
Get Access
research-article
June 2019
Software combining to mitigate multithreaded MPI contention
ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 367–379https://doi.org/10.1145/3330345.3330378

Efforts to mitigate lock contention from concurrent threaded accesses to MPI have reduced contention through fine-grained locking, avoided locking altogether by offloading communication to dedicated threads, or alleviated negative side effects from ...
5
238
Metrics
Total Citations5
Total Downloads238
Last 12 Months12
Last 6 weeks0
Get Access
research-article
June 2019
Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior
ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 195–205https://doi.org/10.1145/3330345.3330363

Modern workloads such as graph analytics, sparse matrix multiplication, and in-memory key-value stores use very large datasets and typically have non-uniform memory access patterns which defy traditional concepts of locality. Moreover, many of these ...
3
243
Metrics
Total Citations3
Total Downloads243
Last 12 Months23
Last 6 weeks9
Get Access
research-article
June 2017
Toward Full Specialization of the HPC Software Stack: Reconciling Application Containers and Lightweight Multi-kernels
ROSS '17: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017Article No.: 7, Pages 1–8https://doi.org/10.1145/3095770.3095777

Application containers enable users to have greater control of their user-space execution environment by bundling application code with all the necessary libraries in a single software package. Lightweight multi-kernels leverage multi-core CPUs to run ...
3
228
Metrics
Total Citations3
Total Downloads228
Last 12 Months12
Last 6 weeks3
Get Access
research-article
June 2017
HPAT: high performance analytics with scripting ease-of-use
ICS '17: Proceedings of the International Conference on SupercomputingArticle No.: 9, Pages 1–10https://doi.org/10.1145/3079079.3079099

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We ...
9
450
Metrics
Total Citations9
Total Downloads450
Last 12 Months5
Last 6 weeks0
Get Access
research-article
June 2016
A Multi-Kernel Survey for High-Performance Computing
ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 5, Pages 1–8https://doi.org/10.1145/2931088.2931092

In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been the increase in complexity of the HPC software environment, which ...
5
432
Metrics
Total Citations5
Total Downloads432
Last 12 Months21
Last 6 weeks3
Get Access
research-article
June 2016
Simulation and Analysis Engine for Scale-Out Workloads
ICS '16: Proceedings of the 2016 International Conference on SupercomputingArticle No.: 22, Pages 1–13https://doi.org/10.1145/2925426.2926293

We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the ...
11
174
Metrics
Total Citations11
Total Downloads174
Last 12 Months7
Last 6 weeks1
Get Access
research-article
June 2015
Analyzing System Calls in Multi-OS Hierarchical Environments
ROSS '15: Proceedings of the 5th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 6, Pages 1–8https://doi.org/10.1145/2768405.2768411

As supercomputers progress to exascale computing and beyond, the number of nodes in a supercomputer is continuing to increase, and the number of cores within a node is dramatically increasing. The amount of parallelism argues for hierarchical approaches ...
4
135
Metrics
Total Citations4
Total Downloads135
Last 12 Months5
Last 6 weeks1
Get Access
research-article
June 2015
Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale Computing
ROSS '15: Proceedings of the 5th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 5, Pages 1–8https://doi.org/10.1145/2768405.2768410

As systems sizes increase to exascale and beyond, there is a need to enhance the system software to meet the needs and challenges of applications. The evolutionary versus revolutionary debate can be set aside by providing system software that ...
11
277
Metrics
Total Citations11
Total Downloads277
Last 12 Months16
Last 6 weeks0
Get Access
research-article
June 2015
History-Assisted Adaptive-Granularity Caches (HAAG$) for High Performance 3D DRAM Architectures
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 251–261https://doi.org/10.1145/2751205.2751227

3D-stacked DRAM has the potential to provide high performance and large capacity memory for future high performance computing systems and datacenters, and the integration of a dedicated logic die opens up opportunities for architectural enhancements ...
2
240
Metrics
Total Citations2
Total Downloads240
Last 12 Months8
Last 6 weeks3
Get Access
research-article
June 2015
Optimizing Overlapped Memory Accesses in User-directed Vectorization
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 393–404https://doi.org/10.1145/2751205.2751224

Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive ...
7
251
Metrics
Total Citations7
Total Downloads251
Last 12 Months6
Last 6 weeks4
Get Access
research-article
June 2014
Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 7, Pages 1–7https://doi.org/10.1145/2612262.2612268

Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has ...
9
176
Metrics
Total Citations9
Total Downloads176
Last 12 Months14
Last 6 weeks0
Get Access
research-article
June 2014
Hybrid MPI: a case study on the Xeon Phi platform
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 6, Pages 1–8https://doi.org/10.1145/2612262.2612267

New many-core architectures such as Intel Xeon Phi offer applications significantly higher power efficiency than conventional multi-core processors. However, while this processor's compute and communication performance is an excellent match for MPI ...
3
200
Metrics
Total Citations3
Total Downloads200
Last 12 Months7
Last 6 weeks0
Get Access
research-article
June 2014
mOS: an architecture for extreme-scale operating systems
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for SupercomputersArticle No.: 2, Pages 1–8https://doi.org/10.1145/2612262.2612263

Linux^®, or more specifically, the Linux API, plays a key role in HPC computing. Even for extreme-scale computing, a known and familiar API is required for production machines. However, an off-the-shelf Linux distribution faces challenges at extreme ...
45
572
Metrics
Total Citations45
Total Downloads572
Last 12 Months17
Last 6 weeks5
Get Access
research-article
June 2014
Author's retrospective for biomedical image analysis on a cooperative cluster of gpus and multicores
ACM International Conference on Supercomputing 25th Anniversary VolumePages 82–84https://doi.org/10.1145/2591635.2591670

The last six years has seen Moore's Law continue to produce incredible gains in computational power. Indeed, the November, 2007 list of the top ten fastest supercomputers in the world contained no machines with acceleration of any kind. The same list ...
2
98
Metrics
Total Citations2
Total Downloads98
Last 12 Months0
Last 6 weeks0
Get Access
research-article
June 2014
Author retrospective for bloom filtering cache misses for accurate data speculation and prefetching
ACM International Conference on Supercomputing 25th Anniversary VolumePages 65–67https://doi.org/10.1145/2591635.2591664

In this paper, we provide the authors? retrospective analysis of the paper "Bloom Filtering Cache Misses for Accurate Data Speculative and Prefetching" which was published in the proceedings of 2002 International Conference on Supercomputing.

DOI: http:/...
0
81
Metrics
Total Citations0
Total Downloads81
Last 12 Months0
Last 6 weeks0
Get Access

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Publication Date

Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and Views

Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN Deployment

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

ALTO: adaptive linearized storage of sparse tensors

Software combining to mitigate multithreaded MPI contention

Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

Toward Full Specialization of the HPC Software Stack: Reconciling Application Containers and Lightweight Multi-kernels

HPAT: high performance analytics with scripting ease-of-use

A Multi-Kernel Survey for High-Performance Computing

Simulation and Analysis Engine for Scale-Out Workloads

Analyzing System Calls in Multi-OS Hierarchical Environments

Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale Computing

History-Assisted Adaptive-Granularity Caches (HAAG$) for High Performance 3D DRAM Architectures

Optimizing Overlapped Memory Accesses in User-directed Vectorization

Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor

Hybrid MPI: a case study on the Xeon Phi platform

mOS: an architecture for extreme-scale operating systems

Author's retrospective for biomedical image analysis on a cooperative cluster of gpus and multicores

Author retrospective for bloom filtering cache misses for accurate data speculation and prefetching