Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Bibliometrics
Skip Table Of Content Section
research-article
Open Access
QuMan: Profile-based Improvement of Cluster Utilization
Article No.: 27, Pages 1–25https://doi.org/10.1145/3210560

Modern data centers consolidate workloads to increase server utilization and reduce total cost of ownership, and cope with scaling limitations. However, server resource sharing introduces performance interference across applications and, consequently, ...

research-article
Open Access
LAPPS: Locality-Aware Productive Prefetching Support for PGAS
Article No.: 28, Pages 1–26https://doi.org/10.1145/3233299

Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality ...

research-article
Open Access
BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU
Article No.: 29, Pages 1–27https://doi.org/10.1145/3226228

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, ...

research-article
Open Access
An Alternative TAGE-like Conditional Branch Predictor
Article No.: 30, Pages 1–23https://doi.org/10.1145/3226098

TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is possible to obtain significant prediction accuracy improvements by complementing TAGE with a statistical ...

research-article
Open Access
Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing
Article No.: 31, Pages 1–24https://doi.org/10.1145/3233300

Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for image, voice, and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for ...

research-article
Open Access
CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems
Article No.: 32, Pages 1–23https://doi.org/10.1145/3232521

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory ...

research-article
Open Access
Global Dead-Block Management for Task-Parallel Programs
Article No.: 33, Pages 1–25https://doi.org/10.1145/3234337

Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. ...

research-article
Open Access
High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach
Article No.: 34, Pages 1–27https://doi.org/10.1145/3235029

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized ...

research-article
Open Access
Cluster Programming using the OpenMP Accelerator Model
Article No.: 35, Pages 1–23https://doi.org/10.1145/3226112

Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation ...

research-article
Open Access
Block Cooperation: Advancing Lifetime of Resistive Memories by Increasing Utilization of Error Correcting Codes
Article No.: 36, Pages 1–26https://doi.org/10.1145/3243906

Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated ...

research-article
Open Access
Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures
Article No.: 37, Pages 1–26https://doi.org/10.1145/3243904

Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run ...

research-article
Open Access
Software-Directed Techniques for Improved GPU Register File Utilization
Article No.: 38, Pages 1–23https://doi.org/10.1145/3243905

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming ...

research-article
Open Access
On-GPU Thread-Data Remapping for Branch Divergence Reduction
Article No.: 39, Pages 1–24https://doi.org/10.1145/3242089

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...

Subjects

Comments