Issue Downloads
QuMan: Profile-based Improvement of Cluster Utilization
Modern data centers consolidate workloads to increase server utilization and reduce total cost of ownership, and cope with scaling limitations. However, server resource sharing introduces performance interference across applications and, consequently, ...
LAPPS: Locality-Aware Productive Prefetching Support for PGAS
Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality ...
BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU
The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, ...
An Alternative TAGE-like Conditional Branch Predictor
TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is possible to obtain significant prediction accuracy improvements by complementing TAGE with a statistical ...
Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing
Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for image, voice, and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for ...
CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems
- Hyojong Kim,
- Ramyad Hadidi,
- Lifeng Nai,
- Hyesoon Kim,
- Nuwan Jayasena,
- Yasuko Eckert,
- Onur Kayiran,
- Gabriel Loh
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory ...
Global Dead-Block Management for Task-Parallel Programs
Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. ...
High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach
The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized ...
Cluster Programming using the OpenMP Accelerator Model
Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation ...
Block Cooperation: Advancing Lifetime of Resistive Memories by Increasing Utilization of Error Correcting Codes
Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated ...
Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures
Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run ...
Software-Directed Techniques for Improved GPU Register File Utilization
Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming ...
On-GPU Thread-Data Remapping for Branch Divergence Reduction
General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...