Issue Downloads
PERI: A Configurable Posit Enabled RISC-V Core
Owing to the failure of Dennard’s scaling, the past decade has seen a steep growth of prominent new paradigms leveraging opportunities in computer architecture. Two technologies of interest are Posit and RISC-V. Posit was introduced in mid-2017 as a ...
MC-DeF: Creating Customized CGRAs for Dataflow Applications
Executing complex scientific applications on Coarse-Grain Reconfigurable Arrays (CGRAs) promises improvements in execution time and/or energy consumption compared to optimized software implementations or even fully customized hardware solutions. Typical ...
Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs
QR decomposition is one of the most useful factorization kernels in modern numerical linear algebra algorithms. In particular, the decomposition of tall-and-skinny matrices (TSMs) has major applications in areas including scientific computing, machine ...
Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache
While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that ...
Performance Evaluation of Intel Optane Memory for Managed Workloads
Intel Optane memory offers non-volatility, byte addressability, and high capacity. It suits managed workloads that prefer large main memory heaps. We investigate Optane as the main memory for managed (Java) workloads, focusing on performance ...
GraphPEG: Accelerating Graph Processing on GPUs
Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. ...
PRISM: Strong Hardware Isolation-based Soft-Error Resilient Multicore Architecture with High Performance and Availability at Low Hardware Overheads
Multicores increasingly deploy safety-critical parallel applications that demand resiliency against soft-errors to satisfy the safety standards. However, protection against these errors is challenging due to complex communication and data access ...
PAVER: Locality Graph-Based Thread Block Scheduling for GPUs
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread ...
Automatic Sublining for Efficient Sparse Memory Accesses
Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional ...
Fast Key-Value Lookups with Node Tracker
Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that ...
CacheInspector: Reverse Engineering Cache Resources in Public Clouds
- Weijia Song,
- Christina Delimitrou,
- Zhiming Shen,
- Robbert Van Renesse,
- Hakim Weatherspoon,
- Lotfi Benmohamed,
- Frederic De Vaulx,
- Charif Mahmoudi
Infrastructure-as-a-Service cloud providers sell virtual machines that are only specified in terms of number of CPU cores, amount of memory, and I/O throughput. Performance-critical aspects such as cache sizes and memory latency are missing or reported ...
Understanding Cache Compression
Hardware cache compression derives from software-compression research; yet, its implementation is not a straightforward translation, since it must abide by multiple restrictions to comply with area, power, and latency constraints. This study sheds light ...
Flynn’s Reconciliation: Automating the Register Cache Idiom for Cross-accelerator Programming
A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational ...
KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls
- João P. L. De Carvalho,
- Braedy Kuzma,
- Ivan Korostelev,
- José Nelson Amaral,
- Christopher Barton,
- José Moreira,
- Guido Araujo
Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code ...
Early Address Prediction: Efficient Pipeline Prefetch and Reuse
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register ...
Subjects
Currently Not Available