Issue Downloads
Towards Enhanced System Efficiency while Mitigating Row Hammer
In recent years, DRAM-based main memories have become susceptible to the Row Hammer (RH) problem, which causes bits to flip in a row without accessing them directly. Frequent activation of a row, called an aggressor row, causes its adjacent rows’ (...
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns
Two novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI ...
Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks
The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control ...
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms
With the proliferation of applications with machine learning (ML), the importance of edge platforms has been growing to process streaming sensor, data locally without resorting to remote servers. Such edge platforms are commonly equipped with ...
Gem5-X: A Many-core Heterogeneous Simulation Platform for Architectural Exploration and Optimization
The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this ...
PICO: A Presburger In-bounds Check Optimization for Compiler-based Memory Safety Instrumentations
Memory safety violations such as buffer overflows are a threat to security to this day. A common solution to ensure memory safety for C is code instrumentation. However, this often causes high execution-time overhead and is therefore rarely used in ...
Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs
- Zhibing Sha,
- Jun Li,
- Lihao Song,
- Jiewen Tang,
- Min Huang,
- Zhigang Cai,
- Lianju Qian,
- Jianwei Liao,
- Zhiming Liu
This article proposes a low I/O intensity-aware scheduling scheme on garbage collection (GC) in SSDs for minimizing the I/O long-tail latency to ensure I/O responsiveness. The basic idea is to assemble partial GC operations by referring to several ...
Low-precision Logarithmic Number Systems: Beyond Base-2
Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide, and square root. LNS ...
Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache
- Candace Walden,
- Devesh Singh,
- Meenatchi Jagasivamani,
- Shang Li,
- Luyi Kang,
- Mehdi Asnaashari,
- Sylvain Dubois,
- Bruce Jacob,
- Donald Yeung
Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D ...
Byte-Select Compression
- Matthew Tomei,
- Shomit Das,
- Mohammad Seyedzadeh,
- Philip Bedoukian,
- Bradford Beckmann,
- Rakesh Kumar,
- David Wood
Cache-block compression is a highly effective technique for both reducing accesses to lower levels in the memory hierarchy (cache compression) and minimizing data transfers (link compression). While many effective cache-block compression algorithms have ...
CIB-HIER: Centralized Input Buffer Design in Hierarchical High-radix Routers
Hierarchical organization is widely used in high-radix routers to enable efficient scaling to higher switch port count. A general-purpose hierarchical router must be symmetrically designed with the same input buffer depth, resulting in a large amount of ...
Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
- Tobias Gysi,
- Christoph Müller,
- Oleksandr Zinenko,
- Stephan Herhut,
- Eddie Davis,
- Tobias Wicky,
- Oliver Fuhrer,
- Torsten Hoefler,
- Tobias Grosser
Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on ...
System-level Early-stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System
Despite being employed in numerous efforts to improve power delivery efficiency, the integrated voltage regulator (IVR) approach has yet to be evaluated rigorously and quantitatively in a full power delivery system (PDS) setting. To fulfill this need, ...
GraphAttack: Optimizing Data Supply for Graph Applications on In-Order Multicore Architectures
Graph structures are a natural representation of important and pervasive data. While graph applications have significant parallelism, their characteristic pointer indirect loads to neighbor data hinder scalability to large datasets on multicore systems. ...
Scenario-Aware Program Specialization for Timing Predictability
The successful application of static program analysis strongly depends on flow facts of a program such as loop bounds, control-flow constraints, and operating modes. This problem heavily affects the design of real-time systems, since static program ...
WaFFLe: Gated Cache-<underline>Wa</underline>ys with Per-Core <underline>F</underline>ine-Grained DV<underline>F</underline>S for Reduced On-Chip Temperature and <underline>Le</underline>akage Consumption
Managing thermal imbalance in contemporary chip multi-processors (CMPs) is crucial in assuring functional correctness of modern mobile as well as server systems. Localized regions with high activity, e.g., register files, ALUs, FPUs, and so on, ...
SortCache: Intelligent Cache Management for Accelerating Sparse Data Workloads
Sparse data applications have irregular access patterns that stymie modern memory architectures. Although hyper-sparse workloads have received considerable attention in the past, moderately-sparse workloads prevalent in machine learning applications, ...
Device Hopping: Transparent Mid-Kernel Runtime Switching for Heterogeneous Systems
Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-...
LargeGraph: An Efficient Dependency-Aware GPU-Accelerated Large-Scale Graph Processing
Many out-of-GPU-memory systems are recently designed to support iterative processing of large-scale graphs. However, these systems still suffer from long time to converge because of inefficient propagation of active vertices’ new states along graph ...
Spiking Neural Networks in Spintronic Computational RAM
- Hüsrev Cılasun,
- Salonik Resch,
- Zamshed I. Chowdhury,
- Erin Olson,
- Masoud Zabihi,
- Zhengyang Zhao,
- Thomas Peterson,
- Keshab K. Parhi,
- Jian-Ping Wang,
- Sachin S. Sapatnekar,
- Ulya R. Karpuzcu
Spiking Neural Networks (SNNs) represent a biologically inspired computation model capable of emulating neural computation in human brain and brain-like structures. The main promise is very low energy consumption. Classic Von Neumann architecture based ...
Subjects
Currently Not Available