Issue Downloads
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main ...
An architecture-independent instruction shuffler to protect against side-channel attacks
Embedded cryptographic systems, such as smart cards, require secure implementations that are robust to a variety of low-level attacks. Side-Channel Attacks (SCA) exploit the information such as power consumption, electromagnetic radiation and acoustic ...
Approximate graph clustering for program characterization
An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data graphs ...
Bahurupi: A polymorphic heterogeneous multi-core architecture
Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex cores ...
Compiler mitigations for time attacks on modern x86 processors
This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side channels that ...
Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions
Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate ...
DAPSCO: Distance-aware partially shared cache organization
Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks that ...
On-the-fly structure splitting for heap objects
With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect. Sophisticated techniques are needed more than ever to improve an application's spatial and temporal locality. ...
Efficient liveness computation using merge sets and DJ-graphs
In this work we devise an efficient algorithm that computes the liveness information of program variables. The algorithm employs SSA form and DJ-graphs as representation to build Merge sets. The Merge set of node n, M(n) is based on the structure of the ...
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, ...
Exploring the limits of GPGPU scheduling in control flow bound applications
GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication, GPGPU ...
FlexSig: Implementing flexible hardware signatures
With the advent of chip multiprocessors, new techniques have been developed to make parallel programing easier and more reliable. New parallel programing paradigms and new methods of making the execution of programs more efficient and more reliable have ...
Hardware transactional memory with software-defined conflicts
- Ruben Titos-Gil,
- Manuel E. Acacio,
- Jose M. Garcia,
- Tim Harris,
- Adrian Cristal,
- Osman Unsal,
- Ibrahim Hur,
- Mateo Valero
In this paper we investigate the benefits of turning the concept of transactional conflict from its traditionally fixed definition into a variable one that can be dynamically controlled in software. We propose the extension of the atomic language ...
Improving performance of nested loops on reconfigurable array processors
Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained ...
Making wide-issue VLIW processors viable on FPGAs
Soft and highly-customized processors are emerging as a common way to efficiently control large amount of computing resources available on FPGAs. Yet, some processor architectures of choice for DSP and media applications, such as wide-issue VLIW ...
On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments
Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However, ...
Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks
We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other ...
On the simulation of large-scale architectures using multiple application abstraction levels
- Alejandro Rico,
- Felipe Cabarcas,
- Carlos Villavieja,
- Milan Pavlovic,
- Augusto Vega,
- Yoav Etsion,
- Alex Ramirez,
- Mateo Valero
Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating large-...
Optimizing explicit data transfers for data parallel applications on the cell architecture
In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the ...
PLDS: Partitioning linked data structures for parallelism
Recently, parallelization of computations in the presence of dynamic data structures has shown promising potential. In this paper, we present PLDS, a system for easily expressing and efficiently exploiting parallelism in computations that are based on ...
Polyhedral parallelization of binary code
Many automatic software parallelization systems have been proposed in the past decades, but most of them are dedicated to source-to-source transformations. This paper shows that parallelizing executable programs is feasible, even if they require complex ...
ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication
Virtualization is gaining popularity in cloud computing and has become the key enabling technology in cloud infrastructure. By replicating the virtual server state to multiple independent platforms, virtualization improves the reliability and ...
Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic
In spite of the fact that floating-point arithmetic is costly in terms of silicon area, the joint design of hardware for floating-point and integer arithmetic is seldom considered. While components like multipliers and adders can potentially be shared, ...
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems
Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and ...
SYRANT: SYmmetric resource allocation on not-taken and taken paths
In the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in parallel applications. Unfortunately, despite tremendous research effort on branch prediction, ...
The gradient-based cache partitioning algorithm
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to ...
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
The exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been ...
Thread Tranquilizer: Dynamically reducing performance variation
To realize the performance potential of multicore systems, we must effectively manage the interactions between memory reference behavior and the operating system policies for thread scheduling and migration decisions. We observe that these interactions ...
TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks
As the energy consumption of multi-core systems becomes increasingly prominent, it's a challenge to design an energy-efficient real-time scheduling algorithm in multi-core systems for reducing the system energy consumption while guaranteeing the ...