TACO: Vol 8, No 4

Volume 8, Issue 4January 2012Special Issue on High-Performance Embedded Architectures and Compilers

Volume 8, Issue 4

January 2012

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:1544-3566

EISSN:1544-3973

Tags:

Bibliometrics

Issue Downloads

PDFfront matter (TOC, masthead, submission information)

Select All

Export Citations Save to Binder

introduction

Open Access

Introduction to the special issue on high-performance and embedded architectures and compilers

Article No.: 18, Pages 1–2https://doi.org/10.1145/2086696.2086697

research-article

Open Access

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

Article No.: 19, Pages 1–20https://doi.org/10.1145/2086696.2086698

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main ...

research-article

Open Access

An architecture-independent instruction shuffler to protect against side-channel attacks

Article No.: 20, Pages 1–19https://doi.org/10.1145/2086696.2086699

Embedded cryptographic systems, such as smart cards, require secure implementations that are robust to a variety of low-level attacks. Side-Channel Attacks (SCA) exploit the information such as power consumption, electromagnetic radiation and acoustic ...

research-article

Open Access

Approximate graph clustering for program characterization

Article No.: 21, Pages 1–21https://doi.org/10.1145/2086696.2086700

An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data graphs ...

research-article

Open Access

Bahurupi: A polymorphic heterogeneous multi-core architecture

Article No.: 22, Pages 1–21https://doi.org/10.1145/2086696.2086701

Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex cores ...

research-article

Open Access

Compiler mitigations for time attacks on modern x86 processors

Article No.: 23, Pages 1–20https://doi.org/10.1145/2086696.2086702

This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side channels that ...

research-article

Open Access

Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

Article No.: 24, Pages 1–20https://doi.org/10.1145/2086696.2086703

Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate ...

research-article

Open Access

DAPSCO: Distance-aware partially shared cache organization

Article No.: 25, Pages 1–19https://doi.org/10.1145/2086696.2086704

Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks that ...

research-article

Open Access

On-the-fly structure splitting for heap objects

Article No.: 26, Pages 1–20https://doi.org/10.1145/2086696.2086705

With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect. Sophisticated techniques are needed more than ever to improve an application's spatial and temporal locality. ...

research-article

Open Access

Efficient liveness computation using merge sets and DJ-graphs

Article No.: 27, Pages 1–18https://doi.org/10.1145/2086696.2086706

In this work we devise an efficient algorithm that computes the liveness information of program variables. The algorithm employs SSA form and DJ-graphs as representation to build Merge sets. The Merge set of node n, M(n) is based on the structure of the ...

research-article

Open Access

Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

Article No.: 28, Pages 1–21https://doi.org/10.1145/2086696.2086707

Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, ...

research-article

Open Access

Exploring the limits of GPGPU scheduling in control flow bound applications

Article No.: 29, Pages 1–22https://doi.org/10.1145/2086696.2086708

GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication, GPGPU ...

research-article

Open Access

FlexSig: Implementing flexible hardware signatures

Article No.: 30, Pages 1–20https://doi.org/10.1145/2086696.2086709

With the advent of chip multiprocessors, new techniques have been developed to make parallel programing easier and more reliable. New parallel programing paradigms and new methods of making the execution of programs more efficient and more reliable have ...

research-article

Open Access

Hardware transactional memory with software-defined conflicts

Article No.: 31, Pages 1–20https://doi.org/10.1145/2086696.2086710

In this paper we investigate the benefits of turning the concept of transactional conflict from its traditionally fixed definition into a variable one that can be dynamically controlled in software. We propose the extension of the atomic language ...

research-article

Open Access

Improving performance of nested loops on reconfigurable array processors

Article No.: 32, Pages 1–23https://doi.org/10.1145/2086696.2086711

Pipelining algorithms are typically concerned with improving only the steady-state performance, or the kernel time. The pipeline setup time happens only once and therefore can be negligible compared to the kernel time. However, for Coarse-Grained ...

research-article

Open Access

Making wide-issue VLIW processors viable on FPGAs

Article No.: 33, Pages 1–16https://doi.org/10.1145/2086696.2086712

Soft and highly-customized processors are emerging as a common way to efficiently control large amount of computing resources available on FPGAs. Yet, some processor architectures of choice for DSP and media applications, such as wide-issue VLIW ...

research-article

Open Access

On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments

Article No.: 34, Pages 1–25https://doi.org/10.1145/2086696.2086713

Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However, ...

research-article

Open Access

Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks

Article No.: 35, Pages 1–21https://doi.org/10.1145/2086696.2086714

We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other ...

research-article

Open Access

On the simulation of large-scale architectures using multiple application abstraction levels

Article No.: 36, Pages 1–20https://doi.org/10.1145/2086696.2086715

Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating large-...

research-article

Open Access

Optimizing explicit data transfers for data parallel applications on the cell architecture

Article No.: 37, Pages 1–20https://doi.org/10.1145/2086696.2086716

In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the ...

research-article

Open Access

PLDS: Partitioning linked data structures for parallelism

Article No.: 38, Pages 1–21https://doi.org/10.1145/2086696.2086717

Recently, parallelization of computations in the presence of dynamic data structures has shown promising potential. In this paper, we present PLDS, a system for easily expressing and efficiently exploiting parallelism in computations that are based on ...

research-article

Open Access

Polyhedral parallelization of binary code

Article No.: 39, Pages 1–21https://doi.org/10.1145/2086696.2086718

Many automatic software parallelization systems have been proposed in the past decades, but most of them are dedicated to source-to-source transformations. This paper shows that parallelizing executable programs is feasible, even if they require complex ...

research-article

Open Access

ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication

Article No.: 40, Pages 1–22https://doi.org/10.1145/2086696.2086719

Virtualization is gaining popularity in cloud computing and has become the key enabling technology in cloud infrastructure. By replicating the virtual server state to multiple independent platforms, virtualization improves the reliability and ...

research-article

Open Access

Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic

Article No.: 41, Pages 1–22https://doi.org/10.1145/2086696.2086720

In spite of the fact that floating-point arithmetic is costly in terms of silicon area, the joint design of hardware for floating-point and integer arithmetic is seldom considered. While components like multipliers and adders can potentially be shared, ...

research-article

Open Access

Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

Article No.: 42, Pages 1–20https://doi.org/10.1145/2086696.2086721

Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and ...

research-article

Open Access

SYRANT: SYmmetric resource allocation on not-taken and taken paths

Article No.: 43, Pages 1–20https://doi.org/10.1145/2086696.2086722

In the multicore era, achieving ultimate single process performance is still an issue e.g. for single process workload or for sequential sections in parallel applications. Unfortunately, despite tremendous research effort on branch prediction, ...

research-article

Open Access

The gradient-based cache partitioning algorithm

Article No.: 44, Pages 1–21https://doi.org/10.1145/2086696.2086723

This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to ...

research-article

Open Access

The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

Article No.: 45, Pages 1–20https://doi.org/10.1145/2086696.2086724

The exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been ...

research-article

Open Access

Thread Tranquilizer: Dynamically reducing performance variation

Article No.: 46, Pages 1–21https://doi.org/10.1145/2086696.2086725

To realize the performance potential of multicore systems, we must effectively manage the interactions between memory reference behavior and the operating system policies for thread scheduling and migration decisions. We observe that these interactions ...

research-article

Open Access

TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks

Article No.: 47, Pages 1–20https://doi.org/10.1145/2086696.2086726

As the energy consumption of multi-core systems becomes increasingly prominent, it's a challenge to design an energy-efficient real-time scheduling algorithm in multi-core systems for reducing the system energy consumption while guaranteeing the ...

ACM Transactions on Architecture and Code Optimization

Sections

Issue Downloads

Introduction to the special issue on high-performance and embedded architectures and compilers

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

An architecture-independent instruction shuffler to protect against side-channel attacks

Approximate graph clustering for program characterization

Bahurupi: A polymorphic heterogeneous multi-core architecture

Compiler mitigations for time attacks on modern x86 processors

Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

DAPSCO: Distance-aware partially shared cache organization

On-the-fly structure splitting for heap objects

Efficient liveness computation using merge sets and DJ-graphs

Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

Exploring the limits of GPGPU scheduling in control flow bound applications

FlexSig: Implementing flexible hardware signatures

Hardware transactional memory with software-defined conflicts

Improving performance of nested loops on reconfigurable array processors

Making wide-issue VLIW processors viable on FPGAs

On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments

Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks

On the simulation of large-scale architectures using multiple application abstraction levels

Optimizing explicit data transfers for data parallel applications on the cell architecture

PLDS: Partitioning linked data structures for parallelism

Polyhedral parallelization of binary code

ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication

Sabrewing: A lightweight architecture for combined floating-point and integer arithmetic

Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

SYRANT: SYmmetric resource allocation on not-taken and taken paths

The gradient-based cache partitioning algorithm

The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

Thread Tranquilizer: Dynamically reducing performance variation

TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks

Sections

Issue Downloads

Save to Binder

Subjects

Comments