- Sponsor:
- sigarch
No abstract available.
Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency
Transactional memory is being advanced as an alternative to traditional lock-based synchronization for concurrent programming. Transactional memory simplifies the programming model and maximizes concurrency. At the same time, transactions can suffer ...
Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading
- Carlos Madriles,
- Pedro Lopez,
- Josep Maria Codina,
- Enric Gibert,
- Fernando Latorre,
- Alejandro Martinez,
- Raul Martinez,
- Antonio Gonzalez
Industry is moving towards multi-core designs as we have hit the memory and power walls. Multi-core designs are very effective to exploit thread-level parallelism (TLP) but do not provide benefits when executing serial code (applications with low TLP, ...
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors
Translation Lookaside Buffers (TLBs) are a staple in modern computer systems and have a significant impact on overall system performance. Numerous prior studies have addressed TLB designs to lower access times and miss rates; these, however, have been ...
Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs
Load elimination is a classical compiler transformation that is increasing in importance for multi-core and many-core architectures. The effect of the transformation is to replace a memory access, such as a read of an object field or an array element, ...
Quantifying the Potential of Program Analysis Peripherals
Tools such as multi-threaded data race detectors, memory bounds checkers, dynamic type analyzers, data flight recorders, and various performance profilers are becoming increasingly vital aids to software developers. Rather than performing all the ...
Algorithmic Skeletons within an Embedded Domain Specific Language for the CELL Processor
Efficiently using the hardware capabilities of the Cell processor, a heterogeneous chip multiprocessor that uses several levels of parallelism to deliver high performance, and being able to reuse legacy code are real challenges for application ...
A Task-Centric Memory Model for Scalable Accelerator Architectures
This paper presents a task-centric memory model for 1000-core compute accelerators.Visual computing applications are emerging as an important class of workloads that can exploit 1000-core processors.In these workloads, we observe data sharing and ...
SHIP: Scalable Hierarchical Power Control for Large-Scale Data Centers
In today's data centers, precisely controlling server power consumption is an essential way to avoid system failures caused by power capacity overload or overheating due to increasingly high server density. While various power control strategies have ...
Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures
Emerging three-dimensional (3D) integration technology allows for the direct placement of DRAM on top of a microprocessor, significantly reducing the wire-delay between the two and thereby alleviating memory latency and bandwidth constraints. However, ...
Core-Selectability in Chip Multiprocessors
The centralized structures necessary for the extraction of instruction-level parallelism (ILP) are consuming progressively smaller portions of the total die area of chip multiprocessors (CMP). The reason for this is that scaling these structures does ...
Chainsaw: Using Binary Matching for Relative Instruction Mix Comparison
With advances in hardware, instruction set architectures are undergoing continual evolution. As a result, compilers are under constant pressure to adapt and take full advantage of available features. However, current techniques for evaluating relative ...
tm_db: A Generic Debugging Library for Transactional Programs
Transactional Memory (TM) has received a lot of attention as a programming API for concurrent programson emerging multicore architectures. If the transactionalprogramming model is to realize its promise of simplifyingthe problem of writing correct and ...
StealthTest: Low Overhead Online Software Testing Using Transactional Memory
Software testing is hard. The emergence of multicore architectures and the proliferation of bugprone multithreaded software makes testing even harder. To this end, researchers have proposed methods to continue testing software after deployment, e.g., in ...
CPROB: Checkpoint Processing with Opportunistic Minimal Recovery
CPR (Checkpoint Processing and Recovery) is a physical register management scheme that supports a larger instruction window and higher average IPC than conventional ROB-style register management.It does so by restricting mis-speculation recovery to ...
Architecture Support for Improving Bulk Memory Copying and Initialization Performance
Bulk memory copying and initialization is one of the most ubiquitous operations performed in current computer systems by both user applications and Operating Systems. While many current systems rely on a loop of loads and stores, there are proposals to ...
Oblivious Routing in On-Chip Bandwidth-Adaptive Networks
Oblivious routing can be implemented on simple router hardware, but network performance suffers when routes become congested. Adaptive routing attempts to avoid hot spots by re-routing flows, but requires more complex hardware to determine and configure ...
Exploiting Parallelism with Dependence-Aware Scheduling
It is well known that a large fraction of applications cannot be parallelized at compile time due to unpredictable data dependences such as indirect memory accesses and/or memory accesses guarded by data-dependent conditional statements. A significant ...
ITCA: Inter-task Conflict-Aware CPU Accounting for CMPs
Chip-MultiProcessor (CMP) architectures are becoming more and more popular as an alternative to the traditional processors that only extract instruction-level parallelism from an application. CMPs introduce complexities when accounting CPU utilization. ...
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures
Increasing demand for performance and efficiency has driven the computer industry toward multicore systems. These systems have become the industry standard in almost all segments of the computer market from high-end servers to handheld devices. In order ...
DDCache: Decoupled and Delegable Cache Data and Metadata
In order to harness the full compute power of many-core processors, future designs must focus on effective utilization of on-chip cache and bandwidth resources. In this paper, we address the dual goals of (1) reducing on-chip communication overheads and ...
Zero-Value Caches: Cancelling Loads that Return Zero
The speed gap between processor and memory continues to limit performance. To address this problem, we explore the potential of eliminating Zero Loads — loads accessing memory locations that contain the value “zero” — to improve performance and energy ...
Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning
Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First,the last-level cache remains in the CPU's critical ...
Index Terms
- Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques