Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJanuary 2024
Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
- Jie Zhao,
- Jinchen Xu,
- Peng Di,
- Wang Nie,
- Jiahui Hu,
- Yanzhi Yi,
- Sijia Yang,
- Zhen Geng,
- Renwei Zhang,
- Bojie Li,
- Zhiliang Gan,
- Xuefeng Jin
ACM Transactions on Computer Systems (TOCS), Volume 41, Issue 1-4Article No.: 5, Pages 1–45https://doi.org/10.1145/3635305Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or ...
- research-articleMay 2020
A Retargetable System-level DBT Hypervisor
ACM Transactions on Computer Systems (TOCS), Volume 36, Issue 4Article No.: 14, Pages 1–24https://doi.org/10.1145/3386161System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different from that of the host machine. Due to their performance-critical ...
- research-articleJune 2019
Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspective
ACM Transactions on Computer Systems (TOCS), Volume 36, Issue 3Article No.: 8, Pages 1–34https://doi.org/10.1145/3319393Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. ...
- research-articleJanuary 2016
Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment
ACM Transactions on Computer Systems (TOCS), Volume 33, Issue 4Article No.: 12, Pages 1–33https://doi.org/10.1145/2807694Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do ...
- research-articleAugust 2015
SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration
ACM Transactions on Computer Systems (TOCS), Volume 33, Issue 3Article No.: 9, Pages 1–27https://doi.org/10.1145/2798725Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code ...
-
- research-articleAugust 2014
Scaling Performance via Self-Tuning Approximation for Graphics Engines
ACM Transactions on Computer Systems (TOCS), Volume 32, Issue 3Article No.: 7, Pages 1–29https://doi.org/10.1145/2631913Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such ...
- research-articleMarch 2008
Incrementally parallelizing database transactions with thread-level speculation
ACM Transactions on Computer Systems (TOCS), Volume 26, Issue 1Article No.: 2, Pages 1–50https://doi.org/10.1145/1328671.1328673With the advent of chip multiprocessors, exploiting intratransaction parallelism in database systems is an attractive way of improving transaction performance. However, exploiting intratransaction parallelism is difficult for two reasons: first, ...
- articleAugust 2005
The STAMPede approach to thread-level speculation
ACM Transactions on Computer Systems (TOCS), Volume 23, Issue 3Pages 253–300https://doi.org/10.1145/1082469.1082471Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to ...
- articleAugust 2004
A study of source-level compiler algorithms for automatic construction of pre-execution code
ACM Transactions on Computer Systems (TOCS), Volume 22, Issue 3Pages 326–379https://doi.org/10.1145/1012268.1012270Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ...
- articleMay 2004
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching
ACM Transactions on Computer Systems (TOCS), Volume 22, Issue 2Pages 214–280https://doi.org/10.1145/986533.986536Pointer-chasing applications tend to traverse composite data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent ...
- articleFebruary 2003
Run-time support for distributed sharing in safe languages
ACM Transactions on Computer Systems (TOCS), Volume 21, Issue 1Pages 1–35https://doi.org/10.1145/592637.592638We present a new run-time system that supports object sharing in a distributed system. The key insight in this system is that a handle-based implementation of such a system enables efficient and transparent sharing of data with both fine- and coarse-...
- articleAugust 2002
Secure program partitioning
ACM Transactions on Computer Systems (TOCS), Volume 20, Issue 3Pages 283–328https://doi.org/10.1145/566340.566343This paper presents secure program partitioning, a language-based technique for protecting confidential data during computation in distributed systems containing mutually untrusted hosts. Confidentiality and integrity policies can be expressed by ...
- articleMay 2001
Compiler-based I/O prefetching for out-of-core applications
ACM Transactions on Computer Systems (TOCS), Volume 19, Issue 2Pages 111–170https://doi.org/10.1145/377769.377774Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve “out-of-core” problems efficiently are typically faced with the onerous task of rewriting ...
- articleFebruary 2001
Architectural and compiler support for effective instruction prefetching: a cooperative approach
ACM Transactions on Computer Systems (TOCS), Volume 19, Issue 1Pages 71–109https://doi.org/10.1145/367742.367786Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing ...
- articleNovember 1999
Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives
ACM Transactions on Computer Systems (TOCS), Volume 17, Issue 4Pages 337–371https://doi.org/10.1145/329466.329486This article presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular, object-based computations. Our experience shows that the synchronization ...
- articleAugust 1999
Ace: a language for parallel programming with customizable protocols
ACM Transactions on Computer Systems (TOCS), Volume 17, Issue 3Pages 202–248https://doi.org/10.1145/320656.320657Customizing the protocols that manage accesses to different data structures within an application can improve the performance of software shared-memory programs substantially. Existing systems for using customizable protocols are hard to use directly ...
- articleMay 1999
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback
ACM Transactions on Computer Systems (TOCS), Volume 17, Issue 2Pages 89–132https://doi.org/10.1145/312203.312210This article presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses ...
- articleMay 1998
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS), Volume 16, Issue 2Pages 170–205https://doi.org/10.1145/279227.279230Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the ...
- articleFebruary 1998
Tolerating latency in multiprocessors through compiler-inserted prefetching
ACM Transactions on Computer Systems (TOCS), Volume 16, Issue 1Pages 55–92https://doi.org/10.1145/273011.273021The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing ...
- articleFebruary 1998
Performance evaluation of the Orca shared-object system
ACM Transactions on Computer Systems (TOCS), Volume 16, Issue 1Pages 1–40https://doi.org/10.1145/273011.273014Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca's coherence ...