No abstract available.
Proceeding Downloads
Cheetah: detecting false sharing efficiently and effectively
False sharing is a notorious performance problem that may occur in multithreaded programs when they are running on ubiquitous multicore hardware. It can dramatically degrade the performance by up to an order of magnitude, significantly hurting the ...
AutoFDO: automatic feedback-directed optimization for warehouse-scale applications
AutoFDO is a system to simplify real-world deployment of feedback-directed optimization (FDO). The system works by sampling hardware performance monitors on production machines and using those profiles in to guide optimization. Profile data is stale by ...
Portable performance on asymmetric multicore processors
Static and dynamic power constraints are steering chip manufacturers to build single-ISA Asymmetric Multicore Processors (AMPs) with big and small cores. To deliver on their energy efficiency potential, schedulers must consider core sensitivity, load ...
StructSlim: a lightweight profiler to guide structure splitting
Memory access latency continues to be a dominant bottleneck in a large class of applications on modern architectures. To optimize memory performance, it is important to utilize the locality in the memory hierarchy. Structure splitting can significantly ...
Exploiting recent SIMD architectural advances for irregular applications
A broad class of applications involve indirect or datadependent memory accesses and are referred to as irregular applications. Recent developments in SIMD architectures – specifically, the emergence of wider SIMD lanes, combination of SIMD parallelism ...
Exploiting mixed SIMD parallelism by reducing data reorganization overhead
Existing loop vectorization techniques can exploit either intra- or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting ...
A black-box approach to energy-aware scheduling on integrated CPU-GPU systems
Energy efficiency is now a top design goal for all computing systems, from fitness trackers and tablets, where it affects battery life, to cloud computing centers, where it directly impacts operational cost, maintainability, and environmental impact. ...
Portable and transparent software managed scheduling on accelerators for fair resource sharing
Accelerators, such as Graphic Processing Units (GPUs), are popular components of modern parallel systems. Their energy-efficient performance make them attractive components for modern data center nodes. However, they lack control for fair resource ...
Communication-aware mapping of stream graphs for multi-GPU platforms
Stream graphs can provide a natural way to represent many applications in multimedia and DSP domains. Though the exposed parallelism of stream graphs makes it relatively easy to map them to GP (General Purpose)-GPUs, very large stream graphs as well as ...
gpucc: an open-source GPGPU compiler
- Jingyue Wu,
- Artem Belevich,
- Eli Bendersky,
- Mark Heffernan,
- Chris Leary,
- Jacques Pienaar,
- Bjarke Roune,
- Rob Springer,
- Xuetian Weng,
- Robert Hundt
Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
A basic linear algebra compiler for structured matrices
Many problems in science and engineering are in practice modeled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform the computation. ...
Opening polyhedral compiler's black box
While compilers offer a fair trade-off between productivity and executable performance in single-threaded execution, their optimizations remain fragile when addressing compute-intensive code for parallel architectures with deep memory hierarchies. ...
Trace-based affine reconstruction of codes
Complete comprehension of loop codes is desirable for a variety of program optimizations. Compilers perform static code analyses and transformations, such as loop tiling or memory partitioning, by constructing and manipulating formal representations of ...
Inference of peak density of indirect branches to detect ROP attacks
A program subject to a Return-Oriented Programming (ROP) attack usually presents an execution trace with a high frequency of indirect branches. From this observation, several researchers have proposed to monitor the density of these instructions to ...
Sparse flow-sensitive pointer analysis for multithreaded programs
For C programs, flow-sensitivity is important to enable pointer analysis to achieve highly usable precision. Despite significant recent advances in scaling flow-sensitive pointer analysis sparsely for sequential C programs, relatively little progress ...
Symbolic range analysis of pointers
Alias analysis is one of the most fundamental techniques that compilers use to optimize languages with pointers. However, in spite of all the attention that this topic has received, the current state-of-the-art approaches inside compilers still face ...
Towards automatic significance analysis for approximate computing
- Vassilis Vassiliadis,
- Jan Riehme,
- Jens Deussen,
- Konstantinos Parasyris,
- Christos D. Antonopoulos,
- Nikolaos Bellas,
- Spyros Lalis,
- Uwe Naumann
Several applications may trade-off output quality for energy efficiency by computing only an approximation of their output. Current approaches to software-based approximate computing often require the programmer to specify parts of the code or data ...
Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns
- Kevin J. Brown,
- HyoukJoong Lee,
- Tiark Rompf,
- Arvind K. Sujeeth,
- Christopher De Sa,
- Christopher Aberger,
- Kunle Olukotun
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple ...
NRG-loops: adjusting power from within applications
NRG-Loops are source-level abstractions that allow an application to dynamically manage its power and energy through adjustments to functionality, performance, and accuracy. The adjustments, which come in the form of truncated, adapted, or perforated ...
Validating optimizations of concurrent C/C++ programs
We present a validator for checking the correctness of LLVM compiler optimizations on C11 programs as far as concurrency is concerned. Our validator checks that optimizations do not change memory accesses in ways disallowed by the C11 and/or LLVM ...
IPAS: intelligent protection against silent output corruption in scientific applications
This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually ...
Atomicity violation checker for task parallel programs
Task based programming models (e.g., Cilk, Intel TBB, X10, Java Fork-Join tasks) simplify multicore programming in contrast to programming with threads. In a task based model, the programmer specifies parallel tasks and the runtime maps these tasks to ...
Flexible on-stack replacement in LLVM
On-Stack Replacement (OSR) is a technique for dynamically transferring execution between different versions of a function at run time. OSR is typically used in virtual machines to interrupt a long-running function and recompile it at a higher ...
BlackBox: lightweight security monitoring for COTS binaries
After a software system is compromised, it can be difficult to understand what vulnerabilities attackers exploited. Any information residing on that machine cannot be trusted as attackers may have tampered with it to cover their tracks. Moreover, even ...
Re-constructing high-level information for language-specific binary re-optimization
In this paper, we show a binary optimizer can achieve competitive performance relative to a state-of-the-art source code compiler by re-constructing high-level information (HLI) from binaries. Recent advances in compiler technologies have resulted in a ...
Index Terms
- Proceedings of the 2016 International Symposium on Code Generation and Optimization