Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- ArticleSeptember 2024
Annotation of Compiler Attributes for MPI Functions
Recent Advances in the Message Passing InterfacePages 21–35https://doi.org/10.1007/978-3-031-73370-3_2AbstractThis paper explores the use of LLVM IR function and parameter attributes to enhance compiler optimizations for code that uses MPI. As MPI is usually used as a dynamically linked library, the compiler is not able to automatically infer certain ...
- research-articleMarch 2022
APT-GET: profile-guided timely software prefetching
EuroSys '22: Proceedings of the Seventeenth European Conference on Computer SystemsPages 747–764https://doi.org/10.1145/3492321.3519583Prefetching which predicts future memory accesses and preloads them from main memory, is a widely-adopted technique to overcome the processor-memory performance gap. Unfortunately, hardware prefetchers implemented in today's processors cannot identify ...
Compiler assisted hybrid implicit and explicit GPU memory management under unified address space
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 51, Pages 1–16https://doi.org/10.1145/3295500.3356141To improve programmability and productivity, recent GPUs adopt a virtual memory address space shared with CPUs (e.g., NVIDIA's unified memory). Unified memory migrates the data management burden from programmers to system software and hardware, and ...
- research-articleMay 2019
Compiler-Assisted and Profiling-Based Analysis for Fast and Efficient STT-MRAM On-Chip Cache Design
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 24, Issue 4Article No.: 41, Pages 1–25https://doi.org/10.1145/3321693Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate for large on-chip memories as a zero-leakage, high-density and non-volatile alternative to the present SRAM technology. Since memories are the dominating component of ...
- research-articleFebruary 2018
vSensor: leveraging fixed-workload snippets of programs for performance variance detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingPages 124–136https://doi.org/10.1145/3178487.3178497Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...
Also Published in:
ACM SIGPLAN Notices: Volume 53 Issue 1 -
- ArticleFebruary 2017
Discovery and exploitation of general reductions: a constraint based approach
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and OptimizationPages 269–280Discovering and exploiting scalar reductions in programs has been studied for many years. The discovery of more complex reduction operations has, however, received less attention. Such reductions contain compile-time unknown parameters, indirect memory ...
- research-articleFebruary 2016
IPAS: intelligent protection against silent output corruption in scientific applications
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationPages 227–238https://doi.org/10.1145/2854038.2854059This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually ...
- research-articleMay 2015
ExaSAT: An exascale co-design tool for performance modeling
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 29, Issue 2Pages 209–232https://doi.org/10.1177/1094342014568690One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because ...
- ArticleSeptember 2014
PWCET: Power-Aware Worst Case Execution Time Analysis
ICPPW '14: Proceedings of the 2014 43rd International Conference on Parallel Processing WorkshopsPages 439–447https://doi.org/10.1109/ICPPW.2014.64Worst case execution time (WCET) analysis is used to verify that real-time tasks on systems can be executed without violating any timing constraints. Power consumption is not considered in most of the WCET research work. However, real-time embedded ...
- research-articleMay 2014
LORAIN: a step closer to the PDES 'holy grail'
SIGSIM PADS '14: Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete SimulationPages 3–14https://doi.org/10.1145/2601381.2601397Automatic parallelization of models has been the "Holy Grail" of the PDES community for the last 20 years. In this paper we present LORAIN -- Low Overhead Runtime Assisted Instruction Negation -- a tool capable of automatic emission of a reverse event ...
- ArticleSeptember 2012
Exact dependence analysis for increased communication overlap
EuroMPI'12: Proceedings of the 19th European conference on Recent Advances in the Message Passing InterfacePages 89–99https://doi.org/10.1007/978-3-642-33518-1_14MPI programs are often challenged to scale up to several million cores. In doing so, the programmer tunes every aspect of the application code. However, for large applications, this is often not practical and expensive tracing tools and post-mortem ...
- ArticleAugust 2012
From serial loops to parallel execution on distributed systems
Euro-Par'12: Proceedings of the 18th international conference on Parallel ProcessingPages 246–257https://doi.org/10.1007/978-3-642-32820-6_25Programmability and performance portability are two major challenges in today's dynamic environment. Algorithm designers targeting efficient algorithms should focus on designing high-level algorithms exhibiting maximum parallelism, while relying on ...
- research-articleFebruary 2011
Energy-efficient hardware data prefetching
IEEE Transactions on Very Large Scale Integration (VLSI) Systems (ITVL), Volume 19, Issue 2Pages 250–263https://doi.org/10.1109/TVLSI.2009.2032916Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often ...
- research-articleMarch 2010
Shoestring: probabilistic soft error reliability on the cheap
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsPages 385–396https://doi.org/10.1145/1736020.1736063Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 38 Issue 1ACM SIGPLAN Notices: Volume 45 Issue 3 - ArticleMarch 2009
Communication-Sensitive Static Dataflow for Parallel Message Passing Applications
CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and OptimizationPages 1–12https://doi.org/10.1109/CGO.2009.32Message passing is a very popular style of parallel programming, used in a wide variety of applications and supported by many APIs, such as BSD sockets, MPI and PVM. Its importance has motivated significant amounts of research on optimization and ...
- posterFebruary 2009
Exploiting global optimizations for openmp programs in the openuh compiler
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingPages 289–290https://doi.org/10.1145/1504176.1504219The advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs a limited set of optimizations for ...
Also Published in:
ACM SIGPLAN Notices: Volume 44 Issue 4 - research-articleJune 2008
Optimizing irregular shared-memory applications for clusters
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 256–265https://doi.org/10.1145/1375527.1375566Irregular applications pose challenges in optimizing communication, due to the difficulty of analyzing irregular data accesses accurately and efficiently. This challenge is especially big when translating irregular shared-memory applications to message-...
- articleApril 2007
DRDU: A data reuse analysis technique for efficient scratch-pad memory management
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 12, Issue 2Pages 15–eshttps://doi.org/10.1145/1230800.1230807In multimedia and other streaming applications, a significant portion of energy is spent on data transfers. Exploiting data reuse opportunities in the application, we can reduce this energy by making copies of frequently used data in a small local ...
- articleFebruary 2005
Runtime characterisation of irregular accesses applied to parallelisation of irregular reductions
International Journal of Computational Science and Engineering (IJCSE), Volume 1, Issue 1Pages 1–14https://doi.org/10.1504/IJCSE.2005.008906Irregular reduction operations are the core of many large scientific and engineering applications. There are, in the literature, different methods to solve these operations in parallel. In this paper we discuss a new technique which improves performance ...
- ArticleSeptember 2004
Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systemsPages 248–258https://doi.org/10.1145/1023833.1023868Many optimization techniques, including several targeted specifically at embedded systems, depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such ...