Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleApril 2024
A shared compilation stack for distributed-memory parallelism in stencil DSLs
- George Bisbas,
- Anton Lydike,
- Emilien Bauer,
- Nick Brown,
- Mathieu Fehr,
- Lawrence Mitchell,
- Gabriel Rodriguez-Canal,
- Maurice Jamieson,
- Paul H. J. Kelly,
- Michel Steuwer,
- Tobias Grosser
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024, Pages 38–56https://doi.org/10.1145/3620666.3651344Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target ...
- research-articleSeptember 2021
Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
- Tobias Gysi,
- Christoph Müller,
- Oleksandr Zinenko,
- Stephan Herhut,
- Eddie Davis,
- Tobias Wicky,
- Oliver Fuhrer,
- Torsten Hoefler,
- Tobias Grosser
ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 4Article No.: 51, Pages 1–23https://doi.org/10.1145/3469030Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on ...
- research-articleAugust 2020
FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation
ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 3Article No.: 19, Pages 1–27https://doi.org/10.1145/3402451We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits ...
- research-articleDecember 2019
Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation
ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 4Article No.: 47, Pages 1–25https://doi.org/10.1145/3369382Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling,...
- research-articleNovember 2018
Stencil codes on a vector length agnostic architecture
- Adrià Armejach,
- Helena Caminal,
- Juan M. Cebrian,
- Rekai González-Alberquilla,
- Chris Adeniyi-Jones,
- Mateo Valero,
- Marc Casas,
- Miquel Moretó
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesNovember 2018, Article No.: 13, Pages 1–12https://doi.org/10.1145/3243176.3243192Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual ...
-
- short-paperJune 2017
Stencil Autotuning with Ordinal Regression: Extended Abstract
SCOPES '17: Proceedings of the 20th International Workshop on Software and Compilers for Embedded SystemsJune 2017, Pages 72–75https://doi.org/10.1145/3078659.3078664The increasing performance of today's computer architecture comes with an unprecedented augment of hardware complexity. Unfortunately this results in difficult-to-tune software and consequentially in a gap between the potential peak performance and the ...
- research-articleJanuary 2017
Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations
ACM Transactions on Parallel Computing (TOPC), Volume 3, Issue 1Article No.: 3, Pages 1–47https://doi.org/10.1145/2897188This article derives trade-offs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. These trade-offs are lower bounds on the execution time of the algorithm that are independent of the number of ...
- research-articleSeptember 2016
Resource Conscious Reuse-Driven Tiling for GPUs
- Prashant Singh Rawat,
- Changwan Hong,
- Mahesh Ravishankar,
- Vinod Grover,
- Louis-Noel Pouchet,
- Atanas Rountev,
- P. Sadayappan
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016, Pages 99–111https://doi.org/10.1145/2967938.2967967Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial ...
- research-articleMarch 2016
Effective resource management for enhancing performance of 2D and 3D stencils on GPUs
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitMarch 2016, Pages 92–102https://doi.org/10.1145/2884045.2884047GPUs are an attractive target for data parallel stencil computations prevalent in scientific computing and image processing applications. Many tiling schemes, such as overlapped tiling and split tiling, have been proposed in past to improve the ...
- research-articleJune 2015
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed ComputingJune 2015, Pages 259–270https://doi.org/10.1145/2749246.2749255This paper proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels ...
- research-articleJune 2015
Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators
- Ian J. Bertolacci,
- Catherine Olschanowsky,
- Ben Harshbarger,
- Bradford L. Chamberlain,
- David G. Wonnacott,
- Michelle Mills Strout
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingJune 2015, Pages 197–206https://doi.org/10.1145/2751205.2751226Stencil computations figure prominently in the core kernels of many scientific computations, such as partial differential equation solvers. Parallel scaling of stencil computations can be significantly improved on multicore processors using advanced ...
- ArticleMay 2015
Energy Modeling and Optimization for Tiled Nested-Loop Codes
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopMay 2015, Pages 888–895https://doi.org/10.1109/IPDPSW.2015.94We develop a methodology for modeling the energy efficiency of tiled nested-loop codes running on a graphics processing unit (GPU) and use it for energy efficiency optimization. % We use the polyhedral model, a We assume that a highly optimized and ...
- articleApril 2015
Optimizing the computation of a parallel 3D finite difference algorithm for graphics processing units
Concurrency and Computation: Practice & Experience (CCOMP), Volume 27, Issue 6April 2015, Pages 1591–1602https://doi.org/10.1002/cpe.3351This paper explores the possibilities of using a graphics processing unit for complex 3D finite difference computation via MUSTA-FORCE and WENO algorithms. We propose a novel algorithm based on the new properties of CUDA surface memory optimized for 2D ...
- research-articleJanuary 2015
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2015, Pages 54–64https://doi.org/10.1145/2688500.2688512Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
Also Published in:
ACM SIGPLAN Notices: Volume 50 Issue 8August 2015 - research-articleJanuary 2015
Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates
SIAM Journal on Scientific Computing (SISC), Volume 37, Issue 42015, Pages C439–C464https://doi.org/10.1137/140991133The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to ...
- tutorialOctober 2014
WOSC 2014: second workshop on optimizing stencil computations
SPLASH '14: Proceedings of the companion publication of the 2014 ACM SIGPLAN conference on Systems, Programming, and Applications: Software for HumanityOctober 2014, Pages 89–90https://doi.org/10.1145/2660252.2662138The second Workshop on Optimizing Stencil Computations is held in Portland, Oregon, USA on October 20, 2014, as part of the 2014 ACM SIGPLAN conference on Systems, Programming Languages, and Applications: Software for Humanity (SPLASH). The workshop's ...
- ArticleMay 2012
Automatic Resource Scheduling with Latency Hiding for Parallel Stencil Applications on GPGPU Clusters
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumMay 2012, Pages 544–556https://doi.org/10.1109/IPDPS.2012.57Overlapping computations and communication is a key to accelerating stencil applications on parallel computers, especially for GPU clusters. However, such programming is a time-consuming part of the stencil application development. To address this ...
- ArticleJuly 2009
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization
COMPSAC '09: Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01July 2009, Pages 579–586https://doi.org/10.1109/COMPSAC.2009.82We present a pipelined wavefront parallelization approach for stencil-based computations. Within a fixed spatial domain successive wavefronts are executed by threads scheduled to a multicore processor chip with a shared outer level cache. By re-using ...
- articleFebruary 2009
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
SIAM Review (SIREV), Volume 51, Issue 1February 2009, Pages 129–159https://doi.org/10.1137/070693199Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory ...
- articleJanuary 2009
Writing productive stencil codes with overlapped tiling
Concurrency and Computation: Practice & Experience (CCOMP), Volume 21, Issue 1January 2009, Pages 25–39Stencil computations constitute the kernel of many scientific applications. Tiling is often used to improve the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management ...