Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2024
Substitution of kernel functions based on pattern matching on schedule trees
ICPP Workshops '24: Workshop Proceedings of the 53rd International Conference on Parallel ProcessingPages 48–57https://doi.org/10.1145/3677333.3678152With the rise of AI, computing hardware with varying architectures has emerged. For some frequently used AI kernels, these hardwares provide special accelerators and related instructions. For example, since the Volta architecture, Nvidia GPUs have ...
- research-articleJune 2024JUST ACCEPTED
A Survey of General-purpose Polyhedral Compilers
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3674735Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available ...
Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid
Proceedings of the ACM on Programming Languages (PACMPL), Volume 7, Issue OOPSLA2Article No.: 246, Pages 686–715https://doi.org/10.1145/3622822Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when ...
- research-articleDecember 2022
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration
ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 1Article No.: 16, Pages 1–26https://doi.org/10.1145/3566054This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine ...
- research-articleJanuary 2023
Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically
- Jie Zhao,
- Cédric Bastoul,
- Yanzhi Yi,
- Jiahui Hu,
- Wang Nie,
- Renwei Zhang,
- Zhen Geng,
- Chong Li,
- Thibaut Tachon,
- Zhiliang Gan
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 451–466https://doi.org/10.1145/3559009.3569656Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network ...
- research-articleJanuary 2023
Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor
ICPP '22: Proceedings of the 51st International Conference on Parallel ProcessingArticle No.: 52, Pages 1–12https://doi.org/10.1145/3545008.3545031We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The ...
- research-articleJune 2021
A practical tile size selection model for affine loop nests
ICS '21: Proceedings of the 35th ACM International Conference on SupercomputingPages 27–39https://doi.org/10.1145/3447818.3462213Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic ...
- research-articleDecember 2020
LLOV: A Fast Static Data-Race Checker for OpenMP Programs
ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 4Article No.: 35, Pages 1–26https://doi.org/10.1145/3418597In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in ...
- research-articleOctober 2019
The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically
- Nicolas Vasilache,
- Oleksandr Zinenko,
- Theodoros Theodoridis,
- Priya Goyal,
- Zachary Devito,
- William S. Moses,
- Sven Verdoolaege,
- Andrew Adams,
- Albert Cohen
ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 4Article No.: 38, Pages 1–26https://doi.org/10.1145/3355606Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or ...
- research-articleJune 2019
Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping
ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 354–366https://doi.org/10.1145/3330345.3330377Identifying the (near) optimal program variants an optimizing and parallelizing compiler should generate is known to be difficult. Autotuning is the best solution to navigate the often high-dimensional space of possible options. However, to be practical ...
- research-articleNovember 2016
PIPES: a language and compiler for task-based programming on distributed-memory clusters
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 39, Pages 1–12Applications running on clusters of shared-memory computers are often implemented using OpenMP+MPI. Productivity can be vastly improved using task-based programming, a paradigm where the user expresses the data and control-flow relations between tasks, ...
- research-articleSeptember 2016
Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 87–97https://doi.org/10.1145/2967938.2967950Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There ...
- research-articleFebruary 2015
Characterizing and enhancing global memory data coalescing on GPUs
Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where ...
- research-articleSeptember 2013
Improving polyhedral code generation for high-level synthesis
CODES+ISSS '13: Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System SynthesisArticle No.: 15, Pages 1–10High-level synthesis (HLS) tools are now capable of generating high-quality RTL codes for a number of programs. Nevertheless, for best performance aggressive program transformations are still required to exploit data reuse and enable communication/...