Keyword: polyhedral compilation : Search

research-article

Substitution of kernel functions based on pattern matching on schedule trees

ICPP Workshops '24: Workshop Proceedings of the 53rd International Conference on Parallel ProcessingPages 48–57https://doi.org/10.1145/3677333.3678152

With the rise of AI, computing hardware with varying architectures has emerged. For some frequently used AI kernels, these hardwares provide special accelerators and related instructions. For example, since the Volta architecture, Nvidia GPUs have ...

research-article

Open Access

JUST ACCEPTED

A Survey of General-purpose Polyhedral Compilers

ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3674735

Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available ...

research-article

Open Access

Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid

Proceedings of the ACM on Programming Languages (PACMPL), Volume 7, Issue OOPSLA2Article No.: 246, Pages 686–715https://doi.org/10.1145/3622822

Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when ...

research-article

Open Access

Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration

ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 1Article No.: 16, Pages 1–26https://doi.org/10.1145/3566054

This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine ...

research-article

Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 451–466https://doi.org/10.1145/3559009.3569656

Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network ...

research-article

Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor

ICPP '22: Proceedings of the 51st International Conference on Parallel ProcessingArticle No.: 52, Pages 1–12https://doi.org/10.1145/3545008.3545031

We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The ...

research-article

A practical tile size selection model for affine loop nests

ICS '21: Proceedings of the 35th ACM International Conference on SupercomputingPages 27–39https://doi.org/10.1145/3447818.3462213

Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic ...

research-article

Open Access

LLOV: A Fast Static Data-Race Checker for OpenMP Programs

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 4Article No.: 35, Pages 1–26https://doi.org/10.1145/3418597

In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in ...

research-article

Open Access

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 4Article No.: 38, Pages 1–26https://doi.org/10.1145/3355606

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or ...

research-article

Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping

ICS '19: Proceedings of the ACM International Conference on SupercomputingPages 354–366https://doi.org/10.1145/3330345.3330377

Identifying the (near) optimal program variants an optimizing and parallelizing compiler should generate is known to be difficult. Autotuning is the best solution to navigate the often high-dimensional space of possible options. However, to be practical ...

research-article

PIPES: a language and compiler for task-based programming on distributed-memory clusters

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 39, Pages 1–12

Applications running on clusters of shared-memory computers are often implemented using OpenMP+MPI. Productivity can be vastly improved using task-based programming, a paradigm where the user expresses the data and control-flow relations between tasks, ...

research-article

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 87–97https://doi.org/10.1145/2967938.2967950

Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There ...

research-article

Characterizing and enhancing global memory data coalescing on GPUs

CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and OptimizationPages 12–22

Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where ...

research-article

Improving polyhedral code generation for high-level synthesis

CODES+ISSS '13: Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System SynthesisArticle No.: 15, Pages 1–10

High-level synthesis (HLS) tools are now capable of generating high-quality RTL codes for a number of programs. Nevertheless, for best performance aggressive program transformations are still required to exploit data reuse and enable communication/...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Substitution of kernel functions based on pattern matching on schedule trees

A Survey of General-purpose Polyhedral Compilers

Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid

Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration

Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor

A practical tile size selection model for affine loop nests

LLOV: A Fast Static Data-Race Checker for OpenMP Programs

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mapping

PIPES: a language and compiler for task-based programming on distributed-memory clusters

Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU

Characterizing and enhancing global memory data coalescing on GPUs

Improving polyhedral code generation for high-level synthesis

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder