research-article

Open access

(De/Re)-Compositions Expressed Systematically via MDH-Based Schedules

Authors:

Richard Schulze,

Denys Shabalin,

Sergei Gorlatch,

Mary HallAuthors Info & Claims

CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

Pages 61 - 72

https://doi.org/10.1145/3578360.3580269

Published: 17 February 2023 Publication History

Abstract

We introduce a new scheduling language, based on the formalism of Multi-Dimensional Homomorphisms (MDH). In contrast to existing scheduling languages, our MDH-based language is designed to systematically "de-compose" computations for the memory and core hierarchies of architectures, and "re-compose" the computed intermediate results back to the final result -- we say "(de/re)-composition" for short. We argue that our scheduling langauge is easy to use and yet expressive enough to express well-performing (de/re)-compositions of popular related approaches, e.g., the TVM compiler, for MDH-supported computations (such as linear algebra routines and stencil computations). Moreover, our language is designed as auto-tunable, i.e., any optimization decision can optionally be left to the auto-tuning engine of our system, and our system can automatically recommend schedules for the user, based on its auto-tuning capabilities. Also, by relying on the MDH approach, we can formally guarantee the correctness of optimizations expressed in our language, thereby further enhancing user experience. Our experiments on GPU and CPU confirm that we can express optimizations that cannot be expressed straightforwardly (or at all) in TVM's scheduling language, thereby achieving higher performance than TVM, and also vendor libraries provided by NVIDIA and Intel, for time-intensive computations used in real-world deep learning neural networks.

References

[1]

Appache TVM Community. 2020. Non top-level reductions in compute statements. https://discuss.tvm.apache.org/t/non-top-level-reductions-in-compute-statements/5693

[2]

Appache TVM Community. 2022. Bind reduce axis to blocks. https://discuss.tvm.apache.org/t/bind-reduce-axis-to-blocks/2907

[3]

Appache TVM Community. 2022. Expressing nested reduce operations. https://discuss.tvm.apache.org/t/expressing-nested-reduce-operations/8784

[4]

Appache TVM Community. 2022. Implementing Array Packing via cache_read. https://discuss.tvm.apache.org/t/implementing-array-packing-via-cache-read/13360

[5]

Appache TVM Community. 2022. Undetected parallelization issue. https://discuss.tvm.apache.org/t/undetected-parallelization-issue/13224

[6]

Appache TVM Community. 2022. Undetected type issue. https://discuss.tvm.apache.org/t/undetected-type-issue/13223

[7]

Artifact Implementation. 2022. https://doi.org/10.1145/3554351

Digital Library

[8]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197

[9]

Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in High-Performance Computing Applications. Proc. IEEE, 106, 11 (2018), 2068–2083. https://doi.org/10.1109/JPROC.2018.2841200

[10]

Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19).

Digital Library

[11]

Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. https://doi.org/10.48550/ARXIV.2003.00532

[12]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA. 578–594. isbn:978-1-939133-08-3 https://www.usenix.org/conference/osdi18/presentation/chen

[13]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). 31, Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/8b5700012be65c9da25f49408d959ca0-Paper.pdf

[14]

Vincent Dumoulin and Francesco Visin. 2018. A guide to convolution arithmetic for deep learning. arxiv:stat.ML/1603.07285.

[15]

Azadeh Farzan and Victor Nicolet. 2019. Modular Divide-and-Conquer Parallelization of Nested Loops. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 610–624. isbn:9781450367127 https://doi.org/10.1145/3314221.3314612

Digital Library

[16]

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543

Digital Library

[17]

Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. https://doi.org/10.48550/ARXIV.2207.04296

[18]

Sergei Gorlatch and Murray Cole. 2011. Parallel skeletons. In Encyclopedia of parallel computing. Springer-Verlag GmbH, 1417–1422.

[19]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), Article 12, may, 25 pages. issn:0098-3500 https://doi.org/10.1145/1356052.1356053

Digital Library

[20]

John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw., 27, 4 (2001), dec, 422–455. issn:0098-3500 https://doi.org/10.1145/504210.504213

Digital Library

[21]

Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). Association for Computing Machinery, New York, NY, USA. 71–82. isbn:9781450380751 https://doi.org/10.1145/3410463.3414632

Digital Library

[22]

Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving High-Performance the Functional Way: A Functional Pearl on Expressing High-Performance Optimizations as Rewrite Strategies. Proc. ACM Program. Lang., 4, ICFP (2020), Article 92, Aug., 29 pages. https://doi.org/10.1145/3408974

Digital Library

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385 (2015), arXiv:1512.03385. arxiv:1512.03385

[24]

Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. SIGPLAN Not., 52, 6 (2017), June, 556–571. issn:0362-1340 https://doi.org/10.1145/3140587.3062354

Digital Library

[25]

Torsten Hoefler and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses When Reporting Performance Results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). Association for Computing Machinery, New York, NY, USA. Article 73, 12 pages. isbn:9781450337236 https://doi.org/10.1145/2807591.2807644

Digital Library

[26]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, abs/1704.04861 (2017), arXiv:1704.04861. arxiv:1704.04861

[27]

Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A Script-Based Autotuning Compiler System to Generate High-Performance CUDA Code. ACM Trans. Archit. Code Optim., 9, 4 (2013), Article 31, jan, 25 pages. issn:1544-3566 https://doi.org/10.1145/2400682.2400690

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.). 25, Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

Digital Library

[29]

Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–10. https://doi.org/10.1109/CGO.2013.6494986

Digital Library

[30]

C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75–86. https://doi.org/10.1109/CGO.2004.1281665

[31]

Amanda Liu, Gilbert Louis Bernstein, Adam Chlipala, and Jonathan Ragan-Kelley. 2022. Verified Tensor-Program Optimization via High-Level Scheduling Rewrites. Proc. ACM Program. Lang., 6, POPL (2022), Article 55, jan, 28 pages. https://doi.org/10.1145/3498717

Digital Library

[32]

Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Trans. Program. Lang. Syst., 18, 4 (1996), jul, 424–453. issn:0164-0925 https://doi.org/10.1145/233561.233564

Digital Library

[33]

NVIDIA. 2018. Warp-level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

[34]

NVIDIA. 2022. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/

[35]

Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the Point: Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming. Proc. ACM Program. Lang., 5, ICFP (2021), Article 88, aug, 29 pages. https://doi.org/10.1145/3473593

Digital Library

[36]

S. J. Pennycook, J. D. Sewall, and V. W. Lee. 2016. A Metric for Performance Portability. https://doi.org/10.48550/ARXIV.1611.07409

[37]

Phitchaya Mangpo Phothilimthana, Archibald Samuel Elliott, An Wang, Abhinav Jangda, Bastian Hagedorn, Henrik Barthels, Samuel J. Kaufman, Vinod Grover, Emina Torlak, and Rastislav Bodik. 2019. Swizzle Inventor: Data Movement Synthesis for GPU Kernels. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA. 65–78. isbn:9781450362405 https://doi.org/10.1145/3297858.3304059

Digital Library

[38]

Federico Pizzuti, Michel Steuwer, and Christophe Dubach. 2020. Generating Fast Sparse Matrix Vector Multiplication from a High Level Generic Functional IR. In Proceedings of the 29th International Conference on Compiler Construction (CC 2020). Association for Computing Machinery, New York, NY, USA. 85–95. isbn:9781450371209 https://doi.org/10.1145/3377555.3377896

Digital Library

[39]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA. 519–530. isbn:9781450320146 https://doi.org/10.1145/2491956.2462176

Digital Library

[40]

Ari Rasch, Julian Bigge, Martin Wrodarczyk, Richard Schulze, and Sergei Gorlatch. 2020. dOCAL: high-level distributed programming with OpenCL and CUDA. The Journal of Supercomputing, 76, 7 (2020), 5117–5138. isbn:1573-0484 https://doi.org/10.1007/s11227-019-02829-2

Digital Library

[41]

Ari Rasch and Sergei Gorlatch. 2016. Multi-Dimensional Homomorphisms and Their Implementation in OpenCL. In International Workshop on High-Level Parallel Programming and Applications (HLPP). 101–119.

[42]

Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2019. Generating Portable High-Performance Code via Multi-Dimensional Homomorphisms. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 354–369. https://doi.org/10.1109/PACT.2019.00035

[43]

Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2020. md_poly: A Performance-Portable Polyhedral Compiler based on Multi-Dimensional Homomorphisms. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT’20). 1–4.

[44]

Ari Rasch, Richard Schulze, Waldemar Gorus, Jan Hiller, Sebastian Bartholomäus, and Sergei Gorlatch. 2019. High-Performance Probabilistic Record Linkage via Multi-Dimensional Homomorphisms. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC ’19). Association for Computing Machinery, New York, NY, USA. 526–533. isbn:9781450359337 https://doi.org/10.1145/3297280.3297330

Digital Library

[45]

Ari Rasch, Richard Schulze, Michel Steuwer, and Sergei Gorlatch. 2021. Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF). ACM Trans. Archit. Code Optim., 18, 1 (2021), Article 1, jan, 26 pages. issn:1544-3566 https://doi.org/10.1145/3427093

Digital Library

[46]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556

[47]

TensorFlow. 2022. MobileNet v1 models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/mobilenet.py

[48]

TensorFlow. 2022. ResNet models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/resnet.py

[49]

TensorFlow. 2022. VGG16 model for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/vgg16.py

[50]

Bram Wasti, José Pablo Cambronero, Benoit Steiner, Hugh Leather, and Aleksandar Zlateski. 2022. LoopStack: a Lightweight Tensor Algebra Compiler Stack. https://doi.org/10.48550/ARXIV.2205.00618

[51]

M.E. Wolf and M.S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2, 4 (1991), 452–471. https://doi.org/10.1109/71.97902

Digital Library

[52]

Stephen Wright and Jorge Nocedal. 1999. Numerical Optimization. Springer Science, 35, 67–68 (1999), 7.

[53]

Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 286–300. isbn:9781450392655 https://doi.org/10.1145/3519939.3523437

Digital Library

[54]

Cambridge Yang, Eric Atkinson, and Michael Carbin. 2021. Simplifying Dependent Reductions in the Polyhedral Model. Proc. ACM Program. Lang., 5, POPL (2021), Article 20, Jan., 33 pages. https://doi.org/10.1145/3434301

Digital Library

[55]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863–879. isbn:978-1-939133-19-9 https://www.usenix.org/conference/osdi20/presentation/zheng

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643

Index Terms

(De/Re)-Compositions Expressed Systematically via MDH-Based Schedules
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Scheduling algorithms
    2. Online algorithms
      1. Online learning algorithms
        Scheduling algorithms
  2. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Sequential decision making

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three ...
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...
Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC
IA³ '14: Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms

In this paper, a new level of heterogeneous concurrent execution of Monte Carlo photon transport is presented. ARCHER, an application for computing radiation dosimetry for CT imaging involving whole-body patient phantoms has been extended to execute on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

February 2023

249 pages

ISBN:9798400700880

DOI:10.1145/3578360

General Chair:
Clark Verbrugge
McGill University, Canada
,
Program Chairs:
Ondřej Lhoták
University of Waterloo, Canada
,
Xipeng Shen
North Carolina State University, USA / Facebook, USA

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

DFG, German Research Foundation

Conference

CC '23

Sponsor:

SIGPLAN

CC '23: 32nd ACM SIGPLAN International Conference on Compiler Construction

February 25 - 26, 2023

QC, Montréal, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,301
Total Downloads

Downloads (Last 12 months)1,114
Downloads (Last 6 weeks)495

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents