Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3578360.3580269acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article
Open access

(De/Re)-Compositions Expressed Systematically via MDH-Based Schedules

Published: 17 February 2023 Publication History

Abstract

We introduce a new scheduling language, based on the formalism of Multi-Dimensional Homomorphisms (MDH). In contrast to existing scheduling languages, our MDH-based language is designed to systematically "de-compose" computations for the memory and core hierarchies of architectures, and "re-compose" the computed intermediate results back to the final result -- we say "(de/re)-composition" for short. We argue that our scheduling langauge is easy to use and yet expressive enough to express well-performing (de/re)-compositions of popular related approaches, e.g., the TVM compiler, for MDH-supported computations (such as linear algebra routines and stencil computations). Moreover, our language is designed as auto-tunable, i.e., any optimization decision can optionally be left to the auto-tuning engine of our system, and our system can automatically recommend schedules for the user, based on its auto-tuning capabilities. Also, by relying on the MDH approach, we can formally guarantee the correctness of optimizations expressed in our language, thereby further enhancing user experience. Our experiments on GPU and CPU confirm that we can express optimizations that cannot be expressed straightforwardly (or at all) in TVM's scheduling language, thereby achieving higher performance than TVM, and also vendor libraries provided by NVIDIA and Intel, for time-intensive computations used in real-world deep learning neural networks.

References

[1]
Appache TVM Community. 2020. Non top-level reductions in compute statements. https://discuss.tvm.apache.org/t/non-top-level-reductions-in-compute-statements/5693
[2]
Appache TVM Community. 2022. Bind reduce axis to blocks. https://discuss.tvm.apache.org/t/bind-reduce-axis-to-blocks/2907
[3]
Appache TVM Community. 2022. Expressing nested reduce operations. https://discuss.tvm.apache.org/t/expressing-nested-reduce-operations/8784
[4]
Appache TVM Community. 2022. Implementing Array Packing via cache_read. https://discuss.tvm.apache.org/t/implementing-array-packing-via-cache-read/13360
[5]
Appache TVM Community. 2022. Undetected parallelization issue. https://discuss.tvm.apache.org/t/undetected-parallelization-issue/13224
[6]
Appache TVM Community. 2022. Undetected type issue. https://discuss.tvm.apache.org/t/undetected-type-issue/13223
[7]
Artifact Implementation. 2022. https://doi.org/10.1145/3554351
[8]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
[9]
Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in High-Performance Computing Applications. Proc. IEEE, 106, 11 (2018), 2068–2083. https://doi.org/10.1109/JPROC.2018.2841200
[10]
Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19).
[11]
Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. https://doi.org/10.48550/ARXIV.2003.00532
[12]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA. 578–594. isbn:978-1-939133-08-3 https://www.usenix.org/conference/osdi18/presentation/chen
[13]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). 31, Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/8b5700012be65c9da25f49408d959ca0-Paper.pdf
[14]
Vincent Dumoulin and Francesco Visin. 2018. A guide to convolution arithmetic for deep learning. arxiv:stat.ML/1603.07285.
[15]
Azadeh Farzan and Victor Nicolet. 2019. Modular Divide-and-Conquer Parallelization of Nested Loops. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 610–624. isbn:9781450367127 https://doi.org/10.1145/3314221.3314612
[16]
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543
[17]
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2022. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. https://doi.org/10.48550/ARXIV.2207.04296
[18]
Sergei Gorlatch and Murray Cole. 2011. Parallel skeletons. In Encyclopedia of parallel computing. Springer-Verlag GmbH, 1417–1422.
[19]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), Article 12, may, 25 pages. issn:0098-3500 https://doi.org/10.1145/1356052.1356053
[20]
John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw., 27, 4 (2001), dec, 422–455. issn:0098-3500 https://doi.org/10.1145/504210.504213
[21]
Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). Association for Computing Machinery, New York, NY, USA. 71–82. isbn:9781450380751 https://doi.org/10.1145/3410463.3414632
[22]
Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving High-Performance the Functional Way: A Functional Pearl on Expressing High-Performance Optimizations as Rewrite Strategies. Proc. ACM Program. Lang., 4, ICFP (2020), Article 92, Aug., 29 pages. https://doi.org/10.1145/3408974
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385 (2015), arXiv:1512.03385. arxiv:1512.03385
[24]
Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. SIGPLAN Not., 52, 6 (2017), June, 556–571. issn:0362-1340 https://doi.org/10.1145/3140587.3062354
[25]
Torsten Hoefler and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses When Reporting Performance Results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). Association for Computing Machinery, New York, NY, USA. Article 73, 12 pages. isbn:9781450337236 https://doi.org/10.1145/2807591.2807644
[26]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, abs/1704.04861 (2017), arXiv:1704.04861. arxiv:1704.04861
[27]
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A Script-Based Autotuning Compiler System to Generate High-Performance CUDA Code. ACM Trans. Archit. Code Optim., 9, 4 (2013), Article 31, jan, 25 pages. issn:1544-3566 https://doi.org/10.1145/2400682.2400690
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.). 25, Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[29]
Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–10. https://doi.org/10.1109/CGO.2013.6494986
[30]
C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75–86. https://doi.org/10.1109/CGO.2004.1281665
[31]
Amanda Liu, Gilbert Louis Bernstein, Adam Chlipala, and Jonathan Ragan-Kelley. 2022. Verified Tensor-Program Optimization via High-Level Scheduling Rewrites. Proc. ACM Program. Lang., 6, POPL (2022), Article 55, jan, 28 pages. https://doi.org/10.1145/3498717
[32]
Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Trans. Program. Lang. Syst., 18, 4 (1996), jul, 424–453. issn:0164-0925 https://doi.org/10.1145/233561.233564
[33]
NVIDIA. 2018. Warp-level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
[34]
NVIDIA. 2022. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/
[35]
Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the Point: Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming. Proc. ACM Program. Lang., 5, ICFP (2021), Article 88, aug, 29 pages. https://doi.org/10.1145/3473593
[36]
S. J. Pennycook, J. D. Sewall, and V. W. Lee. 2016. A Metric for Performance Portability. https://doi.org/10.48550/ARXIV.1611.07409
[37]
Phitchaya Mangpo Phothilimthana, Archibald Samuel Elliott, An Wang, Abhinav Jangda, Bastian Hagedorn, Henrik Barthels, Samuel J. Kaufman, Vinod Grover, Emina Torlak, and Rastislav Bodik. 2019. Swizzle Inventor: Data Movement Synthesis for GPU Kernels. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA. 65–78. isbn:9781450362405 https://doi.org/10.1145/3297858.3304059
[38]
Federico Pizzuti, Michel Steuwer, and Christophe Dubach. 2020. Generating Fast Sparse Matrix Vector Multiplication from a High Level Generic Functional IR. In Proceedings of the 29th International Conference on Compiler Construction (CC 2020). Association for Computing Machinery, New York, NY, USA. 85–95. isbn:9781450371209 https://doi.org/10.1145/3377555.3377896
[39]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA. 519–530. isbn:9781450320146 https://doi.org/10.1145/2491956.2462176
[40]
Ari Rasch, Julian Bigge, Martin Wrodarczyk, Richard Schulze, and Sergei Gorlatch. 2020. dOCAL: high-level distributed programming with OpenCL and CUDA. The Journal of Supercomputing, 76, 7 (2020), 5117–5138. isbn:1573-0484 https://doi.org/10.1007/s11227-019-02829-2
[41]
Ari Rasch and Sergei Gorlatch. 2016. Multi-Dimensional Homomorphisms and Their Implementation in OpenCL. In International Workshop on High-Level Parallel Programming and Applications (HLPP). 101–119.
[42]
Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2019. Generating Portable High-Performance Code via Multi-Dimensional Homomorphisms. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 354–369. https://doi.org/10.1109/PACT.2019.00035
[43]
Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2020. md_poly: A Performance-Portable Polyhedral Compiler based on Multi-Dimensional Homomorphisms. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT’20). 1–4.
[44]
Ari Rasch, Richard Schulze, Waldemar Gorus, Jan Hiller, Sebastian Bartholomäus, and Sergei Gorlatch. 2019. High-Performance Probabilistic Record Linkage via Multi-Dimensional Homomorphisms. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC ’19). Association for Computing Machinery, New York, NY, USA. 526–533. isbn:9781450359337 https://doi.org/10.1145/3297280.3297330
[45]
Ari Rasch, Richard Schulze, Michel Steuwer, and Sergei Gorlatch. 2021. Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF). ACM Trans. Archit. Code Optim., 18, 1 (2021), Article 1, jan, 26 pages. issn:1544-3566 https://doi.org/10.1145/3427093
[46]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556
[47]
TensorFlow. 2022. MobileNet v1 models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/mobilenet.py
[48]
TensorFlow. 2022. ResNet models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/resnet.py
[49]
TensorFlow. 2022. VGG16 model for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/vgg16.py
[50]
Bram Wasti, José Pablo Cambronero, Benoit Steiner, Hugh Leather, and Aleksandar Zlateski. 2022. LoopStack: a Lightweight Tensor Algebra Compiler Stack. https://doi.org/10.48550/ARXIV.2205.00618
[51]
M.E. Wolf and M.S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2, 4 (1991), 452–471. https://doi.org/10.1109/71.97902
[52]
Stephen Wright and Jorge Nocedal. 1999. Numerical Optimization. Springer Science, 35, 67–68 (1999), 7.
[53]
Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 286–300. isbn:9781450392655 https://doi.org/10.1145/3519939.3523437
[54]
Cambridge Yang, Eric Atkinson, and Michael Carbin. 2021. Simplifying Dependent Reductions in the Polyhedral Model. Proc. ACM Program. Lang., 5, POPL (2021), Article 20, Jan., 33 pages. https://doi.org/10.1145/3434301
[55]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863–879. isbn:978-1-939133-19-9 https://www.usenix.org/conference/osdi20/presentation/zheng

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction
February 2023
249 pages
ISBN:9798400700880
DOI:10.1145/3578360
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. CPU
  2. GPU
  3. deep learning
  4. scheduling languages

Qualifiers

  • Research-article

Funding Sources

  • DFG, German Research Foundation

Conference

CC '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,114
  • Downloads (Last 6 weeks)495
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media