Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid

Published: 16 October 2023 Publication History

Abstract

Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when leveraging deterministic sparsity patterns in structured grids.
In this paper, we propose an innovative domain-specific language (DSL), Mat2Stencil, with its compiler, for PDE solvers on structured grids. Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum, encompassing components such as Jacobi or Gauss-Seidel preconditioners, incomplete LU or Cholesky decompositions, and multigrid methods built upon them. Our DSL compiler subsequently generates matrix-free code consisting of generalized stencils through multi-stage programming. The code allows spatial loop-carried dependence in the form of quasi-affine loops, in addition to the Jacobi-style stencil’s embarrassingly parallel on spatial dimensions. We further propose a novel automatic parallelization technique for the spatially dependent loops, which offers a compile-time deterministic task partitioning for threading, calculates necessary inter-thread synchronization automatically, and generates an efficient multi-threaded implementation with fine-grained synchronization.
Implementing 4 benchmarking programs, 3 of them being the pseudo-applications in NAS Parallel Benchmarks with 6.3% lines of code and 1 being matrix-free High Performance Conjugate Gradients with 16.4% lines of code, we achieve up to 1.67× and on average 1.03× performance compared to manual implementations.

References

[1]
Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, CA, USA. isbn:1-55860-286-0
[2]
Travis Augustine, Janarthanan Sarma, Louis-Noël Pouchet, and Gabriel Rodríguez. 2019. Generating piecewise-regular code from irregular structures. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019, Kathryn S. McKinley and Kathleen Fisher (Eds.). ACM, New York, NY, USA. 625–639. https://doi.org/10.1145/3314221.3314615
[3]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019, Mahmut Taylan Kandemir, Alexandra Jimborean, and Tipp Moseley (Eds.). IEEE, 193–205. https://doi.org/10.1109/CGO.2019.8661197
[4]
David H. Bailey, Eric Barszcz, John T. Barton, D. S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, T. A. Lasinski, Robert Schreiber, Horst D. Simon, V. Venkatakrishnan, and Sisira Weeratunga. 1991. The Nas Parallel Benchmarks. Int. J. High Perform. Comput. Appl., 5, 3 (1991), 63–73. https://doi.org/10.1177/109434209100500306
[5]
Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests. ACM Trans. Program. Lang. Syst., 38, 3 (2016), 12:1–12:32. https://doi.org/10.1145/2896389
[6]
Uday Bondhugula, Vinayaka Bandishti, and Irshad Pananilath. 2017. Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations. IEEE Trans. Parallel Distributed Syst., 28, 5 (2017), 1285–1298. https://doi.org/10.1109/TPDS.2016.2615094
[7]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, Rajiv Gupta and Saman P. Amarasinghe (Eds.). ACM, 101–113. https://doi.org/10.1145/1375581.1375595
[8]
Huanqi Cao. 2023. Artifact of Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid. https://doi.org/10.5281/zenodo.8149701
[9]
Edmond Chow and Aftab Patel. 2015. Fine-Grained Parallel Incomplete LU Factorization. SIAM J. Sci. Comput., 37, 2 (2015), https://doi.org/10.1137/140968896
[10]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David A. Patterson, John Shalf, and Katherine A. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15-21, 2008, Austin, Texas, USA. IEEE/ACM, 4. https://doi.org/10.1109/SC.2008.5222004
[11]
Leonardo Mendonça de Moura and Nikolaj S. Bjørner. 2008. Z3: An Efficient SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings, C. R. Ramakrishnan and Jakob Rehof (Eds.) (Lecture Notes in Computer Science, Vol. 4963). Springer, 337–340. https://doi.org/10.1007/978-3-540-78800-3_24
[12]
James Decker. 2019. Implementation of Lightweight Modular Staging (LMS) in Python. https://github.com/jmd1011/snek-LMS
[13]
Jack J. Dongarra, Michael A. Heroux, and Piotr Luszczek. 2016. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl., 30, 1 (2016), 3–10. https://doi.org/10.1177/1094342015593158
[14]
Mohamed Essadki, Bertrand Michel, Bruno Maugars, Oleksandr Zinenko, Nicolas Vasilache, and Albert Cohen. 2023. Code Generation for In-Place Stencils. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2023, Montréal, QC, Canada, 25 February 2023- 1 March 2023, Christophe Dubach, Derek Bruening, and Ben Hardekopf (Eds.). ACM, 2–13. https://doi.org/10.1145/3579990.3580006
[15]
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Int. J. Parallel Program., 21, 5 (1992), 313–347. https://doi.org/10.1007/BF01407835
[16]
Johannes Habich, T. Zeiser, Georg Hager, and Gerhard Wellein. 2009. Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization.
[17]
X. Huang, X. Huang, D. Wang, Q. Wu, Y. Li, S. Zhang, Y. Chen, M. Wang, Y. Gao, Q. Tang, Y. Chen, Z. Fang, Z. Song, and G. Yang. 2019. OpenArray v1.0: a simple operator library for the decoupling of ocean modeling and parallel computing. Geoscientific Model Development, 12, 11 (2019), 4729–4749. https://doi.org/10.5194/gmd-12-4729-2019
[18]
Intel. 2023. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
[19]
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine A. Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory System Performance and Correctness, San Jose, California, USA, October 11, 2006, Antony L. Hosking and Ali-Reza Adl-Tabatabai (Eds.). ACM, 51–60. https://doi.org/10.1145/1178597.1178605
[20]
Sriram Krishnamoorthy, Muthu Manikandan Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, Jeanne Ferrante and Kathryn S. McKinley (Eds.). ACM, 235–244. https://doi.org/10.1145/1250734.1250761
[21]
Christian Lengauer, Sven Apel, Matthias Bolten, Shigeru Chiba, Ulrich Rüde, Jürgen Teich, Armin Größ linger, Frank Hannig, Harald Köstler, Lisa Claus, Alexander Grebhahn, Stefan Groth, Stefan Kronawitter, Sebastian Kuckuk, Hannah Rittich, Christian Schmitt, and Jonas Schmitt. 2020. ExaStencils: Advanced Multigrid Solver Generation. In Software for Exascale Computing - SPPEXA 2016-2019, Hans-Joachim Bungartz, Severin Reiz, Benjamin Uekermann, Philipp Neumann, and Wolfgang E. Nagel (Eds.) (Lecture Notes in Computational Science and Engineering, Vol. 136). Springer, 405–452. https://doi.org/10.1007/978-3-030-47956-5_14
[22]
Xiaoye S. Li and Meiyue Shao. 2011. A Supernodal Approach to Incomplete LU Factorization with Partial Pivoting. ACM Trans. Math. Softw., 37, 4 (2011), 43:1–43:20. https://doi.org/10.1145/1916461.1916467
[23]
M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman. 2019. Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration. Geoscientific Model Development, 12, 3 (2019), 1165–1187. https://doi.org/10.5194/gmd-12-1165-2019
[24]
George Mcmechan. 2006. Migration by extrapolation of time-dependent boundary values. Geophysical Prospecting, 31 (2006), 04, 413 – 420. https://doi.org/10.1111/j.1365-2478.1983.tb01060.x
[25]
Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, John West and Cherri M. Pancake (Eds.). IEEE Computer Society, 678–689. https://doi.org/10.1109/SC.2016.57
[26]
Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010. IEEE, 1–13. https://doi.org/10.1109/SC.2010.2
[27]
Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in scala: towards the systematic construction of generators for performance libraries. 125–134. https://doi.org/10.1145/2517208.2517228
[28]
OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774.
[29]
David A. Padua and Michael Wolfe. 1986. Advanced Compiler Optimizations for Supercomputers. Commun. ACM, 29, 12 (1986), 1184–1201. https://doi.org/10.1145/7902.7904
[30]
Andreas Pieper, Georg Hager, and Holger Fehske. 2021. A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials. Int. J. High Perform. Comput. Appl., 35, 1 (2021), https://doi.org/10.1177/1094342020959423
[31]
Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing, SC 1999, November 13-19, 1999, Portland, Oregon, USA. ACM, 30. https://doi.org/10.1145/331532.331562
[32]
Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM, 55, 6 (2012), 121–130. https://doi.org/10.1145/2184319.2184345
[33]
Yousef Saad. 2003. Iterative methods for sparse linear systems. SIAM. isbn:978-0-89871-534-7 https://doi.org/10.1137/1.9780898718003
[34]
Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018. Building Efficient Query Engines in a High-Level Language. ACM Trans. Database Syst., 43, 1 (2018), 4:1–4:45. https://doi.org/10.1145/3183653
[35]
C. Skamarock, BogumiŁ a Klemp, Jimy Dudhia, O. Gill, Zhiquan Liu, Judith Berner, Wei Wang, G. Powers, Greg Duda, Dale M. Barker, and Xiangyu Huang. 2019. A Description of the Advanced Research WRF Model Version 4.
[36]
John C. Strikwerda. 2004. Finite Difference Schemes and Partial Differential Equations, Second Edition. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898717938 arxiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898717938.
[37]
Nicolas Stucki, Aggelos Biboudis, and Martin Odersky. 2018. A practical unification of multi-stage programming and macros. In Proceedings of the 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, GPCE 2018, Boston, MA, USA, November 5-6, 2018, Eric Van Wyk and Tiark Rompf (Eds.). ACM, 14–27. https://doi.org/10.1145/3278122.3278139
[38]
Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embed. Comput. Syst., 13, 4s (2014), 134:1–134:25. https://doi.org/10.1145/2584665
[39]
Walid Taha. 1999. Multi-Stage Programming: Its Theory and Applications. Ph. D. Dissertation. Halmstad University, Sweden. https://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-15052
[40]
Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs. In PLDI ’22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022, Ranjit Jhala and Isil Dillig (Eds.). ACM, 872–887. https://doi.org/10.1145/3519939.3523448
[41]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011), Rajmohan Rajaraman and Friedhelm Meyer auf der Heide (Eds.). ACM, 117–128. https://doi.org/10.1145/1989493.1989508
[42]
Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral Model. In Mathematical Software - ICMS 2010, Third International Congress on Mathematical Software, Kobe, Japan, September 13-17, 2010. Proceedings, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.) (Lecture Notes in Computer Science, Vol. 6327). Springer, 299–302. https://doi.org/10.1007/978-3-642-15582-6_49
[43]
Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. https://doi.org/10.13140/RG.2.2.28998.68169
[44]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17 (2020), 261–272. https://doi.org/10.1038/s41592-019-0686-2
[45]
Samuel Williams, Nathan Bell, Jee Whan Choi, Michael Garland, Leonid Oliker, and Richard Vu. 2010. Sparse Matrix-Vector Multiplication on Multicore and Accelerators. In Scientific Computing with Multicore and Accelerators, Jakub Kurzak, David A. Bader, and Jack J. Dongarra (Eds.). CRC Press / Taylor & Francis, 83–109. https://doi.org/10.1201/b10376-8
[46]
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine A. Yelick. 2007. Scientific Computing Kernels on the Cell Processor. Int. J. Parallel Program., 35, 3 (2007), 263–298. https://doi.org/10.1007/s10766-007-0034-5
[47]
Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, Guangwen Yang, and Weimin Zheng. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, John West and Cherri M. Pancake (Eds.). IEEE Computer Society, 57–68. https://doi.org/10.1109/SC.2016.5
[48]
Nathan Zhang, Michael B. Driscoll, Charles Markley, Samuel Williams, Protonu Basu, and Armando Fox. 2017. Snowflake: A Lightweight Portable Stencil DSL. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena Vista, FL, USA, May 29 - June 2, 2017. IEEE Computer Society, 795–804. https://doi.org/10.1109/IPDPSW.2017.89
[49]
Qianchao Zhu, Hao Luo, Chao Yang, Mingshuo Ding, Wanwang Yin, and Xinhui Yuan. 2021. Enabling and scaling the HPCG benchmark on the newest generation Sunway supercomputer with 42 million heterogeneous cores. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 57. https://doi.org/10.1145/3458817.3476158

Cited By

View all
  • (2024)MxPL: A Programming Language for Matrix-Related OperationsSymmetry10.3390/sym1602018116:2(181)Online publication date: 2-Feb-2024
  • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 7, Issue OOPSLA2
October 2023
2250 pages
EISSN:2475-1421
DOI:10.1145/3554312
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023
Published in PACMPL Volume 7, Issue OOPSLA2

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. compiler
  2. domain-specific language
  3. finite difference method
  4. multi-stage programming
  5. performance optimization
  6. polyhedral compilation
  7. stencil
  8. structured grid

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)305
  • Downloads (Last 6 weeks)36
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MxPL: A Programming Language for Matrix-Related OperationsSymmetry10.3390/sym1602018116:2(181)Online publication date: 2-Feb-2024
  • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media