research-article

Porting and Evaluation of a Distributed Task-driven Stencil-based Application

Authors:

Jonathon Anderson,

Mauricio Araya-Polo,

Jie MengAuthors Info & Claims

PMAM'21: Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 21 - 30

https://doi.org/10.1145/3448290.3448559

Published: 24 July 2021 Publication History

Abstract

Alternative programming models and runtimes are increasing in popularity and maturity. This allows porting and comparing, on competitive grounds, emerging parallel approaches against the traditional MPI+X paradigm. In this work, an implementation of distributed task-based stencil computation is compared with a traditional MPI+X implementation of the same application. The Legion task-based parallel programming system is used as an alternative to MPI, but the underlying OpenMP approach is kept at the subdomain level. Overall results are promising toward making this alternative method competitive to the traditional MPI approach. In future work, extensions to other applications will be explored, as well as the use of GPUs.

References

[1]

2020. About Ookami. https://www.stonybrook.edu/commcms/ookami/about/index.php

[2]

B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L. Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 647-658. https://doi.org/10.1109/SC.2014.58

Digital Library

[3]

M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguade, J. M. Cela, and M. Valero. 2011. Assessing Accelerator-Based HPC Reverse Time Migration. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 147--162. https://doi.org/10.1109/TPDS.2010.144

Digital Library

[4]

Mauricio Araya-Polo, Félix Rubio, Raúl De la Cruz, Mauricio Hanzich, José María Cela, and Daniele Paolo Scarpazza. 2009. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Scientific Programming 17, 1-2 (2009), 185--198.

Digital Library

[5]

Patrick Atkinson and Simon McIntosh-Smith. 2017. On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application. In Scaling OpenMP for Exascale Performance and Portability, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 92--106. https://doi.org/10.1007/978-3-319-65578-9_7

[6]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. https://doi.org/10.1002/cpe.1631

Digital Library

[7]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1-11. https://doi.org/10.1109/SC.2012.71

[8]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. SIGPLAN Not. 30, 8 (Aug. 1995), 207--216. https://doi.org/10.1145/209937.209958

[9]

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science Engineering 15, 6 (2013), 36--45. https://doi.org/10.1109/MCSE.2013.98

Digital Library

[10]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202 - 3216. https://doi.org/10.1016/j.jpdc.2014.07.003 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.

Digital Library

[11]

Raúl de la Cruz and Mauricio Araya-Polo. 2011. Towards a Multi-Level Cache Performance Model for 3D Stencil Computation. Procedia Computer Science 4 (2011), 2146 - 2155. https://doi.org/10.1016/j.procs.2011.04.235 Proceedings of the International Conference on Computational Science, ICCS 2011.

[12]

Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3, Article 23 (April 2014), 39 pages. https://doi.org/10.1145/2591006

[13]

O. Delannoy and S. Petiton. 2004. A peer to peer computing framework: design and performance evaluation of YML. In Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks. 362-369. https://doi.org/10.1109/ISPDC.2004.7

[14]

Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel processing letters 21, 02 (2011), 173--193. https://doi.org/10.1142/S0129626411000151

[15]

Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. 2008. Evaluation of OpenMP Task Scheduling Strategies. In OpenMP in a New Era of Parallelism, Rudolf Eigenmann and Bronis R. de Supinski (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 100--110. https://doi.org/10.1007/978-3-540-79561-2_9

Digital Library

[16]

Matteo Frigo and Volker Strumpen. 2005. Cache Oblivious Stencil Computations. In Proceedings of the 19th Annual International Conference on Supercomputing (Cambridge, Massachusetts) (ICS '05). Association for Computing Machinery, New York, NY, USA, 361--366. https://doi.org/10.1145/1088149.1088197

Digital Library

[17]

S. Ghosh, T. Liao, H. Calandra, and B. M. Chapman. 2012. Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 691-700. https://doi.org/10.1109/SC.Companion.2012.95

[18]

Jérôme Gurhem, Miwako Tsuji, Serge G. Petiton, and Mitsuhisa Sato. 2019. Distributed and Parallel Programming Paradigms on the K Computer and a Cluster. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Guangzhou, China) (HPC Asia 2019). Association for Computing Machinery, New York, NY, USA, 9--17. https://doi.org/10.1145/3293320.3293330

Digital Library

[19]

Tobias Gysi, Christoph Müller, Oleksandr Zinenko, Stephan Herhut, Eddie Davis, Tobias Wicky, Oliver Fuhrer, Torsten Hoefler, and Tobias Grosser. 2020. Domain-Specific Multi-Level IR Rewriting for GPU. arXiv:2005.13014 [cs.PL]

[20]

Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, OR, USA) (PGAS '14). Association for Computing Machinery, New York, NY, USA, Article 6, 11 pages. https://doi.org/10.1145/2676870.2676883

Digital Library

[21]

Jannis Klinkenberg, Philipp Samfass, Michael Bader, Christian Terboven, and Matthias S. Müller. 2020. CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications. J. Parallel and Distrib. Comput. 138 (2020), 55 - 64. https://doi.org/10.1016/j.jpdc.2019.12.005

Digital Library

[22]

J. Lee and M. Sato. 2010. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems. In 2010 39th International Conference on Parallel Processing Workshops. 413-420. https://doi.org/10.1109/ICPPW.2010.62

[23]

M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G.J. Gorman. 2019. Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration. Geoscientific Model Development 12, 3 (2019), 1165--1187. https://doi.org/10.5194/gmd-12-1165-2019

[24]

Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (San Diego, CA, USA) (CGO 2020). Association for Computing Machinery, New York, NY, USA, 199--211. https://doi.org/10.1145/3368826.3377904

Digital Library

[25]

John Mellor-Crummey, Robert Fowler, and David Whalley. 2001. Tools for Application-Oriented Performance Tuning. In Proceedings of the 15th International Conference on Supercomputing (Sorrento, Italy) (ICS '01). Association for Computing Machinery, New York, NY, USA, 154--165. https://doi.org/10.1145/377792.377826

Digital Library

[26]

Jie Meng, Andreas Atle, Henri Calandra, and Mauricio Araya-Polo. 2020. Minimod: A Finite Difference solver for Seismic Modeling. arXiv (2020). arXiv:2007.06048 [cs.DC] https://arxiv.org/abs/2007.06048

[27]

Salli Moustafa, Wilfried Kirschenmann, Fabrice Dupros, and Hideo Aochi. 2018. Task-Based Programming on Emerging Parallel Architectures for Finite-Differences Seismic Numerical Kernel. In Euro-Par 2018: Parallel Processing, Marco Aldinucci, Luca Padovani, and Massimo Torquati (Eds.). Springer International Publishing, Cham, 764--777. https://doi.org/10.1007/978-3-319-96983-1_54

[28]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1-13.

[29]

Oak Ridge Leadership Computing Facility. [n.d.]. Summit. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/

[30]

T. Odajima, Y. Kodama, M. Tsuji, M. Matsuda, Y. Maruyama, and M. Sato. 2020. Preliminary Performance Evaluation of the Fujitsu A64FX Using HPC Applications. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). 523-530. https://doi.org/10.1109/CLUSTER49012.2020.00075

[31]

Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesus Labarta. 2009. Hierarchical Task-Based Programming With StarSs. The International Journal of High Performance Computing Applications 23, 3 (2009), 284--299. https://doi.org/10.1177/1094342009106195

Digital Library

[32]

Ahmad Qawasmeh, Maxime R Hugues, Henri Calandra, and Barbara M Chapman. 2017. Performance portability in reverse time migration and seismic modelling via OpenACC. The International Journal of High Performance Computing Applications 31, 5 (2017), 422--440. https://doi.org/10.1177/1094342016675678

Digital Library

[33]

Eric Raut, Jie Meng, Mauricio Araya-Polo, and Barbara Chapman. 2020. Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham, 67--81.

[34]

P. S. Rawat, M. Vaidya, A. Sukumaran-Rajam, A. Rountev, L. Pouchet, and P. Sadayappan. 2019. On Optimizing Complex Stencils on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 641-652. https://doi.org/10.1109/IPDPS.2019.00073

[35]

J. Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reilly Media.

Digital Library

[36]

Alejandro Rico, Isaac Sánchez Barrera, Jose A. Joao, Joshua Randall, Marc Casas, and Miquel Moretó. 2019. On the Benefits of Tasking with OpenMP. In OpenMP: Conquering the Full Hardware Spectrum, Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.). Springer International Publishing, Cham, 217--230. https://doi.org/10.1007/978-3-030-28596-8_15

[37]

Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, and Jie Meng. 2020. Accelerating High-Order Stencils on GPUs. arXiv:2009.04619 [cs.DC]

[38]

Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A High-Productivity Programming Language for HPC with Logical Regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). Association for Computing Machinery, New York, NY, USA, Article 81, 12 pages. https://doi.org/10.1145/2807591.2807629

Digital Library

[39]

Rupanshu Soi, Nischay Mamidi, Elliott Slaughter, Kumar Prasun, and Suresh Deshpande. 2020. An Implicitly Parallel Meshfree Solver in Regent 3 rd Annual Parallel Applications Workshop Alternatives to MPI+X, Nov 12, 2020. Virtual Workshop.

[40]

Raul Vidal, Marc Casas, Miquel Moretó, Dimitrios Chasapis, Roger Ferrer, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 60--72. https://doi.org/10.1007/978-3-319-24595-9_5

[41]

Philippe Virouleau, Pierrick Brunet, François Broquedis, Nathalie Furmento, Samuel Thibault, Olivier Aumage, and Thierry Gautier. 2014. Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite. In Usingand Improving OpenMP for Devices, Tasks, and More, Luiz DeRose, Bronis R. de Supinski, Stephen L. Olivier, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 16--29. https://doi.org/10.1007/978-3-319-11454-5_2

[42]

David Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE, 171--180.

Cited By

Burford ACalder ACarlson DChapman BCoskun FCurtis TFeldman CHarrison RKang YMichalowicz BRaut ESiegmann EWood DDeLeon RJones MSimakov NWhite JOryspayev D(2021)Ookami: Deployment and Initial ExperiencesPractice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions10.1145/3437359.3465578(1-8)Online publication date: 17-Jul-2021
https://dl.acm.org/doi/10.1145/3437359.3465578
Raut EAnderson JAraya-Polo MMeng J(2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00011

Index Terms

Porting and Evaluation of a Distributed Task-driven Stencil-based Application
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...
Revisiting Temporal Blocking Stencil Optimizations
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing

Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM'21: Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores

February 2021

34 pages

ISBN:9781450383486

DOI:10.1145/3448290

Editors:
Quan Chen
Shanghai Jiao Tong, University, China
,
Zhiyi Huang
University of Otago, New Zealand
,
Min Si
Argonne National, Laboratory, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

PPoPP '21

Sponsor:

PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22, 2021

Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Burford ACalder ACarlson DChapman BCoskun FCurtis TFeldman CHarrison RKang YMichalowicz BRaut ESiegmann EWood DDeLeon RJones MSimakov NWhite JOryspayev D(2021)Ookami: Deployment and Initial ExperiencesPractice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions10.1145/3437359.3465578(1-8)Online publication date: 17-Jul-2021
https://dl.acm.org/doi/10.1145/3437359.3465578
Raut EAnderson JAraya-Polo MMeng J(2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00011

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents