Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Sedova, Ada; Tillack, Andreas F.; Tharrington, Arnold

doi:10.1007/978-3-030-12274-4_2

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11381))

Included in the following conference series:

International Workshop on Accelerator Programming Using Directives

Abstract

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance within an acceptable range for these types of programs in the HPC setting. We find that for easily parallelizable kernels, performance on the GPU remains within this range. On the CPU, OpenACC-parallelized pairwise distance kernels would not meet the performance standards required, when using AMD Opteron “Interlagos” processors, but with IBM Power 9 processors, performance remains within an acceptable range for small batch sizes. These kernels provide a test for achieving performance portability with compiler directives for problems with memory-intensive components as are often found in scientific applications.

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs

Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

References

https://www.lanl.gov/asc/doe-coe-mtg-2017.php. Accessed 20 Aug 2018
https://gerrit.gromacs.org/. Accessed 22 Aug 2018
https://lammps.sandia.gov/. Accessed 20 Aug 2018
https://lammps.sandia.gov/doc/Speed/_intel.html. Accessed 27 Aug 2018
http://manual.gromacs.org/documentation/2016/manual-2016.pdf. Accessed 31 Aug 2018
www.ks.uiuc.edu/Research/namd/performance.html. Accessed 14 July 2017
thrust.github.io. Accessed 19 July 2017
https://www.olcf.ornl.gov/olcf-resources/. Accessed 6 Sept 2018
https://www.cp2k.org/performance. Accessed 27 Aug 2018
https://docs.nvidia.com/cuda/cublas/index.html. Accessed 24 Aug 2018
icl.cs.utk.edu/magma. Accessed 19 July 2017
BLAS (basic linear algebra subprograms). www.netlib.org/blas. Accessed 19 July 2017
Computational and data-enabled science and engineering. https://www.nsf.gov. Accessed 14 July 2017
Introducing batch GEMM operations. https://software.intel.com/en-us/articles/introducing-batch-gemm-operations. Accessed 6 Sept 2018
NSF/Intel partnership on computer assisted programming for heterogeneous architectures (CAPA). https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=505319. Accessed 20 Aug 2018
www.openacc.org (2017). Accessed 14 July 2017
www.openmp.org (2017). Accessed 14 July 2017
www.gnu.org (2017). Accessed 14 July 2017
Abraham, M.J., et al.: GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015)
Article Google Scholar
Al-Neama, M.W., Reda, N.M., Ghaleb, F.F.: An improved distance matrix computation algorithm for multicore clusters. BioMed Res. Int. 2014, 1–12 (2014)
Article Google Scholar
Arefin, A.S., Riveros, C., Berretta, R., Moscato, P.: Computing large-scale distance matrices on GPU. In: 2012 7th International Conference on Computer Science & Education (ICCSE), pp. 576–580. IEEE (2012)
Google Scholar
Barrett, R.F., Vaughan, C.T., Heroux, M.A.: MiniGhost: a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing. Technical report. SAND 5294832, Sandia National Laboratories (2011)
Google Scholar
Bonati, C., et al.: Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 28(05), 1750063 (2017)
Article Google Scholar
Bowers, K.J., Dror, R.O., Shaw, D.E.: Zonal methods for the parallel execution of range-limited N-body simulations. J. Comput. Phys. 221(1), 303–329 (2007)
Article MathSciNet Google Scholar
Brown, W.M., Carrillo, J.M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J.: Optimizing legacy molecular dynamics software with directive-based offload. Comput. Phys. Commun. 195, 95–101 (2015)
Article Google Scholar
Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers-short range forces. Comput. Phys. Commun. 182(4), 898–911 (2011)
Article Google Scholar
Brown, W.M., Yamada, M.: Implementing molecular dynamics on hybrid high performance computers—three-body potentials. Comput. Phys. Commun. 184(12), 2785–2793 (2013)
Article Google Scholar
Calore, E., Gabbana, A., Kraus, J., Schifano, S.F., Tripiccione, R.: Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurr. Comput. Pract. Exp. 28(12), 3485–3502 (2016)
Article Google Scholar
Chandrasekaran, S., Juckeland, G.: OpenACC for Programmers: Concepts and Strategies. Addison-Wesley Professional, Boston (2017)
Google Scholar
Ciccotti, G., Ferrario, M., Schuette, C.: Molecular dynamics simulation. Entropy 16, 233 (2014)
Article MathSciNet Google Scholar
Codreanu, V., et al.: Evaluating automatically parallelized versions of the support vector machine. Concurr. Comput. Pract. Exp. 28(7), 2274–2294 (2016)
Article Google Scholar
PGI Compilers and Tools: OpenACC getting started guide. https://www.pgroup.com/resources/docs/18.5/pdf/openacc18_gs.pdf. Accessed 31 Aug 2018
Decyk, V.K., Singh, T.V.: Particle-in-cell algorithms for emerging computer architectures. Comput. Phys. Commun. 185(3), 708–719 (2014)
Article MathSciNet Google Scholar
Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched blas on modern high-performance computing systems. Procedia Comput. Sci. 108, 495–504 (2017)
Article Google Scholar
Garvey, J.D., Abdelrahman, T.S.: A strategy for automatic performance tuning of stencil computations on GPUs. Sci. Programm. 2018, 1–24 (2018)
Article Google Scholar
Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized born. J. Chem. Theory Comput. 8(5), 1542–1555 (2012)
Article Google Scholar
Guo, X., Rogers, B.D., Lind, S., Stansby, P.K.: New massively parallel scheme for incompressible smoothed particle hydrodynamics (ISPH) for highly nonlinear and distorted flow. Comput. Phys. Commun. 233, 16–28 (2018)
Article MathSciNet Google Scholar
Hardy, D.J.: Improving NAMD performance on multi-GPU platforms. In: 16th Annual Workshop on Charm++ and its Applications. https://charm.cs.illinois.edu/workshops/charmWorkshop2018/slides/CharmWorkshop2018_namd_hardy.pdf (2018)
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA, chapter 39. In: Nguyen, H. (ed.) GPU Gems 3. Addison-Wesley, Boston (2008)
Google Scholar
Huber, J., Hernandez, O., Lopez, G.: Effective vectorization with OpenMP 4.5, ORNL/TM-2016/391. Technical report, Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF) (2017)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Jocksch, A., Hariri, F., Tran, T.-M., Brunner, S., Gheller, C., Villard, L.: A bucket sort algorithm for the particle-in-cell method on manycore architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 43–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_5
Chapter Google Scholar
Juckeland, G., et al.: From describing to prescribing parallelism: translating the SPEC ACCEL OpenACC suite to OpenMP target directives. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 470–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_33
Chapter Google Scholar
Kale, V., Solomonik, E.: Parallel sorting pattern. In: Proceedings of the 2010 Workshop on Parallel Programming Patterns, p. 10. ACM (2010)
Google Scholar
Kirk, R.O., Mudalige, G.R., Reguly, I.Z., Wright, S.A., Martineau, M.J., Jarvis, S.A.: Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 834–841. IEEE (2017)
Google Scholar
Kutzner, C., Páll, S., Fechner, M., Esztermann, A., de Groot, B.L., Grubmüller, H.: Best bang for your buck: GPU nodes for GROMACS biomolecular simulations. J. Comput. Chem. 36(26), 1990–2008 (2015)
Article Google Scholar
Larrea, V.V., Joubert, W., Lopez, M.G., Hernandez, O.: Early experiences writing performance portable OpenMP 4 codes. In: Proceedings of Cray User Group Meeting, London, England (2016)
Google Scholar
Lashgar, A., Baniasadi, A.: Employing software-managed caches in OpenACC: opportunities and benefits. ACM Trans. Model. Perform. Eval. Comput. Syst. 1(1), 2 (2016)
Article Google Scholar
Li, Q., Kecman, V., Salman, R.: A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In: 2010 Ninth International Conference on Machine Learning and Applications (ICMLA), pp. 208–213. IEEE (2010)
Google Scholar
Li, X., Shih, P.C., Overbey, J., Seals, C., Lim, A.: Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 6(5), 1–15 (2016)
Google Scholar
Lopez, M.G., et al.: Towards achieving performance portability using directives for accelerators. In: 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), pp. 13–24. IEEE (2016)
Google Scholar
Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)
Google Scholar
Milic, U., Gelado, I., Puzovic, N., Ramirez, A., Tomasevic, M.: Parallelizing general histogram application for CUDA architectures. In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 11–18. IEEE (2013)
Google Scholar
Mooney, J.D.: Bringing portability to the software process. Department of Statistics and Computer Science, West Virginia University, Morgantown WV (1997)
Google Scholar
Mooney, J.D.: Developing portable software. In: Reis, R. (ed.) Information Technology. IIFIP, vol. 157, pp. 55–84. Springer, Boston, MA (2004). https://doi.org/10.1007/1-4020-8159-6_3
Chapter Google Scholar
Nicolini, M., Miller, J., Wienke, S., Schlottke-Lakemper, M., Meinke, M., Müller, M.S.: Software cost analysis of GPU-accelerated aeroacoustics simulations in C++ with OpenACC. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 524–543. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_36
Chapter Google Scholar
Nori, R., Karodiya, N., Reza, H.: Portability testing of scientific computing software systems. In: 2013 IEEE International Conference on Electro/Information Technology (EIT), pp. 1–8. IEEE (2013)
Google Scholar
Páll, S., Abraham, M.J., Kutzner, C., Hess, B., Lindahl, E.: Tackling exascale software challenges in molecular dynamics simulations with GROMACS. In: Markidis, S., Laure, E. (eds.) EASC 2014. LNCS, vol. 8759, pp. 3–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15976-8_1
Chapter Google Scholar
Van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP–The Next Step: Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)
Google Scholar
Pennycook, S.J., Sewall, J.D., Lee, V.: A metric for performance portability. arXiv preprint arXiv:1611.07409 (2016)
Phillips, J.C., et al.: Scalable molecular dynamics with namd. J. Comput. Chem. 26(16), 1781–1802 (2005)
Article Google Scholar
Phillips, J.C., Kale, L., Buch, R., Acun, B.: NAMD: scalable molecular dynamics based on the charm++ parallel runtime system. In: Exascale Scientific Applications, pp. 119–144. Chapman and Hall/CRC (2017)
Google Scholar
Phillips, J.C., Sun, Y., Jain, N., Bohm, E.J., Kalé, L.V.: Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 81–91. IEEE Press (2014)
Google Scholar
Pino, S., Pollock, L., Chandrasekaran, S.: Exploring translation of OpenMP to OpenACC 2.5: lessons learned. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 673–682. IEEE (2017)
Google Scholar
Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995)
Article Google Scholar
Plimpton, S.J.: The LAMMPS molecular dynamics engine (2017). https://www.osti.gov/servlets/purl/1458156
Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)
Article Google Scholar
Schach, S.R.: Object-oriented and Classical Software Engineering, pp. 215–255. McGraw-Hill, New York (2002)
Google Scholar
Schlick, T.: Molecular Modeling and Simulation: An Interdisciplinary Guide, vol. 21. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-6351-2
Book MATH Google Scholar
Sedova, A., Banavali, N.K.: Geometric patterns for neighboring bases near the stacked state in nucleic acid strands. Biochemistry 56(10), 1426–1443 (2017)
Article Google Scholar
Shi, T., Belkin, M., Yu, B., et al.: Data spectroscopy: eigenspaces of convolution operators and clustering. Ann. Stat. 37(6B), 3960–3984 (2009)
Article MathSciNet Google Scholar
Solomonik, E., Kale, L.V.: Highly scalable parallel sorting. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)
Google Scholar
Stone, J.E., Hynninen, A.-P., Phillips, J.C., Schulten, K.: Early experiences porting the NAMD and VMD molecular simulation and analysis software to GPU-accelerated OpenPOWER platforms. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 188–206. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_14
Chapter Google Scholar
Sultana, N., Calvert, A., Overbey, J.L., Arnold, G.: From OpenACC to OpenMP 4: toward automatic translation. In: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale, p. 44. ACM (2016)
Google Scholar
Sun, Y., et al.: Evaluating performance tradeoffs on the Radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 209–218. IEEE (2018)
Google Scholar
Tedre, M., Denning, P.J.: Shifting identities in computing: from a useful tool to a new method and theory of science. Informatics in the Future, pp. 1–16. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55735-9_1
Chapter Google Scholar
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC—first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32820-6_85
Chapter Google Scholar
Wienke, S., Terboven, C., Beyer, J.C., Müller, M.S.: A pattern-based comparison of OpenACC and OpenMP for accelerator computing. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 812–823. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09873-9_68
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Scientific Computing Group, National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA
Ada Sedova, Andreas F. Tillack & Arnold Tharrington

Authors

Ada Sedova
View author publications
You can also search for this author in PubMed Google Scholar
Andreas F. Tillack
View author publications
You can also search for this author in PubMed Google Scholar
Arnold Tharrington
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ada Sedova .

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Delaware, Newark, DE, USA
Sunita Chandrasekaran
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Sachsen, Germany
Guido Juckeland
RWTH Aachen University, Aachen, Nordrhein-Westfalen, Germany
Sandra Wienke

A Artifact Description Appendix: Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

1.1 A.1 Abstract

This appendix details the run environments, compilers used, and compile line arguments for the four tested methods details in the text. Note that hardware access is limited to OLCF users.

1.2 A.2 Description

Check-list (artifact meta information)

Algorithm: Select kernels used in molecular dynamics
Compilation: See compliers and commands below
Binary: C++/CUDA or C++/OpenACC
Run-time environment: Modules displayed below
Hardware: OLCF Titan and Summit as described in main text
Run-time state: Summit used SMT = 1 for CPU threading. Run commands below
Execution: Run commands below, BLAS routines were called using standard calls to the cuBLAS library
Publicly available?: All kernels are provided in the text and appendix

All kernels used are listed in the main text, except the CUDA kernel. This is provided below:

Software Dependencies. Below are the modules, compilers, and run commands used.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sedova, A., Tillack, A.F., Tharrington, A. (2019). Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science(), vol 11381. Springer, Cham. https://doi.org/10.1007/978-3-030-12274-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-12274-4_2
Published: 24 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12273-7
Online ISBN: 978-3-030-12274-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs

Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application

MACC: An OpenACC Transpiler for Automatic Multi-GPU Use

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Artifact Description Appendix: Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

A Artifact Description Appendix: Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

1.1 A.1 Abstract

1.2 A.2 Description

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us