Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

  • Conference paper
  • First Online:
Accelerator Programming Using Directives (WACCPD 2018)

Abstract

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance within an acceptable range for these types of programs in the HPC setting. We find that for easily parallelizable kernels, performance on the GPU remains within this range. On the CPU, OpenACC-parallelized pairwise distance kernels would not meet the performance standards required, when using AMD Opteron “Interlagos” processors, but with IBM Power 9 processors, performance remains within an acceptable range for small batch sizes. These kernels provide a test for achieving performance portability with compiler directives for problems with memory-intensive components as are often found in scientific applications.

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. https://www.lanl.gov/asc/doe-coe-mtg-2017.php. Accessed 20 Aug 2018

  2. https://gerrit.gromacs.org/. Accessed 22 Aug 2018

  3. https://lammps.sandia.gov/. Accessed 20 Aug 2018

  4. https://lammps.sandia.gov/doc/Speed/_intel.html. Accessed 27 Aug 2018

  5. http://manual.gromacs.org/documentation/2016/manual-2016.pdf. Accessed 31 Aug 2018

  6. www.ks.uiuc.edu/Research/namd/performance.html. Accessed 14 July 2017

  7. thrust.github.io. Accessed 19 July 2017

  8. https://www.olcf.ornl.gov/olcf-resources/. Accessed 6 Sept 2018

  9. https://www.cp2k.org/performance. Accessed 27 Aug 2018

  10. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 24 Aug 2018

  11. icl.cs.utk.edu/magma. Accessed 19 July 2017

  12. BLAS (basic linear algebra subprograms). www.netlib.org/blas. Accessed 19 July 2017

  13. Computational and data-enabled science and engineering. https://www.nsf.gov. Accessed 14 July 2017

  14. Introducing batch GEMM operations. https://software.intel.com/en-us/articles/introducing-batch-gemm-operations. Accessed 6 Sept 2018

  15. NSF/Intel partnership on computer assisted programming for heterogeneous architectures (CAPA). https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=505319. Accessed 20 Aug 2018

  16. www.openacc.org (2017). Accessed 14 July 2017

  17. www.openmp.org (2017). Accessed 14 July 2017

  18. www.gnu.org (2017). Accessed 14 July 2017

  19. Abraham, M.J., et al.: GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015)

    Article  Google Scholar 

  20. Al-Neama, M.W., Reda, N.M., Ghaleb, F.F.: An improved distance matrix computation algorithm for multicore clusters. BioMed Res. Int. 2014, 1–12 (2014)

    Article  Google Scholar 

  21. Arefin, A.S., Riveros, C., Berretta, R., Moscato, P.: Computing large-scale distance matrices on GPU. In: 2012 7th International Conference on Computer Science & Education (ICCSE), pp. 576–580. IEEE (2012)

    Google Scholar 

  22. Barrett, R.F., Vaughan, C.T., Heroux, M.A.: MiniGhost: a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing. Technical report. SAND 5294832, Sandia National Laboratories (2011)

    Google Scholar 

  23. Bonati, C., et al.: Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 28(05), 1750063 (2017)

    Article  Google Scholar 

  24. Bowers, K.J., Dror, R.O., Shaw, D.E.: Zonal methods for the parallel execution of range-limited N-body simulations. J. Comput. Phys. 221(1), 303–329 (2007)

    Article  MathSciNet  Google Scholar 

  25. Brown, W.M., Carrillo, J.M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J.: Optimizing legacy molecular dynamics software with directive-based offload. Comput. Phys. Commun. 195, 95–101 (2015)

    Article  Google Scholar 

  26. Brown, W.M., Wang, P., Plimpton, S.J., Tharrington, A.N.: Implementing molecular dynamics on hybrid high performance computers-short range forces. Comput. Phys. Commun. 182(4), 898–911 (2011)

    Article  Google Scholar 

  27. Brown, W.M., Yamada, M.: Implementing molecular dynamics on hybrid high performance computers—three-body potentials. Comput. Phys. Commun. 184(12), 2785–2793 (2013)

    Article  Google Scholar 

  28. Calore, E., Gabbana, A., Kraus, J., Schifano, S.F., Tripiccione, R.: Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurr. Comput. Pract. Exp. 28(12), 3485–3502 (2016)

    Article  Google Scholar 

  29. Chandrasekaran, S., Juckeland, G.: OpenACC for Programmers: Concepts and Strategies. Addison-Wesley Professional, Boston (2017)

    Google Scholar 

  30. Ciccotti, G., Ferrario, M., Schuette, C.: Molecular dynamics simulation. Entropy 16, 233 (2014)

    Article  MathSciNet  Google Scholar 

  31. Codreanu, V., et al.: Evaluating automatically parallelized versions of the support vector machine. Concurr. Comput. Pract. Exp. 28(7), 2274–2294 (2016)

    Article  Google Scholar 

  32. PGI Compilers and Tools: OpenACC getting started guide. https://www.pgroup.com/resources/docs/18.5/pdf/openacc18_gs.pdf. Accessed 31 Aug 2018

  33. Decyk, V.K., Singh, T.V.: Particle-in-cell algorithms for emerging computer architectures. Comput. Phys. Commun. 185(3), 708–719 (2014)

    Article  MathSciNet  Google Scholar 

  34. Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched blas on modern high-performance computing systems. Procedia Comput. Sci. 108, 495–504 (2017)

    Article  Google Scholar 

  35. Garvey, J.D., Abdelrahman, T.S.: A strategy for automatic performance tuning of stencil computations on GPUs. Sci. Programm. 2018, 1–24 (2018)

    Article  Google Scholar 

  36. Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized born. J. Chem. Theory Comput. 8(5), 1542–1555 (2012)

    Article  Google Scholar 

  37. Guo, X., Rogers, B.D., Lind, S., Stansby, P.K.: New massively parallel scheme for incompressible smoothed particle hydrodynamics (ISPH) for highly nonlinear and distorted flow. Comput. Phys. Commun. 233, 16–28 (2018)

    Article  MathSciNet  Google Scholar 

  38. Hardy, D.J.: Improving NAMD performance on multi-GPU platforms. In: 16th Annual Workshop on Charm++ and its Applications. https://charm.cs.illinois.edu/workshops/charmWorkshop2018/slides/CharmWorkshop2018_namd_hardy.pdf (2018)

  39. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA, chapter 39. In: Nguyen, H. (ed.) GPU Gems 3. Addison-Wesley, Boston (2008)

    Google Scholar 

  40. Huber, J., Hernandez, O., Lopez, G.: Effective vectorization with OpenMP 4.5, ORNL/TM-2016/391. Technical report, Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF) (2017)

    Google Scholar 

  41. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  42. Jocksch, A., Hariri, F., Tran, T.-M., Brunner, S., Gheller, C., Villard, L.: A bucket sort algorithm for the particle-in-cell method on manycore architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 43–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_5

    Chapter  Google Scholar 

  43. Juckeland, G., et al.: From describing to prescribing parallelism: translating the SPEC ACCEL OpenACC suite to OpenMP target directives. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 470–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_33

    Chapter  Google Scholar 

  44. Kale, V., Solomonik, E.: Parallel sorting pattern. In: Proceedings of the 2010 Workshop on Parallel Programming Patterns, p. 10. ACM (2010)

    Google Scholar 

  45. Kirk, R.O., Mudalige, G.R., Reguly, I.Z., Wright, S.A., Martineau, M.J., Jarvis, S.A.: Achieving performance portability for a heat conduction solver mini-application on modern multi-core systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 834–841. IEEE (2017)

    Google Scholar 

  46. Kutzner, C., Páll, S., Fechner, M., Esztermann, A., de Groot, B.L., Grubmüller, H.: Best bang for your buck: GPU nodes for GROMACS biomolecular simulations. J. Comput. Chem. 36(26), 1990–2008 (2015)

    Article  Google Scholar 

  47. Larrea, V.V., Joubert, W., Lopez, M.G., Hernandez, O.: Early experiences writing performance portable OpenMP 4 codes. In: Proceedings of Cray User Group Meeting, London, England (2016)

    Google Scholar 

  48. Lashgar, A., Baniasadi, A.: Employing software-managed caches in OpenACC: opportunities and benefits. ACM Trans. Model. Perform. Eval. Comput. Syst. 1(1), 2 (2016)

    Article  Google Scholar 

  49. Li, Q., Kecman, V., Salman, R.: A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In: 2010 Ninth International Conference on Machine Learning and Applications (ICMLA), pp. 208–213. IEEE (2010)

    Google Scholar 

  50. Li, X., Shih, P.C., Overbey, J., Seals, C., Lim, A.: Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 6(5), 1–15 (2016)

    Google Scholar 

  51. Lopez, M.G., et al.: Towards achieving performance portability using directives for accelerators. In: 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), pp. 13–24. IEEE (2016)

    Google Scholar 

  52. Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)

    Google Scholar 

  53. Milic, U., Gelado, I., Puzovic, N., Ramirez, A., Tomasevic, M.: Parallelizing general histogram application for CUDA architectures. In: 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), pp. 11–18. IEEE (2013)

    Google Scholar 

  54. Mooney, J.D.: Bringing portability to the software process. Department of Statistics and Computer Science, West Virginia University, Morgantown WV (1997)

    Google Scholar 

  55. Mooney, J.D.: Developing portable software. In: Reis, R. (ed.) Information Technology. IIFIP, vol. 157, pp. 55–84. Springer, Boston, MA (2004). https://doi.org/10.1007/1-4020-8159-6_3

    Chapter  Google Scholar 

  56. Nicolini, M., Miller, J., Wienke, S., Schlottke-Lakemper, M., Meinke, M., Müller, M.S.: Software cost analysis of GPU-accelerated aeroacoustics simulations in C++ with OpenACC. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 524–543. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_36

    Chapter  Google Scholar 

  57. Nori, R., Karodiya, N., Reza, H.: Portability testing of scientific computing software systems. In: 2013 IEEE International Conference on Electro/Information Technology (EIT), pp. 1–8. IEEE (2013)

    Google Scholar 

  58. Páll, S., Abraham, M.J., Kutzner, C., Hess, B., Lindahl, E.: Tackling exascale software challenges in molecular dynamics simulations with GROMACS. In: Markidis, S., Laure, E. (eds.) EASC 2014. LNCS, vol. 8759, pp. 3–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15976-8_1

    Chapter  Google Scholar 

  59. Van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP–The Next Step: Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)

    Google Scholar 

  60. Pennycook, S.J., Sewall, J.D., Lee, V.: A metric for performance portability. arXiv preprint arXiv:1611.07409 (2016)

  61. Phillips, J.C., et al.: Scalable molecular dynamics with namd. J. Comput. Chem. 26(16), 1781–1802 (2005)

    Article  Google Scholar 

  62. Phillips, J.C., Kale, L., Buch, R., Acun, B.: NAMD: scalable molecular dynamics based on the charm++ parallel runtime system. In: Exascale Scientific Applications, pp. 119–144. Chapman and Hall/CRC (2017)

    Google Scholar 

  63. Phillips, J.C., Sun, Y., Jain, N., Bohm, E.J., Kalé, L.V.: Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 81–91. IEEE Press (2014)

    Google Scholar 

  64. Pino, S., Pollock, L., Chandrasekaran, S.: Exploring translation of OpenMP to OpenACC 2.5: lessons learned. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 673–682. IEEE (2017)

    Google Scholar 

  65. Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117(1), 1–19 (1995)

    Article  Google Scholar 

  66. Plimpton, S.J.: The LAMMPS molecular dynamics engine (2017). https://www.osti.gov/servlets/purl/1458156

  67. Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)

    Article  Google Scholar 

  68. Schach, S.R.: Object-oriented and Classical Software Engineering, pp. 215–255. McGraw-Hill, New York (2002)

    Google Scholar 

  69. Schlick, T.: Molecular Modeling and Simulation: An Interdisciplinary Guide, vol. 21. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-6351-2

    Book  MATH  Google Scholar 

  70. Sedova, A., Banavali, N.K.: Geometric patterns for neighboring bases near the stacked state in nucleic acid strands. Biochemistry 56(10), 1426–1443 (2017)

    Article  Google Scholar 

  71. Shi, T., Belkin, M., Yu, B., et al.: Data spectroscopy: eigenspaces of convolution operators and clustering. Ann. Stat. 37(6B), 3960–3984 (2009)

    Article  MathSciNet  Google Scholar 

  72. Solomonik, E., Kale, L.V.: Highly scalable parallel sorting. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)

    Google Scholar 

  73. Stone, J.E., Hynninen, A.-P., Phillips, J.C., Schulten, K.: Early experiences porting the NAMD and VMD molecular simulation and analysis software to GPU-accelerated OpenPOWER platforms. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 188–206. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_14

    Chapter  Google Scholar 

  74. Sultana, N., Calvert, A., Overbey, J.L., Arnold, G.: From OpenACC to OpenMP 4: toward automatic translation. In: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale, p. 44. ACM (2016)

    Google Scholar 

  75. Sun, Y., et al.: Evaluating performance tradeoffs on the Radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 209–218. IEEE (2018)

    Google Scholar 

  76. Tedre, M., Denning, P.J.: Shifting identities in computing: from a useful tool to a new method and theory of science. Informatics in the Future, pp. 1–16. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55735-9_1

    Chapter  Google Scholar 

  77. Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC—first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32820-6_85

    Chapter  Google Scholar 

  78. Wienke, S., Terboven, C., Beyer, J.C., Müller, M.S.: A pattern-based comparison of OpenACC and OpenMP for accelerator computing. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 812–823. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09873-9_68

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ada Sedova .

Editor information

Editors and Affiliations

A Artifact Description Appendix: Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

A Artifact Description Appendix: Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

1.1 A.1 Abstract

This appendix details the run environments, compilers used, and compile line arguments for the four tested methods details in the text. Note that hardware access is limited to OLCF users.

1.2 A.2 Description

Check-list (artifact meta information)

  • Algorithm: Select kernels used in molecular dynamics

  • Compilation: See compliers and commands below

  • Binary: C++/CUDA or C++/OpenACC

  • Run-time environment: Modules displayed below

  • Hardware: OLCF Titan and Summit as described in main text

  • Run-time state: Summit used SMT = 1 for CPU threading. Run commands below

  • Execution: Run commands below, BLAS routines were called using standard calls to the cuBLAS library

  • Publicly available?: All kernels are provided in the text and appendix

All kernels used are listed in the main text, except the CUDA kernel. This is provided below:

figure g

Software Dependencies. Below are the modules, compilers, and run commands used.

figure h

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sedova, A., Tillack, A.F., Tharrington, A. (2019). Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science(), vol 11381. Springer, Cham. https://doi.org/10.1007/978-3-030-12274-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12274-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12273-7

  • Online ISBN: 978-3-030-12274-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics