Abstract
Cosmological N-body simulation is associated with hyper-scale and high-resolution computing, and time will increase exponentially as the scale increases, which is always the most important issue that needs to be considered in N-body problems. With the increasing computing scale demand, high-performance computing systems and effective parallel algorithms have been applied to solve the N-body problem. PHotoNs-2, a parallel N-body simulation code designed for Lambda cold dark matter modelled simulation, was developed using the hybrid fast multipole method and particle-mesh method. In this study, PHotoNs-2 is migrated on a parallel heterogeneous CPU+Accelerator platform, which is referred to as PhotoNs-MA, and challenges are imposed on its performance by the massive data transmission, memory access, and complex mathematical functions. In this paper, the main optimizations for the kernel functions of short-range force calculations on the SIMT architecture are listed as follows: transmission of large amounts of data using page-locked memory and the structure of array to improve the efficiency of memory access, the transmission of index lists instead of particle interaction lists to reduce the transfer overhead, and using the interpolation method to replace the modified formula for interaction forces. Finally, compared with PHotoNs-2 run on 4 CPU cores, the optimized PHotoNs-MA on 4 accelerators accelerates the P2P operator 1000x times. We compared the results with Gadget-2 run on 64 CPU cores, and the overall performance is improved by 6 times for 4 accelerators. As for large-scale simulations, near-linear scalability is observed in P2P, and the parallel efficiency ultimately reaching 89.28%.
Similar content being viewed by others
References
Angulo RE, Springel V, White SDM et al (2012) Scaling relations for galaxy clusters in the Millennium-XXL simulation. Month Notic R Astronom Soc 426(3):2046–2062. https://doi.org/10.1111/j.1365-2966.2012.21830.x
Ishiyama T, Prada F, Klypin AA et al (2021) The Uchuu simulations: data Release 1 and dark matter halo concentrations. Month Notic R Astronom Soc 506(3):4210–4231. https://doi.org/10.1093/mnras/stab1755
Cheng S, Yu HR, Inman D et al (2020) CUBE-Towards an Optimal Scaling of Cosmological N-body Simulations. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE 2020:685–690. https://doi.org/10.1109/CCGrid49817.2020.00-22
Barnes J, Hut P (1986) A hierarchical O(NlogN) force-calculation algorithm. Nature 324(6096):446–449. https://doi.org/10.1038/324446a0
Greengard L, Rokhlin V (1987) A fast algorithm for particle simulations. J Comput Phys 73(2):325–348. https://doi.org/10.1016/0021-9991(87)90140-9
Yahagi H, Yoshii Y (2001) N-body code with adaptive mesh refinement. Astrophys J 558(1):463–475
Hockney RW, Eastwood JW (1988) Particle-particle-particle-mesh (P3M) algorithms. In: Computer simulation using particles, pp 267–304
Bagla JS (2002) TreePM: a code for cosmological N-body simulations. J Astrophys Astronom 23(3):185–196. https://doi.org/10.1007/BF02702282
Ishiyama T, Fukushige T, Makino J (2009) GreeM: massively parallel TreePM code for large cosmological N-body simulations. Publicat Astronom Soc Japan 61(6):1319–1330. https://doi.org/10.1093/pasj/61.6.1319
Ishiyama T, Nitadori K, Makino J (2012) 4.45 Pflops astrophysical N-body simulation on K computer—The gravitational trillion-body problem. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE: 1–10. https://doi.org/10.1109/SC.2012.3
Warren MS (2014) 2HOT: an improved parallel hashed oct-tree N-body algorithm for cosmological simulation. Sci Programm 22(2):109–124. https://doi.org/10.3233/SPR-140385
Puchwein E, Baldi M, Springel V (2013) Modified-Gravity-GADGET: a new code for cosmological hydrodynamical simulations of modified gravity models. Month Notice R Astronom Soc 436(1):348–360. https://doi.org/10.1093/mnras/stt1575
Ragagnin A, Dolag K, Wagner M et al (2020) Gadget3 on GPUs with OpenACC. arXiv preprint arXiv:2003.10850. https://doi.org/10.3233/APC200043
Jafary B, Jha S, Fiondella L et al (2021) Data-driven application-oriented reliability model of a high-performance computing system. IEEE Trans Reliab. https://doi.org/10.1109/TR.2021.3085582
Nori M, Baldi M (2018) AX-GADGET: a new code for cosmological simulations of Fuzzy Dark Matter and Axion models. Month Notice R Astronom Soc 478(3):3935–3951. https://doi.org/10.1093/mnras/sty1224
Wang Q, Cao ZY, Gao L et al (2018) PHoToNs-A parallel heterogeneous and threads oriented code for cosmological N-body simulation. Res Astronom Astrophys 18(6):062
Wang Q (2021) A hybrid fast multipole method for cosmological N-body simulations. Res Astronom Astrophys 21(1):003
Springel V, Pakmor R, Zier O et al (2021) Simulating cosmic structure formation with the GADGET-4 code. Month Notice R Astronom Soc 506(2):2871–2949. https://doi.org/10.1093/mnras/stab1855
Habib S, Morozov V, Frontiere N et al (2013) HACC: Extreme scaling and performance across diverse architectures. In: SC’13: Proce-dings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE, pp 1–10. https://doi.org/10.1145/2503210.2504566
Belleman RG, Bédorf J, Zwart SFP (2008) High performance direct gravitational N-body simulations on graphics processing units II: an implementation in CUDA. New Astronom 13(2):103–112. https://doi.org/10.1016/j.newast.2007.07.004
Nylons L, Harris M, Prins J (2007) Fast n-body simulation with CUDA. In: GPU Gems 3, vol. 24. Addison Wesley, Boston, pp 62–66
Yokota R, Barba LA (2011) Treecode and fast multipole method for N-body simulation with CUDA. In: Wen-mei WH (ed) GPU Computing Gems Emerald Edition, Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-384988-5.00009-7
Hamada T, Iitaka T (2007) The chamomile scheme: an optimized algorithm for n-body simulations on programmable graphics processing units. arXiv:astro-ph/0703100
Hamada T, Nitadori K (2010) 190 tflops astrophysical n-body simulation on a cluster of gpus. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis: 1–9. https://doi.org/10.1109/SC.2010.1
Hamada T, Nitadori K, Benkrid K et al (2009) A novel multiple-walk parallel algorithm for the Barnes-Hut treecode on GPUs-towards cost effective, high performance N-body simulation. Compu Sci Res Develop 24(1–2):21–31
Hamada T, Narumi T, Yokota R et al (2009) 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/1654059.1654123
Potter D, Stadel J, Teyssier R (2017) PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys. Comput Astrophys Cosmol 4(1):1–13. https://doi.org/10.1186/s40668-017-0021-1
Gumerov NA, Duraiswami R (2008) Fast multipole methods on graphics processors. J Comput Phys 227(18):8290–8313. https://doi.org/10.1016/j.jcp.2008.05.023
Gaburov E, Bédorf J, Zwart SP (2010) Gravitational tree-code on graphics processing units: implementation in CUDA. Proc Comput Sci 1(1):1119–1127. https://doi.org/10.1016/j.procs.2010.04.124
Bédorf J, Gaburov E, Zwart SP (2012) A sparse octree gravitational N-body code that runs entirely on the GPU processor. J Comput Phys 231(7):2825–2839. https://doi.org/10.1016/j.jcp.2011.12.024
Goldfarb M, Jo Y, Kulkarni M (2013) General transformations for GPU execution of tree traversals. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. https://doi.org/10.1145/2503210.2503223
Soderquist P, Leeser M (1996) Area and performance tradeoffs in floating-point divide and square-root implementations. ACM Comput Surv (CSUR) 28(3):518–564. https://doi.org/10.1145/243439.243481
Wang Q, Meng C (2021) PHotoNs-GPU: a GPU accelerated cosmological simulation code. arXiv preprint arXiv:2107.14008
Kuznetsov E, Stegailov V (2019) Porting CUDA-based molecular dynamics algorithms to AMD ROCm platform using hip framework: performance analysis. russian supercomputing days. Springer, Cham, pp 121–130. https://doi.org/10.1007/978-3-030-36592-9_11
Greengard L, Lee JY (1996) A direct adaptive Poisson solver of arbitrary order accuracy. J Computat Phys 125(2):415–424. https://doi.org/10.1006/jcph.1996.0103
Bode P, Ostriker JP, Xu G (2000) The tree particle-mesh N-body gravity solver. Astrophys J Suppl Ser 128(2):561
Li N, Laizet S (2010) 2decomp & fft-a highly scalable 2d decomposition library and fft interface. In: Cray User Group 2010 Conference, pp. 1–13
Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units: 46-51. https://doi.org/10.1145/1513895.1513901
AMD (2020) AMD ROCm Platform. https://rocmdocs.amd.com/en/latest/index.html. Accessed 18 Sep 2021
Hundt C, Martinez M (2021) Memory Layouts and Memory Pools. https://developer.nvidia.com/blog. Accessed 18 Sep 2021
NVIDIA(2012) How to optimize data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc. Accessed 18 Sep 2021
NVIDIA(2012) How to overlap data transfers in CUDA C/C++. https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc. Accessed 18 Sep 2021
Farber R (2011) CUDA application design and development. Elsevier
Arafa Y, Badawy A H A, Chennupati G, et al (2019) Low overhead instruction latency characterization for nvidia gpgpus. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, pp. 1–8. https://doi.org/10.1109/HPEC.2019.8916466
Crocce M, Pueblas S, Scoccimarro R (2006) Transients from initial conditions in cosmological simulations. Month Notices R Astronom Soc 373(1):369–381. https://doi.org/10.1111/j.1365-2966.2006.11040.x
Yu HR, Emberson JD, Inman D et al (2017) Differential neutrino condensation onto cosmic structure. Nature Astronom 1(7):1–5. https://doi.org/10.1038/s41550-017-0143
Acknowledgements
This paper was supported by the National Key R&D Program for Developing Basic Sciences (Grant Nos.2020YFB0204802), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No.XDC01000000) and GHFUND A No.20210701. The numerical calculation in the paper was carried out on CAS Xiandao-1 computing environment.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, WL., Wang, W. & Wang, Q. Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators. J Supercomput 78, 7186–7205 (2022). https://doi.org/10.1007/s11227-021-04153-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04153-0