¹¹institutetext: ¹ KTH Royal Institute of Technology, Stockholm, Sweden
² Max Planck Institute for Plasma Physics, Garching and Greifswald, Germany
³ Institute of Plasma Physics of the CAS, Prague, Czech Republic
⁴ LeCAD, University of Ljubljana, Ljubljana, Slovenia
⁵ See the author list of [Nuclear Fusion, Doi: 10.1088/1741-4326/ad5a21]
⁶ Max Planck Computing and Data Facility, Garching and Greifswald, Germany
jjwil@kth.se

Understanding Large-Scale Plasma Simulation Challenges for Fusion Energy on Supercomputers

J. J. Williams¹ A. Bhole¹ D. Kierans¹ M. Hoelzl² I. Holod² W. Tang² D. Tskhakaya³
S. Costea⁴ L. Kos⁴ A. Podolnik³ J. Hromadka³ JOREK Team⁵ E. Laure⁶ & S. Markidis¹

1 Introduction

Understanding plasma instabilities is crucial for sustainable fusion energy. Large-scale plasma simulations are essential for designing next-generation fusion energy devices and modeling industrial plasmas. Accurate modeling and prediction of plasma behavior under extreme conditions require sophisticated simulation codes to capture the complex interactions between plasma dynamics, magnetic fields, and material surfaces. This work conducts an HPC analysis of two prominent plasma simulation codes, BIT1 and JOREK, to advance the understanding of plasma behavior in fusion energy applications. The focus is on evaluating JOREK’s computational efficiency and scalability for simulating non-linear MHD phenomena in tokamak fusion devices [1]. Previous studies examined BIT1 [2, 3], a massively parallel PIC code for fusion device plasma-material interactions. Investigations into BIT1’s computational requirements and scalability on supercomputers provided crucial insights. Detailed profiling pinpointed bottlenecks, guiding optimization efforts that notably boosted parallel performance. Key contributions include addressing challenges in BIT1’s neutral particle ionization and non-linear JOREK shattered pellet injection (SPI) simulations. Profiling techniques also highlighted computational hotspots during strong scaling tests, especially with activated diagnostics.

2 PIC MC BIT1 Code

Modeling plasma-loaded divertors in fusion devices like ITER is challenging. The divertor manages heat and particle fluxes, protects the first wall from high-energy neutron damage, and removes impurities by diverting plasma flow to a specific region at the bottom of the toroidal chamber (see Fig. 1). The Particle-in-Cell (PIC) method models plasma behavior by simulating

Refer to caption — Figure 1: BIT1 simulates plasma behavior in the tokamak divertor (blue arrows), such as in the ITER fusion device [3, 4, 5].

particle dynamics in 1-3D spatial dimensions and 3D velocity space, integrating Monte Carlo (MC) routines for particle collisions and interactions with chamber walls. The PIC cycle involves interpolating particle data for plasma density computation, smoothing densities to remove spurious frequencies, solving linear systems for electric and magnetic fields, using MC techniques for collisions and wall interactions, and advancing particle positions and velocities. BIT1, a 1D3V PIC code optimized for HPC systems, operates in one-dimensional space with three velocity components per particle, facilitating large-scale plasma simulations.

3 Non-linear Extended MHD JOREK Code

JOREK, an HPC-capable code for fusion reactor modeling [1], employs hybrid MPI+OpenMP parallelization and various plasma physics models, including fluid and kinetic ones. It uses

advanced finite element methods and fully implicit time-integration techniques with GMRES for solving linearized equations. Interfaces to libraries like PaStiX and STRUMPACK enhance GMRES efficiency. Given its extensive equations involved, simulations of fusion reactors are computationally demanding on supercomputers due to their scale in length and time. As an example, in the Fig. 2, a snapshot of the toroidal current is plotted at 0.72 ms in a 3D SPI simulation performed using two-temperature reduced magentohydrodynamics (MHD) model [1]. The simulation assumes a frozen pellet of Neon gas being shattered into 53 fragments that are injected into ASDEX-upgrade plasma. It uses 12,460 finite elements in the poloidal plane and up to 7 toroidal harmonics resolved using 64 equidistance planes.

4 Methodology & Experimental Setup

In this work, we use perf for profiling hardware performance counters, focusing on execution time, cache, and memory. Additionally, we employ TAU (Tuning and Analysis Utilities) for

automated and manual instrumentation to analyze performance data, identifying bottlenecks and inefficient code paths [6]. Our experiments compare BIT1 and JOREK on two systems: Marconi A3 (Skylake), featuring 2982 nodes with Intel Xeon 8160 24-core processors and Dardel, with 1270 nodes equipped with AMD EPYC Zen2 64-core processors. Further details about these systems are provided in the poster.

5 Performance Results and Analysis

We begin by identifying the challenges discovered in the BIT1 neutral particle ionization and non-linear JOREK SPI simulations to pinpoint the most computationally intensive parts of these codes during strong scaling tests, particularly with diagnostics activated. As previously discovered and improved in [2, 3], using perf, the BIT1 performance considerably depends on the problem size and effective LLC (L3) usage. When BIT1 simulates up to 50 nodes on Dardel in [2], the Ionization simulation scales well initially up to 10 nodes but then faces diminishing returns with reduced efficiency as more nodes are added beyond 20. This highlights the challenges in maintaining linear speedup with increasing parallelism for small size problems. Contrary to this, large size BIT1 simulations indicate hyper-scaling [2]. For JOREK, we analyze a 3D SPI simulation from a checkpoint to study JOREK’s HPC performance. In Fig 3, we

compile JOREK with STRUMPACK v7.2.0 and compare 8 to 64 nodes on Marconi. Significant speed-ups are observed using perf: for 10 time steps, speed-up increases from 1 (8 nodes) to 1.93 (64 nodes), a 93% improvement; for 5 time steps, it goes from 1 to 1.89, an 89% gain. Fig 4 shows 10 time steps with varying OpenMP threads on Marconi, revealing decreasing execution time with more threads, but not at an ideal linear rate. Using 2 threads as a baseline (3532.7 seconds), 4 threads take 2235.1 seconds (26.6% slower than ideal). Increasing to 8, 12, and 24 threads results in 1543.7, 1389.6, and 1377.3 seconds, respectively, versus ideal times of 883.2, 588.8, and 294.4 seconds. This shows diminishing returns due to overhead from thread management and synchronization.

Using TAU profiling on JOREK SPI simulations for 10 time steps on Marconi reveal detailed performance insights. Simulating with 2 OpenMP threads, thread 0 primarily focuses on

$construct\_matrix$ (58.1%) and $solve\_sparse\_system$ (32.6%), while worker threads are fully occupied by executing the $OpenMP\_Thread\_Type\_ompt\_thread\_worker$ routine (contributing 39.1% each to elementary_matrix_build and $element\_matrix\_fft$ ). Increasing to 24 threads shifts thread 0’s focus to $solve\_sparse\_system$ (53.2%) and $construct\_matrix$ (35%), optimizing sparse system solving with enhanced parallel capabilities. As seen in Fig 5, worker threads maintains focus on the $OpenMP\_Thread\_Type\_ompt\_thread\_worker$ routine, however it contributions to $elementary\_matrix\_build$ and $element\_matrix\_fft$ reduction by 10.6% each, highlighting improved workload distribution across threads for efficient performance in complex computations.

6 Conclusions and Future Work

In this work, we identified the $solve\_sparse\_system$ and $construct\_matrix$ routines as the most computationally intensive parts of the JOREK SPI simulation, consuming up to 88.2% of execution time with 24 threads. This analysis was conducted for a moderate-size problem, and better scalability is expected for the largest production cases. Future efforts will extend our analysis to other platforms and incorporate hybrid fluids+kinetic models to enhance overall efficiency.

References

[1] M. Hoelzl, et al., NF 61, 065001 (2021)
[2] J. J. Williams, et al., LNCS 14351, Springer Nature, arXiv:2306.16512 (2023)
[3] J. J. Williams, et al., LNCS 14832, Springer Nature, arXiv:2404.10270 (2024)
[4] D. Tskhakaya, et al., Contrib. to Plasma Phys., volume 44, 5-6, 564-570 (2004)
[5] D. Tskhakaya, et al., Pisa Italy 2010, Computer Society, 476-481, IEEE (2010)
[6] Shende, Sameer S., et al., volume 20.2, 287-311 Sage Pub. Sage CA (2006)
[7] Snyder, Shane, et al., 5th Workshop on ESPT, 9-17, IEEE (2016)

7 Acknowledgements

This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Finland, Germany, Greece, France, Slovenia, Spain, and the Czech Republic under grant agreement No 101093261. The computations/data handling were/was enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. Parts of this work was carried out using the Marconi-Fusion Supercomputer.