KPAR and NCORE in VASP - Quantum Tinkering

>[!INFO] Work in Progress > This note is a work in progress! It's only published so I can more easily share it with colleagues who are helping with the bench marking. In this note, I will be discussing how to optimize VASP's parallelism. I will mainly focus on how to properly set the $\texttt{KPAR}$ and $\texttt{NCORE}$ tags on pure MPI versions of VASP (i.e., VASP 5 and VASP 6 with OpenMP disabled). I will assume you've already picked a ballpark for the number of nodes or MPI ranks and focus on how to get the best performance with just $\texttt{NCORE}$ (or equivalently, $\texttt{NPAR}$) and $\texttt{KPAR}$. Picking the optimal number of MPI ranks depends on how many atoms/electrons your system has, the number of bands you're computing, whether or not you're using SOC, how many node hours you're willing to burn, and many other factors. I will write a different note about that later. >[!WARNING] Disclaimer >I am not an expert at high performance computing (HPC), and I haven't done an extensive amount of benchmarking. The point of this note is to show you some rough guidelines for setting those two tags. # Resources Below are some useful resources on parallelism in general and for VASP specifically, mostly courtesy of NERSC. If you're running large calculations, it is worth having an understanding of these concepts. 1. [NERSC documentation on process and thread affinity](https://docs.nersc.gov/jobs/affinity/) 2. [VASP Wiki entry on parallelization](https://www.vasp.at/wiki/index.php/Category:Parallelization) 3. [VASP Wiki entry on optimizing parallelization](https://www.vasp.at/wiki/index.php/Optimizing_the_parallelization) 4. [NERSC slides on parallelizing VASP](https://www.nersc.gov/assets/Uploads/Using-VASP-at-NERSC-20180629.pdf) 5. [NERSC paper benchmarking hybrid VASP on Cori](https://cug.org/proceedings/cug2017_proceedings/includes/files/pap134s2-file1.pdf) 6. [NERSC on KNL's MCDRAM, NUMA modes, etc.](https://docs.nersc.gov/systems/cori/knl_modes/) # Glossary Below are some terms I will refer to in the rest of this note. | Syntax | Description | Formula/Notes | | --------------------------- | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------- | | $N_{\text{nodes}}$ | Number of nodes | Calculation dependent | | $N_\text{cores/node}$ | Number of *physical* cores per node | Hardware specification | | $N_{\text{MPI/node}}$ | Number of MPI ranks per node | Usually set to a factor of $N_\text{cores/node}$ | | $N_{\text{MPI total}}$ | Total number of MPI ranks | $=N_{\text{nodes}} \times N_{\text{MPI/node}}$ | | $N_k$ | Number of k-points | $\texttt{NKPTS}$ in `OUTCAR` | $N_{\text{k-groups}}$ | Number of k-groups | $\texttt{KPAR}$ in `INCAR`, 1 by default | $N_\text{k-points/k-group}$ | Number of k-points in each k-group | $=N_k/N_\text{k-groups}$ | | $N_{\text{MPI/k-group}}$ | Number of MPI ranks per k-group | $=N_{\text{MPI total}}/N_{\text{k-groups}}$ | | $N_{\text{bands}}$ | Number of bands | $\texttt{NBANDS}$. Usually set in `INCAR`, but there's a reasonable default. | | $N_{\text{b-groups}}$ | Number of band-groups | $=N_{\text{MPI/k-group}}/N_\text{MPI/b-group} = \texttt{NPAR}$. Can be set in `INCAR` (see below), defaults to $N_\text{MPI/k-group}$ | | $N_{\text{bands/b-groups}}$ | Number of bands in each band group | $=N_\text{bands}/N_\text{b-groups} = \texttt{NBANDS}/\texttt{NPAR}$. | | $N_{\text{MPI/b-group}}$ | Number of MPI ranks per band group | $=\texttt{NCORE}$, usually set in `INCAR` to a factor of $N_{\text{MPI/node}}$. 1 by default. | >[!NOTE] $\texttt{NPAR}$ and $\texttt{NCORE}$ >$\texttt{NPAR}$ and $\texttt{NCORE}$ are mutually exclusive, and $\texttt{NPAR}$ takes precedence if both are set. I will only discuss setting $\texttt{NCORE}$ in what follows, as it is easier to understand and is generally more common in calculations done using recent versions of VASP. >[!HINT] > Sometimes it's beneficial to run VASP with fewer $N_\text{MPI/node}$ than $N_\text{cores/node}$, e.g., to deal with memory bottlenecks in large supercells, or using OpenMP with architectures like Intel Xeon Phi (e.g., Cori's KNL). > > I will not discuss OpenMP in what follows (see the [NERSC whitepaper](https://cug.org/proceedings/cug2017_proceedings/includes/files/pap134s2-file1.pdf) for more details). However, note that when `OMP_NUM_THREADS != 1`, VASP will automatically set `NCORE=1` regardless of what you have in `INCAR`. # General notes on `NCORE` and `KPAR` You can see more details about these two tags [here]() and [here]() on the VASP wiki, but here's a summary. >[!SUMMARY] Setting $\texttt{NCORE}$ > 1. $\texttt{NCORE}$ distributes both computational load and data . > 2. Generally, you want $\texttt{NCORE}$ to be a factor of the number of physical cores per node $N_\text{cores/node}$. For architectures like Knight's Landing with 68 cores per node, most people use 64 MPI ranks per node and pretend the last 4 cores don't exist. > 3. For very small unit cells and very simple calculations, $\texttt{NCORE}=1$ is not bad choice. You should also use this value for small clusters with poor communication bandwidth and/or a small number of total cores. > 4. For modern multi-core, many-core, and massively parallel machines, $\texttt{NCORE}$ should be set in the range of $[2, N_\text{cores/node}]$ or $[2, N_\text{cores/socket}]$. Cori Haswell, for example, has 32 cores/node and 16 cores/socket. > 5. For bandwidth limited machines, or those with load balancing issues, it is recommended to set $\texttt{NCORE}$ up to $\sqrt{N_\text{cores/node}}$ instead. I haven't experimented with this too much, I think (hope?) Cori and Perlmutter are fine in that respect. (I think the VASP wiki has a typo here where they set $\texttt{NPAR}$ instead.) > 6. Increasing $\texttt{NCORE}$ reduces the memory used per core to store nonlocal projectors. I will personally sometimes increase it to $N_{\text{cores/node}}$ for large supercells or slabs/nanowires. > 7. In all cases, finding the optimal value requires some benchmarking. If you're running a large number of similar calculations, spend some times benchmarking one of them with various different parallelism settings. >[!SUMMARY] Setting $\texttt{KPAR}$ > 1. $\texttt{KPAR}$ only distributes the computational load but *not* data. This can lead to large memory usage. The VASP wiki recommends setting both of them on HPC clusters, especially for large workloads. # Setting `NCORE` and `KPAR` simultaneously >[!NOTE] OpenMP > As mentioned above, I am generally assuming we're working with a pure MPI version of VASP, i.e., below version 6 or version 6 with OMP disabled (`export OMP_NUM_THREADS=1`). The results here were tested using VASP v6.3.2, compiled for Cori Haswell (`module load vasp/6.3.2-hsw`). ## Condition 1 The most important condition for using $\texttt{NCORE}$ and $\texttt{KPAR}$ simultaneously is that the number of k-groups $N_{\text{k-groups}} \equiv \texttt{KPAR}$ needs to be a factor of the total number of MPI ranks, i.e., each k-group needs an integer number of MPI ranks. $ \begin{align} N_\text{MPI/k-group} = \frac{N_\text{MPI total}}{N_\text{k-groups}} \in \mathbb{Z}^+ &\implies \frac{N_\text{MPI total}}{\texttt{KPAR}} \in \mathbb{Z}^+, \\ \end{align} $ $\mathbb{Z}^+$ is the set of positive integers. If you violate this condition, your calculation will crash with an error along the lines of: ``` M_divide: can not subdivide 320 nodes by 6 ``` For this calculation, I used 320 MPI ranks and $\texttt{KPAR}=6$. VASP just happens to call MPI ranks nodes for this specific error. ## Condition 2 Second, since each k-group is split into $N_\text{b-groups} \equiv \texttt{NPAR}$, each with $N_\text{MPI/b-group} \equiv \texttt{NCORE}$ MPI ranks, then $N_\text{MPI/b-group}\equiv \texttt{NCORE}$ needs to be a factor of $N_\text{MPI/k-group}$, more concretely: $ \begin{align} N_\text{b-groups} = \frac{N_\text{MPI/k-group}}{N_\text{MPI/b-group}} \in \mathbb{Z}^+ &\implies \frac{N_\text{MPI total}}{\texttt{KPAR} \times \texttt{NCORE}} \in \mathbb{Z}^+. \end{align} $ Violating this condition won't make VASP crash, but it will severely affect the performance. The first thing you'll notice is this line near the top of `stdout` or `OUTCAR`: ``` distr: one band on NCORE= 1 cores, 104 groups ``` No matter what you set $\texttt{NCORE}$ to in `INCAR`, violating this condition will force it to 1, significantly degrading performance. No warning or "advice" is issued by VASP, either. To make matters worse, VASP will likely increase the number of bands, sometimes significantly. This is because when $\texttt{NCORE}$ is forced to 1, we end up with $\texttt{NPAR} = N_\text{MPI total}/\texttt{KPAR}$, which is usually a large number for large calculations. Since $\texttt{NBANDS}$ needs to be an integer multiple of `NPAR`, VASP will automatically change the number of bands to be the closest multiple of `NPAR` large than the $\texttt{NBANDS}$ you specified in `INCAR`. This will slow down the calculation even further. ## Condition 3 Third, you ideally want the number of k-groups $N_\text{k-groups} \equiv \texttt{KPAR}$ to be a factor of the number of k-points $N_k$, so that each k-group works on an equal number of k-points. $ \begin{align} N_\text{k-points/k-group} = \frac{N_k}{N_\text{k-groups}} \in \mathbb{Z}^+ &\implies \frac{N_k}{\texttt{KPAR}} \in \mathbb{Z}^+, \\ \end{align} $ Violating this condition will not lead to any errors, warnings or even "advice" messages in the `OUTCAR` or `stdout`. For example, if you have 144 k-points and set $\texttt{KPAR}$=5, then you'll end up with 4 k-groups with 29 k-points each and 1 k-group with 28 k-points. This is not a huge issue since the last k-group has 96.5% as many k-points as the others so it won't be idle for too long. However, if instead you have $N_k = 13$ and set $\texttt{KPAR}=5$, you'll end up with 3 k-groups with 3 k-points each and 2 k-groups with 2 points each. These last 2 groups will be idle ~2/3 of the time, which is quite inefficient. Another example is if you have $N_k = 120$ and set $\texttt{KPAR}=7$. You'll end up with 6 k-groups with 17 points each and a seventh group with 18 points. 6 of your 7 k-groups (i.e., ~85% of your MPI ranks) will be sitting idle while the seventh group works on its one extra k-point. ## Condition 4 Finally, you also ideally want the number of band groups $N_{\text{b-groups}} \equiv \texttt{NPAR}$ to be a factor of the number of bands $\texttt{NBANDS}$. $ \begin{align} N_\text{bands/b-group} = \frac{N_{\text{bands}}}{N_\text{b-groups}} \in \mathbb{Z}^+ &\implies \frac{\texttt{NBANDS}}{\texttt{NPAR}} \in \mathbb{Z}^+, \\ \end{align} $ If you violate this condition, you'll end up with something like this: ``` ----------------------------------------------------------------------------- | | | W W AA RRRRR N N II N N GGGG !!! | | W W A A R R NN N II NN N G G !!! | | W W A A R R N N N II N N N G !!! | | W WW W AAAAAA RRRRR N N N II N N N G GGG ! | | WW WW A A R R N NN II N NN G G | | W W A A R R N N II N N GGGG !!! | | | | The number of bands has been changed from the values supplied in | | the INCAR file. This is a result of running the parallel version. | | The orbitals not found in the WAVECAR file will be initialized with | | random numbers, which is usually adequate. For correlated | | calculations, however, you should redo the groundstate calculation. | | I found NBANDS = 420. Now, NBANDS = 448. | | | ----------------------------------------------------------------------------- ``` # Examples The examples below all have $N_k=144$ k-points. The system is an insulator with 400 electrons and I had SOC enabled (i.e., we need `NBANDS` at least 401 or VASP will throw a warning). The examples were run on Cori Haswell with 32 cores per node. I used VASP 6.3.2. > [!HINT] `OUTCAR` header > The first few lines of `OUTCAR` (or `stdout`) are helpful for a quick glance at your parallelization options. They usually look like this: > ``` > running ${N_mpi_tot} mpi-ranks, with ${OMP_NUM_THREADS} threads/rank > distrk: each k-point on ${KCORE} cores, ${KPAR} groups > distr: one band on NCORE= ${NCORE} cores, ${NPAR} groups > ``` > where `${N_mpi_tot}` is $N_\text{MPI total}$ and `${KCORE}` is $N_\text{MPI/k-group}$. ## Example 0: a functional calculation ### Input * $\texttt{NKPTS} = 144$ * $\texttt{NBANDS} = 420$ * $\texttt{NCORE} = 32$ * $\texttt{KPAR}=6$ * $\text{N}_\text{MPI total} = 384$ ### Conditions * **Condition 1**: $N_\text{MPI/k-group} = N_\text{MPI total}/\texttt{KPAR} = 384/6=64$ ✅ * **Condition 2:** $\texttt{NPAR} = N_\text{MPI total}/(\texttt{KPAR} \times \texttt{NCORE}) = 384/(6\times32) = 2$ ✅ * **Condition 3:** $N_{\text{k-points/k-group}} = \texttt{NKPTS}/\texttt{KPAR} = 144/6=24$ ✅ * **Condition 4:** $N_{\text{bands/b-group}} = \texttt{NBANDS}/\texttt{NPAR} = 420/2=210$ ✅ ### `OUTCAR` header ``` running 384 mpi-ranks, with 1 threads/rank distrk: each k-point on 64 cores, 6 groups distr: one band on NCORE= 32 cores, 2 groups ``` ### Conclusion Everything is running as expected. Number of bands wasn't changed, and no warnings or anything else was given. ## Example 1: violating condition 1 ### Input * $\texttt{NKPTS} = 144$ * $\texttt{NBANDS} = 420$ * $\texttt{NCORE} = 8$ * $\texttt{KPAR}=6$ * $\text{N}_\text{MPI total} = 320$ ### Conditions * **Condition 1**: $N_\text{MPI/k-group} = N_\text{MPI total}/\texttt{KPAR} = 320/6=160/3$ ❌ * **Condition 2:** $\texttt{NPAR} = N_\text{MPI total}/(\texttt{KPAR} \times \texttt{NCORE}) = 320/(6\times8) = 20/3$ ❌ * **Condition 3:** $N_{\text{k-points/k-group}} = \texttt{NKPTS}/\texttt{KPAR} = 144/6=24$ ✅ * **Condition 4:** $N_{\text{bands/b-group}} = \texttt{NBANDS}/\texttt{NPAR} = 420/(20/3)=63$ ✅ ### `OUTCAR` header ``` running 320 mpi-ranks, with 1 threads/rank ``` But the calculation immediately crashes with the following error: ``` M_divide: can not subdivide 320 nodes by 6 ``` ### Conclusion Violating condition 1 makes the calculation crash immediately. Note that violating condition 1 immediately leads to a violation of condition 2. ## Example 2: violating condition 2 ### Input * $\texttt{NKPTS} = 144$ * $\texttt{NBANDS} = 416$ * $\texttt{NCORE} = 16$ * $\texttt{KPAR}=4$ * $\text{N}_\text{MPI total} = 416$ ### Conditions * **Condition 1**: $N_\text{MPI/k-group} = N_\text{MPI total}/\texttt{KPAR} = 416/4=104$ ✅ * **Condition 2:** $\texttt{NPAR} = N_\text{MPI total}/(\texttt{KPAR} \times \texttt{NCORE}) =416/(16\times4) = 13/2$❌ * **Condition 3:** $N_{\text{k-points/k-group}} = \texttt{NKPTS}/\texttt{KPAR} = 144/4=36$ ✅ * **Condition 4:** $N_{\text{bands/b-group}} = \texttt{NBANDS}/\texttt{NPAR} = 416/(13/2)=64$ ✅ ### `OUTCAR` header ``` running 416 mpi-ranks, with 1 threads/rank distrk: each k-point on 104 cores, 4 groups distr: one band on NCORE= 1 cores, 104 groups ``` $\texttt{NCORE}$ is clearly forced to 1 even though it's set to 16 in the `INCAR`. This leads to a new value of $\texttt{NPAR} = 416/4 = 104$. The nearest multiple of 104 to $\texttt{NBANDS}=416$ is 520, so we get a warning about $\texttt{NBANDS}$ being changed to 520. ### Conclusion The calculation ends up with no band-level parallelization and a large number of unnecessary bands. I haven't benchmarked this calculation yet, but it's clear it will be quite inefficient. Even though the calculation does not crash or give indications that it's running inefficiently, it's fairly easy to spot violations of condition 2. ## Example 3: violating condition 3 ### Input * $\texttt{NKPTS} = 144$ * $\texttt{NBANDS} = 424$ * $\texttt{NCORE} = 8$ * $\texttt{KPAR}=5$ * $\text{N}_\text{MPI total} = 320$ ### Conditions * **Condition 1**: $N_\text{MPI/k-group} = N_\text{MPI total}/\texttt{KPAR} = 320/5=64$ ✅ * **Condition 2:** $\texttt{NPAR} = N_\text{MPI total}/(\texttt{KPAR} \times \texttt{NCORE}) =320/(8\times5) = 8$✅ * **Condition 3:** $N_{\text{k-points/k-group}} = \texttt{NKPTS}/\texttt{KPAR} = 144/5$ ❌ * **Condition 4:** $N_{\text{bands/b-group}} = \texttt{NBANDS}/\texttt{NPAR} = 424/8=53$ ✅ ### `OUTCAR` header ``` running 320 mpi-ranks, with 1 threads/rank distrk: each k-point on 64 cores, 5 groups distr: one band on NCORE= 8 cores, 8 groups ``` No warning, error, or any other indication that this condition is violated. ### Conclusion In this case, we'll end up with 4 k-groups of 29 k-points each and 1 k-group of 28 k-points, so it's not awfully inefficient. However, this will not always be the case. Violate condition 3 with caution. ## Example 4: violating condition 4 ### Input * $\texttt{NKPTS} = 144$ * $\texttt{NBANDS} = 420$ * $\texttt{NCORE} = 8$ * $\texttt{KPAR}=4$ * $\text{N}_\text{MPI total} = 416$ ### Conditions * **Condition 1**: $N_\text{MPI/k-group} = N_\text{MPI total}/\texttt{KPAR} = 416/4=104$ ✅ * **Condition 2:** $\texttt{NPAR} = N_\text{MPI total}/(\texttt{KPAR} \times \texttt{NCORE}) = 416/(4\times8) = 13$ ✅ * **Condition 3:** $N_{\text{k-points/k-group}} = \texttt{NKPTS}/\texttt{KPAR} = 144/4=36$ ✅ * **Condition 4:** $N_{\text{bands/b-group}} = \texttt{NBANDS}/\texttt{NPAR} = 420/13$ ❌ ### `OUTCAR` header ``` running 416 mpi-ranks, with 1 threads/rank distrk: each k-point on 104 cores, 4 groups distr: one band on NCORE= 32 cores, 2 groups ``` The number of bands is changed from 420 to 429. ### Conclusion Overall, this wasn't a huge problem, 9 extra bands isn't too much more expensive when you already need 420. Think of condition 4 as more of a guideline than anything. It's always worth setting $\texttt{NBANDS}$ in `INCAR` so VASP behaves in the way you expect it to. # Limitations >[!WARNING] $\texttt{NCORE}$ and $\texttt{KPAR}$ sometimes cause issues > I have had some issues setting $\texttt{NCORE}>1$ with some DFPT calculations, especially those with `LOPTICS=.TRUE.` or `LEPSILON=.TRUE.`, usually with errors like > ``` > VASP internal routines have requested a change of the k-point set. Unfortunately, this is only possible if NPAR=number of nodes. Please remove the tag NPAR from the INCAR file and restart the calculation. > ``` > Setting $\texttt{NCORE}=1$ fixes this problem, I can only assume the linear response routines in VASP do not support band-level parallelism. As a result, these calculations run quite slowly. > > Additionally, for some calculations, I've seen warnings along the lines of: > ``` > WARNING: Sub-Space-Matrix is not hermitian in DAV > ``` > If you've eliminated all other sources of the warning (bad LAPACK, bad input geometry, tried different `ALGO`, etc.), then try setting $\texttt{NCORE}$ to smaller values. > > The most common error you'll get from using $\texttt{KPAR}$ is an out of memory error (see [this page](https://www.vasp.at/wiki/index.php/Not_enough_memory) on VASP wiki). Sometimes, if there's something wrong with your VASP binary, $\texttt{KPAR}>1$ will lead to cryptic MPI-related crashes. # Summary > [!SUMMARY] Conditions for $\texttt{KPAR}$ and $\texttt{NCORE}$ > To get the most out of your VASP calculations, you should set $\texttt{KPAR}$ and $\texttt{NCORE}$ according to the following conditions: > $ >\begin{align} >\frac{\texttt{NKPTS}}{\texttt{KPAR}} &\in \mathbb{Z}^+, \\ >\frac{N_\text{MPI total}}{\texttt{KPAR}} &\in \mathbb{Z}^+ , \\ >\frac{N_\text{MPI total}}{\texttt{KPAR} \times \texttt{NCORE}} &\in \mathbb{Z}^+, \\ >\frac{\texttt{NBANDS}\times\texttt{KPAR}\times\texttt{NCORE}}{N_\text{MPI total}} &\in \mathbb{Z}^+. >\end{align} >$ > You generally also want $N_\text{cores/node}/\texttt{NCORE} \in \mathbb{Z}^+$ in both cases.