VASP: Some Accumulated Wisdom

VASP: Some Accumulated Wisdom
J. M. Skelton
WMD Group Meeting
21st September 2015

WMD Group Meeting, September 2015 | Slide 2
Convergence: Parameters
• Four key technical parameters in a VASP calculation:
o Basis set: ENCUT and PREC (or, alternatively, NGX, NGY, NGZ)
o k-point sampling: KPOINTS file and SIGMA
o [For certain types of pseudopotential.] Augmentation grid: ENAUG and PREC (or,
alternatively, NGXF, NGYF, NGZF)
o Which space the projection operators are applied in (LREAL)

Convergence: Augmentation grid
• A second, finer mesh is used to represent the charge density near the ion cores:
controlled by ENAUG (or PREC + EAUG in the POTCAR files), which determines NG*F

Convergence: ZnS revisited
• For calculations on ZnS with TPSS, ENAUG needs to be increased from the default (but
ENCUT = 550 eV is fine) - equivalent to increasing NG*F [but without also increasing
NG* as in the QHA-ExC paper, which evidently unnecessary (!)]

Convergence: ZnS revisited
• For calculations on ZnS with TPSS, ENAUG needs to be increased from the default (but
ENCUT = 550 eV is fine) - equivalent to increasing NG*F, but without also increasing
NG*, which is wasteful
ENCUT /
eV
ENAUG /
eV NG* NG*F Noise? t / min
550 575.892 120 160 û -
650 575.892 128 160 û -
750 575.892 140 160 û -
850 575.892 150 160 û -
550 675.892 120 180 ü 116
550 775.892 120 192 ü 108
550 875.892 120 200 ü 113

The VASP SCF cycle
• The SCF cycle proceeds in two phases:
o The plane-wave coefficients are initialised randomly and “pre-optimised” within a
fixed potential given by the superposition of atomic densities (INIWAV, NELMDL)
o The wavefunctions and density are then optimised self-consistently to convergence
(EDIFF, NELMIN, NELM)
o If an initial charge density exists (e.g. from a previous SCF or converged
CHGCAR/WAVECAR), the first step can be skipped (ISTART, ICHARG)
• To accelerate convergence, the output density from a step N is not fed directly into the
next step N+1, but is mixed with the input density (IMIX, INIMIX, MIXPRE, MAXMIX,
AMIX, AMIN, AMIX_MAG, BMIX, BMIX_MAG, WC)
• For the mathematically-minded: http://th.fhi-berlin.mpg.de/th/Meetings/DFT-workshop-
Berlin2011/presentations/2011-07-14_Marsman_Martijn.pdf

The VASP SCF cycle
N E dE d eps ncg rms rms(c)
DAV: 1 0.425437171796E+04 0.42544E+04 -0.38613E+05 920 0.178E+03
DAV: 2 -0.114846409831E+04 -0.54028E+04 -0.51653E+04 1130 0.323E+02
DAV: 3 -0.169662738043E+04 -0.54816E+03 -0.53994E+03 1130 0.100E+02
DAV: 4 -0.171494085624E+04 -0.18313E+02 -0.18206E+02 1160 0.198E+01
DAV: 5 -0.171553585547E+04 -0.59500E+00 -0.59387E+00 1220 0.331E+00 0.706E+01
RMM: 6 -0.159733114612E+04 0.11820E+03 -0.21124E+02 920 0.147E+01 0.352E+01
RMM: 7 -0.157358217358E+04 0.23749E+02 -0.82778E+01 920 0.937E+00 0.173E+01
RMM: 8 -0.157195752202E+04 0.16247E+01 -0.10028E+01 922 0.344E+00 0.736E+00
RMM: 9 -0.157170732229E+04 0.25020E+00 -0.24051E+00 920 0.173E+00 0.186E+00
RMM: 10 -0.157170709721E+04 0.22508E-03 -0.17654E-01 932 0.561E-01 0.965E-01
RMM: 11 -0.157173130475E+04 -0.24208E-01 -0.10240E-01 920 0.332E-01 0.466E-01
RMM: 12 -0.157174953342E+04 -0.18229E-01 -0.23004E-02 920 0.198E-01 0.213E-01
RMM: 13 -0.157175624413E+04 -0.67107E-02 -0.12470E-02 920 0.134E-01 0.938E-02
RMM: 14 -0.157175705572E+04 -0.81159E-03 -0.49641E-03 922 0.781E-02 0.577E-02
RMM: 15 -0.157175711576E+04 -0.60039E-04 -0.62130E-04 922 0.302E-02 0.211E-02
RMM: 16 -0.157175714692E+04 -0.31162E-04 -0.18825E-04 932 0.152E-02 0.146E-02
RMM: 17 -0.157175715237E+04 -0.54516E-05 -0.37827E-05 935 0.701E-03 0.564E-03
RMM: 18 -0.157175715526E+04 -0.28845E-05 -0.88070E-06 824 0.340E-03 0.361E-03
RMM: 19 -0.157175715551E+04 -0.24851E-06 -0.27408E-06 657 0.209E-03
1 F= -.15717572E+04 E0= -.15717572E+04 d E =-.291254-147
Between NELMIN
and NELM steps in
total
NELMDL steps in a
fixed potential
Minimisation
algorithm
Total free
energy
Change in total energy
and eigenvalues
Number of evaluations
of 𝐻"#𝛹⟩
Difference in input and output density; oscillations
probably indicate convergence problems
Total free and zero-broadening
(𝜎 → 0) energy

The ALGO tag
• ALGO is the “recommended” tag for selecting the electronic-minimisation algorithm
• Most of the algorithms have “subswitches”, which can be selected using IALGO
• I tend to use one of four ALGOs:
• RMM-DIIS (ALGO = VeryFast): fastest per SCF step, best parallelised, and
converges quickly close to a minimum, but can struggle with difficult systems
• Blocked Davidson (ALGO = Normal): slower than RMM-DIIS, but usually stable,
although can still struggle with difficult problems (e.g. magnetism, meta-GGAs and
hybrids)
• Davidson/RMM-DIIS (ALGO = Fast): Uses ALGO = Normal for the “pre-
optimisation”, then switches to ALGO = VeryFast; a good default choice
• All-band conjugate gradient (ALGO = All): Slow, but very stable; use as a fallback
when ALGO = Normal struggles, and for hybrids

Taming TPSS (and other meta-GGAs)
!ALGO = Normal | All
!GGA = PS
METAGGA = TPSS | revTPSS | M06L
LASPH = .TRUE.
LMIXTAU = .TRUE.
!ENAUG = MAX(EAUG) * 1.5
!NGXF = <>; NGYF = <>; NGZF = <>;
• In my experience, meta-GGAs can sometimes be more difficult to converge than
standard GGA functionals (or even hybrids)
RMM-DIIS (ALGO = Fast | VeryFast)
sometimes struggle
Don’t forget - (rev)TPSS are based on PBE
Aspherical gradient corrections
inside PAW spheres
Pass kinetic-energy density to the
charge-density mixer
May need to increase ENAUG/NG*F if very accurate
forces are needed (e.g. phonons)

Parallelisation
• The newest versions of VASP implement four levels of parallelism:
o k-point parallelism: KPAR
o Band parallelism and data distribution: NCORE and NPAR
o Parallelisation and data distribution over plane-wave coefficients (= FFTs; done over
planes along NGZ): LPLANE
o Parallelisation of some linear-algebra operations using ScaLAPACK (notionally set at
compile time, but can be controlled using LSCALAPACK)
• Effective parallelisation will…:
o … minimise (relatively slow) communication between MPI processes, …
o … distribute data to reduce memory requirements…
o … and make sure the MPI processes have enough work to keep them busy

Parallelisation: Workload distribution
Cores
KPAR k-point
groups
NPAR band
groups
NGZ FFT
groups (?)
• Workload distribution over KPAR k-point groups, NBANDS band groups and NGZ plane-
wave coefficient (FFT) groups [not 100 % sure how this works…]

Parallelisation: Data distribution
Data
KPAR k-point
groups
NPAR band
groups
NGZ FFT
groups (?)
• Data distribution over NBANDS band groups and NGZ plane-wave coefficient (FFT)
groups [also not 100 % sure how this works…]

Parallelisation: KPAR
• During a standard DFT calculation, k-points are independent -> k-point parallelism should
be linearly scaling, although perhaps not in practice:
https://www.nsc.liu.se/~pla/blog/2015/01/12/vasp-how-many-cores/
• <#cores> must be divisible by KPAR, but the parallelisation is via a “round-robin”
algorithm, so <#k-points> does not need to be divisible by KPAR -> check how many
irreducible k-points you have (head IBZKPT) and set KPAR accordingly
k1
k2
k3
k1 k2
k3
k1 k2 k3
KPAR = 1
t = 3 [OK]
KPAR = 2; t = 2 [Bad]
KPAR = 3
t = 1 [Good]
R1
R2
R3
R1
R2
R1

NCORE : number of cores in band groups
NPAR : number of bands treated simultaneously
Parallelisation: NCORE and NPAR
NCORE =
< #cores >
NPAR
• Why not NCORE = 1/NPAR = <#cores> (the default)? - more band groups
(probably) increases memory pressure and incurs a substantial communication overhead
7.08x
6.41x
6.32x

Parallelisation: NCORE and NPAR
• WARNING: VASP will increase the default NBANDS to the nearest multiple of the number
of groups
• Since the electronic minimisation scales as a power of NBANDS, this can backfire in
calculations with a large NPAR (e.g. those requiring NPAR = <#cores>)
Cores
NBANDS
Default Adjusted
96 455 480
128 455 512
192 455 576
256 455 512
384 455 768
512 455 512
NBANDS =
NELECT
2
+
NIONS
2
Example system:
• 238 atoms w/ 272 electrons
• Default NBANDS = 455
NBANDS =
3
5
NELECT + NMAG

Parallelisation: Memory
• KPAR: current implementation does not distribute data over k-point groups -> KPAR =
N will use N x more memory than KPAR = 1
• NPAR/NCORE: data is distributed over band groups -> decreasing NPAR/increasing
NCORE by a factor of N will reduce memory requirements by N x
• NPAR takes precedence over NCORE - if you use “master” INCAR files, make sure you
don’t define both
• The defaults for NPAR/NCORE(NPAR = <#cores>, NCORE = 1) are usually a poor
choice for both memory and performance
• Band parallelism for hybrid functionals has been supported since VASP 5.3.5; for
memory-intensive calculations, it is a good alternative to underpopulating nodes
• LPLANE: distributes data over plane-wave coefficients, and speeds things up by
reducing communication during FFTs - the default is LPLANE = .TRUE., and should
only need to be changed for massively-parallel architectures (e.g. BG/Q)

Parallelisation: ScaLAPACK
• RMM-DIIS (ALGO = VeryFast | Fast) involves three steps:
EDDIAG : subspace diagonalisation
RMM-DIIS : electronic minimisation
ORTHCH : wavefunction orthogonalisation
Routine 312 atoms 624 atoms 1,248 atoms 1,872 atoms
EDDIAG 2.90 (18.64 %) 12.97 (22.24 %) 75.26 (26.38 %) 208.29 (31.31 %)
RMM-DIIS 12.39 (79.63 %) 42.73 (73.27 %) 187.62 (65.78 %) 379.80 (57.10 %)
ORTHCH 0.27 (1.74 %) 2.62 (4.49 %) 22.36 (7.84 %) 77.11 (11.59 %)
• EDDIAG and ORTHCH formally scale as N3, and rapidly begin to dominate the SCF cycle
time for large calculations
• A good scaLAPACK library can improve the performance of these routines in massively-
parallel calculations
See also: https://www.nsc.liu.se/~pla/blog/2014/01/30/vasp9k/

Parallelisation: My “rules of thumb”
• For x86_64 IB systems (Archer, Balena, Neon…):
o Use KPAR in preference to NPAR
o Set NPAR = (<#nodes>/KPAR) or NCORE = <#cores/node>
o 1 node/band group per 50 atoms; may want to use 2 nodes/50 atoms for hybrids,
or decrease to ½ node per band group for < 10 atoms
o ALGO = Fast is a usually a good choice, except for badly-behaved systems
o Leave LPLANE at the default (.TRUE.)
o For the IBM BG/Q (STFC Hartree):
o The Hartree machine currently uses VASP 5.2.x -> no KPAR
o Try to choose a square number of cores, and set NPAR = sqrt(<#cores>)
o Consider setting LPLANE = .FALSE. if <#cores> ≥ NGZ

VASP: Some Accumulated Wisdom

More Related Content

VASP: Some Accumulated Wisdom