Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Very-Large-Scale GPU-Accelerated Nuclear Gradient of Time-Dependent Density Functional Theory with Tamm–Dancoff Approximation and Range-Separated Hybrid Functionals

Inkoo Kim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A.    Daun Jeong Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Leah Weisburn Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A.    Alexandra Alexiu Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A.    Troy Van Voorhis tvan@mit.edu Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A.    Young Min Rhee ymrhee@kaist.ac.kr Department of Chemistry, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea.    Won-Joon Son wonjoon.son@samsung.com Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Hyung-Jin Kim hj.windy.kim@samsung.com Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Jinkyu Yim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Sungmin Kim Samsung Advanced Institute of Technology, Samsung Electronics, Suwon 16678, Republic of Korea.    Yeonchoo Cho Samsung Advanced Institute of Technology, Samsung Electronics, Suwon 16678, Republic of Korea.    Inkook Jang Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Seungmin Lee Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.    Dae Sin Kim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.
Abstract

Modern graphics processing units (GPUs) provide an unprecedented level of computing power. In this study, we present a high-performance, multi-GPU implementation of the analytical nuclear gradient for Kohn–Sham time-dependent density functional theory (TDDFT), employing the Tamm–Dancoff approximation (TDA) and Gaussian-type atomic orbitals as basis functions. We discuss GPU-efficient algorithms for the derivatives of electron repulsion integrals and exchange–correlation functionals within the range-separated scheme. As an illustrative example, we calculated the TDA-TDDFT gradient of the S1 state of a full-scale green fluorescent protein with explicit water solvent molecules, totaling 4353 atoms, at the ω𝜔\omegaitalic_ωB97X/def2-SVP level of theory. Our algorithm demonstrates favorable parallel efficiencies on a high-speed distributed system equipped with 256 Nvidia A100 GPUs, achieving >>>70% with up to 64 GPUs and 31% with 256 GPUs, effectively leveraging the capabilities of modern high-performance computing systems.

1 Introduction

High-performance computing is undergoing a rapid evolution supported by high-speed distributed systems that incorporate heterogenous accelerators such as graphics processing units (GPUs) for high-throughput data-parallel computation.Bourzac (2017); McInnes et al. (2021); Heldens et al. (2021) This heterogeneous architecture poses a challenge in efficiently adapting existing algorithms to fully exploit the computational potential of these accelerators. Within general-purpose GPU programming models,Herten (2023) fine-grained computational workloads represented by threads are organized into thread-blocks and are offloaded to GPU, which consists of a large number of lightweight processors for massively parallel execution. Therefore, a key consideration in achieving high performance with GPU acceleration is maintaining a high degree of concurrency by simultaneously handling a large number of similar tasks to utilize the single instruction multiple-thread (SIMT)-based GPU architecture effectively.

To harness the high-throughput potential of GPUs in quantum chemistry, significant efforts have been made to redesign conventional algorithms and parallelization tactics. Kohn–Sham density functional theory (DFT),Kohn and Sham (1965) arguably the most widely adopted quantum chemistry method for its balance of accuracy and cost-efficiency,Burke (2012); Becke (2014); Bursch et al. (2022) has been a prime target for GPU acceleration.Gordon et al. (2020); Felice et al. (2023); Vallejo et al. (2023) Focusing on DFT in the linear combination of atomic orbitals (LCAO) framework, using atom-centered Gaussian basis sets is a natural choice for molecular systems.Nagy and Jensen (2017) In this approach, it is well known that owing to the evaluation of four-index electron repulsion integrals (ERIs), the integral-direct Fock build is the most computationally intensive section.Almlöf et al. (1982)

Early pioneering works by YasudaYasuda (2008) and by Ufimtsev and MartínezUfimtsev and Martínez (2008) focused on accelerating the computation of ERIs for the Hartree–Fock (HF) method on GPUs. These were followed by numerous reports on improvements in both performance and scalability to date.Ufimtsev and Martínez (2009); Asadchev et al. (2010); Titov et al. (2013); Miao and Merz (2013, 2015); Kussmann and Ochsenfeld (2017); Rák and Cserey (2015); Tornai et al. (2019); Barca et al. (2020); Qi et al. (2023); Vallejo et al. (2023); Asadchev and Valeev (2023) Recent advancements included efficient ground-state DFT calculations on multi-GPUs, utilizing the GPU implementations for the numerical integration of exchange–correlation (XC) functionals,Seritan et al. (2020); Manathunga et al. (2021); Barca et al. (2021); Manathunga et al. (2020); Williams-Young et al. (2020) which allowed massively parallel simulations of very large molecules.Kowalski et al. (2021); Johnson et al. (2022) Notably, scaling-reducing schemes such as the resolution-of-identityKussmann et al. (2021); Asadchev and Valeev (2023); Williams-Young et al. (2023) and fragmentation-based methodsVallejo et al. (2023) have also been explored on GPUs. The electronic structures of excited states play crucial roles from both chemical and materials perspectives and can be modeled using the time-dependent variant of DFT (TDDFT).Runge and Gross (1984); Dreuw and Head-Gordon (2005) Despite the widespread adoption of TDDFT across various fields,Herbert (2023) few studies have focused on its implementation on GPUs.Isborn et al. (2011); Kim et al. (2023)

Inspired by the GPU acceleration of TDDFT by Isborn et al.,Isborn et al. (2011) some of present authors have provided in ref 42 a modernized implementation utilizing the state-of-the-art Nvidia A100 GPUs over high-speed Infiniband HDR networks for very large-scale calculations. In this process, the global hybrid functionals showed limitations in excited-state calculations of full-scale biological proteins, as the occurrence of spurious low-lying charge-transfer states compromised their practical application. As a simple, pragmatic remedy, we employed range-separated hybrid (RSH) functionalsLeininger et al. (1997); Vydrov and Scuseria (2006) to address certain fundamental issues with global hybrid functionals. Further, even with well-established parallel CPU implementations of the TDDFT nuclear gradient in many quantum chemistry programs,Epifanovsky et al. (2021); Franzke et al. (2023); Zahariev et al. (2023); Mejia-Rodriguez et al. (2023); Neese (2022); Cao et al. (2021); Rinkevicius et al. (2020); Frisch et al. (2016) to our knowledge, the GPU counterpart in conjunction with RSH functionals, which is critical for geometry optimizations of large systems to find stationary points on the excited-state potential energy surface and for elucidating various response properties, has not been reported thus far.

In this work, as a necessary advance towards reliable quantum chemistry calculations of excited states on an unprecedented scale, we present a massively parallel GPU implementation of the nuclear gradient of the TDDFT method with RSH functionals, employing the Gaussian-type atomic orbitals for exploring the excited states of very large molecular systems. We discuss efficient multi-GPU algorithms for computing the derivatives of ERIs and XC functionals, and demonstrate strong scalability in the nuclear gradient calculations of the singlet excited state of a 4353-atom protein system. This implementation achieves reasonably high parallel efficiencies up to 256 Nvidia A100 GPUs.

2 Theory and Algorithm

2.1 TDA-TDDFT Nuclear Gradient with RSH Functionals

Within the linear-response formalism, the total excited-state energy, E𝐸Eitalic_E, of TDDFT is calculated as the sum of the ground-state energy, E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the excitation energy, EΔsubscript𝐸ΔE_{\Delta}italic_E start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT. The Tamm–Dancoff approximation (TDA)Hirata and Head-Gordon (1999) simplifies the calculation of EΔsubscript𝐸ΔE_{\Delta}italic_E start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT by neglecting the deexcitation term in the Casida equation.Casida (1995) This approach allows EΔsubscript𝐸ΔE_{\Delta}italic_E start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT to be determined from the following eigenvalue equation:

𝐇𝐗=EΔ𝐗𝐇𝐗subscript𝐸Δ𝐗\displaystyle\mathbf{H}\mathbf{X}=E_{\Delta}\mathbf{X}bold_HX = italic_E start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT bold_X (1)

where 𝐇𝐇\mathbf{H}bold_H is the Hamiltonian represented within the space of (occupied–virtual) singly-excited Slater determinants and 𝐗𝐗\mathbf{X}bold_X is the excitation amplitude arranged in an No×Nvsubscript𝑁osubscript𝑁vN_{\mathrm{o}}\times N_{\mathrm{v}}italic_N start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT matrix, where Nosubscript𝑁oN_{\mathrm{o}}italic_N start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT and Nvsubscript𝑁vN_{\mathrm{v}}italic_N start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT refer to the numbers of occupied and virtual orbitals, respectively. For simplicity, we limit our discussion to closed-shell orbitals with restricted spins, using i𝑖iitalic_i, j𝑗jitalic_j for the occupied orbitals, and a𝑎aitalic_a, b𝑏bitalic_b for the virtual orbitals. Throughout this work, the XC functionals considered are within the generalized gradient approximation (GGA). These functionals depend on the density and its gradient, and are described by f(ρα,ρβ,γαα,γαβ,γββ)𝑓subscript𝜌𝛼subscript𝜌𝛽subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽subscript𝛾𝛽𝛽f(\rho_{\alpha},\rho_{\beta},\gamma_{\alpha\alpha},\gamma_{\alpha\beta},\gamma% _{\beta\beta})italic_f ( italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT ) where ραsubscript𝜌𝛼\rho_{\alpha}italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and ρβsubscript𝜌𝛽\rho_{\beta}italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are the spin densities, and γααραραsubscript𝛾𝛼𝛼subscript𝜌𝛼subscript𝜌𝛼\gamma_{\alpha\alpha}\equiv\nabla\rho_{\alpha}{\cdot}\nabla\rho_{\alpha}italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ≡ ∇ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⋅ ∇ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, γαβραρβsubscript𝛾𝛼𝛽subscript𝜌𝛼subscript𝜌𝛽\gamma_{\alpha\beta}\equiv\nabla\rho_{\alpha}{\cdot}\nabla\rho_{\beta}italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT ≡ ∇ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⋅ ∇ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and γββρβρβsubscript𝛾𝛽𝛽subscript𝜌𝛽subscript𝜌𝛽\gamma_{\beta\beta}\equiv\nabla\rho_{\beta}{\cdot}\nabla\rho_{\beta}italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT ≡ ∇ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ⋅ ∇ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT represent the gradient invariants.Perdew et al. (1996) Additionally, we focus on the spin-adapted formulation for singlet excited states, noting that the triplet counterpart can be easily derived as described in ref 56.

By expanding all the terms in eq 1, the nuclear derivative of E𝐸Eitalic_E with respect to an atomic center 𝐀{Ax,Ay,Az}𝐀subscript𝐴𝑥subscript𝐴𝑦subscript𝐴𝑧\mathbf{A}\equiv\{A_{x},A_{y},A_{z}\}bold_A ≡ { italic_A start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } of the A𝐴Aitalic_A-th atom can be expressed as

𝐀Esubscript𝐀𝐸\displaystyle\nabla_{\mathbf{A}}E∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_E =𝐀E0+iajbXiaXjb𝐀[εaδabεiδij\displaystyle=\nabla_{\mathbf{A}}E_{0}+\sum_{ia}\sum_{jb}X_{ia}X_{jb}\nabla_{% \mathbf{A}}[\varepsilon_{a}\delta_{ab}-\varepsilon_{i}\delta_{ij}= ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i italic_a end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j italic_b end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i italic_a end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j italic_b end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT [ italic_ε start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (2)
+2(ia|jb)α(ij|ab)β(ij|ab)lr+(ia|f^XC|jb)]\displaystyle+2(ia|jb)-\alpha(ij|ab)-\beta(ij|ab)_{\mathrm{lr}}+(ia|\hat{f}_{% \mathrm{XC}}|jb)]+ 2 ( italic_i italic_a | italic_j italic_b ) - italic_α ( italic_i italic_j | italic_a italic_b ) - italic_β ( italic_i italic_j | italic_a italic_b ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT + ( italic_i italic_a | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_j italic_b ) ]

where εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and εasubscript𝜀𝑎\varepsilon_{a}italic_ε start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represent the orbital energies of the occupied and virtual molecular orbitals, ψi(𝐫)subscript𝜓𝑖𝐫\psi_{i}(\mathbf{r})italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) and ψa(𝐫)subscript𝜓𝑎𝐫\psi_{a}(\mathbf{r})italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ), respectively, and (ia|jb)conditional𝑖𝑎𝑗𝑏(ia|jb)( italic_i italic_a | italic_j italic_b ), (ij|ab)conditional𝑖𝑗𝑎𝑏(ij|ab)( italic_i italic_j | italic_a italic_b ) and (ij|ab)lrsubscriptconditional𝑖𝑗𝑎𝑏lr(ij|ab)_{\mathrm{lr}}( italic_i italic_j | italic_a italic_b ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT denote the Coulomb, HF exchange and long-range HF exchange integrals, respectively, defined by

(ia|jb)conditional𝑖𝑎𝑗𝑏\displaystyle(ia|jb)( italic_i italic_a | italic_j italic_b ) =ψi(𝐫)ψa(𝐫)1rψj(𝐫)ψb(𝐫)d𝐫d𝐫absentdouble-integralsubscript𝜓𝑖𝐫subscript𝜓𝑎𝐫1𝑟subscript𝜓𝑗superscript𝐫subscript𝜓𝑏superscript𝐫differential-d𝐫differential-dsuperscript𝐫\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{a}(\mathbf{r})\frac{1}{r}\psi_{% j}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{r}^{\prime})\,\mathrm{d}\mathbf{r}% \mathrm{d}\mathbf{r}^{\prime}= ∬ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ) divide start_ARG 1 end_ARG start_ARG italic_r end_ARG italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d bold_r roman_d bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (3)
(ij|ab)conditional𝑖𝑗𝑎𝑏\displaystyle(ij|ab)( italic_i italic_j | italic_a italic_b ) =ψi(𝐫)ψj(𝐫)1rψa(𝐫)ψb(𝐫)d𝐫d𝐫absentdouble-integralsubscript𝜓𝑖𝐫subscript𝜓𝑗𝐫1𝑟subscript𝜓𝑎superscript𝐫subscript𝜓𝑏superscript𝐫differential-d𝐫differential-dsuperscript𝐫\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{j}(\mathbf{r})\frac{1}{r}\psi_{% a}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{r}^{\prime})\,\mathrm{d}\mathbf{r}% \mathrm{d}\mathbf{r}^{\prime}= ∬ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r ) divide start_ARG 1 end_ARG start_ARG italic_r end_ARG italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d bold_r roman_d bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (4)
(ij|ab)lrsubscriptconditional𝑖𝑗𝑎𝑏lr\displaystyle(ij|ab)_{\mathrm{lr}}( italic_i italic_j | italic_a italic_b ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT =ψi(𝐫)ψj(𝐫)erf(ωr)rψa(𝐫)ψb(𝐫)d𝐫d𝐫absentdouble-integralsubscript𝜓𝑖𝐫subscript𝜓𝑗𝐫erf𝜔𝑟𝑟subscript𝜓𝑎superscript𝐫subscript𝜓𝑏superscript𝐫differential-d𝐫differential-dsuperscript𝐫\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{j}(\mathbf{r})\frac{% \operatorname{erf}(\omega r)}{r}\psi_{a}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{% r}^{\prime})\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}^{\prime}= ∬ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r ) divide start_ARG roman_erf ( italic_ω italic_r ) end_ARG start_ARG italic_r end_ARG italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_d bold_r roman_d bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (5)

with r|𝐫𝐫|𝑟𝐫superscript𝐫r\equiv|\mathbf{r}-\mathbf{r}^{\prime}|italic_r ≡ | bold_r - bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |. Furthermore, (ia|f^XC|jb)𝑖𝑎subscript^𝑓XC𝑗𝑏(ia|\hat{f}_{\mathrm{XC}}|jb)( italic_i italic_a | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_j italic_b ) represents the singlet XC kernel defined byBauernschmitt and Ahlrichs (1996)

(ia|f^XC|jb)=𝑖𝑎subscript^𝑓XC𝑗𝑏absent\displaystyle(ia|\hat{f}_{\mathrm{XC}}|jb)=( italic_i italic_a | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_j italic_b ) =
[(2fρα2+2fραρβ)ψi(𝐫)ψa(𝐫)ψj(𝐫)ψb(𝐫)\displaystyle\iint\Bigg{[}\left(\frac{\partial^{2}f}{\partial\rho_{\alpha}^{2}% }+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\rho_{\beta}}\right)\psi_{% i}(\mathbf{r})\psi_{a}(\mathbf{r})\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})∬ [ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG ) italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
+(2fγαα+fγαβ)(ψi(𝐫)ψa(𝐫))(ψj(𝐫)ψb(𝐫))2𝑓subscript𝛾𝛼𝛼𝑓subscript𝛾𝛼𝛽subscript𝜓𝑖𝐫subscript𝜓𝑎𝐫subscript𝜓𝑗superscript𝐫subscript𝜓𝑏superscript𝐫\displaystyle+\left(2\frac{\partial f}{\partial\gamma_{\alpha\alpha}}+\frac{% \partial f}{\partial\gamma_{\alpha\beta}}\right)\nabla\big{(}\psi_{i}(\mathbf{% r})\psi_{a}(\mathbf{r})\big{)}\cdot\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})% \psi_{b}(\mathbf{r}^{\prime})\big{)}+ ( 2 divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG ) ∇ ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ) ) ⋅ ∇ ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
+(2fραγαα+2fραγββ+2fραγαβ)(ψi(𝐫)ψa(𝐫)\displaystyle+\left(\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{% \alpha\alpha}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\beta% \beta}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha\beta}% }\right)\Big{(}\psi_{i}(\mathbf{r})\psi_{a}(\mathbf{r})+ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG ) ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r )
×ρ(ψj(𝐫)ψb(𝐫))+ρ(ψi(𝐫)ψa(𝐫))ψj(𝐫)ψb(𝐫))\displaystyle\times\nabla\rho\cdot\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})% \psi_{b}(\mathbf{r}^{\prime})\big{)}+\nabla\rho\cdot\nabla\big{(}\psi_{i}(% \mathbf{r})\psi_{a}(\mathbf{r})\big{)}\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})\Big{)}× ∇ italic_ρ ⋅ ∇ ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) + ∇ italic_ρ ⋅ ∇ ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ) ) italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
+(2fγαα2+122fγαβ2+22fγααγαβ+2fγααγββ)superscript2𝑓superscriptsubscript𝛾𝛼𝛼212superscript2𝑓superscriptsubscript𝛾𝛼𝛽22superscript2𝑓subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽superscript2𝑓subscript𝛾𝛼𝛼subscript𝛾𝛽𝛽\displaystyle+\left(\frac{\partial^{2}f}{\partial\gamma_{\alpha\alpha}^{2}}+% \frac{1}{2}\frac{\partial^{2}f}{\partial\gamma_{\alpha\beta}^{2}}+2\frac{% \partial^{2}f}{\partial\gamma_{\alpha\alpha}\gamma_{\alpha\beta}}+\frac{% \partial^{2}f}{\partial\gamma_{\alpha\alpha}\gamma_{\beta\beta}}\right)+ ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG )
×ρ(ψi(𝐫)ψa(𝐫))ρ(ψj(𝐫)ψb(𝐫))]d𝐫d𝐫\displaystyle\times\nabla\rho\nabla\big{(}\psi_{i}(\mathbf{r})\psi_{a}(\mathbf% {r})\big{)}\cdot\nabla\rho\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})\big{)}\Bigg{]}\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}^% {\prime}× ∇ italic_ρ ∇ ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_r ) italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_r ) ) ⋅ ∇ italic_ρ ∇ ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] roman_d bold_r roman_d bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (6)

with the total density ρρα+ρβ𝜌subscript𝜌𝛼subscript𝜌𝛽\rho\equiv\rho_{\alpha}+\rho_{\beta}italic_ρ ≡ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT. The range-separated scheme blends the short-range exchange DFT functional and the long-range HF exchange using the error function (erferf\operatorname{erf}roman_erf), with the parameter ω𝜔\omegaitalic_ω setting the cross-fading point. In the limits of short-range (r0𝑟0r\rightarrow 0italic_r → 0) and long-range (r𝑟r\rightarrow\inftyitalic_r → ∞), the portions of HF exchange are determined by α𝛼\alphaitalic_α and α+β𝛼𝛽\alpha+\betaitalic_α + italic_β, respectively. For global hybrid functionals, the HF exchange is kept at constant α𝛼\alphaitalic_α by setting β=0𝛽0\beta=0italic_β = 0.

In the following discussion, to derive the atomic orbital formulation of eq 2 for computational efficiency, we utilize the direct differentiation approach.Foresman et al. (1992); Caillie and Amos (1999, 2000); Shroll and Edwards (2004); Liu et al. (2010) Alternatively, one can achieve the same formulation through the Lagrangian approach.Furche and Ahlrichs (2002); Chiba et al. (2006) Despite rather involving algebraic derivation, the resulting expression is relatively concise and can be written in matrix notation, facilitating efficient computational processing.

We rewrite eq 2 in the atomic orbital basis {ϕμ}subscriptitalic-ϕ𝜇\{\phi_{\mu}\}{ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT } as

𝐀Esubscript𝐀𝐸\displaystyle\nabla_{\mathbf{A}}E∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_E =μνPμν𝐀Hμνcore+μνWμν𝐀Sμν+𝐀Vnucabsentsubscript𝜇𝜈subscript𝑃𝜇𝜈subscript𝐀subscriptsuperscript𝐻core𝜇𝜈subscript𝜇𝜈subscript𝑊𝜇𝜈subscript𝐀subscript𝑆𝜇𝜈subscript𝐀subscript𝑉nuc\displaystyle=\sum_{\mu\nu}P_{\mu\nu}\nabla_{\mathbf{A}}H^{\mathrm{core}}_{\mu% \nu}+\sum_{\mu\nu}W_{\mu\nu}\nabla_{\mathbf{A}}S_{\mu\nu}+\nabla_{\mathbf{A}}V% _{\mathrm{nuc}}= ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT roman_core end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT roman_nuc end_POSTSUBSCRIPT
+μνλσΓμνλσ𝐀(μν|λσ)+μνλσΓμνλσlr𝐀(μν|λσ)lr\displaystyle+\sum_{\mu\nu\lambda\sigma}\Gamma_{\mu\nu\lambda\sigma}\nabla_{% \mathbf{A}}(\mu\nu|\lambda\sigma)+\sum_{\mu\nu\lambda\sigma}\Gamma_{\mu\nu% \lambda\sigma}^{\,\mathrm{lr}}\nabla_{\mathbf{A}}(\mu\nu|\lambda\sigma)_{% \mathrm{lr}}+ ∑ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT ( italic_μ italic_ν | italic_λ italic_σ ) + ∑ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_lr end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT ( italic_μ italic_ν | italic_λ italic_σ ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT
+μνPμν𝐀FμνXC+μνλσRμνRλσ𝐀(μν|f^XC|λσ)subscript𝜇𝜈subscript𝑃𝜇𝜈subscript𝐀subscriptsuperscript𝐹XC𝜇𝜈subscript𝜇𝜈𝜆𝜎subscript𝑅𝜇𝜈subscript𝑅𝜆𝜎subscript𝐀𝜇𝜈subscript^𝑓XC𝜆𝜎\displaystyle+\sum_{\mu\nu}P_{\mu\nu}\nabla_{\mathbf{A}}F^{\mathrm{XC}}_{\mu% \nu}+\sum_{\mu\nu\lambda\sigma}R_{\mu\nu}R_{\lambda\sigma}\nabla_{\mathbf{A}}(% \mu\nu|\hat{f}_{\mathrm{XC}}|\lambda\sigma)+ ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT roman_XC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_λ italic_σ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT ( italic_μ italic_ν | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_λ italic_σ ) (7)

where 𝐇coresubscript𝐇core\mathbf{H}_{\mathrm{core}}bold_H start_POSTSUBSCRIPT roman_core end_POSTSUBSCRIPT represents the core Hamiltonian matrix, 𝐒𝐒\mathbf{S}bold_S represents the overlap matrix, 𝐅XCsubscript𝐅XC\mathbf{F}_{\mathrm{XC}}bold_F start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT represents the Kohn-Sham Fock matrix,Pople et al. (1992) (μν|λσ)conditional𝜇𝜈𝜆𝜎(\mu\nu|\lambda\sigma)( italic_μ italic_ν | italic_λ italic_σ ) and (μν|λσ)lrsubscriptconditional𝜇𝜈𝜆𝜎lr(\mu\nu|\lambda\sigma)_{\mathrm{lr}}( italic_μ italic_ν | italic_λ italic_σ ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT denote the ERIs from eqs 3 and 4, respectively, and Vnucsubscript𝑉nucV_{\text{nuc}}italic_V start_POSTSUBSCRIPT nuc end_POSTSUBSCRIPT denotes the nuclear repulsion energy. The total density matrix is defined by

𝐏𝐏\displaystyle\mathbf{P}bold_P =𝐏0+𝐏Δabsentsubscript𝐏0subscript𝐏Δ\displaystyle=\mathbf{P}_{0}+\mathbf{P}_{\Delta}= bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_P start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT (8)

with

𝐏0subscript𝐏0\displaystyle\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =2𝐂o𝐂oTabsent2subscript𝐂osuperscriptsubscript𝐂o𝑇\displaystyle=2\mathbf{C}_{\text{o}}\mathbf{C}_{\text{o}}^{T}= 2 bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (9)
𝐏Δsubscript𝐏Δ\displaystyle\mathbf{P}_{\Delta}bold_P start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT =𝐏~Δ+2𝐂v𝐙𝐂oTabsentsubscript~𝐏Δ2subscript𝐂vsuperscriptsubscript𝐙𝐂o𝑇\displaystyle=\widetilde{\mathbf{P}}_{\Delta}+2\mathbf{C}_{\text{v}}\mathbf{Z}% \mathbf{C}_{\text{o}}^{T}= over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT + 2 bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT bold_ZC start_POSTSUBSCRIPT o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (10)
𝐏~Δsubscript~𝐏Δ\displaystyle\widetilde{\mathbf{P}}_{\Delta}over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT =𝐂v𝐗T𝐗𝐂vT𝐂o𝐗𝐗T𝐂oTabsentsubscript𝐂vsuperscript𝐗𝑇superscriptsubscript𝐗𝐂v𝑇subscript𝐂osuperscript𝐗𝐗𝑇superscriptsubscript𝐂o𝑇\displaystyle=\mathbf{C}_{\text{v}}\mathbf{X}^{T}\mathbf{X}\mathbf{C}_{\text{v% }}^{T}-\mathbf{C}_{\text{o}}\mathbf{X}\mathbf{X}^{T}\mathbf{C}_{\text{o}}^{T}= bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_XC start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (11)

where 𝐏0subscript𝐏0\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the ground-state density matrix, 𝐏Δsubscript𝐏Δ\mathbf{P}_{\Delta}bold_P start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT and 𝐏~Δsubscript~𝐏Δ\widetilde{\mathbf{P}}_{\Delta}over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT denote the relaxed and unrelaxed difference density matrices, respectively, and 𝐙𝐙\mathbf{Z}bold_Z describes the virtual–occupied orbital relaxation that can be obtained by solving a set of coupled-perturbed equations described in section 2.2. The transition density matrix is expressed as

𝐑=𝐂o𝐗𝐂vT𝐑subscript𝐂osuperscriptsubscript𝐗𝐂v𝑇\displaystyle\mathbf{R}=\mathbf{C}_{\text{o}}\mathbf{X}\mathbf{C}_{\text{v}}^{T}bold_R = bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT bold_XC start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (12)

with the molecular orbital coefficient matrix 𝐂[𝐂o,𝐂v]𝐂subscript𝐂osubscript𝐂v\mathbf{C}\equiv\left[\mathbf{C}_{\mathrm{o}},\mathbf{C}_{\mathrm{v}}\right]bold_C ≡ [ bold_C start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ], and we write the product of densities as

ΓμνλσsubscriptΓ𝜇𝜈𝜆𝜎\displaystyle\Gamma_{\mu\nu\lambda\sigma}roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT =2RνμRλσαRσμRλνabsent2subscript𝑅𝜈𝜇subscript𝑅𝜆𝜎𝛼subscript𝑅𝜎𝜇subscript𝑅𝜆𝜈\displaystyle=2R_{\nu\mu}R_{\lambda\sigma}-\alpha R_{\sigma\mu}R_{\lambda\nu}= 2 italic_R start_POSTSUBSCRIPT italic_ν italic_μ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_λ italic_σ end_POSTSUBSCRIPT - italic_α italic_R start_POSTSUBSCRIPT italic_σ italic_μ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_λ italic_ν end_POSTSUBSCRIPT
+12((Pμν+2PμνΔ)Pλσα2(Pμσ+2PμσΔ)Pλν)12subscript𝑃𝜇𝜈2subscriptsuperscript𝑃Δ𝜇𝜈subscript𝑃𝜆𝜎𝛼2subscript𝑃𝜇𝜎2subscriptsuperscript𝑃Δ𝜇𝜎subscript𝑃𝜆𝜈\displaystyle+\frac{1}{2}\left((P_{\mu\nu}+2P^{\,\Delta}_{\mu\nu})P_{\lambda% \sigma}-\frac{\alpha}{2}(P_{\mu\sigma}+2P^{\,\Delta}_{\mu\sigma})P_{\lambda\nu% }\right)+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ( italic_P start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT + 2 italic_P start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_λ italic_σ end_POSTSUBSCRIPT - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ( italic_P start_POSTSUBSCRIPT italic_μ italic_σ end_POSTSUBSCRIPT + 2 italic_P start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_σ end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_λ italic_ν end_POSTSUBSCRIPT ) (13)
ΓμνλσlrsuperscriptsubscriptΓ𝜇𝜈𝜆𝜎lr\displaystyle\Gamma_{\mu\nu\lambda\sigma}^{\,\mathrm{lr}}roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_lr end_POSTSUPERSCRIPT =βRσμRλνβ4(Pμσ+2PμσΔ)Pλνabsent𝛽subscript𝑅𝜎𝜇subscript𝑅𝜆𝜈𝛽4subscript𝑃𝜇𝜎2subscriptsuperscript𝑃Δ𝜇𝜎subscript𝑃𝜆𝜈\displaystyle=-\beta R_{\sigma\mu}R_{\lambda\nu}-\frac{\beta}{4}(P_{\mu\sigma}% +2P^{\,\Delta}_{\mu\sigma})P_{\lambda\nu}= - italic_β italic_R start_POSTSUBSCRIPT italic_σ italic_μ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_λ italic_ν end_POSTSUBSCRIPT - divide start_ARG italic_β end_ARG start_ARG 4 end_ARG ( italic_P start_POSTSUBSCRIPT italic_μ italic_σ end_POSTSUBSCRIPT + 2 italic_P start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_σ end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_λ italic_ν end_POSTSUBSCRIPT (14)

The atomic orbital formulation of TDDFT nuclear gradient in eq 2.1 facilitates the integral-direct implementation, enabling the calculation of the four-index ERIs as needed during computation to avoid the heavy 𝒪(N4)𝒪superscript𝑁4\mathcal{O}(N^{4})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) scaling in memory. We provide the working expressions for the energy-weighted density matrix, 𝐖𝐖\mathbf{W}bold_W, ERI derivatives (fourth and fifth terms in eq 2.1) and XC derivatives (sixth and seventh terms in eq 2.1) in sections 2.2, 2.3 and 2.4, respectively.

2.2 Coupled-Perturbed Kohn–Sham Equations

The virtual–occupied orbital relaxation, 𝐙𝐙\mathbf{Z}bold_Z, representing the first-order derivative of molecular orbital coefficients, is obtained by solving the coupled-perturbed equationGerratt and Mills (1968) using the Z-vector method:Handy and Schaefer (1984)

(𝐝𝐀)𝐙=𝐋𝐝𝐀𝐙𝐋\displaystyle(\mathbf{d}-\mathbf{A})\mathbf{Z}=\mathbf{L}( bold_d - bold_A ) bold_Z = bold_L (15)

where the matrix elements on the left-hand side are given as

daisubscript𝑑𝑎𝑖\displaystyle d_{ai}italic_d start_POSTSUBSCRIPT italic_a italic_i end_POSTSUBSCRIPT =εiεaabsentsubscript𝜀𝑖subscript𝜀𝑎\displaystyle=\varepsilon_{i}-\varepsilon_{a}= italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (16)
Aai,bjsubscript𝐴𝑎𝑖𝑏𝑗\displaystyle A_{ai,bj}italic_A start_POSTSUBSCRIPT italic_a italic_i , italic_b italic_j end_POSTSUBSCRIPT =4(ia|jb)α[(ij|ab)+(ib|aj)]absent4conditional𝑖𝑎𝑗𝑏𝛼delimited-[]conditional𝑖𝑗𝑎𝑏conditional𝑖𝑏𝑎𝑗\displaystyle=4(ia|jb)-\alpha\Big{[}(ij|ab)+(ib|aj)\Big{]}= 4 ( italic_i italic_a | italic_j italic_b ) - italic_α [ ( italic_i italic_j | italic_a italic_b ) + ( italic_i italic_b | italic_a italic_j ) ]
β[(ij|ab)lr+(ib|aj)lr]+2(ia|f^XC|jb)𝛽delimited-[]subscriptconditional𝑖𝑗𝑎𝑏lrsubscriptconditional𝑖𝑏𝑎𝑗lr2𝑖𝑎subscript^𝑓XC𝑗𝑏\displaystyle-\beta\Big{[}(ij|ab)_{\mathrm{lr}}+(ib|aj)_{\mathrm{lr}}\Big{]}+2% (ia|\hat{f}_{\mathrm{XC}}|jb)- italic_β [ ( italic_i italic_j | italic_a italic_b ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT + ( italic_i italic_b | italic_a italic_j ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT ] + 2 ( italic_i italic_a | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_j italic_b ) (17)

and the Lagrangian vector on the right-hand side is given as

𝐋𝐋\displaystyle\mathbf{L}bold_L =𝐂vT𝐅𝐏~Δ𝐂o+𝐂vT𝐅𝐑𝐂v𝐗T𝐗T𝐂oT𝐅𝐑𝐂o+𝐂vT𝐊𝐂oabsentsuperscriptsubscript𝐂v𝑇subscript𝐅subscript~𝐏Δsubscript𝐂osuperscriptsubscript𝐂v𝑇subscript𝐅𝐑subscript𝐂vsuperscript𝐗𝑇superscript𝐗𝑇superscriptsubscript𝐂o𝑇subscript𝐅𝐑subscript𝐂osuperscriptsubscript𝐂v𝑇subscript𝐊𝐂o\displaystyle=\mathbf{C}_{\text{v}}^{T}\mathbf{F}_{\widetilde{\mathbf{P}}_{% \Delta}}\mathbf{C}_{\text{o}}+\mathbf{C}_{\text{v}}^{T}\mathbf{F}_{\mathbf{R}}% \mathbf{C}_{\text{v}}\mathbf{X}^{T}-\mathbf{X}^{T}\mathbf{C}_{\text{o}}^{T}% \mathbf{F}_{\mathbf{R}}\mathbf{C}_{\text{o}}+\mathbf{C}_{\text{v}}^{T}\mathbf{% K}\mathbf{C}_{\text{o}}= bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT + bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT + bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_KC start_POSTSUBSCRIPT o end_POSTSUBSCRIPT (18)

where the Fock-type matrix 𝐅𝐏~Δsubscript𝐅subscript~𝐏Δ\mathbf{F}_{\widetilde{\mathbf{P}}_{\Delta}}bold_F start_POSTSUBSCRIPT over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT is defined by

Fμν𝐏~Δsuperscriptsubscript𝐹𝜇𝜈subscript~𝐏Δ\displaystyle F_{\mu\nu}^{\,\widetilde{\mathbf{P}}_{\Delta}}italic_F start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT =λσP~σλΔ[2(μν|λσ)α(μλ|νσ)β(μλ|νσ)lr\displaystyle=\sum_{\lambda\sigma}\widetilde{P}^{\,\Delta}_{\sigma\lambda}\Big% {[}2(\mu\nu|\lambda\sigma)-\alpha(\mu\lambda|\nu\sigma)-\beta(\mu\lambda|\nu% \sigma)_{\mathrm{lr}}= ∑ start_POSTSUBSCRIPT italic_λ italic_σ end_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ italic_λ end_POSTSUBSCRIPT [ 2 ( italic_μ italic_ν | italic_λ italic_σ ) - italic_α ( italic_μ italic_λ | italic_ν italic_σ ) - italic_β ( italic_μ italic_λ | italic_ν italic_σ ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT
+(μν|f^XC|λσ)]\displaystyle+(\mu\nu|\hat{f}_{\mathrm{XC}}|\lambda\sigma)\Big{]}+ ( italic_μ italic_ν | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_λ italic_σ ) ] (19)

through the contraction of 𝐏~Δsubscript~𝐏Δ\widetilde{\mathbf{P}}_{\Delta}over~ start_ARG bold_P end_ARG start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT, and 𝐅𝐑subscript𝐅𝐑\mathbf{F}_{\mathbf{R}}bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT can be defined similarly.

The XC hyperkernel involving higher-order functional derivatives is calculated as

Kμνsubscript𝐾𝜇𝜈\displaystyle K_{\mu\nu}italic_K start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT =ϕμ(𝐫)ϕν(𝐫)(2𝐡2f2(2)+g2f1(3)+2g(ρ𝐡)f2(3)\displaystyle=\iiint\phi_{\mu}(\mathbf{r})\phi_{\nu}(\mathbf{r})\Big{(}2{% \mathbf{h}}^{2}f^{\,(2)}_{2}+{g}^{2}f^{\,(3)}_{1}+2{g}(\nabla\rho\cdot{\mathbf% {h}})f^{\,(3)}_{2}= ∭ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r ) italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_r ) ( 2 bold_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_g ( ∇ italic_ρ ⋅ bold_h ) italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+(ρ𝐡)2f3(3))+ρ(ϕμ(𝐫)ϕν(𝐫))(2𝐡2f3(2)+g2f2(3)\displaystyle+(\nabla\rho\cdot{\mathbf{h}})^{2}f^{\,(3)}_{3}\Big{)}+\nabla\rho% \cdot\nabla\big{(}\phi_{\mu}(\mathbf{r})\phi_{\nu}(\mathbf{r})\big{)}\Big{(}2{% \mathbf{h}}^{2}f^{\,(2)}_{3}+{g}^{2}f^{\,(3)}_{2}+ ( ∇ italic_ρ ⋅ bold_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + ∇ italic_ρ ⋅ ∇ ( italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r ) italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_r ) ) ( 2 bold_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+2g(ρ𝐡)f3(3)+(ρ𝐡)2f4(3))d𝐫d𝐫d𝐫′′\displaystyle+2{g}(\nabla\rho\cdot{\mathbf{h}})f^{\,(3)}_{3}+(\nabla\rho\cdot{% \mathbf{h}})^{2}f^{\,(3)}_{4}\Big{)}\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}% ^{\prime}\mathrm{d}\mathbf{r}^{\prime\prime}+ 2 italic_g ( ∇ italic_ρ ⋅ bold_h ) italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + ( ∇ italic_ρ ⋅ bold_h ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) roman_d bold_r roman_d bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_d bold_r start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT (20)

with the intermediates defined by

g𝑔\displaystyle{g}italic_g =μνRμνϕμ(𝐫)ϕν(𝐫′′)absentsubscript𝜇𝜈subscript𝑅𝜇𝜈subscriptitalic-ϕ𝜇superscript𝐫subscriptitalic-ϕ𝜈superscript𝐫′′\displaystyle=\sum_{\mu\nu}{R}_{\mu\nu}\phi_{\mu}(\mathbf{r}^{\prime})\phi_{% \nu}(\mathbf{r}^{\prime\prime})= ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) (21)
𝐡𝐡\displaystyle{\mathbf{h}}bold_h =μνRμν(ϕμ(𝐫)ϕν(𝐫′′))absentsubscript𝜇𝜈subscript𝑅𝜇𝜈subscriptitalic-ϕ𝜇superscript𝐫subscriptitalic-ϕ𝜈superscript𝐫′′\displaystyle=\sum_{\mu\nu}{R}_{\mu\nu}\nabla\big{(}\phi_{\mu}(\mathbf{r}^{% \prime})\phi_{\nu}(\mathbf{r}^{\prime\prime})\big{)}= ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ∇ ( italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) (22)

and the second- and third-order functional derivatives organized as

f2(2)subscriptsuperscript𝑓22\displaystyle f^{\,(2)}_{2}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =2fραγαα+2fραγαβ+2fραγββabsentsuperscript2𝑓subscript𝜌𝛼subscript𝛾𝛼𝛼superscript2𝑓subscript𝜌𝛼subscript𝛾𝛼𝛽superscript2𝑓subscript𝜌𝛼subscript𝛾𝛽𝛽\displaystyle=\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha\beta% }}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\beta\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG (23)
f3(2)subscriptsuperscript𝑓23\displaystyle f^{\,(2)}_{3}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =2fγαα2+122fγαβ2+22fγααγαβ+2fγααγββabsentsuperscript2𝑓superscriptsubscript𝛾𝛼𝛼212superscript2𝑓superscriptsubscript𝛾𝛼𝛽22superscript2𝑓subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽superscript2𝑓subscript𝛾𝛼𝛼subscript𝛾𝛽𝛽\displaystyle=\frac{\partial^{2}f}{\partial\gamma_{\alpha\alpha}^{2}}+\frac{1}% {2}\frac{\partial^{2}f}{\partial\gamma_{\alpha\beta}^{2}}+2\frac{\partial^{2}f% }{\partial\gamma_{\alpha\alpha}\partial\gamma_{\alpha\beta}}+\frac{\partial^{2% }f}{\partial\gamma_{\alpha\alpha}\partial\gamma_{\beta\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG (24)
f1(3)subscriptsuperscript𝑓31\displaystyle f^{\,(3)}_{1}italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =3fρα3+33fρα2ρβabsentsuperscript3𝑓superscriptsubscript𝜌𝛼33superscript3𝑓superscriptsubscript𝜌𝛼2subscript𝜌𝛽\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}^{3}}+3\frac{\partial^% {3}f}{\partial\rho_{\alpha}^{2}\partial\rho_{\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + 3 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG (25)
f2(3)subscriptsuperscript𝑓32\displaystyle f^{\,(3)}_{2}italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =3fρα2γαα+3fρα2γαβ+3fρα2γββ+23fραρβγααabsentsuperscript3𝑓superscriptsubscript𝜌𝛼2subscript𝛾𝛼𝛼superscript3𝑓superscriptsubscript𝜌𝛼2subscript𝛾𝛼𝛽superscript3𝑓superscriptsubscript𝜌𝛼2subscript𝛾𝛽𝛽2superscript3𝑓subscript𝜌𝛼subscript𝜌𝛽subscript𝛾𝛼𝛼\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \alpha\alpha}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \alpha\beta}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \beta\beta}}+2\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\rho_{\beta}% \partial\gamma_{\alpha\alpha}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG + 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG
+3fραρβγαβsuperscript3𝑓subscript𝜌𝛼subscript𝜌𝛽subscript𝛾𝛼𝛽\displaystyle+\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\rho_{\beta}% \partial\gamma_{\alpha\beta}}+ divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG (26)
f3(3)subscriptsuperscript𝑓33\displaystyle f^{\,(3)}_{3}italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =3fραγαα2+3fρβγαα2+3fραγαβ2+23fραγααγαβabsentsuperscript3𝑓subscript𝜌𝛼superscriptsubscript𝛾𝛼𝛼2superscript3𝑓subscript𝜌𝛽superscriptsubscript𝛾𝛼𝛼2superscript3𝑓subscript𝜌𝛼superscriptsubscript𝛾𝛼𝛽22superscript3𝑓subscript𝜌𝛼subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}^{2}}+\frac{\partial^{3}f}{\partial\rho_{\beta}\partial\gamma_{\alpha% \alpha}^{2}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \beta}^{2}}+2\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}\partial\gamma_{\alpha\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG
+23fρβγααγαβ+23fραγααγββ2superscript3𝑓subscript𝜌𝛽subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽2superscript3𝑓subscript𝜌𝛼subscript𝛾𝛼𝛼subscript𝛾𝛽𝛽\displaystyle+2\frac{\partial^{3}f}{\partial\rho_{\beta}\partial\gamma_{\alpha% \alpha}\partial\gamma_{\alpha\beta}}+2\frac{\partial^{3}f}{\partial\rho_{% \alpha}\partial\gamma_{\alpha\alpha}\partial\gamma_{\beta\beta}}+ 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + 2 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG (27)
f4(3)subscriptsuperscript𝑓34\displaystyle f^{\,(3)}_{4}italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT =3fγαα3+123fγαβ3+33fγαα2γαβ+33fγαα2γββabsentsuperscript3𝑓superscriptsubscript𝛾𝛼𝛼312superscript3𝑓superscriptsubscript𝛾𝛼𝛽33superscript3𝑓superscriptsubscript𝛾𝛼𝛼2subscript𝛾𝛼𝛽3superscript3𝑓superscriptsubscript𝛾𝛼𝛼2subscript𝛾𝛽𝛽\displaystyle=\frac{\partial^{3}f}{\partial\gamma_{\alpha\alpha}^{3}}+\frac{1}% {2}\frac{\partial^{3}f}{\partial\gamma_{\alpha\beta}^{3}}+3\frac{\partial^{3}f% }{\partial\gamma_{\alpha\alpha}^{2}\partial\gamma_{\alpha\beta}}+3\frac{% \partial^{3}f}{\partial\gamma_{\alpha\alpha}^{2}\partial\gamma_{\beta\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + 3 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG + 3 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG
+33fγαβ2γαα+33fγααγαβγββ3superscript3𝑓superscriptsubscript𝛾𝛼𝛽2subscript𝛾𝛼𝛼3superscript3𝑓subscript𝛾𝛼𝛼subscript𝛾𝛼𝛽subscript𝛾𝛽𝛽\displaystyle+3\frac{\partial^{3}f}{\partial\gamma_{\alpha\beta}^{2}\partial% \gamma_{\alpha\alpha}}+3\frac{\partial^{3}f}{\partial\gamma_{\alpha\alpha}% \partial\gamma_{\alpha\beta}\partial\gamma_{\beta\beta}}+ 3 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + 3 divide start_ARG ∂ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT ∂ italic_γ start_POSTSUBSCRIPT italic_β italic_β end_POSTSUBSCRIPT end_ARG (28)

This can be implemented straightforwardly by repurposing the existing routine for the XC kernel used in the TDDFT energy computation.

To solve eq 15 without the need to store 𝐀𝐀\mathbf{A}bold_A of large dimension NoNv×NoNvsubscript𝑁osubscript𝑁vsubscript𝑁osubscript𝑁vN_{\mathrm{o}}N_{\mathrm{v}}\times N_{\mathrm{o}}N_{\mathrm{v}}italic_N start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT, we employ the Krylov subspace method. In this approach, the orthogonal subspace 𝐛{𝐛0,𝐛1,}𝐛subscript𝐛0subscript𝐛1\mathbf{b}\equiv\{\mathbf{b}_{0},\mathbf{b}_{1},\cdots\}bold_b ≡ { bold_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ } is iteratively augmented to span the basis space for 𝐙𝐙\mathbf{Z}bold_Z until the residual is acceptably small.Pople et al. (1979) We rewrite eq 15 to adapt it into the Krylov method as

(𝐈𝐀)𝐙=𝐛0𝐈superscript𝐀𝐙subscript𝐛0\displaystyle(\mathbf{I}-\mathbf{A}^{\prime})\mathbf{Z}=\mathbf{b}_{0}( bold_I - bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) bold_Z = bold_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (29)

where 𝐀=𝐝1𝐀superscript𝐀superscript𝐝1𝐀\mathbf{A}^{\prime}=\mathbf{d}^{-1}\mathbf{A}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A and 𝐛0=𝐝1𝐋subscript𝐛0superscript𝐝1𝐋\mathbf{b}_{0}=\mathbf{d}^{-1}\mathbf{L}bold_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_L. At the k𝑘kitalic_k-th iteration, the (k+1)𝑘1(k{+}1)( italic_k + 1 )-dimensional subspace linear equation is solved to obtain 𝐙ksubscript𝐙𝑘\mathbf{Z}_{k}bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

𝐀~k𝐚k=𝐱ksubscript~𝐀𝑘subscript𝐚𝑘subscript𝐱𝑘\displaystyle\widetilde{\mathbf{A}}_{k}\mathbf{a}_{k}=\mathbf{x}_{k}over~ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (30)

where the elements are defined as

A~i,j(k)superscriptsubscript~𝐴𝑖𝑗𝑘\displaystyle\widetilde{A}_{i,j}^{\,(k)}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT =𝐛iT𝐛j𝐛iT𝝈~jabsentsuperscriptsubscript𝐛𝑖𝑇subscript𝐛𝑗superscriptsubscript𝐛𝑖𝑇subscript~𝝈𝑗\displaystyle=\mathbf{b}_{i}^{T}\mathbf{b}_{j}-\mathbf{b}_{i}^{T}\widetilde{% \boldsymbol{\sigma}}_{j}= bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (31)
xi(k)superscriptsubscript𝑥𝑖𝑘\displaystyle x_{i}^{\,(k)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT =δ0i𝐛iT𝐛iabsentsubscript𝛿0𝑖superscriptsubscript𝐛𝑖𝑇subscript𝐛𝑖\displaystyle=\delta_{0i}\mathbf{b}_{i}^{T}\mathbf{b}_{i}= italic_δ start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (32)

The scaled matrix-vector product 𝝈~k𝐝1𝐀𝐛ksubscript~𝝈𝑘superscript𝐝1subscript𝐀𝐛𝑘\widetilde{\boldsymbol{\sigma}}_{k}\equiv\mathbf{d}^{-1}\mathbf{A}\mathbf{b}_{k}over~ start_ARG bold_italic_σ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ bold_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Ab start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is calculated directly as in the Fock build in TDDFT:

𝝈ksubscript𝝈𝑘\displaystyle\boldsymbol{\sigma}_{k}bold_italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝐀𝐛kabsentsubscript𝐀𝐛𝑘\displaystyle=\mathbf{A}\mathbf{b}_{k}= bold_Ab start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (33)
=𝐂vT𝐅~k𝐂oabsentsuperscriptsubscript𝐂v𝑇subscript~𝐅𝑘subscript𝐂o\displaystyle=\mathbf{C}_{\text{v}}^{T}\widetilde{\mathbf{F}}_{k}\mathbf{C}_{% \text{o}}= bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT

Here, we use a slightly modified expression for the Fock-type matrix, derived from eq 2.2, written as

F~μν(k)superscriptsubscript~𝐹𝜇𝜈𝑘\displaystyle\widetilde{F}_{\mu\nu}^{\,(k)}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT =λσBσλ(k)[4(μν|λσ)α((μλ|σν)+(μσ|λν))\displaystyle=\sum_{\lambda\sigma}B_{\sigma\lambda}^{\,(k)}\Big{[}4(\mu\nu|% \lambda\sigma)-\alpha\,\Big{(}(\mu\lambda|\sigma\nu)+(\mu\sigma|\lambda\nu)% \Big{)}= ∑ start_POSTSUBSCRIPT italic_λ italic_σ end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_σ italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT [ 4 ( italic_μ italic_ν | italic_λ italic_σ ) - italic_α ( ( italic_μ italic_λ | italic_σ italic_ν ) + ( italic_μ italic_σ | italic_λ italic_ν ) )
β((μλ|σν)lr+(μσ|λν)lr)+2(μν|f^XC|λσ)]\displaystyle-\beta\,\Big{(}(\mu\lambda|\sigma\nu)_{\mathrm{lr}}+(\mu\sigma|% \lambda\nu)_{\mathrm{lr}}\big{)}+2(\mu\nu|\hat{f}_{\mathrm{XC}}|\lambda\sigma)% \Big{]}- italic_β ( ( italic_μ italic_λ | italic_σ italic_ν ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT + ( italic_μ italic_σ | italic_λ italic_ν ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT ) + 2 ( italic_μ italic_ν | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_XC end_POSTSUBSCRIPT | italic_λ italic_σ ) ] (34)

where 𝐁k𝐂v𝐛k𝐂oTsubscript𝐁𝑘subscript𝐂vsubscript𝐛𝑘superscriptsubscript𝐂o𝑇\mathbf{B}_{k}\equiv\mathbf{C}_{\text{v}}\mathbf{b}_{k}\mathbf{C}_{\text{o}}^{T}bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≡ bold_C start_POSTSUBSCRIPT v end_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

Then, 𝐙𝐙\mathbf{Z}bold_Z can be approximated as

𝐙𝐙\displaystyle\mathbf{Z}bold_Z 𝐙k=i=0kai𝐛isimilar-to-or-equalsabsentsubscript𝐙𝑘superscriptsubscript𝑖0𝑘subscript𝑎𝑖subscript𝐛𝑖\displaystyle\simeq\mathbf{Z}_{k}=\sum_{i=0}^{k}a_{i}\mathbf{b}_{i}≃ bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (35)

If the residual is not sufficiently small, an additional subspace vector is constructed,

𝐛k+1subscript𝐛𝑘1\displaystyle\mathbf{b}_{k+1}bold_b start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =𝝈~ki=0k𝐛iT𝝈~k𝐛iT𝐛i𝐛iabsentsubscript~𝝈𝑘superscriptsubscript𝑖0𝑘superscriptsubscript𝐛𝑖𝑇subscript~𝝈𝑘superscriptsubscript𝐛𝑖𝑇subscript𝐛𝑖subscript𝐛𝑖\displaystyle=\widetilde{\boldsymbol{\sigma}}_{k}-\sum_{i=0}^{k}\frac{\mathbf{% b}_{i}^{T}\widetilde{\boldsymbol{\sigma}}_{k}}{\mathbf{b}_{i}^{T}\mathbf{b}_{i% }}\mathbf{b}_{i}= over~ start_ARG bold_italic_σ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_σ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (36)

and the iteration continues.

Finally, the energy-weighted density matrix can be obtained through matrix multiplications as

𝐖𝐖\displaystyle\mathbf{W}bold_W =𝐂odiag(𝜺o)(2𝐈𝐗𝐗T)𝐂oT+𝐂vdiag(𝜺v)𝐗T𝐗𝐂vTabsentsubscript𝐂odiagsubscript𝜺o2𝐈superscript𝐗𝐗𝑇superscriptsubscript𝐂o𝑇subscript𝐂vdiagsubscript𝜺vsuperscript𝐗𝑇superscriptsubscript𝐗𝐂v𝑇\displaystyle=\mathbf{C}_{\mathrm{o}}\operatorname{diag}(\boldsymbol{% \varepsilon}_{\mathrm{o}})\,(2\mathbf{I}-\mathbf{X}\mathbf{X}^{T})\mathbf{C}_{% \mathrm{o}}^{T}+\mathbf{C}_{\mathrm{v}}\operatorname{diag}(\boldsymbol{% \varepsilon}_{\mathrm{v}})\mathbf{X}^{T}\mathbf{X}\mathbf{C}_{\mathrm{v}}^{T}= bold_C start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT roman_diag ( bold_italic_ε start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT ) ( 2 bold_I - bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_C start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_C start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT roman_diag ( bold_italic_ε start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ) bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_XC start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
+2𝐂v𝐙diag(𝜺o)𝐂oT+14𝐏0T(𝐅𝐏Δ+𝐊)𝐏0+12𝐏0𝐅𝐑T𝐑T2subscript𝐂v𝐙diagsubscript𝜺osuperscriptsubscript𝐂o𝑇14superscriptsubscript𝐏0𝑇subscript𝐅subscript𝐏Δ𝐊subscript𝐏012subscript𝐏0superscriptsubscript𝐅𝐑𝑇superscript𝐑𝑇\displaystyle+2\mathbf{C}_{\mathrm{v}}\mathbf{Z}\operatorname{diag}(% \boldsymbol{\varepsilon}_{\mathrm{o}})\mathbf{C}_{\mathrm{o}}^{T}+\frac{1}{4}% \mathbf{P}_{0}^{T}(\mathbf{F}_{\mathbf{P}_{\Delta}}+\mathbf{K})\mathbf{P}_{0}+% \frac{1}{2}\mathbf{P}_{0}\mathbf{F}_{\mathbf{R}}^{T}\mathbf{R}^{T}+ 2 bold_C start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT bold_Z roman_diag ( bold_italic_ε start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT ) bold_C start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 4 end_ARG bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_K ) bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
+𝐂v𝐂vT𝐅𝐑T𝐑+𝐑T𝐅𝐑𝐏0subscript𝐂vsuperscriptsubscript𝐂v𝑇superscriptsubscript𝐅𝐑𝑇𝐑superscript𝐑𝑇subscript𝐅𝐑subscript𝐏0\displaystyle+\mathbf{C}_{\mathrm{v}}\mathbf{C}_{\mathrm{v}}^{T}\mathbf{F}_{% \mathbf{R}}^{T}\mathbf{R}+\mathbf{R}^{T}\mathbf{F}_{\mathbf{R}}\mathbf{P}_{0}+ bold_C start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_R + bold_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_R end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (37)

where 𝜺osubscript𝜺o\boldsymbol{\varepsilon}_{\mathrm{o}}bold_italic_ε start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT and 𝜺vsubscript𝜺v\boldsymbol{\varepsilon}_{\mathrm{v}}bold_italic_ε start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT denote the orbital energies of the occupied and virtual molecular orbitals, respectively.

2.3 ERI Derivatives and Their GPU-Acceleration

The Cartesian Gaussian function centered on the A𝐴Aitalic_A-th atom is given by

ϕμ(𝐫;ζ,𝐥,𝐀)subscriptitalic-ϕ𝜇𝐫𝜁𝐥𝐀\displaystyle\phi_{\mu}(\mathbf{r};\zeta,\mathbf{l},\mathbf{A})italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r ; italic_ζ , bold_l , bold_A ) =N(xAx)lx(yAy)ly(zAz)lzabsent𝑁superscript𝑥subscript𝐴𝑥subscript𝑙𝑥superscript𝑦subscript𝐴𝑦subscript𝑙𝑦superscript𝑧subscript𝐴𝑧subscript𝑙𝑧\displaystyle=N(x-A_{x})^{l_{x}}(y-A_{y})^{l_{y}}(z-A_{z})^{l_{z}}= italic_N ( italic_x - italic_A start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y - italic_A start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_z - italic_A start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (38)
×exp(ζ|𝐫𝐀|2)absent𝜁superscript𝐫𝐀2\displaystyle\times\exp(-\zeta|\mathbf{r}-\mathbf{A}|^{2})× roman_exp ( - italic_ζ | bold_r - bold_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

where N𝑁Nitalic_N denotes the normalization constant, 𝐥(lx,ly,lz)𝐥subscript𝑙𝑥subscript𝑙𝑦subscript𝑙𝑧\mathbf{l}\equiv(l_{x},l_{y},l_{z})bold_l ≡ ( italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) represents the Cartesian quantum numbers, and ζ𝜁\zetaitalic_ζ denotes the exponent. For the given angular momentum, llx+ly+lz𝑙subscript𝑙𝑥subscript𝑙𝑦subscript𝑙𝑧l\equiv l_{x}+l_{y}+l_{z}italic_l ≡ italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, there are ncart=(l+1)(l+2)/2subscript𝑛cart𝑙1𝑙22n_{\mathrm{cart}}=(l+1)(l+2)/2italic_n start_POSTSUBSCRIPT roman_cart end_POSTSUBSCRIPT = ( italic_l + 1 ) ( italic_l + 2 ) / 2 Cartesian functions or nsph=2l+1subscript𝑛sph2𝑙1n_{\mathrm{sph}}=2l+1italic_n start_POSTSUBSCRIPT roman_sph end_POSTSUBSCRIPT = 2 italic_l + 1 spherical functions, the latter of which being easily be obtained via a spherical transformation of the former. The derivative of a Cartesian Gaussian function can be expressed as a linear combination of functions with higher and lower angular momentum. For instance, the x𝑥xitalic_x-derivative can be written as

xϕμlx,ly,lz=lxϕμlx1,ly,lz2ζϕμlx+1,ly,lzsubscript𝑥superscriptsubscriptitalic-ϕ𝜇subscript𝑙𝑥subscript𝑙𝑦subscript𝑙𝑧subscript𝑙𝑥superscriptsubscriptitalic-ϕ𝜇subscript𝑙𝑥1subscript𝑙𝑦subscript𝑙𝑧2𝜁superscriptsubscriptitalic-ϕ𝜇subscript𝑙𝑥1subscript𝑙𝑦subscript𝑙𝑧\displaystyle\nabla_{x}\phi_{\mu}^{l_{x},l_{y},l_{z}}=l_{x}\phi_{\mu}^{l_{x}-1% ,l_{y},l_{z}}-2\zeta\phi_{\mu}^{l_{x}+1,l_{y},l_{z}}∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 2 italic_ζ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (39)

Because the Gaussian functions are atom-centered, the ERI gradient with respect to the E𝐸Eitalic_E-th atom can be expressed as

𝐄(μAνB|λCσD)subscript𝐄conditionalsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷\displaystyle\nabla_{\mathbf{E}}(\mu_{A}\nu_{B}|\lambda_{C}\sigma_{D})∇ start_POSTSUBSCRIPT bold_E end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) =δEA(μAνB|λCσD)δEB(μAνB|λCσD)absentsubscript𝛿𝐸𝐴conditionalsubscriptsuperscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷subscript𝛿𝐸𝐵conditionalsubscript𝜇𝐴subscriptsuperscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷\displaystyle=-\delta_{EA}(\mu^{\prime}_{A}\nu_{B}|\lambda_{C}\sigma_{D})-% \delta_{EB}(\mu_{A}\nu^{\prime}_{B}|\lambda_{C}\sigma_{D})= - italic_δ start_POSTSUBSCRIPT italic_E italic_A end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) - italic_δ start_POSTSUBSCRIPT italic_E italic_B end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT )
δEC(μAνB|λCσD)δED(μAνB|λCσD)subscript𝛿𝐸𝐶conditionalsubscript𝜇𝐴subscript𝜈𝐵subscriptsuperscript𝜆𝐶subscript𝜎𝐷subscript𝛿𝐸𝐷conditionalsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscriptsuperscript𝜎𝐷\displaystyle-\delta_{EC}(\mu_{A}\nu_{B}|\lambda^{\prime}_{C}\sigma_{D})-% \delta_{ED}(\mu_{A}\nu_{B}|\lambda_{C}\sigma^{\prime}_{D})- italic_δ start_POSTSUBSCRIPT italic_E italic_C end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) - italic_δ start_POSTSUBSCRIPT italic_E italic_D end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) (40)

where the subscript denotes the atomic center, and the prime denotes the Cartesian derivative (i.e., μAϕμsubscriptsuperscript𝜇𝐴subscriptitalic-ϕ𝜇\mu^{\prime}_{A}\equiv\nabla\phi_{\mu}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≡ ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT). The derivatives of long-range ERIs can be defined similarly. Moreover, one of the derivative terms in eq 2.3 can be obtained at no cost by using translational invariance, E𝐄(μAνB|λCσD)=0subscript𝐸subscript𝐄conditionalsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷0\sum_{E}\nabla_{\mathbf{E}}(\mu_{A}\nu_{B}|\lambda_{C}\sigma_{D})=0∑ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_E end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) = 0;Komornicki et al. (1977) for instance, when all the centers are different, we can write

(μAνB|λCσD)conditionalsubscriptsuperscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷\displaystyle(\mu^{\prime}_{A}\nu_{B}|\lambda_{C}\sigma_{D})( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) =(μAνB|λCσD)(μAνB|λCσD)absentconditionalsubscript𝜇𝐴subscriptsuperscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷conditionalsubscript𝜇𝐴subscript𝜈𝐵subscriptsuperscript𝜆𝐶subscript𝜎𝐷\displaystyle=-(\mu_{A}\nu^{\prime}_{B}|\lambda_{C}\sigma_{D})-(\mu_{A}\nu_{B}% |\lambda^{\prime}_{C}\sigma_{D})= - ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) - ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT )
(μAνB|λCσD)conditionalsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscriptsuperscript𝜎𝐷\displaystyle-(\mu_{A}\nu_{B}|\lambda_{C}\sigma^{\prime}_{D})- ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) (41)

This reduces the computational cost for ERI derivatives by at least one fourth if ϕμsubscriptitalic-ϕ𝜇\phi_{\mu}italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT was chosen for the highest angular momentum. With the aid of permutation symmetry in ERI, we placed the atomic orbital of the highest angular momentum at the first index, μ𝜇\muitalic_μ, in our code to directly use the above equation.

In the integral-direct algorithm, the contributions to nuclear gradients up to the four centers comprising an ERI are calculated. Focusing on the case where all the four centers are different for simplicity, the update of 𝐀Esubscript𝐀𝐸\nabla_{\mathbf{A}}E∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_E in eq 2.1 by the ERI derivatives with the given ϕμsubscriptitalic-ϕ𝜇\phi_{\mu}italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT, ϕνsubscriptitalic-ϕ𝜈\phi_{\nu}italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT, ϕλsubscriptitalic-ϕ𝜆\phi_{\lambda}italic_ϕ start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT and ϕσsubscriptitalic-ϕ𝜎\phi_{\sigma}italic_ϕ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT can be expressed as

AxEsubscriptsubscript𝐴𝑥𝐸\displaystyle\nabla_{A_{x}}E∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E +=Γμνλσ[2ζ(μA+νB|λCσD)lx(μAνB|λCσD)]italic-+=absentsubscriptΓ𝜇𝜈𝜆𝜎delimited-[]2𝜁conditionalsuperscriptsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷subscript𝑙𝑥conditionalsuperscriptsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷\displaystyle\mathrel{{+}{=}}\Gamma_{\mu\nu\lambda\sigma}\left[2\zeta(\mu_{A}^% {+}\nu_{B}|\lambda_{C}\sigma_{D})-l_{x}(\mu_{A}^{-}\nu_{B}|\lambda_{C}\sigma_{% D})\right]italic_+= roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT [ 2 italic_ζ ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) - italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ]
+Γμνλσlr[2ζ(μA+νB|λCσD)lrlx(μAνB|λCσD)lr]superscriptsubscriptΓ𝜇𝜈𝜆𝜎lrdelimited-[]2𝜁subscriptconditionalsuperscriptsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷lrsubscript𝑙𝑥subscriptconditionalsuperscriptsubscript𝜇𝐴subscript𝜈𝐵subscript𝜆𝐶subscript𝜎𝐷lr\displaystyle\,\,\,\,+\Gamma_{\mu\nu\lambda\sigma}^{\,\mathrm{lr}}\left[2\zeta% (\mu_{A}^{+}\nu_{B}|\lambda_{C}\sigma_{D})_{\mathrm{lr}}-l_{x}(\mu_{A}^{-}\nu_% {B}|\lambda_{C}\sigma_{D})_{\mathrm{lr}}\right]+ roman_Γ start_POSTSUBSCRIPT italic_μ italic_ν italic_λ italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_lr end_POSTSUPERSCRIPT [ 2 italic_ζ ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_lr end_POSTSUBSCRIPT ] (42)

where the superscripts +++ and -- denote the raising and lowering of Cartesian quantum numbers, i.e., (lx±1,ly,lz)plus-or-minussubscript𝑙𝑥1subscript𝑙𝑦subscript𝑙𝑧(l_{x}\pm 1,l_{y},l_{z})( italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ± 1 , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ). AyEsubscriptsubscript𝐴𝑦𝐸\nabla_{A_{y}}E∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E and AzEsubscriptsubscript𝐴𝑧𝐸\nabla_{A_{z}}E∇ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E can be expressed in an identical manner, and similar expressions can be readily obtained for the other three centers. Generally, ϕμsubscriptitalic-ϕ𝜇\phi_{\mu}italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT is described by a set of Gaussian functions as

ϕμ(𝐫)=kKμdkφk(𝐫)subscriptitalic-ϕ𝜇𝐫superscriptsubscript𝑘subscript𝐾𝜇subscript𝑑𝑘subscript𝜑𝑘𝐫\displaystyle\phi_{\mu}(\mathbf{r})=\sum_{k}^{K_{\mu}}d_{k}\varphi_{k}(\mathbf% {r})italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_r ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_r ) (43)

where Kμsubscript𝐾𝜇K_{\mu}italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT denotes the degree of contraction, and dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the coefficient. Therefore, for a given ERI (μν|λσ)conditional𝜇𝜈𝜆𝜎(\mu\nu|\lambda\sigma)( italic_μ italic_ν | italic_λ italic_σ ), nested loops over each contraction occur with a total loop-count of KμKνKλKσsubscript𝐾𝜇subscript𝐾𝜈subscript𝐾𝜆subscript𝐾𝜎K_{\mu}{\cdot}K_{\nu}{\cdot}K_{\lambda}{\cdot}K_{\sigma}italic_K start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, each evaluating a batch of ERI derivatives with a dimension of 3×nμ×nν×nλ×nσ3subscript𝑛𝜇subscript𝑛𝜈subscript𝑛𝜆subscript𝑛𝜎3\times n_{\mu}\times n_{\nu}\times n_{\lambda}\times n_{\sigma}3 × italic_n start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT in the code where n𝑛nitalic_n denotes ncartsubscript𝑛cartn_{\mathrm{cart}}italic_n start_POSTSUBSCRIPT roman_cart end_POSTSUBSCRIPT or nsphsubscript𝑛sphn_{\mathrm{sph}}italic_n start_POSTSUBSCRIPT roman_sph end_POSTSUBSCRIPT.

To simultaneously achieve concurrency of GPU kernels and load-balancing within the distributed GPUs while minimizing the thread divergence, we have extended the multi-GPU ERI algorithm described in our previous report.Kim et al. (2023) The list of non-zero shell-pairs is presorted according to the degree of contractions, K𝐾Kitalic_K, and the Schwarz upper-bound such that when two lists are combined to form an ERI supermatrix, each thread-blocks of optimal dimension will likely exhibit identical loop-counts for maximum thread concurrency within the GPU kernel. Block-cyclic distributions of the thread-blocks are used to achieve the load-balance between distributed GPUs. The non-redundant ERI types according to the sequence of the angular momenta in indices are computed sequentially to mitigate the thread divergence. For instance, there are a total of 21 GPU kernels each representing a specific ERI type for the atomic orbitals with l2𝑙2l\leq 2italic_l ≤ 2.

The GPU implementation of eq 2.3 is shown in Figure 1. For all ERI calculations, we employed the Rys quadrature methodDupuis et al. (1976) since it requires the least memory footprint compared to other algorithms based on the recursion formula. The Rys roots and weights that correspond to the higher angular momentum ERI are reused for the ERI with lower angular momenta. Thus, the number of Rys roots is given by nrys=(lμ+lν+lλ+lσ+1)/2+1subscript𝑛ryssubscript𝑙𝜇subscript𝑙𝜈subscript𝑙𝜆subscript𝑙𝜎121n_{\mathrm{rys}}=(l_{\mu}+l_{\nu}+l_{\lambda}+l_{\sigma}+1)/2+1italic_n start_POSTSUBSCRIPT roman_rys end_POSTSUBSCRIPT = ( italic_l start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT + 1 ) / 2 + 1, which is used for (μν+|λσ)conditional𝜇superscript𝜈𝜆𝜎(\mu\nu^{+}|\lambda\sigma)( italic_μ italic_ν start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | italic_λ italic_σ ), (μν|λ+σ)conditional𝜇𝜈superscript𝜆𝜎(\mu\nu|\lambda^{+}\sigma)( italic_μ italic_ν | italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_σ ), (μν|λσ+)conditional𝜇𝜈𝜆superscript𝜎(\mu\nu|\lambda\sigma^{+})( italic_μ italic_ν | italic_λ italic_σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ), (μν|λσ)conditional𝜇superscript𝜈𝜆𝜎(\mu\nu^{-}|\lambda\sigma)( italic_μ italic_ν start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | italic_λ italic_σ ), (μν|λσ)conditional𝜇𝜈superscript𝜆𝜎(\mu\nu|\lambda^{-}\sigma)( italic_μ italic_ν | italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_σ ) and (μν|λσ)conditional𝜇𝜈𝜆superscript𝜎(\mu\nu|\lambda\sigma^{-})( italic_μ italic_ν | italic_λ italic_σ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ).

Refer to caption
Figure 1: Sketch of the algorithm used to compute the ERI derivative on GPU.

While it is technically possible to reuse the ERI routines for higher angular momenta to calculate the ERI derivatives, such an implementation would introduce complexity in exploiting the 8-fold permutation symmetry as well as utilizing the translational invariance, which can effectively remove the most computationally intensive derivatives. Therefore, we have explicitly constructed independent GPU kernels for the ERI derivatives.

2.4 Numerical Integration of XC Derivatives on GPUs

The XC functionals and their n𝑛nitalic_n-th order derivatives can be calculated by numerical integration over well-defined quadrature grids as

I=qwqf(n)(𝐫q)𝐼subscript𝑞subscript𝑤𝑞superscript𝑓𝑛subscript𝐫𝑞\displaystyle I=\sum_{q}w_{q}f^{\,(n)}(\mathbf{r}_{q})italic_I = ∑ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ( bold_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) (44)

where wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denotes the grid weight at point 𝐫qsubscript𝐫𝑞\mathbf{r}_{q}bold_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. In the case of polyatomic systems, atom-centered spherical grids can be constructed by combining a radial grid and an angular grid. We employed the Euler–Maclaurin gridMurray et al. (1993) for the radial partitioning and the Lebedev gridLebedev (1977) for the angular partitioning. The numerical stability of discrete summation can be maintained by Becke’s space partitioning method,Becke (1988) in which weights are obtained via the fuzzy cell construction.

For the given quadrature point at 𝐫qsubscript𝐫𝑞\mathbf{r}_{q}bold_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the updating expression for the nuclear gradient by the XC contribution in eq 2.1, can be written as

𝐀Esubscript𝐀𝐸\displaystyle\nabla_{\mathbf{A}}E∇ start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT italic_E +=2wqμν[Pμν(f1(1)𝐱+12f2(1)𝐲)+2Rμν(f2(1)𝐳\displaystyle\mathrel{{+}{=}}-2w_{q}\sideset{}{{}^{\prime}}{\sum}_{\mu}\sum_{% \nu}\bigg{[}P_{\mu\nu}\Big{(}f^{\,(1)}_{1}\mathbf{x}+\frac{1}{2}f^{\,(1)}_{2}% \mathbf{y}\Big{)}+2R_{\mu\nu}\Big{(}f^{\,(1)}_{2}\mathbf{z}italic_+= - 2 italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT SUPERSCRIPTOP start_ARG ∑ end_ARG ′ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_P start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_y ) + 2 italic_R start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_z
+f1(2)g𝐱+f2(2)(g𝐲+(𝐡ρ)𝐱)+f3(2)(𝐡ρ)𝐲)\displaystyle+f^{\,(2)}_{1}{g}\mathbf{x}+f^{\,(2)}_{2}\big{(}g\mathbf{y}+({% \mathbf{h}}\cdot\nabla\rho)\mathbf{x}\big{)}+f^{\,(2)}_{3}({\mathbf{h}}\cdot% \nabla\rho)\mathbf{y}\Big{)}+ italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g bold_x + italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g bold_y + ( bold_h ⋅ ∇ italic_ρ ) bold_x ) + italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_h ⋅ ∇ italic_ρ ) bold_y )
+12Pμν 0(f2(1)𝐳Δ+f1(2)gΔ𝐱+f2(2)(gΔ𝐲+(𝐡Δρ)𝐱\displaystyle+\frac{1}{2}P^{\,0}_{\mu\nu}\Big{(}f^{\,(1)}_{2}\mathbf{z}_{% \Delta}+f^{\,(2)}_{1}g_{\Delta}\mathbf{x}+f^{\,(2)}_{2}\big{(}g_{\Delta}% \mathbf{y}+(\mathbf{h}_{\Delta}\cdot\nabla\rho)\mathbf{x}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_P start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT + italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT bold_x + italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT bold_y + ( bold_h start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ⋅ ∇ italic_ρ ) bold_x
+2𝐡2𝐱+4g𝐳)+f(2)3(2𝐡2𝐲+(𝐡Δ)ρ𝐲+4(𝐡ρ)𝐳)\displaystyle+2\mathbf{h}^{2}\mathbf{x}+4g\mathbf{z}\big{)}+f^{\,(2)}_{3}\big{% (}2\mathbf{h}^{2}\mathbf{y}+(\mathbf{h}_{\Delta}\cdot\nabla)\rho\mathbf{y}+4(% \mathbf{h}\cdot\nabla\rho)\mathbf{z}\big{)}+ 2 bold_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_x + 4 italic_g bold_z ) + italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 2 bold_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_y + ( bold_h start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT ⋅ ∇ ) italic_ρ bold_y + 4 ( bold_h ⋅ ∇ italic_ρ ) bold_z )
+f1(3)g2𝐱+f2(3)(g2𝐲+2g(𝐡ρ)𝐱)+f3(3)((𝐡ρ)2𝐱\displaystyle+f^{\,(3)}_{1}g^{2}\mathbf{x}+f^{\,(3)}_{2}\big{(}g^{2}\mathbf{y}% +2g(\mathbf{h}\cdot\nabla\rho)\mathbf{x}\big{)}+f^{\,(3)}_{3}\big{(}(\mathbf{h% }\cdot\nabla\rho)^{2}\mathbf{x}+ italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_x + italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_y + 2 italic_g ( bold_h ⋅ ∇ italic_ρ ) bold_x ) + italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( ( bold_h ⋅ ∇ italic_ρ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_x
+2g(𝐡ρ)𝐲)+f4(3)(𝐡ρ)2𝐲)]\displaystyle+2g(\mathbf{h}\cdot\nabla\rho)\mathbf{y}\big{)}+f^{\,(3)}_{4}(% \mathbf{h}\cdot\nabla\rho)^{2}\mathbf{y}\Big{)}\bigg{]}+ 2 italic_g ( bold_h ⋅ ∇ italic_ρ ) bold_y ) + italic_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( bold_h ⋅ ∇ italic_ρ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_y ) ] (45)

where the primed sum (ΣsuperscriptΣ\Sigma^{\prime}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) indicates that the sum runs over the atomic orbitals centered on A𝐴Aitalic_A. The functional derivatives are defined as

f1(1)subscriptsuperscript𝑓11\displaystyle f^{\,(1)}_{1}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =fραabsent𝑓subscript𝜌𝛼\displaystyle=\frac{\partial f}{\partial\rho_{\alpha}}= divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG (46)
f2(1)subscriptsuperscript𝑓12\displaystyle f^{\,(1)}_{2}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =2fγαα+fγαβabsent2𝑓subscript𝛾𝛼𝛼𝑓subscript𝛾𝛼𝛽\displaystyle=2\frac{\partial f}{\partial\gamma_{\alpha\alpha}}+\frac{\partial f% }{\partial\gamma_{\alpha\beta}}= 2 divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_α end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT end_ARG (47)
f1(2)subscriptsuperscript𝑓21\displaystyle f^{\,(2)}_{1}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =2fρα2+2fραρβabsentsuperscript2𝑓superscriptsubscript𝜌𝛼2superscript2𝑓subscript𝜌𝛼subscript𝜌𝛽\displaystyle=\frac{\partial^{2}f}{\partial\rho_{\alpha}^{2}}+\frac{\partial^{% 2}f}{\partial\rho_{\alpha}\partial\rho_{\beta}}= divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∂ italic_ρ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG (48)

The intermediates contracting the transition density matrix are defined as

gΔsubscript𝑔Δ\displaystyle g_{\Delta}italic_g start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT =μνPμνΔϕμϕνabsentsubscript𝜇𝜈subscriptsuperscript𝑃Δ𝜇𝜈subscriptitalic-ϕ𝜇subscriptitalic-ϕ𝜈\displaystyle=\sum_{\mu\nu}{P}^{\,\Delta}_{\mu\nu}\phi_{\mu}\phi_{\nu}= ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT (49)
𝐡Δsubscript𝐡Δ\displaystyle\mathbf{h}_{\Delta}bold_h start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT =μνPμνΔ(ϕμϕν)absentsubscript𝜇𝜈subscriptsuperscript𝑃Δ𝜇𝜈subscriptitalic-ϕ𝜇subscriptitalic-ϕ𝜈\displaystyle=\sum_{\mu\nu}{P}^{\,\Delta}_{\mu\nu}\nabla(\phi_{\mu}\phi_{\nu})= ∑ start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ italic_ν end_POSTSUBSCRIPT ∇ ( italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) (50)

The recurring vectors are defined as

𝐱𝐱\displaystyle\mathbf{x}bold_x =ϕμϕνabsentsubscriptitalic-ϕ𝜇subscriptitalic-ϕ𝜈\displaystyle=\nabla\phi_{\mu}\cdot\phi_{\nu}= ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ⋅ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT (51)
𝐲𝐲\displaystyle\mathbf{y}bold_y =[ϕν(ϕμ)T+ϕμ(ϕν)T]ρ\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]\nabla\rho= [ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ∇ italic_ρ (52)
𝐳𝐳\displaystyle\mathbf{z}bold_z =[ϕν(ϕμ)T+ϕμ(ϕν)T]𝐡\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]{\mathbf{h}}= [ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] bold_h (53)
𝐳Δsubscript𝐳Δ\displaystyle\mathbf{z}_{\Delta}bold_z start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT =[ϕν(ϕμ)T+ϕμ(ϕν)T]𝐡Δ\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]\mathbf{h}_{\Delta}= [ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ∇ italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ∇ italic_ϕ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] bold_h start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT (54)

We note that the weight derivative has been omitted as it is often negligible in the case of nuclear gradients when sufficiently fine grids are used.

Refer to caption
Figure 2: Sketch of the algorithm used to compute the XC derivative on GPU

Figure 2 shows the GPU implementation of the XC nuclear gradients. Practically, owing to the locality of Gaussian functions only the atomic orbitals in the vicinity of a quadrature point, which can be predetermined by a suitable gauging, are relevant in computing eq 2.4. In other words, a sufficiently small cluster of quadrature points will share most of these atomic orbitals. Therefore, within GPU kernel, concurrency can be easily achieved by progressively dividing the entire grid into octants until all the points inside any octant can be assigned to a one-dimensional thread-block. By predetermining the atomic orbitals that contribute to the the thread-block (equivalently, a box containing the grid points in the vicinity), we not only eliminate running the entire list of atomic orbitals, but more importantly the threads can loop over the same list of significant atomic orbitals, maximizing the concurrency.

Computational cost can be reduced by precomputing atomic orbital-independent intermediates for each 𝐫qsubscript𝐫𝑞\mathbf{r}_{q}bold_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, g𝑔{g}italic_g (eq 21) and 𝐡𝐡{\mathbf{h}}bold_h (eq 22) contracting 𝐑𝐑\mathbf{R}bold_R, and gΔsubscript𝑔Δg_{\Delta}italic_g start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT (eq 49) and 𝐡Δsubscript𝐡Δ\mathbf{h}_{\Delta}bold_h start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT (eq 50) contracting 𝐏Δsubscript𝐏Δ{\mathbf{P}}_{\Delta}bold_P start_POSTSUBSCRIPT roman_Δ end_POSTSUBSCRIPT. From the TDDFT energy calculations, ρ𝜌\rhoitalic_ρ and ρ𝜌\nabla\rho∇ italic_ρ for all 𝐫qsubscript𝐫𝑞\mathbf{r}_{q}bold_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be stored and reused. Therefore, the precomputed quantities require a total of 12nqsubscript𝑛𝑞n_{q}italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT storage, where nqnatomnradnangsubscript𝑛𝑞subscript𝑛atomsubscript𝑛radsubscript𝑛angn_{q}\equiv n_{\mathrm{atom}}{\cdot}n_{\mathrm{rad}}{\cdot}n_{\mathrm{ang}}italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ≡ italic_n start_POSTSUBSCRIPT roman_atom end_POSTSUBSCRIPT ⋅ italic_n start_POSTSUBSCRIPT roman_rad end_POSTSUBSCRIPT ⋅ italic_n start_POSTSUBSCRIPT roman_ang end_POSTSUBSCRIPT is the number of quadrature points without grid pruning. We provide selected cases in Table S2, affirming that the current GPU memory size is sufficient even for a very large molecular system.

2.5 Implementation

The GPU algorithm for the TDDFT nuclear gradient, as presented in the previous sections, was implemented in an in-house code. Our computational framework is based on core parts of the KPACK code,Kim and Lee (2013) a two-component relativistic quantum chemistry program, and has been completely rewritten in Fortran 2003/2008 to accommodate the modern object-oriented features. The GPU code has been written in CUDA FortranCUDA Fortran Programming Guide, Nvidia using the extensions to the Fortran language with several keywords and APIs to offload computations to GPUs. To ensure efficient GPU utilization with XC functionals, the C code of selected functionals from LibXC 5.2.3Lehtola et al. (2018) has been converted into CUDA Fortran and integrated with our code. We validated our implementation by comparing analytic gradients with numerical gradients obtained from the finite difference method.

To design an efficient parallel model for our distributed, high-performance GPU systems, as outlined in our previous report,Kim et al. (2023) we adopted a hybrid MPI/OpenMP model. We leveraged the message-passing interface (MPI) using OpenMPI 4.1.5 with the multithreaded Unified communication-X (UCX) 1.12.1 for both point-to-point and one-sided communications between the nodes. On each node, CPU-to-GPU bindings were achieved using OpenMP parallel interfaces. Additionally, we utilized the single-node multi-GPU linear algebra libraries of Nvidia’s cuBlasXt 11.11 and cuSolverMg 11.4 for matrix multiplications and diagonalizations, respectively. The Nvidia HPC SDK 22.11 compiler suite with the “-fast” optimization flag was used for code compilation.

Our code employs a symmetric orthonormalization procedure with a threshold of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT au for defining orthonormal atomic orbitals. The SCF convergence threshold was set as the root-mean-squared (rms) difference of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT au between the density matrices of two consecutive SCF iterations. A density-multiplied Schwarz upper bound of 1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT au was used as the threshold for screening ERIs and their derivatives. In numerical integration, a density threshold of 1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT au for the grid point was applied. Throughout this work, we used a (50, 194)-grid, combining a 50-point radial grid with a 194-point angular grid per atom. A residual threshold of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT au was used in both Davidson diagonalization and Z-vector iteration. Currently, all one-electron integrals and their derivatives are computed using CPUs over the hybrid OpenMP/MPI model.

3 Application

\begin{overpic}[height=170.71652pt,trim=170.71652pt 85.35826pt 142.26378pt 71.% 13188pt]{gfp.pov} \put(74.0,32.0){ \includegraphics[height=68.28644pt,trim=128.0374pt 0.0pt 0.0pt 0.0pt,clip]{gfp% } } \put(101.0,34.0){\normalsize HBI} \end{overpic}
Figure 3: Green fluorescent protein (GFP) in the presence of water solvent molecules. The HBI chromophore is shown inside the protein cage, which is visually aided with ribbon representation.

In this section, we showcase the capability and performance of our TDDFT implementation by calculating the excited-state nuclear gradient of a very large molecular system, green fluoresecent protein (GFP). GFP is a naturally occurring biological protein extracted from Aequorea victoria, and has become a commonly used tool in molecular and cell biology for labeling and visualizing proteins in living cells.Zimmer (2002)

As shown in Figure 3, within the rigid 11-stranded β𝛽\betaitalic_β-barrel comprising 238 amino acids, the p-hydroxybenzylidene-imidazolinone (HBI) chromophore converts blue light to green light. The neutral and anionic forms of the chromphore are responsible for the light absorption in the ultraviolet and blue regions, respectively. The HBI chromophore consists of an imidazolone and a phenol moiety connected via a methine bridge.

In previous theoretical studies of GFP, to characterize the relevant excited-state, researchers utilized quantum chemical methods on the chromophore independently or with electrostatic/polarizable embedding of the surrounding environment via molecular mechanics by the so-called QM/MM model.Laino et al. (2004); Epifanovsky et al. (2009); Nåbo et al. (2017) Of course, quantum chemical methods should outperform the empirical force field when compared on an equal footing,Kulik et al. (2012) and their usage could potentially benefit as distant residues can also affect the excitation energies and properties of the chromophore.Schwabe et al. (2015) However, increasing the QM region is nontrivial, requiring chemical intuitions of domain experts, and tends to approach slowly to the accuracy of the large QM asymptotic limit.Kulik et al. (2016)

Moreover, when applied to large systems, QM methods can experience challenges associated with the specific model chemistry. TDDFT calculations with the pure or global hybrid functionals often encounter low-lying spurious charge-transfer states stemming from an unphysically small band gap, an artifact of the underestimated exchange energy at long distance,Rudberg (2012) and partially from unequilibrated electrostatic potential on surfaces.Lever et al. (2013) Using RSH functionals with correct long-distance behavior described by 100% HF exchange is suitable for large systems, as well as including solvent effect via implicit solvation models or explicit solvents, providing an effective solution for reliable results from large-scale TDDFT calculation.

We employed the geometry of GFP prepared in ref 85 from the X-ray diffraction structure (PDB ID: 1W7S)Van Thor et al. (2005) with minor refinements to a few misplaced atoms.Kim et al. (2023) This system represents a dense three-dimensional structure (Figure 3). For calculations, the ω𝜔\omegaitalic_ωB97X RSH GGA functionalChai and Head-Gordon (2008) was chosen, as it has demonstrated to provide reliable results over a wide range of interactions.Mardirossian and Head-Gordon (2017) The default parameters of ω𝜔\omegaitalic_ωB97X were used: α=0.158𝛼0.158\alpha=0.158italic_α = 0.158, β=0.842𝛽0.842\beta=0.842italic_β = 0.842, ω=0.3𝜔0.3\omega=0.3italic_ω = 0.3 bohr-1. Explicit inclusion of water solvent molecules and utilization of the double-zeta plus polarization basis set, def2-SVP,Weigend and Ahlrichs (2005) afforded a more realistic model. The total numbers of atoms and basis functions were 4353 and 40 518, respectively, representing the largest attempt to compute the nuclear gradient of an excited state at the TDDFT level to our knowledge.

3.1 Excited State of Green Fluorescent Protein

Table 1 presents the calculated results of the full-scale GFP system at the TDA-TDDFT/ω𝜔\omegaitalic_ωB97X level with def2-SVP basis sets, alongside QM/MM results using the ONIOM model,Svensson et al. (1996) aiming to unveil the long-distance quantum mechanical effect in the excited state.

Table 1: Calculated results of GFP at the TDA-TDDFT level
HOMO / LUMO Band gap S1 TDM
method (eV) (eV) (eV)a (debye)b
ONIOMc --3.21 / 4.20 7.41 3.80 7.74
ω𝜔\omegaitalic_ωB97X --3.80 / 2.78 6.58 3.62 7.07

a Vertical excitation energy. b Norm of transition dipole moment vector. c The QM region containing the HBI chromophore (43 atoms) was treated by the ω𝜔\omegaitalic_ωB97X level, while the Amber force fields (ref 91) were used to describe the MM region. The QM/MM calculations were performed using the Gaussian 16.C.01 program (ref 52).

The orbital energies of the two approaches exhibit significant differences, which can be attributed to the full-scale calculation representing the extended system. What is more important is that the band gap of ω𝜔\omegaitalic_ωB97X remains sufficiently large, rectifying the band gap underestimation observed with global hybrid functionals. In a previous study, we obtained the band gap of 1.88 eV with B3LYP in the full-scale calculation, with computationally accessible low-lying singlet states identified as spurious charge-transfer states between the chromophore and a distant residue.Kim et al. (2023) The S1 state from TDA-TDDFT yielded a vertial excitation at 3.62 eV consistent with experimental observations, where the broadband absorption by the neutral chromophore ranges from 330 to 450 nm (2.76–3.75 eV).Heim et al. (1994) The QM/MM calculation predicted the same excitation to be 0.18 eV higher.

Refer to caption
Figure 4: Hole and particle wave functions as represented by the natural transition orbitals with the largest weight, shown in parentheses, from the ω𝜔\omegaitalic_ωB97X calculation.

The origin of the difference in the S1 energies of the full-scale and QM/MM calculations in this particular case can be inferred from the natural transition orbitals (NTOs).Martin (2003) Figure 4 shows the NTOs upon S1 excitation, and for visual clarity, the chromophore and a neighboring residue are depicted. Hole and particle wave functions of the S1 state clearly indicate that the S1 state can be largely described as a local π𝜋\piitalic_ππsuperscript𝜋\pi^{\ast}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT excitation within the chromophore. Interestingly, a small participation of charge-transfer from the π𝜋\piitalic_π-density of a nearby residue can be seen, and presumably this has stabilized the S1 state, leading to the transition dipole moment calculated at the ω𝜔\omegaitalic_ωB97X level that is noticeably smaller than that of the QM/MM calculation (Table 1). It is unclear how such excited-state details would affect the actual dynamics, and the full implications remains to be determined, but key long-range interactions that necessitate large-scale calculations can be captured in this approach.

Refer to caption
Figure 5: S1–S0 difference gradient (ω𝜔\omegaitalic_ωB97X, magenta; ONIOM, cyan) with the arrows pointing to the negative sides. Vectors with norm greater than 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT au are shown.

As excited-state dynamics to the first-order is governed by the nuclear forces on the potential energy surface, we visualized in Figure 5 the extracted S1 nuclear gradient by computing the difference between excited- and ground-state gradients. Since the employed geometry has not been optimized, after subtracting the ground-state portion from the total gradient (eq 2), the resulting gradient reflects a change in the potential energy surface upon the electronic excitation.

A noticeable difference between the nuclear gradients from ω𝜔\omegaitalic_ωB97X and ONIOM calculations is observed on the carbon atom on the imidazolone connected to the methine bridge, suggesting the distant residue effect. Indeed, similar visualizations with greater details showed non-negligible participation of distant residues, supporting the notion that excited-state dynamics can encompass the entire protein (Figure S1). Not unexpectedly, the electronic embedding in the ONIOM calculation also captured the nonvanishing S1–S0 difference gradient outside the chromophore, but differing significantly from the ω𝜔\omegaitalic_ωB97X result (Figure S2). Focusing on relatively larger forces (i.e., the negative gradient), centered on the chromophore, the directions closely follow the qualitative interpretation of the NTOs, wherein the bonding-to-anti-bonding character change is reflected (Figure 4). For instance, the π𝜋\piitalic_π-bonding orbital between the imidazolone and the methine bridge becomes an anti-bonding orbital with depleted electron density, suggesting a decrease in bond order and a bond elongation. Fundamentally, the excited-state dynamics is governed by the total nuclear gradient encompassing the ground-state component, which has been handily ignored in our interpretation.

3.2 Multi-GPU Parallel Performance

Refer to caption
Figure 6: Speed-up (black) and parallel efficiency (gray) of TDDFT gradient. The diagonal line indicates the ideal speed-up. Parallel efficiency of TDDFT energy calculation, reproduced from ref 42, is also plotted.

Figure 6 shows the speed-up and parallel efficiency as a function of the number of GPUs for the TDDFT gradient calculation of the GFP system. The parallel efficiency was measured by the ratio with respect to the 1-GPU wall time while doubling the number of GPUs. Excellent parallel efficiency of >>>95% is observed within a single node comprising 8 GPUs. Parallel efficiencies gradually decrease to 31% at 256 GPUs, and these numbers are very close to the parallel efficiency of the TDDFT energy calculation in our previous report,Kim et al. (2023) suggesting that the same number of GPUs can be used for both energy and gradient calculations without incurring computational imbalance. The relative timings of our implementation of TDDFT gradient with the a priori DFT and TDDFT energy calculations were 1:1.3:3.1:11.3:3.11:1.3:3.11 : 1.3 : 3.1, respectively, further assuring that all the calculations yield comparable timings in the same order of magnitude (Table S3). In our largest computation with 256 A100 GPUs, the TDDFT gradient calculation took a mere 1.2 h of time, empowering practical applications of TDDFT for very-large systems.

Table 2: Timings for the S1 gradient calculation of GFPa
nGPUb total (s)c Z-vector (s) \nablaERI (s) \nablaXC (s) ERIXCERIXC\frac{\nabla{\text{ERI}}}{\nabla{\text{XC}}}divide start_ARG ∇ ERI end_ARG start_ARG ∇ XC end_ARG
1 343 643 (100.0%) 167 503 112 421 1 453 77.4
2 172 857 (99.4%) 84 239 56 430 739 76.4
4 87 064 (98.7%) 42 474 28 243 384 73.6
8 44 457 (96.6%) 21 667 14 207 225 63.1
16 23 580 (91.1%) 11 719 7 191 133 54.2
32 12 894 (83.3%) 6 571 3 684 80 46.2
64 7 484 (71.7%) 3 895 1 935 59 33.0
128 4 903 (54.8%) 2 716 1 065 41 26.0
256 4 263 (31.5%) 2 360 696 49 14.1

a Measured at the TDA-TDDFT level with ω𝜔\omegaitalic_ωB97X functional and def2-SVP basis set. Calculations were restarted with the converged molecular orbitals and excitation amplitudes. b Nvidia A100-80GB GPUs were used. c Parallel efficiency with respect to 1-GPU utilization is given in parentheses.

Table 2 provides a detailed timing analysis of wall time by decomposing it into individual components. The ERI gradient exhibited an excellent performance, while the performance of the XC gradient reached a saturation point at approximately 128 GPUs. Furthermore, at 1 GPU, the ratio of workloads between ERI gradient to XC gradient is 77.4, and it decreases to 14.1 at 256 GPUs. In fact, with 256 GPUs, out of overall 49 s, the XC gradient kernel merely takes 8.6 s (2.3 s for intermediates, 6.3 s for the gradient), implying that most of the computational cost is allotted to the GPU preparation. In other words, the XC gradient quickly reaches a plateau due to the relatively short execution time of the GPU kernel, and almost all times are spent on data management between devices, leading to strong scaling stagnation almost immediately. Better data management is thus important for further code optimization. On the other hand, the ERI gradient has the most workloads in the entire gradient calculation, and our load balancing scheme showed better scalability. Clearly, the Z-vector portion remains approximately 50% for all number of GPUs considered in this work, rendering it the most time-consuming process. The Z-vector iteration can be considered as the direct matrix-vector product formation (eq 33) in the Davidson diagonalization for TDDFT energies.Bauernschmitt and Ahlrichs (1996); Leininger et al. (2001)

Refer to caption
Figure 7: Convergence profile of Z-vector iteration. Horizontal dashed line indicates the convergence threshold.

Figure 7 shows the convergence profile of the Z-vector iterations for solving the coupled-perturbed equations during the gradient evaluation. Convergence was attained in 10 iterations, during which the residual norm decreased exponentially. The norm of the additional subspace vector 𝐛ksubscript𝐛𝑘\mathbf{b}_{k}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT also becomes exponentially smaller as the iterations proceeded. As an undesirable side effect, the product of the density matrix 𝐁ksubscript𝐁𝑘\mathbf{B}_{k}bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (in eq 2.2) and the Schwarz upper bound of ERI, which is used as the screening parameter, unintentionally neglect important ERI calculations in practice, potentially hampering convergence. To mitigate such numerical instability at the expense of losing the benefit of efficent density screening, we normalized 𝐛ksubscript𝐛𝑘\mathbf{b}_{k}bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT before the Fock build in eq 33, after which it was back-converted to the original norm. We suggest that a dynamical scheme to adjust the norm by comparing with previous iterations can be helpful in reinstating density screening to accelerate the Z-vector iteration.

Refer to caption
Figure 8: Wall-times of GPU kernels for ERI derivatives measured with 1 GPU. ERIs are sorted in order of increasing the angular momentum. The cumulative fraction of time on each kernel is shown in gray.

Figure 8 shows the time taken by all GPU kernels for ERI derivatives grouped into three clusters based on the highest angular momentum before the differentiation. Although approximately 45% of the cost is attributed to the computation of certain derivatives, (pp|ps)conditional𝑝𝑝𝑝𝑠(pp|ps)( italic_p italic_p | italic_p italic_s ), (dd|ps)conditional𝑑𝑑𝑝𝑠(dd|ps)( italic_d italic_d | italic_p italic_s ) and (dp|pp)conditional𝑑𝑝𝑝𝑝(dp|pp)( italic_d italic_p | italic_p italic_p ) in this particular case of the excited-state of GFP, this dependence is not only determined by how the basis functions are defined in terms of the number of Gaussian functions in the given angular momentum and their contraction details, which determine the loop-counts in the code, but also by the nature of the excited state. The latter determines the pattern of nonzero elements in the density matrix used to screen negligble ERIs prior to computation, which is also an important factor. Interestingly, adding a single d𝑑ditalic_d-polarization function to heavy atoms as has been done in the employed def2-SVP basis set, increased the total wall time for ERI derivatives by 2.4-fold. This represents, in some respect, the necessary cost towards more realistic calculations.

Refer to caption
Figure 9: Roofline analysis of ERI derivatives. The peak performance was controlled by the employed profiler, Nsight Compute. Performance of ERI kernels, reproduced from ref 42, are also plotted with gray circles.

To provide deeper insights into the fundamental performance of these GPU kernels, we performed a roofline analysis,Williams et al. (2009) as depicted in Figure 9. The GPU kernels for ERI derivatives are distributed across the memory- and compute-bound regions, which are bounded by the memory bandwidth and the peak performance, Rpeaksubscript𝑅peakR_{\mathrm{peak}}italic_R start_POSTSUBSCRIPT roman_peak end_POSTSUBSCRIPT. As the involved angular momenta increase, denoted by their sum in the figure, the kernels are shifted from the compute-bound to the memory-bound regions.

As seen in the compute-bound region in Figure 9, the GPU kernels of ERI derivatives utilize approximately 15% of Rpeaksubscript𝑅peakR_{\mathrm{peak}}italic_R start_POSTSUBSCRIPT roman_peak end_POSTSUBSCRIPT, in double precision (FP64). It is worth nothing that Rpeaksubscript𝑅peakR_{\mathrm{peak}}italic_R start_POSTSUBSCRIPT roman_peak end_POSTSUBSCRIPT assumes full utilization of fused-multiply-add operation, where a multiplication and an addition are performed simultaneously, and hence, we anticipate that further speed gains could be achieved with more sophisticated code optimizations. When compared to the ERI kernels, represented by gray circles, all the kernels for ERI derivatives clearly shift towards the memory-bound region because the operational intensity is dominated by building larger intermediate integrals for the ERIs with a higher angular momentum, which involves increased memory access. The ERIs with higher angular momenta are mostly memory-bound for the same reason.

4 Conclusions

We presented a multi-GPU implementation of analytical nuclear gradient of TDA-TDDFT using Gaussian-type basis sets with RSH functionals, with a focus on computing large-scale molecular systems. We revisited the underlying theory and algorithms to transform the integral direct procedure into an efficiently computable form in a massively parallel GPU computing environment. We demonstrated full-scale calculations of the excited-state nuclear gradient of a realistic protein system consisting of 4353 atoms at the ω𝜔\omegaitalic_ωB97X/def2-SVP level of theory. Achieving the parallel efficiency of >>>70% with up to 64 GPUs and 31% with 256 GPUs, the parallel efficiency of the nuclear gradient calculations was comparable to the excited-state energy counterpart, showcasing a well-balanced usage of a distributed resource up to 256 GPUs with 2.5 peta-FLOPS of raw computing power.

Heterogeneous computing is becoming increasingly prevalent as GPU-accelerated systems dominate the supercomputing landscape.Atchley et al. (2023); Chang et al. (2024) Quantum chemistry codes that fully exploit the state-of-the-art hardwares, as discussed in this work, will be extremely helpful in efficiently exploring the electronic structure of excited states for very large molecular systems. This will offer an exciting simulation platform for advancements in chemical and materials science.

Acknowledgements

Part of this work was carried out during the stay of I.K. at MIT as a visiting researcher. T.V.V. and Y.M.R. acknowledge Samsung Electronics for financial support (IO221227-04385-01 and IO211126-09175-01). The authors thank Dr. Xin Li and Prof. Patrick Norman at KTH for helpful discussions. I.K. thanks Mandy S. Kim for assistance with data collection. Computational resources were provided by the Supercomputing Center of Samsung Electronics.

Author contributions

I.K., T.V.V., Y.M.R., W.-J.S. and H.-J.K. conceived and designed the project and wrote the manuscript. I.K. developed GPU algorithms and wrote the code. D.J, L.W. and A.A. performed quantum chemistry calculations. J.Y. assisted with parallel computing set-up. S.K. and Y.C. aided in interpreting the results. I.J., S.L. and D.S.K. organized the project and supervised the computational study. All authors analyzed the data, discussed the results, and contributed to various portions of the manuscript.

References

  • Bourzac (2017) Bourzac, K. Supercomputing poised for a massive speed boost. Nature 2017, 551, 554–556.
  • McInnes et al. (2021) McInnes, L. C.; Heroux, M. A.; Draeger, E. W.; Siegel, A.; Coghlan, S.; Antypas, K. How community software ecosystems can unlock the potential of exascale computing. Nat. Comput. Sci. 2021, 1, 92–94.
  • Heldens et al. (2021) Heldens, S.; Hijma, P.; Werkhoven, B. V.; Maassen, J.; Belloum, A. S. Z.; Van Nieuwpoort, R. V. The Landscape of Exascale Research: A Data-Driven Literature Analysis. ACM Comput. Surv. 2021, 53, 1–43.
  • Herten (2023) Herten, A. Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview. SC-W ’23: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. New York, NY, USA, 2023; p 1019–1026.
  • Kohn and Sham (1965) Kohn, W.; Sham, L. J. Self-Consistent Equations Including Exchange and Correlation Effects. Phys. Rev. 1965, 140, A1133–A1138.
  • Burke (2012) Burke, K. Perspective on density functional theory. J. Chem. Phys. 2012, 136, 150901.
  • Becke (2014) Becke, A. D. Perspective: Fifty years of density-functional theory in chemical physics. J. Chem. Phys. 2014, 140, 18A301.
  • Bursch et al. (2022) Bursch, M.; Mewes, J.; Hansen, A.; Grimme, S. Best practice DFT protocols for basic molecular computational chemistry. Angew. Chem. Int. Ed. 2022, 61, e202205735.
  • Gordon et al. (2020) Gordon, M. S.; Barca, G.; Leang, S. S.; Poole, D.; Rendell, A. P.; Vallejo, J. L. G.; Westheimer, B. Novel Computer Architectures and Quantum Chemistry. J. Phys. Chem. A 2020, 124, 4557–4582.
  • Felice et al. (2023) Felice, R. D. et al. A Perspective on Sustainable Computational Chemistry Software Development and Integration. J. Chem. Theory Comput. 2023, 19, 7056–7076.
  • Vallejo et al. (2023) Vallejo, J. L. G. et al. Toward an extreme-scale electronic structure system. J. Chem. Phys. 2023, 159, 044112.
  • Nagy and Jensen (2017) Nagy, B.; Jensen, F. Reviews in Computational Chemistry; Wiley, 2017; Chapter 3, pp 93–149.
  • Almlöf et al. (1982) Almlöf, J.; Faegri, K.; Korsell, K. Principles for a direct SCF approach to LCAO–MO ab-initio calculations. J. Comput. Chem. 1982, 3, 385–399.
  • Yasuda (2008) Yasuda, K. Two-electron integral evaluation on the graphics processor unit. J. Comput. Chem. 2008, 29, 334–342.
  • Ufimtsev and Martínez (2008) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231.
  • Ufimtsev and Martínez (2009) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. J. Chem. Theory Comput. 2009, 5, 1004–1015.
  • Asadchev et al. (2010) Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units. J. Chem. Theory Comput. 2010, 6, 696–704.
  • Titov et al. (2013) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martínez, T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221.
  • Miao and Merz (2013) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965–976.
  • Miao and Merz (2015) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449–1462.
  • Kussmann and Ochsenfeld (2017) Kussmann, J.; Ochsenfeld, C. Hybrid CPU/GPU Integral Engine for Strong-Scaling Ab Initio Methods. J. Chem. Theory Comput. 2017, 13, 3153–3159.
  • Rák and Cserey (2015) Rák, A.; Cserey, G. The BRUSH algorithm for two-electron integrals on GPU. Chem. Phys. Lett. 2015, 622, 92–98.
  • Tornai et al. (2019) Tornai, G. J.; Ladjánszki, I.; Ádám Rák; Kis, G.; Cserey, G. Calculation of Quantum Chemical Two-Electron Integrals by Applying Compiler Technology on GPU. J. Chem. Theory Comput. 2019, 15, 5319–5331.
  • Barca et al. (2020) Barca, G. M. J. et al. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152, 154102.
  • Qi et al. (2023) Qi, J.; Zhang, Y.; Yang, M. A hybrid CPU/GPU method for Hartree–Fock self-consistent-field calculation. J. Chem. Phys. 2023, 159, 104101.
  • Vallejo et al. (2023) Vallejo, J. L. G.; Barca, G. M.; Gordon, M. S. High-performance GPU-accelerated evaluation of electron repulsion integrals. Mol. Phys. 2023, 121, e2112987.
  • Asadchev and Valeev (2023) Asadchev, A.; Valeev, E. F. High-Performance Evaluation of High Angular Momentum 4‑Center Gaussian Integrals on Modern Accelerated Processors. J. Phys. Chem. A 2023, 127, 10889–10895.
  • Seritan et al. (2020) Seritan, S.; Bannwarth, C.; Fales, B. S.; Hohenstein, E. G.; Kokkila-Schumacher, S. I. L.; Luehr, N.; Snyder, J. W.; Song, C.; Titov, A. V.; Ufimtsev, I. S.; Martínez, T. J. TeraChem: Accelerating electronic structure and ab initio molecular dynamics with graphical processing units. J. Chem. Phys. 2020, 152, 224110.
  • Manathunga et al. (2021) Manathunga, M.; Jin, C.; Cruzeiro, V. W. D.; Miao, Y.; Mu, D.; Arumugam, K.; Keipert, K.; Aktulga, H. M.; Merz, K. M.; Götz, A. W. Harnessing the Power of Multi-GPU Acceleration into the Quantum Interaction Computational Kernel Program. J. Chem. Theory Comput. 2021, 17, 3955–3966.
  • Barca et al. (2021) Barca, G. M. J.; Alkan, M.; Galvez-Vallejo, J. L.; Poole, D. L.; Rendell, A. P.; Gordon, M. S. Faster Self-Consistent Field (SCF) Calculations on GPU Clusters. J. Chem. Theory Comput. 2021, 17, 7486–7503.
  • Manathunga et al. (2020) Manathunga, M.; Miao, Y.; Mu, D.; Götz, A. W.; Merz, K. M. Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel Program. J. Chem. Theory Comput. 2020, 16, 4315–4326.
  • Williams-Young et al. (2020) Williams-Young, D. B.; de Jong, W. A.; van Dam, H. J. J.; Yang, C. On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters. Front. Chem. 2020, 8.
  • Kowalski et al. (2021) Kowalski, K. et al. From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape. Chem. Rev. 2021, 121, 4962–4998.
  • Johnson et al. (2022) Johnson, K. G.; Mirchandaney, S.; Hoag, E.; Heirich, A.; Aiken, A.; Martínez, T. J. Multinode Multi-GPU Two-Electron Integrals: Code Generation Using the Regent Language. J. Chem. Theory Comput. 2022, 18, 6522–6536.
  • Kussmann et al. (2021) Kussmann, J.; Laqua, H.; Ochsenfeld, C. Highly Efficient Resolution-of-Identity Density Functional Theory Calculations on Central and Graphics Processing Units. J. Chem. Theory Comput. 2021, 17, 1512–1521.
  • Asadchev and Valeev (2023) Asadchev, A.; Valeev, E. F. Memory-Efficient Recursive Evaluation of 3‑Center Gaussian Integrals. J. Chem. Theory Comput. 2023, 19, 1698–1710.
  • Williams-Young et al. (2023) Williams-Young, D. B.; Asadchev, A.; Popovici, D. T.; Clark, D.; Waldrop, J.; Windus, T. L.; Valeev, E. F.; Jong, W. A. d. Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory. J. Chem. Phys. 2023, 158, 234104.
  • Runge and Gross (1984) Runge, E.; Gross, E. K. U. Density-Functional Theory for Time-Dependent Systems. Phys. Rev. Lett. 1984, 52, 997–1000.
  • Dreuw and Head-Gordon (2005) Dreuw, A.; Head-Gordon, M. Single-Reference ab Initio Methods for the Calculation of Excited States of Large Molecules. Chem. Rev. 2005, 105, 4009–4037.
  • Herbert (2023) Herbert, J. M. Density Functional Theory for Electronic Excited States. Theoretical and Computational Photochemistry. 2023; pp 69–118.
  • Isborn et al. (2011) Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814–1823.
  • Kim et al. (2023) Kim, I.; Jeong, D.; Son, W.-J.; Kim, H.-J.; Rhee, Y. M.; Jung, Y.; Choi, H.; Yim, J.; Jang, I.; Kim, D. S. Kohn–Sham time-dependent density functional theory with Tamm–Dancoff approximation on massively parallel GPUs. npj Comput. Mater. 2023, 9, 81.
  • Leininger et al. (1997) Leininger, T.; Stoll, H.; Werner, H.-J.; Savin, A. Combining long-range configuration interaction with short-range density functionals. Chem. Phys. Lett. 1997, 275, 151–160.
  • Vydrov and Scuseria (2006) Vydrov, O. A.; Scuseria, G. E. Assessment of a long-range corrected hybrid functional. J. Chem. Phys. 2006, 125, 234109.
  • Epifanovsky et al. (2021) Epifanovsky, E. et al. Software for the frontiers of quantum chemistry: An overview of developments in the Q-Chem 5 package. J. Chem. Phys. 2021, 155, 084801.
  • Franzke et al. (2023) Franzke, Y. J. et al. TURBOMOLE: Today and Tomorrow. J. Chem. Theory Comput. 2023, 19, 6859–6890.
  • Zahariev et al. (2023) Zahariev, F. et al. The General Atomic and Molecular Electronic Structure System (GAMESS): Novel Methods on Novel Architectures. J. Chem. Theory Comput. 2023, 19, 7031–7055.
  • Mejia-Rodriguez et al. (2023) Mejia-Rodriguez, D. et al. NWChem: Recent and Ongoing Developments. J. Chem. Theory Comput. 2023, 19, 7077–7096.
  • Neese (2022) Neese, F. Software update: ORCA program system—Version 5.0. WIREs Comput. Mol. Sci. 2022, 12, e1606.
  • Cao et al. (2021) Cao, Y.; Halls, M. D.; Friesner, R. A. Highly efficient implementation of the analytical gradients of pseudospectral time-dependent density functional theory. J. Chem. Phys. 2021, 155, 024115.
  • Rinkevicius et al. (2020) Rinkevicius, Z.; Li, X.; Vahtras, O.; Ahmadzadeh, K.; Brand, M.; Ringholm, M.; List, N. H.; Scheurer, M.; Scott, M.; Dreuw, A.; Norman, P. VeloxChem: A Python‐driven density‐functional theory program for spectroscopy simulations in high‐performance computing environments. WIREs Comput. Mol. Sci. 2020, 10, e1457.
  • Frisch et al. (2016) Frisch, M. J. et al. Gaussian 16 Revision C.01. 2016; Gaussian Inc. Wallingford CT.
  • Hirata and Head-Gordon (1999) Hirata, S.; Head-Gordon, M. Time-dependent density functional theory within the Tamm–Dancoff approximation. Chem. Phys. Lett. 1999, 314, 291–299.
  • Casida (1995) Casida, M. E. In Recent Advances in Computational Chemistry; Chong, D. P., Ed.; WORLD SCIENTIFIC, 1995; Vol. 1; pp 155–192.
  • Perdew et al. (1996) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 1996, 77, 3865–3868.
  • Bauernschmitt and Ahlrichs (1996) Bauernschmitt, R.; Ahlrichs, R. Treatment of electronic excitations within the adiabatic approximation of time dependent density functional theory. Chem. Phys. Lett. 1996, 256, 454–464.
  • Foresman et al. (1992) Foresman, J. B.; Head-Gordon, M.; Pople, J. A.; Frisch, M. J. Toward a systematic molecular orbital theory for excited states. J. Phys. Chem. 1992, 96, 135–149.
  • Caillie and Amos (1999) Caillie, C. V.; Amos, R. D. Geometric derivatives of excitation energies using SCF and DFT. Chem. Phys. Lett. 1999, 308, 249–255, Excited state property.
  • Caillie and Amos (2000) Caillie, C. V.; Amos, R. D. Geometric derivatives of density functional theory excitation energies using gradient-corrected functionals. Chem. Phys. Lett. 2000, 317, 159–164.
  • Shroll and Edwards (2004) Shroll, R. M.; Edwards, W. D. Excited‐state gradients via CPHF equations. Int. J. Quantum Chem. 2004, 56, 395–410.
  • Liu et al. (2010) Liu, F.; Gan, Z.; Shao, Y.; Hsu, C.-P.; Dreuw, A.; Head-Gordon, M.; Miller, B. T.; Brooks, B. R.; Yu, J.-G.; Furlani, T. R.; Kong, J. A parallel implementation of the analytic nuclear gradient for time-dependent density functional theory within the Tamm–Dancoff approximation. Mol. Phys. 2010, 108, 2791–2800.
  • Furche and Ahlrichs (2002) Furche, F.; Ahlrichs, R. Adiabatic time-dependent density functional methods for excited state properties. J. Chem. Phys. 2002, 117, 7433–7447.
  • Chiba et al. (2006) Chiba, M.; Tsuneda, T.; Hirao, K. Excited state geometry optimizations by analytical energy gradient of long-range corrected time-dependent density functional theory. J. Chem. Phys. 2006, 124, 144106.
  • Pople et al. (1992) Pople, J. A.; Gill, P. M.; Johnson, B. G. Kohn–Sham density-functional theory within a finite basis set. Chem. Phys. Lett. 1992, 199, 557–560.
  • Gerratt and Mills (1968) Gerratt, J.; Mills, I. M. Force Constants and Dipole-Moment Derivatives of Molecules from Perturbed Hartree–Fock Calculations. I. J. Chem. Phys. 1968, 49, 1719–1729.
  • Handy and Schaefer (1984) Handy, N. C.; Schaefer, H. F. On the evaluation of analytic energy derivatives for correlated wave functions. J. Chem. Phys. 1984, 81, 5031–5033.
  • Pople et al. (1979) Pople, J. A.; Krishnan, R.; Schlegel, H. B.; Binkley, J. S. Derivative studies in hartree‐fock and møller‐plesset theories. Int. J. Quantum Chem. 1979, 16, 225–241.
  • Komornicki et al. (1977) Komornicki, A.; Ishida, K.; Morokuma, K.; Ditchfield, R.; Conrad, M. Efficient determination and characterization of transition states using ab-initio methods. Chem. Phys. Lett. 1977, 45, 595–602.
  • Dupuis et al. (1976) Dupuis, M.; Rys, J.; King, H. F. Evaluation of molecular integrals over Gaussian basis functions. J. Chem. Phys. 1976, 65, 111–116.
  • Murray et al. (1993) Murray, C. W.; Handy, N. C.; Laming, G. J. Quadrature schemes for integrals of density functional theory. Mol. Phys. 1993, 78, 997–1014.
  • Lebedev (1977) Lebedev, V. I. Spherical quadrature formulas exact to orders 25–29. Sib. Math. J. 1977, 18, 99–107.
  • Becke (1988) Becke, A. D. A multicenter numerical integration scheme for polyatomic molecules. J. Chem. Phys. 1988, 88, 2547–2553.
  • Kim and Lee (2013) Kim, I.; Lee, Y. S. KPACK: Relativistic two-component ab initio electronic structure program package. Bull. Korean Chem. Soc. 2013, 34, 179–187.
  • (74) CUDA Fortran Programming Guide, Nvidia https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/, Accessed: 2022-07-06.
  • Lehtola et al. (2018) Lehtola, S.; Steigemann, C.; Oliveira, M. J.; Marques, M. A. Recent developments in LibXC – A comprehensive library of functionals for density functional theory. SoftwareX 2018, 7, 1–5.
  • Zimmer (2002) Zimmer, M. Green fluorescent protein (GFP): Applications, structure, and related photophysical behavior. Chem. Rev. 2002, 102, 759–782.
  • Laino et al. (2004) Laino, T.; Nifosì, R.; Tozzini, V. Relationship between structure and optical properties in green fluorescent proteins: a quantum mechanical study of the chromophore environment. Chem. Phys. 2004, 298, 17–28.
  • Epifanovsky et al. (2009) Epifanovsky, E.; Polyakov, I.; Grigorenko, B.; Nemukhin, A.; Krylov, A. I. Quantum Chemical Benchmark Studies of the Electronic Properties of the Green Fluorescent Protein Chromophore. 1. Electronically Excited and Ionized States of the Anionic Chromophore in the Gas Phase. J. Chem. Theory Comput. 2009, 5, 1895–1906.
  • Nåbo et al. (2017) Nåbo, L. J.; Olsen, J. M. H.; Martínez, T. J.; Kongsted, J. The Quality of the Embedding Potential Is Decisive for Minimal Quantum Region Size in Embedding Calculations: The Case of the Green Fluorescent Protein. J. Chem. Theory. Comput. 2017, 13, 6230–6236.
  • Kulik et al. (2012) Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Ab Initio Quantum Chemistry for Protein Structures. J. Phys. Chem. B 2012, 116, 12501–12509.
  • Schwabe et al. (2015) Schwabe, T.; Beerepoot, M. T. P.; Olsen, J. M. H.; Kongsted, J. Analysis of computational models for an accurate study of electronic excitations in GFP. Phys. Chem. Chem. Phys. 2015, 17, 2582–2588.
  • Kulik et al. (2016) Kulik, H. J.; Zhang, J.; Klinman, J. P.; Martínez, T. J. How Large Should the QM Region Be in QM/MM Calculations? The Case of Catechol O -Methyltransferase. J. Phys. Chem. B 2016, 120, 11381–11394.
  • Rudberg (2012) Rudberg, E. Difficulties in applying pure Kohn–Sham density functional theory electronic structure methods to protein molecules. J. Phys. Condens. Matter 2012, 24, 072202.
  • Lever et al. (2013) Lever, G.; Cole, D. J.; Hine, N. D. M.; Haynes, P. D.; Payne, M. C. Electrostatic considerations affecting the calculated HOMO–LUMO gap in protein molecules. J. Phys. Condens. Mat. 2013, 25, 152101.
  • Foresman and Frisch (2015) Foresman, J.; Frisch, Æ. Exploring Chemistry with Electronic Structure Methods, 3rd ed.; Gaussian, Inc.: Wallingford, CT, 2015.
  • Van Thor et al. (2005) Van Thor, J. J.; Georgiev, G. Y.; Towrie, M.; Sage, J. T. Ultrafast and Low Barrier Motions in the Photoreactions of the Green Fluorescent Protein. J. Biol. Chem. 2005, 280, 33652–33659.
  • Chai and Head-Gordon (2008) Chai, J.-D.; Head-Gordon, M. Systematic optimization of long-range corrected hybrid density functionals. J. Chem. Phys. 2008, 128, 084106.
  • Mardirossian and Head-Gordon (2017) Mardirossian, N.; Head-Gordon, M. Thirty years of density functional theory in computational chemistry: An overview and extensive assessment of 200 density functionals. Mol. Phys. 2017, 115, 2315–2372.
  • Weigend and Ahlrichs (2005) Weigend, F.; Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297.
  • Svensson et al. (1996) Svensson, M.; Humbel, S.; Froese, R. D. J.; Matsubara, T.; Sieber, S.; Morokuma, K. ONIOM: A multilayered integrated MO + MM method for geometry optimizations and single point energy predictions. A test for Diels-Alder reactions and Pt(P(t-Bu)3)2 + H2 oxidative addition. J. Phys. Chem. 1996, 100, 19357–19363.
  • Cornell et al. (1995) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995, 117, 5179–5197.
  • Heim et al. (1994) Heim, R.; Prasher, D. C.; Tsien, R. Y. Wavelength mutations and posttranslational autoxidation of green fluorescent protein. Proc. Natl. Acad. Sci. USA 1994, 91, 12501–12504.
  • Martin (2003) Martin, R. L. Natural transition orbitals. J. Chem. Phys. 2003, 118, 4775.
  • Leininger et al. (2001) Leininger, M. L.; Sherrill, C. D.; Allen, W. D.; Schaefer, H. F. Systematic Study of Selected Diagonalization Methods for Configuration Interaction Matrices. J. Comput. Chem. 2001, 22, 1574–1589.
  • Williams et al. (2009) Williams, S.; Waterman, A.; Patterson, D. Roofline. Commun. ACM 2009, 52, 65–76.
  • Atchley et al. (2023) Atchley, S. et al. Frontier: Exploring Exascale. SC ’23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Denver CO USA, 2023; pp 1–16.
  • Chang et al. (2024) Chang, J.; Lu, K.; Guo, Y.; Wang, Y.; Zhao, Z.; Huang, L.; Zhou, H.; Wang, Y.; Lei, F.; Zhang, B. A survey of compute nodes with 100 TFLOPS and beyond for supercomputers. CCF Trans. High Perform. Comput. 2024, 6, 243–262.