Very-Large-Scale GPU-Accelerated Nuclear Gradient of Time-Dependent Density Functional Theory with Tamm–Dancoff Approximation and Range-Separated Hybrid Functionals

Inkoo Kim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A. Daun Jeong Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Leah Weisburn Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A. Alexandra Alexiu Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A. Troy Van Voorhis tvan@mit.edu Department of Chemistry, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, U.S.A. Young Min Rhee ymrhee@kaist.ac.kr Department of Chemistry, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea. Won-Joon Son wonjoon.son@samsung.com Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Hyung-Jin Kim hj.windy.kim@samsung.com Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Jinkyu Yim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Sungmin Kim Samsung Advanced Institute of Technology, Samsung Electronics, Suwon 16678, Republic of Korea. Yeonchoo Cho Samsung Advanced Institute of Technology, Samsung Electronics, Suwon 16678, Republic of Korea. Inkook Jang Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Seungmin Lee Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea. Dae Sin Kim Innovation Center, Samsung Electronics, Hwaseong 18448, Republic of Korea.

Abstract

Modern graphics processing units (GPUs) provide an unprecedented level of computing power. In this study, we present a high-performance, multi-GPU implementation of the analytical nuclear gradient for Kohn–Sham time-dependent density functional theory (TDDFT), employing the Tamm–Dancoff approximation (TDA) and Gaussian-type atomic orbitals as basis functions. We discuss GPU-efficient algorithms for the derivatives of electron repulsion integrals and exchange–correlation functionals within the range-separated scheme. As an illustrative example, we calculated the TDA-TDDFT gradient of the S₁ state of a full-scale green fluorescent protein with explicit water solvent molecules, totaling 4353 atoms, at the $\omega$ B97X/def2-SVP level of theory. Our algorithm demonstrates favorable parallel efficiencies on a high-speed distributed system equipped with 256 Nvidia A100 GPUs, achieving $>$ 70% with up to 64 GPUs and 31% with 256 GPUs, effectively leveraging the capabilities of modern high-performance computing systems.

1 Introduction

High-performance computing is undergoing a rapid evolution supported by high-speed distributed systems that incorporate heterogenous accelerators such as graphics processing units (GPUs) for high-throughput data-parallel computation.Bourzac (2017); McInnes et al. (2021); Heldens et al. (2021) This heterogeneous architecture poses a challenge in efficiently adapting existing algorithms to fully exploit the computational potential of these accelerators. Within general-purpose GPU programming models,Herten (2023) fine-grained computational workloads represented by threads are organized into thread-blocks and are offloaded to GPU, which consists of a large number of lightweight processors for massively parallel execution. Therefore, a key consideration in achieving high performance with GPU acceleration is maintaining a high degree of concurrency by simultaneously handling a large number of similar tasks to utilize the single instruction multiple-thread (SIMT)-based GPU architecture effectively.

To harness the high-throughput potential of GPUs in quantum chemistry, significant efforts have been made to redesign conventional algorithms and parallelization tactics. Kohn–Sham density functional theory (DFT),Kohn and Sham (1965) arguably the most widely adopted quantum chemistry method for its balance of accuracy and cost-efficiency,Burke (2012); Becke (2014); Bursch et al. (2022) has been a prime target for GPU acceleration.Gordon et al. (2020); Felice et al. (2023); Vallejo et al. (2023) Focusing on DFT in the linear combination of atomic orbitals (LCAO) framework, using atom-centered Gaussian basis sets is a natural choice for molecular systems.Nagy and Jensen (2017) In this approach, it is well known that owing to the evaluation of four-index electron repulsion integrals (ERIs), the integral-direct Fock build is the most computationally intensive section.Almlöf et al. (1982)

Early pioneering works by YasudaYasuda (2008) and by Ufimtsev and MartínezUfimtsev and Martínez (2008) focused on accelerating the computation of ERIs for the Hartree–Fock (HF) method on GPUs. These were followed by numerous reports on improvements in both performance and scalability to date.Ufimtsev and Martínez (2009); Asadchev et al. (2010); Titov et al. (2013); Miao and Merz (2013, 2015); Kussmann and Ochsenfeld (2017); Rák and Cserey (2015); Tornai et al. (2019); Barca et al. (2020); Qi et al. (2023); Vallejo et al. (2023); Asadchev and Valeev (2023) Recent advancements included efficient ground-state DFT calculations on multi-GPUs, utilizing the GPU implementations for the numerical integration of exchange–correlation (XC) functionals,Seritan et al. (2020); Manathunga et al. (2021); Barca et al. (2021); Manathunga et al. (2020); Williams-Young et al. (2020) which allowed massively parallel simulations of very large molecules.Kowalski et al. (2021); Johnson et al. (2022) Notably, scaling-reducing schemes such as the resolution-of-identityKussmann et al. (2021); Asadchev and Valeev (2023); Williams-Young et al. (2023) and fragmentation-based methodsVallejo et al. (2023) have also been explored on GPUs. The electronic structures of excited states play crucial roles from both chemical and materials perspectives and can be modeled using the time-dependent variant of DFT (TDDFT).Runge and Gross (1984); Dreuw and Head-Gordon (2005) Despite the widespread adoption of TDDFT across various fields,Herbert (2023) few studies have focused on its implementation on GPUs.Isborn et al. (2011); Kim et al. (2023)

Inspired by the GPU acceleration of TDDFT by Isborn et al.,Isborn et al. (2011) some of present authors have provided in ref 42 a modernized implementation utilizing the state-of-the-art Nvidia A100 GPUs over high-speed Infiniband HDR networks for very large-scale calculations. In this process, the global hybrid functionals showed limitations in excited-state calculations of full-scale biological proteins, as the occurrence of spurious low-lying charge-transfer states compromised their practical application. As a simple, pragmatic remedy, we employed range-separated hybrid (RSH) functionalsLeininger et al. (1997); Vydrov and Scuseria (2006) to address certain fundamental issues with global hybrid functionals. Further, even with well-established parallel CPU implementations of the TDDFT nuclear gradient in many quantum chemistry programs,Epifanovsky et al. (2021); Franzke et al. (2023); Zahariev et al. (2023); Mejia-Rodriguez et al. (2023); Neese (2022); Cao et al. (2021); Rinkevicius et al. (2020); Frisch et al. (2016) to our knowledge, the GPU counterpart in conjunction with RSH functionals, which is critical for geometry optimizations of large systems to find stationary points on the excited-state potential energy surface and for elucidating various response properties, has not been reported thus far.

In this work, as a necessary advance towards reliable quantum chemistry calculations of excited states on an unprecedented scale, we present a massively parallel GPU implementation of the nuclear gradient of the TDDFT method with RSH functionals, employing the Gaussian-type atomic orbitals for exploring the excited states of very large molecular systems. We discuss efficient multi-GPU algorithms for computing the derivatives of ERIs and XC functionals, and demonstrate strong scalability in the nuclear gradient calculations of the singlet excited state of a 4353-atom protein system. This implementation achieves reasonably high parallel efficiencies up to 256 Nvidia A100 GPUs.

2 Theory and Algorithm

2.1 TDA-TDDFT Nuclear Gradient with RSH Functionals

Within the linear-response formalism, the total excited-state energy, $E$ , of TDDFT is calculated as the sum of the ground-state energy, $E_{0}$ , and the excitation energy, $E_{\Delta}$ . The Tamm–Dancoff approximation (TDA)Hirata and Head-Gordon (1999) simplifies the calculation of $E_{\Delta}$ by neglecting the deexcitation term in the Casida equation.Casida (1995) This approach allows $E_{\Delta}$ to be determined from the following eigenvalue equation:

\displaystyle\mathbf{H}\mathbf{X}=E_{\Delta}\mathbf{X}

(1)

where $\mathbf{H}$ is the Hamiltonian represented within the space of (occupied–virtual) singly-excited Slater determinants and $\mathbf{X}$ is the excitation amplitude arranged in an $N_{\mathrm{o}}\times N_{\mathrm{v}}$ matrix, where $N_{\mathrm{o}}$ and $N_{\mathrm{v}}$ refer to the numbers of occupied and virtual orbitals, respectively. For simplicity, we limit our discussion to closed-shell orbitals with restricted spins, using $i$ , $j$ for the occupied orbitals, and $a$ , $b$ for the virtual orbitals. Throughout this work, the XC functionals considered are within the generalized gradient approximation (GGA). These functionals depend on the density and its gradient, and are described by $f(\rho_{\alpha},\rho_{\beta},\gamma_{\alpha\alpha},\gamma_{\alpha\beta},\gamma% _{\beta\beta})$ where $\rho_{\alpha}$ and $\rho_{\beta}$ are the spin densities, and $\gamma_{\alpha\alpha}\equiv\nabla\rho_{\alpha}{\cdot}\nabla\rho_{\alpha}$ , $\gamma_{\alpha\beta}\equiv\nabla\rho_{\alpha}{\cdot}\nabla\rho_{\beta}$ and $\gamma_{\beta\beta}\equiv\nabla\rho_{\beta}{\cdot}\nabla\rho_{\beta}$ represent the gradient invariants.Perdew et al. (1996) Additionally, we focus on the spin-adapted formulation for singlet excited states, noting that the triplet counterpart can be easily derived as described in ref 56.

By expanding all the terms in eq 1, the nuclear derivative of $E$ with respect to an atomic center $\mathbf{A}\equiv\{A_{x},A_{y},A_{z}\}$ of the $A$ -th atom can be expressed as

	$\displaystyle\nabla_{\mathbf{A}}E$	$\displaystyle=\nabla_{\mathbf{A}}E_{0}+\sum_{ia}\sum_{jb}X_{ia}X_{jb}\nabla_{% \mathbf{A}}[\varepsilon_{a}\delta_{ab}-\varepsilon_{i}\delta_{ij}$		(2)
		$\displaystyle+2(ia\|jb)-\alpha(ij\|ab)-\beta(ij\|ab)_{\mathrm{lr}}+(ia\|\hat{f}_{% \mathrm{XC}}\|jb)]$

where $\varepsilon_{i}$ and $\varepsilon_{a}$ represent the orbital energies of the occupied and virtual molecular orbitals, $\psi_{i}(\mathbf{r})$ and $\psi_{a}(\mathbf{r})$ , respectively, and $(ia|jb)$ , $(ij|ab)$ and $(ij|ab)_{\mathrm{lr}}$ denote the Coulomb, HF exchange and long-range HF exchange integrals, respectively, defined by

$\displaystyle(ia\|jb)$	$\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{a}(\mathbf{r})\frac{1}{r}\psi_{% j}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{r}^{\prime})\,\mathrm{d}\mathbf{r}% \mathrm{d}\mathbf{r}^{\prime}$	(3)
$\displaystyle(ij\|ab)$	$\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{j}(\mathbf{r})\frac{1}{r}\psi_{% a}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{r}^{\prime})\,\mathrm{d}\mathbf{r}% \mathrm{d}\mathbf{r}^{\prime}$	(4)
$\displaystyle(ij\|ab)_{\mathrm{lr}}$	$\displaystyle=\iint\!\psi_{i}(\mathbf{r})\psi_{j}(\mathbf{r})\frac{% \operatorname{erf}(\omega r)}{r}\psi_{a}(\mathbf{r}^{\prime})\psi_{b}(\mathbf{% r}^{\prime})\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}^{\prime}$	(5)

with $r\equiv|\mathbf{r}-\mathbf{r}^{\prime}|$ . Furthermore, $(ia|\hat{f}_{\mathrm{XC}}|jb)$ represents the singlet XC kernel defined byBauernschmitt and Ahlrichs (1996)

		$\displaystyle(ia\|\hat{f}_{\mathrm{XC}}\|jb)=$
		$\displaystyle\iint\Bigg{[}\left(\frac{\partial^{2}f}{\partial\rho_{\alpha}^{2}% }+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\rho_{\beta}}\right)\psi_{% i}(\mathbf{r})\psi_{a}(\mathbf{r})\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})$
		$\displaystyle+\left(2\frac{\partial f}{\partial\gamma_{\alpha\alpha}}+\frac{% \partial f}{\partial\gamma_{\alpha\beta}}\right)\nabla\big{(}\psi_{i}(\mathbf{% r})\psi_{a}(\mathbf{r})\big{)}\cdot\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})% \psi_{b}(\mathbf{r}^{\prime})\big{)}$
		$\displaystyle+\left(\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{% \alpha\alpha}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\beta% \beta}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha\beta}% }\right)\Big{(}\psi_{i}(\mathbf{r})\psi_{a}(\mathbf{r})$
		$\displaystyle\times\nabla\rho\cdot\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})% \psi_{b}(\mathbf{r}^{\prime})\big{)}+\nabla\rho\cdot\nabla\big{(}\psi_{i}(% \mathbf{r})\psi_{a}(\mathbf{r})\big{)}\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})\Big{)}$
		$\displaystyle+\left(\frac{\partial^{2}f}{\partial\gamma_{\alpha\alpha}^{2}}+% \frac{1}{2}\frac{\partial^{2}f}{\partial\gamma_{\alpha\beta}^{2}}+2\frac{% \partial^{2}f}{\partial\gamma_{\alpha\alpha}\gamma_{\alpha\beta}}+\frac{% \partial^{2}f}{\partial\gamma_{\alpha\alpha}\gamma_{\beta\beta}}\right)$
		$\displaystyle\times\nabla\rho\nabla\big{(}\psi_{i}(\mathbf{r})\psi_{a}(\mathbf% {r})\big{)}\cdot\nabla\rho\nabla\big{(}\psi_{j}(\mathbf{r}^{\prime})\psi_{b}(% \mathbf{r}^{\prime})\big{)}\Bigg{]}\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}^% {\prime}$		(6)

with the total density $\rho\equiv\rho_{\alpha}+\rho_{\beta}$ . The range-separated scheme blends the short-range exchange DFT functional and the long-range HF exchange using the error function ( $\operatorname{erf}$ ), with the parameter $\omega$ setting the cross-fading point. In the limits of short-range ( $r\rightarrow 0$ ) and long-range ( $r\rightarrow\infty$ ), the portions of HF exchange are determined by $\alpha$ and $\alpha+\beta$ , respectively. For global hybrid functionals, the HF exchange is kept at constant $\alpha$ by setting $\beta=0$ .

In the following discussion, to derive the atomic orbital formulation of eq 2 for computational efficiency, we utilize the direct differentiation approach.Foresman et al. (1992); Caillie and Amos (1999, 2000); Shroll and Edwards (2004); Liu et al. (2010) Alternatively, one can achieve the same formulation through the Lagrangian approach.Furche and Ahlrichs (2002); Chiba et al. (2006) Despite rather involving algebraic derivation, the resulting expression is relatively concise and can be written in matrix notation, facilitating efficient computational processing.

We rewrite eq 2 in the atomic orbital basis $\{\phi_{\mu}\}$ as

$\displaystyle\nabla_{\mathbf{A}}E$	$\displaystyle=\sum_{\mu\nu}P_{\mu\nu}\nabla_{\mathbf{A}}H^{\mathrm{core}}_{\mu% \nu}+\sum_{\mu\nu}W_{\mu\nu}\nabla_{\mathbf{A}}S_{\mu\nu}+\nabla_{\mathbf{A}}V% _{\mathrm{nuc}}$
	$\displaystyle+\sum_{\mu\nu\lambda\sigma}\Gamma_{\mu\nu\lambda\sigma}\nabla_{% \mathbf{A}}(\mu\nu\|\lambda\sigma)+\sum_{\mu\nu\lambda\sigma}\Gamma_{\mu\nu% \lambda\sigma}^{\,\mathrm{lr}}\nabla_{\mathbf{A}}(\mu\nu\|\lambda\sigma)_{% \mathrm{lr}}$
	$\displaystyle+\sum_{\mu\nu}P_{\mu\nu}\nabla_{\mathbf{A}}F^{\mathrm{XC}}_{\mu% \nu}+\sum_{\mu\nu\lambda\sigma}R_{\mu\nu}R_{\lambda\sigma}\nabla_{\mathbf{A}}(% \mu\nu\|\hat{f}_{\mathrm{XC}}\|\lambda\sigma)$	(7)

where $\mathbf{H}_{\mathrm{core}}$ represents the core Hamiltonian matrix, $\mathbf{S}$ represents the overlap matrix, $\mathbf{F}_{\mathrm{XC}}$ represents the Kohn-Sham Fock matrix,Pople et al. (1992) $(\mu\nu|\lambda\sigma)$ and $(\mu\nu|\lambda\sigma)_{\mathrm{lr}}$ denote the ERIs from eqs 3 and 4, respectively, and $V_{\text{nuc}}$ denotes the nuclear repulsion energy. The total density matrix is defined by

\displaystyle\mathbf{P}

\displaystyle=\mathbf{P}_{0}+\mathbf{P}_{\Delta}

(8)

with

$\displaystyle\mathbf{P}_{0}$	$\displaystyle=2\mathbf{C}_{\text{o}}\mathbf{C}_{\text{o}}^{T}$	(9)
$\displaystyle\mathbf{P}_{\Delta}$	$\displaystyle=\widetilde{\mathbf{P}}_{\Delta}+2\mathbf{C}_{\text{v}}\mathbf{Z}% \mathbf{C}_{\text{o}}^{T}$	(10)
$\displaystyle\widetilde{\mathbf{P}}_{\Delta}$	$\displaystyle=\mathbf{C}_{\text{v}}\mathbf{X}^{T}\mathbf{X}\mathbf{C}_{\text{v% }}^{T}-\mathbf{C}_{\text{o}}\mathbf{X}\mathbf{X}^{T}\mathbf{C}_{\text{o}}^{T}$	(11)

where $\mathbf{P}_{0}$ denotes the ground-state density matrix, $\mathbf{P}_{\Delta}$ and $\widetilde{\mathbf{P}}_{\Delta}$ denote the relaxed and unrelaxed difference density matrices, respectively, and $\mathbf{Z}$ describes the virtual–occupied orbital relaxation that can be obtained by solving a set of coupled-perturbed equations described in section 2.2. The transition density matrix is expressed as

\displaystyle\mathbf{R}=\mathbf{C}_{\text{o}}\mathbf{X}\mathbf{C}_{\text{v}}^{T}

(12)

with the molecular orbital coefficient matrix $\mathbf{C}\equiv\left[\mathbf{C}_{\mathrm{o}},\mathbf{C}_{\mathrm{v}}\right]$ , and we write the product of densities as

$\displaystyle\Gamma_{\mu\nu\lambda\sigma}$	$\displaystyle=2R_{\nu\mu}R_{\lambda\sigma}-\alpha R_{\sigma\mu}R_{\lambda\nu}$
	$\displaystyle+\frac{1}{2}\left((P_{\mu\nu}+2P^{\,\Delta}_{\mu\nu})P_{\lambda% \sigma}-\frac{\alpha}{2}(P_{\mu\sigma}+2P^{\,\Delta}_{\mu\sigma})P_{\lambda\nu% }\right)$	(13)
$\displaystyle\Gamma_{\mu\nu\lambda\sigma}^{\,\mathrm{lr}}$	$\displaystyle=-\beta R_{\sigma\mu}R_{\lambda\nu}-\frac{\beta}{4}(P_{\mu\sigma}% +2P^{\,\Delta}_{\mu\sigma})P_{\lambda\nu}$	(14)

The atomic orbital formulation of TDDFT nuclear gradient in eq 2.1 facilitates the integral-direct implementation, enabling the calculation of the four-index ERIs as needed during computation to avoid the heavy $\mathcal{O}(N^{4})$ scaling in memory. We provide the working expressions for the energy-weighted density matrix, $\mathbf{W}$ , ERI derivatives (fourth and fifth terms in eq 2.1) and XC derivatives (sixth and seventh terms in eq 2.1) in sections 2.2, 2.3 and 2.4, respectively.

2.2 Coupled-Perturbed Kohn–Sham Equations

The virtual–occupied orbital relaxation, $\mathbf{Z}$ , representing the first-order derivative of molecular orbital coefficients, is obtained by solving the coupled-perturbed equationGerratt and Mills (1968) using the Z-vector method:Handy and Schaefer (1984)

\displaystyle(\mathbf{d}-\mathbf{A})\mathbf{Z}=\mathbf{L}

(15)

where the matrix elements on the left-hand side are given as

$\displaystyle d_{ai}$	$\displaystyle=\varepsilon_{i}-\varepsilon_{a}$	(16)
$\displaystyle A_{ai,bj}$	$\displaystyle=4(ia\|jb)-\alpha\Big{[}(ij\|ab)+(ib\|aj)\Big{]}$
	$\displaystyle-\beta\Big{[}(ij\|ab)_{\mathrm{lr}}+(ib\|aj)_{\mathrm{lr}}\Big{]}+2% (ia\|\hat{f}_{\mathrm{XC}}\|jb)$	(17)

and the Lagrangian vector on the right-hand side is given as

\displaystyle\mathbf{L}

\displaystyle=\mathbf{C}_{\text{v}}^{T}\mathbf{F}_{\widetilde{\mathbf{P}}_{% \Delta}}\mathbf{C}_{\text{o}}+\mathbf{C}_{\text{v}}^{T}\mathbf{F}_{\mathbf{R}}% \mathbf{C}_{\text{v}}\mathbf{X}^{T}-\mathbf{X}^{T}\mathbf{C}_{\text{o}}^{T}% \mathbf{F}_{\mathbf{R}}\mathbf{C}_{\text{o}}+\mathbf{C}_{\text{v}}^{T}\mathbf{% K}\mathbf{C}_{\text{o}}

(18)

where the Fock-type matrix $\mathbf{F}_{\widetilde{\mathbf{P}}_{\Delta}}$ is defined by

	$\displaystyle F_{\mu\nu}^{\,\widetilde{\mathbf{P}}_{\Delta}}$	$\displaystyle=\sum_{\lambda\sigma}\widetilde{P}^{\,\Delta}_{\sigma\lambda}\Big% {[}2(\mu\nu\|\lambda\sigma)-\alpha(\mu\lambda\|\nu\sigma)-\beta(\mu\lambda\|\nu% \sigma)_{\mathrm{lr}}$
		$\displaystyle+(\mu\nu\|\hat{f}_{\mathrm{XC}}\|\lambda\sigma)\Big{]}$		(19)

through the contraction of $\widetilde{\mathbf{P}}_{\Delta}$ , and $\mathbf{F}_{\mathbf{R}}$ can be defined similarly.

The XC hyperkernel involving higher-order functional derivatives is calculated as

$\displaystyle K_{\mu\nu}$	$\displaystyle=\iiint\phi_{\mu}(\mathbf{r})\phi_{\nu}(\mathbf{r})\Big{(}2{% \mathbf{h}}^{2}f^{\,(2)}_{2}+{g}^{2}f^{\,(3)}_{1}+2{g}(\nabla\rho\cdot{\mathbf% {h}})f^{\,(3)}_{2}$
	$\displaystyle+(\nabla\rho\cdot{\mathbf{h}})^{2}f^{\,(3)}_{3}\Big{)}+\nabla\rho% \cdot\nabla\big{(}\phi_{\mu}(\mathbf{r})\phi_{\nu}(\mathbf{r})\big{)}\Big{(}2{% \mathbf{h}}^{2}f^{\,(2)}_{3}+{g}^{2}f^{\,(3)}_{2}$
	$\displaystyle+2{g}(\nabla\rho\cdot{\mathbf{h}})f^{\,(3)}_{3}+(\nabla\rho\cdot{% \mathbf{h}})^{2}f^{\,(3)}_{4}\Big{)}\,\mathrm{d}\mathbf{r}\mathrm{d}\mathbf{r}% ^{\prime}\mathrm{d}\mathbf{r}^{\prime\prime}$	(20)

with the intermediates defined by

	$\displaystyle{g}$	$\displaystyle=\sum_{\mu\nu}{R}_{\mu\nu}\phi_{\mu}(\mathbf{r}^{\prime})\phi_{% \nu}(\mathbf{r}^{\prime\prime})$		(21)
	$\displaystyle{\mathbf{h}}$	$\displaystyle=\sum_{\mu\nu}{R}_{\mu\nu}\nabla\big{(}\phi_{\mu}(\mathbf{r}^{% \prime})\phi_{\nu}(\mathbf{r}^{\prime\prime})\big{)}$		(22)

and the second- and third-order functional derivatives organized as

$\displaystyle f^{\,(2)}_{2}$	$\displaystyle=\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha\beta% }}+\frac{\partial^{2}f}{\partial\rho_{\alpha}\partial\gamma_{\beta\beta}}$	(23)
$\displaystyle f^{\,(2)}_{3}$	$\displaystyle=\frac{\partial^{2}f}{\partial\gamma_{\alpha\alpha}^{2}}+\frac{1}% {2}\frac{\partial^{2}f}{\partial\gamma_{\alpha\beta}^{2}}+2\frac{\partial^{2}f% }{\partial\gamma_{\alpha\alpha}\partial\gamma_{\alpha\beta}}+\frac{\partial^{2% }f}{\partial\gamma_{\alpha\alpha}\partial\gamma_{\beta\beta}}$	(24)
$\displaystyle f^{\,(3)}_{1}$	$\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}^{3}}+3\frac{\partial^% {3}f}{\partial\rho_{\alpha}^{2}\partial\rho_{\beta}}$	(25)
$\displaystyle f^{\,(3)}_{2}$	$\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \alpha\alpha}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \alpha\beta}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}^{2}\partial\gamma_{% \beta\beta}}+2\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\rho_{\beta}% \partial\gamma_{\alpha\alpha}}$
	$\displaystyle+\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\rho_{\beta}% \partial\gamma_{\alpha\beta}}$	(26)
$\displaystyle f^{\,(3)}_{3}$	$\displaystyle=\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}^{2}}+\frac{\partial^{3}f}{\partial\rho_{\beta}\partial\gamma_{\alpha% \alpha}^{2}}+\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \beta}^{2}}+2\frac{\partial^{3}f}{\partial\rho_{\alpha}\partial\gamma_{\alpha% \alpha}\partial\gamma_{\alpha\beta}}$
	$\displaystyle+2\frac{\partial^{3}f}{\partial\rho_{\beta}\partial\gamma_{\alpha% \alpha}\partial\gamma_{\alpha\beta}}+2\frac{\partial^{3}f}{\partial\rho_{% \alpha}\partial\gamma_{\alpha\alpha}\partial\gamma_{\beta\beta}}$	(27)
$\displaystyle f^{\,(3)}_{4}$	$\displaystyle=\frac{\partial^{3}f}{\partial\gamma_{\alpha\alpha}^{3}}+\frac{1}% {2}\frac{\partial^{3}f}{\partial\gamma_{\alpha\beta}^{3}}+3\frac{\partial^{3}f% }{\partial\gamma_{\alpha\alpha}^{2}\partial\gamma_{\alpha\beta}}+3\frac{% \partial^{3}f}{\partial\gamma_{\alpha\alpha}^{2}\partial\gamma_{\beta\beta}}$
	$\displaystyle+3\frac{\partial^{3}f}{\partial\gamma_{\alpha\beta}^{2}\partial% \gamma_{\alpha\alpha}}+3\frac{\partial^{3}f}{\partial\gamma_{\alpha\alpha}% \partial\gamma_{\alpha\beta}\partial\gamma_{\beta\beta}}$	(28)

This can be implemented straightforwardly by repurposing the existing routine for the XC kernel used in the TDDFT energy computation.

To solve eq 15 without the need to store $\mathbf{A}$ of large dimension $N_{\mathrm{o}}N_{\mathrm{v}}\times N_{\mathrm{o}}N_{\mathrm{v}}$ , we employ the Krylov subspace method. In this approach, the orthogonal subspace $\mathbf{b}\equiv\{\mathbf{b}_{0},\mathbf{b}_{1},\cdots\}$ is iteratively augmented to span the basis space for $\mathbf{Z}$ until the residual is acceptably small.Pople et al. (1979) We rewrite eq 15 to adapt it into the Krylov method as

\displaystyle(\mathbf{I}-\mathbf{A}^{\prime})\mathbf{Z}=\mathbf{b}_{0}

(29)

where $\mathbf{A}^{\prime}=\mathbf{d}^{-1}\mathbf{A}$ and $\mathbf{b}_{0}=\mathbf{d}^{-1}\mathbf{L}$ . At the $k$ -th iteration, the $(k{+}1)$ -dimensional subspace linear equation is solved to obtain $\mathbf{Z}_{k}$ :

\displaystyle\widetilde{\mathbf{A}}_{k}\mathbf{a}_{k}=\mathbf{x}_{k}

(30)

where the elements are defined as

	$\displaystyle\widetilde{A}_{i,j}^{\,(k)}$	$\displaystyle=\mathbf{b}_{i}^{T}\mathbf{b}_{j}-\mathbf{b}_{i}^{T}\widetilde{% \boldsymbol{\sigma}}_{j}$		(31)
	$\displaystyle x_{i}^{\,(k)}$	$\displaystyle=\delta_{0i}\mathbf{b}_{i}^{T}\mathbf{b}_{i}$		(32)

The scaled matrix-vector product $\widetilde{\boldsymbol{\sigma}}_{k}\equiv\mathbf{d}^{-1}\mathbf{A}\mathbf{b}_{k}$ is calculated directly as in the Fock build in TDDFT:

	$\displaystyle\boldsymbol{\sigma}_{k}$	$\displaystyle=\mathbf{A}\mathbf{b}_{k}$		(33)
		$\displaystyle=\mathbf{C}_{\text{v}}^{T}\widetilde{\mathbf{F}}_{k}\mathbf{C}_{% \text{o}}$

Here, we use a slightly modified expression for the Fock-type matrix, derived from eq 2.2, written as

	$\displaystyle\widetilde{F}_{\mu\nu}^{\,(k)}$	$\displaystyle=\sum_{\lambda\sigma}B_{\sigma\lambda}^{\,(k)}\Big{[}4(\mu\nu\|% \lambda\sigma)-\alpha\,\Big{(}(\mu\lambda\|\sigma\nu)+(\mu\sigma\|\lambda\nu)% \Big{)}$
		$\displaystyle-\beta\,\Big{(}(\mu\lambda\|\sigma\nu)_{\mathrm{lr}}+(\mu\sigma\|% \lambda\nu)_{\mathrm{lr}}\big{)}+2(\mu\nu\|\hat{f}_{\mathrm{XC}}\|\lambda\sigma)% \Big{]}$		(34)

where $\mathbf{B}_{k}\equiv\mathbf{C}_{\text{v}}\mathbf{b}_{k}\mathbf{C}_{\text{o}}^{T}$ .

Then, $\mathbf{Z}$ can be approximated as

\displaystyle\mathbf{Z}

\displaystyle\simeq\mathbf{Z}_{k}=\sum_{i=0}^{k}a_{i}\mathbf{b}_{i}

(35)

If the residual is not sufficiently small, an additional subspace vector is constructed,

\displaystyle\mathbf{b}_{k+1}

\displaystyle=\widetilde{\boldsymbol{\sigma}}_{k}-\sum_{i=0}^{k}\frac{\mathbf{% b}_{i}^{T}\widetilde{\boldsymbol{\sigma}}_{k}}{\mathbf{b}_{i}^{T}\mathbf{b}_{i% }}\mathbf{b}_{i}

(36)

and the iteration continues.

Finally, the energy-weighted density matrix can be obtained through matrix multiplications as

$\displaystyle\mathbf{W}$	$\displaystyle=\mathbf{C}_{\mathrm{o}}\operatorname{diag}(\boldsymbol{% \varepsilon}_{\mathrm{o}})\,(2\mathbf{I}-\mathbf{X}\mathbf{X}^{T})\mathbf{C}_{% \mathrm{o}}^{T}+\mathbf{C}_{\mathrm{v}}\operatorname{diag}(\boldsymbol{% \varepsilon}_{\mathrm{v}})\mathbf{X}^{T}\mathbf{X}\mathbf{C}_{\mathrm{v}}^{T}$
	$\displaystyle+2\mathbf{C}_{\mathrm{v}}\mathbf{Z}\operatorname{diag}(% \boldsymbol{\varepsilon}_{\mathrm{o}})\mathbf{C}_{\mathrm{o}}^{T}+\frac{1}{4}% \mathbf{P}_{0}^{T}(\mathbf{F}_{\mathbf{P}_{\Delta}}+\mathbf{K})\mathbf{P}_{0}+% \frac{1}{2}\mathbf{P}_{0}\mathbf{F}_{\mathbf{R}}^{T}\mathbf{R}^{T}$
	$\displaystyle+\mathbf{C}_{\mathrm{v}}\mathbf{C}_{\mathrm{v}}^{T}\mathbf{F}_{% \mathbf{R}}^{T}\mathbf{R}+\mathbf{R}^{T}\mathbf{F}_{\mathbf{R}}\mathbf{P}_{0}$	(37)

where $\boldsymbol{\varepsilon}_{\mathrm{o}}$ and $\boldsymbol{\varepsilon}_{\mathrm{v}}$ denote the orbital energies of the occupied and virtual molecular orbitals, respectively.

2.3 ERI Derivatives and Their GPU-Acceleration

The Cartesian Gaussian function centered on the $A$ -th atom is given by

	$\displaystyle\phi_{\mu}(\mathbf{r};\zeta,\mathbf{l},\mathbf{A})$	$\displaystyle=N(x-A_{x})^{l_{x}}(y-A_{y})^{l_{y}}(z-A_{z})^{l_{z}}$		(38)
		$\displaystyle\times\exp(-\zeta\|\mathbf{r}-\mathbf{A}\|^{2})$

where $N$ denotes the normalization constant, $\mathbf{l}\equiv(l_{x},l_{y},l_{z})$ represents the Cartesian quantum numbers, and $\zeta$ denotes the exponent. For the given angular momentum, $l\equiv l_{x}+l_{y}+l_{z}$ , there are $n_{\mathrm{cart}}=(l+1)(l+2)/2$ Cartesian functions or $n_{\mathrm{sph}}=2l+1$ spherical functions, the latter of which being easily be obtained via a spherical transformation of the former. The derivative of a Cartesian Gaussian function can be expressed as a linear combination of functions with higher and lower angular momentum. For instance, the $x$ -derivative can be written as

\displaystyle\nabla_{x}\phi_{\mu}^{l_{x},l_{y},l_{z}}=l_{x}\phi_{\mu}^{l_{x}-1% ,l_{y},l_{z}}-2\zeta\phi_{\mu}^{l_{x}+1,l_{y},l_{z}}

(39)

Because the Gaussian functions are atom-centered, the ERI gradient with respect to the $E$ -th atom can be expressed as

	$\displaystyle\nabla_{\mathbf{E}}(\mu_{A}\nu_{B}\|\lambda_{C}\sigma_{D})$	$\displaystyle=-\delta_{EA}(\mu^{\prime}_{A}\nu_{B}\|\lambda_{C}\sigma_{D})-% \delta_{EB}(\mu_{A}\nu^{\prime}_{B}\|\lambda_{C}\sigma_{D})$
		$\displaystyle-\delta_{EC}(\mu_{A}\nu_{B}\|\lambda^{\prime}_{C}\sigma_{D})-% \delta_{ED}(\mu_{A}\nu_{B}\|\lambda_{C}\sigma^{\prime}_{D})$		(40)

where the subscript denotes the atomic center, and the prime denotes the Cartesian derivative (i.e., $\mu^{\prime}_{A}\equiv\nabla\phi_{\mu}$ ). The derivatives of long-range ERIs can be defined similarly. Moreover, one of the derivative terms in eq 2.3 can be obtained at no cost by using translational invariance, $\sum_{E}\nabla_{\mathbf{E}}(\mu_{A}\nu_{B}|\lambda_{C}\sigma_{D})=0$ ;Komornicki et al. (1977) for instance, when all the centers are different, we can write

	$\displaystyle(\mu^{\prime}_{A}\nu_{B}\|\lambda_{C}\sigma_{D})$	$\displaystyle=-(\mu_{A}\nu^{\prime}_{B}\|\lambda_{C}\sigma_{D})-(\mu_{A}\nu_{B}% \|\lambda^{\prime}_{C}\sigma_{D})$
		$\displaystyle-(\mu_{A}\nu_{B}\|\lambda_{C}\sigma^{\prime}_{D})$		(41)

This reduces the computational cost for ERI derivatives by at least one fourth if $\phi_{\mu}$ was chosen for the highest angular momentum. With the aid of permutation symmetry in ERI, we placed the atomic orbital of the highest angular momentum at the first index, $\mu$ , in our code to directly use the above equation.

In the integral-direct algorithm, the contributions to nuclear gradients up to the four centers comprising an ERI are calculated. Focusing on the case where all the four centers are different for simplicity, the update of $\nabla_{\mathbf{A}}E$ in eq 2.1 by the ERI derivatives with the given $\phi_{\mu}$ , $\phi_{\nu}$ , $\phi_{\lambda}$ and $\phi_{\sigma}$ can be expressed as

	$\displaystyle\nabla_{A_{x}}E$	$\displaystyle\mathrel{{+}{=}}\Gamma_{\mu\nu\lambda\sigma}\left[2\zeta(\mu_{A}^% {+}\nu_{B}\|\lambda_{C}\sigma_{D})-l_{x}(\mu_{A}^{-}\nu_{B}\|\lambda_{C}\sigma_{% D})\right]$
		$\displaystyle\,\,\,\,+\Gamma_{\mu\nu\lambda\sigma}^{\,\mathrm{lr}}\left[2\zeta% (\mu_{A}^{+}\nu_{B}\|\lambda_{C}\sigma_{D})_{\mathrm{lr}}-l_{x}(\mu_{A}^{-}\nu_% {B}\|\lambda_{C}\sigma_{D})_{\mathrm{lr}}\right]$		(42)

where the superscripts $+$ and $-$ denote the raising and lowering of Cartesian quantum numbers, i.e., $(l_{x}\pm 1,l_{y},l_{z})$ . $\nabla_{A_{y}}E$ and $\nabla_{A_{z}}E$ can be expressed in an identical manner, and similar expressions can be readily obtained for the other three centers. Generally, $\phi_{\mu}$ is described by a set of Gaussian functions as

\displaystyle\phi_{\mu}(\mathbf{r})=\sum_{k}^{K_{\mu}}d_{k}\varphi_{k}(\mathbf% {r})

(43)

where $K_{\mu}$ denotes the degree of contraction, and $d_{k}$ denotes the coefficient. Therefore, for a given ERI $(\mu\nu|\lambda\sigma)$ , nested loops over each contraction occur with a total loop-count of $K_{\mu}{\cdot}K_{\nu}{\cdot}K_{\lambda}{\cdot}K_{\sigma}$ , each evaluating a batch of ERI derivatives with a dimension of $3\times n_{\mu}\times n_{\nu}\times n_{\lambda}\times n_{\sigma}$ in the code where $n$ denotes $n_{\mathrm{cart}}$ or $n_{\mathrm{sph}}$ .

To simultaneously achieve concurrency of GPU kernels and load-balancing within the distributed GPUs while minimizing the thread divergence, we have extended the multi-GPU ERI algorithm described in our previous report.Kim et al. (2023) The list of non-zero shell-pairs is presorted according to the degree of contractions, $K$ , and the Schwarz upper-bound such that when two lists are combined to form an ERI supermatrix, each thread-blocks of optimal dimension will likely exhibit identical loop-counts for maximum thread concurrency within the GPU kernel. Block-cyclic distributions of the thread-blocks are used to achieve the load-balance between distributed GPUs. The non-redundant ERI types according to the sequence of the angular momenta in indices are computed sequentially to mitigate the thread divergence. For instance, there are a total of 21 GPU kernels each representing a specific ERI type for the atomic orbitals with $l\leq 2$ .

The GPU implementation of eq 2.3 is shown in Figure 1. For all ERI calculations, we employed the Rys quadrature methodDupuis et al. (1976) since it requires the least memory footprint compared to other algorithms based on the recursion formula. The Rys roots and weights that correspond to the higher angular momentum ERI are reused for the ERI with lower angular momenta. Thus, the number of Rys roots is given by $n_{\mathrm{rys}}=(l_{\mu}+l_{\nu}+l_{\lambda}+l_{\sigma}+1)/2+1$ , which is used for $(\mu\nu^{+}|\lambda\sigma)$ , $(\mu\nu|\lambda^{+}\sigma)$ , $(\mu\nu|\lambda\sigma^{+})$ , $(\mu\nu^{-}|\lambda\sigma)$ , $(\mu\nu|\lambda^{-}\sigma)$ and $(\mu\nu|\lambda\sigma^{-})$ .

Refer to caption — Figure 1: Sketch of the algorithm used to compute the ERI derivative on GPU.

While it is technically possible to reuse the ERI routines for higher angular momenta to calculate the ERI derivatives, such an implementation would introduce complexity in exploiting the 8-fold permutation symmetry as well as utilizing the translational invariance, which can effectively remove the most computationally intensive derivatives. Therefore, we have explicitly constructed independent GPU kernels for the ERI derivatives.

2.4 Numerical Integration of XC Derivatives on GPUs

The XC functionals and their $n$ -th order derivatives can be calculated by numerical integration over well-defined quadrature grids as

\displaystyle I=\sum_{q}w_{q}f^{\,(n)}(\mathbf{r}_{q})

(44)

where $w_{q}$ denotes the grid weight at point $\mathbf{r}_{q}$ . In the case of polyatomic systems, atom-centered spherical grids can be constructed by combining a radial grid and an angular grid. We employed the Euler–Maclaurin gridMurray et al. (1993) for the radial partitioning and the Lebedev gridLebedev (1977) for the angular partitioning. The numerical stability of discrete summation can be maintained by Becke’s space partitioning method,Becke (1988) in which weights are obtained via the fuzzy cell construction.

For the given quadrature point at $\mathbf{r}_{q}$ , the updating expression for the nuclear gradient by the XC contribution in eq 2.1, can be written as

$\displaystyle\nabla_{\mathbf{A}}E$	$\displaystyle\mathrel{{+}{=}}-2w_{q}\sideset{}{{}^{\prime}}{\sum}_{\mu}\sum_{% \nu}\bigg{[}P_{\mu\nu}\Big{(}f^{\,(1)}_{1}\mathbf{x}+\frac{1}{2}f^{\,(1)}_{2}% \mathbf{y}\Big{)}+2R_{\mu\nu}\Big{(}f^{\,(1)}_{2}\mathbf{z}$
	$\displaystyle+f^{\,(2)}_{1}{g}\mathbf{x}+f^{\,(2)}_{2}\big{(}g\mathbf{y}+({% \mathbf{h}}\cdot\nabla\rho)\mathbf{x}\big{)}+f^{\,(2)}_{3}({\mathbf{h}}\cdot% \nabla\rho)\mathbf{y}\Big{)}$
	$\displaystyle+\frac{1}{2}P^{\,0}_{\mu\nu}\Big{(}f^{\,(1)}_{2}\mathbf{z}_{% \Delta}+f^{\,(2)}_{1}g_{\Delta}\mathbf{x}+f^{\,(2)}_{2}\big{(}g_{\Delta}% \mathbf{y}+(\mathbf{h}_{\Delta}\cdot\nabla\rho)\mathbf{x}$
	$\displaystyle+2\mathbf{h}^{2}\mathbf{x}+4g\mathbf{z}\big{)}+f^{\,(2)}_{3}\big{% (}2\mathbf{h}^{2}\mathbf{y}+(\mathbf{h}_{\Delta}\cdot\nabla)\rho\mathbf{y}+4(% \mathbf{h}\cdot\nabla\rho)\mathbf{z}\big{)}$
	$\displaystyle+f^{\,(3)}_{1}g^{2}\mathbf{x}+f^{\,(3)}_{2}\big{(}g^{2}\mathbf{y}% +2g(\mathbf{h}\cdot\nabla\rho)\mathbf{x}\big{)}+f^{\,(3)}_{3}\big{(}(\mathbf{h% }\cdot\nabla\rho)^{2}\mathbf{x}$
	$\displaystyle+2g(\mathbf{h}\cdot\nabla\rho)\mathbf{y}\big{)}+f^{\,(3)}_{4}(% \mathbf{h}\cdot\nabla\rho)^{2}\mathbf{y}\Big{)}\bigg{]}$	(45)

where the primed sum ( $\Sigma^{\prime}$ ) indicates that the sum runs over the atomic orbitals centered on $A$ . The functional derivatives are defined as

$\displaystyle f^{\,(1)}_{1}$	$\displaystyle=\frac{\partial f}{\partial\rho_{\alpha}}$	(46)
$\displaystyle f^{\,(1)}_{2}$	$\displaystyle=2\frac{\partial f}{\partial\gamma_{\alpha\alpha}}+\frac{\partial f% }{\partial\gamma_{\alpha\beta}}$	(47)
$\displaystyle f^{\,(2)}_{1}$	$\displaystyle=\frac{\partial^{2}f}{\partial\rho_{\alpha}^{2}}+\frac{\partial^{% 2}f}{\partial\rho_{\alpha}\partial\rho_{\beta}}$	(48)

The intermediates contracting the transition density matrix are defined as

	$\displaystyle g_{\Delta}$	$\displaystyle=\sum_{\mu\nu}{P}^{\,\Delta}_{\mu\nu}\phi_{\mu}\phi_{\nu}$		(49)
	$\displaystyle\mathbf{h}_{\Delta}$	$\displaystyle=\sum_{\mu\nu}{P}^{\,\Delta}_{\mu\nu}\nabla(\phi_{\mu}\phi_{\nu})$		(50)

The recurring vectors are defined as

$\displaystyle\mathbf{x}$	$\displaystyle=\nabla\phi_{\mu}\cdot\phi_{\nu}$	(51)
$\displaystyle\mathbf{y}$	$\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]\nabla\rho$	(52)
$\displaystyle\mathbf{z}$	$\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]{\mathbf{h}}$	(53)
$\displaystyle\mathbf{z}_{\Delta}$	$\displaystyle=[\phi_{\nu}\nabla(\nabla\phi_{\mu})^{T}+\nabla\phi_{\mu}(\nabla% \phi_{\nu})^{T}]\mathbf{h}_{\Delta}$	(54)

We note that the weight derivative has been omitted as it is often negligible in the case of nuclear gradients when sufficiently fine grids are used.

Figure 2 shows the GPU implementation of the XC nuclear gradients. Practically, owing to the locality of Gaussian functions only the atomic orbitals in the vicinity of a quadrature point, which can be predetermined by a suitable gauging, are relevant in computing eq 2.4. In other words, a sufficiently small cluster of quadrature points will share most of these atomic orbitals. Therefore, within GPU kernel, concurrency can be easily achieved by progressively dividing the entire grid into octants until all the points inside any octant can be assigned to a one-dimensional thread-block. By predetermining the atomic orbitals that contribute to the the thread-block (equivalently, a box containing the grid points in the vicinity), we not only eliminate running the entire list of atomic orbitals, but more importantly the threads can loop over the same list of significant atomic orbitals, maximizing the concurrency.

Computational cost can be reduced by precomputing atomic orbital-independent intermediates for each $\mathbf{r}_{q}$ , ${g}$ (eq 21) and ${\mathbf{h}}$ (eq 22) contracting $\mathbf{R}$ , and $g_{\Delta}$ (eq 49) and $\mathbf{h}_{\Delta}$ (eq 50) contracting ${\mathbf{P}}_{\Delta}$ . From the TDDFT energy calculations, $\rho$ and $\nabla\rho$ for all $\mathbf{r}_{q}$ can be stored and reused. Therefore, the precomputed quantities require a total of 12 $n_{q}$ storage, where $n_{q}\equiv n_{\mathrm{atom}}{\cdot}n_{\mathrm{rad}}{\cdot}n_{\mathrm{ang}}$ is the number of quadrature points without grid pruning. We provide selected cases in Table S2, affirming that the current GPU memory size is sufficient even for a very large molecular system.

2.5 Implementation

The GPU algorithm for the TDDFT nuclear gradient, as presented in the previous sections, was implemented in an in-house code. Our computational framework is based on core parts of the KPACK code,Kim and Lee (2013) a two-component relativistic quantum chemistry program, and has been completely rewritten in Fortran 2003/2008 to accommodate the modern object-oriented features. The GPU code has been written in CUDA FortranCUDA Fortran Programming Guide, Nvidia using the extensions to the Fortran language with several keywords and APIs to offload computations to GPUs. To ensure efficient GPU utilization with XC functionals, the C code of selected functionals from LibXC 5.2.3Lehtola et al. (2018) has been converted into CUDA Fortran and integrated with our code. We validated our implementation by comparing analytic gradients with numerical gradients obtained from the finite difference method.

To design an efficient parallel model for our distributed, high-performance GPU systems, as outlined in our previous report,Kim et al. (2023) we adopted a hybrid MPI/OpenMP model. We leveraged the message-passing interface (MPI) using OpenMPI 4.1.5 with the multithreaded Unified communication-X (UCX) 1.12.1 for both point-to-point and one-sided communications between the nodes. On each node, CPU-to-GPU bindings were achieved using OpenMP parallel interfaces. Additionally, we utilized the single-node multi-GPU linear algebra libraries of Nvidia’s cuBlasXt 11.11 and cuSolverMg 11.4 for matrix multiplications and diagonalizations, respectively. The Nvidia HPC SDK 22.11 compiler suite with the “-fast” optimization flag was used for code compilation.

Our code employs a symmetric orthonormalization procedure with a threshold of $10^{-6}$ au for defining orthonormal atomic orbitals. The SCF convergence threshold was set as the root-mean-squared (rms) difference of $10^{-6}$ au between the density matrices of two consecutive SCF iterations. A density-multiplied Schwarz upper bound of $10^{-10}$ au was used as the threshold for screening ERIs and their derivatives. In numerical integration, a density threshold of $10^{-10}$ au for the grid point was applied. Throughout this work, we used a (50, 194)-grid, combining a 50-point radial grid with a 194-point angular grid per atom. A residual threshold of $10^{-5}$ au was used in both Davidson diagonalization and Z-vector iteration. Currently, all one-electron integrals and their derivatives are computed using CPUs over the hybrid OpenMP/MPI model.

3 Application

\begin{overpic}[height=170.71652pt,trim=170.71652pt 85.35826pt 142.26378pt 71.% 13188pt]{gfp.pov} \put(74.0,32.0){ \includegraphics[height=68.28644pt,trim=128.0374pt 0.0pt 0.0pt 0.0pt,clip]{gfp% } } \put(101.0,34.0){\normalsize HBI} \end{overpic}

Figure 3: Green fluorescent protein (GFP) in the presence of water solvent molecules. The HBI chromophore is shown inside the protein cage, which is visually aided with ribbon representation.

In this section, we showcase the capability and performance of our TDDFT implementation by calculating the excited-state nuclear gradient of a very large molecular system, green fluoresecent protein (GFP). GFP is a naturally occurring biological protein extracted from Aequorea victoria, and has become a commonly used tool in molecular and cell biology for labeling and visualizing proteins in living cells.Zimmer (2002)

As shown in Figure 3, within the rigid 11-stranded $\beta$ -barrel comprising 238 amino acids, the p-hydroxybenzylidene-imidazolinone (HBI) chromophore converts blue light to green light. The neutral and anionic forms of the chromphore are responsible for the light absorption in the ultraviolet and blue regions, respectively. The HBI chromophore consists of an imidazolone and a phenol moiety connected via a methine bridge.

In previous theoretical studies of GFP, to characterize the relevant excited-state, researchers utilized quantum chemical methods on the chromophore independently or with electrostatic/polarizable embedding of the surrounding environment via molecular mechanics by the so-called QM/MM model.Laino et al. (2004); Epifanovsky et al. (2009); Nåbo et al. (2017) Of course, quantum chemical methods should outperform the empirical force field when compared on an equal footing,Kulik et al. (2012) and their usage could potentially benefit as distant residues can also affect the excitation energies and properties of the chromophore.Schwabe et al. (2015) However, increasing the QM region is nontrivial, requiring chemical intuitions of domain experts, and tends to approach slowly to the accuracy of the large QM asymptotic limit.Kulik et al. (2016)

Moreover, when applied to large systems, QM methods can experience challenges associated with the specific model chemistry. TDDFT calculations with the pure or global hybrid functionals often encounter low-lying spurious charge-transfer states stemming from an unphysically small band gap, an artifact of the underestimated exchange energy at long distance,Rudberg (2012) and partially from unequilibrated electrostatic potential on surfaces.Lever et al. (2013) Using RSH functionals with correct long-distance behavior described by 100% HF exchange is suitable for large systems, as well as including solvent effect via implicit solvation models or explicit solvents, providing an effective solution for reliable results from large-scale TDDFT calculation.

We employed the geometry of GFP prepared in ref 85 from the X-ray diffraction structure (PDB ID: 1W7S)Van Thor et al. (2005) with minor refinements to a few misplaced atoms.Kim et al. (2023) This system represents a dense three-dimensional structure (Figure 3). For calculations, the $\omega$ B97X RSH GGA functionalChai and Head-Gordon (2008) was chosen, as it has demonstrated to provide reliable results over a wide range of interactions.Mardirossian and Head-Gordon (2017) The default parameters of $\omega$ B97X were used: $\alpha=0.158$ , $\beta=0.842$ , $\omega=0.3$ bohr^-1. Explicit inclusion of water solvent molecules and utilization of the double-zeta plus polarization basis set, def2-SVP,Weigend and Ahlrichs (2005) afforded a more realistic model. The total numbers of atoms and basis functions were 4353 and 40 518, respectively, representing the largest attempt to compute the nuclear gradient of an excited state at the TDDFT level to our knowledge.

3.1 Excited State of Green Fluorescent Protein

Table 1 presents the calculated results of the full-scale GFP system at the TDA-TDDFT/ $\omega$ B97X level with def2-SVP basis sets, alongside QM/MM results using the ONIOM model,Svensson et al. (1996) aiming to unveil the long-distance quantum mechanical effect in the excited state.

Table 1: Calculated results of GFP at the TDA-TDDFT level

	HOMO / LUMO	Band gap	S₁	TDM
method	(eV)	(eV)	(eV)^a	(debye)^b
ONIOM^c	$-$ 3.21 / 4.20	7.41	3.80	7.74
$\omega$ B97X	$-$ 3.80 / 2.78	6.58	3.62	7.07

^a Vertical excitation energy. ^b Norm of transition dipole moment vector. ^c The QM region containing the HBI chromophore (43 atoms) was treated by the $\omega$ B97X level, while the Amber force fields (ref 91) were used to describe the MM region. The QM/MM calculations were performed using the Gaussian 16.C.01 program (ref 52).

The orbital energies of the two approaches exhibit significant differences, which can be attributed to the full-scale calculation representing the extended system. What is more important is that the band gap of $\omega$ B97X remains sufficiently large, rectifying the band gap underestimation observed with global hybrid functionals. In a previous study, we obtained the band gap of 1.88 eV with B3LYP in the full-scale calculation, with computationally accessible low-lying singlet states identified as spurious charge-transfer states between the chromophore and a distant residue.Kim et al. (2023) The S₁ state from TDA-TDDFT yielded a vertial excitation at 3.62 eV consistent with experimental observations, where the broadband absorption by the neutral chromophore ranges from 330 to 450 nm (2.76–3.75 eV).Heim et al. (1994) The QM/MM calculation predicted the same excitation to be 0.18 eV higher.

The origin of the difference in the S₁ energies of the full-scale and QM/MM calculations in this particular case can be inferred from the natural transition orbitals (NTOs).Martin (2003) Figure 4 shows the NTOs upon S₁ excitation, and for visual clarity, the chromophore and a neighboring residue are depicted. Hole and particle wave functions of the S₁ state clearly indicate that the S₁ state can be largely described as a local $\pi$ – $\pi^{\ast}$ excitation within the chromophore. Interestingly, a small participation of charge-transfer from the $\pi$ -density of a nearby residue can be seen, and presumably this has stabilized the S₁ state, leading to the transition dipole moment calculated at the $\omega$ B97X level that is noticeably smaller than that of the QM/MM calculation (Table 1). It is unclear how such excited-state details would affect the actual dynamics, and the full implications remains to be determined, but key long-range interactions that necessitate large-scale calculations can be captured in this approach.

As excited-state dynamics to the first-order is governed by the nuclear forces on the potential energy surface, we visualized in Figure 5 the extracted S₁ nuclear gradient by computing the difference between excited- and ground-state gradients. Since the employed geometry has not been optimized, after subtracting the ground-state portion from the total gradient (eq 2), the resulting gradient reflects a change in the potential energy surface upon the electronic excitation.

A noticeable difference between the nuclear gradients from $\omega$ B97X and ONIOM calculations is observed on the carbon atom on the imidazolone connected to the methine bridge, suggesting the distant residue effect. Indeed, similar visualizations with greater details showed non-negligible participation of distant residues, supporting the notion that excited-state dynamics can encompass the entire protein (Figure S1). Not unexpectedly, the electronic embedding in the ONIOM calculation also captured the nonvanishing S₁–S₀ difference gradient outside the chromophore, but differing significantly from the $\omega$ B97X result (Figure S2). Focusing on relatively larger forces (i.e., the negative gradient), centered on the chromophore, the directions closely follow the qualitative interpretation of the NTOs, wherein the bonding-to-anti-bonding character change is reflected (Figure 4). For instance, the $\pi$ -bonding orbital between the imidazolone and the methine bridge becomes an anti-bonding orbital with depleted electron density, suggesting a decrease in bond order and a bond elongation. Fundamentally, the excited-state dynamics is governed by the total nuclear gradient encompassing the ground-state component, which has been handily ignored in our interpretation.

3.2 Multi-GPU Parallel Performance

Figure 6 shows the speed-up and parallel efficiency as a function of the number of GPUs for the TDDFT gradient calculation of the GFP system. The parallel efficiency was measured by the ratio with respect to the 1-GPU wall time while doubling the number of GPUs. Excellent parallel efficiency of $>$ 95% is observed within a single node comprising 8 GPUs. Parallel efficiencies gradually decrease to 31% at 256 GPUs, and these numbers are very close to the parallel efficiency of the TDDFT energy calculation in our previous report,Kim et al. (2023) suggesting that the same number of GPUs can be used for both energy and gradient calculations without incurring computational imbalance. The relative timings of our implementation of TDDFT gradient with the a priori DFT and TDDFT energy calculations were $1:1.3:3.1$ , respectively, further assuring that all the calculations yield comparable timings in the same order of magnitude (Table S3). In our largest computation with 256 A100 GPUs, the TDDFT gradient calculation took a mere 1.2 h of time, empowering practical applications of TDDFT for very-large systems.

Table 2: Timings for the S₁ gradient calculation of GFP^a

nGPU^b	total (s)^c	Z-vector (s)	$\nabla$ ERI (s)	$\nabla$ XC (s)	$\frac{\nabla{\text{ERI}}}{\nabla{\text{XC}}}$
1	343 643 (100.0%)	167 503	112 421	1 453	77.4
2	172 857 (99.4%)	84 239	56 430	739	76.4
4	87 064 (98.7%)	42 474	28 243	384	73.6
8	44 457 (96.6%)	21 667	14 207	225	63.1
16	23 580 (91.1%)	11 719	7 191	133	54.2
32	12 894 (83.3%)	6 571	3 684	80	46.2
64	7 484 (71.7%)	3 895	1 935	59	33.0
128	4 903 (54.8%)	2 716	1 065	41	26.0
256	4 263 (31.5%)	2 360	696	49	14.1

^a Measured at the TDA-TDDFT level with $\omega$ B97X functional and def2-SVP basis set. Calculations were restarted with the converged molecular orbitals and excitation amplitudes. ^b Nvidia A100-80GB GPUs were used. ^c Parallel efficiency with respect to 1-GPU utilization is given in parentheses.

Table 2 provides a detailed timing analysis of wall time by decomposing it into individual components. The ERI gradient exhibited an excellent performance, while the performance of the XC gradient reached a saturation point at approximately 128 GPUs. Furthermore, at 1 GPU, the ratio of workloads between ERI gradient to XC gradient is 77.4, and it decreases to 14.1 at 256 GPUs. In fact, with 256 GPUs, out of overall 49 s, the XC gradient kernel merely takes 8.6 s (2.3 s for intermediates, 6.3 s for the gradient), implying that most of the computational cost is allotted to the GPU preparation. In other words, the XC gradient quickly reaches a plateau due to the relatively short execution time of the GPU kernel, and almost all times are spent on data management between devices, leading to strong scaling stagnation almost immediately. Better data management is thus important for further code optimization. On the other hand, the ERI gradient has the most workloads in the entire gradient calculation, and our load balancing scheme showed better scalability. Clearly, the Z-vector portion remains approximately 50% for all number of GPUs considered in this work, rendering it the most time-consuming process. The Z-vector iteration can be considered as the direct matrix-vector product formation (eq 33) in the Davidson diagonalization for TDDFT energies.Bauernschmitt and Ahlrichs (1996); Leininger et al. (2001)

Figure 7 shows the convergence profile of the Z-vector iterations for solving the coupled-perturbed equations during the gradient evaluation. Convergence was attained in 10 iterations, during which the residual norm decreased exponentially. The norm of the additional subspace vector $\mathbf{b}_{k}$ also becomes exponentially smaller as the iterations proceeded. As an undesirable side effect, the product of the density matrix $\mathbf{B}_{k}$ (in eq 2.2) and the Schwarz upper bound of ERI, which is used as the screening parameter, unintentionally neglect important ERI calculations in practice, potentially hampering convergence. To mitigate such numerical instability at the expense of losing the benefit of efficent density screening, we normalized $\mathbf{b}_{k}$ before the Fock build in eq 33, after which it was back-converted to the original norm. We suggest that a dynamical scheme to adjust the norm by comparing with previous iterations can be helpful in reinstating density screening to accelerate the Z-vector iteration.

Figure 8 shows the time taken by all GPU kernels for ERI derivatives grouped into three clusters based on the highest angular momentum before the differentiation. Although approximately 45% of the cost is attributed to the computation of certain derivatives, $(pp|ps)$ , $(dd|ps)$ and $(dp|pp)$ in this particular case of the excited-state of GFP, this dependence is not only determined by how the basis functions are defined in terms of the number of Gaussian functions in the given angular momentum and their contraction details, which determine the loop-counts in the code, but also by the nature of the excited state. The latter determines the pattern of nonzero elements in the density matrix used to screen negligble ERIs prior to computation, which is also an important factor. Interestingly, adding a single $d$ -polarization function to heavy atoms as has been done in the employed def2-SVP basis set, increased the total wall time for ERI derivatives by 2.4-fold. This represents, in some respect, the necessary cost towards more realistic calculations.

To provide deeper insights into the fundamental performance of these GPU kernels, we performed a roofline analysis,Williams et al. (2009) as depicted in Figure 9. The GPU kernels for ERI derivatives are distributed across the memory- and compute-bound regions, which are bounded by the memory bandwidth and the peak performance, $R_{\mathrm{peak}}$ . As the involved angular momenta increase, denoted by their sum in the figure, the kernels are shifted from the compute-bound to the memory-bound regions.

As seen in the compute-bound region in Figure 9, the GPU kernels of ERI derivatives utilize approximately 15% of $R_{\mathrm{peak}}$ , in double precision (FP64). It is worth nothing that $R_{\mathrm{peak}}$ assumes full utilization of fused-multiply-add operation, where a multiplication and an addition are performed simultaneously, and hence, we anticipate that further speed gains could be achieved with more sophisticated code optimizations. When compared to the ERI kernels, represented by gray circles, all the kernels for ERI derivatives clearly shift towards the memory-bound region because the operational intensity is dominated by building larger intermediate integrals for the ERIs with a higher angular momentum, which involves increased memory access. The ERIs with higher angular momenta are mostly memory-bound for the same reason.

4 Conclusions

We presented a multi-GPU implementation of analytical nuclear gradient of TDA-TDDFT using Gaussian-type basis sets with RSH functionals, with a focus on computing large-scale molecular systems. We revisited the underlying theory and algorithms to transform the integral direct procedure into an efficiently computable form in a massively parallel GPU computing environment. We demonstrated full-scale calculations of the excited-state nuclear gradient of a realistic protein system consisting of 4353 atoms at the $\omega$ B97X/def2-SVP level of theory. Achieving the parallel efficiency of $>$ 70% with up to 64 GPUs and 31% with 256 GPUs, the parallel efficiency of the nuclear gradient calculations was comparable to the excited-state energy counterpart, showcasing a well-balanced usage of a distributed resource up to 256 GPUs with 2.5 peta-FLOPS of raw computing power.

Heterogeneous computing is becoming increasingly prevalent as GPU-accelerated systems dominate the supercomputing landscape.Atchley et al. (2023); Chang et al. (2024) Quantum chemistry codes that fully exploit the state-of-the-art hardwares, as discussed in this work, will be extremely helpful in efficiently exploring the electronic structure of excited states for very large molecular systems. This will offer an exciting simulation platform for advancements in chemical and materials science.

Acknowledgements

Part of this work was carried out during the stay of I.K. at MIT as a visiting researcher. T.V.V. and Y.M.R. acknowledge Samsung Electronics for financial support (IO221227-04385-01 and IO211126-09175-01). The authors thank Dr. Xin Li and Prof. Patrick Norman at KTH for helpful discussions. I.K. thanks Mandy S. Kim for assistance with data collection. Computational resources were provided by the Supercomputing Center of Samsung Electronics.

Author contributions

I.K., T.V.V., Y.M.R., W.-J.S. and H.-J.K. conceived and designed the project and wrote the manuscript. I.K. developed GPU algorithms and wrote the code. D.J, L.W. and A.A. performed quantum chemistry calculations. J.Y. assisted with parallel computing set-up. S.K. and Y.C. aided in interpreting the results. I.J., S.L. and D.S.K. organized the project and supervised the computational study. All authors analyzed the data, discussed the results, and contributed to various portions of the manuscript.

References

Bourzac (2017) Bourzac, K. Supercomputing poised for a massive speed boost. Nature 2017, 551, 554–556.
McInnes et al. (2021) McInnes, L. C.; Heroux, M. A.; Draeger, E. W.; Siegel, A.; Coghlan, S.; Antypas, K. How community software ecosystems can unlock the potential of exascale computing. Nat. Comput. Sci. 2021, 1, 92–94.
Heldens et al. (2021) Heldens, S.; Hijma, P.; Werkhoven, B. V.; Maassen, J.; Belloum, A. S. Z.; Van Nieuwpoort, R. V. The Landscape of Exascale Research: A Data-Driven Literature Analysis. ACM Comput. Surv. 2021, 53, 1–43.
Herten (2023) Herten, A. Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview. SC-W ’23: Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. New York, NY, USA, 2023; p 1019–1026.
Kohn and Sham (1965) Kohn, W.; Sham, L. J. Self-Consistent Equations Including Exchange and Correlation Effects. Phys. Rev. 1965, 140, A1133–A1138.
Burke (2012) Burke, K. Perspective on density functional theory. J. Chem. Phys. 2012, 136, 150901.
Becke (2014) Becke, A. D. Perspective: Fifty years of density-functional theory in chemical physics. J. Chem. Phys. 2014, 140, 18A301.
Bursch et al. (2022) Bursch, M.; Mewes, J.; Hansen, A.; Grimme, S. Best practice DFT protocols for basic molecular computational chemistry. Angew. Chem. Int. Ed. 2022, 61, e202205735.
Gordon et al. (2020) Gordon, M. S.; Barca, G.; Leang, S. S.; Poole, D.; Rendell, A. P.; Vallejo, J. L. G.; Westheimer, B. Novel Computer Architectures and Quantum Chemistry. J. Phys. Chem. A 2020, 124, 4557–4582.
Felice et al. (2023) Felice, R. D. et al. A Perspective on Sustainable Computational Chemistry Software Development and Integration. J. Chem. Theory Comput. 2023, 19, 7056–7076.
Vallejo et al. (2023) Vallejo, J. L. G. et al. Toward an extreme-scale electronic structure system. J. Chem. Phys. 2023, 159, 044112.
Nagy and Jensen (2017) Nagy, B.; Jensen, F. Reviews in Computational Chemistry; Wiley, 2017; Chapter 3, pp 93–149.
Almlöf et al. (1982) Almlöf, J.; Faegri, K.; Korsell, K. Principles for a direct SCF approach to LCAO–MO ab-initio calculations. J. Comput. Chem. 1982, 3, 385–399.
Yasuda (2008) Yasuda, K. Two-electron integral evaluation on the graphics processor unit. J. Comput. Chem. 2008, 29, 334–342.
Ufimtsev and Martínez (2008) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231.
Ufimtsev and Martínez (2009) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. J. Chem. Theory Comput. 2009, 5, 1004–1015.
Asadchev et al. (2010) Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units. J. Chem. Theory Comput. 2010, 6, 696–704.
Titov et al. (2013) Titov, A. V.; Ufimtsev, I. S.; Luehr, N.; Martínez, T. J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221.
Miao and Merz (2013) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965–976.
Miao and Merz (2015) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449–1462.
Kussmann and Ochsenfeld (2017) Kussmann, J.; Ochsenfeld, C. Hybrid CPU/GPU Integral Engine for Strong-Scaling Ab Initio Methods. J. Chem. Theory Comput. 2017, 13, 3153–3159.
Rák and Cserey (2015) Rák, A.; Cserey, G. The BRUSH algorithm for two-electron integrals on GPU. Chem. Phys. Lett. 2015, 622, 92–98.
Tornai et al. (2019) Tornai, G. J.; Ladjánszki, I.; Ádám Rák; Kis, G.; Cserey, G. Calculation of Quantum Chemical Two-Electron Integrals by Applying Compiler Technology on GPU. J. Chem. Theory Comput. 2019, 15, 5319–5331.
Barca et al. (2020) Barca, G. M. J. et al. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152, 154102.
Qi et al. (2023) Qi, J.; Zhang, Y.; Yang, M. A hybrid CPU/GPU method for Hartree–Fock self-consistent-field calculation. J. Chem. Phys. 2023, 159, 104101.
Vallejo et al. (2023) Vallejo, J. L. G.; Barca, G. M.; Gordon, M. S. High-performance GPU-accelerated evaluation of electron repulsion integrals. Mol. Phys. 2023, 121, e2112987.
Asadchev and Valeev (2023) Asadchev, A.; Valeev, E. F. High-Performance Evaluation of High Angular Momentum 4‑Center Gaussian Integrals on Modern Accelerated Processors. J. Phys. Chem. A 2023, 127, 10889–10895.
Seritan et al. (2020) Seritan, S.; Bannwarth, C.; Fales, B. S.; Hohenstein, E. G.; Kokkila-Schumacher, S. I. L.; Luehr, N.; Snyder, J. W.; Song, C.; Titov, A. V.; Ufimtsev, I. S.; Martínez, T. J. TeraChem: Accelerating electronic structure and ab initio molecular dynamics with graphical processing units. J. Chem. Phys. 2020, 152, 224110.
Manathunga et al. (2021) Manathunga, M.; Jin, C.; Cruzeiro, V. W. D.; Miao, Y.; Mu, D.; Arumugam, K.; Keipert, K.; Aktulga, H. M.; Merz, K. M.; Götz, A. W. Harnessing the Power of Multi-GPU Acceleration into the Quantum Interaction Computational Kernel Program. J. Chem. Theory Comput. 2021, 17, 3955–3966.
Barca et al. (2021) Barca, G. M. J.; Alkan, M.; Galvez-Vallejo, J. L.; Poole, D. L.; Rendell, A. P.; Gordon, M. S. Faster Self-Consistent Field (SCF) Calculations on GPU Clusters. J. Chem. Theory Comput. 2021, 17, 7486–7503.
Manathunga et al. (2020) Manathunga, M.; Miao, Y.; Mu, D.; Götz, A. W.; Merz, K. M. Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel Program. J. Chem. Theory Comput. 2020, 16, 4315–4326.
Williams-Young et al. (2020) Williams-Young, D. B.; de Jong, W. A.; van Dam, H. J. J.; Yang, C. On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters. Front. Chem. 2020, 8.
Kowalski et al. (2021) Kowalski, K. et al. From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape. Chem. Rev. 2021, 121, 4962–4998.
Johnson et al. (2022) Johnson, K. G.; Mirchandaney, S.; Hoag, E.; Heirich, A.; Aiken, A.; Martínez, T. J. Multinode Multi-GPU Two-Electron Integrals: Code Generation Using the Regent Language. J. Chem. Theory Comput. 2022, 18, 6522–6536.
Kussmann et al. (2021) Kussmann, J.; Laqua, H.; Ochsenfeld, C. Highly Efficient Resolution-of-Identity Density Functional Theory Calculations on Central and Graphics Processing Units. J. Chem. Theory Comput. 2021, 17, 1512–1521.
Asadchev and Valeev (2023) Asadchev, A.; Valeev, E. F. Memory-Efficient Recursive Evaluation of 3‑Center Gaussian Integrals. J. Chem. Theory Comput. 2023, 19, 1698–1710.
Williams-Young et al. (2023) Williams-Young, D. B.; Asadchev, A.; Popovici, D. T.; Clark, D.; Waldrop, J.; Windus, T. L.; Valeev, E. F.; Jong, W. A. d. Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory. J. Chem. Phys. 2023, 158, 234104.
Runge and Gross (1984) Runge, E.; Gross, E. K. U. Density-Functional Theory for Time-Dependent Systems. Phys. Rev. Lett. 1984, 52, 997–1000.
Dreuw and Head-Gordon (2005) Dreuw, A.; Head-Gordon, M. Single-Reference ab Initio Methods for the Calculation of Excited States of Large Molecules. Chem. Rev. 2005, 105, 4009–4037.
Herbert (2023) Herbert, J. M. Density Functional Theory for Electronic Excited States. Theoretical and Computational Photochemistry. 2023; pp 69–118.
Isborn et al. (2011) Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814–1823.
Kim et al. (2023) Kim, I.; Jeong, D.; Son, W.-J.; Kim, H.-J.; Rhee, Y. M.; Jung, Y.; Choi, H.; Yim, J.; Jang, I.; Kim, D. S. Kohn–Sham time-dependent density functional theory with Tamm–Dancoff approximation on massively parallel GPUs. npj Comput. Mater. 2023, 9, 81.
Leininger et al. (1997) Leininger, T.; Stoll, H.; Werner, H.-J.; Savin, A. Combining long-range configuration interaction with short-range density functionals. Chem. Phys. Lett. 1997, 275, 151–160.
Vydrov and Scuseria (2006) Vydrov, O. A.; Scuseria, G. E. Assessment of a long-range corrected hybrid functional. J. Chem. Phys. 2006, 125, 234109.
Epifanovsky et al. (2021) Epifanovsky, E. et al. Software for the frontiers of quantum chemistry: An overview of developments in the Q-Chem 5 package. J. Chem. Phys. 2021, 155, 084801.
Franzke et al. (2023) Franzke, Y. J. et al. TURBOMOLE: Today and Tomorrow. J. Chem. Theory Comput. 2023, 19, 6859–6890.
Zahariev et al. (2023) Zahariev, F. et al. The General Atomic and Molecular Electronic Structure System (GAMESS): Novel Methods on Novel Architectures. J. Chem. Theory Comput. 2023, 19, 7031–7055.
Mejia-Rodriguez et al. (2023) Mejia-Rodriguez, D. et al. NWChem: Recent and Ongoing Developments. J. Chem. Theory Comput. 2023, 19, 7077–7096.
Neese (2022) Neese, F. Software update: ORCA program system—Version 5.0. WIREs Comput. Mol. Sci. 2022, 12, e1606.
Cao et al. (2021) Cao, Y.; Halls, M. D.; Friesner, R. A. Highly efficient implementation of the analytical gradients of pseudospectral time-dependent density functional theory. J. Chem. Phys. 2021, 155, 024115.
Rinkevicius et al. (2020) Rinkevicius, Z.; Li, X.; Vahtras, O.; Ahmadzadeh, K.; Brand, M.; Ringholm, M.; List, N. H.; Scheurer, M.; Scott, M.; Dreuw, A.; Norman, P. VeloxChem: A Python‐driven density‐functional theory program for spectroscopy simulations in high‐performance computing environments. WIREs Comput. Mol. Sci. 2020, 10, e1457.
Frisch et al. (2016) Frisch, M. J. et al. Gaussian 16 Revision C.01. 2016; Gaussian Inc. Wallingford CT.
Hirata and Head-Gordon (1999) Hirata, S.; Head-Gordon, M. Time-dependent density functional theory within the Tamm–Dancoff approximation. Chem. Phys. Lett. 1999, 314, 291–299.
Casida (1995) Casida, M. E. In Recent Advances in Computational Chemistry; Chong, D. P., Ed.; WORLD SCIENTIFIC, 1995; Vol. 1; pp 155–192.
Perdew et al. (1996) Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 1996, 77, 3865–3868.
Bauernschmitt and Ahlrichs (1996) Bauernschmitt, R.; Ahlrichs, R. Treatment of electronic excitations within the adiabatic approximation of time dependent density functional theory. Chem. Phys. Lett. 1996, 256, 454–464.
Foresman et al. (1992) Foresman, J. B.; Head-Gordon, M.; Pople, J. A.; Frisch, M. J. Toward a systematic molecular orbital theory for excited states. J. Phys. Chem. 1992, 96, 135–149.
Caillie and Amos (1999) Caillie, C. V.; Amos, R. D. Geometric derivatives of excitation energies using SCF and DFT. Chem. Phys. Lett. 1999, 308, 249–255, Excited state property.
Caillie and Amos (2000) Caillie, C. V.; Amos, R. D. Geometric derivatives of density functional theory excitation energies using gradient-corrected functionals. Chem. Phys. Lett. 2000, 317, 159–164.
Shroll and Edwards (2004) Shroll, R. M.; Edwards, W. D. Excited‐state gradients via CPHF equations. Int. J. Quantum Chem. 2004, 56, 395–410.
Liu et al. (2010) Liu, F.; Gan, Z.; Shao, Y.; Hsu, C.-P.; Dreuw, A.; Head-Gordon, M.; Miller, B. T.; Brooks, B. R.; Yu, J.-G.; Furlani, T. R.; Kong, J. A parallel implementation of the analytic nuclear gradient for time-dependent density functional theory within the Tamm–Dancoff approximation. Mol. Phys. 2010, 108, 2791–2800.
Furche and Ahlrichs (2002) Furche, F.; Ahlrichs, R. Adiabatic time-dependent density functional methods for excited state properties. J. Chem. Phys. 2002, 117, 7433–7447.
Chiba et al. (2006) Chiba, M.; Tsuneda, T.; Hirao, K. Excited state geometry optimizations by analytical energy gradient of long-range corrected time-dependent density functional theory. J. Chem. Phys. 2006, 124, 144106.
Pople et al. (1992) Pople, J. A.; Gill, P. M.; Johnson, B. G. Kohn–Sham density-functional theory within a finite basis set. Chem. Phys. Lett. 1992, 199, 557–560.
Gerratt and Mills (1968) Gerratt, J.; Mills, I. M. Force Constants and Dipole-Moment Derivatives of Molecules from Perturbed Hartree–Fock Calculations. I. J. Chem. Phys. 1968, 49, 1719–1729.
Handy and Schaefer (1984) Handy, N. C.; Schaefer, H. F. On the evaluation of analytic energy derivatives for correlated wave functions. J. Chem. Phys. 1984, 81, 5031–5033.
Pople et al. (1979) Pople, J. A.; Krishnan, R.; Schlegel, H. B.; Binkley, J. S. Derivative studies in hartree‐fock and møller‐plesset theories. Int. J. Quantum Chem. 1979, 16, 225–241.
Komornicki et al. (1977) Komornicki, A.; Ishida, K.; Morokuma, K.; Ditchfield, R.; Conrad, M. Efficient determination and characterization of transition states using ab-initio methods. Chem. Phys. Lett. 1977, 45, 595–602.
Dupuis et al. (1976) Dupuis, M.; Rys, J.; King, H. F. Evaluation of molecular integrals over Gaussian basis functions. J. Chem. Phys. 1976, 65, 111–116.
Murray et al. (1993) Murray, C. W.; Handy, N. C.; Laming, G. J. Quadrature schemes for integrals of density functional theory. Mol. Phys. 1993, 78, 997–1014.
Lebedev (1977) Lebedev, V. I. Spherical quadrature formulas exact to orders 25–29. Sib. Math. J. 1977, 18, 99–107.
Becke (1988) Becke, A. D. A multicenter numerical integration scheme for polyatomic molecules. J. Chem. Phys. 1988, 88, 2547–2553.
Kim and Lee (2013) Kim, I.; Lee, Y. S. KPACK: Relativistic two-component ab initio electronic structure program package. Bull. Korean Chem. Soc. 2013, 34, 179–187.
(74) CUDA Fortran Programming Guide, Nvidia https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/, Accessed: 2022-07-06.
Lehtola et al. (2018) Lehtola, S.; Steigemann, C.; Oliveira, M. J.; Marques, M. A. Recent developments in LibXC – A comprehensive library of functionals for density functional theory. SoftwareX 2018, 7, 1–5.
Zimmer (2002) Zimmer, M. Green fluorescent protein (GFP): Applications, structure, and related photophysical behavior. Chem. Rev. 2002, 102, 759–782.
Laino et al. (2004) Laino, T.; Nifosì, R.; Tozzini, V. Relationship between structure and optical properties in green fluorescent proteins: a quantum mechanical study of the chromophore environment. Chem. Phys. 2004, 298, 17–28.
Epifanovsky et al. (2009) Epifanovsky, E.; Polyakov, I.; Grigorenko, B.; Nemukhin, A.; Krylov, A. I. Quantum Chemical Benchmark Studies of the Electronic Properties of the Green Fluorescent Protein Chromophore. 1. Electronically Excited and Ionized States of the Anionic Chromophore in the Gas Phase. J. Chem. Theory Comput. 2009, 5, 1895–1906.
Nåbo et al. (2017) Nåbo, L. J.; Olsen, J. M. H.; Martínez, T. J.; Kongsted, J. The Quality of the Embedding Potential Is Decisive for Minimal Quantum Region Size in Embedding Calculations: The Case of the Green Fluorescent Protein. J. Chem. Theory. Comput. 2017, 13, 6230–6236.
Kulik et al. (2012) Kulik, H. J.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Ab Initio Quantum Chemistry for Protein Structures. J. Phys. Chem. B 2012, 116, 12501–12509.
Schwabe et al. (2015) Schwabe, T.; Beerepoot, M. T. P.; Olsen, J. M. H.; Kongsted, J. Analysis of computational models for an accurate study of electronic excitations in GFP. Phys. Chem. Chem. Phys. 2015, 17, 2582–2588.
Kulik et al. (2016) Kulik, H. J.; Zhang, J.; Klinman, J. P.; Martínez, T. J. How Large Should the QM Region Be in QM/MM Calculations? The Case of Catechol O -Methyltransferase. J. Phys. Chem. B 2016, 120, 11381–11394.
Rudberg (2012) Rudberg, E. Difficulties in applying pure Kohn–Sham density functional theory electronic structure methods to protein molecules. J. Phys. Condens. Matter 2012, 24, 072202.
Lever et al. (2013) Lever, G.; Cole, D. J.; Hine, N. D. M.; Haynes, P. D.; Payne, M. C. Electrostatic considerations affecting the calculated HOMO–LUMO gap in protein molecules. J. Phys. Condens. Mat. 2013, 25, 152101.
Foresman and Frisch (2015) Foresman, J.; Frisch, Æ. Exploring Chemistry with Electronic Structure Methods, 3rd ed.; Gaussian, Inc.: Wallingford, CT, 2015.
Van Thor et al. (2005) Van Thor, J. J.; Georgiev, G. Y.; Towrie, M.; Sage, J. T. Ultrafast and Low Barrier Motions in the Photoreactions of the Green Fluorescent Protein. J. Biol. Chem. 2005, 280, 33652–33659.
Chai and Head-Gordon (2008) Chai, J.-D.; Head-Gordon, M. Systematic optimization of long-range corrected hybrid density functionals. J. Chem. Phys. 2008, 128, 084106.
Mardirossian and Head-Gordon (2017) Mardirossian, N.; Head-Gordon, M. Thirty years of density functional theory in computational chemistry: An overview and extensive assessment of 200 density functionals. Mol. Phys. 2017, 115, 2315–2372.
Weigend and Ahlrichs (2005) Weigend, F.; Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297.
Svensson et al. (1996) Svensson, M.; Humbel, S.; Froese, R. D. J.; Matsubara, T.; Sieber, S.; Morokuma, K. ONIOM: A multilayered integrated MO + MM method for geometry optimizations and single point energy predictions. A test for Diels-Alder reactions and Pt(P(t-Bu)₃)₂ + H₂ oxidative addition. J. Phys. Chem. 1996, 100, 19357–19363.
Cornell et al. (1995) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995, 117, 5179–5197.
Heim et al. (1994) Heim, R.; Prasher, D. C.; Tsien, R. Y. Wavelength mutations and posttranslational autoxidation of green fluorescent protein. Proc. Natl. Acad. Sci. USA 1994, 91, 12501–12504.
Martin (2003) Martin, R. L. Natural transition orbitals. J. Chem. Phys. 2003, 118, 4775.
Leininger et al. (2001) Leininger, M. L.; Sherrill, C. D.; Allen, W. D.; Schaefer, H. F. Systematic Study of Selected Diagonalization Methods for Configuration Interaction Matrices. J. Comput. Chem. 2001, 22, 1574–1589.
Williams et al. (2009) Williams, S.; Waterman, A.; Patterson, D. Roofline. Commun. ACM 2009, 52, 65–76.
Atchley et al. (2023) Atchley, S. et al. Frontier: Exploring Exascale. SC ’23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Denver CO USA, 2023; pp 1–16.
Chang et al. (2024) Chang, J.; Lu, K.; Guo, Y.; Wang, Y.; Zhao, Z.; Huang, L.; Zhou, H.; Wang, Y.; Lei, F.; Zhang, B. A survey of compute nodes with 100 TFLOPS and beyond for supercomputers. CCF Trans. High Perform. Comput. 2024, 6, 243–262.

	$\displaystyle\nabla_{\mathbf{E}}(\mu_{A}\nu_{B}\|\lambda_{C}\sigma_{D})$	$\displaystyle=-\delta_{EA}(\mu^{\prime}_{A}\nu_{B}\|\lambda_{C}\sigma_{D})-% \delta_{EB}(\mu_{A}\nu^{\prime}_{B}\|\lambda_{C}\sigma_{D})$
		$\displaystyle-\delta_{EC}(\mu_{A}\nu_{B}\|\lambda^{\prime}_{C}\sigma_{D})-% \delta_{ED}(\mu_{A}\nu_{B}\|\lambda_{C}\sigma^{\prime}_{D})$		(40)

	$\displaystyle(\mu^{\prime}_{A}\nu_{B}\|\lambda_{C}\sigma_{D})$	$\displaystyle=-(\mu_{A}\nu^{\prime}_{B}\|\lambda_{C}\sigma_{D})-(\mu_{A}\nu_{B}% \|\lambda^{\prime}_{C}\sigma_{D})$
		$\displaystyle-(\mu_{A}\nu_{B}\|\lambda_{C}\sigma^{\prime}_{D})$		(41)