Machine learning surrogates for efficient hydrologic modeling: Insights from stochastic simulations of managed aquifer recharge

Timothy Dai
Department of Computer Science
Stanford University
Stanford, CA, USA
&

Kate Maher
Department of Earth System Science
Stanford University
Stanford, CA, USA
\AND

Zach Perzan
Department of Geoscience
University of Nevada, Las Vegas
Las Vegas, NV, USA
Corresponding author: zach.perzan@unlv.edu.

Abstract

Process-based hydrologic models are invaluable tools for understanding the terrestrial water cycle and addressing modern water resources problems. However, many hydrologic models are computationally expensive and, depending on the resolution and scale, simulations can take on the order of hours to days to complete. While techniques such as uncertainty quantification and optimization have become valuable tools for supporting management decisions, these analyses typically require hundreds of model simulations, which are too computationally expensive to perform with a process-based hydrologic model. To address this gap, we propose a hybrid modeling workflow in which a process-based model is used to generate an initial set of simulations and a machine learning (ML) surrogate model is then trained to perform the remaining simulations required for downstream analysis. As a case study, we apply this workflow to simulations of variably saturated groundwater flow at a prospective managed aquifer recharge (MAR) site. We compare the accuracy and computational efficiency of several ML architectures, including deep convolutional networks, recurrent neural networks, vision transformers, and networks with Fourier transforms. Our results demonstrate that ML surrogate models can achieve under 10% mean absolute percentage error and yield order-of-magnitude runtime savings over processed-based models. We also offer practical recommendations for training hydrologic surrogate models, including implementing data normalization to improve accuracy, using a normalized loss function to improve training stability and downsampling input features to decrease memory requirements.

Keywords hydrology $\cdot$ machine learning $\cdot$ surrogate modeling $\cdot$ managed aquifer recharge $\cdot$ emulator $\cdot$ groundwater

1 Introduction

Future water security hinges on effective water resources management. Process-based hydrologic and reactive transport models, which simulate water flow and solute transport (Xu and Singh, 2004; Devia et al., 2015; Steefel et al., 2015), are frequently used to forecast future changes in the hydrologic cycle and to optimize water management strategies. When paired with uncertainty quantification and parameter estimation methods, these models can support decision-making in diverse scenarios such as flood forecasting, contaminant transport, nutrient availability and managed aquifer recharge (MAR). However, these techniques often require hundreds of model simulations in which uncertain input parameters are varied randomly from one simulation to the next. While each individual simulation is deterministic – producing one fixed output for a given set of input parameters – the ensemble of simulations is stochastic because of the random variations in model input. When performed at the scale and resolution required for locally relevant policy decisions (Wood et al., 2011), these stochastic approaches result in large computational loads, even on state-of-the-art supercomputers.

To bridge this gap, recent studies have proposed a hybrid modeling workflow that capitalizes on both the high accuracy of process-based hydrologic models and the computational efficiency of modern machine learning (ML) models (Asher et al., 2015; Tran et al., 2021; Maxwell et al., 2021). In this workflow, an initial batch of simulations is generated using a process-based hydrologic model. An ML surrogate is then trained on this initial set of simulations and used to perform the remaining model runs. For example, Maxwell et al. (2021) evaluated the accuracy of two- and three-dimensional convolutional surrogate models for simulating overland flow in a simple tilted V catchment. This benchmark test problem (Di Giammarco et al., 1996) simulates 2D surface water flow through a simplified domain that consists of two inclined planes that converge in a central channel. Across simulations, the authors varied six uncertain input parameters related to rainfall and surface runoff, all of which were assumed to be spatially homogeneous across the domain. Of the six architectures tested, those that were provided with explicit temporal information during training exhibited the best performance, though accuracy decreased when the simulations in the test set contained different parameter distributions than those in the training set (Maxwell et al., 2021). Building off of this work, Tran et al. (2021) trained recurrent and highway networks (Wang et al., 2018) to emulate surface-subsurface flows within two watersheds spanning $600\text{\,}\mathrm{k}\mathrm{m}^{2}1236\text{\,}\mathrm{k}\mathrm{m}^{2}$ . As compared to a process-based hydrologic model, the emulator exhibits low ( $<$ $0.1$ ) relative bias for streamflow and total groundwater storage and can produce simulations 42 times faster.

While these early studies highlight the potential for ML surrogate models to accelerate hydrologic simulations, many questions remain for practitioners interested in applying ML emulators to water resources problems. For example, how many process-based simulations are required to train a high-fidelity surrogate model? Given the high dimensionality associated with spatial uncertainty, does the minimum training set size change when varying spatially heterogeneous input parameters across stochastic simulations? How does the choice of loss function impact performance? How important is data normalization? And, given the computational expense associated with training many ML architectures, when is it more computationally efficient to perform simulations exclusively with a process-based hydrologic code instead of an ML surrogate?

In this paper, we investigate these questions and present practical considerations for using ML surrogates to perform stochastic hydrologic simulations. As a case study, we focus on 3D simulations of variably saturated groundwater flow during managed aquifer recharge (MAR), the intentional replenishment of depleted aquifers through recharge basins and injection wells. We investigate seven different surrogate model architectures and evaluate the accuracy with which each one can reproduce output from a parallel, integrated hydrologic code (ParFlow-CLM). Importantly, we explore the impacts of spatial uncertainty on surrogate performance by varying parameters both spatially across the domain and between simulations. We then evaluate the memory and runtime required for each surrogate architecture as compared to the process-based hydrologic code. We also identify a range of implementation techniques that improve model performance, including data normalization, loss normalization and downsampling of high-dimensional inputs using an autoencoder.

2 Material and Methods

In this section, we formalize the hybrid modeling workflow (§2.1) and introduce the case study (§2.2). Next, we describe the process-based hydrologic code (§2.3) and the surrogate model architectures (§2.4). We then describe data preprocessing (§2.5) and detail the experimental setup for training and evaluating each ML surrogate (§2.6). We conclude with a summary of experiments, enumerating the ML models trained to assess our approach (§2.7).

2.1 Proposed hybrid workflow

The goal of this hybrid workflow is to harness the fast output rate of ML models to efficiently generate a large volume of simulations for downstream hydrologic analysis. Our task is to execute $N$ simulations, generating $N$ simulation outputs from $N$ distinct sets of simulation inputs. As illustrated in Figure 1, we employ a traditional processed-based hydrologic model (ParFlow-CLM) to generate a training set $\mathcal{D}_{\text{train}}$ comprising simulation input–output pairs. Subsequently, an ML surrogate model is trained on these examples and used to generate simulation outputs for the remaining $N-|\mathcal{D}_{\text{train}}|$ simulations, where $|\mathcal{D}_{\text{train}}|$ is the number of examples in $\mathcal{D}_{\text{train}}$ . If the ML surrogate provides comparable accuracy to the process-based model, the combined group of $N$ simulations can then be used for further analyses, such as uncertainty quantification, parameter estimation and resource optimization.

Refer to caption — Figure 1: Diagram of the proposed hybrid framework. The process-based hydrologic model generates a training set ( $\mathcal{D}_{\text{train}}$ ), which is then used to train and test an ML model. Subsequently, the trained ML model is used to generate the remaining hydrologic simulations.

2.2 The case study: Managed aquifer recharge site

We apply this hybrid workflow to a single case study in which hydrologic simulations are used to evaluate a recharge efficiency at a prospective managed aquifer recharge (MAR) site. MAR is a water management strategy whereby land managers deliberately flood parcels of land or inject water into wells to replenish underlying aquifers. Recent work has shown that MAR is a valuable tool for mitigating groundwater depletion in many regions of the globe (Dillon et al., 2019; Sprenger et al., 2017), but evaluating prospective recharge sites remains challenging as subsurface heterogeneity can significantly impact recharge rates (Knight et al., 2022; Perzan et al., 2023). Hydrologic and reactive transport simulations are effective tools for quantifying outcomes of interest at prospective MAR sites, such as changes in groundwater storage (e.g., Ganot et al., 2018) or the fate of subsurface contaminants (e.g., Perzan and Maher, 2024). However, generating these simulations comes with a notable computational expense, as the time required for a single simulation can range from hours to days depending on model scale and resolution. Thus, MAR site selection is a logical case study that could benefit from a computationally efficient hybrid modeling workflow.

In this case study, we focus on a prospective MAR site in Tulare County, California, that measures $800\text{\,}\mathrm{m}$\times$400\text{\,}\mathrm{m}$ . The site is underlain by a $45\text{\,}\mathrm{m}$ thick vadose zone that consists of interbedded layers of sand, silt and clay. The unconfined aquifer extends to $150\text{\,}\mathrm{m}$ depth, below which the Corcoran Clay, a regional aquitard, limits vertical flow (Mid-Kaweah GSA, 2019). The goal of the simulations at this site is to quantify the increase in saturated zone storage that will occur within two years of inundating the site with a $0.8\text{\,}\mathrm{m}$ recharge event.

To evaluate recharge efficiency at this site, Perzan et al. (2023) performed several hundred stochastic simulations of MAR using ParFlow-CLM (§2.3). The model domain is discretized into a $120\times 80\times 25$ rectilinear grid ( $1200\text{\,}\mathrm{m}$\times$800\text{\,}\mathrm{m}$\times$150\text{\,}% \mathrm{m}$ in the x, y and z directions, respectively), utilizing a uniform horizontal cell size of $10\text{\,}\mathrm{m}$ and variable cell thicknesses in the $z$ direction ( $1.02\text{\,}\mathrm{m}$ at the surface and $9\text{\,}\mathrm{m}$ at $90\text{\,}\mathrm{m}$ depth). Specified head boundaries corresponding to the depth to the water table ( $45\text{\,}\mathrm{m}$ ) are applied on the sides of the domain. A no-flow boundary is implemented on the bottom of the domain and variable forcing is applied to the top of the domain based on the modeling stage, as explained at the end of this section.

The domain is parameterized through a stochastic geophysical-geostatistical workflow (Perzan et al., 2023). Because subsurface heterogeneity can exert a dominant control on recharge efficiency, we parameterize the domain using a subsurface electrical resistivity model derived from a towed transient electromagnetic (tTEM) survey of the site. The tTEM system is a hydrogeophysical tool that can map meter-scale variations in electrical resistivity, a property directly related to sediment lithology. We first generate 600 gridded realizations of the tTEM-derived resistivity model using sequential Gaussian simulation, a stochastic technique for populating a rectilinear grid with a random variable. We then compare vertical profiles of electrical resistivity to collocated profiles of sediment type collected through cone penetration tests. Using the bootstrapping procedure of Knight et al. (2018), we transform each resistivity realization to a realization of coarse fraction, a metric that describes the fraction of coarse-grained (sand and gravel) sediment within each grid cell. The remainder of sediment within the cell is assumed to be fine-grained (silt and clay). For example, a cell with a coarse fraction of 0.5 is 50% coarse-grained material (either 100% sand, 100% gravel, or some combination of the two) and 50% fine-grained material (some combination of silt and clay). To assign hydrogeologic parameters (hydraulic conductivity, porosity, etc.) to each coarse fraction realization, we assume that only two types of sediment exist within a single realization — coarse-grained and fine-grained — and that each cell is a mix of the two types. We then sample from broad distributions to assign parameter values to each end member and use a weighted averaging procedure to calculate values for cells that are a mix of the two end members. The input parameter distributions are derived from a mix of literature and site-specific data. For more information about these parameter values, we refer the reader to the more complete description provided by Perzan et al. (2023).

Each MAR simulation consists of three consecutive stages. In Stage 1, a constant recharge forcing equal to the long-term natural recharge rate is applied to the top of the domain for 160,000 days ( $\sim$ 440 years), which is enough time for each simulation to reach steady state. In Stage 2, we apply one year of meteorological and irrigation forcing using hourly measurements from the southern Central Valley. These two spin-up stages initialize water content in the vadose zone and establish realistic soil moisture profiles prior to implementing recharge. Finally, Stage 3 simulates recharge by inundating the orchard with $0.8\text{\,}\mathrm{m}$ of water in a single winter season (applied in either 1, 4, 8 or 16 individual recharge events), alternating between periods of flooding and no flooding. In between inundation events, we apply meteorologic and irrigation forcing to the top of the domain.

Once each simulation has completed all three stages, we quantify the increase in saturated zone storage induced by MAR. Each simulation produces spatiotemporally variable output (transient pressure fields, represented as a time series of $120\times 80\times 25$ rectilinear grids) from a combination of temporally static, spatially variable input (e.g., 3D fields of porosity or saturated hydraulic conductivity, represented as $120\times 80\times 25$ rectilinear grids) and temporally variable, spatially homogeneous input (e.g., ambient air temperature). The number of output time steps varies depending on the modeling stage, with a finer temporal resolution in later modeling stages. We output pressure fields every 1000 days in Stage 1 (160 output time steps), every 2 days in Stage 2 (183 output time steps) and every 2 days in Stage 3 (366 output time steps). Note that the number of time step outputs here simply reflects the frequency at which the model outputs 3D pressure fields and not the temporal resolution of the underlying solver. From these 3D transient pressure fields, we calculate other system states (e.g., water content in each grid cell) and quantify the change in saturated zone storage.

2.3 The high-fidelity process-based model: ParFlow-CLM

In the first phase of our proposed framework (§2.1), a process-based hydrologic model is used to generate an initial set of simulations, which are then used to train an ML surrogate model. In this case study, we perform these process-based simulations using ParFlow-CLM. ParFlow-CLM couples the integrated hydrologic code ParFlow with the Common Land Model (CLM) to solve partial differential equations that describe water and energy fluxes over and beneath the land surface. ParFlow solves the Richards’ equation for variably saturated subsurface flow using a Newton-Krylov approach and the shallow-wave equations for overland flow using a kinematic wave approximation. CLM, meanwhile, uses a mass transfer approach to compute water and energy fluxes at the land surface (Maxwell and Miller, 2005; Kollet and Maxwell, 2008), including irrigation, evapotranspiration and precipitation. To accurately simulate the impacts of diurnal temperature and radiation fluctuations on water fluxes across the land surface, we run ParFlow-CLM using an hourly time step in Stages 2 and 3. For a detailed description of the governing equations, we refer readers to previous work describing ParFlow-CLM (Jones and Woodward, 2001; Ashby and Falgout, 1996; Kollet and Maxwell, 2006; Maxwell, 2013; Maxwell and Miller, 2005; Kollet and Maxwell, 2008).

2.4 The machine learning models

In the second phase of this hybrid framework (§2.1), we train an ML surrogate on the initial set of simulations generated by the process-based model. In total, we evaluate the performance of seven different machine learning surrogate model architectures. The architectures include recurrent and highway networks developed for spatiotemporal forecasting (Wang et al., 2018), vision transformers (Dosovitskiy et al., 2021), convolutional nerual networks and U-shaped networks paired with Fourier transforms (Wen et al., 2022; Zhou et al., 2022; Kang et al., 2023). We selected these architectures because some of them have been successfully applied in previous hydrologic surrogate modeling studies (Tran et al., 2021; Maxwell et al., 2021; Wen et al., 2022), while others are some of the most common architectures used for spatiotemporal forecasting in related fields like computer vision.

The ML surrogate architectures can be categorized into three types based on their method of prediction (Table 1, Figure 2). First, the Non-sequential models emulate the entire simulation in a single forward pass (i.e., the emulator produces 3D pressure fields for all output time steps in one run of the model). By contrast, the One-step models emulate one output time step (producing a single 3D pressure field) per forward pass. These models then use the output from the previous time step to perform another forward pass, repeating until all 3D pressure fields have been generated. Finally, the Recurrent models are similar to the one-step models in that they emulates one output time step per forward pass. However, they use memory cells to retain hidden states between forward passes, so that information can be stored between output time steps. As discussed above, the output frequency varies between model stages (see 2.2), which results in a different shape input tensor for each model stage. Because most ML architectures require a consistent shape input tensor across all training examples, we train a separate surrogate for each model stage. The ML models are summarized below, though complete details are provided in the Supplementary material (Section S1.1).

Table 1: Overview of the principal method, number of parameters, and source for each surrogate model.

Model Type	Model Name	Principal Method	Parameter Count	Source
	CNN4d	Convolution	12,801,281	N/A
(a) Non-sequential	ViT4d	Attention	18,228,032	Dosovitskiy et al. (2021)
	U-FNO4d	Fourier	251,446,273	Wen et al. (2022)
	CNN3d	Convolution	2,561,601	N/A
(b) One-step	ViT3d	Attention	10,238,848	Dosovitskiy et al. (2021)
	U-FNO3d	Fourier	26,152,993	Wen et al. (2022)
(c) Recurrent	PredRNN++	LSTM	11,275,425	Wang et al. (2018)

Non-sequential models

We experiment with three non-sequential models. Since non-sequential models produce 3D pressure fields for all output time steps in a single forward past, they are four-dimensional; i.e., they receive four-dimensional inputs — x, y, z and time — and generate four-dimensional outputs (3D transient pressure fields). The non-sequential models are:

1.

CNN4d. The CNN4d is a three-layer convolutional neural network that consists of four-dimensional convolutional layers, each followed by a four-dimensional batch normalization layer (Ioffe and Szegedy, 2015) to allow for fast and stable training, a leaky ReLU activation (Maas et al., 2013) to introduce nonlinearities, and a dropout layer (Hinton et al., 2012) to prevent overfitting.
2.

ViT4d. The ViT4d is a vision transformer (Dosovitskiy et al., 2021) that uses a four-dimensional convolutional layer to transforms the input tensor into a sequence of four-dimensional patches. This sequence of patches is then fed into a self-attention module (Vaswani et al., 2017) that calculates the importance of each patch relative to every other patch, allowing the model to learn both spatial and temporal dependencies regardless of their distance from each other.
3.

U-FNO4d. The U-FNO4d is based on the three-dimensional U-FNO implementation originally proposed by Wen et al. (2022), which consists of a U-shaped network paired with Fourier neural operators (FNOs; Li et al., 2020). While a full description of the U-FNO architecture is beyond the scope of this paper, we provide a brief description below. The U-FNO architecture (Figure S1) consists of three Fourier layers followed by three U-Fourier layers, bookended by linear projection layers. Each Fourier layer contains two information pathways: one through a Fourier and inverse Fourier transform separated by a linear transformation and another through a simple linear layer. These two information pathways are merged with a residual connection. The U-Fourier layer is essentially identical to the Fourier layer except there is a third information pathway: a U-Net consisting of a sequence of five convolutional layers followed by three transposed convolutional layers. Again, the information pathways are merged with a residual connection. We expand the original, three-dimensional U-FNO architecture to four dimensions by replacing three-dimensional convolutional layers in the U-Fourier layers with four-dimensional convolutional layers and by adding a fourth dimension with an identical maximum number of nodes to the Fourier and inverse Fourier transforms.

One-step models

One-step models, unlike non-sequential models, receive three-dimensional inputs (x, y and z) and predict one time step at a time. The output of the previous time step is reused as input for the next time step and this rolling forward process continues until the final time step. Note that these models receive no explicit temporal information during training, such as day of year or time step size. These one-step models include:

4.

CNN3d. The CNN3d model, like the CNN4d model, acts as our baseline one-step model. Constructed almost identically to the CNN4d model, the CNN3d model mainly differs in that it employs three-dimensional convolutions instead of four-dimensional ones.
5.

ViT3d. The ViT3d model mirrors the ViT4d model but employs a three-dimensional convolutional layer to generate three-dimensional patches.
6.

U-FNO3d. Our U-FNO3d model replicates the U-FNO architecture presented by Wen et al. (2022) and is the same as the U-FNO4d described above, except that it uses three-dimensional convolutional layers instead of four-dimensional layers.

Recurrent models

Like one-step models, one forward pass of a recurrent model produces three-dimensional output for a single time step. The key distinction is that the recurrent model retains hidden states between forward passes. We experiment with one such model:

7.

PredRNN++. This recurrent model is an improved predictive recurrent neural network (PredRNN++), which was originally developed by Wang et al. (2018). The PredRNN++ architecture (Figure S2) pairs causal long short-term memory networks (causal LSTMs) with a gradient highway units (GHUs). Causal LSTMs contain a spatiotemporal memory mechanism designed to improve short-term spatial correlations, while GHUs connect future outputs with distant inputs for improved long-term modeling capabilities. Each causal LSTM contains distinct cells designed to store spatial information ( $\mathcal{M}$ ) and temporal information ( $\mathcal{C}$ ) from one time step to the next, as well as hidden states ( $\mathcal{H}$ ), which are used to make predictions (Figure S2). Within the PredRNN++, which contains multiple causal LSTMs layers, each layer has its own unique temporal memory ( $\mathcal{C}$ ) and hidden states ( $\mathcal{H}$ ). The spatial memory ( $\mathcal{M}$ ) is shared across all layers passed sequentially between them within a single time step. To allow the model to learn long-term dependencies, the GHU layer has its own memory cell ( $\mathcal{Z}$ ), which bypasses the deeper causal LSTM layers and provides an alternate route for information to flow through the network. Given that the original PredRNN++ architecture was designed to handle two-dimensional inputs, we modify the implementation to handle three-dimensional inputs by replacing all two-dimensional convolutional layers with three-dimensional convolutional layers. Detailed descriptions of causal LSTMs and gradient highway unit can be found elsewhere (Wang et al., 2018).

2.5 Data preprocessing

To train each ML surrogate model, we designate the inputs of the ParFlow-CLM simulations (e.g., 3D porosity fields, transient meteorological forcing, etc.) as our training features and the outputs of the ParFlow-CLM simulations (transient, 3D pressure fields) as our training targets. However, prior to training, we first perform several steps of preprocessing. In particular, we perform feature selection to curate a list of salient input features, we broadcast input features across the $x$ , $y$ , $z$ and $time$ dimensions to standardize input and output formats, and we downsample certain inputs to reduce memory requirements for training.

Summary of input and target fields.

					Varies between	Spatially	Temporally	Appears in Stage
	Field name	Units	$\mu$	$\sigma$	simulations	variable	variable	1	2	3
	Inputs
1.	Horizontal Hydraulic Conductivity	$\unit{m/hr}$	$0.0936$	$0.270$	✓	✓	✗	✓	✓	✓
2.	Vertical Hydraulic Conductivity	$\unit{m/hr}$	$0.0151$	$0.0465$	✓	✓	✗	✓	✓	✓
3.	Porosity	$1$	$0.338$	$0.0585$	✓	✓	✗	✓	✓	✓
4.	Residual Saturation	$1$	$3.417$	$1.15$	✓	✓	✗	✓	✓	✓
5.	van Genuchten $\alpha$	$\unit{1/m}$	$3.41$	$1.15$	✓	✓	✗	✓	✓	✓
6.	van Genuchten $n$	$1$	$0.365$	$0.201$	✓	✓	✗	✓	✓	✓
7.	Incoming Short-Wave Radiation	$\unit{W/m^{2}}$	$202.$	$86.5$	✗	✗	✓	✗	✓	✓
8.	Incoming Long-Wave Radiation	$\unit{W/m^{2}}$	$323.$	$22.3$	✗	✗	✓	✗	✓	✓
9.	Precipitation Rate	$\unit{mm/s}$	$2.85$ e-05	$1.74$ e-05	✗	✗	✓	✗	✓	✓
10.	Air Temperature	$\unit{K}$	$290.$	$6.14$	✗	✗	✓	✗	✓	✓
11.	West-to-East Wind Speed	$\unit{m/s}$	$0.525$	$0.844$	✗	✗	✓	✗	✓	✓
12.	South-to-North Wind Speed	$\unit{m/s}$	$-0.371$	$0.902$	✗	✗	✓	✗	✓	✓
13.	Barometric Pressure	$\unit{pa}$	100,000	$246.$	✗	✗	✓	✗	✓	✓
14.	Specific Humidity	$\unit{kg/kg}$	$6.40$ e-03	$5.00$ e-04	✗	✗	✓	✗	✓	✓
15.	Cell Volume	$\unit{m}^{3}$	$600.$	$1130$	✗	✓	✗	✓	✓	✓
16.	Initial Pressure Head Field¹	$\unit{m}$	$6.40$	$18.4$	✗/✓	✓	✗	✓	✓	✓
17.	Inundation Forcing	$\unit{m/s}$	$0.232$	$0.294$	✓	✓	✓	✗	✗	✓
	Targets
1.	Pressure Head	$\unit{m}$	$6.20$	$21.2$	✓	✓	✓	✓	✓	✓

1

At the start of model Stage 1 (model spin up to steady state), pressure across the domain is in hydrostatic equilibrium with the water table at 45 m depth. For Stages 2 and 3, the initial pressure head field is initialized as the final pressure field output from the previous stage.

2.5.1 Feature selection

We select 17 input fields as the input features and one output field as our supervised learning target (Table 2.5). These input fields include hydrogeologic parameters (fields 1-6 in Table 2.5), meteorologic forcing (fields 7-14), information about the rectilinear model grid (field 15), initial conditions (field 16) and the recharge forcing applied at the surface (field 17).

The process-based hydrologic model receives additional input information not included in Table 2.5, such as cell dimensions and time step sizes. However, we omit many of these fields from our input features because they are temporally invariant and constant across all training examples. Thus, each surrogate architecture likely learns this information implicitly during training. Some fields, such as cell volume and the initial pressure head for Stage 1, are constant across all simulations, but are included nonetheless to support future studies seeking to modify these parameters.

2.5.2 Broadcasting

Broadcasting streamlines manipulation of inputs and targets through the use of large, combined tensors. The input features (Table 2.5) include fields that remain constant in time, fields that constant in space and fields that are constant in both space and time. Broadcasting simply repeats the values of each field across dimensions where that field remains constant, which ensures that the tensor for each input feature is four-dimensional (x, y, z and time). For example, we broadcast air temperature at each time step — which does not vary spatially across the domain — along the x, y and z dimensions. Similarly, given that saturated hydraulic conductivity does not change from one time step to the next, we repeat the 3D hydraulic conductivity field from the first time step across all time steps. This approach standardizes the dimensions of all input fields and allows them to be combined into one large tensor for input into each architecture.

2.5.3 Downsampling

While broadcasting standardizes input dimensions, it increases the total size of the input features and, by extension, the memory allocated to each training example. The memory required to train an ML model is an important component of computational efficiency; if the input and output for a single training example exceeds the memory capacity of a single graphics processing unit (GPU), then the model will have to be trained across multiple GPUs, which will increase the total number of processor-hours used during training. Modern hydrologic model domains can include anywhere from $10^{5}$ to $>$ $10^{8}$ total grid cells (Wood et al., 2011; Bierkens et al., 2015). At this scale, after broadcasting each input feature across the x, y, z and time dimensions and adding in the size of the training targets (i.e., the transient 3D pressure fields from ParFlow-CLM), the memory allocated to a single training example can quickly exceed 100 GB, which is larger than the memory capacity of most modern GPUs. To address this challenge, we limit the size of a single training example through temporal and spatial downsampling.

For model stages 1 and 2, we apply temporal downsampling. These model stages are only used to initialize water content throughout the vadose zone. Because intermediate pressure fields are not used to calculate recharge efficiency, decreasing the temporal resolution does not limit downstream analyses. For each stage, we sample 16 output time steps from the spin up period using a geometric sampling scheme:

\left\{\left\lfloor\sum_{k=1}^{t}r^{k-1}\right\rfloor:\sum_{k=1}^{16}r^{k-1}=1% 60\right\}_{t=1}^{16}.

(1)

The length of 16 is chosen so that a single training example fits on our training hardware, while the geometric sampling scheme prioritizes time step outputs toward the beginning of the model stage, when changes in pressure head are most pronounced. Though the final steady-state pressure field at the end of each stage is the only output used in subsequent modeling stages, we calculate loss across multiple time steps from this stage to ensure that the model simulates physically meaningful processes (i.e., gradual changes in pressure over time). This also forces the model to output intermediate pressure fields; visually examining these intermediate outputs also allows the user to more easily diagnose factors contributing to poor model performance. While this geometric sampling scheme results in a variable time step size within an individual simulation, the output frequency is consistent across all training examples. Thus, the data-driven ML models should implicitly learn these temporal patterns during training.

To accurately quantify recharge efficiency, the output from the Stage 3 surrogate models must be temporally high-resolution. Thus, instead of applying temporal downsampling, we apply spatial downsampling to the Stage 3 input features and training targets. For the input features, we reduce dimensionality using an average pooling layer with a $20\times 20\times 1$ window and a stride of $4\times 4\times 1$ . This layer simply calculates the mean of each input feature over the $20\times 20\times 1$ window, reducing the size of the input tensor in the $x$ and $y$ directions. The $z$ dimension remains unchanged, however, in order to preserve vertical resolution given that vertical flow is important for this case study.

To downsample the Stage 3 training targets, we use an autoencoder. The autoencoder is a separate neural network that is trained to reproduce the spatially variable pressure field from a single time step, but it is forced to store this information in an intermediate, lower-dimensional space (Figure 3(a)). Each surrogate model is then trained on the lower-dimensional representation of the 3D pressure fields, known as the encoded targets (Figure 3(b)). We train a single autoencoder that is used for all surrogate models. To calculate surrogate performance on the test set, we then decode the surrogate model’s predictions using the autoencoder and calculate error metrics relative to the high-resolution pressure fields output by ParFlow-CLM. This workflow allows us to train our surrogate model in the reduced dimensional space, minimizing memory requirements, while retaining the high spatial resolution of our predictions for use in downstream tasks, such as calculating recharge efficiency. For a detailed overview of the autoencoder architecture, please refer to Section S1.3 in the Supplementary material.

2.6 Model training and evaluation

In this section, we discuss the training setup used for all surrogate models, the hardware used to calculate computational efficiency and the evaluation criteria used to quantify model performance.

2.6.1 Training configuration

Each model undergoes training for $150$ epochs with a learning rate of $0.0001$ , using Adam optimization (Kingma and Ba, 2015). Due to GPU memory constraints, we restrict the batch size to $1$ . We employ a normalized L2 loss function:

L({y}_{i},h({x}_{i}))=\frac{||w\cdot({y}_{i}-h({x}_{i}))||_{2}}{||{y}_{i}||_{2}}

(2)

where $y_{i}$ is the ground-truth output for the $i$ th training example, $x_{i}$ is the input for the $i$ th training example, $h(x_{i})$ is the predicted output, $w$ defines a depth-wise weighting scheme and $||\cdot||_{2}$ is the L2 norm. We choose this normalized loss function because it is particularly effective when there is large variance in the data and because previous hydrologic surrogate modeling work suggests that it provides faster convergence than the L1 loss (Wen et al., 2022). Given that managed aquifer recharge efficiency is largely governed by unsaturated flow processes (Perzan et al., 2023), accurately simulating pressure changes within the vadose zone is more important than simulating pressure changes deep within the saturated zone. Thus, we define the weights $w$ such that they more severely penalize pressure errors closer to the surface than at depth (Figure S3).

To investigate the impact of our choice of loss function on model performance, we train additional CNN3d surrogates on Stage 1 simulations using the mean square error loss (i.e., the square of the numerator of Equation 2). Due to computational constraints, we could not train duplicates of each surrogate architecture using the mean square error loss, so we only perform this comparison with the CNN3d because it has the fastest training time.

We perform a random 80%-10%-10% (217-28-28) training–validation–test split and all performance metrics are reported based on the test set unless otherwise noted. Features and labels are normalized using their respective minimum and maximum values from the training set. We normalize all splits by the training set’s minimums and maximums to prevent bias in validation and test results introduced by distribution information from one set to another. To assess the effects of data normalization on ML model accuracy, we train two additional Stage 1 CNN3d models: one using Z-score normalization and one without any data normalization. As discussed above, we only perform these tests using the CNN3d because of its short training time.

2.6.2 Hardware environment

We train all ML models on a single NVIDIA A100 Tensor Core graphics processing unit (GPU) with 80 GB memory. All ParFlow-CLM simulations were performed on AMD Epyc 7543 central processing units (CPUs). Depending on the model stage, ParFlow-CLM simulations were run either on 4 MPI ranks with 1 CPU core per rank (Stage 1 and Stage 3) or 6 MPI ranks with 1 CPU core per rank (Stage 2).

2.6.3 Model accuracy evaluation

We evaluate each surrogate model using the mean absolute percentage error (MAPE) between the predicted pressure head time-series and the ground-truth (ParFlow-CLM) pressure head time-series in a cell-by-cell manner. Because we train a different surrogate for each modeling stage, we report different MAPE values for each stage. As with the loss function, we weight MAPE by depth to ensure that each model’s performance accurately reflects its ability to simulate vadose zone flow (Figure S3). We also evaluate model performance when each surrogate is run sequentially across all three stages, such that the final pressure head output by the surrogate from Stage 1 is the initial pressure head (field 16 in Table 2.5) used for the Stage 2 surrogate model and the final pressure head output of Stage 2 is the initial pressure head for Stage 3. We call this technique end-to-end (E2E) evaluation. E2E provides a more realistic evaluation of surrogate performance in a real-world application.

Training an accurate ML surrogate requires a sufficiently large training set, but at a certain point, the cost of increasing the training set size may outweigh any benefit in improved prediction performance. To evaluate this relationship and estimate the minimum number of process-based simulations required to train an accurate surrogate, we train multiple versions of each Stage 1 surrogate model while using smaller and smaller training sets. In addition to the baseline training set size of 245 examples used for training (with a 217-28 training-validation split), we use training set sizes of 196 (174-22 split), 147 (130-17), 98 (87-11), 49 (43-6), 25 (22-3), 13 (11-2), 8 (7-1), 5 (4-1), and 3 (2-1) training examples while maintaining the test set constant at 28 examples. Because ML models typically exhibit a power-law relationship between model error and training set size (Hestness et al., 2017), we then fit a power function to these data ( $y=am^{b}$ , where $m$ is the number of training examples, $y$ is the test set MAPE and $a$ and $b$ are regression coefficients). Lastly, we differentiate $am^{b}$ with respect to $m$ , which yields $abm^{b-1}$ . This expression represents the instantaneous change in MAPE given an increase in training set size, which we use to compare each model’s sensitivity to training set size.

2.6.4 Model runtime evaluation

In addition to evaluating ML model accuracy, we quantify the total compute time needed to generate simulations with this hybrid modeling workflow. Since model runtime is largely dependent on the hardware on which it is executed, we strictly adhere to the hardware environment described in §2.6.2 when evaluating model runtime. To simplify analysis and minimize additional training, we only consider models trained on Stage 1 simulations in this section.

First, we benchmark runtime by defining the time needed to generate $N$ Stage 1 simulations using the processed-based hydrologic model, ParFlow-CLM, as

T_{\text{PF}}=Nt_{\text{PF}},

(3)

where $t_{\text{PF}}$ denotes the CPU-hours taken by ParFlow-CLM to generate a single Stage 1 simulation (i.e., the wall clock time multiplied by the number of CPU cores used for the parallel simulation). Though runtime varies across stochastic ParFlow-CLM simulations (1.07–19.37 CPU-hours for Stage 1 simulations), we calculate $t_{\text{PF}}$ here as the mean runtime across all Stage 1 simulations (4.17 CPU-hours).

We also establish the number of GPU-hours required to output $N$ Stage 1 simulations using an ML model as

T_{\text{ML}}=T_{\text{train}}+(N-\mathcal{D}_{\text{train}})t_{\text{ML}},

(4)

where $T_{\text{train}}$ denotes the training time needed to reach the minimum MAPE on the validation set, $t_{\text{ML}}$ denotes the time taken by the trained ML model to generate one Stage 1 simulation, and $\mathcal{D}_{\text{train}}$ represents the minimum training set size necessary for the ML model to achieve an MAPE of less than $10\%$ — the threshold for “highly accurate forecasting” (Lewis, 1982). Due to differences in each surrogate model architecture, the minimum training set size ( $\mathcal{D}_{\text{train}}$ ) differs for each model. To find this value, we train multiple versions of each surrogate model while varying training set sizes using a binary search approach. The test set (28 simulations) is held constant during this procedure and is used to evaluate models, ensuring they fall below the $10\%$ MAPE threshold. We then calculate $\mathcal{D}_{\text{train}}$ as the smallest training set that still allows for a test set MAPE below 10%. Note that $\mathcal{D}_{\text{train}}$ includes both training and validation examples with an 80%-10% train–validation split, so each model trains on $\frac{8}{9}\cdot\mathcal{D}_{\text{train}}$ examples and validates using $\frac{1}{9}\cdot\mathcal{D}_{\text{train}}$ examples.

Given that the training set is generated by a process-based model, the total compute time used to generate $N$ simulations in this hybrid workflow includes both Equation 4 and the time to generate the training set:

T_{\text{training set}}=\mathcal{D}_{\text{train}}\cdot t_{\text{PF}}.

(5)

Equations 3, 4, and 5 have different units (CPU-, GPU-, and CPU-hours, respectively) because the ML model and the process-based hydrologic model run on different types of processors. Depending on the computing resources available to an individual modeler, these units might not be equivalent. Nonetheless, it is valuable to compare these compute times in order to evaluate the efficiency of a given surrogate model against a traditional process-based model. By assuming that $1\text{GPU-hour}=1\text{CPU-hour}$ , we quantify the compute time saved using this hybrid modeling workflow as

	$\displaystyle T_{\text{saved}}$	$\displaystyle=Nt_{\text{PF}}-\Big{(}T_{\text{training set}}+T_{\text{train}}+(% N-\|\mathcal{D}_{\text{train}}\|)t_{\text{ML}}\Big{)}$
		$\displaystyle=\underbrace{(t_{\text{PF}}-t_{\text{ML}})}_{\text{slope}}N-% \underbrace{(\|\mathcal{D}_{\text{train}}\|(t_{\text{PF}}-t_{\text{ML}})+T_{% \text{train}})}_{\text{intercept}},$		(6)

where $T_{\text{saved}}$ is in generic processor-hours. The slope term in Equation 6 indicates that the time saved by employing an ML surrogate model becomes more pronounced when $t_{\text{ML}}$ is significantly lower than $t_{\text{PF}}$ . Furthermore, the intercept term indicates that time savings begin sooner, i.e., at smaller values of $N$ , if the required training set is small and the training time is low, as expected.

2.7 Summary of experiments

In summary, we trained six different groups of surrogate models, each for a different numerical experiment:

1.

We train all seven surrogate architectures — the CNN4d, ViT4d, U-FNO4d, CNN3d, ViT3d, U-FNO3d, and PredRNN++ (§2.4) — on Stage 1, 2, and 3 ParFlow-CLM simulations, amounting to 21 trained models. We also train an autoencoder for all Stage 3 models. This totals 22 trained models.
2.

We train two additional Stage 1 CNN3d models using alternative data normalization techniques (§2.6.1). One of the additional models does not use any data normalization, and the other uses Z-score normalization. The Stage 1 CNN3d model is chosen over other stages and models for its efficiency in training and evaluation.
3.

We train six additional Stage 1 CNN3d models using different combinations of loss function and data normalization (§2.6.1). Three CNN3d models use the normalized L2 loss with min-max, Z-score, and no data normalization, and three CNN3d models use the mean squared error loss with min-max, Z-score and no data normalization.
4.

To evaluate the effect of training set size on model accuracy, we train all seven ML models with nine Stage 1 training sets of decreasing sizes, as described in §2.6.3. This experiment includes 63 trained models in total.
5.

To evaluate the ML models’ runtime efficiency, we re-train all seven ML models with variable Stage 1 training set sizes, using a binary search technique to find the minimum training set size that achieves $<$ 10% test MAPE. The number of models trained at this stage varies according to the efficiency of the search algorithm and the test MAPE results from previous experiments but ranged from 2 to 7 across the seven architectures. We then compare the runtimes of the seven models trained on their respective minimum Stage 1 training sets with the procedure described in §2.6.4.

3 Results and Discussion

Balancing speed and accuracy, the hybrid modeling framework achieves runtime benefits with minimal accuracy trade-offs. In §3.1, we first present the predictive accuracy of each ML model when trained on the full training set before examining the impacts of training set size and normalization on model accuracy and training stability. In §3.2, we compare the computational efficiency of each ML surrogate with a baseline approach that relies solely on the process-based hydrologic model. Lastly, we discuss the importance and challenges of model interpretability (§3.3) and potential limitations of this framework (§3.4).

3.1 Model accuracy

Each ML surrogate model produces pressure fields that are visually similar to those produced by the process-based hydrologic model. For example, a comparison of ParFlow and the PredRNN++ shows that both models produce similarly-shaped heterogeneous wetting fronts during Stage 1 (Figure 4), with monotonically increasing pressure values below the front. Plotting the cell-by-cell absolute error between the two models reveals that differences are greatest near the wetting front (Figure 4c). However, absolute pressure differences are generally less than a few meters, which is of a reasonable range given the sharp hydraulic gradients that arise near infiltration wetting fronts.

All seven ML models achieve $<$ 10% MAPE on the test set (Figure 5 and Table S8), which is the threshold for “highly accurate forecasting” (Lewis, 1982). The PredRNN++ displays the lowest error for Stage 1 and end-to-end (E2E) predictions, while the CNN3d and ViT3d perform the best on Stage 2 and Stage 3, respectively. Even though the non-sequential models have more parameters than their one-step counterparts (Table 1), they do not exhibit an appreciable increase in predictive performance. The lone recurrent model (the PredRNN++), however, outperforms the others and is better able to learn the temporal dependencies that develop as each hydrologic simulation progresses through time. The large memory cells retained between time steps likely provide ample degrees of freedom for the model to store spatially heterogeneous information over time.

Overall, prediction error is the highest for Stage 1, with a mean MAPE of 5.6% across all ML models, as opposed to 0.13% and 0.42% for Stages 2 and 3, respectively. Previous modeling work at this site has shown that managed aquifer recharge efficiency is most strongly controlled by vadose zone water content prior to inundation (Perzan et al., 2023). The low model error for Stages 2 and 3 makes sense, then, given that Stage 2 and 3 are provided the ground-truth pressure field output by ParFlow-CLM at the end of Stage 1 and 2, respectively, as initial pressure head fields during training.

Alternatively, the higher Stage 1 MAPE could result from its large time scale (§2.2 and 2.5.3). While Stage 2 and 3 span 1 and 2 years, respectively, Stage 1 spans over 400 years, with more dramatic changes in pressure between outputs. Pressure head changes across large time scales may be more difficult to learn than smaller, more incremental changes, potentially contributing to the poorer performance of Stage 1 models.

End-to-end (E2E) predictions, on the other hand, do not receive ground-truth pressure fields between stages; the predicted pressure at the end of one stage is used as input for the ML surrogate in the next stage. Overall, the MAPE for E2E predictions is typically lower than that of Stage 1 and similar in magnitude to that of Stage 3 (Figure 5). The similarity between Stage 3 and E2E error likely arises because E2E error is averaged equally over the entire time series and Stage 3 has the most time steps. However, not all models follow this trend; the CNN3d and CNN4d both exhibit E2E MAPE higher than 1%, with a $3.1\times$ and $6.6\times$ respective increases compared to their Stage 3 MAPEs. Plotting the E2E MAPE per time step in Figure 6 reveals that the CNN3d and CNN4d exhibit the highest MAPE per time step among other models in Stage 3. This heightened difference between Stage 3 MAPE and Stage 3 MAPE in E2E evaluation could indicate that the CNN3d and CNN4d both exhibit a strong dependence on the initial pressure head at the start of a model stage. The curves for all other models follow the same general shape, suggesting that these models may have learned similar patterns in the Stage 3 data that have resulted in similar accuracy.

Collectively, the results in this section suggest that ML surrogate models can simulate variably saturated groundwater flow with a similar level of accuracy as process-based hydrologic models, supporting the feasibility of applying a hybrid modeling framework to this case study.

3.1.1 Impact of training set size on accuracy

While the PredRNN++ exhibits the lowest E2E error, it also demonstrates the second lowest sensitivity to training set size. Figure 7 reveals that the ML models, in general, exhibit a power-law relationship between training set size and MAPE, which is typical for most deep learning architectures (Hestness et al., 2017). This power-law relationship demonstrates diminishing returns when increasing training set size beyond a certain point. To explore these diminishing returns more closely, we plot the derivative of the best-fit curve for this power-law relationship ( $\frac{d\text{MAPE}}{dm}(m)$ ) in Figure 7. The $\frac{d\text{MAPE}}{dm}(m)$ values for the PredRNN++ remain consistently close to 0 and the model is able to achieve $\sim$ 5% MAPE with as few as 25 training examples (Figure 7). In fact, almost all models exhibit nearly consistent MAPE values after 50–100 training examples. While the models presented in Figure 5 were trained on 245 examples (with a 217-28 training-validation split), these results suggests that future practitioners could achieve similar performance by training on fewer examples. However, the relationship between training set size and MAPE is highly dependant on the specific problem of interest, so future practitioners should use caution when extending these results to other sites or other scenarios.

3.1.2 Impact of normalization on accuracy

Both the min-max data normalization and the normalized loss function (§2.6.1) improve model performance, resulting in lower MAPE and a more stable training process. When trained on a Stage 1 dataset normalized with a min-max procedure, the CNN3d achieves 4.8% MAPE, which is an order of magnitude lower than the CNN3d trained without any normalization (48.9% MAPE) and three times lower than the model trained with a Z-score normalized dataset (15.3% MAPE; Figure 8). We attribute this improved performance to the variability in the input dataset (Table 2.5), which contains several features that span multiple orders of magnitude. In such cases, normalization improves accuracy by constraining features to a consistent range. In addition, some of the input features are not necessarily normally distributed. Because min-max normalization preserves distribution information — as opposed to Z-score normalization, which standardizes each input feature to have a mean of 0 and a standard deviation of 1 — it provides this distribution information to each ML model, which may improve model accuracy.

While normalizing input data improves model accuracy, loss normalization increases training stability. Comparing MAPE on the validation set across all training epochs, we see that using a normalized loss generally reduces the number of sharp spikes in MAPE, indicating that a normalized loss allows for more stable descent and reliable training convergence. One reason loss normalization may have such a large impact in this case is our use of small batch sizes. To limit memory requirements, we train each ML model using a batch size of 1. However, at small batch sizes, the gradient is calculated over fewer examples (or in this case, a single example), which leads to noisy loss curves and increases the likelihood that gradient descent can get stuck in a local minimum. Loss normalization helps offset these drawbacks by ensuring that the gradients are of consistent magnitude across the training set. This effect is most noticeable in models that use either no data normalization or Z-score data normalization, suggesting that loss normalization most benefits models with inadequate data normalization.

Due to computational constraints, we only perform these analyses with the CNN3d. However, we expect that results would be similar or even more pronounced for other architectures; the CNN3d has built-in batch normalization layers that increase training stability, while most other architectures do not. Overall, these results underscore the importance of both data normalization and loss normalization in enhancing model performance and training stability for hydrologic surrogate models.

3.2 Computational efficiency

While each ML surrogate model achieves $<$ $10\%$ MAPE, the models vary widely in the amount of time and memory they require to produce simulations. One attribute they share, however, is that all ML surrogate models offer significant runtime advantages over the process-based hydrologic model (Figure 9). Figure 9 reveals positive slopes across all lines. Recall that the slope for these plots are calculated as the difference in time taken by ParFlow-CLM and the ML model in generating a single Stage 1 simulation (Equation 6). All lines appear to exhibit identical slopes because each ML surrogate outputs simulations orders of magnitude faster than the process-based hydrologic model ( $t_{\text{PF}}=$ 15,017 CPU-seconds/simulation), so the difference in $t_{\text{PF}}-t_{\text{ML}}$ across the ML models is miniscule. Nonetheless, positive slopes for all lines signify that all seven ML models produce simulations at a faster rate than ParFlow-CLM. The ViT4d technically exhibits the steepest slope, producing simulations the fastest ( $t_{\text{ML}}=0.115$ GPU-seconds/simulation), but, due to the large training set size required for the ViT4d, time savings only appear for the ViT4d after 246 simulations. Closer examination of the $x$ -intercepts reveals that 6 out of 7 of the ML models achieve a runtime advantage over ParFlow-CLM when the desired simulation output count is as low as 17. We provide a detailed overview of all time measurements in Table 2, including distinct values of $\mathcal{D}_{\text{train}}$ (the minimum size training set required to achieve $<10\%$ MAPE, which strongly impacts the computational efficiency of each model.

When accounting for both training time and forward simulation time, the U-FNO4d is the most efficient ML model. To perform 500 simulations, the U-FNO4d requires a total of 13.1 processor-hours (the sum of Equations 4 and 5), compared to the second-most efficient ML model, the PredRNN++, which requires 13.3 processor-hours. These computational times are orders of magnitude smaller than the time it would take to generate 500 simulations exclusively with ParFlow-CLM (2090 processor-hours). These findings, expressed in Figure 9, affirm the tangible benefits of this hybrid framework, which outpaces a naive approach of solely relying on ParFlow-CLM. We believe this computational speedup arises from two factors. First, ParFlow-CLM uses an iterative Newton-Krylov approach to solve the coupled system of nonlinear partial differential equations at each time step. This iterative scheme requires many Jacobian evaluations and floating point operations per time step, while the ML surrogates only require forward passes, consisting of fixed calculations determined by the model architecture and model weights. Second, the ML surrogate models can take arbitrarily large time steps without compromising convergence due to their data-driven nature, while ParFlow-CLM must use an hourly time step to accurately represent meteorology-driven water fluxes across the land surface. Other process-based models are subject to similar constraints — such as the Courant–Friedrichs–Lewy condition — which limit their computational efficiency relative to ML surrogates. Since ML surrogates can learn temporal patterns in training data, they are unbound by time step size and can thereby complete simulations at faster rates.

Table 2: Runtime measurements and GPU memory allocation for all seven ML surrogate models.

Model	$t_{\text{ML}}$ (GPU-sec)¹	$T_{\text{train}}$ (GPU-hr)²	$\|\mathcal{D}_{\text{train}}\|$ ³	$T_{\text{training set}}$ (CPU-hr)⁴	Memory allocation (GB)⁵
CNN4d	$2.28$	$4.47$	$15$ (13-2)	62.6	$21.8$
ViT4d	$0.0934$	$1.74$	$245$ (217-28)	1020	$4.2$
U-FNO4d	$0.856$	$0.454$	$3$ (2-1)	12.5	$22.7$
CNN3d	$0.991$	$0.145$	$5$ (4-1)	20.9	$10.5$
ViT3d	$0.115$	$0.122$	$13$ (11-2)	54.2	$2.1$
U-FNO3d	$0.288$	$0.187$	$10$ (8-2)	41.7	$14.7$
PredRNN++	$2.38$	$0.473$	$3$ (2-1)	12.5	$31.7$

1

Computational time to perform a single simulation with the trained ML model.
2

Time to train each ML model, once a training set has been generated.
3

The minimum number of training examples required to achieve $<$ $10\%$ MAPE. We show the training-validation split in parentheses.
4

Time to generate $|\mathcal{D}_{\text{train}}|$ training examples using the process-based hydrologic model.
5

Maximum memory allocated to each ML model during training, using a batch size of 1.

In addition to runtime, the memory required to train an ML model is an important component of computational efficiency. While the memory cells in the PredRNN++ allow the model to retain information from one time step to the next, these large tensors require significant memory allocation (Table 2). Using a batch size of 1, the PredRNN++ requires 15.1 $\times$ more GPU memory to train (31.7 GB) than the ViT3d (2.1 GB), the least memory-intensive architecture. At this allocation size, the PredRNN++ may be too memory-intensive to train on some modern GPUs. Similarly, all three non-sequential models (CNN4d, ViT4d and U-FNO4d) require more memory than their one-step counterparts (CNN3d, ViT3d and U-FNO3d), though this increase in memory allocation does not necessarily translate to an increase in accuracy (Figure 5). Ultimately, even though some architectures, such as the PredRNN++, can achieve low MAPE with only a few training examples, they may not be flexible enough to adapt to all compute environments due to their high memory requirements.

3.3 Model interpretability

Another crucial factor to consider when selecting an ML architecture within a hybrid modeling framework is model interpretability. Error recognition is more straightforward with an interpretable surrogate model and can enhance trust among practitioners. The PredRNN++ architecture, for example, contains $\mathcal{C}$ and $\mathcal{M}$ memory cells (Figure S2) that are intended to serve distinct roles, with the former facilitating the transfer of temporal information and the latter facilitating the transfer of spatial information (Wang et al., 2018). However, in practice, deciphering the values stored in these cells and validating their role in each model has proven challenging (Figure 10). Each memory cell is the same size as the model domain ( $120\times 80\times 25$ ), with two memory cells for each of the four causal LSTM layers in the PredRNN++. Given that these memory cells change between each time step and between each evaluation example, manually inspecting and interpreting $\mathcal{C}$ and $\mathcal{M}$ quickly becomes cumbersome. In other words, the hidden states of the PredRNN++ model and, consequently, the internal mechanisms of the model, prove challenging to interpret.

In addition to the PredRNN++, other ML architectures have been shown to be interpretable when applied in different domains and may prove useful in hydrologic surrogate modeling. For example, previous studies have indicated that hierarchical convolutional layers, as employed in CNN3d, can learn hierarchical image information (Zhou et al., 2014; Yang et al., 2019). Furthermore, research on transformers like those employed by the ViT3d has demonstrated that attention mechanisms within these models may also learn hierarchical information (Jain and Wallace, 2019; Jawahar et al., 2019; Wiegreffe and Pinter, 2019). When combined with conventional ML interpretability methods such as saliency maps, deconvolutional networks, guided backpropagation, and feature occlusion, each of these mechanisms could serve as a check to ensure that a surrogate model is behaving as expected. Future work is needed to explore the interpretablility of each of these architectures when applied to hydrologic simulations.

3.4 Limitations

While each ML model can reliably reproduce transient pressure fields generated by a process-based model, future researchers should carefully tune models to limit spatial and temporal variability in model performance. For example, when loss is calculated evenly across the entirety of the domain, ML models may not accurately simulate changes in pressure head in the most important portions of the domain. In Figure 4, which visualizes one time step from a PredRNN++ emulation of Stage 1, model errors are predominantly concentrated at and above the wetting front. Even though the loss function is weighted to more severely penalize pressure changes within the vadose zone (Figure S3), the high MAPE in this area suggests that the ML model still cannot fully resolve sharp hydraulic gradients near the wetting front. Future applications could employ more complex weighted loss functions to ensure that each surrogate most accurately simulates the processes of interest.

The full results that display ML model performance on all stages in Figure 5 reveal a similar challenge in capturing temporal changes in pressure head. The performance in Stage 1 for all models consistently lags behind that of Stage 2 and Stage 3. Because these later stages simulate shorter time periods (one year in Stage 2 and two years in Stage 3, as opposed to 440 years in Stage 1), pressure head changes over the model stage are less pronounced and their temporal relationships are thereby easier to learn. Thus, while this case study achieves low MAPE, future applications should carefully adapt the implementation strategies in this study (for example, the choice of loss function, data normalization technique and input features) to their problem of interest, especially when simulating model domains characterized by abrupt variations in pressure head across either time or space.

4 Conclusion

Machine learning surrogate models offer an efficient approach for generating large sets of hydrologic simulations. All seven of the ML surrogate models produce transient pressure fields that are comparable (below 10% mean absolute percentage error, MAPE) to those produced by the process-based hydrologic model. These transient pressure fields can then be used in downstream tasks, such as quantifying changes in groundwater storage. The improved predictive recurrent neural network (PredRNN++) exhibits the highest accuracy when simulating all three model stages (0.36% MAPE), though other architectures also perform well (0.52–2.4% MAPE). Interestingly, he non-sequential models, which are deeper, have more parameters and compute pressure at all time steps simultaneously, do not exhibit improved accuracy over the one-step models, which compute pressure for each time step sequentially and with no memory between steps.

This hybrid modeling framework also yields substantial runtime benefits (0.115–2.38 processor-seconds/simulation, as opposed to 3,860–69,700 processor-seconds/simulation with a process-based model for Stage 1 simulations) with errors consistently below 10%. Even when accounting for the time it takes to train each model, 6 out of 7 ML surrogate models exhibit runtime advantages when generating as few as 17 hydrologic simulations. Computational savings increase when performing larger numbers of simulations, with 159 $\times$ speed-up when generating 500 simulations with the U-FNO4d, for example.

Additionally, we identify key practices for efficiently training accurate hydrologic surrogate models, showing that min-max data normalization and loss normalization increase surrogate model accuracy and training stability. Temporal and spatial downsampling, meanwhile, improve computational efficiency by decreasing the memory required to train each model.

Overall, this study offers encouraging results for applying machine learning surrogate models to real-world hydrologic problems, supporting future advancements in hydrologic modeling and simulation techniques.

Acknowledgements

We thank the Stanford Research Computing Center for providing computational resources and support that contributed to these research results. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518. Additional research funding was provided by the Realizing Environmental Innovation Program (REIP) at the Stanford Woods Institute for the Environment.

Overview

This supplementary file contains additional details on the methods used to develop each ML surrogate model, including layer-by-layer information for each surrogate model (§S1.1), diagrams of recent model architecture innovations (§S1.2), and details on the autoencoder used to compress 3D pressure fields (§S1.3). We also present additional results referenced in the main text, including complete MAPE results across all models and stages (§S2.1) and a discussion of any error introduced by the autoencoder (§S2.2).

S1 Materials and methods

S1.1 Model layer tables

Table S1: CNN3d layer details given

(-1,120,80,25,8)

input.

	Layer	Output shape	Parameter count
1.	Linear	$(-1,120,80,25,64)$	576
2.	Conv3d/BatchNorm3d/ReLU/Dropout	$(-1,64,120,80,25)$	512,192
3.	Conv3d/BatchNorm3d/ReLU/Dropout	$(-1,128,120,80,25)$	1,024,384
4.	Conv3d/BatchNorm3d/ReLU/Dropout	$(-1,64,120,80,25)$	1,024,384
5.	Linear	$(-1,120,80,25,1)$	65

Table S2: CNN4d layer details given

(-1,16,120,80,25,8)

input.

	Layer	Output shape	Parameter count
1.	Linear	$(-1,1,120,80,25,64)$	576
2.	Conv4d/BatchNorm4d/ReLU/Dropout	$(-1,64,1,120,80,25)$	2,560,128
3.	Conv4d/BatchNorm4d/ReLU/Dropout	$(-1,128,16,120,80,25)$	5,120,256
4.	Conv4d/BatchNorm4d/ReLU/Dropout	$(-1,64,16,120,80,25)$	5,120,256
5.	Linear	$(-1,16,120,80,25,1)$	65

Table S3: UFNO3d layer details given

(-1,120,80,25,8)

inputs.

	Layer	Output shape	Parameter count
1.	Padding	$(-1,120,80,32,8)$	0
2.	Linear	$(-1,120,80,32,32)$	256
3.	Linear	$(-1,120,80,32,32)$	1056
4.	Fourier3d/Conv1d/Add/ReLU	$(-1,32,120,80,32)$	4,097,056
5.	Fourier3d/Conv1d/Add/ReLU	$(-1,32,120,80,32)$	4,097,056
6.	Fourier3d/Conv1d/Add/ReLU	$(-1,32,120,80,32)$	4,097,056
7.	Fourier3d/Conv1d/U-Net3d/Add/ReLU	$(-1,32,120,80,32)$	4,618,720
8.	Fourier3d/Conv1d/U-Net3d/Add/ReLU	$(-1,32,120,80,32)$	4,618,720
9.	Fourier3d/Conv1d/U-Net3d/Add/ReLU	$(-1,32,120,80,32)$	4,618,720
10.	Linear	$(-1,120,80,32,128)$	4224
11.	Linear	$(-1,120,80,32,1)$	129
12.	Depadding	$(-1,120,80,25,1)$	0

Table S4: UFNO4d layer details given

(-1,16,120,80,25,8)

inputs.

	Layer	Output shape	Parameter count
1.	Padding	$(-1,16,120,80,32,8)$	0
2.	Linear	$(-1,16,120,80,32,32)$	256
3.	Fourier4d/Conv1d/Add/ReLU	$(-1,32,16,120,80,32)$	40,961,056
4.	Fourier4d/Conv1d/Add/ReLU	$(-1,32,16,120,80,32)$	40,961,056
5.	Fourier4d/Conv1d/Add/ReLU	$(-1,32,16,120,80,32)$	40,961,056
6.	Fourier4d/Conv1d/U-Net4d/Add/ReLU	$(-1,32,16,120,80,32)$	42,852,832
7.	Fourier4d/Conv1d/U-Net4d/Add/ReLU	$(-1,32,16,120,80,32)$	42,852,832
8.	Fourier4d/Conv1d/U-Net4d/Add/ReLU	$(-1,32,16,120,80,32)$	42,852,832
9.	Linear	$(-1,16,120,80,32,128)$	4224
10.	Linear	$(-1,16,120,80,32,1)$	129
11.	Depadding	$(-1,16,120,80,25,1)$	0

Table S5: ViT3d layer details given

(-1,120,80,25,8)

inputs.

	Layer	Output shape	Parameter count
1.	Linear	$(-1,120,80,25,64)$	512
2.	Conv3d	$(-1,64,5,5,5)$	7,864,320
3.	PositionalEncoding	$(125,-1,64)$	0
4.	TransformerEncoder	$(125,-1,64)$	2,249,216
5.	Linear	$(-1,125,1920)$	124,800
6.	Reshape	$(-1,120,80,25,1)$	0

Table S6: ViT4d layer details given

(-1,16,120,80,25,8)

inputs.

	Layer	Output shape	Parameter count
1.	Linear	$(-1,16,120,80,25,64)$	512
2.	Conv4d	$(-1,64,8,5,5,5)$	15,728,704
3.	PositionalEncoding	$(1000,-1,64)$	0
4.	TransformerEncoder	$(1000,-1,64)$	2,249,216
5.	Linear	$(-1,1000,3840)$	249,600
6.	Reshape	$(-1,16,120,80,25,1)$	0

Table S7: PredRNN++ layer details given

(-1,120,80,25,8)

inputs.

	Layer	Output shape	Parameter count
1.	Linear	$(-1,120,80,25,32)$	256
2.	CausalLSTM3d	$(-1,32,120,80,25)$	2,690,752
3.	GHU	$(-1,32,120,80,25)$	512,128
4.	CausalLSTM3d	$(-1,32,120,80,25)$	2,690,752
5.	CausalLSTM3d	$(-1,32,120,80,25)$	2,690,752
6.	CausalLSTM3d	$(-1,32,120,80,25)$	2,690,752
7.	Linear	$(-1,120,80,25,1)$	33

S1.2 Model architecture diagrams

S1.2.1 U-FNO

S1.2.2 PredRNN++

S1.3 Autoencoder details

The autoencoder is a convolutional neural network containing four layers and around 676,000 parameters, which makes the autoencoder much smaller than each surrogate model. Two of the four layers are encoding convolutional layers, which squeezes the input into compact spatial dimensions, and the remaining two are decoding transposed convolutional layers, which recovers the original spatial resolution. All convolutional layers are 2D, using the $z$ dimension as the channel dimension. They employ a (7, 7)-sized convolutional kernel, a stride of (2, 2), and a hidden size of 128. The convolutional layers are separated by ReLU activation functions and batch normalization layers.

S1.4 Implicit depth-wise loss and MAPE weighting

S2 Results

S2.1 Mean absolute percentage error across all model stages

MAPE Model Stage 1 Stage 2 Stage 3 E2E a. CNN4d $5.4$ $0.089$ $0.75$ $2.4$ b. ViT4d $9.9$ $0.14$ $0.28$ $0.70$ c. U-FNO4d $4.3$ $0.14$ $0.28$ $0.52$ d. CNN3d $4.8$ $\mathbf{0.050}$ $0.29$ $1.9$ e. ViT3d $6.3$ $0.12$ $\mathbf{0.20}$ $0.57$ f. U-FNO3d $5.1$ $0.22$ $0.22$ $0.62$ g. PredRNN++ $\mathbf{2.9}$ $0.054$ $0.46$ $\mathbf{0.36}$

Table S8: Test MAPEs for all ML models. Bold indicates best performance for each stage.

S2.2 Error introduced by the autoencoder

The autoencoder, though a vital component in training Stage 3 models, may introduce inaccuracies. We attempt to disentangle the errors from the surrogate model and the errors from the autoencoder in Figure S4, focusing on the CNN3d for its efficiency in training. Computing MAPE between the predictions $\hat{y}$ taken directly from the autoencoder, i.e., $\hat{y}\leftarrow h_{\text{dec}}(h_{\text{enc}}(y))$ , and the simulation targets reveals that the autoencoder can reproduce its own input with 0.28% error on the held-out test set (Figure S4, Config A). We expect that this autoencoder error contributes to Stage 3 error, since we use the autoencoder in training Stage 3 models; indeed, when computing the MAPE between decoded Stage 3 model predictions and simulation targets, this error, at 0.29%, is greater than the autoencoder error alone. It is surprising, then, when we use the autoencoder to encode and decode the targets and use these encoded and decoded targets as our new evaluation references, our percent error decreases to 0.11%. This shift indicates that the autoencoder introduces error by compressing the space of the target encodings. In other words, our autoencoder encodes targets in a way that makes it more difficult to distinguish between unique targets once they have been encoded. Consequently, during the decoding step, essential information distinguishing unique targets is not fully recovered. Exploring improved autoencoder architectures may help to mitigate these challenges and enhance the overall performance our hybrid framework in future applications.

References

Xu and Singh [2004] C.-Y. Xu and V. P. Singh. Review on Regional Water Resources Assessment Models under Stationary and Changing Climate. Water Resources Management, 18(6):591–612, December 2004. ISSN 0920-4741, 1573-1650. doi:10.1007/s11269-004-9130-0. URL http://link.springer.com/10.1007/s11269-004-9130-0.
Devia et al. [2015] Gayathri K. Devia, B.P. Ganasri, and G.S. Dwarakish. A Review on Hydrological Models. Aquatic Procedia, 4:1001–1007, 2015. ISSN 2214241X. doi:10.1016/j.aqpro.2015.02.126. URL https://linkinghub.elsevier.com/retrieve/pii/S2214241X15001273.
Steefel et al. [2015] C. I. Steefel, C. A. J. Appelo, B. Arora, D. Jacques, T. Kalbacher, O. Kolditz, V. Lagneau, P. C. Lichtner, K. U. Mayer, J. C. L. Meeussen, S. Molins, D. Moulton, H. Shao, J. Šimůnek, N. Spycher, S. B. Yabusaki, and G. T. Yeh. Reactive transport codes for subsurface environmental simulation. Computational Geosciences, 19(3):445–478, June 2015. ISSN 1573-1499. doi:10.1007/s10596-014-9443-x. URL https://doi.org/10.1007/s10596-014-9443-x.
Wood et al. [2011] Eric F. Wood, Joshua K. Roundy, Tara J. Troy, L. P. H. van Beek, Marc F. P. Bierkens, Eleanor Blyth, Ad de Roo, Petra Döll, Mike Ek, James Famiglietti, David Gochis, Nick van de Giesen, Paul Houser, Peter R. Jaffé, Stefan Kollet, Bernhard Lehner, Dennis P. Lettenmaier, Christa Peters-Lidard, Murugesu Sivapalan, Justin Sheffield, Andrew Wade, and Paul Whitehead. Hyperresolution global land surface modeling: Meeting a grand challenge for monitoring Earth’s terrestrial water: OPINION. Water Resources Research, 47(5), May 2011. ISSN 00431397. doi:10.1029/2010WR010090. URL http://doi.wiley.com/10.1029/2010WR010090. 604 citations (Crossref) [2024-02-08].
Asher et al. [2015] M. J. Asher, B. F. W. Croke, A. J. Jakeman, and L. J. M. Peeters. A review of surrogate models and their application to groundwater modeling. Water Resources Research, 51(8):5957–5973, 2015. ISSN 1944-7973. doi:10.1002/2015WR016967. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/2015WR016967. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/2015WR016967.
Tran et al. [2021] Hoang Tran, Elena Leonarduzzi, Luis De la Fuente, Robert Bruce Hull, Vineet Bansal, Calla Chennault, Pierre Gentine, Peter Melchior, Laura E. Condon, and Reed M. Maxwell. Development of a Deep Learning Emulator for a Distributed Groundwater–Surface Water Model: ParFlow-ML. Water, 13(23):3393, December 2021. ISSN 2073-4441. doi:10.3390/w13233393. URL https://www.mdpi.com/2073-4441/13/23/3393.
Maxwell et al. [2021] Reed M. Maxwell, Laura E. Condon, and Peter Melchior. A Physics-Informed, Machine Learning Emulator of a 2D Surface Water Model: What Temporal Networks and Simulation-Based Inference Can Help Us Learn about Hydrologic Processes. Water, 13(24):3633, January 2021. ISSN 2073-4441. doi:10.3390/w13243633. URL https://www.mdpi.com/2073-4441/13/24/3633. Number: 24 Publisher: Multidisciplinary Digital Publishing Institute.
Di Giammarco et al. [1996] P. Di Giammarco, E. Todini, and P. Lamberti. A conservative finite elements approach to overland flow: the control volume finite element formulation. Journal of Hydrology, 175(1):267–291, February 1996. ISSN 0022-1694. doi:10.1016/S0022-1694(96)80014-X.
Wang et al. [2018] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. In Proceedings of the 35 th International Conference on Machine Learning. arXiv, November 2018. URL http://arxiv.org/abs/1804.06300. arXiv:1804.06300 [cs, stat].
Dillon et al. [2019] P. Dillon, P. Stuyfzand, T. Grischek, M. Lluria, R. D. G. Pyne, R. C. Jain, J. Bear, J. Schwarz, W. Wang, E. Fernandez, C. Stefan, M. Pettenati, J. van der Gun, C. Sprenger, G. Massmann, B. R. Scanlon, J. Xanke, P. Jokela, Y. Zheng, R. Rossetto, M. Shamrukh, P. Pavelic, E. Murray, A. Ross, J. P. Bonilla Valverde, A. Palma Nava, N. Ansems, K. Posavec, K. Ha, R. Martin, and M. Sapiano. Sixty years of global progress in managed aquifer recharge. Hydrogeology Journal, 27(1):1–30, February 2019. ISSN 1431-2174, 1435-0157. doi:10.1007/s10040-018-1841-z. URL http://link.springer.com/10.1007/s10040-018-1841-z. 282 citations (Crossref) [2024-02-08].
Sprenger et al. [2017] C. Sprenger, N. Hartog, M. Hernández, E. Vilanova, G. Grützmacher, F. Scheibler, and S. Hannappel. Inventory of managed aquifer recharge sites in Europe: historical development, current situation and perspectives. Hydrogeology Journal, 25(6):1909–1922, September 2017. ISSN 1435-0157. doi:10.1007/s10040-017-1554-8. URL https://doi.org/10.1007/s10040-017-1554-8. 111 citations (Crossref) [2024-02-08].
Knight et al. [2022] Rosemary Knight, Klara Steklova, Alex Miltenberger, Seogi Kang, Meredith Goebel, and Graham Fogg. Airborne geophysical method images fast paths for managed recharge of California’s groundwater. Environmental Research Letters, 17(12):124021, December 2022. ISSN 1748-9326. doi:10.1088/1748-9326/aca344. URL https://iopscience.iop.org/article/10.1088/1748-9326/aca344.
Perzan et al. [2023] Zach Perzan, Gordon Osterman, and Kate Maher. Controls on flood managed aquifer recharge through a heterogeneous vadose zone: hydrologic modeling at a site characterized with surface geophysics. Hydrology and Earth System Sciences, 27(5):969–990, March 2023. ISSN 1607-7938. doi:10.5194/hess-27-969-2023. URL https://hess.copernicus.org/articles/27/969/2023/.
Ganot et al. [2018] Yonatan Ganot, Ran Holtzman, Noam Weisbrod, Anat Bernstein, Hagar Siebner, Yoram Katz, and Daniel Kurtzman. Managed aquifer recharge with reverse-osmosis desalinated seawater: modeling the spreading in groundwater using stable water isotopes. Hydrology and Earth System Sciences, 22(12):6323–6333, December 2018. ISSN 1607-7938. doi:10.5194/hess-22-6323-2018. 12 citations (Crossref) [2024-02-08].
Perzan and Maher [2024] Zach Perzan and Kate Maher. Transport, dispersion, and degradation of nonpoint source contaminants during flood-managed aquifer recharge. Vadose Zone Journal, 23(2):1–14, 2024. doi:10.1002/vzj2.20307.
Mid-Kaweah GSA [2019] Mid-Kaweah GSA. Groundwater Sustainability Plan. Groundwater Sustainability Plan, Mid-Kaweah GSA, Tulare, CA, 2019.
Knight et al. [2018] Rosemary Knight, Ryan Smith, Ted Asch, Jared Abraham, Jim Cannia, Andrea Viezzoli, and Graham Fogg. Mapping Aquifer Systems with Airborne Electromagnetics in the Central Valley of California. Ground Water, 56(6):893–908, November 2018. ISSN 1745-6584. doi:10.1111/gwat.12656.
Maxwell and Miller [2005] Reed M. Maxwell and Norman L. Miller. Development of a Coupled Land Surface and Groundwater Model. Journal of Hydrometeorology, 6(3):233–247, June 2005. ISSN 1525-7541, 1525-755X. doi:10.1175/JHM422.1. URL http://journals.ametsoc.org/doi/10.1175/JHM422.1.
Kollet and Maxwell [2008] Stefan J. Kollet and Reed M. Maxwell. Capturing the influence of groundwater dynamics on land surface processes using an integrated, distributed watershed model: INFLUENCE OF GROUNDWATER DYNAMICS ON LAND. Water Resources Research, 44(2), February 2008. ISSN 00431397. doi:10.1029/2007WR006004. URL http://doi.wiley.com/10.1029/2007WR006004.
Jones and Woodward [2001] Jim E Jones and Carol S Woodward. Newton-Krylov-multigrid solvers for large-scale, highly heterogeneous, variably saturated ¯ow problems q. Advances in Water Resources, 2001.
Ashby and Falgout [1996] Steven F. Ashby and Robert D. Falgout. A Parallel Multigrid Preconditioned Conjugate Gradient Algorithm for Groundwater Flow Simulations. Nuclear Science and Engineering, 124(1):145–159, September 1996. ISSN 0029-5639, 1943-748X. doi:10.13182/NSE96-A24230. URL https://www.tandfonline.com/doi/full/10.13182/NSE96-A24230.
Kollet and Maxwell [2006] Stefan J. Kollet and Reed M. Maxwell. Integrated surface–groundwater flow modeling: A free-surface overland flow boundary condition in a parallel groundwater flow model. Advances in Water Resources, 29(7):945–958, July 2006. ISSN 03091708. doi:10.1016/j.advwatres.2005.08.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0309170805002101.
Maxwell [2013] Reed M. Maxwell. A terrain-following grid transform and preconditioner for parallel, large-scale, integrated hydrologic modeling. Advances in Water Resources, 53:109–117, March 2013. ISSN 03091708. doi:10.1016/j.advwatres.2012.10.001. URL https://linkinghub.elsevier.com/retrieve/pii/S0309170812002564.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), 2021.
Wen et al. [2022] Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M. Benson. U-FNO—An enhanced Fourier neural operator-based deep-learning model for multiphase flow. Advances in Water Resources, 163:104180, May 2022. ISSN 03091708. doi:10.1016/j.advwatres.2022.104180. URL https://linkinghub.elsevier.com/retrieve/pii/S0309170822000562.
Zhou et al. [2022] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting, June 2022. URL http://arxiv.org/abs/2201.12740. arXiv:2201.12740 [cs, stat].
Kang et al. [2023] Hyoeun Kang, Yongsu Kim, Thi-Thu-Huong Le, Changwoo Choi, Yoonyoung Hong, Seungdo Hong, Sim Won Chin, and Howon Kim. A new fluid flow approximation method using a vision transformer and a U-shaped convolutional neural network. AIP Advances, 13(2):025233, February 2023. doi:10.1063/5.0138515. URL https://aip.scitation.org/doi/full/10.1063/5.0138515. Publisher: American Institute of Physics.
Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning. arXiv, March 2015. URL http://arxiv.org/abs/1502.03167. arXiv:1502.03167 [cs].
Maas et al. [2013] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, 2013.
Hinton et al. [2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors, 2012. URL https://arxiv.org/abs/1207.0580. Publisher: arXiv Version Number: 1.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), 2017.
Li et al. [2020] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier Neural Operator for Parametric Partial Differential Equations, 2020. URL https://arxiv.org/abs/2010.08895. Publisher: arXiv Version Number: 3.
Bierkens et al. [2015] Marc F. P. Bierkens, Victoria A. Bell, Peter Burek, Nathaniel Chaney, Laura E. Condon, Cédric H. David, Ad de Roo, Petra Döll, Niels Drost, James S. Famiglietti, Martina Flörke, David J. Gochis, Paul Houser, Rolf Hut, Jessica Keune, Stefan Kollet, Reed M. Maxwell, John T. Reager, Luis Samaniego, Edward Sudicky, Edwin H. Sutanudjaja, Nick van de Giesen, Hessel Winsemius, and Eric F. Wood. Hyper-resolution global hydrological modelling: what is next?: “Everywhere and locally relevant”. Hydrological Processes, 29(2):310–320, January 2015. ISSN 08856087. doi:10.1002/hyp.10391. URL https://onlinelibrary.wiley.com/doi/10.1002/hyp.10391. 267 citations (Crossref) [2024-02-08].
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], 2015.
Hestness et al. [2017] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. preprint, December 2017. URL http://arxiv.org/abs/1712.00409. arXiv:1712.00409 [cs, stat].
Lewis [1982] C. D. Lewis. Industrial and business forecasting methods: a practical guide to exponential smoothing and curve fitting. Butterworth Scientific, London ; Boston, 1982. ISBN 978-0-408-00559-3.
Zhou et al. [2014] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object Detectors Emerge in Deep Scene CNNs, December 2014. URL https://arxiv.org/abs/1412.6856v2.
Yang et al. [2019] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis, November 2019. URL https://arxiv.org/abs/1911.09267v3.
Jain and Wallace [2019] Sarthak Jain and Byron C. Wallace. Attention is not Explanation, May 2019. URL http://arxiv.org/abs/1902.10186. arXiv:1902.10186 [cs].
Jawahar et al. [2019] Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What Does BERT Learn about the Structure of Language? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1356. URL https://aclanthology.org/P19-1356.
Wiegreffe and Pinter [2019] Sarah Wiegreffe and Yuval Pinter. Attention is not not Explanation, September 2019. URL http://arxiv.org/abs/1908.04626. arXiv:1908.04626 [cs].