1 Introduction

Streamflow and river flow prediction have been widely employed in flood control projects, surface wetlands, and reservoir management, explaining drought prevention policies and water distribution. In the last decades, hydrologists and water managers have utilized distinct methodologies for streamflow prediction, such as numerical computation [45], stochastic models [36], physical rainfall–runoff approaches [34], and stage–discharge approaches [32]. Although distributed physical-based models (e.g., rainfall–runoff models) that properly describe hydrologic processes remain the majority of streamflow modeling, data-driven models can be beneficial when there is limited data on physical watershed processes but lengthy precipitation and streamflow time series. Nevertheless, in the last 2 decades, substantial research into new approaches for data-driven streamflow prediction has been spurred by the development of increasingly powerful machine learning (ML) algorithms and the tremendous rise in computational performance.

ML deals with studying and analyzing various methods (viz. artificial intelligence algorithms) that can automatically improve via training based on the data provided (herein streamflow, Q, precipitation, P, and temperature, T, data). In the field of hydrological modeling, several well-known ML approaches have been successfully employed artificial neural networks (ANNs) [14], neuro-fuzzy [22], support vector regression (SVR) [20], tree-based models [26], KNN-based models [33], and genetic programming (GP) [16].

Among the most commonly used ML models, neurocomputing, accordingly, ANNs have been demonstrated to be effective tools for simulating, predicting, and exploratory data analysis, particularly in complicated, nonlinear systems [55]. Several researchers have employed different types of ANN models, including the multi-layer perceptron neural networks (MLPNN) [47], generalized regression neural networks (GRNN) [24], radial basis function neural networks (RBFNN) [13], group method of data handling (GMDH) [23], extreme learning machine [48], recurrent neural networks (RNN) [9], wavelet neural network (WNN) [27], multivariate nonlinear autoregressive eXogenous (NARX) [17], long short-term memory (LSTM) [43], recurrent neural network (RNN) [7], and deep neural networks (DNN) [8] in the field of streamflow perdition.

Wu et al. [47] estimated stream and watershed runoff using MLPNN for multi-step flow prediction. According to the study, one-step predictions were more accurate than two-, three- and four-step predictions. The authors also determined that MLPNNs are useful for simulating watersheds and streams. Yaseen et al. [50] investigated and evaluated the potential of MLPNN and RBFNN models for daily streamflow forecasting, demonstrating that the RBNN model outperformed the MLPNN model. Alizadeh et al. [7] assessed several ML models, such as ANNs, SVR, RNN, and KNN, to predict monthly flow for two case studies. Regarding the correlation coefficient, the findings demonstrated that the KNN outperformed the other neural network configurations in the first case study, while the RBFNN model exceeded the others in the second case study. Shoaib et al. [43] conducted comprehensive research evaluating the performance of different wavelet-based neural networks (WNN) models, including MLPNN, GRNN, RBNN, MNN, and NFNN. According to the results, the WNN models outperformed their traditional equivalents. In another study, Hussain and Khan [20] investigated the performance of the MLP, SVR, and RF machine learning models for forecasting one-month ahead river flow. In general, it has been concluded that RF performed better than the other models, followed by the SVR and MLPNN. Ganesan et al. [14] applied DNN to monitor the river flow characteristics using river images from multispectral satellites. The study outcomes were used to develop digital elevation maps of river drainage features and guide disaster preparedness.

Excessive parameterization of ANNs (particularly in MLPNNs and DNNs) might result in overfitting models that are unfit for new data. Moreover, some reports demonstrate that traditional ANN models might face the drawback of getting trapped in local minima [12]. Most of these approaches have been developed to cope with the over-training problems in traditional ANNs dealing with highly nonlinear data. In addition, since the streamflow prediction consists of both base flow and peak flows, most traditional models fail to provide acceptable results for the peak flows. Due to these shortcomings of traditional stand-alone ANNs, other novel and upgraded types of ANNs have been introduced and applied for modeling sophisticated problems. One of the main novel ANN categories is integrative (hybrid) metaheuristic ANN models [42]. Briefly, metaheuristic algorithms are computational intelligence paradigms in artificial intelligence employed explicitly for complex problem-solving in optimization. Several studies have previously proved the potency and efficacy of using metaheuristic algorithms in ML models to produce more robust and accurate results [53].

In this regard, some researchers have found the beneficial traits of metaheuristic algorithms embedded with ANNs for streamflow and river flow modeling [5, 10, 15, 21, 37]. Zounemat-Kermani et al. [54] conducted thorough research to evaluate several types of integrative and stand-alone ANNs for forecasting daily flow in the Thames River in the United Kingdom. The results indicated that all neural network models acceptably forecasted the daily flow rate. It was reported that the integrative ML and ANN models embedded with the optimization algorithms such as particle swarm optimization (PSO) and genetic algorithm (GA) (e.g., MLPNN-PSO and MLPNN GA) provided more accurate results than the stand-alone ANN models such as MLPNN, RBFNN, GRNN, GMDH, Neuro-fuzzy, WNN, and DNN [2, 4, 6, 29, 30, 39, 41, 44, 52].

Feng and Niu [12] developed an integrative cooperation search algorithm (CSA) with ANN, namely CSA-ANN, to predict river flow time series. It was shown that the CSA-ANN acted better than the ANN,for instance, it made 11.10% and 5.42% improvements in the Nash–Sutcliffe efficiency and correlation coefficient values compared to the stand-alone ANN. Wee et al. [46] employed the bat algorithm (BA), a metaheuristic technique, to improve the weights and biases of the ANN model. The suggested hybrid model was verified in five distinct research areas. The first findings of the statistical testing demonstrated that the hybrid BA-ANN was superior to forecasting the streamflow in all five specified research regions. Zanial et al. [51] used a stand-alone ANN model and an integrative ANN model, namely the Cuckoo search method (CS-ANN), to forecast river flow behavior in Malaysia based on changing rainfall patterns. The statistical indices findings reveal that the proposed hybrid CS-ANN model outperformed the ANN model regarding R2 and RMSE values. The findings showed that the suggested model outperformed the stand-alone model in accurately estimating river flow.

Several metaheuristic search methods inspired by natural events of biological evolution and animal behaviors have been widely and successfully applied to optimize the computational parameters of the ANNs in hydrological phenomena. Several integrative ANN models have been developed and appraised in the present study. The stand-alone ANN model, which served as the study’s core model, was first constructed as a way of forecasting river flow using several input strategies (Q, T, and P) and a single output (Q at a time step ahead). In the next step, four integrative ANN models are developed to form ANN-AO (Aquila optimizer), ANN-GA (genetic algorithm), ANN-PSO (particle swarm optimization), and ANN-GWO (gray wolf optimizer). Comparing the results of these models will provide a valuable contribution to the application of conventional (ANN-GA, ANN-PSO, and ANN-GWO) and new integrative (ANN-AO) ML models in hydrology. In addition to the models above, an integrative advanced ML model, viz., ANN with Runge–Kutta improved by Aquila optimizer (ANN-RUNAO), was also developed to challenge the robustness and accuracy of the new hybrid model, which highlights the novelty of the research.

2 Case Study

The Jialing River Basin (JRB) of China is selected as a case study for the current study, as shown in Fig. 1. JRB is the second key largest tributary of the Yangtze River, and it is situated between 29° 18′ N–34° 33′ N and 102° 39′ E–109° 01′ E. The total catchment area of the JRB region is 144,122 km2 (accounting for 9% of the Yangtze basin area) with a 1345 km river span length. The JRB runs across three key provinces and one municipality, i.e., Shanxi, Gansu, Sichuan, and Chongqing. The region has a subtropical humid monsoon climate and has an uneven rainfall pattern throughout the year with an average annual rainfall of 873 mm (60–70% occurs from July to September) and an average annual temperature of 16–18 °C with a maximum temperature is 26.1 °C, and a minimum temperature is 4.4 °C [15, 40]. The upstream area of JRB is more mountainous and sloppy and, therefore, often faces floods. Therefore, robust, accurate estimation of river flows is crucial for this basin. Due to the narrow channel and fast movement of river flows, many key reservoirs have been constructed in this basin. Therefore, precise measurement of river flows is also essential for adequately scheduling these reservoirs. For this purpose, the main outlet of the JRB region, i.e., Beibei hydraulic station, is selected in this study.

Fig. 1
figure 1

Study are

For the JRB streamflow, forecasting, daily time series data of streamflow, precipitation, and temperature is obtained from the Hydrological Yearbooks of the People’s Republic of China for the duration of 2007 to 2015. For better visualization of different model’s performance, three splitting techniques are adopted, i.e., M1 (50 training and 50% testing), M2 (60 training and 40% testing), and M3 (75 training and 25% testing). A brief statistical summary of the utilized streamflow data is reported in Table 1.

Table 1 The statistical parameters of the applied data

3 Artificial Neural Network (ANN)

The structure of an artificial neural network (ANN) with the standard feed-forward architecture can be represented in Fig. 2. The ANN is composed of an ensemble of elements working in parallel, i.e., the neurons, similar to the biological neurons in nervous human systems. Generally speaking, any ANN structure comprises successive layers, starting with an input layer and an ensemble of hidden layers. Finally, the acquired information reaches the end of the process at the single neuron in the output layer. The available information for solving a nonlinear problem is collected in the input layer and used to generate a nonlinear correlation, referred to as a regression problem. The collected information in the input layer is sent to the hidden layer, and each neuron in the hidden layer computes the weighted sum of the inputs and adds a bias value. Before sending the collected information to the output layer, a sigmoidal transformation function is used. In brief, the connections between input neurons and hidden neurons are called weights “Wij”, and between hidden and output neurons are called “Wjk” (Fig. 2). These parameters are determined during the training process using the backpropagation training algorithm.

Fig. 2
figure 2

The artificial neural network (ANN) architecture

3.1 Artificial Neural Network-Optimized Genetic Algorithm (ANN-GA)

The Genetic Algorithm (GA) was proposed by Holland [18, 19], and this technique works by simulating the biological genetic evolution process. The GA comprises selection, reproduction, crossover, and mutation operators, and the GA is used to search the best ANN topology using an initial population randomly generated and searching through the spaces of the problem for the best possible solution. The GA uses a fitness function to determine the suitable individual in the population that can reproduce and survive. In the present study, the GA is employed to search the optimal values of the weights for the ANN model. This includes generating an initial population, coding the chromosomes, determining the length of chromosomes, defining the fitness function, and finally applying the process of selection, crossover and mutation operations [11]. Therefore, in ANN-GA, the model is optimized with GA instead of the well-known backpropagation algorithm. Thus, the weights and biases are optimized rather than randomly generated.

3.2 Artificial Neural Network Particle Swarm Optimization (ANN-PSO)

The particle swarm optimization (PSO) is an evolutionary social–psychological metaphor algorithm proposed by Kennedy and Eberhart [28], and it is mainly proposed for global optimization problems. The PSO uses an ensemble of particles that form a population. The PSO algorithm can be summarized as follows. Each particle “i” has three components, i.e., the position (xi), the velocity (vi), and the best possible position (pi). As a global position exists (pg), each individual moves toward its best solution and simultaneously in the direction of the global best position (pg); thus, the (pi) and (pg) are used for updating the position of each particle (Fig. 3). Consequently, we can argue that there is competition among the individuals of the population, and if someone among the members of the population discovers a good position, all other members move closer to it [29, 30, 49]:

$$v_{i}^{t + 1} = \delta v_{i}^{t} + L_{1} M_{1} \left( {p_{i} \left( t \right) - x_{i} \left( t \right)} \right) + L_{2} M_{2} \left( {p_{g} \left( t \right) - x_{i} \left( t \right)} \right)$$
(1)
$$x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t + 1}$$
(2)
Fig. 3
figure 3

The PSO algorithm

In the above equations, L1, L2, M1, and M2 correspond to the acceleration constants and the uniform random values ranging from zero to one ([0, 1]). δ is the inertia weight factor, \(v_{i}^{t + 1}\) and \(x_{i}^{t + 1}\) are the velocity and position of the particle “i” at the iteration (t + 1). The overall PSO algorithm can be summarized as follows: (i) the population with position and velocity are randomly initialized, (ii) the fitness evaluation, (iii) update the velocity of each particle, (iv) the construction phase, during which the particle updates the position; and (v) finally stop the algorithm if the criterion is satisfied. The present study uses the PSO to train the ANN model, and optimize the weights, i.e., the “WijandWjk” as stated in Fig. 2. The PSO algorithm starts by randomly initializing weights' values, and the model's training starts. During training, the algorithm compares several fitness value, i.e., the mean squared error (MSE), to the global best solution, considered the optimum fitness values. We conclude by highlighting that, using the PSO algorithm, each neuron is represented by two components: the position and the velocity. The position is the neuron’s weight, while the velocity updates the weights during training [29, 30, 49].

3.3 ANN Optimized Using Gray Wolf Optimization Algorithm (ANN-GWO)

Mirjalili et al. [35] introduced the gray wolf optimization (GWO) algorithm. The idea of the GWO is based on simultaneously simulating the hierarchical relationship and the hunting activity of the gray wolves in nature. According to Mirjalili et al. [35], there are four types of wolves: alpha (α), beta (β), delta (δ) and omega (ω), and the algorithm is composed of four steps, namely the “hierarchy”, “tracking”, “encircling” and “attacking” process. The top of the hierarchy is occupied by the alpha, which is responsible for making the decisions regarding the course of the hunting operation, as its success will be key to the design of the survival of the wolves’ colony in the future. The second place is occupied by the beta, which receives the command from alpha and transmits orders to the following entity (i.e., the delta and omega) for execution; thus, sometimes, beta helps alpha make some critical decisions. The third place is occupied by delta, which dominates omega and reports to alpha and beta. Finally, omega is ranked the lowest among all wolves and plays the role of scapegoat. The GWO can be summarized as follows. The hunting process is triggered by encircling the prey, which can be formulated as follows [35]:

$$\vec{D} = \left| {\vec{C} \cdot \overrightarrow {{X_{P} }} \left( t \right) - \vec{X} \left( t \right)} \right|$$
(3)
$$\vec{X}\left( {t + 1} \right) = \overrightarrow {{X_{P} }} \left( t \right) - \vec{A} \cdot \vec{D}$$
(4)

In the above equations, \(\overrightarrow {{X_{P} }}\) and \(\vec{X}\) are the current positions of the prey and the gray wolves, respectively, \(\vec{A}\) and \(\vec{C}\) are the coefficient vectors. The hunting can be formulated as follows:

$$\overrightarrow {{D_{\alpha } }} = \left| {\overrightarrow {{C_{1} }} \cdot \overrightarrow {{X_{\alpha } }} - \vec{X}} \right|,\; \overrightarrow {{D_{\beta } }} = \left| {\overrightarrow {{C_{2} }} \cdot \overrightarrow {{X_{\beta } }} - \vec{X}} \right|,\;\overrightarrow {{D_{\delta } }} = \left| {\overrightarrow {{C_{3} }} \cdot \overrightarrow {{X_{\delta } }} - \vec{X}} \right|$$
(5)
$$\overrightarrow {{X_{1} }} = \overrightarrow {{X_{\alpha } }} - \overrightarrow {{A_{1} }} \cdot \left( {\overrightarrow {{D_{\alpha } }} } \right),\;\overrightarrow {{X_{2} }} = \overrightarrow {{X_{\beta } }} - \overrightarrow {{A_{2} }} \cdot \left( {\overrightarrow {{D_{\beta } }} } \right),\;\overrightarrow {{X_{3} }} = \overrightarrow {{X_{\delta } }} - \overrightarrow {{A_{3} }} \cdot \left( {\overrightarrow {{D_{\delta } }} } \right)$$
(6)
$$\vec{X}\left( {t + 1} \right) = \frac{{\overrightarrow {{X_{1} }} + \overrightarrow {{X_{2} }} + \overrightarrow {{X_{3} }} }}{3}$$
(7)

As reported above, the performance of the ANN model depends on two kinds of parameters, i.e., the weights “Wij” and “Wjk”, in addition to the values of biases. The random initialization of the se parameters can cause low generalization performances. Therefore, the GWO is used for model parameters optimization using the means squared error as a cost function. For example, for an ANN model with six input variables, ten hidden neurons and one output neuron, the total number of parameters to be optimized can be calculated as follows:

$$N_{P} = \left( {6 \times 10} \right) + \left( {10} \right) + \left( {1 \times 10} \right) + \left( 1 \right) = 81$$
(8)

Therefore, 81 optimization parameters are involved in the GWO optimization problem. The ANN training algorithm starts and the fitness value is then calculated. After each iteration, the fitness value is compared with the previous one, and if it is better, we proceed to update the candidates' positions. This optimization process will be continued until the maximal number of iterations is reached. The optimized parameters will be used as the final model.

3.4 ANN-Optimized Runge–Kutta Algorithm (ANN-RUN)

The Runge–Kutta Optimizer (RUN) is a relatively optimization algorithm proposed by Ahmadianfar et al. [3], and it is inspired from the idea of solving the ordinary differential equations of Kutta and Runge [38, 31]. According to Ahmadianfar et al. [3], RUN is composed of four parts including of “initialization step”, the “root of search mechanism”, the “updating solutions”, and the “enhanced solution quality”. During the first step of the algorithm, i.e., the initialization, and similar to the major optimization algorithms, an initial population “N” equal to the number of positions is created. It evolves in response to a fixed number of iterations. The initial position is created according to the following equation:

$$x_{n,l} = L_{l} + rand \cdot \left( {U_{l} - L_{l} } \right)$$
(9)

where rand is a random number in the range of [0, 1], \(U_{l}\) and \(L_{l}\) are the upper and lower bounds. During the second step, i.e., the “root of search mechanism” the first order RK, i.e., the RK4 is adopted for the search mechanism. During this second step, the four Kutta number designated as k1, k2, k3 and k4 are calculated using the equations expressed as follows:

$$k_{1} = \frac{1}{{2\left( {2 \times rand \times \left| {Stp} \right|} \right)}}\left[ {rand \times x_{w} - \left( {round\left( {1 + rand} \right) \times \left( {1 - rand} \right)} \right) \times x_{b} } \right]$$
(10)
$$Stp = rand \times \left( {\left( {x_{b} - rand \times x_{avg} } \right) + \gamma } \right)$$
(11)
$$\gamma = rand\left( {x_{n} - rand\left( {u - l} \right)} \right) \times exp\left( { - 4 \times \frac{i}{Maxi}} \right)$$
(12)

In the above equations, a rand is a random number in the range of [0, 1], xb is the best position around xn, xw is the worst position around xn, the “u” is simply a random number introduced to promote the xb, Stp is the step size determined by the difference between xb and xavg (the average of all solution at each iteration), and γ is a scale factor. By introducing the so-called Δx, the k2, k3 and k4 can be expressed as follows:

$$\Delta x = 2 \times rand \times \left| {Stp} \right|$$
(13)
$$k_{2} = \frac{1}{2\Delta x}\left[ {rand \times \left( {x_{w} + rand_{1} \times k_{1} \times \Delta x} \right) - \left( {u \cdot x_{b} + rand_{2} \times k_{1} \times \Delta x} \right)} \right]$$
(14)
$$k_{3} = \frac{1}{2\Delta x}\left[ {rand \times \left( {x_{w} + rand_{1} \times \left( {\frac{1}{2}k_{2} } \right) \times \Delta x} \right) - \left( {u \cdot x_{b} + rand_{2} \times \left( {\frac{1}{2}k_{2} } \right) \times \Delta x} \right)} \right]$$
(15)
$$k_{4} = \frac{1}{2\Delta x}\left[ {rand \times \left( {x_{w} + rand_{1} \times k_{3} \times \Delta x} \right) - \left( {u \cdot x_{b} + rand_{2} \times k_{3} \times \Delta x} \right)} \right]$$
(16)

where rand1 and rand2 are two random numbers in the range of [0, 1]. Finally, the leading search mechanism in RUN can be formulated as follows:

$$SM = \frac{1}{6}\left( {k_{1} + 2 \times k_{2} + 2 \times k_{3} + k_{4} } \right)\Delta x$$
(17)

The third step is “updating solutions”. During this step, and at each iteration, there is a continuous update of the positions of individuals using the RK. The global “exploration” and local “exploitation” search can be formulated as follows:

$$if\; rand < 0.5 \Rightarrow exploration\;phase, \quad x_{n + 1} = x_{c} + SF \times SM + \mu \times x_{s}$$
(18)
$$else \Rightarrow exploitation\; phase,\quad x_{n + 1} = x_{m} + SF \times SM + \mu \times x_{{s^{\prime}}}$$
(19)

The following equations can be formulated:

$$\mu = 0.5 + 0.1 \times randn$$
(20)
$$x_{s} = randn \times \left( {x_{m} - x_{c} } \right)\quad and\quad x_{{s^{\prime}}} = randn \times \left( {x_{r1} - x_{r2} } \right)$$
(21)
$$x_{c} = \varphi \times x_{n} + \left( {1 - \varphi } \right) \times x_{r1} \quad and\quad x_{m} = \varphi \times x_{best} + \left( {1 - \varphi } \right) \times x_{lbest}$$
(22)

where μ and randn are the random numbers, φ is a random number in the range of [0, 1], xbest is the “best so far” solution and xlbest is the “best position” calculated sat each iteration, xr1, xr2, xr3 are the three random solutions. The SF can be formulated as follows:

$$SF = 2 \times \left( {0.5 - rand} \right) \times a \times exp\left( {\left( { - b \times rand \times \left( \frac{i}{Maxi} \right)} \right)} \right)$$
(23)

where a and b are two constant numbers, i is the number of iterations, and Maxi is the maximum number of iterations. The two Eqs. 18 and 19 can be reformulated as follows:

$$if\;rand < 0.5 \Rightarrow exploration\;phase,\; x_{n + 1} = \left( {x_{c} + r \times SF \times g \times x_{c} } \right) + SF \times SM + \mu \times x_{s}$$
(24)
$$else \Rightarrow exploitation\;phase,\; x_{n + 1} = \left( {x_{m} + r \times SF \times g \times x_{m} } \right) + SF \times SM + \mu \times x_{{s^{\prime}}}$$
(25)

where r is an integer number, which is “1” or “-1” and “g” is a random number in the range [0, 2].

The “enhanced solution quality” is the last step of the optimization algorithm. During this step, the “enhanced solution quality (ESQ)” is introduced to improve the “quality of solution”. Using the ESQ, the mean value of three random solutions “xavg” and the best position will be combined together for calculating a new solution called “xnew1”, thus a new solution “xnew2” is created as follows:

$$if\;rand < 0.5, \;if\; w < 1 \Rightarrow x_{new2} = x_{new1} + r \cdot w \cdot \left| {\left( {x_{new1} - x_{avg} } \right) + randn} \right|$$
(26)
$$else\; x_{new2} = \left( {x_{new1} - x_{avg} } \right) + r \cdot w \cdot \left| {\left( {\mu .x_{new1} - x_{avg} } \right) + randn} \right|,\; end, \;end$$
(27)

where “w”, “xavg” and “xnew1” can be calculated as follows:

$$w = rand\left( {0,2} \right) \cdot exp\left( { - c\left( \frac{i}{Maxi} \right)} \right)$$
(28)
$$x_{avg} = \frac{{x_{r1} + x_{r2} + x_{r3} }}{3}$$
(29)
$$x_{new1} = \beta \times x_{avg} + \left( {1 - \beta } \right) \times x_{best}$$
(30)

Another solution can be proposed, i.e., the “xnew3” and formulated as follows:

$$if\;rand < w \Rightarrow x_{new3} = \left( { x_{new2} - rand \cdot x_{new2} } \right) + SF \cdot \left( {rand \cdot x_{RK} + \left( {\rho \cdot x_{b} - x_{new2} } \right)} \right),\; end$$
(31)

where \(\rho\) is a random number with a value of 2 × rand. The flowchart of the RUN is presented in Fig. 4.

Fig. 4
figure 4

Flowchart of the RUN algorithm, f refers to the fitness value [3]

In the present study, the weights and biases of the ANN were optimized using RUN and by choosing the MSE as the cost function. The steps can be summarized as follows: selecting the number of hidden neurons and the number of input variables, selecting the activation function, and creating the initial population according to the total number of parameters, and the total number of iterations.

3.5 ANN-Optimized Runge–Kutta-Improved Aquila Optimizer (ANN-RUNAO)

Abualigah et al. [1] proposed the Aquila Optimizer (AO) algorithm. The AO algorithm was formulated by simulating Aquila’s hunting behavior. The AO algorithm was formulated and expressed by considering four different hunting methods as reported hereafter as X1, X2, X3 and X4. The X1 corresponds to the “high soar with a vertical stoop”, the X2 corresponds to the “contour flight with short glide attack”, the X3 corresponds to the “low flight with a slow descent attack”, and the X4 corresponds to the “walking and grab prey”. Similar to the well-known optimization algorithms, the AO uses the concepts of population of candidate solutions (X), and two limits are fixed, i.e., the upper “UB” and lower “LB” bounds. The population X can be expressed as follows:

$$X=\left[\begin{array}{ccccc} {x}_{{1,1}} & \dots & {x}_{1,j} & {x}_{1,j} & {x}_{1,j}\\ {x}_{{2,1}}& \dots & {x}_{2,j} & \dots &{x}_{2,Dim} \\ \dots & \dots & {x}_{1,j}& \dots & \dots\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ {x}_{N-{1,1}}& \dots &{x}_{N-1,j} &\dots & {x}_{N-1,Dim}\\ {x}_{N,1}& \dots &{x}_{N,j} & {x}_{N,Dim-1} &{x}_{N,Dim}\\ \end{array}\right]$$
(32)

In the above equation, X corresponds to an ensemble of probable solutions randomly generated using the equation expressed hereafter:

$$X_{ij} = rand \times \left( {UB_{j} - LB_{j} } \right) + LB_{j} ,\;i = 1,2, \ldots \ldots ,N \; j = 1,2, \ldots \ldots ,Dim$$
(33)

where “Xi” is a probable position for the ith probable solution, “N” the total number of individuals in the presented population, and “Dim” denotes the dimension of the problem to be resolved, rand is a random number, and rand is a random number.

During the first step, i.e., the “expanded exploration” or the “high soar with a vertical stoop”, the solution generated by the first search method can be expressed as follows:

$$X_{1} \left( {t + 1} \right) = X_{best} \left( t \right) \times \left( {1 - \frac{t}{T}} \right) + \left( {X_{M} \left( t \right) - X_{best} \left( t \right)*rand} \right)$$
(34)

where t and T are, respectively, the actual and the maximum number of iterations, X1(t + 1) corresponds to the solution after the iteration number “t”, the Xbest (t) corresponds to the “best” solution, the XM (t) is calculated as the mean value for the location of the actual solution, and the rand is ranged from 0 to 1:

$$X_{M} \left( t \right) = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} X_{i} \left( t \right),\;\forall \,j = 1,2, \ldots ,Dim$$
(35)

The second possible hunting method is called “narrowed exploration” or the “contour flight with short glide attack”, calculated as follows:

$$X_{2} \left( {t + 1} \right) = X_{best} \left( t \right) \times Levy\left( D \right) + X_{R} \left( t \right) + \left( {y - x} \right)*rand$$
(36)

where X2(t + 1) corresponds to the solution of the second method after the iteration number “t”, D is the dimension space, Levy (D) denotes the “levy flight distribution function”, XR(t) is ranged from 1 to N at the ith iteration. The Levy (D) can be calculated as follows:

$$Levy\left( D \right) = s \times \frac{\mu \times \sigma }{{\left| \nu \right|^{{\frac{1}{\beta }}} }}$$
(37)
$$\sigma = \left( {\frac{{{\Gamma }\left( {1 + \beta } \right) \times sine\left( {\frac{\pi \beta }{2}} \right)}}{{{\Gamma }\left( {\frac{1 + \beta }{2}} \right) \times \beta \times 2^{{\left( {\frac{\beta - 1}{2}} \right)}} }}} \right)$$
(38)
$$y = r \times \cos \left( \theta \right)$$
(39)
$$x = r \times \sin \left( \theta \right)$$
(40)
$$r = r1 + U \times D_{1}$$
(41)
$$\theta = - \omega \times D_{1} + \theta 1$$
(42)
$$\theta 1 = \frac{3 \times \pi }{2}$$
(43)

In the above equations, β is equal to 1.5, U is equal to 0.00565, ω is equal to 0.005, r1 ranges from 1 to 20, D1 ranges from 1 to Dim.

During the third step, i.e., the “expanded exploitation” or the “low flight with a slow descent attack”, the X3 can be formulated as follows:

$$X_{3} \left( {t + 1} \right) = \left( {X_{best} \left( t \right) - X_{M} \left( t \right)} \right) \times \alpha - rand + \left( {\left( {UB - LB} \right) \times rand + LB} \right) \times \delta$$
(44)

In the above equation, α and δ were fixed to 0.1.

The fourth and last method is the “Narrowed exploitation” or the “walking and grabing prey”, the X4 can be expressed as follows:

$$X_{4} \left( {t + 1} \right) = QF \times X_{best} \left( t \right) - \left( {G_{1} \times X\left( t \right) \times rand} \right) - G_{2} \times Levy\left( D \right) + rand \times G_{1}$$
(45)

where the QF is the is calculated as follows:

$$QF\left( t \right) = t^{{\frac{2 \times rand - 1}{{\left( {1 - T} \right)^{2} }}}}$$
(46)
$$G_{1} = 2 \times rand - 1$$
(47)
$$G_{2} = 2 \times \left( {1 - \frac{t}{T}} \right)$$
(48)

The flowchart of the Aquila optimizer (AO) algorithm is presented in Fig. 5. The present study combined the AO algorithm with the RUN to optimize the ANN parameters. The proposed RUNAO is presented in Fig. 6 and starts by generating an ensemble of “N” individual’s “X”, which corresponds to the ANN parameters (i.e., the weights and biases). As the data are divided into training and testing set, the training part of the data is used for checking and updating the quality of each parameter, i.e., ach weight by selecting the mean squared error “MSE” as the fitness function, calculated between the measured and the predicted data. In addition, the value corresponding to the best weight, “Xbest” is updated; consequently, all other parameters were updated. The optimization procedure is repeated several times, which can lead to the adjustment of the position of particles and, therefore, the smallest fitness values can be obtained.

Fig. 5
figure 5

Flowchart of the Aquila Optimizer (AO) algorithm

Fig. 6
figure 6

Flowchart of the RUNAO algorithm

4 Models’ Assessments

The performance of standard ANN and hybrid ANN-PSO, ANN-GA, ANN-GWO, ANN-AO, ANN-RUN, ANN-RUNAO methods in streamflow prediction was evaluated using four different statistical methods and various visual graphic approaches such as violin and Taylor diagrams. The formulas for the used statistics are as follows:

$$RMSE:\;Root\;Mean\;Square\;Error = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} [(Q_{0} )_{i} - (Q_{C} )_{i} ]^{2} }$$
(49)
$$MAE:\;Mean\;Absolute\; Error = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left| {(Q_{0} )_{i} - (Q_{C} )_{i} } \right|$$
(50)
$$R^{2} :\;Determination\;coefficient = \left[ {\frac{{\mathop \sum \nolimits_{t = 1}^{N} \left( {Q_{o} - \overline{{Q_{o} }} } \right)\left( {Q_{c} - \overline{{Q_{c} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{t = 1}^{N} (Q_{o} - \overline{{Q_{o} }} )^{2} (Q_{c} - \overline{{Q_{c} }} )^{2} } }}} \right]^{2}$$
(51)
$$NSE:\;Nash - Suutcliffe = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} [(Q_{0} )_{i} - (Q_{c} )_{i} ]^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} [(Q_{0} )_{i} - \overline{Q}_{0} ]^{2} }},\; - \infty < NSE \le 1$$
(52)

where \(Q_{o}\) and \(Q_{c}\) refer to observed and computed streamflow, \(\overline{Q}_{o}\) and \(N\), respectively, are mean streamflow and data quantity.

5 Results and Discussion

5.1 Results

The viability of the hybrid ANN-RUNAO method was compared with hybrid ANN-PSO, ANN-GA, ANN-GWO, ANN-AO, ANN-RUN, ANN-RUNAO and standard ANN methods in streamflow prediction using different input scenarios of streamflow, precipitation and temperature. Three distinct data split cases (50–50%, 60–40% and 75–-25% train-test) were employed to observe the training data length on each method's model accuracy. Table 2 reports the train-test statistics of the standard ANN methods in streamflow prediction. First, previous discharge values were used as input combinations and then previous precipitation and temperature data were added. It is clear from Table 2 that the best accuracies of the ANN models belong to the input scenarios involving all three variables, Q, P and T which tell us that all these data are needed for better streamflow prediction. Among the different split scenarios, 75–25% provides the best accuracy in streamflow prediction with the lowest RMSE and MAE and the highest R2 and NSE in both training and testing stages. The increasing training data from 50 to 75%, respectively, improve the model accuracy by 23.7%, 20.6%, 12.3 and 17.5% concerning mean RMSE, MAE, R2and NSE in the testing stage. The best input combination consists of two lagged streamflow, three lagged precipitation and two lagged temperatures in all three data split cases. Even though the ANN model of the second split scenario (M2, 60–40%) has lower RMSE compared to the model of the third scenario, the MAE, R2 and NSE of the latter model are superior to the first model.

Table 2 Training and test statistics of the models for streamflow prediction—ANN

The accuracies of the hybrid ANN methods in streamflow prediction are summed up in Tables 3, 4, 5, 6, 7 and 8. Similar to the standard ANN method, the hybrid methods also provide the best accuracies in the 75–25% train-test split case and confirm that the increasing training data length improves the ANN-based models in streamflow prediction. Increasing training data from 50 to 75% increases the RMSE and MAE from 1517.7 m3/s and 642.9 m3/s to 1145.6 m3/s and 521.9 m3/s for ANN-PSO (Table 3), from 1503.8 m3/s and 654.1 m3/s to 1123.2 m3/s and 511.1 m3/s for ANN-GA (Table 4), from 1483.1 m3/s and 640.4 m3/s to 1069 m3/s and 503.7 m3/s for ANN-GWO (Table 5), from 1470.1 m3/s and 626.4 m3/s to 1056.4 m3/s and 501.4 m3/s for ANN-AO (Table 6), from 1460.1 m3/s and 622.4 m3/s to 1047.1 m3/s and 496.1 m3/s for ANN-RUN (Table 7) and from 1451.7 m3/s and 614.5 m3/s to 1034.7 m3/s and 489.6 m3/s for ANN-RUNAO (Table 8) in the testing stage.

Table 3 Training and test statistics of the models for streamflow prediction—ANN-PSO
Table 4 Training and test statistics of the models for streamflow prediction—ANN-GA
Table 5 Training and test statistics of the models for streamflow prediction—ANN-GWO
Table 6 Training and test statistics of the models for streamflow prediction—ANN-AO
Table 7 Training and test statistics of the models for streamflow prediction—ANN-RUN
Table 8 Training and test statistics of the models for streamflow prediction—ANN-RUNAO

According to the average statistics provided in tables (see Tables 2, 3, 4, 5, 6, 7, 8), the ANN-PSO model improves the prediction accuracy of single ANN by 1, 4.1, 0.4 and 4.3% concerning RMSE, MAE, R2 and NSE in the testing stage of 50–50% split case while the corresponding improvements are 1.2, 4.2, 0.4 and 0.9% for the 60–40% case and 2.2, 1.9, 1.4 and 1.6% for the 75–25% case, respectively. Using GA as a training algorithm in ANN, more improvements are obtained; increments in RMSE, MAE, R2 and NSE, respectively, are 1.9, 2.4, 2.4 and 5.9% for the 50–50% split case, 2.6, 7.8, 1.9 and 1% for the 60–40% and 4.1, 3.9, 1.8 and 0.7% for the 75–25% case in the testing stage. In the case of a 50–50% split scenario, the improvements in RMSE, MAE, R2 and NSE of the standard ANN model in predicting monthly streamflow are 3.3, 4.4, 3.2 and 6.3% by applying ANN-GWO, 4.1, 6.5, 5 and 8.4% using ANN-AO, 4.8, 7.1, 5.9 and 10.8% by employing ANN-RUN, 5.3, 8.3, 7.8 and 12.4% by utilizing ANN-RUNAO, respectively. The corresponding improvements in the test stage of the 60–40% split scenario separately are 3.3, 8.5, 2.8 and 3.3% for the ANN-GWO, 4.0, 9.4, 3.4 and 3.9% for the ANN-AO, 4.6, 10, 4.1 and 4.6% for the ANN-RUN and 6, 11.2, 4.4 and 4.9% for the ANN-RUNAO, whereas the RMSE, MAE, R2 and NSE of the standard ANN model were improved by 8.7, 5.3, 2.6, 2.8%; 9.8, 5.7, 3.1, 3.1%; 10.6, 6.7, 3.5, 3.6% and 11.6, 7.9, 5.5, 5.5% employing ANN-GWO, ANN-AO, ANN-RUN and ANN-RUNAO models.

It is clear from Tables 2, 3, 4, 5, 6, 7 and 8 that adding precipitation and temperature inputs improves the prediction accuracy of all models in all train-test split cases. Here, we will evaluate the best split case of 75–25%. By adding precipitation input, the accuracy of ANN is improved by 3.0, 3.3, 2.1 and 2.4% with respect to RMSE, MAE, R2 and NSE in the test stage, respectively, while the corresponding improvements are 3.8, 3, 1.5 and 1.6% for the ANN-PSO, 1.1, 3.5, 1.1 and 0.1% for the ANN-GA, 2.6, 2.4, 0.9 and 1% for the ANN-GWO, 1.1, 2.2, 0.9 and 0.8% for the ANN-AO, 0.9, 1.3, 1.0 and 0.9% for the ANN-RUN and 1.9, 2.2, 3.6 and 3.3% for the ANN-RUNAO models. It is clear from the outcomes of the models that temperature input is more effective in streamflow. By considering temperature input, the RMSE, MAE, R2 and NSE of ANN are improved in the test stage by 5.5, 7.1, 3.2 and 3.6% whereas the corresponding improvements are 5.6, 5.5, 2.1 and 2.2% for the ANN-PSO, 2.3, 5.6, 2.1 and 2.6% for the ANN-GA, 4.7, 3.8, 2.0 and 1.9% for the ANN-GWO, 3.9, 3.8, 2.4 and 2.1% for the ANN-AO, 4.7, 3.1, 2.7 and 2.6% for the ANN-RUN and 5.4, 4.4, 5.2 and 5.1% for the ANN-RUNAO models, respectively.

The results of the best ANN-based models are compared using scatterplots in Fig. 7. It is clear that the use of metaheuristic algorithms in ANN tuning improves its accuracy and the RUNAO is superior to the other alternatives in predicting monthly streamflow. The outcomes of the best models are further assessed in Figs. 8 and 9 based on Taylor and violin charts which are very useful in evaluating various statistics and distribution together. It is clear from the figures that the RUNAO has the lowest RMSE and the highest correlation and its standard deviation is closer to the observed one than that of other models.

Fig. 7
figure 7figure 7

Scatterplots of the observed and predicted streamflow by different ANN-based models in the test period using the best input combination

Fig. 8
figure 8

Taylor diagrams of the predicted streamflow by different ANN-based models in the test period using the best input combination

Fig. 9
figure 9

Violin charts of the predicted streamflow by different ANN-based models in the test period using the best input combination

5.2 Discussion

The ANN-RUNAO model exhibited the highest accuracy, particularly in the 75–25% train-test split scenario, reducing RMSE by 28.7% compared to standard ANN models. This improvement is likely due to the enhanced ability of the Aquila optimizer (AO) and Runge–Kutta method to overcome local minima traps, a common limitation in conventional ANN models [12]. Previous studies have also highlighted the benefit of hybrid metaheuristic approaches in optimizing ANN performance, particularly in hydrological modeling [55]. Our findings are consistent with this trend, further supporting the use of advanced algorithms in improving predictive accuracy.

Our results align with prior studies that have demonstrated the superiority of hybrid ANN models in hydrological predictions. For example, Zounemat-Kermani et al. [54] found that hybrid ANN models integrating PSO and GA significantly improved streamflow prediction. Similarly, our ANN-RUNAO model outperformed traditional ANN and other hybrid models, further validating the efficacy of combining AO and Runge–Kutta for this prediction task. This is a notable improvement, as previous research has often cited the difficulty of predicting peak flows accurately [7].

Integrating metaheuristic algorithms such as PSO, GA, GWO, and AO into ANN models significantly enhances predictive accuracy because these algorithms are particularly effective at navigating complex, high-dimensional search spaces [2]. Including the Runge–Kutta method further stabilizes the learning process, ensuring that the model is less prone to overfitting on the training data while maintaining high performance in the testing phase. This combination is especially critical in hydrological applications, where the data often exhibit nonlinear patterns that are difficult to capture with traditional methods.

The 75–25% train-test data split consistently yielded the best prediction accuracy in our study. This outcome can be attributed to a larger training dataset, allowing the model to better learn the underlying patterns in streamflow data, particularly for more complex, nonlinear systems. This result is in agreement with the findings of Kisi [25], who emphasized that increasing the size of the training dataset improves the robustness and generalizability of ANN-based models. In addition, a larger training dataset helps mitigate the impact of noise and outliers, leading to a more reliable model calibration.

While the ANN-RUNAO model demonstrated high accuracy in streamflow prediction, it should be noted that it was only tested in the Jialing River Basin. As such, its applicability to other basins with different hydrological characteristics still needs to be tested. Future research should explore the transferability of this model across various geographical regions and climatic conditions to ensure its robustness in diverse environments. In addition, while the metaheuristic algorithms used in this study significantly improved prediction accuracy, their computational cost remains relatively high. Optimizing these algorithms for real-time applications should be a focus of future work.

The results of this study have important practical implications for water resource management, particularly in regions prone to flooding. The superior accuracy of the ANN-RUNAO model in predicting streamflow, especially during peak flow events, makes it a valuable tool for flood forecasting and reservoir management. By improving the accuracy of flow predictions, decision-makers can better allocate resources and mitigate the impacts of extreme weather events, thereby enhancing the resilience of water infrastructure.

6 Conclusions and Recommendations

Integrating metaheuristic algorithms, including PSO, GA, GWO, AO, RUN, significantly enhances streamflow prediction accuracy when compared to the standard ANN method. Among the hybrid models, ANN-RUNAO demonstrated the highest accuracy, exhibiting superior performance in RMSE, MAE, R2, and NSE. The study confirms a positive relationship between an increased training dataset and improved model accuracy. Models consistently benefited from larger training sizes, enhancing predictive efficiency across different scenarios. The 75–25% train-test data splitting scenario consistently yielded the best accuracy in streamflow prediction, emphasizing the importance of a substantial dataset for accurate calibration in data-driven models. Despite the promising results, the study has certain limitations. First, the analysis was conducted using data from a single basin (Jialing River Basin), which may limit the generalizability of the findings to other basins or regions with different hydrological conditions. In addition, while the hybrid models showed superior performance, the computational cost associated with these metaheuristic algorithms can be substantial, particularly in the case of larger datasets or real-time applications. Furthermore, the selection of input variables, though carefully considered, might still impact the accuracy of the models if applied to more complex or dynamic systems where additional hydrological variables are relevant.

Future research should focus on testing the developed hybrid models in a broader range of geographical areas and under different climatic conditions to evaluate their robustness and transferability. Further studies could also explore optimizing computational efficiency through parallel computing or simplified versions of metaheuristic algorithms, to reduce the processing time while maintaining model accuracy. In addition, investigating the impact of including more diverse input variables (e.g., soil moisture and evaporation rates) could enhance the performance of these models in more complex hydrological environments. Finally, future work should consider real-time data assimilation and model updates to improve further these models' applicability for practical water resource management.