1. Introduction
Soil resistivity is a critical parameter influencing many geotechnical, hydrological, and agricultural processes. It serves as an indirect measure of various soil properties, including moisture content, matric suction, and temperature [
1,
2], among others. The electrical resistivity of soil adheres to Ohm’s law, with potential differences being measured through the transfer of artificially generated currents into the soil [
3]. The electrical resistivity of soil reflects its ability to conduct electrical current, which is influenced by factors such as sub-surface soil moisture content and temperature [
4], the degree of saturation [
5], organic content [
6], pore water composition [
7], geologic formation [
8], ion concentration in pore water, soil texture, and structure [
9,
10]. Therefore, understanding the data-driven interplay between soil resistivity and other soil hydrologic properties is crucial for applications like soil water management, landfill cover hydrology, groundwater monitoring, slope stability analysis, and precision agriculture [
11,
12].
Among the influencing factors of soil resistivity, soil moisture, and matric suction are particularly significant due to their direct impact on the soil’s electrical conductivity and pore water distribution. Soil moisture significantly affects soil resistivity, as water in the soil enhances ionic mobility, reducing resistivity [
13,
14]. Studies have demonstrated that resistivity decreases exponentially with increasing moisture content, particularly in fine-grained soils where pore water dominates the conduction process [
15,
16]. Soil resistivity measurements provide a non-invasive means to estimate soil moisture, making them invaluable in hydrological modeling for geotechnical infrastructures and irrigation scheduling [
17,
18]. Suction determines the distribution of water in soil pores, affecting the connectivity of water films that facilitate electrical conductivity [
2,
3]. High-suction conditions, indicative of dry soils, correspond to higher resistivity values due to the limited presence of conductive pore water [
19,
20]. Thus, integrating resistivity data with suction measurements offers deeper insights into soil–water dynamics and unsaturated soil behavior [
21,
22]. Soil temperature introduces additional variability in soil resistivity by altering the ionic mobility in pore water. Increased temperatures generally decrease resistivity as higher thermal energy enhances ion diffusion [
23].
Earth infrastructures are exposed to the environment and, accordingly, the hydraulic properties of the soil of the infrastructures change substantially due to various natural processes such as wetting–drying cycling, the freeze-thaw effect, etc. [
24]. Recent studies have demonstrated the potential of sensor-based soil moisture and suction measurements of various infrastructures for the real-time monitoring of soil hydrologic conditions [
4,
8]. For instance, time domain reflectometry (TDR) [
25], thermal dissipation sensors [
26], dielectric sensors, and tensiometers have enabled the continuous monitoring of these parameters in the field, enhancing the resolution and accuracy of resistivity estimations [
12,
13]. The integration of such field data with advanced machine learning (ML) models further presents a transformative approach to soil resistivity characterization [
27,
28]. ML models have emerged as powerful tools for analyzing these datasets, offering a means to capture non-linear and multi-variable relationships [
29,
30]. By integrating resistivity data with soil moisture, suction, and temperature measurements, ML models can provide reliable predictions of soil behavior under varying environmental conditions [
31,
32]. ML models, including artificial neural networks (ANNs), support vector machines (SVMs), random forests (RFs), and gradient boosting, have shown promise in capturing complex relationships between soil properties and resistivity [
16,
31]. The use of ML models is particularly evident in heterogeneous field conditions, where conventional methods often struggle to generalize [
27].
Understanding the relationships between soil resistivity, moisture content, and matric suction has broad implications for fundamental soil science and applied engineering. Furthermore, integrating ML with field sensor data enables a scalable and adaptive framework for resistivity monitoring, providing insights at unprecedented temporal and spatial resolutions [
33,
34]; eventually, a large-scale site characterization can be facilitated by resistivity. While significant progress has been made in applying ML models for geotechnical engineering purposes, challenges remain in optimizing these models for field-scale applications. One critical issue is the lack of adequate data, variability in data quality, sensor installation methods, and environmental factors such as temperature fluctuations and vegetation [
35]. The lack of standardized datasets for training and validating ML models in soil resistivity characterization further complicates their application. Most existing studies rely on limited datasets, often collected under controlled laboratory conditions [
36,
37], which may not fully represent the complexities of natural soils. Thus, there is a growing need for field-based datasets that incorporate continuous moisture variabilities, suction ranges, and soil temperature conditions, along with robust methodologies for data preprocessing and model evaluation. In this study, we aimed to address these gaps by exploring different ML models for predicting the field resistivity of clayey soil using sensor-based moisture, suction, and temperature data. Specifically, we evaluated the performance of models such as linear regression (LR) models, decision tree regressors (DTRs), RFs, SVMs, and ANNs in characterizing soil resistivity under variable field conditions. We also investigated the role of feature selection, data normalization, and model tuning in enhancing prediction accuracy and generalizability.
The study was conducted on evapotranspiration (ET) covers located in the City of Denton Landfill, Texas, USA, where six large-scale (1219 cm by 1219 cm) prototype ET covers were constructed using fine-grained soil (CH). Field hydrologic data were collected using state-of-the-art moisture and temperature sensors and tensiometers and calibrated to ensure reliability and consistency. Field electrical resistivity tests were conducted periodically using an advanced resistivity meter to portray electrical resistivity tomography (ERT), followed by data extraction from the ERT. The datasets were then integrated into a structured ML pipeline, encompassing data preprocessing, model training, and performance evaluation using cross-validation.
2. Electrical Resistivity Tomography (ERT)
Electrical resistivity tomography (ERT) is a geophysical imaging technique used to characterize the subsurface by mapping variations in electrical resistivity. It has emerged as a powerful method for characterizing subsurface structures and processes due to its non-destructive nature and ability to provide spatially distributed data. This method provides valuable insights into geological, hydrogeological, and environmental systems and has applications in a wide range of disciplines, including geotechnical engineering, archeology, and environmental monitoring.
ERT involves injecting a controlled current into the ground using electrodes and measuring the resulting voltage differences. The basic measurement principle relies on Ohm’s Law [
38]:
where V is the voltage, I is the current, and R is the resistance. The configuration of electrodes influences the depth of investigation and resolution. Common electrode arrays include Wenner, Schlumberger, and dipole-dipole configurations, each with its unique advantages in spatial resolution and depth sensitivity [
3]. By systematically varying the positions of current and potential electrodes, ERT collects apparent resistivity data over a grid, which reflects the subsurface resistivity distribution [
39].
The collected apparent resistivity data are processed to estimate the true resistivity distribution. This requires solving an inverse problem, where numerical algorithms convert surface measurements into a resistivity model of the subsurface. Techniques like least-squares inversion, with regularization to stabilize the solution, are commonly employed. However, the inversion process is inherently non-unique, meaning multiple subsurface resistivity distributions can explain the same dataset.
3. Related Work
Machine learning techniques for field resistivity characterization have gained significant attention due to their ability to integrate various geophysical measurements, such as resistivity, moisture content, and soil suction, thereby enhancing the accuracy and efficiency of characterizing soil behavior. Several studies focused on the relationship between these parameters and their implications in geotechnical and geo-environmental engineering. Recent methodologies have employed advanced geophysical techniques, including ERT, which can provide high-resolution spatial data on soil properties [
40]. Combined with ML algorithms, these methods allow for more robust resistivity predictions based on soil moisture and suction data.
The application of various ML algorithms in soil resistivity prediction has been a focal point in recent studies. Algorithms such as SVMs, RFs, and ANNs have been particularly effective in modeling complex non-linear relationships among the input variables— soil moisture, suction, and soil temperature [
41,
42]. For example, Driba et al., 2024 [
43] demonstrated the potential of machine learning models to enhance the prediction of spatial variations in wetland soil properties, such as soil moisture content and soil organic matter (SOM). Their research explored these using synthetic data, constrained by limited field data, to train an Xtreme Gradient Boosting (XGBoost) algorithm for predicting soil property distributions based on geophysical measurements and soil samples. The study, conducted in Ohio, USA, analyzed correlations between electrical conductivity, soil moisture, and soil organic matter from 22 core samples, with the primary objective of geospatially mapping wetland soil properties. Ozcep et al., 2009 [
44], utilized an artificial neural network approach to examine the relationship between soil moisture content and electrical resistivity. Laboratory testing generated 148 datasets for the study, achieving an R
2 value of 0.88, demonstrating the method’s effectiveness. However, the study was limited by a small dataset, minimal input variables, and constraints imposed by the controlled laboratory environment, which were not accounted for. In a more extensive study, Zamanian et al., 2024 [
45], employed large-scale observations in Texas, USA, to predict geotechnical properties using soil resistivity values. This research utilized a deep learning model with three hidden layers trained on 842 observations to investigate the association between electrical resistivity and geotechnical properties. Key geotechnical properties, such as moisture content and unit weight, were identified through Spearman’s correlation and feature importance analyses. However, the study excluded soil suction and noted that resistivity values were predominantly clustered around 100 Ohm-m. Moreover, the SVM has been highlighted for its capability to handle high-dimensional spaces and its effectiveness in achieving high accuracy rates in classification tasks related to soil types based on resistivity data [
41]. Studies have reported a model comparison, where RF outperformed both an SVM and ANN regarding robustness and accuracy when predicting soil resistivity using field data [
42,
46]. Applying hyperparameter tuning and cross-validation techniques within these ML models further enhances their predictive capabilities and generalization performance in varied soil conditions [
46].
5. Results
5.1. Scatter Plots of Different Variables
To have a general understanding of the relationship of resistivity with other variables, scatter plots were adopted.
Figure 10a shows the scatter plot of resistivity and soil moisture content. The strong relationship between moisture content and soil resistivity is well documented and is also evident from the scatter plot. The graph demonstrates a clear inverse correlation, indicating that resistivity decreases as moisture content increases. At a higher moisture content (e.g., 0.3–0.4), resistivity is low, clustering between 10 and 30 Ohm-m. As moisture content decreases (e.g., <0.2), resistivity increases significantly, reaching values up to 90 Ohm-m. The points follow a distinct downward trend, forming a non-linear inverse relationship.
Figure 10b shows the scatter plot of resistivity and soil matric suction. The plot shows the relationship is more scattered and does not show a clear trend. At low resistivity values (<20 Ohm-m), suction varies widely, ranging from 0 to over 1500 kPa. Also, at higher resistivity values (e.g., >60 or 70 Ohm-m), suction values are well-scattered, with fewer points reaching higher suction levels. This scatter plot of resistivity versus suction (
Figure 10b) reveals a more complex and scattered relationship, indicating a weak or indirect correlation. It is well understood that soil matric suction increases as moisture content decreases, correlating with higher resistivity values. However, the scatter plot in the data suggests that suction alone may not be a straightforward predictor of resistivity, as it is influenced by soil texture, compaction, and pore size distribution.
In
Figure 10c, the scatter plot of resistivity versus soil temperature is depicted from the field monitoring data. This graph shows a positive correlation between resistivity and soil temperature (°C), though the trend is less pronounced compared to moisture content. At lower temperatures (e.g., <10 °C), resistivity is generally lower, clustering below 40 Ohm-m. As temperature increases (e.g., >20 °C), resistivity values increase, with several points exceeding 80 Ohm-m. The relationship appears non-linear, with resistivity stabilizing at high temperatures. At higher temperatures, the mobility of ions in the pore water increases, typically leading to higher conductivity (lower resistivity). However, in drier soils, resistivity increases with temperature due to the evaporation of water and reduced ionic content, as seen in this plot (
Figure 10c). Therefore, the influence of temperature on resistivity is dependent on moisture availability. In soils with low moisture content, temperature effects may dominate, while in moist soils, the impact of temperature is less significant compared to the effects of moisture.
From the field-measured data, the clearest relationship with a well-defined inverse correlation was observed in resistivity versus soil moisture content. Resistivity versus suction exhibited significant variability, indicating that suction alone cannot explain resistivity changes. It can be best interpreted alongside moisture content and other variables. Resistivity versus soil temperature shows a positive trend but with more variability than moisture content. The observed trend reflects the combined influence of temperature and moisture availability. The variability in the scatter plots highlights the need for multi-parameter models to interpret resistivity to increase accuracy.
5.2. Correlation Among Variables
Correlation analysis, which may be performed using either the Pearson correlation or the Spearman rank-order correlation, evaluates the connection between two variables. Quantifying the strength of a linear connection between variables is a popular application of the Pearson technique, which is often used in the domains of science and engineering studies. In comparison, Spearman rank-order correlation is a nonparametric measure of the strength and direction of the association between two ranked variables. It is very effective, particularly when the variables do not meet the assumptions of linearity or normal distribution required by the Pearson correlation coefficient.
These two categories are distinguished from one another primarily by how they evaluate their relationships. The Pearson correlation is used to determine the degree of linearity in a connection, while the Spearman correlation is used to determine whether a monotonic relationship currently exists. When changes in one variable are proportional to changes in another variable, this indicates the existence of a linear connection. One kind of connection, known as a monotonic relationship, is characterized by the fact that variables tend to change together, albeit not always in a consistent manner. When compared to the Spearman correlation, which needs the data to be converted into ranked values before analysis, the Pearson correlation may be calculated straight from the raw data collection.
Since the nature of the association was unclear before the investigation, both forms of correlation were studied. When there is a small number of data pairs, the correlation values need to be near 1 or −1 to reach statistical significance. On the other hand, when there are many data pairs, correlations that are close to zero may be deemed very significant. The correlation matrix provides insights into the relationships between different features in the dataset. In the heatmap or correlation matrix shown in
Figure 11, the correlation between moisture content, suction, temperature, and resistivity is visualized. Moisture content and resistivity have a strong negative correlation of −0.88, indicating that resistivity tends to decrease as moisture content increases. The plot suggests that moisture content is crucial in determining resistivity, which could be vital for predictive modeling.
The scattered plot between suction and resistivity indicates a weaker relationship. The data points are widely scattered (
Figure 10b) with no clear trend, reflecting a weaker correlation of 0.34. This suggests that while suction may influence resistivity, it is less significant compared to other variables like moisture content. The scatter plot of temperature versus resistivity shows a positive trend, where higher temperatures are associated with increased resistivity values. This observation is consistent with a relatively moderate correlation of 0.41 between the two variables. The relationship between temperature and resistivity indicates that temperature is a factor that can affect resistivity, though not as strongly as moisture content.
5.3. Evaluation of Machine Learning (ML) Models
The performance of each ML model was evaluated, and the results were compared based on R2, RMSE, bias, and ubRMSE. The models demonstrated varying levels of performance, depending on the complexity of the relationships in the data. During the training process, no overfitting or underfitting was observed. This was achieved through proper hyperparameter tuning and the use of cross-validation, ensuring that each model generalized well to unseen data. The details of each model are described in the following sections. As the data were normalized before modeling, the range of the actual and predicted values ranged from 0 to 1.
5.3.1. Linear Regression
Linear regression provided a baseline model, capturing linear relationships, but it was limited in handling complex patterns in the data. The LR performed reasonably well, with an R
2 value of 0.7397 and an RMSE of 0.14913. However, it struggled with non-linear relationships, resulting in moderate errors. The plot (
Figure 12) shows considerable data points deviating from the 45
o diagonal (dashed line), indicating that while this method works well for simple relationships, it fails to capture more complex patterns. This can further be understood from the estimated bias. The bias yielded the highest value (−0.0287) for the LR model compared to other ML models, indicating that the LR model assumed linearity even if the data were non-linear. While investigating the scatter plots of resistivity with temperature, suction, and moisture, it is evident that only moisture shows an interpretable relationship with resistivity. On the contrary, no trend was observed for suction and temperature with resistivity; LR is unable to capture this complex non-linear behavior, which is attributed to the lower R
2 value.
5.3.2. Decision Tree Regressor
After the linear regression, this study approached decision tree regressor modeling. While it captured non-linear interactions, the performance was limited due to overfitting and a lower generalization capability than ensemble models. The DTR struggled to capture the complexity of the data, with an R
2 value of 0.5634 and a higher RMSE of 0.19356. In addition, the DTR model had the largest ubRMSE (0.1935), suggesting that it struggled with random errors. The scatter plot in
Figure 13 shows significant variance around the 45
o diagonal line, indicating that the model’s predictions deviate substantially from the actual values. The model is prone to overfitting and lacks generalization capability. The low accuracy can be attributed to overfitting and small datasets. As there were less than 300 datasets, DTR struggled with small datasets because they tended to over-split the data, creating models that did not generalize well. In addition, overfitting occurs because the tree creates highly specific splits that align closely with the noise in the training data rather than capturing the true underlying patterns. The datasets used in the study were from all seasons (spring, summer, and winter). As the behavior of soil also depends on seasonal variation and vegetation, the sub-grouping of such data is required for each season. If the dataset has imbalanced classes or outliers, a decision tree may become biased toward the majority class or overreact to the outliers. A large dataset with a seasonal sub-group might increase accuracy under this modeling technique.
5.3.3. Random Forest
The random forest technique was adopted after the decision tree mechanism. As the decision tree method overfitted the data, methods like RF have the potential to improve accuracy, as it combines multiple decision trees. As can be seen, it achieved strong predictive performance by averaging multiple decision trees, reducing overfitting, and increasing generalization power. The RF demonstrated relatively robust performance, with an R
2 of 0.7351 and an RMSE of 0.15007. Regarding the ubRMSE and bias, the RF model performed similarly to the LR model but with a slightly lower random error (ubRMSE = 0.146347). A closer look at the plot presented in
Figure 14 reveals that the values are closer to the 45° line at the lower and upper range, while in the middle, it is more scattered. The reason can be attributed to the fact that RF combines predictions from many decision trees and each tree captures different aspects of the data through random sampling of the data and features. In addition to reducing the likelihood of data overfitting, this averaging technique may also help smooth out complex patterns, resulting in improved predictions at the extreme ends of the range (low and high values), where the data points are either more distinct or fewer.
As mentioned before, the dataset was not sub-grouped under seasonal variation. Based on previous studies [
47], resistivity values capture more variation in the drier months compared to the wetter months. During the wetter months, resistivity values tend to cluster despite variations in temperature and suction levels. In other words, in the summer months, slight temporal variation can bring changes in the resistivity values, whereas in the winter months, it takes considerable variation in the ambiance to alter the resistivity of soil. As such, in the middle range, where there is often more data density, the RF model might struggle to capture subtle variations because of averaging effects.
5.3.4. Support Vector Machine
The support vector machine technique was adopted after RF modeling. The SVM model exhibited robust performance, especially in classification tasks and with smaller datasets, demonstrating an effective management of intricate decision boundaries.
Figure 15 shows the actual and model-predicted values. The SVM model delivered one of the best performances, with an R
2 of 0.7698 and a lower RMSE of 0.14023 compared to the LR, DTR, and RF models’ RMSE. In addition, the SVM showed a balance of low bias (−0.00135) and relatively low ubRMSE (0.140226), making it one of the better-performing models.
The SVM aims to find the optimal boundary that separates the data into different regions, maximizing the margin between different classes or between predicted and actual values in regression tasks. In addition, unlike the RF, which averages predictions and may struggle in densely populated regions, the SVM can effectively balance the influence of different regions by focusing on maximizing the margin. However, the accuracy below 0.8 in terms of R2 can be attributed to the absence of a trend between temperature and suction with resistivity. Even though resistivity variations were noted with these two variables, there is no clear trend, unlike moisture variations. As such, the improved modeling technique is also battling to considerably improve model accuracy.
5.3.5. Artificial Neural Network
Finally, the artificial neural network model was adopted in soil resistivity prediction. It delivered an almost similar predictive accuracy to the SVM. The ANN model demonstrated strong predictive performance, with an R
2 value of 0.7875 (2.27% higher than the SVM) and a low RMSE of 0.13505. The bias and ubRMSE further support the ANN model’s best performance in terms of bias (−0.00449) and ubRMSE (0.134976), indicating minimal systematic and random errors. The predicted values closely follow the 45
o diagonal line, as seen in
Figure 16, indicating a good fit between actual and predicted values. The ANN effectively captured non-linear relationships in the data, making it one of the top-performing models along with the SVM. The SVM tends to perform exceptionally well with smaller datasets, as it does not require as much data to generalize. In addition, the dataset has a small number of features (inputs) for which both SVM and ANN may have comparable performance because the need for extensive feature extraction (an ANN strength) is reduced.
6. Discussion
6.1. Relationship Between Soil Resistivity and Hydrologic Variables
The scatter plots in
Figure 10 show a strong inverse correlation between moisture content and resistivity. Soil moisture content plays a critical role in electrical conductivity. Water contains dissolved ions such as salts that contribute to the soil’s ability to conduct electricity. Higher moisture content increases the connectivity of water-filled pores, forming continuous conductive pathways. When soil moisture increases, ionic concentration in the pore water becomes a dominant factor in reducing resistivity. Conversely, when the soil is drier, the resistivity is higher due to limited ion mobility in disconnected water films. As a result, the amount of water in the soil changes the structure of the pores directly. This makes more conductive networks available, which makes the relationship between soil moisture and resistivity more deterministic and less affected by other factors, as pronounced in this study shown in
Figure 10a.
On the other hand, there is a scattered correlation between matric suction and resistivity (
Figure 10b), showing no clear trend. Matric suction is indirectly related to moisture content. As suction increases, moisture content decreases. Different soil types such as clay, silt, and sand have different moisture retention characteristics. Matric suction is influenced by the soil texture and pore structure, which adds complexity and variability. In this study, the test sections were constructed with CH-type soil, with almost 40% clay, 47% silt, and 13% sand. The variability in pore structure may have been influenced by these soil constituents, which suggests that matric suction might not accurately reflect the changes in soil resistivity. As a result, the relationship between suction and resistivity was not as consistent as that of moisture content. We anticipate that a different soil matrix in the field conditions may display a different scatter plot of resistivity versus suction.
Figure 10c shows a scattered distribution, with no strong correlation between resistivity and soil temperature like the suction scatter plot. In porous media, resistivity is more sensitive to changes in temperature and relative humidity compared to nonporous materials, which show stronger resistivity changes with temperature. This phenomenon is particularly appropriate for sandstone and marble, where temperature alone has a stronger correlation with resistivity. However, in field conditions, temperature changes are often accompanied by changes in moisture content induced by evapotranspiration, which tends to overshadow the effect of temperature. In addition, the field conditions may have high variability in ionic concentration and soil composition. This coupling complicates isolating the direct effect of temperature on resistivity in the field conditions, as reflected in
Figure 10c.
6.2. Evaluation of Model Performance Metrics
To evaluate the efficiency of the five machine learning models (LR, DTR, RF, SVM, and ANN), four performance metrics were used: R
2, RMSE, bias, and ubRMSE. The R
2 indicates the goodness-of-fit of the model and identifies how well the model captures variability in the data. The R
2 value of one indicates a perfect prediction. The RMSE captures the overall magnitude of prediction errors. Lower RMSE values indicate better model performance. Bias reveals systematic over or underprediction by identifying systematic errors in the model. The ubRMSE is a variation in the RMSE that isolates and measures the random errors in predictions, removing the systematic bias from the calculation. It is connected to the RMSE by accounting for and subtracting the bias (the systematic difference between predictions and actual values). The ubRMSE metric captures the variability in the errors due to random or stochastic factors, excluding systematic bias. A lower ubRMSE indicates that the predictions are closer to the actual values, with little randomness. Separating the bias and ubRMSE helps distinguish between systematic errors (captured by bias) and random errors (captured by ubRMSE). Incorporating these four metrics ensures both the accuracy and reliability of the models. The following table (
Table 2) lists four measurement metrics of the five different ML models:
The R2 values indicated that the ANN (R2 = 0.787) performed the best, followed closely by the SVM (R2 = 0.770) and RF (R2 = 0.736). The LR also demonstrated reasonable accuracy (R2 = 0.740), especially given its simplicity. The DTR significantly underperformed, with the lowest R2 value (0.561).
The RMSE and ubRMSE followed a trend consistent with R
2. The ANN model achieved the lowest RMSE (0.135), followed by the SVM (0.140) and LR (0.149). The DTR had the highest RMSE (0.194), demonstrating its poorer predictive accuracy. The ubRMSE values were also close to the RMSE for all the models, indicating that random errors dominated the overall error, with little systematic bias in the predictions. For instance, the ANN model’s ubRMSE (0.135) and RMSE (0.135) are almost identical (
Table 2), suggesting minimal systematic errors in the prediction.
Bias values for all models were relatively low, demonstrating minimal systematic deviation from the true values in all the models. The DTR, SVM, and ANN exhibited small bias magnitudes (−0.00306, 0.00135, and −0.00449, respectively), while the LR model showed the largest bias (−0.02866). The negative bias values in all models indicate that on average, the predictions are slightly underestimating the actual values. This systematic underestimation could result from the characteristics of the dataset, such as an uneven distribution of target values. However, these bias values are small enough to have a limited impact compared to the random error component.
Despite the ANN’s superior performance metrics, the difference between its performance and that of LR is not significant. This is likely due to the limited size of the dataset. Smaller datasets reduce the opportunity for complex models like the ANN to demonstrate their full potential. The LR, being simpler, is less prone to overfitting and can perform competitively when the data does not strongly demand non-linear modeling. Additionally, small datasets can limit the ability of the ANN to adequately learn intricate patterns of data, narrowing the performance gap with simpler models. This is reflected in the R2 and RMSE values, where the ANN and LR are quite similar.
The ANN model performed best, achieving the highest R2 and lowest RMSE, with minimal bias and random errors. However, the relatively small dataset constrained its advantage over simpler models like LR, which also achieved competitive results. The RF performed similarly to LR but had a slightly higher RMSE and bias. The SVM model emerged as another strong contender, with good predictive power and minimal error, balancing simplicity and accuracy. However, in dealing with large datasets, the ANN may be a more advantageous choice over the SVM, even if both models yield similar R2 values. The ANN is designed to capture complex non-linear relationships within data and, with increased data points, tends to improve their performance by refining the weights and biases in their layered structure through backpropagation. The ANN technique scales well with larger datasets due to their inherent ability to learn and represent intricate patterns as they utilize a multi-layer architecture, which allows them to approximate almost any function given sufficient training data. In contrast, the SVM often faces computational limitations with large datasets, as the algorithm involves quadratic optimization that becomes increasingly complex with more support vectors, leading to higher training times and memory requirements. Therefore, the ANN model may offer greater scalability and adaptability in scenarios with abundant data, making them potentially better choices as dataset size increases. The DTR, on the other hand, lagged significantly behind the different models, reflecting its limitations in this scenario. Given the limited dataset size, future work could explore augmenting the data or applying cross-validation to evaluate model performance better. Additionally, ensemble methods like boosting could be tested to improve the performance of tree-based models.