Introduction

Accurate mapping of soil plays a crucial role in environmental management, serving as a key element in understanding and addressing environmental challenges. This process provides valuable information for soil biodiversity preservation by identifying areas vulnerable to soil degradation processes and planning for sustainable management (Pereira et al. 2017; Rutgers et al. 2019). Additionally, it proves beneficial in agriculture by enabling precision practices, reducing fertilizer applications, and thereby minimizing pollution (Huang and Hartemink 2020). In contaminated site detection, soil mapping facilitates rapid responses, while in natural risk mitigation, it offers essential data to ensure community safety (Shi et al. 2018). Its versatility is evident in identifying environmental changes and planning infrastructure to minimize impacts. Therefore, this tool significantly contributes to well-informed decisions, driving effective and sustainable strategies in environmental management.

In this context, detailed soil mapping emerges as an essential need in the environmental management in regions like Extremadura, Spain. Where a study by Lavado Contador et al. (2009) revealed that 67% of the region's surface is at critical risk, with 29% being fragile areas prone to degradation processes. These risks are often associated with intensive agricultural practices and livestock activities, highlighting significant environmental challenges. Addressing these challenges necessitates precise soil mapping as a fundamental resource. In agriculture, notable soil erosion rates have been linked to activities such as viticulture and chestnut cultivation (Barrena-González et al. 2020; Rodrigo-Comino et al. 2019). On the other hand, numerous studies have emphasized significant soil degradation processes linked to livestock activity (Alfonso-Torreño et al. 2021; Pulido et al. 2017; Rubio-Delgado et al. 2017; Schnabel et al. 2009).

On this basis, precise soil mapping becomes an indispensable tool, especially considering its crucial importance for agriculture, environmental management, and land use planning. Traditional soil mapping practices, involving the analysis of manually collected soil samples and their representation through conventional interpolation techniques, whether deterministic or geostatistical, have proven to be slow, labor-intensive, and often subjective (Ghorbani et al. 2018; Omran 2016). However, in recent years, a relatively new approach involving the use of machine learning algorithms (MLAs) has emerged, often providing more accurate results and increasing labour efficiency (Behrens et al. 2018; Khaledian and Miller 2020).

MLAs are currently used to map soil properties or types similarly to other unrelated scientific fields (Wadoux et al. 2020). They have proven effective, widely used for mapping soil properties, capable of analysing large datasets and detecting patterns beyond human discovery or explanation by other techniques (Padarian et al. 2019; Singh et al. 2016). The use of MLA statistical models allows establishing relationships between soil properties and various environmental and topographic factors, such as climate, topography, land use, or vegetation (Khanal et al. 2018; Lamichhane et al. 2019). These relatively novel approaches enable a more comprehensive analysis of soil properties and a deeper understanding of their spatial behaviour.

Moreover, other studies have demonstrated that integrating environmental variables into MLAs enhances the accuracy of soil mapping (John et al. 2020; Kang et al. 2020). Various algorithms have been developed, many used to predict soil properties. For instance, Wang et al. (2019) compared five MLAs for predicting soil salinity, and Khaledian and Miller (2020) investigated the application of popular MLAs, including RF, SVM, Artificial Neural Networks, and K-Nearest Neighbors, for digital soil mapping. Wadoux (2019) described a convolutional neural network (CNN) for predicting various soil properties with quantified uncertainty. Other studies have also employed MLAs for mapping soil phosphorus concentration (Matos‐Moreira et al. 2017), soil thickness (Li et al. 2020), or crop yield prediction (Van Klompenburg et al. 2020).

In the quest for the most accurate and tailored method for soil properties using MLA, the current definition of the optimal technique has not been universally established. This choice inherently depends on the specific property under study, the quality of available data, and its suitability for the purpose (Wadoux et al. 2020). In this context, the recommendation emphasizes the meticulous prioritization of the most precise MLAs based on the peculiarities of each case study and area when searching and selecting methods.

In regions such as Extremadura, Spain, some studies have explored the spatial distribution of soil properties. However, some of these employ classical interpolation methods, which could result in overlooking nonlinear relationships with environmental variables (Barrena-González et al. 2022). On the other hand, those applying machine learning algorithms often focus on a reduced working scale, lacking a comprehensive representation of the spatial behaviour of soil properties at the regional level (Barrena-González et al. 2023).

Despite 30.6% of Extremadura's territory being under some form of natural protection, the region remains vulnerable. Nearly 48% of its total area (Fig. 1b) is subject to intensive grazing by more than 5.1 million animals (pigs, sheep, cows, and goats), leading, in many cases, to soil degradation processes and consequently, a decrease in pasture production (Pulido et al. 2018a, b). Therefore, the creation of accurate maps of soil properties would not only help identify areas with high potential for providing ecosystem services but also facilitate prioritizing these areas for restoration or conservation activities. Additionally, strategically using the revealed spatial patterns of soil properties could help identify areas suitable for the application of optimized management practices in terms of soil nutrients, tillage, or cover crops, thereby contributing to improving overall soil health.

Fig. 1
figure 1

Location map of Extremadura region (a), Clay and nitrogen soil samples and graze surface (b), Dystric Leptosol (c), Leptic Cambisol (d)

Given the lack of information on precise methods tailored to soil properties for many regions, the objectives of this study are: (1) to predict 9 soil properties at 3 depth intervals using 6 machine learning algorithms, (2) to assess the accuracy of each algorithm in predicting each soil property and depth interval, and (3) to identify the most important covariates in predicting each soil property.

Materials and Methods

Study Area and Data Collection

Extremadura is a region in southwestern Spain (Fig. 1b) with a diversity of ecosystems and natural resources (Lozano-Parra et al. 2023). The region covers 41,635 km2, which is 8.4% of the area of Spain. According to the Köppen–Geiger (Peel et al. 2007) climate classification the predominant climate in Extremadura is Mediterranean continental (Csa), tempered by Atlantic maritime air masses. It is characterized by mild winters and very hot summers when temperatures can reach over 40 °C. However, the climatic characteristics of the region vary from north to south and are influenced by the presence of mountain ranges. Average annual rainfall is 600 mm, exceeding 1500 mm in the mountainous north and barely reaching 400 mm in the central and southern areas. Average temperatures range between 16–17 °C in most of the region, being lower in the north (13 °C) and higher in the south (18 °C). The average altitude is 600 m and varies greatly from north to south and from west to east. The highest altitude of 2404 m is in the northern mountains of the region, and the lowest (150 m) in the valley of the Guadiana River. In terms of lithological characteristics, the parent material for the soils consists of slates and granite, which do not allow the development of deep soils, with Cambisols and Leptosols (Fig. 1c, d) being the most common soil types (IUSS Working Group WRB 2022).

The soil samples used in this study were obtained from the Extremadura Soil Catalogue (https://www.eweb.unex.es/eweb/edafo/CatSuelos.html), from the database of edaphological properties of Spanish soils belonging to Centre for Energy, Environment and Technology Research (CIEMAT) and from several research projects developed at the University of Extremadura by the GeoEnvironmental Research Group (GIGA). The information was obtained from soil profiles described by these three sources of information. To ensure the representativeness of the data, samples were selected that correspond to the three depths of interest for the study, namely 0–5, 5–10 and > 10 cm. It is considered important to distinguish between soil depths when assessing soil properties through modelling techniques in order to properly assess the impact of management practices on soils (Thomas et al. 2017). At the upper depth of 0–5 cm, the most important changes in soil properties occur due to tillage, N fertility or the influence of crop rotation (McVay et al. 2006), while the depth interval of 5–10 cm is generally used to assess the effects of grazing and other land use practices on soil properties that are not influenced by tillage or management practices that mainly affect the top soil layer (Pulido et al. 2018a, b). The depth interval beyond 10 cm is used to assess the spatial patterns of soil properties in a layer where different processes, including carbon allocation, water storage, nutrient availability, and vegetation rooting, exhibit different behaviors compared to the upper soil layers. Due to the spatial variability of soil depth and the number of properties being modelled, the number of samples available varies by property and depth interval (e.g., 433 soil samples for clay at 0–5 cm and 200 soil samples for nitrogen at 5–10 cm).

Environmental Covariates

Topography is one of the most important factors shaping soil (McBratney et al. 2003). This is particularly important in Mediterranean environments where water-related soil chemical processes are not as influential as in tropical environments due to low rainfall. To account for the influence of topography on the models developed in this study, various topographic features were obtained from a digital elevation model (DEM) downloaded from the Spanish National Geographic Information Centre, with a spatial resolution of 30 × 30 m. Fifteen geomorphological indices were therefore calculated using the software SAGA (Gerstoft 1997) (Table 1). Vegetation is another important factor in soil formation, so three vegetation indices were calculated from Landsat imagery and integrated into the models. The Google Earth Engine platform was used to calculate the mean value of the Normalized Difference Vegetation Index (NDVI), the Soil Adjusted Vegetation Index (SAVI) and the Enhanced Vegetation Index (EVI) for the last 7 years (2016/2023). In addition, climate variables such as rainfall (mm), mean temperature (°C) and mean solar radiation (kJ/m2) were used according to the maps of Ninyerola et al. (2005).

Table 1 Environmental variables used for machine learning modelling

Machine Learning Algorithms

From the large number of available algorithms, six MLAs—Random Forest (RF), Support Vector Machine (SVM), Multiple Adaptive Regression Splines (MARS), Cubist, Gradient Boosting Machine (GBM), and Neural Networks (NN)—were chosen for this study based on the analysis conducted by Khaledian and Miller (2020). Their study identifies some of the most commonly employed machine learning algorithms for soil property mapping. These algorithms were applied, and the models were run using the “caret” package (Kuhn 2008) installed in the software Rstudio (RStudio Team 2020).

Random forest is a popular MLA commonly used in environmental studies, both for classification and regression problems (Grimm et al. 2008; Vaysse and Lagacherie 2015; Zeraatpisheh et al. 2020). In this case, the Ranger method was used, which consists of an ensemble learning method that generates multiple decision trees and combines their predictions by majority voting. The algorithm randomly selects a subset of features and samples to build each decision tree, which helps to reduce overfitting and improve generalization performance (Kulkarni and Sinha 2012).

SVM is a powerful MLA that is also used for classification and regression analysis (Roy and Chakraborty 2023) and has been used in soil mapping studies (Lamichhane et al. 2019; Taghizadeh-Mehrjardi et al. 2019; Wang et al. 2021). It works by finding a hyperplane or line that separates the data into different classes or groups. The goal of SVM is to find the hyperplane that maximizes the distance between the different classes or groups. SVM is considered a promising algorithm for soil mapping with remote sensing data, especially in cases where there are nonlinear relationships between soil properties and remote sensing variables (Asgari et al. 2020; Bachri et al. 2019).

MARS is mainly used for regression analysis in various soil mapping studies (Jeihouni et al. 2020; Mohamed et al. 2018). It is a non-parametric algorithm that works by building a piecewise linear model that best fits the training data. By dividing the data into segments based on the values of the features, it fits a linear regression model to each segment so that MARS can handle datasets with many features and missing values. The algorithm can then adjust the number and location of segments to minimize the prediction error (Friedman 1991).

Cubist is an MLA that has also been successfully applied in various fields, including digital soil mapping (Pouladi et al. 2019; Suleymanov et al. 2023). Cubist is an advanced decision tree-based algorithm that combines the power of decision trees and rule-based modelling to make accurate predictions. In soil mapping, Cubist uses a set of rules generated by recursively partitioning the feature space into smaller subspaces. Each subspace is then modelled using a decision tree, and the final model is an ensemble of these decision trees (Quinlan 1992, 1993).

GBM is an MLA commonly used in digital soil mapping (Estévez et al. 2022; Meier et al. 2018). Like RF or Cubist, GBM is a type of ensemble learning method that combines multiple decision trees to make accurate predictions. It works by training a series of decision trees, each designed to correct the errors of the previous tree (Ayyadevara 2018). This iterative process allows the algorithm to learn complex patterns and relationships in the data that may not be apparent in simpler models.

NNs are a type of MLA increasingly used in soil mapping (Arruda et al. 2016; Behrens et al. 2005; Coelho et al. 2021). These networks simulate the structure and function of the human brain and are capable of learning complex patterns and relationships in data. In this study, the Multilayer Perceptron Network with Dropout was used, which is one of the most common in soil mapping research (Bodaghabadi et al. 2015; Xu et al. 2019). The MLP with dropout consists of several layers of neurons, including an input layer, one or more hidden layers and an output layer. The dropout layer is inserted between the hidden layers and works by randomly taking out a certain percentage of neurons during the training phase. This means that a different set of neurons is used each time the network is trained, preventing overfitting (Wei et al. 2021).

The parameter tuning process in the models was carried out through a random selection of combinations for each machine learning algorithm used in this study (Table 2). The choice of this random strategy is based on the exhaustive exploration of the hyperparameter space, allowing for the evaluation of a wide range of possible configurations. Subsequently, the parameter combination that yielded the most optimal results in terms of accuracy and predictive performance was selected. This random selection methodology was implemented with the aim of capturing the diversity of the search space and ensuring that various configurations are explored to achieve the best results in predicting soil properties.

Table 2 Methods and hyperparameters configuration overview used in each model

Model Evaluation and Statistical Analysis

In this study, the dataset underwent preprocessing steps to ensure its quality and suitability for modeling. Outlier analysis was conducted using spatial analysis techniques such as the Moran's I or Local Moran's I to identify and address any anomalous data points that could potentially distort the results. Additionally, the dataset was randomly divided into 80% of cases for training and 20% for testing the models. Tenfold validation was employed to assess the performance of the machine learning algorithms (MLAs). This validation method involves iteratively splitting the dataset into ten equal parts, using nine parts for training the model and one part for testing, and then averaging the results across iterations to obtain a final performance metric. Tenfold validation is preferred due to its ability to provide a robust estimate of the model's performance while mitigating the variance of the results (Wadoux et al. 2021). Furthermore, it helps guard against overfitting by ensuring that the model's performance is evaluated on unseen data, thus promoting better generalization to Cui et al. (2008) and Kohavi (1995). It also ensures that the model is not overfitted to the training data, which can lead to poor generalization with new data.

To determine the performance of the six models used to represent soil properties in each depth interval, the root mean square error (RMSE) and coefficient of determination (R2) were considered. The model with the lowest RMSE and the highest R2 values for external validation of the models was determined to be the most accurate in each case.

$$RMSE = \sqrt {{\sum {\frac{e_i i^2 }{n},} }}$$
(1)

where ei is the difference between the predicted and observed values, and n is the number of observations.

$$R2 = \frac{{\Sigma_{\text{i}} \left( {y_{\text{i}} - {\overline{\text{y}}} - {\hat{\text{y}}}_{\text{i}} } \right)^2 }}{{\Sigma_{\text{i}} \left( {y_{\text{i}} - {\overline{\text{y}}}} \right)^2 }},$$
(2)

where Σi denotes the sum over all i observations in the dataset, yi is the ith observed value of the dependent variable, ȳ is the mean value of the dependent variable and ŷi is the predicted value of the dependent variable for the ith observation.

The analysis of relative RMSE (i.e. RMSE%) was used to examine how the most accurate methods for each property vary depending on the number of available samples. This analysis allowed for an equitable comparison between different soil properties, as RMSE% provides a relative measure of model accuracy relative to the scale of variation of each property. Thus, a more comprehensive and fair evaluation of model effectiveness in predicting various soil characteristics based on sample data availability could be conducted. Therefore, the RMSE% was calculated as:

$$RMSE\% = \frac{RMSE}{{\overline{x}}} \times 100,$$
(3)

where \({\overline{\text{x}}}\) is the mean of the observed value.

Model Deployment and Mapping Soil Properties

Once the most accurate model has been identified, it is used to map the soil properties, considering the importance of the variables generated by that model. These variables, which are identified when the model is trained, capture the most important factors that influence soil properties, such as topography, vegetation, and climate. The "predict" function from the Caret package is used to apply these variables to the trained model and produce continuous maps of the various soil properties throughout the study area. These maps provide a detailed visual representation of the spatial distribution of soil properties and allow a deeper understanding of soil variability.

Results

Descriptive Statistics

The results from Table 3 reveal distinct patterns in soil properties concerning depth. Clay content, soil pH, and CEC show a tendency to increase in deeper layers, while the opposite trend is observed for the other properties. For instance, sand content and soil pH exhibit coefficient of variation (CV) values ranging from 25.18 to 32.84%, and from 14.99 to 16.91%, respectively. However, the remaining properties display higher variation, with CV values around 50% and above. Notably, properties like phosphorus (P) and potassium (K) present CV values exceeding 100%, indicating substantial variability in their concentrations. Additionally, it's observed that physical properties generally show less variation than chemical properties, except for nitrogen (N) content. Moreover, properties with fewer samples demonstrate higher coefficients of variation, while the opposite is observed for properties with more samples.

Table 3 Descriptive statistics of soil properties selected for analysis

Accuracy of the Models

Table 4 shows the performance metrics for the different models used in predicting various soil properties across the 3 depth intervals. For clay, it is observed that the Random Forest (RF) method has a higher coefficient of determination (R2) and a lower root mean square error (RMSE) across all depths, particularly standing out in the 0–5 cm layer with an R2 of 0.45 and an RMSE of 5.17. Similarly, for silt content, RF also exhibits superior performance in most cases. For sand content, RF and Cubist show the lowest RMSE values for the shallowest layer (0–5 cm), while RF performs best in the other depth intervals. Regarding pH, RF and Cubist exhibit the lowest RMSE values in the 0–5 cm interval, while RF proves to be the most accurate method in the other depth intervals. For cation exchange capacity (CEC), Cubist shows the most solid performance in terms of RMSE for the 0–5 cm and 5–10 cm intervals, while RF is the best-performing method for the deepest interval (> 10 cm). For nutrient content (NPK), RF and SVM generally show good performance. However, Gradient Boosting Machine (GBM) offered the best results for nitrogen content, particularly in the deeper intervals. Finally, for soil organic matter (SOM) content, RF and GBM demonstrate lower RMSE values across various depths. Overall, the results indicate that the Random Forest method tends to be a reliable option for predicting various soil properties, particularly for the deeper intervals.

Table 4 Evaluation metrics for model calibration and validation

Sample Representativeness and Sensitivity of the Models

The RMSE% compared to the total number of samples was analyzed to assess the models' sensitivity (Fig. 2). Results indicate a pattern of an inverse relationship between the number of samples and the predicted error. As the number of samples used in model development increases, the RMSE% value tends to decrease relative to the observed mean value in each property, indicating that a larger dataset leads to greater accuracy in predicting soil properties.

Fig. 2
figure 2

Assessment of RMSE% compared to the mean observed value in each property

In this context, alongside evaluating the models' sensitivity to the number of samples used, an analysis was also conducted by varying the percentage of data used for model calibration and validation (Fig. 3). The analysis revealed that as the calibration data percentage decreases and the validation data proportion increases, the models' performance tends to deteriorate. Moreover, Fisher LSD post-hoc test was conducted to assess the significant differences among the different percentages of data used as input in the models. The results demonstrate significant differences between the various groups, except for between the datasets considering 30% and 40% of the data for calibration, and 30%, 40%, and 50% of the dataset for validation.

Fig. 3
figure 3

Variation of RMSE trend across calibration and validation sets from 90 to 10%. Box plot indicates the average and standard deviation, and the whiskers represent the range of the data. Significance letters denote differences between groups

Maps of Soil Properties and Model Covariates Importance

After examining the performances of the developed models, the most accurate ones were selected to produce prediction maps for each of the studied properties by depth interval. In addition, the significance of the environmental covariates in each of the cases studied was determined and analysed.

Soil Particle Size Distribution

Figure 4 displays the maps generated using the most accurate method to represent the spatial distribution of soil particle size. The predictive map of clay content in Extremadura exhibits higher values in the central and southern parts of the region, aligning with the “Tierra de Barros” area, known for its abundant clay content and extensive vineyards. In the case of clay content, vegetation indices such as EVI, SAVI, and NDVI demonstrated greater relevance in the topsoil depth interval, as depicted in Fig. 3. Also, the results showed that as the depth increases, the importance of precipitation surpasses that of vegetation indices in predicting clay content in the deeper soil layers.

Fig. 4
figure 4

Spatial distribution of clay, silt, and sand content (from top to bottom) in the three depth intervals (0–5 cm, 5–10 cm and > 10 cm from left to right) using the most accurate method

The distribution of silt content (Fig. 4) shows a similar pattern to the areas occupied by pasture and grassland, with concentrated higher values corresponding to soils of loamy-silt texture. Conversely, the lowest values for silt content are observed in river valleys. In these regions, the distribution map of sand content (Fig. 4) shows the highest values. In addition, the northern part of the region, characterized by a granite lithological domain, shows increased values for sand content. Looking at the importance of the variables, the morphometric index of valley depth proves to be the most influential factor in predicting silt and sand content, followed by precipitation and profile curvature in the case of sand content (Fig. 5).

Fig. 5
figure 5

Significance of environmental covariates for the spatial distribution of clay, silt, and sand content (from top to bottom) in three depth intervals (0–5 cm, 5–10 cm and > 10 cm from left to right)

Soil pH and Cation Exchange Capacity

Cubist and RF proved to be the most accurate methods for mapping the spatial distribution of soil pH and cation exchange capacity (CEC) (Fig. 6). The southern half of the region, characterised by Calcisols and intensive agricultural land, has the highest soil pH values. Conversely, high altitude regions dominated by forest formations have lower pH values, indicating the acidity of their soils. Furthermore, a decreasing trend of pH values with depth is observed, ranging from 7.10 in the uppermost interval to 2.64 in the deepest layer.

Fig. 6
figure 6

Spatial distribution of soil pH (top) and Cation Exchange Capacity (bottom) across the three depth intervals (from left to right)

Regarding the spatial distribution of the CEC, the highest values are concentrated near the "Tierra de Barros" region, which coincides with an increased clay content that facilitates nutrient exchange. On the other hand, higher areas and soils with lower nutrient content have lower CEC values.

Regarding the significance of the explanatory variables, precipitation, and vegetation indices (NDVI and EVI) showed a significant influence on the models and maps of pH and CEC in all depth intervals, as shown in Fig. 7.

Fig. 7
figure 7

Importance of environmental covariates for soil pH (top) and CEC (bottom) at three depth intervals (from left to right)

Nutrients and Soil Organic Matter

The maps depicting the spatial distribution of nutrients (NPK) and soil organic matter (SOM), generated by the most accurate model for each parameter, are presented in Fig. 8. Elevated regions characterized by forest presence exhibit high nitrogen content, with a gradual decrease observed from the central to the southern part of the region. Phosphorus content varies across different soil depth intervals, with the highest values observed in the south-eastern region in the shallowest interval, while in deeper intervals, concentrations peak in the northern part and areas dominated by forests. Discrepancies are notable in the potassium maps for the depth intervals 0–5 cm and > 10 cm compared to the deepest interval. Regarding soil organic matter, higher concentrations are found in moisture-rich areas with dense vegetation, resembling the nitrogen distribution pattern.

Fig. 8
figure 8

Spatial distribution maps of the content of nitrogen, phosphorus, potassium, and organic matter in the soil in Extremadura (each from top to bottom) over three depth intervals (from left to right)

The prediction of soil nutrient content and organic matter involves several influential covariates, including precipitation, altitude, solar radiation, and the morphometric index of valley depth, as depicted in Fig. 9. Precipitation emerges as the most critical variable for predicting nitrogen content, particularly in the first and 5–10 cm depth intervals, while solar radiation and RSP show significant influence in the deepest interval. Elevation proves crucial for mapping available phosphorus across all depth intervals, followed by solar radiation and vegetation indices. For soil potassium content, the morphometric index of valley depth stands out as the most influential variable, followed by altitude. Regarding soil organic matter content, precipitation and solar radiation dominate the models, with temperature and orientation also playing roles.

Fig. 9
figure 9

Significance of environmental covariates for soil nitrogen, phosphorus, potassium, and organic matter content (each from top to bottom) at three depth intervals (from left to right)

Discussion

This study is the first to employ machine learning algorithms to map different soil properties at various depth intervals in the Extremadura region, Spain. Available soil data were used to create maps of the spatial distribution of multiple soil properties using various machine learning methods, aiming to identify the most accurate approach in each case. The results of descriptive statistics (Table 2) shed light on the soil reality of the region. The findings reveal the prevalence of loamy or sandy loam textures associated with igneous and siliceous parent material, which cover approximately 90% of the region. Conversely, the dominance of clays is evident in the natural region known as "Tierra de Barros," where clay texture prevails (Martín et al. 2022). Additionally, it is evident how clay content, soil pH, and cation exchange capacity (CEC) tend to increase with depth, while other properties decrease. This discrepancy could partly be explained by the interaction between clay particle size and its relationship with CEC. This suggests that the nutrient retention capacity, as reflected by CEC values, may be higher in deeper intervals. However, it is important to note that soil organic matter decreases with depth, indicating additional complexity in soil dynamics. Although the quantity of soil organic matter decreases, the soil's ability to retain nutrients may increase due to the higher presence of clays in deeper layers.

The results revealed that data variability tends to be more pronounced in the deeper soil intervals. This trend may be attributed to the spatial heterogeneity of soil types, which vary in depth and therefore undergo different formation processes (Mulla and McBratney 2001). However, it is important to consider that the lower data availability in these intervals could also be contributing to this perception of increased variability. This data limitation could result in an incomplete representation of the true diversity of conditions in the deeper soil layers, highlighting the need for more comprehensive sampling and careful interpretation of the results. Additionally, it is observed that variability in the data is greater in chemical properties than in physical properties, such as soil texture. This difference suggests that, although there are no major disparities in the textural characteristics of the region's soils except in specific areas, other factors such as land use and biogeochemical processes could significantly influence the variability of available nutrient content.

Despite the lack of significant differences in the evaluation metrics of the various methods used, it could be suggested that RF followed by Cubist are recommended for mapping different soil properties in the Extremadura region with the available data. However, Barrena-González et al. (2022), although without significant differences, obtained better validation metrics for mapping the same soil properties using classical interpolation methods. This raises the question of the reliability of maps develop by machine learning (ML) methods. Although classical methods offer superior validation metrics, ML methods could potentially generate more reliable regional maps by considering dominant environmental characteristics.

Previous studies have shown that RF and Cubist are methods that perform well in mapping various soil properties (Table 5) (Fathololoumi et al. 2020; Kaya et al. 2022; Parsaie et al. 2021; Saidi et al. 2022; Suleymanov et al. 2023; Zeraatpisheh et al. 2019). This fact justifies why these methods are two of the most used in soil property mapping (Khaledian and Miller 2020). Additionally, RF demonstrated its good performance in mapping the deepest soil interval, highlighting its ability to manage nonlinear relationships in the deeper soil layers.

Table 5 Relation of most accurate prediction methods for each soil property in agreement with previous works

However, it is important to highlight that the generalizability of mapping models cannot be taken for granted. Each study context is unique, and factors such as experimental design and environmental characteristics can significantly influence model performance. Although RF and Cubist often deliver good performances overall, it is not possible to generalize a specific model with certain hyperparameters for all situations. Therefore, it is essential to test different models or adjust hyperparameters specifically for each case study if reliable soil management mapping is searched.

In this context, the results of relative RMSE concerning the actual average value revealed the dependency of model performance on the number of available soil samples, as noted by Tajik et al. (2020) in a previous study. Furthermore, this behavior is reinforced by the outcomes obtained when varying the percentages of data in the model's training and validation sets, as depicted in Fig. 3. As the size of the training dataset increases, a reduction in RMSE value is observed. Nguyen et al. (2021) demonstrated in a study that the 70/30 ratio for training and validation, respectively, was the most effective option. Decreasing the sample size below 20% for validation could lead to unreliable validation results, while reducing the training data could result in overfitting issues. However, it is also important to consider that a specific data split may influence the validation results (Wadoux et al. 2019). Therefore, it is recommended to employ different evaluation techniques depending on the presence of clustering in the data, as indicated by Wadoux et al. (2021) in their study.

Likewise, the ability of RF to maintain a strong performance was observed even when the number of samples was reduced, as evidenced in the deeper soil intervals for properties such as pH, CEC, or SOM. The performance of support vector machine (SVM) also stood out when the number of available samples decreased. However, it is crucial to interpret the improvement in the performance of this method with caution, as it may experience overfitting issues. Khaledian and Miller (2020) have demonstrated how the performance of SVM can be sensitive to the reduction in the number of samples, which could be reflected in validation metrics showing better results. Therefore, although SVM excels in terms of precision in smaller datasets, it is essential to consider the potential risk of overfitting when interpreting these results.

To enhance the performance of predictive models, this study employed various environmental variables. Previous research has shown how the inclusion of different climatic, topographic, or vegetation attributes enhances the performance of various methods (Fathololoumi et al. 2020; Lamichhane et al. 2019; Mahmoudabadi et al. 2017). On the other hand, employing techniques to identify the most influential variables in model prediction, thereby reducing data dimensionality and generating simpler models, can be advantageous. Brungard et al. (2015) demonstrated more accurate results by employing recursive feature elimination compared to manual variable selection by soil researchers. Similarly, Barrena-González et al. (2023) showed that selecting covariates using the Boruta algorithm can be an effective approach to data dimensionality reduction, resulting in accurate models.

However, little is known about the ideal ratio between the number of environmental covariates and sample size. Including a large number of covariates relative to the available observations can lead to overfitting issues. Poggio et al. (2013) indicated that having fewer than 15 observations per variable used in the model could be insufficient for generating accurate models. Therefore, further study analyzing this ratio is necessary to establish a general rule.

In predicting the spatial distribution of soil clay content, vegetation indices such as EVI, SAVI, and NDVI were found to be most significant. This finding differs from those of Barrena-González et al. (2023) in a study conducted in Extremadura, or other studies elsewhere (Zeraatpisheh et al. 2019), where topographic attributes were of greater importance in explaining the spatial distribution of clay. However, in this study, it could be inferred that these vegetation indices have identified areas with lower vegetation cover associated with bare soils, particularly notable in areas with high agricultural activity. This assertion could be supported by studies such as that of Gasmi et al. (2021), which found negative correlations between clay content and reflectance values, suggesting a plausible interpretation of the relationship between low vegetation index values and higher clay accumulation, and contrary.

Topographic attributes such as valley depth index and altitude stand out for their relevance in mapping the spatial distribution of silt and sand, providing a deeper understanding of sediment transport dynamics in the landscape (Gallant and Dowling 2003). The valley depth index offers valuable insights into terrain shape and morphology, critical aspects influencing the accumulation and redistribution of sedimentary materials. On the other hand, altitude can influence sediment deposition and erosion, as higher areas may experience different patterns of precipitation and erosion. Mello et al. (2022) corroborated the importance of topographic attributes linked to hydrological behavior in mapping sand and silt content, while Qu et al. (2024) demonstrated the relevance of other topographic indices for obtaining more accurate maps of sand content spatial distribution. These findings underscore the need to consider a variety of topographic attributes for precise and detailed prediction of sediment distribution in the soil.

The importance of variables such as precipitation, solar radiation, altitude, and vegetation indices in predicting soil chemical properties and nutrients at a regional scale in this study can be explained by their direct influence on the biogeochemical processes that regulate soil and vegetation dynamics on a large scale (Zepp et al. 2011; Zhao et al. 2019). These environmental factors, operating at regional scales, have a significant impact on the distribution and availability of nutrients in the soil. For instance, precipitation can vary widely across a region, affecting the quantity and rate of nutrient leaching, as well as soil erosion (Nielsen and Ball 2015; Qiu et al. 2016). Similarly, solar radiation and altitude influence soil temperature and moisture, thereby affecting biological activity, organic matter decomposition, and nutrient mineralization (Kumar et al. 2020; Sultanova et al. 2023; Yan et al. 2019).

Furthermore, vegetation indices, by reflecting vegetation health and biomass, can indicate primary productivity and carbon uptake, which has direct implications for soil quality and fertility (Kunkel et al. 2022). In comparison, topographic attributes, while important at the local level, may have a relatively minor influence on the variability of soil chemical properties at a regional scale (Liu et al. 2022; Mosleh et al. 2016). However, it is important to note that the spatial resolution of data, including pixel size, can also play a crucial role in the models' ability to capture and explain the spatial distribution processes of these properties (Hengl 2006). An appropriate pixel size can allow for a more accurate representation of landscape heterogeneities and better identification of relationships between environmental variables and soil properties at different spatial scales (Behrens et al. 2014; Brus et al. 2011).

The implications of the findings of this study result in an initial understanding of how soil properties it's distributed spatially at regional scale in Extremadura, Spain. This work also demonstrates that soils in the region are generally poor in key nutrients, with higher concentrations of organic matter and nitrogen in areas with dense forest cover, such as the northern part of the region or higher elevations. Overall, the soils tend to be acidic due to their geological nature, except in the basin of the Guadiana River and Tierra de Barros, coinciding with areas of higher agricultural intensity, which could result in higher pH and CEC values due to the incorporation of salts associated with fertilizer use.

With environmental covariates, maps have been generated providing a reliable approximation of these properties, which could be highly useful for regional-scale management activities. Previous studies have indicated that around 70% of the region faces a critical risk of soil degradation, stemming from either high livestock stocking rate arising from EU accession or processes associated with intensive agriculture (Contador et al. 2009). Therefore, the results of this study could serve as a monitoring tool to identify vulnerable areas requiring special attention, enabling a temporal analysis of the effectiveness of soil management practices.

The findings of this study could serve as useful tools to support decision-making across various domains. Similar studies have demonstrated how soil mapping and analysis of its properties can provide valuable insights for soil management and precision agriculture (López-Castañeda et al. 2022; Pereira et al. 2022). Additionally, it has been evidenced that monitoring the spatial distribution of soil properties following events such as forest fires or land use changes can be crucial for designing land reforestation and restoration strategies (Dindaroglu et al. 2021; Mousavinezhad et al. 2023). However, it is important to note that this study relied on regionally available data, which may be outdated and not reflect the most recent reality. Therefore, while our findings may offer valuable insights, caution is needed when interpreting and applying them in decision-making.

Therefore, it is suggested that future work ensures a balanced density of sampling points throughout the region, considering all land uses and harmonizing a database that addresses both spatial variability and depth variation. This would allow for more detailed analyses of how soil properties respond to different soil management practices. Additionally, leveraging information from environmental covariates could enable more comprehensive monitoring of changes in land use and evaluation of the effectiveness of implemented management strategies.

Regarding this, an accurate maps of soil properties spatial distribution can play a pivotal role for policymakers and managers in making informed decisions concerning land planning, environmental management, and agriculture. The findings derived from this study could be instrumental in formulating policies aimed at safeguarding strategic soil resources. In other words, considering the soil strategy set by the European Union for 2030, there are concerns in regions like Extremadura, where a large area of productive soils occupied for agriculture and livestock farming is at risk of being reassigned for renewable energy production. Despite the limited contribution of such installations to job creation and economic development in the region, photovoltaic energy has seen a 367% growth in Extremadura over recent years (2013–2020), impacting approximately 5000 ha (Díaz and Berrocal 2022). However, projections for future photovoltaic installations, as estimated by Barriga Bravo et al. (2021) for 2030, foresee the necessity of over 46,000 ha for the installation of 20,000 MW, raising further concerns. This scenario entails the abandonment of agricultural activity across thousands of hectares, a primary economic activity in the region. Hence, the identification, classification, and protection of strategic soils intended for agricultural activities warrant immediate attention to ensure food provisions for both inhabitants and livestock, and to prevent the utilization of fertile soils for activities that do not contribute to territorial development.

Conclusions

The study addressed the spatial distribution of various soil properties in the Extremadura region, Spain, employing six different machine learning models and diverse environmental variables. Overall, the RF model exhibited notable performance, particularly in predicting soil particle size (clay, silt, and sand), as well as estimating soil organic matter and other properties in deeper intervals. Additionally, the Cubist approach also showed promising results in soil property mapping. On the other hand, SVM proved to be the most accurate model when the available data was reduced, although its performance should be interpreted cautiously due to its susceptibility to overfitting.

It was observed that model performance decreased as the number of samples decreased, especially when the percentage of data for training was below 60%. Furthermore, climatic variables such as precipitation and solar radiation, followed by altitude, were found to predominate in mapping the spatial distribution of soil chemical properties and essential nutrients. In contrast, vegetation indices and other topographic indices were more relevant for mapping soil physical properties.

These findings highlight the importance of considering a variety of environmental variables when developing soil mapping models and underscore the need for careful interpretation of model results, especially under conditions of limited data availability. Additionally, the need for updated sampling, with an adequate number of sampling points, is emphasized to create more reliable and current maps reflecting the soil property distribution in the Extremadura region.

Therefore, future work is suggested to ensure a balanced density of sampling points across the region, considering all land uses and harmonizing a database addressing both spatial variability and variation in depth. This would enable more detailed analysis of how soil properties respond to different soil management practices. Furthermore, leveraging environmental covariate information could facilitate a more comprehensive monitoring of changes in land use and evaluation of implemented management strategies.