Looking for Optimal Maps of Soil Properties at the Regional Scale

Barrena-González, Jesús; Lavado Contador, Francisco; Repe, Blâz; Pulido Fernández, Manuel

doi:10.1007/s41742-024-00611-8

Looking for Optimal Maps of Soil Properties at the Regional Scale

Research paper
Open access
Published: 27 May 2024

Volume 18, article number 60, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Environmental Research Aims and scope Submit manuscript

Looking for Optimal Maps of Soil Properties at the Regional Scale

Download PDF

1039 Accesses
1 Citation
Explore all metrics

Abstract

Around 70% of surface in Extremadura, Spain, faces a critical risk of degradation processes, highlighting the necessity for regional-scale soil property mapping to monitor degradation trends. This study aimed to generate the most reliable soil property maps, employing the most accurate methods for each case. To achieve this, six different machine learning (ML) techniques were tested to map nine soil properties across three depth intervals (0–5, 5–10 and > 10 cm). Additionally, 22 environmental covariates were utilized as inputs for model performance. Results revealed that the Random Forest (RF) model exhibited the highest precision, followed by Cubist, while Support Vector Machine showed effectiveness with limited data availability. Moreover, the study highlighted the influence of sample size on model performance. Concerning environmental covariates, vegetation indices along with selected topographic indices proved optimal for explaining the spatial distribution of soil physical properties, whereas climatic variables emerged as crucial for mapping the spatial distribution of chemical properties and key nutrients at a regional scale. Despite providing an initial insight into the regional soil property distribution using ML, future work is warranted to ensure a robust, up-to-date, and equitable database for accurate monitoring of soil degradation processes arising from various land uses.

Highlights

Overall, the Random Forest algorithm was the most accurate in mapping soil properties in Extremadura.
Chemical properties and key nutrients exhibit more variability than soil physical properties.
The number of soil samples determines the performance of the methods used for soil property mapping.
Vegetation indices and topographic attributes emerge as the most relevant variables for mapping soil physical properties.
Climatic variables are more important in mapping chemical properties and key soil nutrients.

Digital mapping of selected soil properties using machine learning and geostatistical techniques in Mashhad plain, northeastern Iran

Article 02 May 2023

Exploring soil property spatial patterns in a small grazed catchment using machine learning

Article Open access 26 October 2023

Soil quality estimation using environmental covariates and predictive models: an example from tropical soils of Nigeria

Article Open access 25 November 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Accurate mapping of soil plays a crucial role in environmental management, serving as a key element in understanding and addressing environmental challenges. This process provides valuable information for soil biodiversity preservation by identifying areas vulnerable to soil degradation processes and planning for sustainable management (Pereira et al. 2017; Rutgers et al. 2019). Additionally, it proves beneficial in agriculture by enabling precision practices, reducing fertilizer applications, and thereby minimizing pollution (Huang and Hartemink 2020). In contaminated site detection, soil mapping facilitates rapid responses, while in natural risk mitigation, it offers essential data to ensure community safety (Shi et al. 2018). Its versatility is evident in identifying environmental changes and planning infrastructure to minimize impacts. Therefore, this tool significantly contributes to well-informed decisions, driving effective and sustainable strategies in environmental management.

In this context, detailed soil mapping emerges as an essential need in the environmental management in regions like Extremadura, Spain. Where a study by Lavado Contador et al. (2009) revealed that 67% of the region's surface is at critical risk, with 29% being fragile areas prone to degradation processes. These risks are often associated with intensive agricultural practices and livestock activities, highlighting significant environmental challenges. Addressing these challenges necessitates precise soil mapping as a fundamental resource. In agriculture, notable soil erosion rates have been linked to activities such as viticulture and chestnut cultivation (Barrena-González et al. 2020; Rodrigo-Comino et al. 2019). On the other hand, numerous studies have emphasized significant soil degradation processes linked to livestock activity (Alfonso-Torreño et al. 2021; Pulido et al. 2017; Rubio-Delgado et al. 2017; Schnabel et al. 2009).

On this basis, precise soil mapping becomes an indispensable tool, especially considering its crucial importance for agriculture, environmental management, and land use planning. Traditional soil mapping practices, involving the analysis of manually collected soil samples and their representation through conventional interpolation techniques, whether deterministic or geostatistical, have proven to be slow, labor-intensive, and often subjective (Ghorbani et al. 2018; Omran 2016). However, in recent years, a relatively new approach involving the use of machine learning algorithms (MLAs) has emerged, often providing more accurate results and increasing labour efficiency (Behrens et al. 2018; Khaledian and Miller 2020).

MLAs are currently used to map soil properties or types similarly to other unrelated scientific fields (Wadoux et al. 2020). They have proven effective, widely used for mapping soil properties, capable of analysing large datasets and detecting patterns beyond human discovery or explanation by other techniques (Padarian et al. 2019; Singh et al. 2016). The use of MLA statistical models allows establishing relationships between soil properties and various environmental and topographic factors, such as climate, topography, land use, or vegetation (Khanal et al. 2018; Lamichhane et al. 2019). These relatively novel approaches enable a more comprehensive analysis of soil properties and a deeper understanding of their spatial behaviour.

Moreover, other studies have demonstrated that integrating environmental variables into MLAs enhances the accuracy of soil mapping (John et al. 2020; Kang et al. 2020). Various algorithms have been developed, many used to predict soil properties. For instance, Wang et al. (2019) compared five MLAs for predicting soil salinity, and Khaledian and Miller (2020) investigated the application of popular MLAs, including RF, SVM, Artificial Neural Networks, and K-Nearest Neighbors, for digital soil mapping. Wadoux (2019) described a convolutional neural network (CNN) for predicting various soil properties with quantified uncertainty. Other studies have also employed MLAs for mapping soil phosphorus concentration (Matos‐Moreira et al. 2017), soil thickness (Li et al. 2020), or crop yield prediction (Van Klompenburg et al. 2020).

In the quest for the most accurate and tailored method for soil properties using MLA, the current definition of the optimal technique has not been universally established. This choice inherently depends on the specific property under study, the quality of available data, and its suitability for the purpose (Wadoux et al. 2020). In this context, the recommendation emphasizes the meticulous prioritization of the most precise MLAs based on the peculiarities of each case study and area when searching and selecting methods.

In regions such as Extremadura, Spain, some studies have explored the spatial distribution of soil properties. However, some of these employ classical interpolation methods, which could result in overlooking nonlinear relationships with environmental variables (Barrena-González et al. 2022). On the other hand, those applying machine learning algorithms often focus on a reduced working scale, lacking a comprehensive representation of the spatial behaviour of soil properties at the regional level (Barrena-González et al. 2023).

Despite 30.6% of Extremadura's territory being under some form of natural protection, the region remains vulnerable. Nearly 48% of its total area (Fig. 1b) is subject to intensive grazing by more than 5.1 million animals (pigs, sheep, cows, and goats), leading, in many cases, to soil degradation processes and consequently, a decrease in pasture production (Pulido et al. 2018a, b). Therefore, the creation of accurate maps of soil properties would not only help identify areas with high potential for providing ecosystem services but also facilitate prioritizing these areas for restoration or conservation activities. Additionally, strategically using the revealed spatial patterns of soil properties could help identify areas suitable for the application of optimized management practices in terms of soil nutrients, tillage, or cover crops, thereby contributing to improving overall soil health.

Given the lack of information on precise methods tailored to soil properties for many regions, the objectives of this study are: (1) to predict 9 soil properties at 3 depth intervals using 6 machine learning algorithms, (2) to assess the accuracy of each algorithm in predicting each soil property and depth interval, and (3) to identify the most important covariates in predicting each soil property.

Materials and Methods

Study Area and Data Collection

Extremadura is a region in southwestern Spain (Fig. 1b) with a diversity of ecosystems and natural resources (Lozano-Parra et al. 2023). The region covers 41,635 km², which is 8.4% of the area of Spain. According to the Köppen–Geiger (Peel et al. 2007) climate classification the predominant climate in Extremadura is Mediterranean continental (Csa), tempered by Atlantic maritime air masses. It is characterized by mild winters and very hot summers when temperatures can reach over 40 °C. However, the climatic characteristics of the region vary from north to south and are influenced by the presence of mountain ranges. Average annual rainfall is 600 mm, exceeding 1500 mm in the mountainous north and barely reaching 400 mm in the central and southern areas. Average temperatures range between 16–17 °C in most of the region, being lower in the north (13 °C) and higher in the south (18 °C). The average altitude is 600 m and varies greatly from north to south and from west to east. The highest altitude of 2404 m is in the northern mountains of the region, and the lowest (150 m) in the valley of the Guadiana River. In terms of lithological characteristics, the parent material for the soils consists of slates and granite, which do not allow the development of deep soils, with Cambisols and Leptosols (Fig. 1c, d) being the most common soil types (IUSS Working Group WRB 2022).

The soil samples used in this study were obtained from the Extremadura Soil Catalogue (https://www.eweb.unex.es/eweb/edafo/CatSuelos.html), from the database of edaphological properties of Spanish soils belonging to Centre for Energy, Environment and Technology Research (CIEMAT) and from several research projects developed at the University of Extremadura by the GeoEnvironmental Research Group (GIGA). The information was obtained from soil profiles described by these three sources of information. To ensure the representativeness of the data, samples were selected that correspond to the three depths of interest for the study, namely 0–5, 5–10 and > 10 cm. It is considered important to distinguish between soil depths when assessing soil properties through modelling techniques in order to properly assess the impact of management practices on soils (Thomas et al. 2017). At the upper depth of 0–5 cm, the most important changes in soil properties occur due to tillage, N fertility or the influence of crop rotation (McVay et al. 2006), while the depth interval of 5–10 cm is generally used to assess the effects of grazing and other land use practices on soil properties that are not influenced by tillage or management practices that mainly affect the top soil layer (Pulido et al. 2018a, b). The depth interval beyond 10 cm is used to assess the spatial patterns of soil properties in a layer where different processes, including carbon allocation, water storage, nutrient availability, and vegetation rooting, exhibit different behaviors compared to the upper soil layers. Due to the spatial variability of soil depth and the number of properties being modelled, the number of samples available varies by property and depth interval (e.g., 433 soil samples for clay at 0–5 cm and 200 soil samples for nitrogen at 5–10 cm).

Environmental Covariates

Topography is one of the most important factors shaping soil (McBratney et al. 2003). This is particularly important in Mediterranean environments where water-related soil chemical processes are not as influential as in tropical environments due to low rainfall. To account for the influence of topography on the models developed in this study, various topographic features were obtained from a digital elevation model (DEM) downloaded from the Spanish National Geographic Information Centre, with a spatial resolution of 30 × 30 m. Fifteen geomorphological indices were therefore calculated using the software SAGA (Gerstoft 1997) (Table 1). Vegetation is another important factor in soil formation, so three vegetation indices were calculated from Landsat imagery and integrated into the models. The Google Earth Engine platform was used to calculate the mean value of the Normalized Difference Vegetation Index (NDVI), the Soil Adjusted Vegetation Index (SAVI) and the Enhanced Vegetation Index (EVI) for the last 7 years (2016/2023). In addition, climate variables such as rainfall (mm), mean temperature (°C) and mean solar radiation (kJ/m²) were used according to the maps of Ninyerola et al. (2005).

Table 1 Environmental variables used for machine learning modelling

Full size table

Machine Learning Algorithms

From the large number of available algorithms, six MLAs—Random Forest (RF), Support Vector Machine (SVM), Multiple Adaptive Regression Splines (MARS), Cubist, Gradient Boosting Machine (GBM), and Neural Networks (NN)—were chosen for this study based on the analysis conducted by Khaledian and Miller (2020). Their study identifies some of the most commonly employed machine learning algorithms for soil property mapping. These algorithms were applied, and the models were run using the “caret” package (Kuhn 2008) installed in the software Rstudio (RStudio Team 2020).

Random forest is a popular MLA commonly used in environmental studies, both for classification and regression problems (Grimm et al. 2008; Vaysse and Lagacherie 2015; Zeraatpisheh et al. 2020). In this case, the Ranger method was used, which consists of an ensemble learning method that generates multiple decision trees and combines their predictions by majority voting. The algorithm randomly selects a subset of features and samples to build each decision tree, which helps to reduce overfitting and improve generalization performance (Kulkarni and Sinha 2012).

SVM is a powerful MLA that is also used for classification and regression analysis (Roy and Chakraborty 2023) and has been used in soil mapping studies (Lamichhane et al. 2019; Taghizadeh-Mehrjardi et al. 2019; Wang et al. 2021). It works by finding a hyperplane or line that separates the data into different classes or groups. The goal of SVM is to find the hyperplane that maximizes the distance between the different classes or groups. SVM is considered a promising algorithm for soil mapping with remote sensing data, especially in cases where there are nonlinear relationships between soil properties and remote sensing variables (Asgari et al. 2020; Bachri et al. 2019).

MARS is mainly used for regression analysis in various soil mapping studies (Jeihouni et al. 2020; Mohamed et al. 2018). It is a non-parametric algorithm that works by building a piecewise linear model that best fits the training data. By dividing the data into segments based on the values of the features, it fits a linear regression model to each segment so that MARS can handle datasets with many features and missing values. The algorithm can then adjust the number and location of segments to minimize the prediction error (Friedman 1991).

Cubist is an MLA that has also been successfully applied in various fields, including digital soil mapping (Pouladi et al. 2019; Suleymanov et al. 2023). Cubist is an advanced decision tree-based algorithm that combines the power of decision trees and rule-based modelling to make accurate predictions. In soil mapping, Cubist uses a set of rules generated by recursively partitioning the feature space into smaller subspaces. Each subspace is then modelled using a decision tree, and the final model is an ensemble of these decision trees (Quinlan 1992, 1993).

GBM is an MLA commonly used in digital soil mapping (Estévez et al. 2022; Meier et al. 2018). Like RF or Cubist, GBM is a type of ensemble learning method that combines multiple decision trees to make accurate predictions. It works by training a series of decision trees, each designed to correct the errors of the previous tree (Ayyadevara 2018). This iterative process allows the algorithm to learn complex patterns and relationships in the data that may not be apparent in simpler models.

NNs are a type of MLA increasingly used in soil mapping (Arruda et al. 2016; Behrens et al. 2005; Coelho et al. 2021). These networks simulate the structure and function of the human brain and are capable of learning complex patterns and relationships in data. In this study, the Multilayer Perceptron Network with Dropout was used, which is one of the most common in soil mapping research (Bodaghabadi et al. 2015; Xu et al. 2019). The MLP with dropout consists of several layers of neurons, including an input layer, one or more hidden layers and an output layer. The dropout layer is inserted between the hidden layers and works by randomly taking out a certain percentage of neurons during the training phase. This means that a different set of neurons is used each time the network is trained, preventing overfitting (Wei et al. 2021).

The parameter tuning process in the models was carried out through a random selection of combinations for each machine learning algorithm used in this study (Table 2). The choice of this random strategy is based on the exhaustive exploration of the hyperparameter space, allowing for the evaluation of a wide range of possible configurations. Subsequently, the parameter combination that yielded the most optimal results in terms of accuracy and predictive performance was selected. This random selection methodology was implemented with the aim of capturing the diversity of the search space and ensuring that various configurations are explored to achieve the best results in predicting soil properties.

Table 2 Methods and hyperparameters configuration overview used in each model

Full size table

Model Evaluation and Statistical Analysis

In this study, the dataset underwent preprocessing steps to ensure its quality and suitability for modeling. Outlier analysis was conducted using spatial analysis techniques such as the Moran's I or Local Moran's I to identify and address any anomalous data points that could potentially distort the results. Additionally, the dataset was randomly divided into 80% of cases for training and 20% for testing the models. Tenfold validation was employed to assess the performance of the machine learning algorithms (MLAs). This validation method involves iteratively splitting the dataset into ten equal parts, using nine parts for training the model and one part for testing, and then averaging the results across iterations to obtain a final performance metric. Tenfold validation is preferred due to its ability to provide a robust estimate of the model's performance while mitigating the variance of the results (Wadoux et al. 2021). Furthermore, it helps guard against overfitting by ensuring that the model's performance is evaluated on unseen data, thus promoting better generalization to Cui et al. (2008) and Kohavi (1995). It also ensures that the model is not overfitted to the training data, which can lead to poor generalization with new data.

To determine the performance of the six models used to represent soil properties in each depth interval, the root mean square error (RMSE) and coefficient of determination (R2) were considered. The model with the lowest RMSE and the highest R2 values for external validation of the models was determined to be the most accurate in each case.

$$RMSE = \sqrt {{\sum {\frac{e_i i^2 }{n},} }}$$

(1)

where ei is the difference between the predicted and observed values, and n is the number of observations.

$$R2 = \frac{{\Sigma_{\text{i}} \left( {y_{\text{i}} - {\overline{\text{y}}} - {\hat{\text{y}}}_{\text{i}} } \right)^2 }}{{\Sigma_{\text{i}} \left( {y_{\text{i}} - {\overline{\text{y}}}} \right)^2 }},$$

(2)

where Σ_i denotes the sum over all i observations in the dataset, y_i is the ith observed value of the dependent variable, ȳ is the mean value of the dependent variable and ŷ_i is the predicted value of the dependent variable for the ith observation.

The analysis of relative RMSE (i.e. RMSE%) was used to examine how the most accurate methods for each property vary depending on the number of available samples. This analysis allowed for an equitable comparison between different soil properties, as RMSE% provides a relative measure of model accuracy relative to the scale of variation of each property. Thus, a more comprehensive and fair evaluation of model effectiveness in predicting various soil characteristics based on sample data availability could be conducted. Therefore, the RMSE% was calculated as:

$$RMSE\% = \frac{RMSE}{{\overline{x}}} \times 100,$$

(3)

where ${\overline{\text{x}}}$ is the mean of the observed value.

Model Deployment and Mapping Soil Properties

Once the most accurate model has been identified, it is used to map the soil properties, considering the importance of the variables generated by that model. These variables, which are identified when the model is trained, capture the most important factors that influence soil properties, such as topography, vegetation, and climate. The "predict" function from the Caret package is used to apply these variables to the trained model and produce continuous maps of the various soil properties throughout the study area. These maps provide a detailed visual representation of the spatial distribution of soil properties and allow a deeper understanding of soil variability.

Results

Descriptive Statistics

The results from Table 3 reveal distinct patterns in soil properties concerning depth. Clay content, soil pH, and CEC show a tendency to increase in deeper layers, while the opposite trend is observed for the other properties. For instance, sand content and soil pH exhibit coefficient of variation (CV) values ranging from 25.18 to 32.84%, and from 14.99 to 16.91%, respectively. However, the remaining properties display higher variation, with CV values around 50% and above. Notably, properties like phosphorus (P) and potassium (K) present CV values exceeding 100%, indicating substantial variability in their concentrations. Additionally, it's observed that physical properties generally show less variation than chemical properties, except for nitrogen (N) content. Moreover, properties with fewer samples demonstrate higher coefficients of variation, while the opposite is observed for properties with more samples.

Table 3 Descriptive statistics of soil properties selected for analysis

Full size table

Accuracy of the Models

Table 4 shows the performance metrics for the different models used in predicting various soil properties across the 3 depth intervals. For clay, it is observed that the Random Forest (RF) method has a higher coefficient of determination (R2) and a lower root mean square error (RMSE) across all depths, particularly standing out in the 0–5 cm layer with an R2 of 0.45 and an RMSE of 5.17. Similarly, for silt content, RF also exhibits superior performance in most cases. For sand content, RF and Cubist show the lowest RMSE values for the shallowest layer (0–5 cm), while RF performs best in the other depth intervals. Regarding pH, RF and Cubist exhibit the lowest RMSE values in the 0–5 cm interval, while RF proves to be the most accurate method in the other depth intervals. For cation exchange capacity (CEC), Cubist shows the most solid performance in terms of RMSE for the 0–5 cm and 5–10 cm intervals, while RF is the best-performing method for the deepest interval (> 10 cm). For nutrient content (NPK), RF and SVM generally show good performance. However, Gradient Boosting Machine (GBM) offered the best results for nitrogen content, particularly in the deeper intervals. Finally, for soil organic matter (SOM) content, RF and GBM demonstrate lower RMSE values across various depths. Overall, the results indicate that the Random Forest method tends to be a reliable option for predicting various soil properties, particularly for the deeper intervals.

Table 4 Evaluation metrics for model calibration and validation

Full size table

Sample Representativeness and Sensitivity of the Models

The RMSE% compared to the total number of samples was analyzed to assess the models' sensitivity (Fig. 2). Results indicate a pattern of an inverse relationship between the number of samples and the predicted error. As the number of samples used in model development increases, the RMSE% value tends to decrease relative to the observed mean value in each property, indicating that a larger dataset leads to greater accuracy in predicting soil properties.

In this context, alongside evaluating the models' sensitivity to the number of samples used, an analysis was also conducted by varying the percentage of data used for model calibration and validation (Fig. 3). The analysis revealed that as the calibration data percentage decreases and the validation data proportion increases, the models' performance tends to deteriorate. Moreover, Fisher LSD post-hoc test was conducted to assess the significant differences among the different percentages of data used as input in the models. The results demonstrate significant differences between the various groups, except for between the datasets considering 30% and 40% of the data for calibration, and 30%, 40%, and 50% of the dataset for validation.

Maps of Soil Properties and Model Covariates Importance

After examining the performances of the developed models, the most accurate ones were selected to produce prediction maps for each of the studied properties by depth interval. In addition, the significance of the environmental covariates in each of the cases studied was determined and analysed.

Soil Particle Size Distribution

Figure 4 displays the maps generated using the most accurate method to represent the spatial distribution of soil particle size. The predictive map of clay content in Extremadura exhibits higher values in the central and southern parts of the region, aligning with the “Tierra de Barros” area, known for its abundant clay content and extensive vineyards. In the case of clay content, vegetation indices such as EVI, SAVI, and NDVI demonstrated greater relevance in the topsoil depth interval, as depicted in Fig. 3. Also, the results showed that as the depth increases, the importance of precipitation surpasses that of vegetation indices in predicting clay content in the deeper soil layers.

The distribution of silt content (Fig. 4) shows a similar pattern to the areas occupied by pasture and grassland, with concentrated higher values corresponding to soils of loamy-silt texture. Conversely, the lowest values for silt content are observed in river valleys. In these regions, the distribution map of sand content (Fig. 4) shows the highest values. In addition, the northern part of the region, characterized by a granite lithological domain, shows increased values for sand content. Looking at the importance of the variables, the morphometric index of valley depth proves to be the most influential factor in predicting silt and sand content, followed by precipitation and profile curvature in the case of sand content (Fig. 5).

Soil pH and Cation Exchange Capacity

Cubist and RF proved to be the most accurate methods for mapping the spatial distribution of soil pH and cation exchange capacity (CEC) (Fig. 6). The southern half of the region, characterised by Calcisols and intensive agricultural land, has the highest soil pH values. Conversely, high altitude regions dominated by forest formations have lower pH values, indicating the acidity of their soils. Furthermore, a decreasing trend of pH values with depth is observed, ranging from 7.10 in the uppermost interval to 2.64 in the deepest layer.

Regarding the spatial distribution of the CEC, the highest values are concentrated near the "Tierra de Barros" region, which coincides with an increased clay content that facilitates nutrient exchange. On the other hand, higher areas and soils with lower nutrient content have lower CEC values.

Regarding the significance of the explanatory variables, precipitation, and vegetation indices (NDVI and EVI) showed a significant influence on the models and maps of pH and CEC in all depth intervals, as shown in Fig. 7.

Nutrients and Soil Organic Matter

The maps depicting the spatial distribution of nutrients (NPK) and soil organic matter (SOM), generated by the most accurate model for each parameter, are presented in Fig. 8. Elevated regions characterized by forest presence exhibit high nitrogen content, with a gradual decrease observed from the central to the southern part of the region. Phosphorus content varies across different soil depth intervals, with the highest values observed in the south-eastern region in the shallowest interval, while in deeper intervals, concentrations peak in the northern part and areas dominated by forests. Discrepancies are notable in the potassium maps for the depth intervals 0–5 cm and > 10 cm compared to the deepest interval. Regarding soil organic matter, higher concentrations are found in moisture-rich areas with dense vegetation, resembling the nitrogen distribution pattern.

The prediction of soil nutrient content and organic matter involves several influential covariates, including precipitation, altitude, solar radiation, and the morphometric index of valley depth, as depicted in Fig. 9. Precipitation emerges as the most critical variable for predicting nitrogen content, particularly in the first and 5–10 cm depth intervals, while solar radiation and RSP show significant influence in the deepest interval. Elevation proves crucial for mapping available phosphorus across all depth intervals, followed by solar radiation and vegetation indices. For soil potassium content, the morphometric index of valley depth stands out as the most influential variable, followed by altitude. Regarding soil organic matter content, precipitation and solar radiation dominate the models, with temperature and orientation also playing roles.

Discussion

This study is the first to employ machine learning algorithms to map different soil properties at various depth intervals in the Extremadura region, Spain. Available soil data were used to create maps of the spatial distribution of multiple soil properties using various machine learning methods, aiming to identify the most accurate approach in each case. The results of descriptive statistics (Table 2) shed light on the soil reality of the region. The findings reveal the prevalence of loamy or sandy loam textures associated with igneous and siliceous parent material, which cover approximately 90% of the region. Conversely, the dominance of clays is evident in the natural region known as "Tierra de Barros," where clay texture prevails (Martín et al. 2022). Additionally, it is evident how clay content, soil pH, and cation exchange capacity (CEC) tend to increase with depth, while other properties decrease. This discrepancy could partly be explained by the interaction between clay particle size and its relationship with CEC. This suggests that the nutrient retention capacity, as reflected by CEC values, may be higher in deeper intervals. However, it is important to note that soil organic matter decreases with depth, indicating additional complexity in soil dynamics. Although the quantity of soil organic matter decreases, the soil's ability to retain nutrients may increase due to the higher presence of clays in deeper layers.

The results revealed that data variability tends to be more pronounced in the deeper soil intervals. This trend may be attributed to the spatial heterogeneity of soil types, which vary in depth and therefore undergo different formation processes (Mulla and McBratney 2001). However, it is important to consider that the lower data availability in these intervals could also be contributing to this perception of increased variability. This data limitation could result in an incomplete representation of the true diversity of conditions in the deeper soil layers, highlighting the need for more comprehensive sampling and careful interpretation of the results. Additionally, it is observed that variability in the data is greater in chemical properties than in physical properties, such as soil texture. This difference suggests that, although there are no major disparities in the textural characteristics of the region's soils except in specific areas, other factors such as land use and biogeochemical processes could significantly influence the variability of available nutrient content.

Despite the lack of significant differences in the evaluation metrics of the various methods used, it could be suggested that RF followed by Cubist are recommended for mapping different soil properties in the Extremadura region with the available data. However, Barrena-González et al. (2022), although without significant differences, obtained better validation metrics for mapping the same soil properties using classical interpolation methods. This raises the question of the reliability of maps develop by machine learning (ML) methods. Although classical methods offer superior validation metrics, ML methods could potentially generate more reliable regional maps by considering dominant environmental characteristics.

Previous studies have shown that RF and Cubist are methods that perform well in mapping various soil properties (Table 5) (Fathololoumi et al. 2020; Kaya et al. 2022; Parsaie et al. 2021; Saidi et al. 2022; Suleymanov et al. 2023; Zeraatpisheh et al. 2019). This fact justifies why these methods are two of the most used in soil property mapping (Khaledian and Miller 2020). Additionally, RF demonstrated its good performance in mapping the deepest soil interval, highlighting its ability to manage nonlinear relationships in the deeper soil layers.

Table 5 Relation of most accurate prediction methods for each soil property in agreement with previous works

Full size table

However, it is important to highlight that the generalizability of mapping models cannot be taken for granted. Each study context is unique, and factors such as experimental design and environmental characteristics can significantly influence model performance. Although RF and Cubist often deliver good performances overall, it is not possible to generalize a specific model with certain hyperparameters for all situations. Therefore, it is essential to test different models or adjust hyperparameters specifically for each case study if reliable soil management mapping is searched.

In this context, the results of relative RMSE concerning the actual average value revealed the dependency of model performance on the number of available soil samples, as noted by Tajik et al. (2020) in a previous study. Furthermore, this behavior is reinforced by the outcomes obtained when varying the percentages of data in the model's training and validation sets, as depicted in Fig. 3. As the size of the training dataset increases, a reduction in RMSE value is observed. Nguyen et al. (2021) demonstrated in a study that the 70/30 ratio for training and validation, respectively, was the most effective option. Decreasing the sample size below 20% for validation could lead to unreliable validation results, while reducing the training data could result in overfitting issues. However, it is also important to consider that a specific data split may influence the validation results (Wadoux et al. 2019). Therefore, it is recommended to employ different evaluation techniques depending on the presence of clustering in the data, as indicated by Wadoux et al. (2021) in their study.

Likewise, the ability of RF to maintain a strong performance was observed even when the number of samples was reduced, as evidenced in the deeper soil intervals for properties such as pH, CEC, or SOM. The performance of support vector machine (SVM) also stood out when the number of available samples decreased. However, it is crucial to interpret the improvement in the performance of this method with caution, as it may experience overfitting issues. Khaledian and Miller (2020) have demonstrated how the performance of SVM can be sensitive to the reduction in the number of samples, which could be reflected in validation metrics showing better results. Therefore, although SVM excels in terms of precision in smaller datasets, it is essential to consider the potential risk of overfitting when interpreting these results.

To enhance the performance of predictive models, this study employed various environmental variables. Previous research has shown how the inclusion of different climatic, topographic, or vegetation attributes enhances the performance of various methods (Fathololoumi et al. 2020; Lamichhane et al. 2019; Mahmoudabadi et al. 2017). On the other hand, employing techniques to identify the most influential variables in model prediction, thereby reducing data dimensionality and generating simpler models, can be advantageous. Brungard et al. (2015) demonstrated more accurate results by employing recursive feature elimination compared to manual variable selection by soil researchers. Similarly, Barrena-González et al. (2023) showed that selecting covariates using the Boruta algorithm can be an effective approach to data dimensionality reduction, resulting in accurate models.

However, little is known about the ideal ratio between the number of environmental covariates and sample size. Including a large number of covariates relative to the available observations can lead to overfitting issues. Poggio et al. (2013) indicated that having fewer than 15 observations per variable used in the model could be insufficient for generating accurate models. Therefore, further study analyzing this ratio is necessary to establish a general rule.

In predicting the spatial distribution of soil clay content, vegetation indices such as EVI, SAVI, and NDVI were found to be most significant. This finding differs from those of Barrena-González et al. (2023) in a study conducted in Extremadura, or other studies elsewhere (Zeraatpisheh et al. 2019), where topographic attributes were of greater importance in explaining the spatial distribution of clay. However, in this study, it could be inferred that these vegetation indices have identified areas with lower vegetation cover associated with bare soils, particularly notable in areas with high agricultural activity. This assertion could be supported by studies such as that of Gasmi et al. (2021), which found negative correlations between clay content and reflectance values, suggesting a plausible interpretation of the relationship between low vegetation index values and higher clay accumulation, and contrary.

Topographic attributes such as valley depth index and altitude stand out for their relevance in mapping the spatial distribution of silt and sand, providing a deeper understanding of sediment transport dynamics in the landscape (Gallant and Dowling 2003). The valley depth index offers valuable insights into terrain shape and morphology, critical aspects influencing the accumulation and redistribution of sedimentary materials. On the other hand, altitude can influence sediment deposition and erosion, as higher areas may experience different patterns of precipitation and erosion. Mello et al. (2022) corroborated the importance of topographic attributes linked to hydrological behavior in mapping sand and silt content, while Qu et al. (2024) demonstrated the relevance of other topographic indices for obtaining more accurate maps of sand content spatial distribution. These findings underscore the need to consider a variety of topographic attributes for precise and detailed prediction of sediment distribution in the soil.

The importance of variables such as precipitation, solar radiation, altitude, and vegetation indices in predicting soil chemical properties and nutrients at a regional scale in this study can be explained by their direct influence on the biogeochemical processes that regulate soil and vegetation dynamics on a large scale (Zepp et al. 2011; Zhao et al. 2019). These environmental factors, operating at regional scales, have a significant impact on the distribution and availability of nutrients in the soil. For instance, precipitation can vary widely across a region, affecting the quantity and rate of nutrient leaching, as well as soil erosion (Nielsen and Ball 2015; Qiu et al. 2016). Similarly, solar radiation and altitude influence soil temperature and moisture, thereby affecting biological activity, organic matter decomposition, and nutrient mineralization (Kumar et al. 2020; Sultanova et al. 2023; Yan et al. 2019).

Furthermore, vegetation indices, by reflecting vegetation health and biomass, can indicate primary productivity and carbon uptake, which has direct implications for soil quality and fertility (Kunkel et al. 2022). In comparison, topographic attributes, while important at the local level, may have a relatively minor influence on the variability of soil chemical properties at a regional scale (Liu et al. 2022; Mosleh et al. 2016). However, it is important to note that the spatial resolution of data, including pixel size, can also play a crucial role in the models' ability to capture and explain the spatial distribution processes of these properties (Hengl 2006). An appropriate pixel size can allow for a more accurate representation of landscape heterogeneities and better identification of relationships between environmental variables and soil properties at different spatial scales (Behrens et al. 2014; Brus et al. 2011).

The implications of the findings of this study result in an initial understanding of how soil properties it's distributed spatially at regional scale in Extremadura, Spain. This work also demonstrates that soils in the region are generally poor in key nutrients, with higher concentrations of organic matter and nitrogen in areas with dense forest cover, such as the northern part of the region or higher elevations. Overall, the soils tend to be acidic due to their geological nature, except in the basin of the Guadiana River and Tierra de Barros, coinciding with areas of higher agricultural intensity, which could result in higher pH and CEC values due to the incorporation of salts associated with fertilizer use.

With environmental covariates, maps have been generated providing a reliable approximation of these properties, which could be highly useful for regional-scale management activities. Previous studies have indicated that around 70% of the region faces a critical risk of soil degradation, stemming from either high livestock stocking rate arising from EU accession or processes associated with intensive agriculture (Contador et al. 2009). Therefore, the results of this study could serve as a monitoring tool to identify vulnerable areas requiring special attention, enabling a temporal analysis of the effectiveness of soil management practices.

The findings of this study could serve as useful tools to support decision-making across various domains. Similar studies have demonstrated how soil mapping and analysis of its properties can provide valuable insights for soil management and precision agriculture (López-Castañeda et al. 2022; Pereira et al. 2022). Additionally, it has been evidenced that monitoring the spatial distribution of soil properties following events such as forest fires or land use changes can be crucial for designing land reforestation and restoration strategies (Dindaroglu et al. 2021; Mousavinezhad et al. 2023). However, it is important to note that this study relied on regionally available data, which may be outdated and not reflect the most recent reality. Therefore, while our findings may offer valuable insights, caution is needed when interpreting and applying them in decision-making.

Therefore, it is suggested that future work ensures a balanced density of sampling points throughout the region, considering all land uses and harmonizing a database that addresses both spatial variability and depth variation. This would allow for more detailed analyses of how soil properties respond to different soil management practices. Additionally, leveraging information from environmental covariates could enable more comprehensive monitoring of changes in land use and evaluation of the effectiveness of implemented management strategies.

Regarding this, an accurate maps of soil properties spatial distribution can play a pivotal role for policymakers and managers in making informed decisions concerning land planning, environmental management, and agriculture. The findings derived from this study could be instrumental in formulating policies aimed at safeguarding strategic soil resources. In other words, considering the soil strategy set by the European Union for 2030, there are concerns in regions like Extremadura, where a large area of productive soils occupied for agriculture and livestock farming is at risk of being reassigned for renewable energy production. Despite the limited contribution of such installations to job creation and economic development in the region, photovoltaic energy has seen a 367% growth in Extremadura over recent years (2013–2020), impacting approximately 5000 ha (Díaz and Berrocal 2022). However, projections for future photovoltaic installations, as estimated by Barriga Bravo et al. (2021) for 2030, foresee the necessity of over 46,000 ha for the installation of 20,000 MW, raising further concerns. This scenario entails the abandonment of agricultural activity across thousands of hectares, a primary economic activity in the region. Hence, the identification, classification, and protection of strategic soils intended for agricultural activities warrant immediate attention to ensure food provisions for both inhabitants and livestock, and to prevent the utilization of fertile soils for activities that do not contribute to territorial development.

Conclusions

The study addressed the spatial distribution of various soil properties in the Extremadura region, Spain, employing six different machine learning models and diverse environmental variables. Overall, the RF model exhibited notable performance, particularly in predicting soil particle size (clay, silt, and sand), as well as estimating soil organic matter and other properties in deeper intervals. Additionally, the Cubist approach also showed promising results in soil property mapping. On the other hand, SVM proved to be the most accurate model when the available data was reduced, although its performance should be interpreted cautiously due to its susceptibility to overfitting.

It was observed that model performance decreased as the number of samples decreased, especially when the percentage of data for training was below 60%. Furthermore, climatic variables such as precipitation and solar radiation, followed by altitude, were found to predominate in mapping the spatial distribution of soil chemical properties and essential nutrients. In contrast, vegetation indices and other topographic indices were more relevant for mapping soil physical properties.

These findings highlight the importance of considering a variety of environmental variables when developing soil mapping models and underscore the need for careful interpretation of model results, especially under conditions of limited data availability. Additionally, the need for updated sampling, with an adequate number of sampling points, is emphasized to create more reliable and current maps reflecting the soil property distribution in the Extremadura region.

Therefore, future work is suggested to ensure a balanced density of sampling points across the region, considering all land uses and harmonizing a database addressing both spatial variability and variation in depth. This would enable more detailed analysis of how soil properties respond to different soil management practices. Furthermore, leveraging environmental covariate information could facilitate a more comprehensive monitoring of changes in land use and evaluation of implemented management strategies.

Data availability

The data of this study are available upon request.

References

Alfonso-Torreño A, Gómez-Gutiérrez Á, Schnabel S (2021) Dynamics of erosion and deposition in a partially restored valley-bottom gully. Land 10(1):62
Article Google Scholar
Arruda GPD, Demattê JA, Chagas CDS, Fiorio PR, Fongaro CT (2016) Digital soil mapping using reference area and artificial neural networks. Sci Agric 73:266–273
Article Google Scholar
Asgari N, Ayoubi S, Jafari A, Demattê JA (2020) Incorporating environmental variables, remote and proximal sensing data for digital soil mapping of USDA soil great groups. Int J Remote Sens 41(19):7624–7648
Article Google Scholar
Ayyadevara VK (2018) Gradient boosting machine. In: Pro machine learning algorithms. Apress, Berkeley, CA, pp 117–134. https://doi.org/10.1007/978-1-4842-3564-5_6
Bachri I, Hakdaoui M, Raji M, Teodoro AC, Benbouziane A (2019) Machine learning algorithms for automatic lithological mapping using remote sensing data: a case study from Souk Arbaa Sahel, Sidi Ifni Inlier, Western Anti-Atlas, Morocco. ISPRS Int J Geoinf 8(6):248
Article Google Scholar
Barrena-González J, Rodrigo-Comino J, Gyasi-Agyei Y, Pulido Fernandez M, Cerdà A (2020) Applying the RUSLE and ISUM in the Tierra de Barros Vineyards (Extremadura, Spain) to estimate soil mobilisation rates. Land 9(3):93
Article Google Scholar
Barrena-González J, Lavado Contador JF, Pulido Fernández M (2022) Mapping soil properties at a regional scale: assessing deterministic vs. geostatistical interpolation methods at different soil depths. Sustainability 14(16):10049
Article Google Scholar
Barrena-González J, Gabourel-Landaverde VA, Mora J, Contador JFL, Fernández MP (2023) Exploring soil property spatial patterns in a small grazed catchment using machine learning. Earth Sci Inform 2023:1–28
Google Scholar
Barriga Bravo JJ, Muriel Fernández M, González Zurrón F, Reinoso González F, Sánchez Sánchez-Mora JI, Gallardo García JA, Venegas Fito C (2021) Cómo evitarla tercera colonización energética de la región/el sector de las energías y su compromiso con el desarrollo de Extremadura
Behrens T, Förster H, Scholten T, Steinrücken U, Spies ED, Goldschmitt M, Science S (2005) Digital soil mapping using artificial neural networks. J Plant Nutr Soil Sci 168(1):21–33
Article CAS Google Scholar
Behrens T, Schmidt K, Ramirez-Lopez L, Gallant J, Zhu A-X, Scholten T (2014) Hyper-scale digital soil mapping and soil formation analysis. Geoderma 213:578–588
Article Google Scholar
Behrens T, Schmidt K, MacMillan RA, Viscarra Rossel R (2018) Multi-scale digital soil mapping with deep learning. Sci Rep 8(1):15244. https://doi.org/10.1038/s41598-018-33516-6
Article CAS Google Scholar
Bodaghabadi MB, Martínez-Casasnovas J, Salehi MH, Mohammadi J, Borujeni IE, Toomanian N, Gandomkar A (2015) Digital soil mapping using artificial neural networks and terrain-related attributes. Pedosphere 25(4):580–591
Article Google Scholar
Brungard CW, Boettinger JL, Duniway MC, Wills SA, Edwards T Jr (2015) Machine learning for predicting soil classes in three semi-arid landscapes. Geoderma 239:68–83
Article Google Scholar
Brus D, Kempen B, Heuvelink G (2011) Sampling for validation of digital soil maps. Eur J Soil Sci 62(3):394–407
Article Google Scholar
Coelho FF, Giasson E, Campos AR, Costa JJF (2021) Geographic object-based image analysis and artificial neural networks for digital soil mapping. CATENA 206:105568
Article Google Scholar
Contador JFL, Schnabel S, Gómez Gutiérrez Á, Pulido Fernández M (2009) Mapping sensitivity to land degradation in Extremadura, SW Spain. Land Degrad Develop 20(2):129–144
Article Google Scholar
Cui G, Leung Wong M, Zhang G, Li L (2008) Model selection for direct marketing: performance criteria and validation methods. Mark Intell Plan 26(3):275–292
Article Google Scholar
Díaz AP, Berrocal FL (2022) Energías renovables y desarrollo local en Extremadura. Estudios Geográficos 83(292):e102–e102
Article Google Scholar
Dindaroglu T, Babur E, Yakupoglu T, Rodrigo-Comino J, Cerda A (2021) Evaluation of geomorphometric characteristics and soil properties after a wildfire using Sentinel-2 MSI imagery for future fire-safe forest. Fire Saf J 122:103318
Article Google Scholar
Estévez V, Beucher A, Mattbäck S, Boman A, Auri J, Björk K-M, Österholm P (2022) Machine learning techniques for acid sulfate soil mapping in southeastern Finland. Geoderma 406:115446
Article Google Scholar
Fathololoumi S, Vaezi AR, Alavipanah SK, Ghorbani A, Saurette D, Biswas A (2020) Improved digital soil mapping with multitemporal remotely sensed satellite data fusion: a case study in Iran. Sci Total Environ 721:137703
Article CAS Google Scholar
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67
Google Scholar
Gallant JC, Dowling TI (2003) A multiresolution index of valley bottom flatness for mapping depositional areas. Water Resourc Res 39(12):1
Article Google Scholar
Gasmi A, Gomez C, Lagacherie P, Zouari H, Laamrani A, Chehbouni A (2021) Mean spectral reflectance from bare soil pixels along a Landsat-TM time series to increase both the prediction accuracy of soil clay content and mapping coverage. Geoderma 388:114864
Article Google Scholar
Gerstoft P (1997) SAGA user manual 2.0: an inversion software package
Ghorbani A, Moghaddam SM, Majd KH, Dadgar N (2018) Spatial variation analysis of soil properties using spatial statistics: a case study in the region of Sabalan Mountain, Iran. J Prot Mount Areas Res Manag 10:70–80
Google Scholar
Grimm R, Behrens T, Märker M, Elsenbeer H (2008) Soil organic carbon concentrations and stocks on Barro Colorado Island—Digital soil mapping using Random Forests analysis. Geoderma 146(1–2):102–113
Article CAS Google Scholar
Hengl T (2006) Finding the right pixel size [Científico]. Comput Geosci 32:1283–1298
Article Google Scholar
Huang J, Hartemink AE (2020) Soil and environmental issues in sandy soils. Earth Sci Rev 208:103295
Article CAS Google Scholar
IUSS Working Group WRB (2022) World reference base for soil resources. International soil classification system for naming soils and creating legends for soil maps, 4th edn. International Union of Soil Sciences
Jeihouni M, Alavipanah SK, Toomanian A, Jafarzadeh AA (2020) Digital mapping of soil moisture retention properties using solely satellite-based data and data mining techniques. J Hydrol 585:124786
Article Google Scholar
John K, Abraham Isong I, Michael Kebonye N, Okon Ayito E, Chapman Agyeman P, Marcus Afu S (2020) Using machine learning algorithms to estimate soil organic carbon variability with environmental variables and soil nutrient indicators in an alluvial soil. Land 9(12):487
Article Google Scholar
Kang Y, Ozdogan M, Zhu X, Ye Z, Hain C, Anderson M (2020) Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environ Res Lett 15(6):064005
Article Google Scholar
Kaya F, Keshavarzi A, Francaviglia R, Kaplan G, Başayiğit L, Dedeoğlu M (2022) Assessing machine learning-based prediction under different agricultural practices for digital mapping of soil organic carbon and available phosphorus. Agric Agric Sci Proc 12(7):1062
CAS Google Scholar
Khaledian Y, Miller B (2020) Selecting appropriate machine learning methods for digital soil mapping. Appl Math Model 81:401–418. https://doi.org/10.1016/j.apm.2019.12.016
Article Google Scholar
Khanal S, Fulton J, Klopfenstein A, Douridas N, Shearer S (2018) Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield. Comput Electron Agric 153:213–225
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28:1–26
Article Google Scholar
Kulkarni VY, Sinha PK (2012) Pruning of random forest classifiers: a survey and future directions. In: 2012 international conference on data science & engineering (ICDSE)
Kumar N, Kumar A, Jeena N, Singh R, Singh H (2020) Factors influencing soil ecosystem and agricultural productivity at higher altitudes. Microbiol Adv Higher Altitude Agroecosyst Sustain 2020:55–70
Google Scholar
Kunkel V, Wells T, Hancock G (2022) Modelling soil organic carbon using vegetation indices across large catchments in eastern Australia. Sci Total Environ 817:152690
Article CAS Google Scholar
Lamichhane S, Kumar L, Wilson B (2019) Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: a review. Geoderma 352:395–413. https://doi.org/10.1016/j.geoderma.2019.05.031
Article Google Scholar
Lavado Contador JF, Schnabel S, Gómez Gutiérrez Á, Pulido Fernández M (2009) Mapping sensitivity to land degradation in Extremadura. SW Spain. Land Degrad Develop 20(2):129–144. https://doi.org/10.1002/ldr.884
Article Google Scholar
Li X, Luo J, Jin X, He Q, Niu Y (2020) Improving soil thickness estimations based on multiple environmental variables with stacking ensemble methods. Remote Sens 12(21):3609
Article Google Scholar
Liu F, Wu H, Zhao Y, Li D, Yang J-L, Song X, Zhang G-L (2022) Mapping high resolution national soil information grids of China. Sci Bull 67(3):328–340
Article Google Scholar
López-Castañeda A, Zavala-Cruz J, Palma-López DJ, Rincón-Ramírez JA, Bautista F (2022) Digital mapping of soil profile properties for precision agriculture in developing countries. Agronomy 12(2):353
Article Google Scholar
Lozano-Parra J, Velarde JG, Torreño AA, Barrena-González J (2023) Impact of climate variations on water resources and their availability for the vegetation of extremadura. In: Handbook of research on current advances and challenges of borderlands, migration, and geopolitics. IGI Global, pp 167–178
Mahmoudabadi E, Karimi A, Haghnia GH, Sepehr A (2017) Digital soil mapping using remote sensing indices, terrain attributes, and vegetation features in the rangelands of northeastern Iran. Environ Monit Assess 189:1–20
Article Google Scholar
Martín L, García-García B, Alguacil MDM (2022) Interactions of the fungal community in the complex patho-system of Esca, a grapevine trunk disease. Int J Mol Sci 23(23):14726
Article Google Scholar
Matos-Moreira M, Lemercier B, Dupas R, Michot D, Viaud V, Akkal-Corfini N, Gascuel-Odoux C (2017) High-resolution mapping of soil phosphorus concentration in agricultural landscapes with readily available or detailed survey data. Eur J Soil Sci 68(3):281–294
Article CAS Google Scholar
McBratney AB, Santos MM, Minasny B (2003) On digital soil mapping. Geoderma 117(1–2):3–52
Article Google Scholar
McVay K, Budde J, Fabrizzi K, Mikha M, Rice C, Schlegel AJ, Thompson C (2006) Management effects on soil physical properties in long-term tillage studies in Kansas. Soil Sci Soc Am J 70(2):434–438
Article CAS Google Scholar
Meier M, Souza ED, Francelino MR, Fernandes Filho EI, Schaefer CEGR (2018) Digital soil mapping using machine learning algorithms in a tropical mountainous area. Rev Brasil Ciência Solo 42:1
Google Scholar
Mello FA, Demattê JA, Rizzo R, de Mello DC, Poppiel RR, Silvero NE, Gomez AM (2022) Complex hydrological knowledge to support digital soil mapping. Geoderma 409:115638
Article Google Scholar
Mohamed E, Saleh A, Belal A, Gad AA (2018) Application of near-infrared reflectance for quantitative assessment of soil properties. Egypt J Remote Sens Space Sci 21(1):1–14
Google Scholar
Mosleh Z, Salehi MH, Jafari A, Borujeni IE, Mehnatkesh A (2016) The effectiveness of digital soil mapping to predict soil properties over low-relief areas. Environ Monit Assess 188:1–13. https://doi.org/10.1007/s10661-016-5204-8
Article CAS Google Scholar
Mousavinezhad M, Feizi A, Aalipour M (2023) Performance evaluation of machine learning algorithms in change detection and change prediction of a watershed’s land use and land cover. Int J Environ Res 17(2):29
Article Google Scholar
Mulla D, McBratney AB (2001) Soil spatial variability. Soil physics companion. CRC Press, Boca Raton
Google Scholar
Nguyen XC, Ly QV, Li J, Bae H, Bui X-T, Nguyen TTH, Nghiem LD (2021) Nitrogen removal in subsurface constructed wetland: assessment of the influence and prediction by data mining and machine learning. Environ Technol Innov 23:101712
Article CAS Google Scholar
Nielsen UN, Ball BA (2015) Impacts of altered precipitation regimes on soil communities and biogeochemistry in arid and semi-arid ecosystems. Glob Change Biol 21(4):1407–1421
Article Google Scholar
Ninyerola M, Pons X, Roure JM (2005) Atlas Climático Digital de la Península Ibérica. Metodología y aplicaciones en bioclimatología y geobotánica. Universidad Autónoma de Barcelona
Omran E-SE (2016) A simple model for rapid gypsum determination in arid soils. Model Earth Syst Environ 2(4):1–12
Article Google Scholar
Padarian J, Minasny B, McBratney AB (2019) Machine learning and soil sciences: a review aided by machine learning tools. SOIL 6:35–52. https://doi.org/10.5194/soil-6-35-2020
Article CAS Google Scholar
Parsaie F, Farrokhian Firouzi A, Mousavi SR, Rahmani A, Sedri MH, Homaee M (2021) Large-scale digital mapping of topsoil total nitrogen using machine learning models and associated uncertainty map. Environ Monit Assess 193:1–15
Article Google Scholar
Peel MC, Finlayson BL, McMahon TA (2007) Updated world map of the Köppen–Geiger climate classification. Hydrol Earth Syst Sci Discuss 4(2):439–473
Google Scholar
Pereira P, Brevik E, Munoz-Rojas M, Miller B (2017) Soil mapping and process modeling for sustainable land use management. Elsevier, London
Google Scholar
Pereira GW, Valente DSM, de Queiroz DM, Santos NT, Fernandes-Filho EI (2022) Soil mapping for precision agriculture using support vector machines combined with inverse distance weighting. Precis Agric 23(4):1189–1204
Article Google Scholar
Poggio L, Gimona A, Brewer MJ (2013) Regional scale mapping of soil properties and their uncertainty with a large number of satellite-derived covariates. Geoderma 209:1–14
Article Google Scholar
Pouladi N, Møller AB, Tabatabai S, Greve MH (2019) Mapping soil organic matter contents at field level with Cubist, random forest and kriging. Geoderma 342:85–92
Article CAS Google Scholar
Pulido M, Schnabel S, Contador JFL, Lozano-Parra J, Gómez-Gutiérrez Á (2017) Selecting indicators for assessing soil quality and degradation in rangelands of Extremadura (SW Spain). Ecol Indic 74:49–61
Article CAS Google Scholar
Pulido M, Schnabel S, Lavado Contador JF, Lozano-Parra J, González F (2018a) The impact of heavy grazing on soil quality and pasture production in rangelands of SW Spain. Land Degrad Develop 29(2):219–230. https://doi.org/10.1002/ldr.2501
Article Google Scholar
Pulido M, Schnabel S, Lavado Contador JF, Lozano-Parra J, Gonzalez F (2018b) The impact of heavy grazing on soil quality and pasture production in rangelands of SW Spain. Land Degrad Dev 29(2):219–230
Article Google Scholar
Qiu J, Gao Q, Wang S, Su ZJIJOAEO, Geoinformation. (2016) Comparison of temporal trends from multiple soil moisture data sets and precipitation: the implication of irrigation on regional soil moisture trend. Int J Appl Earth Observ Geoinform 48:17–27
Article Google Scholar
Qu L, Lu H, Tian Z, Schoorl J, Huang B, Liang Y, Liang Y (2024) Spatial prediction of soil sand content at various sampling density based on geostatistical and machine learning algorithms in plain areas. CATENA 234:107572
Article Google Scholar
Quinlan JR (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence
Quinlan JR (1993) Combining instance-based and model-based learning. In: Proceedings of the 10th international conference on machine learning
Rodrigo-Comino J, Barrena-González J, Pulido-Fernández M, Cerdá A (2019) Estimating non-sustainable soil erosion rates in the Tierra de Barros Vineyards (Extremadura, Spain) Using an ISUM Update. Appl Sci 9(16):3317. https://doi.org/10.3390/app9163317
Article Google Scholar
Roy A, Chakraborty S (2023) Support vector machine in structural reliability analysis: a review. Reliab Eng Syst Saf 2023:109126
Article Google Scholar
RStudio Team (2020) RStudio: integrated development for R. In: RStudio, PBC. http://www.rstudio.com/
Rubio-Delgado J, Guillén J, Corbacho JA, Gómez-Gutiérrez Á, Baeza A, Schnabel S (2017) Comparison of two methodologies used to estimate erosion rates in Mediterranean ecosystems: 137Cs and exposed tree roots. Sci Total Environ 605–606:541–550. https://doi.org/10.1016/j.scitotenv.2017.06.248
Article CAS Google Scholar
Rutgers M, van Leeuwen JP, Vrebos D, van Wijnen HJ, Schouten T, de Goede RG (2019) Mapping soil biodiversity in Europe and the Netherlands. Soil Systems 3(2):39
Article Google Scholar
Saidi S, Ayoubi S, Shirvani M, Azizi K, Zeraatpisheh M (2022) Comparison of different machine learning methods for predicting cation exchange capacity using environmental and remote sensing data. Sensors 22(18):6890
Article CAS Google Scholar
Schnabel S, Lavado Contador JF, Gómez Gutiérrez Á (2009) Soil degradation in wooded rangelands of southwest Spain. Geophys Res Abstr 11:EGU2009-11193
Google Scholar
Shi T, Guo L, Chen Y, Wang W, Shi Z, Li Q, Wu G (2018) Proximal and remote sensing techniques for mapping of soil contamination with heavy metals. Appl Spectrosc Rev 53(10):783–805
Article Google Scholar
Singh A, Ganapathysubramanian B, Singh AK, Sarkar S (2016) Machine learning for high-throughput stress phenotyping in plants. Trends Plant Sci 21(2):110–124. https://doi.org/10.1016/j.tplants.2015.10.015
Article CAS Google Scholar
Suleymanov A, Tuktarova I, Belan L, Suleymanov R, Gabbasova I, Araslanova L (2023) Spatial prediction of soil properties using random forest, k-nearest neighbors and cubist approaches in the foothills of the Ural Mountains, Russia. Model Earth Syst Environ Behav 2023:1–11
Google Scholar
Sultanova R, Odintsov G, Martynova M, Mustafin R (2023) Assessment of carbon reserves and biomass of forest ecosystems in the southern Urals. Int J Environ Res 17(2):26
Article CAS Google Scholar
Taghizadeh-Mehrjardi R, Minasny B, Toomanian N, Zeraatpisheh M, Amirian-Chakan A, Triantafilis J (2019) Digital mapping of soil classes using ensemble of models in Isfahan region, Iran. Soil Systems 3(2):37
Article Google Scholar
Tajik S, Ayoubi S, Zeraatpisheh M (2020) Digital mapping of soil organic carbon using ensemble learning model in Mollisols of Hyrcanian forests, northern Iran. Geoderma Reg 20:e00256
Article Google Scholar
Thomas N, Schilling K, Amado AA, Streeter M, Weber L (2017) Inverse modeling of soil hydraulic properties in a two-layer system and comparisons with measured soil conditions. Vadose Zone J 16(2):1–14
Article Google Scholar
Van Klompenburg T, Kassahun A, Catal C (2020) Crop yield prediction using machine learning: a systematic literature review. Comput Electron Agric 177:105709
Article Google Scholar
Vaysse K, Lagacherie P (2015) Evaluating digital soil mapping approaches for mapping GlobalSoilMap soil properties from legacy data in Languedoc-Roussillon (France). Geoderma Reg 4:20–30
Article Google Scholar
Wadoux AM-C (2019) Using deep learning for multivariate mapping of soil with quantified uncertainty. Geoderma 351:59–70
Article Google Scholar
Wadoux AM-C, Brus DJ, Heuvelink GB (2019) Sampling design optimization for soil mapping with random forest. Geoderma 355:113913
Article Google Scholar
Wadoux AM-C, Minasny B, McBratney AB (2020) Machine learning for digital soil mapping: applications, challenges and suggested solutions. Earth Sci Rev 210:103359
Article Google Scholar
Wadoux AM-C, Heuvelink GB, De Bruin S, Brus DJ (2021) Spatial cross-validation is not the right way to evaluate map accuracy. Ecol Model 457:109692
Article Google Scholar
Wang F, Yang S, Yang W, Yang X, Jianli D (2019) Comparison of machine learning algorithms for soil salinity predictions in three dryland oases located in Xinjiang Uyghur Autonomous Region (XJUAR) of China. Eur J Remote Sens 52(1):256–276
Article Google Scholar
Wang J, Peng J, Li H, Yin C, Liu W, Wang T, Zhang H (2021) Soil salinity mapping using machine learning algorithms with the Sentinel-2 MSI in arid areas, China. Remote Sens 13(2):305
Article Google Scholar
Wei X, Zhang L, Yang H-Q, Zhang L, Yao Y-P (2021) Machine learning for pore-water pressure time-series prediction: application of recurrent neural networks. Geosci Front 12(1):453–467
Article Google Scholar
Xu Z, Zhao X, Guo X, Guo J (2019) Deep learning application for predicting soil organic matter content by VIS-NIR spectroscopy. Comput Intell Neurosci 2019:1–11
Google Scholar
Yan M, Li Z, Tian X, Zhang L, Zhou Y (2019) Improved simulation of carbon and water fluxes by assimilating multi-layer soil temperature and moisture into process-based biogeochemical model. Forest Ecosyst 6:1–15
Article Google Scholar
Zepp R, Erickson Iii D, Paul N, Sulzberger B (2011) Effects of solar UV radiation and climate change on biogeochemical cycling: interactions and feedbacks. Photochem Photobiol Sci 10(2):261–279
Article CAS Google Scholar
Zeraatpisheh M, Ayoubi S, Jafari A, Tajik S, Finke P (2019) Digital mapping of soil properties using multiple machine learning in a semi-arid region, central Iran. Geoderma 338:445–452. https://doi.org/10.1016/j.geoderma.2018.09.006
Article CAS Google Scholar
Zeraatpisheh M, Jafari A, Bodaghabadi MB, Ayoubi S, Taghizadeh-Mehrjardi R, Toomanian N, Xu M (2020) Conventional and digital soil mapping in Iran: past, present, and future. CATENA 188:104424
Article Google Scholar
Zhao X, Yang Y, Shen H, Geng X, Fang J (2019) Global soil–climate–biome diagram: linking surface soil properties to climate and biota. Biogeosciences 16(14):2857–2871
Article CAS Google Scholar

Download references

Acknowledgements

This research has been made possible thanks to funding granted by the Consejería de Economía, Ciencia y Agenda Digital de la Junta de Extremadura and by the European Regional Development Fund of the European Union through the reference grant IB16052. Also, thanks to the European Social Fund and the Junta de Extremadura for the funding granted to Jesús Barrena González (PD18016).

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and Affiliations

Instituto Universitario de Investigación para el Desarrollo Territorial Sostenible (INTERRA), Grupo de Investigación GeoAmbiental, Universidad de Extremadura, 10071, Cáceres, Spain
Jesús Barrena-González, Francisco Lavado Contador & Manuel Pulido Fernández
Geography Department, Faculty of Arts, University of Ljubljana, Aškerceva 2, Ljubljana, Slovenia
Blâz Repe

Authors

Jesús Barrena-González
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Lavado Contador
View author publications
You can also search for this author in PubMed Google Scholar
Blâz Repe
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Pulido Fernández
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesús Barrena-González.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Barrena-González, J., Lavado Contador, F., Repe, B. et al. Looking for Optimal Maps of Soil Properties at the Regional Scale. Int J Environ Res 18, 60 (2024). https://doi.org/10.1007/s41742-024-00611-8

Download citation

Received: 21 August 2023
Revised: 01 May 2024
Accepted: 15 May 2024
Published: 27 May 2024
DOI: https://doi.org/10.1007/s41742-024-00611-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Looking for Optimal Maps of Soil Properties at the Regional Scale

Abstract

Highlights

Similar content being viewed by others

Digital mapping of selected soil properties using machine learning and geostatistical techniques in Mashhad plain, northeastern Iran

Exploring soil property spatial patterns in a small grazed catchment using machine learning

Soil quality estimation using environmental covariates and predictive models: an example from tropical soils of Nigeria

Introduction

Materials and Methods

Study Area and Data Collection

Environmental Covariates

Machine Learning Algorithms

Model Evaluation and Statistical Analysis

Model Deployment and Mapping Soil Properties

Results

Descriptive Statistics

Accuracy of the Models

Sample Representativeness and Sensitivity of the Models

Maps of Soil Properties and Model Covariates Importance

Soil Particle Size Distribution

Soil pH and Cation Exchange Capacity

Nutrients and Soil Organic Matter

Discussion

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords