Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

GeoRF: a geospatial random forest

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The code is available via https://github.com/margotgeerts/geoRF

  2. Accessed via https://geographicdata.science/book/data/airbnb/regression_cleaning.html.

  3. Accessed via https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market.

  4. Accessed via https://www.kaggle.com/datasets/astronautelvis/kc-house-data.

References

Download references

Acknowledgements

This research was supported by the EC H2020 MSCA RISE NeEDS Project [Grant agreement ID: 822214].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Margot Geerts.

Additional information

Responsible editor: Michelangelo Ceci.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Algorithms

Algorithm 2 presents the functions FindBestAPSplit and FindBestOBSplit used in the GeoTree algorithm. Algorithm 3 outlines the Dual Annealing algorithm used for finding geospatial splits. Similarly, Algorithm 4 shows the Global Function Search algorithm.

Algorithm 2
figure b

FindBestSplit

Algorithm 3
figure c

DualAnnealing(finitboundsmaxiter)

Algorithm 4
figure d

GlobalFunctionSearch(finitboundsmaxiter)

Appendix B: Feature importances for the Melbourne and King County data sets

The impurity-based feature importances for the Melbourne data set is presented in Fig. 9. The feature importances for the King County data set are shown in Fig. 10.

Fig. 9
figure 9

Melbourne feature importances

Fig. 10
figure 10

King County feature importances

Appendix C: Data set descriptions

Four real estate data sets and four data sets from other geospatial domains are used in the experimental phase:

  • San Diego: Table 7 contains the detailed description. In the experiments, the price is predicted based on the other variables in Table 7 and the X- and Y-coordinates.

  • Melbourne: The Melbourne data set is described in Table 8. The described variables and the X- and Y-coordinates are used to predict the price.

  • King County: Refer to Table 9 for a detailed description of the variables used for predicting the price in conjunction with the geographic coordinates of the King County data.

  • Belgium: The Belgium data set is proprietary, but detailed information can be found in Table 10. The described variables are used as explanatory variables along with the X- and Y-coordinates to model prices.

  • Election: The Election data set is described in Table 12. The election outcome ‘gop_2016’ is regressed on the other variables and the locations (X-Y).

  • Elevation: Table 11 describes the Elevation data set used for the spatial interpolation task where the elevation, ‘z’, is interpolated from other locations. A 10% random sample is taken of the original data set resulting in 39,798 observations after removing duplicate locations.

  • Air Temperature: Air temperature and precipitation variables of this data set are described in Table 13. The multivariate geospatial task consists in predicting ‘meanT’ based on ‘meanP’ and X- and Y-coordinates.

  • Clay: The Clay data set contains the target variable (‘CLYPPT’) indicating the percentage of Clay in the soil regressed on the measurement depth and other soil properties. More information on the distribution of these variable can be found in Table 14.

Table 7 Descriptive statistics of the San Diego data set
Table 8 Descriptive statistics of the Melbourne data set
Table 9 Descriptive statistics of the King County data set
Table 10 Descriptive statistics of the Belgium data set
Table 11 Descriptive statistics of the elevation data set
Table 12 Descriptive statistics of the election data set
Table 13 Descriptive statistics of the temperature data set
Table 14 Descriptive statistics of the clay data set

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geerts, M., vanden Broucke, S. & De Weerdt, J. GeoRF: a geospatial random forest. Data Min Knowl Disc 38, 3414–3448 (2024). https://doi.org/10.1007/s10618-024-01046-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-024-01046-7

Keywords