GeoRF: a geospatial random forest

Geerts, Margot; vanden Broucke, Seppe; De Weerdt, Jochen

doi:10.1007/s10618-024-01046-7

GeoRF: a geospatial random forest

Published: 19 June 2024

Volume 38, pages 3414–3448, (2024)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

835 Accesses
2 Altmetric
Explore all metrics

Abstract

The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantifying urban flood extent using satellite imagery and machine learning

Article 27 July 2024

Exploring factors influencing urban sprawl and land-use changes analysis using systematic points and random forest classification

Article 28 July 2023

A high-resolution daily gridded meteorological dataset for Serbia made by Random Forest Spatial Interpolation

Article Open access 30 April 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The code is available via https://github.com/margotgeerts/geoRF
Accessed via https://geographicdata.science/book/data/airbnb/regression_cleaning.html.
Accessed via https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market.
Accessed via https://www.kaggle.com/datasets/astronautelvis/kc-house-data.

References

Balogun AL, Tella A, Baloo L et al (2021) A review of the inter-correlation of climate change, air pollution and urban sustainability using novel machine learning algorithms and spatial information science. Urban Clim 40:100989. https://doi.org/10.1016/j.uclim.2021.100989
Article Google Scholar
Bao LL, Zhang JS, Zhang CX (2024) Spatial multi-attention conditional neural processes. Neural Netw 173:106201. https://doi.org/10.1016/J.NEUNET.2024.106201
Article Google Scholar
Baur K, Rosenfelder M, Lutz B (2023) Automated real estate valuation with machine learning models using property descriptions. Expert Sys Appl 213:119147. https://doi.org/10.1016/j.eswa.2022.119147
Article Google Scholar
Bitter C, Mulligan GF, Dall’erba S (2007) Incorporating spatial variation in housing attribute prices: a comparison of geographically weighted regression and the spatial expansion method. J Geogr Syst 9:7–27. https://doi.org/10.1007/s10109-006-0028-7
Article Google Scholar
Blum A, Dan C, Seddighin S (2021) Learning complexity of simulated annealing. In: Banerjee A, Fukumizu K (eds) Proceedings of The 24th international conference on artificial intelligence and statistics, proceedings of machine learning research, vol 130. PMLR, pp 1540–1548
Ceci M, Corizzo R, Malerba D et al (2019) Spatial autocorrelation and entropy for renewable energy forecasting. Data Min Knowl Discov 33:698–729. https://doi.org/10.1007/s10618-018-0605-7
Article Google Scholar
Chica Olmo J (1995) Spatial estimation of housing prices and locational rents. Urban Stud 32:1331–1344. https://doi.org/10.1080/00420989550012492
Article Google Scholar
Corizzo R, Ceci M, Fanaee-T H et al (2021) Multi-aspect renewable energy forecasting. Inf Sci 546:701–722. https://doi.org/10.1016/j.ins.2020.08.003
Article MathSciNet Google Scholar
Das SSS, Ali ME, Li YF et al (2021) Boosting house price predictions using geo-spatial network embedding. Data Min Knowl Discov 35:2221–2250. https://doi.org/10.1007/s10618-021-00789-x
Article MathSciNet Google Scholar
Deng L, Adjouadi M, Rishe N (2020) Geographic boosting tree: Modeling non-stationary spatial data. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1205–1210. https://doi.org/10.1109/ICMLA51294.2020.00190
Ding Y, Zhu Y, Feng J et al (2020) Interpretable spatio-temporal attention lstm model for flood forecasting. Neurocomputing 403:348–359. https://doi.org/10.1016/j.neucom.2020.04.110
Article Google Scholar
Folorunso O, Ojo O, Busari M et al (2023) Exploring machine learning models for soil nutrient properties prediction: a systematic review. Big Data Cogn Comput 7:113. https://doi.org/10.3390/bdcc7020113
Article Google Scholar
Gao G, Bao Z, Cao J et al (2022) Location-centered house price prediction: a multi-task learning approach. ACM Trans Intell Syst Technol 13:1–25. https://doi.org/10.1145/3501806
Article Google Scholar
Geerts M, vanden Broucke S, De Weerdt J (2023a) An evolutionary geospatial regression tree. In: Sioutis M, Long Z, Lee JH, et al (eds) Proceedings of the 2nd international workshop on spatio-temporal reasoning and learning (STRL 2023) co-located with the 32nd international joint conference on artificial intelligence (IJCAI 2023), Macao, S.A.R., August 21, 2023, CEUR workshop proceedings, vol 3475. CEUR-WS.org, https://ceur-ws.org/Vol-3475/paper4.pdf
Geerts M, vanden Broucke S, De Weerdt J (2023b) A survey of methods and input data types for house price prediction. ISPRS Int J Geo-Inf 12:200. https://doi.org/10.3390/ijgi12050200
Gelfand A, Kim HJ, Sirmans C et al (2003) Spatial modeling with spatially varying coefficient processes. J Am Stat Assoc 98:387–396. https://doi.org/10.1198/016214503000170
Article MathSciNet Google Scholar
Georganos S, Kalogirou S (2022) A forest of forests: a spatially weighted and computationally efficient formulation of geographical random forests. ISPRS Int J Geo-Inf 11:471. https://doi.org/10.3390/ijgi11090471
Article Google Scholar
Georganos S, Grippa T, Gadiaga AN et al (2021) Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Int 36:121–136. https://doi.org/10.1080/10106049.2019.1595177
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH et al (2009) The elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, Berlin
Book Google Scholar
Hengl T, Nussbaum M, Wright MN et al (2018) Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ. https://doi.org/10.7717/peerj.5518
Article Google Scholar
Hooker J, Duveiller G, Cescatti A (2018) A global dataset of air temperature derived from satellite remote sensing and weather stations. Sci Data 5:180246. https://doi.org/10.1038/sdata.2018.246
Article Google Scholar
Hu L, Chun Y, Griffith DA (2022) Incorporating spatial autocorrelation into house sale price prediction using random forest model. Trans GIS 26:2123–2144. https://doi.org/10.1111/tgis.12931
Article Google Scholar
Jia J, Benson AR (2020) Residual correlation in graph neural network regression. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’20, pp 588–598. https://doi.org/10.1145/3394486.3403101
Jiang Z (2019) A survey on spatial prediction methods. IEEE Trans Knowl Data Eng 31:1645–1664. https://doi.org/10.1109/TKDE.2018.2866809
Article Google Scholar
Kaul M, Yang B, Jensen CS (2013) Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th international conference on mobile data management, vol 1. IEEE, pp 137–146. https://doi.org/10.1109/MDM.2013.24
Ke G, Meng Q, Finley T, et al (2017) Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, pp 3149–3157
King D (2017) A global optimization algorithm worth using. http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html, Accessed 27 June 2023
Klemmer K, Neill DB (2021) Auxiliary-task learning for geographic data with autoregressive embeddings. In: Proceedings of the 29th international conference on advances in geographic information systems. ACM, pp 141–144. https://doi.org/10.1145/3474717.3483922
Klemmer K, Safir NS, Neill DB (2023) Positional encoder graph neural networks for geographic data. In: Ruiz F, Dy J, van de Meent JW (eds) Proceedings of the 26th international conference on artificial intelligence and statistics, Proceedings of machine learning research, vol 206. PMLR, pp 1379–1389
Li Z, Fotheringham AS, Li W et al (2019) Fast geographically weighted regression (fastgwr): a scalable algorithm to investigate spatial process heterogeneity in millions of observations. Int J Geogr Inf Sci 33(1):155–175. https://doi.org/10.1080/13658816.2018.1521523
Article Google Scholar
Lin RFY, Ou C, Tseng KK et al (2021) The spatial neural network model with disruptive technology for property appraisal in real estate industry. Technol Forecast Soc Change. https://doi.org/10.1016/j.techfore.2021.121067
Article Google Scholar
Malherbe C, Vayatis N (2017) Global optimization of Lipschitz functions. In: International conference on machine learning. PMLR, pp 2314–2323
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Turukalo TL et al (2021) Applications of machine learning in human microbiome studies: a review on feature selection, biomarker identification, disease prediction and treatment. Front Microbiol. https://doi.org/10.3389/fmicb.2021.634511
Article Google Scholar
Nicholson WB, Wilms I, Bien J et al (2020) High dimensional forecasting via interpretable vector autoregression. J Mach Learn Res 21:1–52. https://doi.org/10.5555/3455716.3455882
Article MathSciNet Google Scholar
Pace RK, Gilley OW (1997) Using the spatial configuration of the data to improve estimation. J Real Estate Finance Econ. https://doi.org/10.1023/A:1007762613901
Article Google Scholar
Peng H, Li J, Wang Z et al (2021) Lifelong property price prediction: a case study for the Toronto real estate market. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3112749
Article Google Scholar
Raikov I (2023) distgfs. https://pypi.org/project/distgfs/1.1.0/#description, Accessed 28 September 2023
Rosen S (1974) Hedonic prices and implicit markets: product differentiation in pure competition. J Pol Econ 82:34–55. https://doi.org/10.1086/260169
Article Google Scholar
Saha A, Basu S, Datta A (2021) Random forests for spatially dependent data. J Am Stat Assoc 118:665–683. https://doi.org/10.1080/01621459.2021.1950003
Article MathSciNet Google Scholar
Sekulić A, Kilibarda M, Heuvelink GB et al (2020) Random forest spatial interpolation. Remote Sens 12:1687. https://doi.org/10.3390/rs12101687
Article Google Scholar
Talebi H, Peeters LJM, Otto A et al (2022) A truly spatial random forests algorithm for geoscience data analysis and modelling. Math Geosci 54:1–22. https://doi.org/10.1007/s11004-021-09946-w
Article MathSciNet Google Scholar
Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52:479–487. https://doi.org/10.1007/BF01016429
Article MathSciNet Google Scholar
Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A Stat Mech Appl 233:395–406. https://doi.org/10.1016/S0378-4371(96)00271-3
Article Google Scholar
Usmanova A, Aziz A, Rakhmonov D et al (2022) Utilities of artificial intelligence in poverty prediction: a review. Sustain 14:14238. https://doi.org/10.3390/su142114238
Article Google Scholar
Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar
Wang H, van Stein B, Emmerich M, et al (2017) Time complexity reduction in efficient global optimization using cluster kriging. In: Proceedings of the genetic and evolutionary computation conference. Association for Computing Machinery, New York, NY, USA, GECCO’17, pp 889–896. https://doi.org/10.1145/3071178.3071321
Zhang W, Liu H, Zha L, et al (2021) Mugrep: a multi-task hierarchical graph representation learning framework for real estate appraisal. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, pp 3937–3947. https://doi.org/10.1145/3447548.3467187
Zheng X, Jia J, Guo S et al (2021) Full parameter time complexity (fptc): a method to evaluate the running time of machine learning classifiers for land use/land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 14:2222–2235. https://doi.org/10.1109/JSTARS.2021.3050166
Article Google Scholar

Download references

Acknowledgements

This research was supported by the EC H2020 MSCA RISE NeEDS Project [Grant agreement ID: 822214].

Author information

Authors and Affiliations

Research centre for Information Systems Engineering, KU Leuven, Naamsestraat 69, 3000, Leuven, Belgium
Margot Geerts, Seppe vanden Broucke & Jochen De Weerdt
Department of Business Informatics and Operations Management, Ghent University, Tweekerkenstraat 2, 9000, Gent, Belgium
Seppe vanden Broucke

Authors

Margot Geerts
View author publications
You can also search for this author in PubMed Google Scholar
Seppe vanden Broucke
View author publications
You can also search for this author in PubMed Google Scholar
Jochen De Weerdt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Margot Geerts.

Additional information

Responsible editor: Michelangelo Ceci.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Algorithms

Algorithm 2 presents the functions FindBestAPSplit and FindBestOBSplit used in the GeoTree algorithm. Algorithm 3 outlines the Dual Annealing algorithm used for finding geospatial splits. Similarly, Algorithm 4 shows the Global Function Search algorithm.

Appendix B: Feature importances for the Melbourne and King County data sets

The impurity-based feature importances for the Melbourne data set is presented in Fig. 9. The feature importances for the King County data set are shown in Fig. 10.

Appendix C: Data set descriptions

Four real estate data sets and four data sets from other geospatial domains are used in the experimental phase:

San Diego: Table 7 contains the detailed description. In the experiments, the price is predicted based on the other variables in Table 7 and the X- and Y-coordinates.
Melbourne: The Melbourne data set is described in Table 8. The described variables and the X- and Y-coordinates are used to predict the price.
King County: Refer to Table 9 for a detailed description of the variables used for predicting the price in conjunction with the geographic coordinates of the King County data.
Belgium: The Belgium data set is proprietary, but detailed information can be found in Table 10. The described variables are used as explanatory variables along with the X- and Y-coordinates to model prices.
Election: The Election data set is described in Table 12. The election outcome ‘gop_2016’ is regressed on the other variables and the locations (X-Y).
Elevation: Table 11 describes the Elevation data set used for the spatial interpolation task where the elevation, ‘z’, is interpolated from other locations. A 10% random sample is taken of the original data set resulting in 39,798 observations after removing duplicate locations.
Air Temperature: Air temperature and precipitation variables of this data set are described in Table 13. The multivariate geospatial task consists in predicting ‘meanT’ based on ‘meanP’ and X- and Y-coordinates.
Clay: The Clay data set contains the target variable (‘CLYPPT’) indicating the percentage of Clay in the soil regressed on the measurement depth and other soil properties. More information on the distribution of these variable can be found in Table 14.

Table 7 Descriptive statistics of the San Diego data set

Full size table

Table 8 Descriptive statistics of the Melbourne data set

Full size table

Table 9 Descriptive statistics of the King County data set

Full size table

Table 10 Descriptive statistics of the Belgium data set

Full size table

Table 11 Descriptive statistics of the elevation data set

Full size table

Table 12 Descriptive statistics of the election data set

Full size table

Table 13 Descriptive statistics of the temperature data set

Full size table

Table 14 Descriptive statistics of the clay data set

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Geerts, M., vanden Broucke, S. & De Weerdt, J. GeoRF: a geospatial random forest. Data Min Knowl Disc 38, 3414–3448 (2024). https://doi.org/10.1007/s10618-024-01046-7

Download citation

Received: 10 October 2023
Accepted: 27 May 2024
Published: 19 June 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10618-024-01046-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GeoRF: a geospatial random forest

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Quantifying urban flood extent using satellite imagery and machine learning

Exploring factors influencing urban sprawl and land-use changes analysis using systematic points and random forest classification

A high-resolution daily gridded meteorological dataset for Serbia made by Random Forest Spatial Interpolation

Notes

References

Acknowledgements