Abstract
The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The code is available via https://github.com/margotgeerts/geoRF
References
Balogun AL, Tella A, Baloo L et al (2021) A review of the inter-correlation of climate change, air pollution and urban sustainability using novel machine learning algorithms and spatial information science. Urban Clim 40:100989. https://doi.org/10.1016/j.uclim.2021.100989
Bao LL, Zhang JS, Zhang CX (2024) Spatial multi-attention conditional neural processes. Neural Netw 173:106201. https://doi.org/10.1016/J.NEUNET.2024.106201
Baur K, Rosenfelder M, Lutz B (2023) Automated real estate valuation with machine learning models using property descriptions. Expert Sys Appl 213:119147. https://doi.org/10.1016/j.eswa.2022.119147
Bitter C, Mulligan GF, Dall’erba S (2007) Incorporating spatial variation in housing attribute prices: a comparison of geographically weighted regression and the spatial expansion method. J Geogr Syst 9:7–27. https://doi.org/10.1007/s10109-006-0028-7
Blum A, Dan C, Seddighin S (2021) Learning complexity of simulated annealing. In: Banerjee A, Fukumizu K (eds) Proceedings of The 24th international conference on artificial intelligence and statistics, proceedings of machine learning research, vol 130. PMLR, pp 1540–1548
Ceci M, Corizzo R, Malerba D et al (2019) Spatial autocorrelation and entropy for renewable energy forecasting. Data Min Knowl Discov 33:698–729. https://doi.org/10.1007/s10618-018-0605-7
Chica Olmo J (1995) Spatial estimation of housing prices and locational rents. Urban Stud 32:1331–1344. https://doi.org/10.1080/00420989550012492
Corizzo R, Ceci M, Fanaee-T H et al (2021) Multi-aspect renewable energy forecasting. Inf Sci 546:701–722. https://doi.org/10.1016/j.ins.2020.08.003
Das SSS, Ali ME, Li YF et al (2021) Boosting house price predictions using geo-spatial network embedding. Data Min Knowl Discov 35:2221–2250. https://doi.org/10.1007/s10618-021-00789-x
Deng L, Adjouadi M, Rishe N (2020) Geographic boosting tree: Modeling non-stationary spatial data. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1205–1210. https://doi.org/10.1109/ICMLA51294.2020.00190
Ding Y, Zhu Y, Feng J et al (2020) Interpretable spatio-temporal attention lstm model for flood forecasting. Neurocomputing 403:348–359. https://doi.org/10.1016/j.neucom.2020.04.110
Folorunso O, Ojo O, Busari M et al (2023) Exploring machine learning models for soil nutrient properties prediction: a systematic review. Big Data Cogn Comput 7:113. https://doi.org/10.3390/bdcc7020113
Gao G, Bao Z, Cao J et al (2022) Location-centered house price prediction: a multi-task learning approach. ACM Trans Intell Syst Technol 13:1–25. https://doi.org/10.1145/3501806
Geerts M, vanden Broucke S, De Weerdt J (2023a) An evolutionary geospatial regression tree. In: Sioutis M, Long Z, Lee JH, et al (eds) Proceedings of the 2nd international workshop on spatio-temporal reasoning and learning (STRL 2023) co-located with the 32nd international joint conference on artificial intelligence (IJCAI 2023), Macao, S.A.R., August 21, 2023, CEUR workshop proceedings, vol 3475. CEUR-WS.org, https://ceur-ws.org/Vol-3475/paper4.pdf
Geerts M, vanden Broucke S, De Weerdt J (2023b) A survey of methods and input data types for house price prediction. ISPRS Int J Geo-Inf 12:200. https://doi.org/10.3390/ijgi12050200
Gelfand A, Kim HJ, Sirmans C et al (2003) Spatial modeling with spatially varying coefficient processes. J Am Stat Assoc 98:387–396. https://doi.org/10.1198/016214503000170
Georganos S, Kalogirou S (2022) A forest of forests: a spatially weighted and computationally efficient formulation of geographical random forests. ISPRS Int J Geo-Inf 11:471. https://doi.org/10.3390/ijgi11090471
Georganos S, Grippa T, Gadiaga AN et al (2021) Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Int 36:121–136. https://doi.org/10.1080/10106049.2019.1595177
Hastie T, Tibshirani R, Friedman JH et al (2009) The elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, Berlin
Hengl T, Nussbaum M, Wright MN et al (2018) Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ. https://doi.org/10.7717/peerj.5518
Hooker J, Duveiller G, Cescatti A (2018) A global dataset of air temperature derived from satellite remote sensing and weather stations. Sci Data 5:180246. https://doi.org/10.1038/sdata.2018.246
Hu L, Chun Y, Griffith DA (2022) Incorporating spatial autocorrelation into house sale price prediction using random forest model. Trans GIS 26:2123–2144. https://doi.org/10.1111/tgis.12931
Jia J, Benson AR (2020) Residual correlation in graph neural network regression. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’20, pp 588–598. https://doi.org/10.1145/3394486.3403101
Jiang Z (2019) A survey on spatial prediction methods. IEEE Trans Knowl Data Eng 31:1645–1664. https://doi.org/10.1109/TKDE.2018.2866809
Kaul M, Yang B, Jensen CS (2013) Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th international conference on mobile data management, vol 1. IEEE, pp 137–146. https://doi.org/10.1109/MDM.2013.24
Ke G, Meng Q, Finley T, et al (2017) Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, pp 3149–3157
King D (2017) A global optimization algorithm worth using. http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html, Accessed 27 June 2023
Klemmer K, Neill DB (2021) Auxiliary-task learning for geographic data with autoregressive embeddings. In: Proceedings of the 29th international conference on advances in geographic information systems. ACM, pp 141–144. https://doi.org/10.1145/3474717.3483922
Klemmer K, Safir NS, Neill DB (2023) Positional encoder graph neural networks for geographic data. In: Ruiz F, Dy J, van de Meent JW (eds) Proceedings of the 26th international conference on artificial intelligence and statistics, Proceedings of machine learning research, vol 206. PMLR, pp 1379–1389
Li Z, Fotheringham AS, Li W et al (2019) Fast geographically weighted regression (fastgwr): a scalable algorithm to investigate spatial process heterogeneity in millions of observations. Int J Geogr Inf Sci 33(1):155–175. https://doi.org/10.1080/13658816.2018.1521523
Lin RFY, Ou C, Tseng KK et al (2021) The spatial neural network model with disruptive technology for property appraisal in real estate industry. Technol Forecast Soc Change. https://doi.org/10.1016/j.techfore.2021.121067
Malherbe C, Vayatis N (2017) Global optimization of Lipschitz functions. In: International conference on machine learning. PMLR, pp 2314–2323
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Turukalo TL et al (2021) Applications of machine learning in human microbiome studies: a review on feature selection, biomarker identification, disease prediction and treatment. Front Microbiol. https://doi.org/10.3389/fmicb.2021.634511
Nicholson WB, Wilms I, Bien J et al (2020) High dimensional forecasting via interpretable vector autoregression. J Mach Learn Res 21:1–52. https://doi.org/10.5555/3455716.3455882
Pace RK, Gilley OW (1997) Using the spatial configuration of the data to improve estimation. J Real Estate Finance Econ. https://doi.org/10.1023/A:1007762613901
Peng H, Li J, Wang Z et al (2021) Lifelong property price prediction: a case study for the Toronto real estate market. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3112749
Raikov I (2023) distgfs. https://pypi.org/project/distgfs/1.1.0/#description, Accessed 28 September 2023
Rosen S (1974) Hedonic prices and implicit markets: product differentiation in pure competition. J Pol Econ 82:34–55. https://doi.org/10.1086/260169
Saha A, Basu S, Datta A (2021) Random forests for spatially dependent data. J Am Stat Assoc 118:665–683. https://doi.org/10.1080/01621459.2021.1950003
Sekulić A, Kilibarda M, Heuvelink GB et al (2020) Random forest spatial interpolation. Remote Sens 12:1687. https://doi.org/10.3390/rs12101687
Talebi H, Peeters LJM, Otto A et al (2022) A truly spatial random forests algorithm for geoscience data analysis and modelling. Math Geosci 54:1–22. https://doi.org/10.1007/s11004-021-09946-w
Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52:479–487. https://doi.org/10.1007/BF01016429
Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A Stat Mech Appl 233:395–406. https://doi.org/10.1016/S0378-4371(96)00271-3
Usmanova A, Aziz A, Rakhmonov D et al (2022) Utilities of artificial intelligence in poverty prediction: a review. Sustain 14:14238. https://doi.org/10.3390/su142114238
Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2
Wang H, van Stein B, Emmerich M, et al (2017) Time complexity reduction in efficient global optimization using cluster kriging. In: Proceedings of the genetic and evolutionary computation conference. Association for Computing Machinery, New York, NY, USA, GECCO’17, pp 889–896. https://doi.org/10.1145/3071178.3071321
Zhang W, Liu H, Zha L, et al (2021) Mugrep: a multi-task hierarchical graph representation learning framework for real estate appraisal. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, pp 3937–3947. https://doi.org/10.1145/3447548.3467187
Zheng X, Jia J, Guo S et al (2021) Full parameter time complexity (fptc): a method to evaluate the running time of machine learning classifiers for land use/land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 14:2222–2235. https://doi.org/10.1109/JSTARS.2021.3050166
Acknowledgements
This research was supported by the EC H2020 MSCA RISE NeEDS Project [Grant agreement ID: 822214].
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Michelangelo Ceci.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Algorithms
Algorithm 2 presents the functions FindBestAPSplit and FindBestOBSplit used in the GeoTree algorithm. Algorithm 3 outlines the Dual Annealing algorithm used for finding geospatial splits. Similarly, Algorithm 4 shows the Global Function Search algorithm.
Appendix B: Feature importances for the Melbourne and King County data sets
The impurity-based feature importances for the Melbourne data set is presented in Fig. 9. The feature importances for the King County data set are shown in Fig. 10.
Appendix C: Data set descriptions
Four real estate data sets and four data sets from other geospatial domains are used in the experimental phase:
-
San Diego: Table 7 contains the detailed description. In the experiments, the price is predicted based on the other variables in Table 7 and the X- and Y-coordinates.
-
Melbourne: The Melbourne data set is described in Table 8. The described variables and the X- and Y-coordinates are used to predict the price.
-
King County: Refer to Table 9 for a detailed description of the variables used for predicting the price in conjunction with the geographic coordinates of the King County data.
-
Belgium: The Belgium data set is proprietary, but detailed information can be found in Table 10. The described variables are used as explanatory variables along with the X- and Y-coordinates to model prices.
-
Election: The Election data set is described in Table 12. The election outcome ‘gop_2016’ is regressed on the other variables and the locations (X-Y).
-
Elevation: Table 11 describes the Elevation data set used for the spatial interpolation task where the elevation, ‘z’, is interpolated from other locations. A 10% random sample is taken of the original data set resulting in 39,798 observations after removing duplicate locations.
-
Air Temperature: Air temperature and precipitation variables of this data set are described in Table 13. The multivariate geospatial task consists in predicting ‘meanT’ based on ‘meanP’ and X- and Y-coordinates.
-
Clay: The Clay data set contains the target variable (‘CLYPPT’) indicating the percentage of Clay in the soil regressed on the measurement depth and other soil properties. More information on the distribution of these variable can be found in Table 14.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Geerts, M., vanden Broucke, S. & De Weerdt, J. GeoRF: a geospatial random forest. Data Min Knowl Disc 38, 3414–3448 (2024). https://doi.org/10.1007/s10618-024-01046-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-024-01046-7