Population spatial distribution data serves as critical geographic information, effectively aiding governmental optimization of social resource allocation, environmental management, and urban development [
1,
2,
3,
4,
5]. Additionally, integrating population information with other data offers a scientific basis for risk assessment, emergency disaster response, and post-disaster reconstruction [
6,
7,
8,
9,
10,
11,
12]. Traditional population data is primarily obtained through comprehensive national surveys and sampling, typically using census divisions as statistical units. However, these boundaries do not always reflect the natural distribution of data. In practical applications, there are drawbacks such as low temporal and spatial resolution, lack of support for spatial operations and analysis, and poor intuitiveness. With the support of geographic information technology, geographers enhance the study of the spatial distribution of socio-economic data by adding clear and detailed geographic references and employing spatial grid methods for quantification. This method of spatializing statistical data stores population data in a grid format, further enhancing computational efficiency and storage capacity. It promotes the integration of population statistics with other environmental data, making predictions and analyses of population-related issues more accurate. At the same time, it effectively addresses the limitations of traditional population data, improves the constraints of population census data in Earth science applications, enhances the spatial resolution of population data, and provides a more intuitive representation of spatial population distribution patterns. Traditional population spatialization methods primarily fall into two categories: spatial interpolation and statistical modeling. Spatial interpolation is a method of converting population data from a large spatial range to a small area, such as point interpolation [
13] and areal interpolation [
14]. These methods make the scale conversion of population data convenient but often overlook scale and boundary effects [
15], leading to suboptimal performance at the boundaries of regions. Moreover, due to the presence of assumption conditions, it becomes challenging to account for spatial heterogeneity issues [
16], resulting in inaccurate estimates in heterogeneous areas. The presence of outliers or extreme values can also significantly impact interpolation results, leading to inaccurate estimations. Statistical modeling methods establish population spatialization models based on the weight relationships between various auxiliary data and the spatial distribution of the population, allowing for estimates of population quantity or density in small spatial units. Compared to spatial interpolation methods, statistical models can consider the intricate relationships between various factors and population density more comprehensively, but they need to face the challenge of multi-source heterogeneous data fusion [
17]. Representative methods include multiple linear regression (MLR) [
18], geographically weighted regression (GWR) [
19,
20], spatial lag regression model [
21], kriging regression model [
22], etc. Moreover, many statistical regression models are based on the assumption of linear relationships, but the distribution of populations and related factors may exhibit non-linear associations. This can lead to inaccurate modeling of the true relationships. Statistical regression models often fail to capture the complexity of geographical spatial structures, such as variations between urban centers and suburbs. This may result in models that are overly smooth in space, overlooking local variations. With the development of machine learning techniques, scholars have applied ensemble learning and neural networks to population spatialization exploration to further explore the complex relationships between multi-source geographic information and demographic statistical features. Ensemble learning combines multiple learners into a unified entity through certain strategies to jointly complete tasks and enhance decision accuracy through collective decision-making, primarily involving boosting and bagging algorithms. The Boosting algorithm follows the principle of gradient boosting [
23], and updates the model by feeding back the information of each round of model training to the next round, obtaining a better model based on the residual iterative training of the previous round of models. On this basis, Extreme Gradient Boosting (XGBoost) employs weighted fusion to average the results of each tree for final output, effectively enhancing model accuracy [
24]. Zhao Xin et al. [
25] estimated the population distribution of Shenzhen in 2019 based on five ensemble learning models, and the XGBoost model achieved the best results. Bagging algorithms combine results from multiple learners through averaging or voting to obtain predictive results [
26]. As a typical bagging algorithm, random forest (RF) is widely used in population spatialization, and it possesses several advantages compared to other algorithms. It features a more flexible and stable framework. Random Forest integrates predictions from multiple decision trees, with each tree learning from the data in a different way, thereby reducing the risk of overfitting and improving the overall model’s generalization ability. This leads to more accurate predictions, helps avoid overfitting, and exhibits higher tolerance to outliers and noise [
27]. Population spatialization studies often involve various types of data, encompassing a large number of features. Random Forest’s flexible and stable framework allows it to effectively handle high dimensional feature spaces, enabling the model to thoroughly consider the impact of different data. Furthermore, since each decision tree is trained independently, Random Forest inherently benefits from parallelization, which accelerates the model training process. This capability is useful in efficiently handling the extensive geographical, social, and economic data involved in the study. Stevens et al. [
28] utilized the random forest model in regions such as Vietnam, Cambodia, and Kenya to generate high-precision population grid data with 100 m resolution. Li et al. [
29], based on 25 m nighttime light (NTL) data and point of interest (POI) data captured by the International Space Station (ISS), proposed a population spatialization approach for constructing high-resolution urban population distribution data using the random forest method. Ye et al. [
30] utilized the random forest model and integrated POI data and multi-source remote sensing data to map China’s 2010 population data to 100 m grids. Liu et al. [
31] integrated POI data and other multi-source data, and used the random forest algorithm to conduct refined mapping of the population of Zhengzhou City at three scales: 50 m, 300 m, and 500 m, achieving excellent spatialization results. Taking the spatialization of population in Beijing as an example, He et al. [
32] compared and analyzed different methods, such as RF, MLR, XGBoost, support vector machine (SVM), back propagation neural network (BPNN), and least absolute shrinkage and selection operator (LASSO), and the results show that RF is superior to other methods. The neural network is a mathematical model that emulates the structure and function of biological neural networks, often used to model complex relationships between inputs and outputs. Although neural networks have made initial attempts and achieved certain effects in population spatialization, their interpretability is limited, and the generalization of the model requires further testing and evaluation [
33,
34]. Additionally, methods based on neural networks needs end-to-end mapping training, and obtaining population data at a fine grid scale is challenging; this paper does not delve into a detailed exploration of such methods.
There are publicly available population grid datasets, such as the Gridded Population of the World (GPW) [
35], Global Rural Urban Mapping Project (GRUMP) [
36], LandScan [
37], WorldPop [
38], etc. These population datasets are mainly created using areal weighting method, intelligent interpolation method, random forest algorithm, etc. Since these datasets simulate population distribution on the global scope, the modeling conditions vary significantly in different regions, making it difficult to ensure model accuracy in areas with complex environments [
39]. Gunasekera et al. [
40] found that LandScan performs well in modeling urban population distribution but has lower reliability in rural areas. Sabesan et al. [
41] compared the differences between LandScan and GPWv3 datasets in many regions and found that LandScan has a better ability to represent the heterogeneity of population spatialization. Bai et al. [
42] compared and analyzed the errors of GPWv3, GRUMPv1, WorldPop, and China Specific Population Grid (CnPop) at the township scale. The results show that the WorldPop dataset has the highest accuracy, but it also has large errors in hilly areas such as the Hengduan Mountains.
In summary, the random forest algorithm can integrate multi-source geographic information data, effectively modeling the complex relationships between population data and spatial distribution indicators, and perform population grid predictions at various resolutions. However, the work of domestic and international scholars primarily focuses on aspects such as multi-source data fusion during the random forest modeling process, remote sensing data processing, and the generation of population datasets with different resolutions. There has been limited exploration into the impact of different parameter optimization methods within the random forest modeling process on population spatialization. Furthermore, the publicly available datasets still require further improvements in accuracy in certain regions.
Therefore, this paper will further enrich data sources, improve data quality, and conduct population spatialization modeling research by combining multi-source remote sensing geographic information data and utilizing the currently prevalent random forest algorithm. The primary focus of the article lies in the meticulous exploration of methodologies for refining and adjusting the parameters of the random forest model. Special attention is given to scrutinizing the impact of various parameter optimization techniques on the model’s accuracy. Subsequently, by combining cross-validation methods, the optimal model parameters will be selected to enhance the model’s structure and improve predictive accuracy. Finally, the population spatialization model is constructed based on optimal parameters, and a spatial population distribution dataset of Sichuan Province at the 1 km resolution is generated. At the same time, the dataset developed in this study is compared with public datasets such as GPW, LandScan, and WorldPop for verification.