3.1. Study Area
The study region of WV (
Figure 3), with an area of 62,259 km
is completely within the defined extent of the Appalachian Mountains [
23]. The terrain and topographic characteristics, and rich natural assets create a unique context for the anthropogenic activities in this region. WV has abundant physical and environmental assets. Energy extraction industries play a crucial role in the economy of WV; coal was one of the primary natural resources of the state. Since early 2000s, shale gas extraction sites have expanded in WV [
24]. The Monongahela National Forest with a land area of over 3719.06 km
is located in the south-east of WV [
25]. This state is the major water source of large rivers such as the Potomac and Ohio rivers. This state has a northern and an eastern panhandle. Jefferson county, one of the 55 counties of this state in the eastern panhandle, is in the Washington DC Metropolitan Statistical Area (Metropolitan Statistical Areas (MSA) are defined by the U.S. Office of Management and Budget (OMB) and used by the Census Bureau and other federal government agencies in the United States (US) for statistical purposes. An MSA consists of one or more counties that contain a city of 50,000 or more inhabitants, or contain a Census Bureau-defined urbanized area (UA) and have a total population of at least 100,000 (75,000 in New England) [
26].). The north central WV borders the Pittsburgh, PA, MSA. Both Pittsburgh and DC MSAs are populated regions with substantial number of organizations in the industrial and administrative sectors. Energy extraction industries play a crucial role in the economy of WV, the coal industry was one of the primary natural resources of the state. Since the early 2000s, a considerable number of shale gas extraction sites were developed in WV [
24]. WV has a unique landscape that, benefits from rich natural and cultural resources. This led to a cycle of economic boom and bust as the value and production through these resources grow and shrink [
27].
In the variables section, we describe the studied variables and the patterns of historical land development within the study area.
3.2. Variables
To study the impact of the generic and site-specific factors [
6,
29] on the process of land development we integrated globally admitted variables with the place-based variables. Place-based factors are the variables that are merely specific to a region [
6], global factors refer to the ones that are demonstrated to play an important role in inducing specific landscape transformations in any geography [
11,
30]. Using different sources of data, the feature class can be constructed. Integrating features from different local, national, and global data sources is accompanied by data formatting and management difficulties. Different geographic reference systems, resolutions, data types, standards and definitions cause such issues. Data fusion methods accommodate dealing with the complexity and heterogeneity of such data [
31].
Multi-source spatial data were applied to study the historical trends of land development. Local and federal web-based datasets were used for data acquisition (See
Table 1). The main rationale for studying the importance of these variables in the complex and dynamic process of land development was to explore the role of local transitions, along with previously examined variables such as population and economic status [
11,
30]. On the other hand, studying the multiple hypothetically interacting variables allowed testing the multicollinearity of the variables. Upon identifying the linear relationship between these variables we could examine the application of other models to study the drivers of land development. Within the context of this study, socio-economic, spatial and policy related variables of urban and rural land development create a dynamic and complex system of changes [
29,
32]. We also used distance to an existing land development as a human made physical variable. A spatially explicit model was applied, meaning that we used distance, density, and data interpolation to construct the input data. We used census population data at the census tract level in each decade. The economic status was represented using the County Economic Status Index proposed by the Appalachian Research Commission (ARC) [
23]. This index is a linear factor of household poverty rate, per capita income, and unemployment rate. Economic index is defined and applied by ARC to indicate the economic status of this region. We used ARC’s method to compute the economic index (E.I.). In
Table 1 the explanatory variables that we used, the input data, and the data source are listed. We verified that the data was geo-referenced and co-registered.
In
Table 2 the summary statistics of the variables are included, this summary statistics is based on the variables’ values before pre-processing. The variables in this work are integrated so the information is scaled, geo-referenced, and co-registered, and the data layers are temporally and spatially synchronized. The pre-processing steps of variables included scaling and interpolation. Scaling the variables helps in making an explicit analysis and improving the stability and performance of the regression models. We scaled the variables by conducting a min-max normalization, where the minimum value of each feature is subtracted and the result is divided by the range. We used the inverse distance values; so the closer the features are to active mining sites, shale gas wells, metropolitan areas, and other forms of development the larger the value of that variable is.
The data of the historical land development in WV is acquired from [
35]. Thirty meters resolution Landsat images were utilized to obtain these images. We used six spectral bands from red, green, blue, near-infrared, and short wave near infra-red sensors for both TM5 and TM8 satellite images.These images are collected in the ±1-year interval of the target year. A pairwise analysis from each scene is conducted on the normalized dataset, any pair is analyzed independently. A total number of 10 scenes cover the entire state of WV, a hybrid algorithm was used in which data transformation and band differencing were deployed. Using this algorithm, a new feature class of the Landsat satellite images was constructed, and an unsupervised machine learning model was applied to group the cells in the reconstructed feature class. [
35] labeled the grouped data of land cover using k-means algorithm. Historical data of google earth was utilized as the ground truth to validate the results [
35]. The output of this study provides the data of historical trends of land development for each decade of 1985−1995, 1995−2005, and 2005−2015 (
Figure 4).
The data of
Figure 4 is applied for hot spot analysis of land development [
36]. Hot spot analysis indicates the dynamics and patterns of land development. This method is used to identify statistically significant spatial clusters of new land developments [
36]. Through this analysis the patterns of the new land developments have been identified. The hot spots (
Figure 5) are the regions wherein there are high clusters-high density new developments. Cold spots are the region in which the clusters of low density new developments were formed.
Spatial auto-correlation is used to test for statistically significant spatial auto-correlation in the geographic events [
8]. To find the distance at which the clusters of new land developments were more intense a spatial auto-correlation analysis was conducted. Within this distance, the density of events was compared to a complete random pattern of new land developments and the
statistic was computed for each new development event (Equation (
4)). These values represent a
z score per feature in the study area [
8].
where,
is the number of collected events in one
,
is the spatial weight between feature
j and
i,
n is the total number of collected event points. We used Inverse Distance Weighting (IDW) to acquire the
values, IDW computes the weights based on the assumption that the near features are more related than the distance ones (Tobler’s First Law of Geography [
37]).
and
S are formulated in Equations (
5) and (
6):
IDW was applied to interpolate the
z score of the
statistic of each event point to the cells in the region. This technique provides a good understanding of the new development patterns by assessing both density and the extent of interaction between the events [
38]. The interpolated values were used as the dependent variables in this study.
Figure 5 illustrates the interpolated
z score of the
statistic, positive
z score values show clustered high-density new land developments and negative
z scores show clusters of low-density new land developments. High
z scores mainly show up around the major cities and low clusters represent scattered developments in the rural areas.
3.3. Assumption Test and Modeling
As a requirement for the modeling, we performed an assumption test (discussed in
Section 2.2). Through this test we studied the
(Equation (
1)) and
values to find out if there is a multicollinearity among the variables. As suggested by [
21] a
of greater than 10 represents a moderate to severe degree of collinearity. They also imply that researchers have variety of criteria for a high
, however, the most common threshold is a
of 0.50 or greater for two or more variables associated with a high
[
21]. Hence, to detect the multicollinearity of the variables, tolerance value of 10 for
and 0.5 for the
were assigned. The decomposition values which are above the threshold and have the
of more than 10 represent the multicollinearity.
Figure 6 shows the presence of global multicollinearity in all three studied time-steps. In this figure we can observe that in any of the global regression models there are multiple variables that present a multicollinearity with other ones.
We used the same method for multicollinearity diagnostics in the local OLS regressions (
Figure 7). Considering that local regression analysis encompasses multiple OLS models, in
Figure 7 maximum value of
for each regression model is represented and the
values that are above the threshold indicate presence of multicollinearity in the model. This analysis confirms presence of local multicollinearity between the variables in all time-steps (
Figure 7). To avoid the drawbacks of local and global multicollinearity in regression models (discussed in
Section 2.3), ridge regression was utilized. We used a 10-fold cross validation to compute the value of model hyperparameter in both ridge regression and GWRR models.
The optimum bandwidth value for the GWRR model was computed using an exponential distance kernel (See
Table 3). Our inferences are based on the GWRR coefficients of each variable and we did not compare the results for each variable with another. The range of coefficients among the variables is considerably different, this variance is because of significance of one variable over another, high dynamics of variables across the region and over time, and the model fitting process in the GWRR.
3.4. Model Evaluation
Coefficient of determination, denoted as
(Equation (
7)) is widely used for evaluating regression model performance [
15].
where,
is the actual value of
z score per event,
is the mean value of the
z scores and
is the predicted value of the
z score for new developments. Coefficient of Determination is one of the measures of model accuracy that we used in this study. We used other statistics to obtain insights on the model performance.
t and p values of each variable are utilized to evaluate the significance of coefficients in the ridge regression models. In addition, the scaled estimates of each parameter, which points to its proportional significance, is also computed. In our local analysis of the variables, each event has its own summary statistics. Therefore, we found pictorializing an innovative method to analyze and evaluate the local regression models. The difference of estimated and actual z score values of each event point is depicted to show how the error of local regression models are distributed in the study area. Moreover, the parameters of the local regressions are represented in the geographic format. The coefficient values of each event point were interpolated using IDW operation to generate such raster graphics.
IDW method is used in studying the spatial patterns of land development (see
Section 3.2)), making visual inference on GWRR coefficient results (see
Section 4.2), and on visualizing GWRR coefficient results (see
Section 4.2). To validate the IDW results, we used a 10% random sample and excluded them form the IDW analysis and conducted the interpolation. Then the interpolated values at the validation subset were compared to the actual values. The residuals of the IDW were computed using Mean Squared Error (MSE), we used this method to identify optimum search radius and power of inverse distance weighting. These values are computed based on the minimum MSE.
In addition, we investigated the geography of the model residuals (Equation (
8)). Through this study the residuals of GWRR models are depicted in the study area.
where,
is the Squared Error value,
is the estimated value for the
event and
is its actual
z score value.
Moreover, providing zonal references for an expressive discussion is extremely helpful to discussing the model output. Therefore, we deployed a visual representation of the zones of development based on four zones of development of lower rural, rural, transitional and populated/urbanized zones. We discuss the process of identifying the regions of development in
Section 3.4.1.
3.4.1. Zones of Development
Zones of development are identified according to the density of development and population in each census tract. The classes of this data are: populated/urbanized, transitional area, rural, and lower rural. Clustering of the tracts is performed using the k-means method. The urbanized/populated areas point to the regions with higher density of development and populated places. The areas denoted as the transition areas, are either the immediate areas surrounding the urban areas in which the population density was higher than the rural area, or some areas that are initial cores for dense population settlements in the future development. Rural areas and lower rural areas are mainly regions with scattered developed lands and dispersed population, the level of sparsity varies between rural areas and lower rural regions. After clustering, we applied Multivariate Analysis of Variance (MANOVA) to test the significance level of the difference between the groups (See
Table 4). The homogeneity of the variables across the grouped data was then tested, the results of this test imply that all the clusters are significantly different from each other.
Figure 8 illustrates zones of development, according to this analysis, the area of urban and transitional zones in the three target time periods of the study was less than 15% of the state. By overlaying the zones of development on top of the hot-spot analysis of new developments, we can observe that the hot spots are formed in the urban and transitional regions. We basically use the zones of development to make visual inferences on the results of geographically weighted regression models.