The Effect of Aggregation On Univariate Statistics
Summary
The resistance of the Modifiable Area Unit Problem to analytical solution requires that it be
investigated by numerical and empirical studies that have the potential to lay the foundations for
analytical approaches. The use of synthetic spatial
datasets, whose spatial autocorrelation,
mean, and variance of individual variables, and Pearson
correlation between variables, can be con- trolled
greatly enhances the ability of the analyst to study the
MAUP in this manner. This chapter explores the
effects of spatial aggregation on the variance and three
univariate spatial autocorrela- tion statistics using a
synthetic 400-region dataset. The relationship between
the relative change
in variance and a modified version of the G statistic
that was first proposed by Amrhein and Rey- nolds
(1996, 1997) is explored in more detail. These results
compare favourably with results gen- erated from the
Lancashire dataset of Amrhein and Reynolds (1996).
Introduction
The Modifiable Area Unit Problem (MAUP) has been the focus of research interest for many years,
with the current resurgence in interest being initiated by Openshaw and Taylor (1979) and fueled by the
rapidly increasing computing power available to analysts. It is well known that the application of
statistical results derived from one level of spatial resolution to a higher resolu- tion (such as census
tract data being used to predict individual household information) can result
in serious errors; this all too common error has been named the ecological fallacy. An ancillary effect of
the enhanced computing power is the proliferation of Geographical Information Systems (GIS) and
other spatial analysis tools. As the MAUP has been either ignored or written off as in- tractable in
many research results, it can be expected to get short shrift by users of this software who are unaware
of the subtleties of spatial data analysis. The importance of gaining an under- standing of the MAUP
and how it can be taken into account in GIS software to reduce the num- bers of flawed analyses and
their possibly expensive repercussions cannot be understated.
Theoretical work, such as that by Arbia (1989), has shown that an analytical solution is possible, but
under restrictive conditions that would seldom be found in real life situations. As a result, research into
the MAUP has been primarily empirical, focusing on the effects of aggrega- tion on various statistics
computed from a specific dataset. For example, Openshaw and Taylor (1979) examine correlation
coefficients using an Iowa electoral dataset, Fotheringham and Wong (1991) study multiple regression
parameters using Buffalo census data, Amrhein and Reynolds (1996), one of the papers in the special
issue of Geographical Systems that focuses on the MAUP, and Amrhein and Reynolds (1997) study the
effects of data aggregation on univariate statistics and make a tentative link between a spatial statistic
and the relative change in variance. Recogni- tion of spatial patterns is a fundamental requirement for
landscape ecology, and various spatial autocorrelation statistics, such as the Moran Coefficient, are
often employed as a tool for this task (Jelinski and Wu, 1996; Qi and Wu, 1996); hence it is important
to know how spatial statistics are affected by aggregation as well.
The use of synthetic spatial datasets overcomes the difficulties inherent in publicly avail- able sets, with
census data being the prime example. Possible errors in the data notwithstanding, the greatest
frustration for researchers into the MAUP is that one has no control over the values of spatial
autocorrelation, means, variances, or Pearson correlations between variables; one must work with the
data at hand. Amrhein (1995) is the first to use synthetic datasets in the study of the MAUP by locating
points randomly within a unit square, assigning them random values, im- posing various sized square
grids, and aggregating the points within each square. This chapter extends this approach by employing
more sophisticated synthetic datasets to explore the effects of spatial data aggregation on the weighted
variance and on three commonly-used spatial autocorrelation statistics, the Moran Coefficient, the
Geary Ratio, and the Getis (G) statistic. The following sec- tions discuss the method of analysis, the
results, and the conclusions.
Results
The effects of aggregation on the variance
Aggregated variance that is weighted by the number of regions ni in the M aggregated cells. A value of
RCV near one (as in the first group of lines in Figure 4.2a) means that the aggregated weighted
variance is much closer to zero than the original variance, while a value near zero (as in the last group
of lines in Figure 4.2a) means that the new variance is very similar to the original. It can be shown that
the variance of a spatially located variable can be partitioned into the sum of variances within various
sub-regions and the variance of the average values of all the subregions (see Section 5.3 and
Moellering and Tobler, 1973). The process of aggregation re- moves the former, so the more spatially
homogeneous (i.e. positively autocorrelated) a variable is, the smaller the variance within each cell will
be (on the average) and hence the less variance is lost. As the number of aggregate cells decreases (i.e.
fewer, larger regions), the loss in variance obviously increases, since a greater number of values are
being lost. Both of these patterns are well demonstrated in Figure 4.2a. As the number of aggregate
cells decreases, the number of re- gions per cell increases on average, since the aggregation algorithm
attempts to have similar num- bers of regions per cell, but does not strictly enforce this ideal. When
significantly positively auto- correlated variables are aggregated, increasing the number of regions per
cell increases the likeli- hood that more widely differing values will be included in each cell, so one
would expect the variability of possible aggregate variance values to increase with a decrease in the
numbers of cells. With negatively or near-randomly auto correlated variables, however, the tendency
towards the juxtaposition of widely differing values means that as the number of regions per cell
increases, the opportunity for variation in the aggregate variance values will tend to remain the same or
decrease. Both of these patterns are demonstrated in Figure 4.2a. When variables of the same MC but
different variances were aggregated, it was found that the variance of the original variable had no
discernible impact upon the distributions of the RCV (not shown). Only the spatial organiza- tion of
the variable plays a major role in the new variance.