Machine Learning Models For Geospatial Data: January 2009
Machine Learning Models For Geospatial Data: January 2009
Machine Learning Models For Geospatial Data: January 2009
net/publication/261551597
CITATION READS
1 2,201
6 authors, including:
Some of the authors of this publication are also working on these related projects:
Multimodal machine learning for remote sensing information fusion View project
All content following this page was uploaded by Mikhail Kanevski on 16 April 2018.
ABSTRACT
This chapter presents an introduction to machine learning
models/algorithms and their potential applications to
geospatial data. The main attention is paid to widely used
models which are based on artificial neural networks
(multilayer perceptron, general regression neural
networks, self-organizing maps) and statistical learning
theory (support vector machines). The main ideas of
spatial classification, spatial predictions/mapping including
automatic algorithms, nonlinear dimensionality reduction
and visualization of high dimensional multivariate socio-
economic data, treatment and classification of remote
sensing images by applying machine learning are
illustrated using real data case studies.
KEYWORDS
Machine learning algorithms, Geospatial data, Mapping and classification,
Dimensionality reduction, Remote sensing
INTRODUCTION
Machine learning (ML), in a general framework, can be considered as a
subfield of artificial intelligence that is concerned with the design,
development, and application of algorithms and techniques that allow
computers to learn from data. Machine learning has a close connection
with statistics (especially nonparametric and computational statistics) and
theoretical computer science. Since the middle of twentieth century
machine learning has evolved from the imitation of a simple neuron and
artificial neural networks to a solid interdisciplinary field of basic and
applied research having an important influence on many topics: pattern
recognition, bio-computing, speech recognition, financial applications,
analysis and modeling of high dimensional and multivariate geo- and
environmental spatio-temporal data, etc. (Agarwal & Skupin, 2008;
Cherkassky & Mulier, 2007; Hastie et al., 2009; Izenman 2008; Kanevski,
2008; Openshaw & Openshaw, 1997; Vapnik, 1998).
175
In recent years there has been an explosive growth in the development of
adaptive and data-driven approaches. Among successful and widely used
models of ML artificial neural networks (ANN) of different architectures and
support vector machines have attracted great attention. Both have
demonstrated important and successful applications for geospatial data
modeling tasks: spatial predictions (classification and mapping); natural
hazards and environmental risk assessments; renewable resources
estimates; analysis, modeling and visualization of multivariate socio-
economic data; environmental time series predictions; hydroinformatics;
treatment and classification of remote sensing images, assimilation of data
and science based models; etc. (see references below).
The key feature of the ML models/algorithms is that they learn from data
and can be used in cases when the modeled phenomena is not very well
described, which is the case in many applications of geospatial data.
Machine learning models are adaptive tools, which at present are widely
used to solve prediction, characterization, optimization and many other
problems.
There exist many kinds of ANN to be used for different problems and
cases. Among the most common in geo- and environmental sciences let
us mention multilayer perceptron (MLP), radial basis function (RBF)
networks, general regression neural networks (GRNN), probabilistic neural
networks, Kohonen networks (self-organizing maps, SOM) (Agarwal &
Skupin, 2008; Cherkassky & Mulier, 2007; Hastie et al., 2009; Izenman
2008; Openshaw & Openshaw, 1997; Haykin, 2009).
176
Kanevski, M. et al.
SVM develop robust and non linear data models with excellent
generalization abilities that are very important both for monitoring and
forecasting. SVM use only support vectors (part of the measurement data
points) to derive decision boundaries. They open a way to sampling
optimization, estimation of noise in data, quantification of data
redundancy, etc. More detailed presentation of SVM application for
spatially distributed environmental data is given in Kanevski & Maignan
(2004), Kanevski (2008) and Kanevski et al. (2009).
• pattern modeling,
177
example, geostatistics widely uses the variography – spatial anisotropic
correlations analysis in order to detect and to characterize spatial
patterns/structures.
178
Kanevski, M. et al.
F(x,y)
Induction Deduction
Training
samples (ynew,xnew)
(xi, yi)
Transduction
179
In almost all real-life case studies an introduction of statistical model to
data is non-trivial because usually only one realization of the phenomena
is available: spatial data on pollution, time series of monitoring data, soil
and land-use data etc. Statistical treatment of data can be introduced in
this case as well under some hypotheses and assumptions. Therefore
there are important hypotheses and assumptions that have to be checked
(usually it is not a trivial task!) and accepted in order to make statistical
(machine learning) inference based on one realization of the phenomena
under study: iideness of data, ergodicity (loosely speaking, the
convergence of the averaging over space to the averaging over
realizations), spatial or temporal stationarity, i.e. absence of trends when
important parameters of the model do not change in space/time. In case of
geospatial data spatial clustering is an important topic which complicates
both treatment of data (representativity of data) and interpretation of the
results. The problem of non-stationarity (spatial or temporal) partly can be
overcome by using locally adaptive models. Like in statistics these topics
are also important in machine learning data treatment and modeling.
180
Kanevski, M. et al.
Supervisor
Evaluation
of ML Response
Modifications
to ML Model
Machine Learning
Algorithm
181
Data: Training Examples Machine Learning Model Response
Modifications
to ML Model
Machine Learning
Algorithm
Risk minimization
An important fundamental question is how to describe the quality of
learning, i.e. the description of the similarity/dissimilarity between data and
a ML model. This quantification is important both during learning and
prediction phases.
Consider the expected value of the loss, given by the risk function
described by the following formula:
182
Kanevski, M. et al.
The goal is to find the function f(x,α0) which minimizes the risk in the
situation where the joint probability distribution function (pdf) is unknown
and the only available information is contained in the training set.
0 if y = f ( x,α ) (2)
L( y, f ( x,α )) =
1 if y ≠ f ( x,α )
L( y, f ( x, α )) = ( y − f ( x, α )) 2 (3)
L( p ( x, α )) = − log p( x, α ) (4)
The criteria presented above are very general. Unfortunately the joint
input-output distribution function is not known. Moreover, only a finite
number of data measurements (N training data) is available. Therefore
most training algorithms for learning machines implement Empirical Risk
Minimisation (ERM), i.e. they minimize the empirical error
N
1
Remp (α ) =
N
∑ L( y , f ( x ,α ))
i =1
i i (5)
183
Mulier, 2007). The SRM principle is illustrated in Figure 5. X-axis
corresponds to the complexity of the model and Y- axis to the error.
According to the SRM principle prediction error is a sum of training error
(empirical risk minimization) and complexity term which takes into account
penalization of too complex models. In this way a bound on prediction
error can be derived which gives an upper limit. This limit does not depend
on the distribution of data and therefore is rather pessimistic (too high). In
reality, for particular data limits are lower and can be estimated by splitting
the data or by using cross-validation technique.
Complexity term
Training error
COMPLEXIT Y
184
Kanevski, M. et al.
not used for modeling purposes. It means that the developed ML model
has learned only structured information and ignored a noise present in
data. In this case over-fitting is avoided.
Bias-Variance Dilemma
Let us consider a regression/mapping problem as an example. In general,
we can assume that data can be decomposed into unknown function and
noise:
Y ( X ) = f ( X ,α ) + ε
where
E (ε ) = 0,
Var (ε ) = σ ε2 (7)
One of the most serious problems that arises in connection with learning
by neural networks is over-fitting of the provided training examples. This
means that the learned function fits very closely the training data and yet
does not generalize well, that is it cannot model sufficiently well unseen
data from the same phenomena.
Solution: Balance the statistical bias and statistical variance when doing
neural network learning in order to achieve the smallest average
generalization error.
The following two processes are important in making a decision about the
quality of the model and its generalization ability (Hastie et al., 2009):
185
A. Model Selection is a process of “estimating the performance of different
models in order to choose the (approximate) best one”.
186
Kanevski, M. et al.
UNSUPERVISED LEARNING
Unsupervised techniques are aimed at the exploratory analysis of data.
They provide insight into the structures and dependencies hidden in the
datasets. This is achieved by finding a simplified representation of a
dataset handful for visualization, feature extraction, or descriptive analysis
purposes. The main problems encountered here are the problems of
clustering and dimensionality reduction (also called embedding).
187
CLUSTERING
Clustering can be defined as partitioning the dataset into subsets of typical
entries such that the samples in each subset share some common
characteristics. The commonness is implied by the pre-defined similarities
between data samples, and usually requires one to define a problem-
specific distance measure in the input space of features. The general
types of similarity measures that are used to compare data samples allow
distinguishing the groups of typical approaches in unsupervised methods.
K-means algorithm
Probably the most popular clustering algorithm is known as k-means.
Given the dataset {x1,…xN}, it operates as follows:
The last two steps are iterated until convergence. As the centers are
updated using all the data at once, this version of the algorithm is known
as batch k-means, opposed to the online or stochastic k-means when the
update is done by randomly iterating through the samples of the dataset.
The stochastic k-means can be faster and less sensitive to the initialization
of the centers. K-means methods are aimed at minimization of the intra-
cluster variance and demonstrate good performance if the data face
distinctive “clouds” and there is no significant correlation between the input
features (Figure 7).
188
Kanevski, M. et al.
Self-organizing maps
Self-organizing maps (also known as Kohonen maps (Kohonen, 2000))
are a popular method extending the functionality of k-means. SOM places
the centers (also called units or neurons) in the data space aiming to
“cover” all data points to fit to the topology of the dataset optimally and
present it in a two-dimensional map. That is, the centers are not drawn
and fitted independently as in the case of k-means, but organized in a two-
dimensional map. There exist two common designs of this map:
rectangular (every unit has four neighbors) and hexagonal (every unit has
six neighbors), shown in Figure 8.
a) b)
Figure 8: Two different SOM structures with cells in the map space: rectangular (a), and
hexagonal (b). The nearest cells (4 for rectangular and 6 for hexagonal, except borders
and corners) are connected by the edges
189
map) surrounding. The closeness on the map is defined with the
neighborhood function which is responsible for the “cooperation” of
centers and their self-organization. This neighborhood influence is then
gradually decreased through every iteration epoch. The last epoch of SOM
adaptation with no cooperation between centers is the same as in the
case of simple k-means.
Spectral clustering
The method of spectral clustering originates from a graph-based
perspective of the clustering problem (see e.g. Hagen & Kahng, 1992; Ng
et al., 2001; Shi & Malik, 2000). Spectral clustering considers the data
samples as the nodes of a weighted graph and applies the methods of
spectral graph theory to study its inner structure (Figure 9). To describe
the affinity between the nodes of the graph, the vertices connecting the i-th
and j-th nodes (the data samples xi and xj) are entitled with the weights wij
that form the matrix W. To separate the graph into clusters one has to find
the “cut” that minimizes the sum of weights that would need to be removed
in order to split it. If the weights are attributed according to the distance
measure between the data samples xi and xj, it leads to the clustering of
data in the input space. Two common approaches are the n-nearest
neighbor one, where wij=1 iff the i-th and j-th samples are amongst the n-
nearest neighbors of each other, and the one which uses the Gaussian
RBF function of the distance between samples as the value for the weight.
The last case implies a parameter to be defined by a user, that is, the
width of the Gaussian. The way one attributes the weights can also
account for some problem-specific knowledge.
To approach the problem of finding the cut of the graph one needs to
analyze the matrix known as graph Laplacian. It is defined as L=W-D
where D is the diagonal matrix with element formed by the column-wise
sums of elements of W, dii=Σj[wij]. The normalized graph Laplacian L=D-
1/2
W D-1/2 is often used as well. The foundations of spectral clustering lie in
190
Kanevski, M. et al.
the fact that the eigenspace of the (normalized) graph Laplacian has a
particular well-defined structure related to the number of the connected
components of the graph. Or, intuitively, if the Laplacian matrix is
essentially block-diagonal, it can be easily detected from its eigenspace.
While there are many possible approaches that implement this idea, the
most popular formulation of the spectral clustering is as follows (Ng et al.,
2001):
• Solve the eigenvalue problem of finding the {λ, f} such that Lf=λf;
191
CLUSTERS IN SOCIO-ECONOMIC DATA
To illustrate the use of clustering methods, let us explore the socio-
economic data on the region of Lausanne, Switzerland. The dataset was
obtained from the population census and includes about 250 different
entries defining the social, cultural (mother tongue, nationality, etc.) and
economic (employment rate, household type, etc.) characteristics of the
population. The data are spatially aggregated from the regular grid cells of
100x100 meters covering the populated area of the region. There are a
total of 3359 samples. Population density, which is one of the most
important input features, is presented in Figure 10.
Figure 10: Population density in the region of Lausanne. The values are normalized to the
maximum value of density in the region
192
Kanevski, M. et al.
193
Figure 13: Clusters obtained with spectral clustering
DIMENSIONALITY REDUCTION
Dimensionality reduction is usually involved while aiming at two goals: first,
to produce the low-dimensional representation of data to visualize them,
and, second, to extract low numbers of features for further analysis. Here
one distinguishes the feature selection and feature extraction, where the
latter rather than selecting already existing variables, tries to select a
linear or nonlinear combination of the input variables which suits best the
problem at hand. An example of a linear method is a well-known principal
component analysis. There is a popular and rapidly growing domain of
modern nonlinear dimensionality reduction methods known as manifold
learning (Lee & Verleysen 2007).
194
Kanevski, M. et al.
PCA enables to reduce the dimensionality of the data that include linearly
correlated inputs. For visualization, the first two principal components span
the projection plane providing the most informative (in the sense of
variability) viewpoint on the original dataset. The amount of accounted
variance can be computed to assist in the choice of the number of
components.
Figure 14: Principle Components form a new orthogonal coordinate system, with the first
components spanning along the directions of maximum variance of the data
Laplacian Eigenmaps
The large variance along a straight line is not necessarily the main
characteristic of interest with respect to analyzing the complex structures
in the data. A linear projection cannot help unfolding the non-linear
structures as the one of the “Swiss roll”, or even a simpler one as shown in
Figure 15. It is the local relationships between data samples that may help
discover these complex structures. Laplacian eigenmaps (Belkin & Niyogi,
2003) are one particular approach of manifold learning aimed at
preserving the local neighborhood relations between the data samples.
Here we briefly name the other methods of descriptive manifold learning:
locally linear embedding (Roweis & Saul, 2000), ISOMAP (Tenenbaum &
De Silva, 2000), maximum variance unfolding (Weinberger & Saul, 2005).
• Form the affinity matrix W and compute the graph Laplacian L=D-
W;
195
• Present data in projections on v starting with the smallest
eigenvalues.
This representation keeps the data samples proximate in the input space
(according to the affinity matrix) close in the embedded coordinates.
Hence the step of constructing the affinity matrix is very important as it
should encode the similarities that one desires to keep on the low-
dimensional map. Spectral clustering (described above) is essentially a
related method that only includes one additional step of clustering the
obtained representation using a conventional k-means.
196
Kanevski, M. et al.
Figure 16. First and second PCA components. The first one clearly follows population
density
197
Figure 17. Third and fourth PCA components. The interpretation of these is not
straightforward, though one can notice that the two neighbouring towns (Renens and
Lausanne) differ
198
Kanevski, M. et al.
Figure 18. Spatial representation of the first and second components obtained with
Laplacian eigenmaps
199
Figure 19. Spatial representation of the third and fourth components obtained with
Laplacian eigenmaps
200
Kanevski, M. et al.
SUPERVISED LEARNING
In supervised learning, the machine takes advantage of knowledge about
the outputs to develop a predictive model. Contrary to the unsupervised
methods discussed in the previous section, the algorithm is trained over a
set of samples x = {x1, x2, …, xn} with associated known outputs y = {y1,
y2, .., yn}.
201
c) The aim of the supervised model is to find the best separation
between the classes. In this example, most of the labeled pixels are
separated correctly, only three pixels are misclassified. Most of the
time, misclassification is a necessary trade-off to maintain a low
complexity. For instance, if the classification is performed on an
aerial photography using RGB values only, green roofs will have a
spectral signature which is identical to meadows. Therefore,
confusion between the classes has to be admitted.
d) Once the separation between the classes has been defined, all the
unlabeled pixels can be classified according to these boundaries by
computing their position in the feature space. If they fall in one of
the green areas of Figure 20.c, they will be classified as 'green' as
well, and so on. Once all the pixels of the image have been
classified, a classification map is provided.
Figure 20: Principle of supervised learning for remote sensing data. (a) Each pixel is
associated to features (ex: the spectral bands) AND known labels Y. (b) Knowledge
about class membership is used in the feature space to train a learner and (c) define a
decision boundary. (d) Finally, the unknown pixels are classified with respect to the
decision boundary found and a classification map is provided
202
Kanevski, M. et al.
There are many algorithms including machine learning that offer solution
models to classification tasks: k nearest neighbors (k-NN can be
considered as a benchmark model), decision trees, probabilistic neural
networks, multilayer perceptron, radial basis functions, support vector
machines (Duda et al., 2001; Bishop 2006; Vapnik 1998). SVM have
demonstrated excellent efficiency on classification tasks in different fields
from remote sensing images to biocomputing and finance.
Since only the points lying on the margin are necessary to define the
separating hyper-plane, all the other labeled points are not considered by
the model.
Figure 21: Linear classifiers for a two-class problem. (a) the problem; (b) several linear
classifiers separating the two classes; (c) the SVM
203
Mathematically, this results in the following classifier for the prediction of
an unlabeled pixel q:
f ( x) =sign(∑ α . yi xi , q + b) (8)
i
where αi are coefficients that are i) nonzero and ii) equal to 1 when the
labeled sample lies on the margin and 〈xi ,q〉 is a dot product defining the
similarity between the unlabeled pixels and the labeled samples and b is
the bias. Since the αi = 0 for each sample not lying on the margin, class
membership of an unseen point is assessed by the similarity (~ distance)
between the labeled pixels and the samples that lie on the margin only.
These samples are called the support vectors.
Recalling the example of Figure 21, since circles have negative label (yCi =
-1) and positive squares (ySi = 1), if the new point q is globally closer to the
support vectors of the class “circles”, solution of Eq. (8) will result in a
negative value and q will be labeled as a circle.
The SVM presented so far can solve only linear separable problems.
Slack variables can be considered to allow small errors (see, for instance,
Cristianini, 2000), but the algorithm will fail if the data are not linearly
separable (Figure 22.a). In order to handle linearly non-separable
problems, we might use the so-called kernel trick. If a problem is not
linearly separable in the input space, it may be linearly separable in a
higher dimensional space H (Figure 22.b). If such a space exists, we can
map the labeled samples in the new space and then apply a linear
classifier. A linear classification in a higher dimensional feature space
(Figure 22.c) will correspond to a nonlinear classification in the input space
(Figure 22.d).
204
Kanevski, M. et al.
Figure 22: the kernel trick. (a) a linearly non-separable problem in the input space; (b)
mapping in a higher dimensional space H; (c) the decision function in H is linear; (d) in
the input space it is not
The mapping (computation of the new coordinates) in the new space of all
the labeled samples can be solved analytically. For instance, a two-
dimensional sample x = {x1, x2} mapped into a 3-dimensional space by a
quadratic transform can take the coordinates φ(xi) = {x12, 2 x1x2, x22}. But
by looking at Eq. (8) again, we can notice that the explicit mapping of xi is
not required, rather only the similarity between xi and q. Therefore, there is
no need to compute the entire mapping of x in H, but only the distances
between the mapped x and q. Such distances can be represented by
kernel functions. For instance, a polynomial function of degree 2 encodes
a nonlinear similarity K(xi,q) = (〈xi ,q〉)2 = x12q12 + 2x1x2q1q2 + x22q22 = 〈φ(xi),
φ(q)〉. Therefore, K returns the value of the dot product between the
mapped samples! Using the kernel trick, the nonlinear SVM solution
becomes:
205
f ( x ) =sign (∑ α . yi K ( xi , q ) + b) (9)
i
In this section, we will train a SVM for the classification of land use in a
neighborhood of the city of Lausanne, Switzerland (Figure 23.a). In
particular, we would like to use an aerial photography to discriminate
different types of habitat, in particular individual versus collective habitat.
This is a very challenging problem, because the spectral information is
rather poor (each pixel is only considered by its color coordinates in the
RGB space) and roof colors are mixed for the same type of habitat.
Moreover, asphalt objects such as roads or parking lots can be easily
confused with roofs.
Figure 23: data considered. (a) aerial photography of the NW of Lausanne; (b) feature ,
xO15 ; (c) feature xC15
206
Kanevski, M. et al.
207
In order to compare model performance on new data, the confusion
matrices of the predictions of the 142723 test pixels for both the MS (Table
1) and the MM (Table 2) experiments were analyzed. Confusion matrices
show the results in terms of predicted pixels (columns) versus ground truth
pixels (rows); pixels on the diagonal are correctly classified. The
percentage of pixels correctly classified by the SVM is given in the last
column, while the last row shows the percentage of pixels correctly
classified with respect to the total number of pixels predicted for that class.
The visual inspection of the classified images was also carried out to
detect improvements in the classification of specific objects. Let's remind
that the dimension of the input vector is 3 for the MS model and 17 for the
MM model.
Model output
MS Trees Meadows I. habitat C. habitat Roads Shadows Accuracy
Trees 22019 2364 5 105 279 0 89%
Meadows 1037 33984 126 410 169 0 95 %
Reference
208
Kanevski, M. et al.
Model output
MM Trees Grass I. habitat C. habitat Roads Shadows Accuracy
Trees 23715 532 12 82 188 243 96%
Grass 436 34281 149 440 420 0 96%
Reference
I. habitat 50 233 19688 985 5384 4 75%
C. habitat 264 394 1354 8278 1151 21 72%
Roads 244 789 5223 764 33413 11 83%
Shadows 472 0 0 2 4 3503 88%
Accuracy 94% 95% 75% 78% 82% 93% 86.1%
Table 2: confusion matrix for MM
Figure 24 shows the classification maps. For the MS model, two buildings
with the roof of the same color are often misclassified (see marker 1a: the
collective building is classified as an individual house). This problem is
greatly solved by the MM SVM (marker 1b). As mentioned above, the MM
model takes into account the information about the structure of the objects
and it is therefore able to detect their size and shape. Marker 2a highlights
the absence of the class Shadow by the MS model, correctly handled by
the MM model (2b). The roofs of the Collective habitat are sometimes
made of concrete and are therefore confused with roads by MS (3a). On
the contrary, the MM model can better discriminate these objects, even if
some confusion is still visible (3b). Green roofs, which are classified as
Meadows by MS (4a) are better handled by MM (4b). Finally, the MM
model proposes a classification which is less contaminated by high spatial
frequencies of the initial image. Therefore, homogeneous surfaces such
as Meadows and Trees (the forest) are more homogeneous (markers 5a
and 5b).
The classified results reported have great value when analyzing the urban
structure of a city (in our case the spatial distribution of collective and
individual habitats). The problem of urban sprawl is strictly related to the
question of urban density. Remote sensing images can be used to
discriminate automatically between different types of habitat. Machine
learning algorithms, used with features that are discriminative for the
problem to solve, allow achieving reliable results and provide effective
maps for the visualization of urban density.
209
Figure 24: Classified image with RGB (top); classified image with RGB and MM (bottom);
Trees = dark green, Meadows = light green, Individual habitat = orange, Collective habitat
= red, Roads = black and Shadows = yellow
210
Kanevski, M. et al.
211
Figure 25: Simple model of artificial neuron
which takes the input features xi (components of some input vector x),
makes the summation with weights wi, adds a bias b and passes it with a
transfer function f(⋅).
1
f ( x) = (11)
1 + e− x
e x − e− x
f ( x ) = tanh( x ) = (12)
e x + e− x
212
Kanevski, M. et al.
Figure 26: Transfer functions: logistic (left) and hyperbolic tangent (right)
The power and capabilities of multilayer perceptron stem from the non-
linearity used within nodes. An MLP can learn with a supervised learning
rule using the backpropagation algorithm. The backward error propagation
algorithm (backpropagation) for ANN learning/training caused a
breakthrough in the application of multilayer perceptron (Haykin 2009).
The backpropagation algorithm gave rise to the iterative gradient
algorithms designed to minimize the quadratic error cost function between
the actual output of the neural network and the desired output. The error is
computed during the forward pass of information flow through the network.
Figure 27: Feed-forward neural network: Multilayer perceptron with 3 input neurons, 7
hidden neurons in the first hidden layer, 7 hidden neurons in the second hidden layer and
2 output neurons (symbolic definition of the net 3-7-7-2). Blue circles are bias neurons
(with constant value 1)
Backpropagation algorithm
213
optimization algorithm, either of the first- or second-order, online or batch.
As usually, for the regression problem the error to be minimized is
considered to be a mean squared error (MSE). This error is easily
computed, has proved itself in practice and, as shown later, its partial
derivatives with respect to individual weights can be computed explicitly.
The outputs of MLP trained with a MSE error function can be interpreted
as the conditional mean of the target data, i.e. the regression of a
dependent variable (output) conditioned on independent variables (input)
(Bishop, 1995; 2006). To simplify the notations, we consider below the
model with a single output t. It can be easily extended to several outputs
by considering the mean squared error averaged over them. For an
inputs-output pair (x, t) the error is simply:
1
[t − F ( x , w ) ]
2
EMSE ( w ) = (13)
2
3) The derivatives of EMSE for a single pair (x, t) are computed with respect
to the weights in each layer, starting at the output layer with the backward
move to the inputs. The derivatives provide information on how much the
error depends on the particular weight in the vicinity of a current model,
and will be used to optimize its value in order to reduce the error, at least
locally. This completes a backward pass.
The key point here is to compute the derivatives of the transfer functions
of the neuron nodes with respect to its arguments. Here the smart choice
of an activation function comes into play. Since for the logistic function and
hyper-tangent, they can be computed through the values of the functions
themselves, and the latter are computed and stored after a forward pass,
the algorithm simplifies and fastens significantly.
214
Kanevski, M. et al.
1 df e− x 1 + e− x − 1 1 (14)
f = , = = = ( f − 1) f 2 = f (1 − f )
1 + e− x −x 2
dx (1 + e ) (1 + e − x ) 2
To minimize the MSE error for the training set, let us construct an iterative
procedure of gradient descent. The weights w are updated iteratively by a
gradient rule, with n denoting the iteration number:
∂EMSE (15)
wijm (n + 1) = wijm (n) − η ( n)
∂wijm
that is,
The effect of the momentum term is to magnify the learning rate for flat
regions of the error surface where gradients are more or less constant (or,
strictly speaking, were constant at the last iteration). In steep regions of
weight space, momentum focuses the movement in a downhill direction by
dampening oscillations caused by alternating the sign of the gradient.
215
epoch) and update the weights once with an averaged gradient (the batch
mode). Both approaches are widely used and different recommendations
on their efficiency can be found in the literature. In the online case, the
order in which the training patterns are presented may affect the direction
of the search on the error surface. Some authors (Masters, 1993) prefer
using the entire training set for each epoch, because this favors stability in
convergence to the optimal weights.
First, all the training samples are presented to the network and the
average gradient is computed, that is, the vector containing all the
derivatives of MSE and whose dimensionality is equal to the number of
weights in the network:
∂EMSE (18)
∇EMSE (w ) =
∂wijm
The optimization step to modify the vector of weights w in the batch mode
then becomes:
w ( n + 1) = w ( n) − η∇EMSE ( w ( n) ) (19)
Some recent research trends, motivated by the huge size of data sets to
be processed, are coming back to the on-line learning scheme, bringing
some stochastic elements into the learning process (Bottou, 2003).
Interestingly, this sometimes allows both processing of large data and
helps in avoiding over-fitting.
Multiple local minima
216
Kanevski, M. et al.
The data were divided into two datasets: for training (168 points) and
testing (32 points). The structure of MLP was 2-10-10-1 (two inputs - X,Y
coordinates, two hidden layers with ten neurons in each hidden layer, and
one output – the level of contamination). On the first step of the training
procedure, simulated annealing was used to initialize the weights. Then
second-order training algorithm Levenberg-Marquardt was used during the
main step of the training. The training was stopped as the testing error
was increasing. So the model with the minimum test error was selected as
a resulting model. The described procedure was repeated 5 times and the
model with the lowest test error was adopted as a result.
217
Figure 28: Sediments contamination of lac Léman, Zn: measured (training) data (top),
MLP mapping (bottom)
Figure 29: Sediments contamination of lac Léman, Ti: measured (training) data (top),
MLP mapping (bottom)
218
Kanevski, M. et al.
219
Different kinds of kernels can be selected from the kernels’ library (Hardle,
1989; Fan & Gijbels, 1977). Gaussian is the most widely used kernel
x − xi 1 x − xi 2
K = 2 p/2
exp −
i = 1, 2K , N
σ (2πσ ) 2σ 2 (21)
N x − xi
2
∑ Z exp −
i
2σ 2
Z ( x) =
i =1
N x − xi 2
∑ exp −
2σ 2
i =1
(22)
The model described above is the simplest GRNN algorithms. One of the
useful improvements is to use multidimensional kernels instead of one-
dimensional ones. In a more general setting parameter σ may be
presented by a covariance matrix. This matrix is a squared symmetrical
with dimension p by p and with the number of parameters equals to
p(p+1)/2.
220
Kanevski, M. et al.
The final result (optimal σ value) corresponds to the model with the
smallest cross-validation error. The interval and the number of steps have
to be consistent in order to catch the expected optimal (with minimum of
the error) value. Reliable limits are the minimum distance between points
and size of the area under study. In fact, really effective interval is much
smaller and can be defined in accordance with the monitoring network
structure and/or by using prior expert’s knowledge about studied
phenomenon.
221
Figure 30: Number of habitants in communes using moto or scooter to reach the job
(normalized by population): initial data (left), GRNN mapping (right)
Figure 31: Number of habitants in communes using bus or tram to reach the job
(normalized by population): initial data (left), GRNN mapping (right)
CONCLUSIONS
At present machine learning models/algorithms play a great role in many
fields dealing with data analysis modeling and visualization. They are
flexible, adaptive, nonlinear, and universal modeling tools based on solid
mathematical and statistical background. Despite of some external
simplicity, especially taking into account the availability of easy to use
software tools, their correct use and interpretation of the results obtained
need deep expert knowledge in the corresponding fields. They seem to be
indispensable tools when multivariate data are embedded in high
dimensional geo-feature spaces and corresponding phenomena are
nonlinear, multi-scale and contaminated by noise which is quite a typical
situation in real-life applications.
222
Kanevski, M. et al.
ACKNOWLEDGEMENTS
The research was supported in part by the Swiss National Science
Foundation “GeoKernels. Phase 2” (200020-121835) and “ClusterVille”
100012-113506. The authors thank the CIPEL organization for providing
data on lac Léman (Geneva lake).
REFERENCES
Almeida, C.; Gleriani, J.; Castejon, E.; Soares-Filho, B., 2008, Using neural
networks and cellular automata fro modeling intra-urban land-use dynamics. In:
International Journal of Geographical Information Science, 22: 943-963
Belkin, M.; Niyogi, P., 2003, Laplacian eigenmaps for dimensionality reduction
and data representation. In: Neural Computation, 15, 6: 1373–1396
Bishop, C.M., 1995, Neural Networks for Pattern Recognition. Oxford University
Press: New York
Boser, B.; Guyon, I.; Vapnik, V., 1992, A training algorithm for optimal margin
classifiers. In: 5th ACM Workshop on Computational Learning Theory
223
Bottou, L., 2003, Stochastic Learning, Advanced Lectures on Machine Learning.
In: Bousquet, O.; von Luxburg, U. (Eds), Lecture Notes in Artificial Intelligence.
Berlin: Springer: 146-168
Cherkassky, V.; Hsieh, W.; Krasnopolsky, V.; Solomatine, D.; Valdes, J.,
2007, Special Issue: Computational intelligence in earth and environmental
sciences. In: Neural Networks, 20, 4: 433-558
Cherkassky, V.; Mulier, F., 2007, Learning from Data. Concepts, Theory, and
Methods. Second edition. Edition.Wiley-Interscience: New York
Collobert, R.; Bengio, S.; Mariéthoz, J., 2002, Torch: a modular machine
learning software library. Tech Report IDIAP: Martigny
Dubois, G., 2005, Automatic mapping algorithms for routine and emergency
data. European Commission, JRC Ispra, EUR 21595
Fan, J.; Gijbels, I., 1997, Applied Local Polynomial Modeling and Its
Applications. In: Monographs on Statistics and Applied Probability 66. London:
Chapman and Hall
Hagen, L.; Kahng, A., 1992, New spectral methods for ratio cut partitioning and
clustering. In: IEEE Trans. on Computer Aided-Design, 11, 9:1074-1085
Grandvalet, Y.; Canu, S.; Boucheron, S., 1997, Noise injection: theoretical
prospects. In: Neural computation, 9: 1093-1108
224
Kanevski, M. et al.
Guyon, I.; Gunn, S.; Nikravesh, N.; Zadeh, L., 2006, Feature Extraction:
Foundations and Applications. Springer: New York
Hastie, T.; Tibshirani, R.; Friedman, J., 2009, The Elements of Statistical
Learning; Data Mining, Inference, and Prediction. Second edition. Springer
Verlag: New York
Haykin, S., 2009, Neural Networks and Learning Machines. Third Edition.
Prentice-Hall, Inc.: New York
Hewitson, B.; Crane, R., 1994, Neural Nets: Applications in Geography. Kindle
Edition
Hornik, K.; Stinchcombe, M.; White, H., 1989, Multilayer feedforward networks
are universal approximations. In: Neural Networks, 2: 359-366
Jain, A.K.; Murty, M.N.; Flynn, P.J., 1999, Data clustering: a review. In: ACM
Computing Surveys, 31, 3: 264-323
Jones, A., 2004, New tools in non-linear modeling and prediction. In: Comput.
Managm. Sci., 1: 109-149
Kanevski, M.; Arutyunyan, R.; Bolshov, L.; Demyanov, V.; Maignan, M.,
1996, Artificial neural networks and spatial estimations of Chernobyl fallout. In:
Geoinformatics, 7, 1-2: 5-11
Kanevski, M.; Pozdnoukhov, A.; Timonin, V., 2009, Machine Learning for
Spatial Environmental Data. Theory, Applications and Software. EPFL Press:
Lausanne
Kohonen, T., 2000, Self-organising maps. 3rd Edition. Springer: New York
225
Lee, J.; Verleysen, M., 2007, Nonlinear Dimensionality Reduction. Springer:
New York
Masters, T., 1993, Practical Neural Network Recipes in C++. Academic Press:
New York
Nadaraya, E.A., 1964, On estimating regression. In: Theory of Probability and its
Applications, 9: 141-142
Ng, A.Y.; Jordan, M.; Weiss, Y., 2001, On spectral clustering: analysis and an
algorithm. In: Glen Th. et al. (Eds), Advances in Neural Information Processing
Systems, 14: 849-856
Parzen, E., 1962, On estimation of a probability density function and mode. In:
Annals of Mathematical Statistics, 33: 1065-1076
Pearson, K., 1901, On Lines and Planes of Closest Fit to Systems of Points in
Space. In: Philosophical Magazine 2, 6: 559–572
Pi, H.; Peterson, C., 1994, Finding embedding dimension and variable
dependencies in time series. In: Neural computation, 6: 509-520
Pesaresi, M.; Benediktsson, J.A., 2001, A new approach for the morphological
segmentation of high-resolution satellite images. In: IEEE Transactions on
Geoscience and Remote Sensing, 392: 309-320
Pijanowski, B.; Brown, D.; Shellito, B.; Manik G., 2002, Using neural networks
and GIS to forecast land use changes: a Land Transformation Model. In:
Computers, Environment and Urban Systems, 26: 553-575
Shi, J.; Malik, J., 2000, Normalized cuts and image segmentation. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22, 8: 888-905
226
Kanevski, M. et al.
Watson, G.S., 1964, Smooth regression analysis. In: Sankhya: The Indian
Journal of Statistics, Series A, 26: 359-372
AUTHORS INFORMATION
Mikhail KANEVSKI Loris FORESTI Christian KAISER
Mikhail.Kanevski@unil.ch Loris.Foresti@unil.ch Christian.Kaiser@unil.ch
IGAR, University of Lausannee IGAR, University of IGUL, University of Lausannee
Amphipole, 1015 Lausanne Lausannee, Amphipole, Antropole, 1015 Lausanne
Switzerland 1015 Lausanne, Switzerland Switzerland
Alexei POZDNOUKHOV Vadim TIMONIN Devis TUIA
Alexei.Pozdnoukhov@unil.ch Vadim.Timonin@unil.ch Devis.Tuia@unil.ch
IGAR, University of Lausannee IGAR, University of IGAR, University of Lausannee
Amphipole, 1015 Lausanne Lausannee Amphipole, 1015 Lausanne
Switzerland Amphipole, 1015 Lausanne Switzerland
Switzerland
227