Machine Learning Models For Geospatial Data: January 2009

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/261551597
Machine learning models for geospatial data
Chapter · January 2009
CITATION READS
1 2,201
6 authors, including:
Mikhail Kanevski Loris Foresti

University of Lausanne MeteoSwiss
258 PUBLICATIONS 3,160 CITATIONS 36 PUBLICATIONS 310 CITATIONS
SEE PROFILE SEE PROFILE
Christian Kaiser Devis Tuia

University of Lausanne Wageningen University & Research
16 PUBLICATIONS 164 CITATIONS 239 PUBLICATIONS 6,048 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
pySTEPS: the precipitation nowcasting initiative View project
Multimodal machine learning for remote sensing information fusion View project
All content following this page was uploaded by Mikhail Kanevski on 16 April 2018.
The user has requested enhancement of the downloaded file.

Kanevski, M. et al.
MACHINE LEARNING MODELS FOR GEOSPATIAL

DATA
Mikhail KANEVSKI*, Loris FORESTI*, Christian KAISER**, Alexei

POZDNOUKHOV*, Vadim TIMONIN* and Devis TUIA*
* **
Institute of Geomatics and Analysis of Risk, University of Lausanne, Lausanne, Switzerland, Institute of
Geography, University of Lausanne, Lausanne, Switzerland
ABSTRACT
This chapter presents an introduction to machine learning
models/algorithms and their potential applications to
geospatial data. The main attention is paid to widely used
models which are based on artificial neural networks
(multilayer perceptron, general regression neural
networks, self-organizing maps) and statistical learning
theory (support vector machines). The main ideas of
spatial classification, spatial predictions/mapping including
automatic algorithms, nonlinear dimensionality reduction
and visualization of high dimensional multivariate socio-
economic data, treatment and classification of remote
sensing images by applying machine learning are
illustrated using real data case studies.
KEYWORDS
Machine learning algorithms, Geospatial data, Mapping and classification,
Dimensionality reduction, Remote sensing
INTRODUCTION
Machine learning (ML), in a general framework, can be considered as a
subfield of artificial intelligence that is concerned with the design,
development, and application of algorithms and techniques that allow
computers to learn from data. Machine learning has a close connection
with statistics (especially nonparametric and computational statistics) and
theoretical computer science. Since the middle of twentieth century
machine learning has evolved from the imitation of a simple neuron and
artificial neural networks to a solid interdisciplinary field of basic and
applied research having an important influence on many topics: pattern
recognition, bio-computing, speech recognition, financial applications,
analysis and modeling of high dimensional and multivariate geo- and
environmental spatio-temporal data, etc. (Agarwal & Skupin, 2008;
Cherkassky & Mulier, 2007; Hastie et al., 2009; Izenman 2008; Kanevski,
2008; Openshaw & Openshaw, 1997; Vapnik, 1998).
175
In recent years there has been an explosive growth in the development of
adaptive and data-driven approaches. Among successful and widely used
models of ML artificial neural networks (ANN) of different architectures and
support vector machines have attracted great attention. Both have
demonstrated important and successful applications for geospatial data
modeling tasks: spatial predictions (classification and mapping); natural
hazards and environmental risk assessments; renewable resources
estimates; analysis, modeling and visualization of multivariate socio-
economic data; environmental time series predictions; hydroinformatics;
treatment and classification of remote sensing images, assimilation of data
and science based models; etc. (see references below).
The key feature of the ML models/algorithms is that they learn from data
and can be used in cases when the modeled phenomena is not very well
described, which is the case in many applications of geospatial data.
Machine learning models are adaptive tools, which at present are widely
used to solve prediction, characterization, optimization and many other
problems.
There exist many kinds of ANN to be used for different problems and
cases. Among the most common in geo- and environmental sciences let
us mention multilayer perceptron (MLP), radial basis function (RBF)
networks, general regression neural networks (GRNN), probabilistic neural
networks, Kohonen networks (self-organizing maps, SOM) (Agarwal &
Skupin, 2008; Cherkassky & Mulier, 2007; Hastie et al., 2009; Izenman
2008; Openshaw & Openshaw, 1997; Haykin, 2009).
Let us remind that other approaches of geospatial data analysis and

treatment (not data-driven) can be considered as model dependent ones.
In this case an expert develops a model (e.g., models of spatial
correlation) which is then used for the modeling and prediction purposes.
For example, traditional geostatistics can be considered as a well-known

model-dependent approach for spatial data and one based on
variography, which deals with the analysis and modeling of spatial
correlations (Chiles & Delfiner, 1999).
At present, one of the most efficient data-driven approaches is based on

statistical learning theory (SLT) (Vapnik 1998). The theory is based on the
Structural Risk Minimisation (SRM) principle and has a solid statistical
background. When applying SRM one tries not only to reduce training
error – to fit the available data with a model, but also to reduce the
complexity of the model and to reduce generalization (prediction) error.
Many nonlinear learning procedures recently developed in neural networks
176
Kanevski, M. et al.
and statistics can be understood and interpreted in terms of the structural

risk minimization inductive principle. A methodology based on SRM is
called Support Vector Machines (SVM). At present SLT is still under
intensive development and SVM have been finding new areas of
application. Many classical models, for example principal component
analysis (PCA), were generalized to kernel PCA using so-called “kernel
trick” (see details below).
SVM develop robust and non linear data models with excellent
generalization abilities that are very important both for monitoring and
forecasting. SVM use only support vectors (part of the measurement data
points) to derive decision boundaries. They open a way to sampling
optimization, estimation of noise in data, quantification of data
redundancy, etc. More detailed presentation of SVM application for
spatially distributed environmental data is given in Kanevski & Maignan
(2004), Kanevski (2008) and Kanevski et al. (2009).
In general, geospatial data are not only data in a geographical low

dimensional (2d, 3d) space but rather data embedded into a high
dimensional geo-feature spaces, which consist of geographical
coordinates and features generated from, for example, digital elevation
models, science based models, remote sensing images, etc. In such
cases an important problem is the “curse of dimensionality” – an
exponential growth (with the dimension of space) of measurement data
necessary to fill the space. Therefore dimensionality reduction methods
have gained great popularity. Recently, many efficient nonlinear
dimensionality reduction techniques were proposed and are efficiently
used in many real life applications (Lee & Verlaysen, 2007; Guyon et al.,
2005).
In terms of patterns/structures the following main problems closely related

to the main problems of learning from data can be considered:
• pattern recognition/pattern detection,
• pattern modeling,
• pattern predictions/pattern completions.
The problem of pattern recognition is closely related to the contemporary

exploratory data analysis when the main topic is to find/detect structured
information in data without making restrictive hypotheses about data
distributions. An important question is how to construct the criteria able to
detect and separate “useful” structured information from the noise. For
177
example, geostatistics widely uses the variography – spatial anisotropic
correlations analysis in order to detect and to characterize spatial
patterns/structures.
Pattern modeling is a task of developing models capable to correctly

model structured information taking into account available data, expert
knowledge, and science-based models, when these are available. Finally,
pattern prediction is a process of forecasting/prediction in space and in
time at the points where there are no measurements.
Other traditional topics where machine learning has contributed are

problems of optimization and control, calibration of science-based
(meteorological, physical) and empirical (cellular automata, multi agent
systems) models, modeling/imitation of processes and events, etc. Some
recent developments in ML applications for environmental modeling are
considered in Cherkassky et al. (2006, 2007). More geographical
applications - cellular automata calibration and land-use modeling are
presented in Pijanowski et al. (2002) and Almeida et al. (2008). As a
universal tool ML can be efficiently integrated with Geographic Information
Systems in order to build modeling and data treatment modules.
Geospatial data processed by ML can be of different types:

categorical/discrete data – classes (e.g., land use, soil types, geological
units), continuous data (concentration of pollution, temperature, river
levels, wind fields), and distributions – probability density distribution
functions. Different models and tools were developed in machine learning
to treat these classes of data and corresponding problems.
Let us consider briefly more formal definitions of the problem of “learning

from data”.
Learning from data. Setting of the learning problem

ML models are based on the statistical treatment of data. The main
hypothesis is that data are generated from some unknown processes
which can be characterized by the joint probability distribution function
p(x,y) describing connection between input(=x) and output(=y). Input-
output modeling can be considered as a general problem of mapping.
Statistical interpretation is quite general and covers also deterministic

approaches of data description and modeling. Learning machine or
machine learning model tries to model the relationship between input and
output in order to recognize and to model patterns. Then, the model
developed can be used for the prediction purposes.
178
Kanevski, M. et al.
The general framework of learning from data can be visualized in the

following way (Vapnik, 1998; Cherkassky & Mulier, 2007):
Figure 1: The flow-chart of a generic problem of machine learning from data
In Figure 1: G is a generator of i.i.d. (independent and identically

distributed) data/vectors x∈Rn drawn from a fixed but unknown probability
distribution function F(x); (S) is a supervisor who returns an output value y
to every input vector x, according to a conditional distribution F(y|x), also
fixed but unknown; (LM) is a learning machine capable of implementing a
set of functions f(x,α), Λ∈α, where Λ is a set of parameters.
In general the learning process can be considered as a process of

inferring from data and expert knowledge. Usually this process is a two
step procedure (Figure 2): induction – development of a general
probabilistic model and then deduction – using a developed model to
make predictions at specific points. In many cases this two-step process
can be replaced by a direct, so-called transduction process or “from
particular to particular”. The latter can be more efficient in real-life
applications when there usually are not enough data in order to develop a
generic inductive model and then use it during the deduction process to
make predictions.
F(x,y)
Induction Deduction
Training
samples (ynew,xnew)
(xi, yi)
Transduction
Figure 2: Learning processes: induction, deduction, transduction
179
In almost all real-life case studies an introduction of statistical model to
data is non-trivial because usually only one realization of the phenomena
is available: spatial data on pollution, time series of monitoring data, soil
and land-use data etc. Statistical treatment of data can be introduced in
this case as well under some hypotheses and assumptions. Therefore
there are important hypotheses and assumptions that have to be checked
(usually it is not a trivial task!) and accepted in order to make statistical
(machine learning) inference based on one realization of the phenomena
under study: iideness of data, ergodicity (loosely speaking, the
convergence of the averaging over space to the averaging over
realizations), spatial or temporal stationarity, i.e. absence of trends when
important parameters of the model do not change in space/time. In case of
geospatial data spatial clustering is an important topic which complicates
both treatment of data (representativity of data) and interpretation of the
results. The problem of non-stationarity (spatial or temporal) partly can be
overcome by using locally adaptive models. Like in statistics these topics
are also important in machine learning data treatment and modeling.
Learning from data. Basic learning algorithms and techniques

Taking into account the definition of learning machine the procedure of
learning in general can be considered as a process of the selection of a
set of functions f(x,α) and tuning/fitting of the corresponding parameters
(α). Most of machine learning algorithms use wide and flexible libraries of
functions capable to model real life data (ML are universal modeling tools).
Common ML models include the following types of learning widely used in

practice:
• supervised learning – where the algorithm generates a function that

maps inputs to desired outputs. In this case output data are available
at some (many) input data points. This information is used to train
machine in a supervised manner by minimizing a well-defined criterion
describing the discrepancy between data and modeling results. Usually
this kind of problems falls into the optimization problems in high-
dimensional spaces composed of machine’s parameters.
• unsupervised learning – which models a set of inputs: labeled

examples are not available. This is a typical problem of data clustering,
data classification, and manifold learning.
180
Kanevski, M. et al.
• reinforcement learning – where the algorithm learns a policy of how

to act given an observation of the world. Every action has some impact
in the environment, and the environment provides feedback that guides
the learning algorithm.
• In recent years a semi-supervised learning, which combines both

labeled and unlabeled examples to generate an appropriate function or
classifier (in fact a “mixture” between supervised and unsupervised
techniques) has gained much attention (Chapelle et al., 2006).
The performance and computational analysis of machine learning

algorithms is a branch of statistics known as computational learning
theory.
The generic procedure of learning can be visualized as flow-charts

(Figures 3 and 4).
Supervised learning algorithms can be synthesized in the following

diagram:
Supervisor
Data: Training ML Response

Machine Learning
Examples
Algorithm
Evaluation
of ML Response
Modifications
to ML Model
Machine Learning
Algorithm
Figure 3: Supervised learning
Unsupervised learning algorithms are a modification of the supervised

approach:
181
Data: Training Examples Machine Learning Model Response
Modifications
to ML Model
Machine Learning
Algorithm
Figure 4: Unsupervised learning
Let us remind that an important property of almost all of the machines is

that they are universal tools, loosely speaking they are able to learn any
data with a desired precision (Haykin, 2009; Hastie et al., 2009;
Cherkassky & Mulier, 2007).
The process of learning usually is quantified by applying a principle of risk

minimization.
Risk minimization
An important fundamental question is how to describe the quality of
learning, i.e. the description of the similarity/dissimilarity between data and
a ML model. This quantification is important both during learning and
prediction phases.
In order to choose the best available model to the supervisor’s response,

one measures the LOSS or discrepancy L(y,f(x,α)) between the response
y of the supervisor to a given input x and the response f(x,α) provided by
the loss measure which describes the dissimilarity between desired
outputs and ML model. Finally, the problem of learning is to minimize the
difference between desired and modeled outputs. In case of unsupervised
problem the task is to group similar objects into separate classes.
Consider the expected value of the loss, given by the risk function
described by the following formula:
R(α ) = ∫ L( y , f ( x, α ))dF ( x, y ) (1)
where L is a loss function and F is a joint (input-output) distribution

probability function.
182
Kanevski, M. et al.
The goal is to find the function f(x,α0) which minimizes the risk in the
situation where the joint probability distribution function (pdf) is unknown
and the only available information is contained in the training set.
Loss function for the classification problem:
0 if y = f ( x,α ) (2)
L( y, f ( x,α )) = 
1 if y ≠ f ( x,α )
For the regression/mapping problem (modeling of conditional mean value)

the risk is a well known mean-square-error:
L( y, f ( x, α )) = ( y − f ( x, α )) 2 (3)
Finally, consider the problem of density estimation from the set of

densities p(x,α), Λ∈α. For this problem the following loss-function is
usually considered:
L( p ( x, α )) = − log p( x, α ) (4)
The criteria presented above are very general. Unfortunately the joint
input-output distribution function is not known. Moreover, only a finite
number of data measurements (N training data) is available. Therefore
most training algorithms for learning machines implement Empirical Risk
Minimisation (ERM), i.e. they minimize the empirical error
N
1
Remp (α ) =
N
∑ L( y , f ( x ,α ))
i =1
i i (5)
Minimization of empirical risk does not consider the capacity of the

learning machine which can result in over-fitting, i.e. using a learning
machine with too much capacity for a particular problem. The problem of
over-fitting in this case is treated by using different regularization
techniques (Bishop, 1995; Bishop 2007; Haykin, 2009): weight decay,
early stopping, noise injection, etc.
Structural risk minimization (SRM) was introduced by Vapnik &

Chervonenkis in 1974. It is an inductive principle for model selection used
for learning from finite training data sets (Vapnik, 1998; Cherkassky &
Mulier, 2007). It describes a general model of capacity control and
provides a trade-off between hypothesis space complexity (the Vapnik-
Chervonenkis dimension of approximating functions) and the quality of
fitting the training data (empirical error) (Vapnik, 1998; Cherkassky &
183
Mulier, 2007). The SRM principle is illustrated in Figure 5. X-axis
corresponds to the complexity of the model and Y- axis to the error.
According to the SRM principle prediction error is a sum of training error
(empirical risk minimization) and complexity term which takes into account
penalization of too complex models. In this way a bound on prediction
error can be derived which gives an upper limit. This limit does not depend
on the distribution of data and therefore is rather pessimistic (too high). In
reality, for particular data limits are lower and can be estimated by splitting
the data or by using cross-validation technique.
An optimal solution corresponds to the minimum on the curve describing

the testing error (= sum of training error and complexity term). The solution
with low complexity (oversmoothing/underfitting) is biased - not complex
enough to explain the structure in the data. The solution with higher that
necessary complexity over-fits data, i.e. noise in data is also modeled. The
intermediate complexity solution models structure/patterns and ignores
noise, thus following Occam’s razor principle: “The more simple
explanation of the phenomena is more likely to be correct” or "entities
must not be multiplied beyond necessity" (the law of parsimony).
E Bound on prediction error

R
R
O
R
Complexity term
Training error
COMPLEXIT Y
Low complexity Optimal complexity High complexity
Figure 5: Structural risk minimization principle (Vapnik, 1998)
An important question of data ML modeling is that the main objective is to

develop a good model that is a model which has good generalization
properties – prediction of new outputs at unknown input points which were
184
Kanevski, M. et al.
not used for modeling purposes. It means that the developed ML model
has learned only structured information and ignored a noise present in
data. In this case over-fitting is avoided.
Bias-variance dilemma is a step towards the understanding and

quantification of this problem.
Bias-Variance Dilemma
Let us consider a regression/mapping problem as an example. In general,
we can assume that data can be decomposed into unknown function and
noise:
Y ( X ) = f ( X ,α ) + ε
where
E (ε ) = 0,
Var (ε ) = σ ε2 (7)
where Y(X) are measured data, f(X,α) is an unknown function and ε is a

noise present in the data.
The following expression for the expected prediction error of a regression

at an input point X=x0 using squared-error loss can be derived (Hastie et
al., 2009):
∧
Err ( x0 ) = E[(Y − f ( x0 , α ))2 ¦ X = x0 ] =
∧ ∧ ∧
σ ε2 + [ E f ( x0 , α ) − f ( x0 , α )]2 + E[ f ( x0 , α ) − E f ( x0 , α )]2 =
∧ ∧
σ ε2 + Bias 2 ( f ( x0 , α )) + Var ( f ( x0 , α )) =
IrreducibleError + Bias 2 + Variance
One of the most serious problems that arises in connection with learning
by neural networks is over-fitting of the provided training examples. This
means that the learned function fits very closely the training data and yet
does not generalize well, that is it cannot model sufficiently well unseen
data from the same phenomena.
Solution: Balance the statistical bias and statistical variance when doing
neural network learning in order to achieve the smallest average
generalization error.
The following two processes are important in making a decision about the
quality of the model and its generalization ability (Hastie et al., 2009):
185
A. Model Selection is a process of “estimating the performance of different
models in order to choose the (approximate) best one”.
B. Model Assessment is a process of ”having chosen a final model,

estimating its prediction error (generalization error) on new data”.
There are many different techniques developed in statistics and the ML

community for model selection and model assessment. If we are in a data-
rich situation, the best solution is to split data into three data subsets
which will be used for 1) training (training data), 2) model selection (test
data) and 3) model assessment (validation data). Typical splitting is: 60%
is used for training, 20% for testing and 20% for validation purposes.
How can geospatial data be split? Usually random splitting of data is

applied. In case of geographically distributed data, especially when spatial
clustering is important, declustering procedures can be used as well in
order to split data into representative subsets (Kanevski & Maignan,
2004).
Let us give some general comments on ML applications to geospatial

data:
• A machine learning model is only as good as the training data. The

results depend on the quality and quantity of data.
• Poor training data inevitably leads to unreliable and unpredictable

neural networks and support vector machines. Control of the quality of
hyper-parameters tuning by using a testing data set is necessary.
• Exploratory Data Analysis and data pre-processing including

visualization tools are extremely important! Pre-processing of data can
include data transformation as well. Despite the fact that ML are
nonlinear tools intelligent data pre-processing can improve learning
and make it faster.
• Analysis of the residuals is necessary and helps in understanding the

quality of modeling and results.
• If possible, prior to training, add some noise or other randomness to

the training examples. This helps to account for noise and natural
variability in real data, and tends to produce a more reliable network.
This is a well known technique of noise injection which in many cases
prevents an over-fitting of data and is equivalent to regularization
(Grandvalet et al., 1997; Kanevski et al., 2009).
186
Kanevski, M. et al.
• Criteria of early stopping are widely used to avoid over-fitting of data

when multilayer perceptron is used for modeling purposes. Complex
model is iteratively trained (one training pass using all data is called an
epoch) until the testing error starts to grow (Figure 6). The model with
minimum testing error is fixed as an optimal model and then the
validation set is used to estimate the generalization properties of the
model developed (see above). If noise in data can be estimated
independently, for example, using variography (nugget) or so-called
delta- or gamma-tests (Pi & Peterson, 1994; Jones, 2004) splitting of
data into training-validation is not necessary because the training
procedure can be stopped at the level of the noise. As usual the goal of
training is to extract patterns and not a noise.
Figure 6: Illustration of an early stopping principle
In the following section unsupervised and supervised learning algorithms

are presented and applied to real data case studies.
UNSUPERVISED LEARNING
Unsupervised techniques are aimed at the exploratory analysis of data.
They provide insight into the structures and dependencies hidden in the
datasets. This is achieved by finding a simplified representation of a
dataset handful for visualization, feature extraction, or descriptive analysis
purposes. The main problems encountered here are the problems of
clustering and dimensionality reduction (also called embedding).
Most of the methods require defining a distance measure between data

samples. Both clustering and dimensionality reduction results depend
heavily on this choice, as this is the step when the inherent similarities in
data are encoded.
187
CLUSTERING
Clustering can be defined as partitioning the dataset into subsets of typical
entries such that the samples in each subset share some common
characteristics. The commonness is implied by the pre-defined similarities
between data samples, and usually requires one to define a problem-
specific distance measure in the input space of features. The general
types of similarity measures that are used to compare data samples allow
distinguishing the groups of typical approaches in unsupervised methods.
There is a vast body on literature on clustering problems, comprehensively

reviewed in Jane et al. (1999). Another criteria allowing to distinguish
between different groups of the clustering methods include the type of
outputs provided by the method (hard assignation to a cluster or fuzzy
measures like cluster membership probability), deterministic or stochastic
nature of the method (whether the algorithm converges to the same
solution in the same initial setting), and the agglomerative or divisive
approach to clustering.
K-means algorithm
Probably the most popular clustering algorithm is known as k-means.
Given the dataset {x1,…xN}, it operates as follows:
• K centers µk are initialized randomly in the input space;
• Each sample xi is assigned to the closest centre with respect to

some (usually Euclidean) distance. It is considered as belonging to
the cluster Ck;
Nk
1
• New centers are computed as µ knew =
Nk
∑x .
i∈Ck
i
The last two steps are iterated until convergence. As the centers are
updated using all the data at once, this version of the algorithm is known
as batch k-means, opposed to the online or stochastic k-means when the
update is done by randomly iterating through the samples of the dataset.
The stochastic k-means can be faster and less sensitive to the initialization
of the centers. K-means methods are aimed at minimization of the intra-
cluster variance and demonstrate good performance if the data face
distinctive “clouds” and there is no significant correlation between the input
features (Figure 7).
188
Kanevski, M. et al.
Figure 7: K-means algorithm searches for well-shaped cloud-like clusters in data. It

performs poorly if data have correlated (linearly or non-linearly) inputs
Self-organizing maps
Self-organizing maps (also known as Kohonen maps (Kohonen, 2000))
are a popular method extending the functionality of k-means. SOM places
the centers (also called units or neurons) in the data space aiming to
“cover” all data points to fit to the topology of the dataset optimally and
present it in a two-dimensional map. That is, the centers are not drawn
and fitted independently as in the case of k-means, but organized in a two-
dimensional map. There exist two common designs of this map:
rectangular (every unit has four neighbors) and hexagonal (every unit has
six neighbors), shown in Figure 8.
The map is initialized either by random placement of the centers of the

map in the data space or by sampling them evenly in the input data
subspace spanned by the two largest principal component eigenvectors.
This method can increase the training speed significantly as the initial
weights may already give a good approximation of the structures in the
data. SOM is sensitive to the initialization and may require several runs.
a) b)
Figure 8: Two different SOM structures with cells in the map space: rectangular (a), and
hexagonal (b). The nearest cells (4 for rectangular and 6 for hexagonal, except borders
and corners) are connected by the edges
The training process for updating the positions of centers is an online

iterative procedure. Samples are presented to the SOM one by one, and
the positions of centers are updated. The key point here is that not only
the closest centre is updated, but also the centers in its closest (on the
189
map) surrounding. The closeness on the map is defined with the
neighborhood function which is responsible for the “cooperation” of
centers and their self-organization. This neighborhood influence is then
gradually decreased through every iteration epoch. The last epoch of SOM
adaptation with no cooperation between centers is the same as in the
case of simple k-means.
The last step in SOM training is a post-processing of the obtained map. It

is usually done by analyzing the so-called U-matrix, that is the matrix of
the distances (in the input space) between the neighboring nodes of the
trained SOM structure. With this matrix, one can detect that a group of
SOM centers appears far from the rest of the map thus revealing that the
data samples captured by these centers form a distinctive cluster. This
step can be performed manually but usually an automated procedure such
as k-means or hierarchical clustering is used. More detailed description of
SOM models and their application for geospatial data can be found in
Kohonen (2000), Agarwal & Skupin (2008) and Kanevski et al. (2009).
Spectral clustering
The method of spectral clustering originates from a graph-based
perspective of the clustering problem (see e.g. Hagen & Kahng, 1992; Ng
et al., 2001; Shi & Malik, 2000). Spectral clustering considers the data
samples as the nodes of a weighted graph and applies the methods of
spectral graph theory to study its inner structure (Figure 9). To describe
the affinity between the nodes of the graph, the vertices connecting the i-th
and j-th nodes (the data samples xi and xj) are entitled with the weights wij
that form the matrix W. To separate the graph into clusters one has to find
the “cut” that minimizes the sum of weights that would need to be removed
in order to split it. If the weights are attributed according to the distance
measure between the data samples xi and xj, it leads to the clustering of
data in the input space. Two common approaches are the n-nearest
neighbor one, where wij=1 iff the i-th and j-th samples are amongst the n-
nearest neighbors of each other, and the one which uses the Gaussian
RBF function of the distance between samples as the value for the weight.
The last case implies a parameter to be defined by a user, that is, the
width of the Gaussian. The way one attributes the weights can also
account for some problem-specific knowledge.
To approach the problem of finding the cut of the graph one needs to
analyze the matrix known as graph Laplacian. It is defined as L=W-D
where D is the diagonal matrix with element formed by the column-wise
sums of elements of W, dii=Σj[wij]. The normalized graph Laplacian L=D-
1/2
W D-1/2 is often used as well. The foundations of spectral clustering lie in
190
Kanevski, M. et al.
the fact that the eigenspace of the (normalized) graph Laplacian has a
particular well-defined structure related to the number of the connected
components of the graph. Or, intuitively, if the Laplacian matrix is
essentially block-diagonal, it can be easily detected from its eigenspace.
While there are many possible approaches that implement this idea, the
most popular formulation of the spectral clustering is as follows (Ng et al.,
2001):
• Form the affinity matrix W and compute the normalized graph

Laplacian;
• Solve the eigenvalue problem of finding the {λ, f} such that Lf=λf;
• To detect k clusters: find the k eigenvectors with largest

eigenvalues of L and stack them in columns to form the N-by-k
matrix U, normalize it.
• Perform ordinary k-means on the columns of U. Assign the obtained

cluster labels to the original N data samples.
Figure 9: Spectral clustering builds a graph connecting data samples in a local

neighborhood and analyses its structure in a search for weakly connected components.
Simulated data
191
CLUSTERS IN SOCIO-ECONOMIC DATA
To illustrate the use of clustering methods, let us explore the socio-
economic data on the region of Lausanne, Switzerland. The dataset was
obtained from the population census and includes about 250 different
entries defining the social, cultural (mother tongue, nationality, etc.) and
economic (employment rate, household type, etc.) characteristics of the
population. The data are spatially aggregated from the regular grid cells of
100x100 meters covering the populated area of the region. There are a
total of 3359 samples. Population density, which is one of the most
important input features, is presented in Figure 10.
Clustering methods may be exploited to provide useful insight into data

and to identify structures in the space of socio-economic parameters.
Visualization of the obtained clusters on the map can shed some light on
the spatial structure of the region. The results of the described methods
obtained on the dataset assuming a priori 6 clusters are shown in Figures
10-13. These maps can give rise to interpretations and insight into the
spatial agglomeration of the different social groups populating the region
at study.
Figure 10: Population density in the region of Lausanne. The values are normalized to the
maximum value of density in the region
192
Kanevski, M. et al.
Figure 11: Clusters obtained with k-means method
Figure 12: Clusters obtained with SOM
193
Figure 13: Clusters obtained with spectral clustering
DIMENSIONALITY REDUCTION
Dimensionality reduction is usually involved while aiming at two goals: first,
to produce the low-dimensional representation of data to visualize them,
and, second, to extract low numbers of features for further analysis. Here
one distinguishes the feature selection and feature extraction, where the
latter rather than selecting already existing variables, tries to select a
linear or nonlinear combination of the input variables which suits best the
problem at hand. An example of a linear method is a well-known principal
component analysis. There is a popular and rapidly growing domain of
modern nonlinear dimensionality reduction methods known as manifold
learning (Lee & Verleysen 2007).
Principal Component Analysis

The method of principal component analysis (PCA) is a well established
statistical method (Pearson, 1901). It searches a linear combination of the
input variables (or a linear transformation of the input space) that accounts
for maximum variability observed in data (Figure 14). It is provided with the
following algorithm:
• Standardize the data and compute the covariance matrix S = XXT
• Compute the eigenvectors and eigenvalues of S, sort them by

decreasing values, arrange the eigenvectors in matrix U
• The transformed data set is F=XU
194
Kanevski, M. et al.
PCA enables to reduce the dimensionality of the data that include linearly
correlated inputs. For visualization, the first two principal components span
the projection plane providing the most informative (in the sense of
variability) viewpoint on the original dataset. The amount of accounted
variance can be computed to assist in the choice of the number of
components.
Figure 14: Principle Components form a new orthogonal coordinate system, with the first
components spanning along the directions of maximum variance of the data
Laplacian Eigenmaps
The large variance along a straight line is not necessarily the main
characteristic of interest with respect to analyzing the complex structures
in the data. A linear projection cannot help unfolding the non-linear
structures as the one of the “Swiss roll”, or even a simpler one as shown in
Figure 15. It is the local relationships between data samples that may help
discover these complex structures. Laplacian eigenmaps (Belkin & Niyogi,
2003) are one particular approach of manifold learning aimed at
preserving the local neighborhood relations between the data samples.
Here we briefly name the other methods of descriptive manifold learning:
locally linear embedding (Roweis & Saul, 2000), ISOMAP (Tenenbaum &
De Silva, 2000), maximum variance unfolding (Weinberger & Saul, 2005).
Laplacian eigenmaps operate with the above-mentioned affinity matrix W

and its graph Laplacian L and can be summarized as follows:
• Form the affinity matrix W and compute the graph Laplacian L=D-
W;
• Solve the eigenproblem Lv=λDv;
195
• Present data in projections on v starting with the smallest
eigenvalues.
This representation keeps the data samples proximate in the input space
(according to the affinity matrix) close in the embedded coordinates.
Hence the step of constructing the affinity matrix is very important as it
should encode the similarities that one desires to keep on the low-
dimensional map. Spectral clustering (described above) is essentially a
related method that only includes one additional step of clustering the
obtained representation using a conventional k-means.
There are some problems encountered when dealing with non-linear

dimensionality reduction methods of manifold learning. The first one is the
so-called out-of-sample problem. As the embedding is based on the graph
built on the training samples, it is not very straightforward to apply the
obtained embedding for a new data sample. Then, it is based on the
eigenvectors problem and requires matrix computations which are difficult
for very large datasets.
Figure 15: Graph-based non-linear dimensionality reduction methods, including the

Laplacian eigenmaps, seek for a “direction” in the graph that carries maximum variance.
As the directions found are based on the graph structure, it is not straightforward to
represent new samples in the obtained embedding
DIMENSIONALITY REDUCTION OF SOCIO-ECONOMIC DATA

The approaches for linear and non-linear dimensionality reduction are
illustrated using the socio-economic data briefly described above. The
methods were applied to the 250-dimensional dataset of 3359 samples.
Analysis of variance of PCA components showed that 95% of variance is
accounted with more than a hundred components. The first 4 components
(accounting for 55% of variance) are shown in Figures 16, 17.
196
Kanevski, M. et al.
Figure 16. First and second PCA components. The first one clearly follows population
density
197
Figure 17. Third and fourth PCA components. The interpretation of these is not
straightforward, though one can notice that the two neighbouring towns (Renens and
Lausanne) differ
The method of Laplacian Eigenmaps was applied to the same dataset.

The first 4 components are shown in Figures 18, 19. For this method, the
interpretation of the amount of “accounted variance” is not evident. Some
approaches can be based around the analysis of eigenvalues. However,
the components revealed clearly the socio-economic diversity of the
regions, highlighting more detailed and distinctive features.
198
Kanevski, M. et al.
Figure 18. Spatial representation of the first and second components obtained with
Laplacian eigenmaps
199
Figure 19. Spatial representation of the third and fourth components obtained with
Laplacian eigenmaps
200
Kanevski, M. et al.
SUPERVISED LEARNING
In supervised learning, the machine takes advantage of knowledge about
the outputs to develop a predictive model. Contrary to the unsupervised
methods discussed in the previous section, the algorithm is trained over a
set of samples x = {x1, x2, …, xn} with associated known outputs y = {y1,
y2, .., yn}.
In a regression problem, outputs are continuous values and in a

classification problem the outputs are discrete values corresponding to
classes.
The aim of a supervised method is to build a predictor that
1. Minimizes the error of prediction over known examples, i.e. the

couples (x,y). The predictor minimizes a certain loss function
depending on the task. Empirical or structural risk principles are
applied.
2. In general the model must be capable of predicting new unknown

examples, i.e. should have good generalization properties.
In this chapter, supervised models are considered in detail both for

classification, through the Support Vector Machine (SVM) and for
regression, through Multilayer perceptron (MLP) and the General
Regression Neural Network (GRNN).
Supervised learning for classification

As introduced above, the outputs of classification are discrete values
corresponding to a given class of interest. For instance, in the example
shown below, the outputs are land-use classes such as forest, roads or
buildings. From now on, we will refer to known outputs as labels.
The principle of supervised classification is demonstrated in Figure 20:
a) A set of features (=variables) is selected to form the input space. In

the figure the input space is a 3-dimensional space defined by the
three bands of an image in the Red, Green and Blue regions of the
visible spectrum. For each pixel that has been labeled we have a
couple (xi,yi) = (xRi, xGi, xBi, yi). Unlabeled pixels only possess the
values of xi.
b) The labeled pixels can be visualized in the feature space. The

coordinates for each pixel correspond to the values on each band.
By assigning a different marker to each class, it is possible to
visualize the separability of the classes.
201
c) The aim of the supervised model is to find the best separation
between the classes. In this example, most of the labeled pixels are
separated correctly, only three pixels are misclassified. Most of the
time, misclassification is a necessary trade-off to maintain a low
complexity. For instance, if the classification is performed on an
aerial photography using RGB values only, green roofs will have a
spectral signature which is identical to meadows. Therefore,
confusion between the classes has to be admitted.
d) Once the separation between the classes has been defined, all the
unlabeled pixels can be classified according to these boundaries by
computing their position in the feature space. If they fall in one of
the green areas of Figure 20.c, they will be classified as 'green' as
well, and so on. Once all the pixels of the image have been
classified, a classification map is provided.
Figure 20: Principle of supervised learning for remote sensing data. (a) Each pixel is
associated to features (ex: the spectral bands) AND known labels Y. (b) Knowledge
about class membership is used in the feature space to train a learner and (c) define a
decision boundary. (d) Finally, the unknown pixels are classified with respect to the
decision boundary found and a classification map is provided
202
Kanevski, M. et al.
There are many algorithms including machine learning that offer solution
models to classification tasks: k nearest neighbors (k-NN can be
considered as a benchmark model), decision trees, probabilistic neural
networks, multilayer perceptron, radial basis functions, support vector
machines (Duda et al., 2001; Bishop 2006; Vapnik 1998). SVM have
demonstrated excellent efficiency on classification tasks in different fields
from remote sensing images to biocomputing and finance.
In the following section basic concepts and ideas concerning support

vector machines first introduced in Boser et al. (1992) are presented in
detail.
Support Vector Machines

First, let us consider the most simple two-class classification problem in a
two-dimensional space (Figure 21.a) when data are linearly separable.
The task is to develop a decision boundary which will discriminate the

circles {yCi = -1} from the squares {ySi = 1}. Several planes can be drawn to
separate correctly these two classes (Figure 21.b). According to Statistical
Learning Theory and SRM principle the best solution is (Figure 21.c) a
decision boundary (line) which separates two classes with a maximal
margin (Vapnik, 1998). In a multidimensional space (where multiple
variables are considered simultaneously), the line is replaced by a hyper-
plane. By finding the margin with maximal width, we find the best way to
linearly separate the labeled samples. This solution has the best
generalization properties.
Since only the points lying on the margin are necessary to define the
separating hyper-plane, all the other labeled points are not considered by
the model.
Figure 21: Linear classifiers for a two-class problem. (a) the problem; (b) several linear
classifiers separating the two classes; (c) the SVM
203
Mathematically, this results in the following classifier for the prediction of
an unlabeled pixel q:
f ( x) =sign(∑ α . yi xi , q + b) (8)
i
where αi are coefficients that are i) nonzero and ii) equal to 1 when the
labeled sample lies on the margin and 〈xi ,q〉 is a dot product defining the
similarity between the unlabeled pixels and the labeled samples and b is
the bias. Since the αi = 0 for each sample not lying on the margin, class
membership of an unseen point is assessed by the similarity (~ distance)
between the labeled pixels and the samples that lie on the margin only.
These samples are called the support vectors.
Recalling the example of Figure 21, since circles have negative label (yCi =
-1) and positive squares (ySi = 1), if the new point q is globally closer to the
support vectors of the class “circles”, solution of Eq. (8) will result in a
negative value and q will be labeled as a circle.
The SVM presented so far can solve only linear separable problems.
Slack variables can be considered to allow small errors (see, for instance,
Cristianini, 2000), but the algorithm will fail if the data are not linearly
separable (Figure 22.a). In order to handle linearly non-separable
problems, we might use the so-called kernel trick. If a problem is not
linearly separable in the input space, it may be linearly separable in a
higher dimensional space H (Figure 22.b). If such a space exists, we can
map the labeled samples in the new space and then apply a linear
classifier. A linear classification in a higher dimensional feature space
(Figure 22.c) will correspond to a nonlinear classification in the input space
(Figure 22.d).
204
Kanevski, M. et al.
Figure 22: the kernel trick. (a) a linearly non-separable problem in the input space; (b)
mapping in a higher dimensional space H; (c) the decision function in H is linear; (d) in
the input space it is not
The mapping (computation of the new coordinates) in the new space of all
the labeled samples can be solved analytically. For instance, a two-
dimensional sample x = {x1, x2} mapped into a 3-dimensional space by a
quadratic transform can take the coordinates φ(xi) = {x12, 2 x1x2, x22}. But
by looking at Eq. (8) again, we can notice that the explicit mapping of xi is
not required, rather only the similarity between xi and q. Therefore, there is
no need to compute the entire mapping of x in H, but only the distances
between the mapped x and q. Such distances can be represented by
kernel functions. For instance, a polynomial function of degree 2 encodes
a nonlinear similarity K(xi,q) = (〈xi ,q〉)2 = x12q12 + 2x1x2q1q2 + x22q22 = 〈φ(xi),
φ(q)〉. Therefore, K returns the value of the dot product between the
mapped samples! Using the kernel trick, the nonlinear SVM solution
becomes:
205
f ( x ) =sign (∑ α . yi K ( xi , q ) + b) (9)
i
Summing up, we can compute the SVM solution in a higher dimensional

space (where the problem is linearly separable) without knowing explicitly
the position of the samples in that space, but only by assessing the
distance between them using a kernel function K.
3.1.2 Application to aerial photography for land use characterization
In this section, we will train a SVM for the classification of land use in a
neighborhood of the city of Lausanne, Switzerland (Figure 23.a). In
particular, we would like to use an aerial photography to discriminate
different types of habitat, in particular individual versus collective habitat.
This is a very challenging problem, because the spectral information is
rather poor (each pixel is only considered by its color coordinates in the
RGB space) and roof colors are mixed for the same type of habitat.
Moreover, asphalt objects such as roads or parking lots can be easily
confused with roofs.
In the first experiment, that we will call MS (Multi-spectral), we use the

spectral bands xMS = (xR, xG, xB). Therefore in this case pixels are
classified by their colors only.
(a) (b) (c)
Figure 23: data considered. (a) aerial photography of the NW of Lausanne; (b) feature ,
xO15 ; (c) feature xC15
206
Kanevski, M. et al.
In order to resolve the confusion discussed above, information about the

spatial neighborhood of the pixels has been extracted. This way, size and
shape of the objects are taken into account and, for instance, thickness of
the roads will help to discriminate road pixels from roof pixels.
Several ways of extracting spatial features have been proposed, going

from the simple computation of local means to more advanced feature
extraction based on mathematical morphology or texture. In this example,
multi-scale morphological filters (Soille, 2004) have been applied to the
image. These filters produce features showing the objects that are brighter
or darker than their spatial surroundings. In this example, we have added
to the three spectral bands 14 morphological features, accounting for
morphological opening and closing using diamond-shaped structuring
elements of size going from 9 to 21 pixels. Morphological opening filter
objects that are brighter than the neighborhood defined by the structured
element; morphological closing filter objects that are darker. Two features,
obtained by applying opening and closing with a 15 pixels structuring
element, are reported in Figure 23.b and 23.c. To know more about these
filters for greyscale images, please refer to (Pesaresi, 2001; Benediktsson,
2003). The filters have been applied to the first principal component (see
section 2) extracted from the RGB image. In the second experiment, that
we will call MM (Mathematical Morphology), we use the spectral bands of
the MS experiment plus the morphological features in a stacked vector
only xMM = (xR, xG, xB, xO9, xO11, …, xO21, xC9, xC11,…, xC21), where xO9
stands for an opening with a structuring elements of 9 pixels diameter.
As introduced above, the dataset considered (Figure 23.a) is a

neighborhood in the North-West of Lausanne and shows mixed residential
structures, with individual houses surrounded by apartment blocks and
towers, both showing higher urban density. The image was taken in 2004
and has a 1m spatial resolution. Six classes of interest were selected by
visual inspection of the image y = {Trees, Meadows, Individual habitat,
Collective habitat, Roads, Shadows} and 157’723 labeled pixels were
highlighted for the analysis. 5'000 randomly selected pixels were used for
the training of the model, 10'000 for the tuning of the free parameters and
the remaining 142’723 pixels were used to evaluate the performance of
the model on unseen data. A multi-class SVM developed using the Torch
3 library (Collobert, 2004) was used for the experiments. Gaussian kernel
was used in all the experiments.
207
In order to compare model performance on new data, the confusion
matrices of the predictions of the 142723 test pixels for both the MS (Table
1) and the MM (Table 2) experiments were analyzed. Confusion matrices
show the results in terms of predicted pixels (columns) versus ground truth
pixels (rows); pixels on the diagonal are correctly classified. The
percentage of pixels correctly classified by the SVM is given in the last
column, while the last row shows the percentage of pixels correctly
classified with respect to the total number of pixels predicted for that class.
The visual inspection of the classified images was also carried out to
detect improvements in the classification of specific objects. Let's remind
that the dimension of the input vector is 3 for the MS model and 17 for the
MM model.
For the MS experiment, the global accuracy is 72.7%. Considering

accuracies per class, the class Shadow is completely neglected by this
model and mainly classified as Trees because of the color similarity. That
means that the model minimizing both the loss function and the
generalization term is a model accounting only for 4 classes. That can be
explained by the small number of pixels composing the class Shadow. The
classes which are best classified are Trees and Meadows. Roads are
often classified as Individual habitat and Collective habitat. The color of
some roofs is the main reason of such confusion. The same problem is
encountered when classifying Individual and Collective habitats. Here, the
similarity in the color of the roofs causes several misclassifications.
Model output
MS Trees Meadows I. habitat C. habitat Roads Shadows Accuracy
Trees 22019 2364 5 105 279 0 89%
Meadows 1037 33984 126 410 169 0 95 %
Reference
I. habitat 55 425 15157 1047 9660 0 58 %

C. habitat 288 765 1348 6214 2847 0 54 %
Roads 370 1521 9606 2572 26375 0 65 %
Shadows 3931 3 0 0 46 1 0%
Accuracy 79 % 87 % 58 % 60 % 67 % 0% 72.7 %
Table 1: confusion matrix for MS
The confusion matrix of MM experiment is shown in Table 2. The global

accuracy is of 86.1%. Inclusion of morphological features, i.e. including
information about the neighboring pixels, has increased performances of
the model by 13.4%. The class Shadow is not neglected anymore and it is
classified at about 90% of accuracy. This is due to the different shape
(mainly squared) and size (forest areas are larger than shadows) of
shadow areas. The greatest improvement is visible on the classes Roads,
Individual and Collective Habitats. For them, the average increase in
208
Kanevski, M. et al.
accuracy is about 17-18%. The classifier can better discriminate these

objects, because the morphological features provide information about the
size of the object in which the pixel is located.
Model output
MM Trees Grass I. habitat C. habitat Roads Shadows Accuracy
Trees 23715 532 12 82 188 243 96%
Grass 436 34281 149 440 420 0 96%
Reference
I. habitat 50 233 19688 985 5384 4 75%
C. habitat 264 394 1354 8278 1151 21 72%
Roads 244 789 5223 764 33413 11 83%
Shadows 472 0 0 2 4 3503 88%
Accuracy 94% 95% 75% 78% 82% 93% 86.1%
Table 2: confusion matrix for MM
Figure 24 shows the classification maps. For the MS model, two buildings
with the roof of the same color are often misclassified (see marker 1a: the
collective building is classified as an individual house). This problem is
greatly solved by the MM SVM (marker 1b). As mentioned above, the MM
model takes into account the information about the structure of the objects
and it is therefore able to detect their size and shape. Marker 2a highlights
the absence of the class Shadow by the MS model, correctly handled by
the MM model (2b). The roofs of the Collective habitat are sometimes
made of concrete and are therefore confused with roads by MS (3a). On
the contrary, the MM model can better discriminate these objects, even if
some confusion is still visible (3b). Green roofs, which are classified as
Meadows by MS (4a) are better handled by MM (4b). Finally, the MM
model proposes a classification which is less contaminated by high spatial
frequencies of the initial image. Therefore, homogeneous surfaces such
as Meadows and Trees (the forest) are more homogeneous (markers 5a
and 5b).
The classified results reported have great value when analyzing the urban
structure of a city (in our case the spatial distribution of collective and
individual habitats). The problem of urban sprawl is strictly related to the
question of urban density. Remote sensing images can be used to
discriminate automatically between different types of habitat. Machine
learning algorithms, used with features that are discriminative for the
problem to solve, allow achieving reliable results and provide effective
maps for the visualization of urban density.
209
Figure 24: Classified image with RGB (top); classified image with RGB and MM (bottom);
Trees = dark green, Meadows = light green, Individual habitat = orange, Collective habitat
= red, Roads = black and Shadows = yellow
210
Kanevski, M. et al.
Supervised learning for regression

Multilayer perceptron
Multilayer perceptron (MLP) is a “workhorse” of artificial neural networks. It

can estimate a dependence function without giving explicitly a
mathematical model of how outputs depend on inputs (Haykin, 2009;
Bishop, 1995, 2006). It means that there is no prior model of a studied
phenomenon whose algorithm should tune to data. MLP “learns from an
experience” (or from data).
The following important mathematical result concerning MLP should be

mentioned: “MLP with two layers of neurons and non-constant non-
decreasing activation function at each hidden neuron can approximate any
piecewise continuous function from a closed bounded subset of Euclidean
N-dimensional space to Euclidean J-dimensional space with any pre-
specified accuracy, provided that sufficiently many neurones are used in
the single hidden layer” (Hornik et al., 1989; Cybenko, 1989). This
theorem establishes that for any mapping problem, which includes
regression and classification, properly trained MLP can find a solution.
The basic unit for information processing, as considered in biological

neuroscience, is a neuron. An artificial neuron has inputs that are
analogous to dendrites in a biological neuron. It combines these inputs,
usually by simple weighted summation, to form an internal activation level.
The higher the activation level, the stronger the signal that it will send out
to other neurons in the network. The links are called synaptic connections
and the bias is known as the activation threshold.
An artificial neuron is a mathematical model that simulates a biological

neuron. The simplified model of an artificial neuron is presented in Figure
25. An MLP is a model that simulates a biological neural network, that is, a
structured set of interconnected neurons.
211
Figure 25: Simple model of artificial neuron
Mathematically, the neuron is the following computational unit

K
Z = f (∑ wi xi + b)
i =1 (10)
which takes the input features xi (components of some input vector x),
makes the summation with weights wi, adds a bias b and passes it with a
transfer function f(⋅).
Examples of the activation (transfer) functions are the following S-shaped

(sigmoid) functions:
Logistic (Figure 26, left):
1
f ( x) = (11)
1 + e− x
Hyperbolic tangent (Figure 26, right):
e x − e− x
f ( x ) = tanh( x ) = (12)
e x + e− x
The choice of transfer function is a technical issue due to the stated

theoretical properties, and other S-shaped functions are acceptable.
212
Kanevski, M. et al.
Figure 26: Transfer functions: logistic (left) and hyperbolic tangent (right)
A graphical presentation of MLP structure is given in Figure 27 with a

network consisting of 3 inputs, 2 hidden layers with 7 neurons in each and
2 outputs. The network, which solves practical non-linear problems, has
hidden (intermediate) layers between the input and output layers.
The power and capabilities of multilayer perceptron stem from the non-
linearity used within nodes. An MLP can learn with a supervised learning
rule using the backpropagation algorithm. The backward error propagation
algorithm (backpropagation) for ANN learning/training caused a
breakthrough in the application of multilayer perceptron (Haykin 2009).
The backpropagation algorithm gave rise to the iterative gradient
algorithms designed to minimize the quadratic error cost function between
the actual output of the neural network and the desired output. The error is
computed during the forward pass of information flow through the network.
Figure 27: Feed-forward neural network: Multilayer perceptron with 3 input neurons, 7
hidden neurons in the first hidden layer, 7 hidden neurons in the second hidden layer and
2 output neurons (symbolic definition of the net 3-7-7-2). Blue circles are bias neurons
(with constant value 1)
Backpropagation algorithm
Let us consider briefly the backpropagation algorithm. Although often

referred to as a training algorithm, the backpropagation is essentially a
method to compute the gradients of the error function with respect to the
network weights. These gradients can then be used in any gradient-based
213
optimization algorithm, either of the first- or second-order, online or batch.
As usually, for the regression problem the error to be minimized is
considered to be a mean squared error (MSE). This error is easily
computed, has proved itself in practice and, as shown later, its partial
derivatives with respect to individual weights can be computed explicitly.
The outputs of MLP trained with a MSE error function can be interpreted
as the conditional mean of the target data, i.e. the regression of a
dependent variable (output) conditioned on independent variables (input)
(Bishop, 1995; 2006). To simplify the notations, we consider below the
model with a single output t. It can be easily extended to several outputs
by considering the mean squared error averaged over them. For an
inputs-output pair (x, t) the error is simply:
1
[t − F ( x , w ) ]
2
EMSE ( w ) = (13)
2
Notice that here we are interested in MSE as a function of a set of weights

w, since these are the values to be optimized in order to reduce the
network error on the training samples.
The basic backpropagation algorithm consists of the following steps:
1) Initialization of weights. Usually it is recommended to set all weights

and node offsets (biases) to small random values. In many practical
studies, the use of simulated annealing to select starting values is more
efficient (see Masters, 1993).
2) A pair (input x; desired output t) is presented to the network. The actual

output of the ANN is computed and the outputs of all the neuron nodes are
stored. This completes the forward pass.
3) The derivatives of EMSE for a single pair (x, t) are computed with respect
to the weights in each layer, starting at the output layer with the backward
move to the inputs. The derivatives provide information on how much the
error depends on the particular weight in the vicinity of a current model,
and will be used to optimize its value in order to reduce the error, at least
locally. This completes a backward pass.
The key point here is to compute the derivatives of the transfer functions
of the neuron nodes with respect to its arguments. Here the smart choice
of an activation function comes into play. Since for the logistic function and
hyper-tangent, they can be computed through the values of the functions
themselves, and the latter are computed and stored after a forward pass,
the algorithm simplifies and fastens significantly.
214
Kanevski, M. et al.
For example, for the logistic function:
1 df e− x 1 + e− x − 1 1 (14)
f = , = = = ( f − 1) f 2 = f (1 − f )
1 + e− x −x 2
dx (1 + e ) (1 + e − x ) 2
Gradient-based MSE minimization
To minimize the MSE error for the training set, let us construct an iterative
procedure of gradient descent. The weights w are updated iteratively by a
gradient rule, with n denoting the iteration number:
∂EMSE (15)
wijm (n + 1) = wijm (n) − η ( n)
∂wijm
that is,
wijm ( n + 1) = wijm ( n) + ηδ im Z mj −1 (16)
where η is called the rate of learning (0 < η ≤ 1).
A variation of the gradient-based minimization that is often used deals with

the addition of a momentum term to the equation for updating weights. In
this case one has:
wijm ( n + 1) = wijm ( n) + ηδ im Z mj −1 + α∆wijm (n) (17)
where 0 ≤ α < 1 is called a momentum parameter, and ∆wijm (n) is an

increment of the weight wijm at the previous iteration.
The effect of the momentum term is to magnify the learning rate for flat
regions of the error surface where gradients are more or less constant (or,
strictly speaking, were constant at the last iteration). In steep regions of
weight space, momentum focuses the movement in a downhill direction by
dampening oscillations caused by alternating the sign of the gradient.
Note that it is difficult to define an optimal momentum parameter in

advance. In practice, there are more complex algorithms that try to
optimize it automatically during the training procedure.
Batch vs. on-line learning
There is an important practical question: whether to estimate gradients

from a single training example (leading to the so-called stochastic or on-
line updating) and make an optimization step at every presented training
sample, or first to compute gradients for all training examples (for the
215
epoch) and update the weights once with an averaged gradient (the batch
mode). Both approaches are widely used and different recommendations
on their efficiency can be found in the literature. In the online case, the
order in which the training patterns are presented may affect the direction
of the search on the error surface. Some authors (Masters, 1993) prefer
using the entire training set for each epoch, because this favors stability in
convergence to the optimal weights.
First, all the training samples are presented to the network and the
average gradient is computed, that is, the vector containing all the
derivatives of MSE and whose dimensionality is equal to the number of
weights in the network:
∂EMSE (18)
∇EMSE (w ) =
∂wijm
where the average is taken over all the training samples.
The optimization step to modify the vector of weights w in the batch mode
then becomes:
w ( n + 1) = w ( n) − η∇EMSE ( w ( n) ) (19)
The practical aspects of learning the weights (i.e. the optimization

algorithm) was an important issue in the development of neural network
models. Some popular MLP training approaches deal with the combination
of conjugate gradient methods in order to find the local minimum of the
error surface, and simulated annealing and/or genetic algorithms in order
to escape from the local minima.
Some recent research trends, motivated by the huge size of data sets to
be processed, are coming back to the on-line learning scheme, bringing
some stochastic elements into the learning process (Bottou, 2003).
Interestingly, this sometimes allows both processing of large data and
helps in avoiding over-fitting.
Multiple local minima
Error surfaces of neural network models are discussed in many textbooks

and research papers (see e.g. Hecht-Nielsen, 1990; Haykin, 2009).
Because of combinatory permutations of the weights that leave the
network input-output function unchanged, these functions typically have a
large number of local minima. Therefore the error surfaces can be highly
degenerated and have numerous “troughs”. Error surfaces may have a
multitude of areas with shallow slopes in multiple dimensions
216
Kanevski, M. et al.
simultaneously. Typically this occurs because particular combinations of

weights cause the weighted sums of one or more hidden layer (with
sigmoid outputs) to be large in magnitude. When this occurs, the output of
that sum (and therefore the value of EMSE(w) is insensitive to small
changes in weights, since these simply move the weighted sum value
back and forth along one of the shallow tails of the sigmoid function. It has
been experimentally established that local minima do actually exist.
However, in many problems, convergence to a non-global minimum is
acceptable if the error is nevertheless fairly low. The presence of multiple
minima does not necessarily presents difficulties in training nets, and a
few simple heuristics can often overcome such problems.
Case study: sediments contamination of lac Léman, Switzerland

In this real case study, multivariate data on sediments contamination of lac
Léman (Geneva lake, Switzerland) are used. The data show the
contamination of the lake sediments by heavy metals, from a total of 200
measurements.
The data were divided into two datasets: for training (168 points) and
testing (32 points). The structure of MLP was 2-10-10-1 (two inputs - X,Y
coordinates, two hidden layers with ten neurons in each hidden layer, and
one output – the level of contamination). On the first step of the training
procedure, simulated annealing was used to initialize the weights. Then
second-order training algorithm Levenberg-Marquardt was used during the
main step of the training. The training was stopped as the testing error
was increasing. So the model with the minimum test error was selected as
a resulting model. The described procedure was repeated 5 times and the
model with the lowest test error was adopted as a result.
In Figures 28 (top) the measurements of the level of sediment

contamination by zinc (Zn) are presented. In Figure 28 (bottom) the result
of MLP mapping is shown. Similarly, the same maps for the sediment
contamination by titan (Ti) are given in Figures 29.
217
Figure 28: Sediments contamination of lac Léman, Zn: measured (training) data (top),
MLP mapping (bottom)
Figure 29: Sediments contamination of lac Léman, Ti: measured (training) data (top),
MLP mapping (bottom)
Both MLP have correctly reconstructed spatial structures of pollution

patterns. As it was mentioned above, in general, the original data set has
to be divided into three parts: training, testing, and validation. The
validation subset can be used to estimate the generalization abilities of the
model. In the present subsection the main attention was paid only to
training of MLP and not to the validation.
218
Kanevski, M. et al.
Supervised learning for automatic mapping of geospatial data

MLP is really a powerful tool for performing regression tasks on data of
different complexity. But it has one important drawback: difficult and long
training procedure.
Automatic modeling technique requires a method which has two important

features. First, parameters of the training (tuning) of the algorithm should
be selected automatically, and not by the user. Second, the result of
mapping should be unique and not dependent on any initial conditions of
the training algorithm. One of the possibilities satisfying these conditions is
based on General Regression Neural Networks (GRNN) (Kanevski &
Maignan, 2004).
General Regression Neural Network
GRNN is another name for a well-known statistical nonparametric method

- Nadaraya-Watson Kernel Regression Estimator. It was proposed
independently in 1964 by Nadaraya (1964) and Watson (1964). In 1991 it
was reinterpreted by Specht in terms of neural networks (Specht, 1991).
This method is based on kernel nonparametric density estimation
proposed by (Parzen, 1962). Details on nonparametric statistics can be
found in Hardle (1989) and Fan & Gijbels (1997). See also Fotheringham
et al. (2000, 2002) for specific method – Geographically Weighted
Regression (GWR), developed in quantitative geography. In fact, the latter
is closely related to the local polynomial modeling using a kernel-based
approach.
Omitting the details of the mathematical background, let us present the

final formula for the regression estimation of Z(x) using available
measurements Zi:
N
 x − xi 
∑ Z K 
i
σ 

Z (x) = i =1
i = 1, 2K , N
N
 x − xi 
∑ K
 σ 

i =1 (20)
where N is a number of training points, Zi is a function value of the i-th

training point having coordinate xi.
The core of this method is a kernel K(·). It depends on two parameters:

distance to the predicted point and model dependent parameter σ – which
is a positive number called bandwidth or simply a width of the kernel. Note
that xi, in fact, is a centre of the i-th kernel.
219
Different kinds of kernels can be selected from the kernels’ library (Hardle,
1989; Fan & Gijbels, 1977). Gaussian is the most widely used kernel
 x − xi  1  x − xi 2

K = 2 p/2
exp  −



i = 1, 2K , N
 σ  (2πσ )  2σ 2  (21)
where p is a dimension of the input vector x.
Gaussian kernel is a standard kernel of GRNN as well. Finally, the GRNN

prediction using Gaussian-type of kernel is given by the following formula:
N  x − xi
2

∑ Z exp  −
i
2σ 2


Z ( x) =
i =1
 
N  x − xi 2

∑ exp  −
 2σ 2


i =1
  (22)
Note that GRNN is a linear estimator (prediction depends on weights

linearly), but weights are estimated non-linearly according to the non-linear
kernel.
The model described above is the simplest GRNN algorithms. One of the
useful improvements is to use multidimensional kernels instead of one-
dimensional ones. In a more general setting parameter σ may be
presented by a covariance matrix. This matrix is a squared symmetrical
with dimension p by p and with the number of parameters equals to
p(p+1)/2.
A model with an anisotropic kernel is much more flexible to model real-

world data. It is especially useful in case of complex multidimensional
data. For example, for 2D spatial mapping we can use the following
parameterization σ=(σx,σy,σxy).
Kernel bandwidth is the only hyper-parameter of the GRNN model. There

is an optimal value of kernel bandwidth: application of kernels with larger
than optimal value of σ leads to over-smoothing of data (biased solution);
and smaller than optimal value of σ produces over-fitting of data.
220
Kanevski, M. et al.
GRNN training using cross-validation
As it was mentioned earlier, the only adaptive (free) parameter in the

GRNN model is the kernel bandwidth. For its estimation, cross-validation
procedure may be applied. In order to find an optimal value of kernel
bandwidth usually a grid search is used. It is necessary to define an
interval of σ values and the number of steps. Then the cross-validation
procedure is performed for all σ values from the selected interval.
The final result (optimal σ value) corresponds to the model with the
smallest cross-validation error. The interval and the number of steps have
to be consistent in order to catch the expected optimal (with minimum of
the error) value. Reliable limits are the minimum distance between points
and size of the area under study. In fact, really effective interval is much
smaller and can be defined in accordance with the monitoring network
structure and/or by using prior expert’s knowledge about studied
phenomenon.
In case of general anisotropic GRNN model an optimization procedure is

performed in a p(p+1)/2-dimensional cube in order to find corresponding
optimal σ -values.
Case study: mapping of transportation data, Switzerland

The Swiss Federal Population Census of the year 2000 includes a matrix
of commuter flux between all the 2896 Swiss communes, with the number
of commuters for several means of transportation. The commuters which
live and work in the same commune are also available. Only the main
mean of transportation has been taken into account for each commuter. In
Figures 30 (left) the measurements of the number of inhabitants using
motorcycles or scooters to reach the job is presented. In Figure 30 (right)
the result of GRNN mapping is presented. Similarly, the same maps for
habitants using buses or trams are presented in Figures 31.
221
Figure 30: Number of habitants in communes using moto or scooter to reach the job
(normalized by population): initial data (left), GRNN mapping (right)
Figure 31: Number of habitants in communes using bus or tram to reach the job
(normalized by population): initial data (left), GRNN mapping (right)
Training of GRNN were carried out using cross-validation techniques.
An important generalization of GRNN models can be done by taking into

account their similarity with nonparametric statistics: estimation of higher
moments, calculation of confidence and prediction intervals, locally
adaptive modeling, etc.
CONCLUSIONS
At present machine learning models/algorithms play a great role in many
fields dealing with data analysis modeling and visualization. They are
flexible, adaptive, nonlinear, and universal modeling tools based on solid
mathematical and statistical background. Despite of some external
simplicity, especially taking into account the availability of easy to use
software tools, their correct use and interpretation of the results obtained
need deep expert knowledge in the corresponding fields. They seem to be
indispensable tools when multivariate data are embedded in high
dimensional geo-feature spaces and corresponding phenomena are
nonlinear, multi-scale and contaminated by noise which is quite a typical
situation in real-life applications.
222
Kanevski, M. et al.
In the present chapter some basic principles of ML models and their

efficient applications for geospatial data were considered. After a short
general introduction unsupervised and supervised learning techniques
were presented with some details and by using real case studies. The
future research is foreseen in the directions of applying active learning
techniques in order to improve efficiency and quality of data processing
using supervised techniques, semi-supervised and manifold learning
taking into account unlabeled data and real life constraints. An important
research topic is to quantify the uncertainties of the results obtained which
will help in decision-making processes. Finally, the problem of spatio-
temporal data and science-based models assimilation/integration is still a
hot topic of research.
ACKNOWLEDGEMENTS
The research was supported in part by the Swiss National Science
Foundation “GeoKernels. Phase 2” (200020-121835) and “ClusterVille”
100012-113506. The authors thank the CIPEL organization for providing
data on lac Léman (Geneva lake).
REFERENCES
Almeida, C.; Gleriani, J.; Castejon, E.; Soares-Filho, B., 2008, Using neural
networks and cellular automata fro modeling intra-urban land-use dynamics. In:
International Journal of Geographical Information Science, 22: 943-963
Agarwal, P.; Skupin, A., 2008, Self-organising Maps: Applications in Geographic

Information Sciences. Wiley: New York
Aha, D.W. (Ed.), 1997, Lazy Learning. Kluwer Academic: Dordrecht
Belkin, M.; Niyogi, P., 2003, Laplacian eigenmaps for dimensionality reduction
and data representation. In: Neural Computation, 15, 6: 1373–1396
Benediktsson, J.A., 2003, Classification and feature extraction for remote

sensing images from urban areas based on morphological transformations. In:
IEEE Transactions on Geoscience and Remote Sensing, 41: 1940-1949
Bishop, C.M., 2006, Patter Recognition and Machine Learning. Springer:

Singapour
Bishop, C.M., 1995, Neural Networks for Pattern Recognition. Oxford University
Press: New York
Boser, B.; Guyon, I.; Vapnik, V., 1992, A training algorithm for optimal margin
classifiers. In: 5th ACM Workshop on Computational Learning Theory
223
Bottou, L., 2003, Stochastic Learning, Advanced Lectures on Machine Learning.
In: Bousquet, O.; von Luxburg, U. (Eds), Lecture Notes in Artificial Intelligence.
Berlin: Springer: 146-168
Chapelle, O.; Shoelkopf, B.; Zien, A. (Eds), 2006, Semi-Supervised Learning

(Adaptive Computation and Machine Learning). MIT Press: Cambridge MA
Cherkassky, V.; Hsieh, W.; Krasnopolsky, V.; Solomatine, D.; Valdes, J.,
2007, Special Issue: Computational intelligence in earth and environmental
sciences. In: Neural Networks, 20, 4: 433-558
Cherkassky, V.; Krasnopolsky, V.; Solomatine, D.; Valdes, J. (Eds), 2006,

Special Issue: Earth Sciences and Environmental Applications of Computational
Intelligence. Introduction. In: Neural Networks, 19, 2: 111-250
Cherkassky, V.; Mulier, F., 2007, Learning from Data. Concepts, Theory, and
Methods. Second edition. Edition.Wiley-Interscience: New York
Chiles, J.P.; Delfiner P., 1999, Geostatistics. Modeling Spatial Uncertainty.

Wiley: New York
Collobert, R.; Bengio, S.; Mariéthoz, J., 2002, Torch: a modular machine
learning software library. Tech Report IDIAP: Martigny
Cristianini, N.; Shawe-Taylor, J., 2000, An Introduction to Support Vector

Machines and other Kernel-based Learning Methods. Cambridge University
Press: Cambridge
Cybenko, G.V., 1989, Approximation by Superpositions of a Sigmoidal function.

In: Mathematics of Control, Signals and Systems, 2, 4: 303-314
Dubois, G., 2005, Automatic mapping algorithms for routine and emergency
data. European Commission, JRC Ispra, EUR 21595
Fan, J.; Gijbels, I., 1997, Applied Local Polynomial Modeling and Its
Applications. In: Monographs on Statistics and Applied Probability 66. London:
Chapman and Hall
Fotheringham, A.; Brunsdon, Ch.; Charlton, M., 2000, Quantitative

Geography. Perspectives on Spatial Data Analysis. SAGE Publications: London
Fotheringham, A.; Brunsdon, Ch.; Charlton, M., 2002, Geographically

Weighted Regression. The Analysis of Spatially Varying Relationships. John
Wiley & Sons: West Sussex
Hagen, L.; Kahng, A., 1992, New spectral methods for ratio cut partitioning and
clustering. In: IEEE Trans. on Computer Aided-Design, 11, 9:1074-1085
Grandvalet, Y.; Canu, S.; Boucheron, S., 1997, Noise injection: theoretical
prospects. In: Neural computation, 9: 1093-1108
224
Kanevski, M. et al.
Guyon, I.; Gunn, S.; Nikravesh, N.; Zadeh, L., 2006, Feature Extraction:
Foundations and Applications. Springer: New York
Hardle, W., 1989, Applied Nonparametric Regression. Cambridge University

Press: Cambridge
Hastie, T.; Tibshirani, R.; Friedman, J., 2009, The Elements of Statistical
Learning; Data Mining, Inference, and Prediction. Second edition. Springer
Verlag: New York
Haykin, S., 2009, Neural Networks and Learning Machines. Third Edition.
Prentice-Hall, Inc.: New York
Hecht-Nielsen, R., 1990, Neurocomputing. Addison-Wesley: Reading
Hewitson, B.; Crane, R., 1994, Neural Nets: Applications in Geography. Kindle
Edition
Hornik, K.; Stinchcombe, M.; White, H., 1989, Multilayer feedforward networks
are universal approximations. In: Neural Networks, 2: 359-366
Izenman, A., 2008, Modern Multivariate Statistical techniques. Regression,

Classification, and Manifold Learning. Springer: New York
Jain, A.K.; Murty, M.N.; Flynn, P.J., 1999, Data clustering: a review. In: ACM
Computing Surveys, 31, 3: 264-323
Jones, A., 2004, New tools in non-linear modeling and prediction. In: Comput.
Managm. Sci., 1: 109-149
Kanevski, M.; Maignan, M., 2004, Analysis and Modeling of Spatial

Environmental Data. EPFL Press: Lausanne
Kanevski, M.; Arutyunyan, R.; Bolshov, L.; Demyanov, V.; Maignan, M.,
1996, Artificial neural networks and spatial estimations of Chernobyl fallout. In:
Geoinformatics, 7, 1-2: 5-11
Kanevski, M., 1999, Spatial Predictions of Soil Contamination Using General

Regression Neural Networks. In: Systems Research and Information Systems, 8,
4: 241-256
Kanevski, M. (Ed.), 2008, Advanced Mapping of Spatial Environmental Data.

iSTE-Wiley: London
Kanevski, M.; Pozdnoukhov, A.; Timonin, V., 2009, Machine Learning for
Spatial Environmental Data. Theory, Applications and Software. EPFL Press:
Lausanne
Kohonen, T., 2000, Self-organising maps. 3rd Edition. Springer: New York
225
Lee, J.; Verleysen, M., 2007, Nonlinear Dimensionality Reduction. Springer:
New York
Masters, T., 1993, Practical Neural Network Recipes in C++. Academic Press:
New York
Nadaraya, E.A., 1964, On estimating regression. In: Theory of Probability and its
Applications, 9: 141-142
Ng, A.Y.; Jordan, M.; Weiss, Y., 2001, On spectral clustering: analysis and an
algorithm. In: Glen Th. et al. (Eds), Advances in Neural Information Processing
Systems, 14: 849-856
Openshaw, S.; Openshaw, C., 1997, Artificial Intelligence in Geography. Wiley:

London
Parzen, E., 1962, On estimation of a probability density function and mode. In:
Annals of Mathematical Statistics, 33: 1065-1076
Pearson, K., 1901, On Lines and Planes of Closest Fit to Systems of Points in
Space. In: Philosophical Magazine 2, 6: 559–572
Pi, H.; Peterson, C., 1994, Finding embedding dimension and variable
dependencies in time series. In: Neural computation, 6: 509-520
Pesaresi, M.; Benediktsson, J.A., 2001, A new approach for the morphological
segmentation of high-resolution satellite images. In: IEEE Transactions on
Geoscience and Remote Sensing, 392: 309-320
Pijanowski, B.; Brown, D.; Shellito, B.; Manik G., 2002, Using neural networks
and GIS to forecast land use changes: a Land Transformation Model. In:
Computers, Environment and Urban Systems, 26: 553-575
Rosenblatt, M., 1956, Remarks on some nonparametric estimates of a density

function. In: Annals of Mathematical Statistics, 27: 832-837
Roweis, S.T.; Saul, L.K., 2000, Nonlinear dimensionality reduction by locally

linear embedding. In: Science, 290, 5500: 2323-2326
Shi, J.; Malik, J., 2000, Normalized cuts and image segmentation. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22, 8: 888-905
Soille, P., 2004, Morphological Image Analysis. Springer-Verlag: Berlin
Specht, D.E., 1991, A General Regression Neural Network. In: IEEE

Transactions on Neural Networks, 2: 568-576
Tenenbaum, J.B.; de Silva, V.; Langford, J.C., 2000, A global geometric

framework for nonlinear dimensionality reduction. In: Science, 290, 5500: 2319-
2323
226
Kanevski, M. et al.
Timonin, V.; Savelieva, E., 2005, Spatial Prediction of Radioactivity Using

General Regression Neural Network. In: Applied GIS, 1, 2: 19-01 to 19-14. DOI:
10.2104/ag050019
Vapnik, V., 1998, Statistical Learning Theory. Springer-Verlag: Berlin
Watson, G.S., 1964, Smooth regression analysis. In: Sankhya: The Indian
Journal of Statistics, Series A, 26: 359-372
Weinberger, K.Q.; Saul, L.K., 2005, Nonlinear dimensionality reduction by semi-

definite programming and kernel matrix factorization. Paper presented at the
Tenth International Workshop on AI and Statistics (AISTATS-05)
AUTHORS INFORMATION
Mikhail KANEVSKI Loris FORESTI Christian KAISER
Mikhail.Kanevski@unil.ch Loris.Foresti@unil.ch Christian.Kaiser@unil.ch
IGAR, University of Lausannee IGAR, University of IGUL, University of Lausannee
Amphipole, 1015 Lausanne Lausannee, Amphipole, Antropole, 1015 Lausanne
Switzerland 1015 Lausanne, Switzerland Switzerland
Alexei POZDNOUKHOV Vadim TIMONIN Devis TUIA
Alexei.Pozdnoukhov@unil.ch Vadim.Timonin@unil.ch Devis.Tuia@unil.ch
IGAR, University of Lausannee IGAR, University of IGAR, University of Lausannee
Amphipole, 1015 Lausanne Lausannee Amphipole, 1015 Lausanne
Switzerland Amphipole, 1015 Lausanne Switzerland
Switzerland
227
View publication stats

Machine Learning Models For Geospatial Data: January 2009

Uploaded by

Copyright:

Available Formats

Machine Learning Models For Geospatial Data: January 2009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Models For Geospatial Data: January 2009

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Machine learning models for geospatial data

Chapter · January 2009

Mikhail Kanevski Loris Foresti

SEE PROFILE SEE PROFILE

Christian Kaiser Devis Tuia

SEE PROFILE SEE PROFILE

pySTEPS: the precipitation nowcasting initiative View project

The user has requested enhancement of the downloaded file.

MACHINE LEARNING MODELS FOR GEOSPATIAL

Mikhail KANEVSKI*, Loris FORESTI*, Christian KAISER**, Alexei

Let us remind that other approaches of geospatial data analysis and

For example, traditional geostatistics can be considered as a well-known

At present, one of the most efficient data-driven approaches is based on

and statistics can be understood and interpreted in terms of the structural

In general, geospatial data are not only data in a geographical low

In terms of patterns/structures the following main problems closely related

• pattern recognition/pattern detection,

• pattern predictions/pattern completions.

The problem of pattern recognition is closely related to the contemporary

Pattern modeling is a task of developing models capable to correctly

Other traditional topics where machine learning has contributed are

Geospatial data processed by ML can be of different types:

Let us consider briefly more formal definitions of the problem of “learning

Learning from data. Setting of the learning problem

Statistical interpretation is quite general and covers also deterministic

The general framework of learning from data can be visualized in the

Figure 1: The flow-chart of a generic problem of machine learning from data

In Figure 1: G is a generator of i.i.d. (independent and identically

In general the learning process can be considered as a process of

Figure 2: Learning processes: induction, deduction, transduction

Learning from data. Basic learning algorithms and techniques

Common ML models include the following types of learning widely used in

• supervised learning – where the algorithm generates a function that

• unsupervised learning – which models a set of inputs: labeled

• reinforcement learning – where the algorithm learns a policy of how

• In recent years a semi-supervised learning, which combines both

The performance and computational analysis of machine learning

The generic procedure of learning can be visualized as flow-charts

Supervised learning algorithms can be synthesized in the following

Data: Training ML Response

Figure 3: Supervised learning

Unsupervised learning algorithms are a modification of the supervised

Figure 4: Unsupervised learning

Let us remind that an important property of almost all of the machines is

The process of learning usually is quantified by applying a principle of risk

In order to choose the best available model to the supervisor’s response,

R(α ) = ∫ L( y , f ( x, α ))dF ( x, y ) (1)

where L is a loss function and F is a joint (input-output) distribution

Loss function for the classification problem:

For the regression/mapping problem (modeling of conditional mean value)

Finally, consider the problem of density estimation from the set of

Minimization of empirical risk does not consider the capacity of the

Structural risk minimization (SRM) was introduced by Vapnik &

An optimal solution corresponds to the minimum on the curve describing

E Bound on prediction error

Low complexity Optimal complexity High complexity

Figure 5: Structural risk minimization principle (Vapnik, 1998)

An important question of data ML modeling is that the main objective is to

Mikhail KANEVSKI, Loris FORESTI, Christian KAISER**, Alexei