10.1007@978 0 387 39351 3 PDF
10.1007@978 0 387 39351 3 PDF
10.1007@978 0 387 39351 3 PDF
Series Editors:
M. Jordan
J. Kleinberg
B. Schölkopf
Information Science and Statistics
Akaike and Kitagawa: The Practice of Time Series Analysis.
Bishop: Pattern Recognition and Machine Learning.
Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and Expert
Systems.
Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice.
Fine: Feedforward Neural Network Methodology.
Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality
Improvement.
Jensen and Nielsen: Bayesian Networks and Decision Graphs, Second Edition.
Lee and Verleysen: Nonlinear Dimensionality Reduction.
Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical
Viewpoint.
Rissanen: Information and Complexity in Statistical Modeling.
Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to
Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning.
Studený: Probabilistic Conditional Independence Structures.
Vapnik: The Nature of Statistical Learning Theory, Second Edition.
Wallace: Statistical and Inductive Inference by Minimum Massage Length.
John A. Lee Michel Verleysen
Nonlinear Dimensionality
Reduction
John Lee Michel Verleysen
Molecular Imaging and Experimental Machine Learning Group – DICE
Radiotherapy Université catholique de Louvain
Université catholique de Louvain Place du Levant 3
Avenue Hippocrate 54/69 B-1348 Louvain-la-Neuve
B-1200 Bruxelles Belgium
Belgium michel.verleysen@uclouvain.be
john.lee@uclouvain.be
Series Editors:
Michael Jordan Jon Kleinberg Bernhard Schölkopf
Division of Computer Department of Computer Max Planck Institute for
Science and Science Biological Cybernetics
Department of Statistics Cornell University Spemannstrasse 38
University of California, Ithaca, NY 14853 72076 Tübingen
Berkeley USA Germany
Berkeley, CA 94720
USA
9 8 7 6 5 4 3 2 1
springer.com
To our families
Preface
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XV
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .XVII
.
1 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Practical motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Fields of application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 The goals to be reached . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Theoretical motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 How can we visualize high-dimensional spaces? . . . . . . . . 4
1.2.2 Curse of dimensionality and empty space phenomenon . 6
1.3 Some directions to be explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Relevance of the variables . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Dependencies between the variables . . . . . . . . . . . . . . . . . . 10
1.4 About topology, spaces, and manifolds . . . . . . . . . . . . . . . . . . . . . 11
1.5 Two benchmark manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Overview of the next chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Distance Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Spatial distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 Metric space, distances, norms and scalar product . . . . . 70
4.2.2 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.3 Sammon’s nonlinear mapping . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.4 Curvilinear component analysis . . . . . . . . . . . . . . . . . . . . . 88
4.3 Graph distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Geodesic distance and graph distance . . . . . . . . . . . . . . . . 97
Contents XI
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1 Summary of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.2 A basic solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.1.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.1.4 Latent variable separation . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.1.5 Intrinsic dimensionality estimation . . . . . . . . . . . . . . . . . . . 229
7.2 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.1 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.3 Linear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . 231
7.2.4 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . 231
7.2.5 Latent variable separation . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2.6 Further processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.3 Model complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.4 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.4.1 Distance preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.4.2 Topology preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.5 Spectral methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.6 Nonspectral methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
XII Contents
C Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.1.1 Finding extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
C.1.2 Multivariate version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
C.2 Gradient ascent/descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
C.2.1 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 261
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Contents XIII
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Notations
DR Dimensionality reduction
LDR Linear dimensionality reduction
NLDR Nonlinear dimensionality reduction
partially redundant. Units that come to failure can be replaced with others
that achieve the same or a similar task.
Redundancy means that parameters or features that could characterize
the set of various units are not independent from each other. Consequently,
the efficient management or understanding of all units requires taking the
redundancy into account. The large set of parameters or features must be
summarized into a smaller set, with no or less redundancy. This is the goal
of dimensionality reduction (DR), which is one of the key tools for analyzing
high-dimensional data.
These terms encompass all applications using a set of several identical sen-
sors. Arrays of antennas (e.g., in radiotelescopes) are the best example. But
to this class also belong numerous biomedical applications, such as electrocar-
diagram or electroencephalograph acquisition, where several electrodes record
time signals at different places on the chest or the scalp. The same configura-
tion is found again in seismography and weather forecasting, for which several
stations or satellites deliver data. The problem of geographic positioning us-
ing satellites (as in the GPS or Galileo system) may be cast within the same
framework too.
Image processing
Let’s consider a picture as the output of a digital camera; then its processing
reduces to the processing of a sensor array, like the well-known photosensitive
CCD or CMOS captors used in digital photography. However, image process-
ing is often seen as a standalone domain, mainly because vision is a very
specific task that holds a priviliged place in information science.
In contrast with sensor arrays or pixel arrays, multivariate data analysis rather
focuses on the analysis of measures that are related to each other but come
from different types of sensors. An obvious example is a car, wherein the gear-
box connecting the engine to the wheels has to take into account information
from rotation sensors (wheels and engine shaft), force sensors (brake and gas
pedals), position sensors (gearbox stick, steering wheel), temperature sensors
(to prevent engine overheating or to detect glaze), and so forth. Such a sit-
uation can also occur in psychosociology: a poll often gathers questions for
which the answers are from different types (true/false, percentage, weight,
age, etc.).
1.2 Theoretical motivations 3
Data mining
At first sight, data mining seems to be very close to multivariate data analy-
sis. However, the former has a broader scope of applications than the latter,
which is a classical subdomain of statistics. Data mining can deal with more
exotic data structures than arrays of numbers. For example, data mining en-
compasses text mining. The analysis of large sets of text documents aims,
for instance, at detecting similarities between texts, like common vocabulary,
same topic, etc. If these texts are Internet pages, hyperlinks can be encoded
in graph structures and analyzed using tools like graph embedding. Cross
references in databases can be analyzed in the same way.
Visualization is a task that regards mainly two classes of data: spatial and tem-
poral. In the latter case, the analysis may resort to the additional information
given by the location in time.
Spatial data
Temporal data
When it is known that data are observed in the course of time, an additional
piece of information is available. As a consequence, the above-mentioned ge-
ometrical representation is no longer unique. Instead of visualizing all di-
mensions simultaneously in the same coordinate system, one can draw the
evolution of each variable as a function of time. For example, in Fig. 1.2, the
same data set is displayed “spatially” in the first plot, and “temporally” in
the second one: the time structure of data is revealed by the temporal rep-
resentation only. In constrast with the spatial representation, the temporal
representation easily generalizes to more than three dimensions. Nevertheless,
1.2 Theoretical motivations 5
1
x2
−1
−2
−2 0 2
x
1
p=1
p
x
p=2
100 200 300 400 500 600 700 800 900 1000
Time
Fig. 1.2. Two plots of the same temporal data. In the first representation, data
are displayed in a single coordinate system (spatial representation). In the second
representation, each variable is plotted in its own coordinate system, with time as
the abscissa (time representation).
6 1 High-Dimensional Data
π D/2 rD
Vsphere (r) = , (1.1)
Γ (1 + D/2)
Vcube (r) = (2r)D , (1.2)
where r is the radius of the sphere. Surprisingly, the ratio Vsphere /Vcube tends
to zero when D increases:
1.2 Theoretical motivations 7
Vsphere (r)
lim =0 . (1.3)
D→∞ Vcube (r)
Intuitively, this means that as dimensionality increases, a cube becomes more
and more spiky, like a sea urchin: the spherical body gets smaller and smaller
while the number of spikes increases, the latter occupying almost all the avail-
able volume. Now, assigning the value 1/2 to r, Vcube equals 1, leading to
This indicates that the volume of a sphere vanishes when dimensionality in-
creases!
1 1 y − μy 2
fy (y) = exp(− ) , (1.6)
2
(2πσ )D 2 σ2
2π D/2 rD−1
Ssphere (r) = . (1.9)
Γ (D/2)
The radius r0.95 grows as the dimensionality D increases, as illustrated in the
following table:
D 1 2 3 4 5 6
r0.95 1.96σ 2.45σ 2.80σ 3.08σ 3.33σ 3.54σ
This shows the weird behavior of a Gaussian distribution in high-dimensional
spaces.
where a and b are parameters depending only on the central moments of order
1, 2, 3, and 4 of the xi :
a = μ2 + μ2 (1.12)
4μ2 μ2 − μ22 + 4μμ3 + μ4
b= , (1.13)
4(μ2 + μ2 )
Diagonal of a hypercube
Considering the hypercube [−1, +1]D , any segment from its center to one of
its 2D corners, i.e., a half-diagonal, can be written as v = [±1, . . . , ±1]T . The
angle between a half-diagonal v and the dth coordinate axis
ed = [0, . . . , 0, 1, 0, . . . , 0]T
is computed as
v T ed ±1
cos θD = =√ . (1.15)
v ed D
When the dimensionality D grows, the cosine tends to zero, meaning that
half-diagonals are nearly orthogonal to all coordinates axes [169]. Hence, the
visualization of high-dimensional data by plotting a subset of two coordinates
on a plane can be misleading. Indeed, a cluster of points lying near a diagonal
line of the space will be surprisingly plotted near the origin, whereas a cluster
lying near a coordinate axis is plotted as intuitively expected.
When analyzing multivariate data, not necessarily all variables are related to
the underlying information the user wishes to catch. Irrelevant variables may
be eliminated from the data set.
Most often, techniques to distinguish relevant variables from irrelevant
ones are supervised: the “interest” of a variable is given by an “oracle” or
“professor”. For example, in a system with many inputs and outputs, the
relevance of an input can be measured by computing the correlations between
known pairs of input/output. Input variables that are not correlated with the
outputs may then be eliminated.
Techniques to determine whether variables are (ir)relevant are not further
studied in this book, which focuses mainly on non-supervised methods. For
the interested reader, some introductory references include [2, 96, 139].
Even when assuming that all variables are relevant, the dimensionality of the
observed data may still be larger than necessary. For example, two variables
may be highly correlated: knowing one of them brings information about the
other. In that case, instead of arbitrarily removing one variable in the pair,
another way to reduce the number of variables would be to find a new set
of transformed variables. This is motivated by the facts that dependencies
between variables may be very complex and that keeping one of them might
not suffice to catch all the information content they both convey.
The new set should obviously contain a smaller number of variables but
should also preserve the interesting characteristics of the initial set. In other
words, one seeks a transformation of the variables with some well-defined
properties. These properties must ensure that the transformation does not
alter the information content conveyed by the initial data set, but only rep-
resents it in a different form. In the remainder of this book, linear as well
as nonlinear transformations of observed variables will often be called projec-
tions, mainly because many transformations are designed for the preservation
of characteristics that are geometrical or interpreted as such.
The type of projection must be chosen according to the model that un-
derlies the data set. For example, if the given variables are assumed to be
mixtures of a few unobserved ones, then a projection that inverts the mixing
process is very useful. In other words, this projection tracks and eliminates
dependencies between the observed variables. These dependencies often result
from a lack of knowledge or other imperfections in the observation process: the
interesting variables are not directly accessible and are thus measured in sev-
eral different but largely redundant ways. The determination of a projection
may also follow two different goals.
The first and simplest one aims to just detect and eliminate the depen-
dencies. For this purpose, the projection is determined in order to reduce the
1.4 About topology, spaces, and manifolds 11
tangled or knotted circles. In other words, topology is used to abstract the in-
trinsic connectivity of objects while ignoring their detailed form. If two objects
have the same topological properties, they are said to be homeomorphic.
The “objects” of topology are formally defined as topological spaces. A
topological space is a set for which a topology is specified [140]. For a set Y, a
topology T is defined as a collection of subsets of Y that obey the following
properties:
• Trivially, ∅ ∈ T and Y ∈ T .
• Whenever two sets are in T , then so is their intersection.
• Whenever two or more sets are in T , then so is their union.
This definition of a topology holds as well for a Cartesian space (RD ) as for
graphs. For example, the natural topology associated with R, the set of real
numbers, is the union of all open intervals.
From a more geometrical point of view, a topological space can also be
defined using neighborhoods and Haussdorf’s axioms. The neighborhood of a
point y ∈ RD , also called a -neighborhood or infinitesimal open set, is often
defined as the open -ball B (y), i.e. the set of points inside a D-dimensional
hollow sphere of radius > 0 and centered on y. A set containing an open
neighborhood is also called a neighborhood. Then, a topological space is such
that
• To each point y there corresponds at least one neighborhood U(y), and
U(y) contains y.
• If U(y) and V(y) are neighborhoods of the same point y, then a neighbor-
hood W(y) exists such that W(y) ⊂ U(y) ∪ V(y).
• If z ∈ U(y), then a neighborhood V(z) of z exists such that V(z) ⊂ U(y).
• For two distinct points, two disjoint neighborhoods of these points exist.
Within this framework, a (topological) manifold M is a topological space
that is locally Euclidean, meaning that around every point of M is a neigh-
borhood that is topologically the same as the open unit ball in RD . In general,
any object that is nearly “flat” on small scales is a manifold. For example, the
Earth is spherical but looks flat on the human scale.
As a topological space, a manifold can be compact or noncompact, con-
nected or disconnected. Commonly, the unqualified term “manifold” means
“manifold without boundary”. Open manifolds are noncompact manifolds
without boundary, whereas closed manifolds are compact manifolds without
boundary. If a manifold contains its own boundary, it is called, not surpris-
ingly, a “manifold with boundary”. The closed unit ball B̄1 (0) in RD is a
manifold with boundary, and its boundary is the unit hollow sphere. By defi-
nition, every point on a manifold has a neighborhood together with a home-
omorphism of that neighborhood with an open ball in RD .
An embedding is a representation of a topological object (a manifold, a
graph, etc.) in a certain space, usually RD for some D, in such a way that its
1.4 About topology, spaces, and manifolds 13
reduction techniques must work with partial and limited data. Second, as-
suming the existence of an underlying manifold allows us to take into account
the support of the data distribution but not its other properties, such as its
density. This may be problematic for latent variable separation, for which a
model of the data density is of prime importance.
Finally, the manifold model does not account for the noise that may cor-
rupt data. In that case, data points no longer lie on the manifold: instead fly
nearby. Hence, regarding terminology, it is correct to write that dimension-
ality reduction re-embeds a manifold, but, on the other hand, it can also be
said that noisy data points are (nonlinearly) projected on the re-embedded
manifold.
1
0 3
3
y
0
y3
1 2
−1 −1 1
−1 0
−1 0
0 0
−1 y2
1 −1 y 1
2
y y
1 1
Fig. 1.3. Two benchmark manifold: the ‘Swiss roll’ and the ‘open box’.
manifold, on the left in Fig. 1.3, is called the Swiss roll, according to the
name of a Swiss-made cake: it is composed of a layer of airy pastry, which
is spread with jam and then rolled up. The manifold shown in the figure
represents the thin layer of jam in a slice of Swiss roll. The challenge of the
Swiss roll consists of finding a two-dimensional embedding that “unrolls” it,
in order to avoid superpositions of the successive turns of the spiral and to
obtain a bijective mapping between the initial and final embeddings of the
manifold. The Swiss roll is a noncompact, smooth, and connected manifold.
1.5 Two benchmark manifolds 15
The second two-manifold of Fig. 1.3 is naturally called the “open box”.
As for the Swiss roll, the goal is to reduce the embedding dimensionality from
three to two. As can be seen, the open box is connected but neither compact
(in contrast with a cube or closed box) nor smooth (there are sharp edges and
corners). Intuitively, it is not so obvious to guess what an embedding of the
open box should look like. Would the lateral faces be stretched? Or torn? Or
would the bottom face be shrunk? Actually, the open box helps to show the
way each particular method behaves.
In practice, all DR methods work with a discrete representation of the
manifold to be embedded. In other words, the methods are fed with a finite
subset of points drawn from the manifold. In the case of the Swiss roll and
open box manifolds, 350 and 316 points are selected, respectively, as shown in
Fig. 1.4. The 350 and 316 available points are regularly spaced, in order to be
1
0 3
3
y
0
y3
2
1
−1 −1 1
−1 0 0
−1
0 0 y
−1 y 1 −1 2
1 2
y1 y
1
Fig. 1.4. A subset of points drawn from the “Swiss roll” and “open box” manifolds
displayed in Fig. 1.3. These points are used as data sets for DR methods in order
to assess their particular behavior. Corners and points on the edges of the box are
shown with squares, whereas points inside the faces are shown as smaller circles. The
color indicates the height of the points in the box or the radius in the Swiss roll. A
lattice connects the points in order to highlight their neighborhood relationships.
2.1 Purpose
This chapter aims at gathering all features or properties that characterize
a method of analyzing high-dimensional data. The first section lists some
functionalities that the user usually expects. The next sections present more
technical characterictics like the mathematical or statistical model that under-
lies the method, the type of algorithm that identifies the model parameters,
and, last but not least, the criterion optimized by the method. Although the
criterion ends the list, it often has a great influence on other characteristics.
Depending on the criterion, indeed, some functionalities are available or not;
similarly, the optimization of a given criterion is achieved more easily with
some type of algorithm and may be more difficult with another one.
18 2 Characteristics of an Analysis Method
1
0
3
y
−1
1
0.5
0 1
−0.5 0
y −1 −1
2 y1
The knowledge of the intrinsic dimension P indicates that data have some
topological structure and do not completely fill the embedding space. Quite
naturally, the following step would consist of re-embedding the data in a lower-
dimensional space that would be better filled. The aims are both to get the
most compact representation and to make any subsequent processing of data
more easy. Typical applications include data compression and visualization.
More precisely, if the estimate of the intrinsic dimensionality P is reli-
able, then two assumptions can be made. First, data most probably hide a
P -dimensional manifold.2 Second, it is possible to re-embed the underlying
P -dimensional manifold in a space having dimensionality between P and D,
hopefully closer to P than D.
Intuitively, dimensionality reduction aims at re-embedding data in such
way that the manifold structure is preserved. If this constraint is relaxed,
then dimensionality reduction no longer makes sense. The main problem is,
2
Of course, this is not necessarily true, as P is a global estimator and data may be
a combination of several manifolds with various local dimensionalities.
20 2 Characteristics of an Analysis Method
1
x2
−1
−2
−3
−2 0 2
x1
Fig. 2.2. Possible two-dimensional embedding for the object in Fig. 2.1. The di-
mensionality of the data set has been reduced from three to two.
1
x2
−1
−2
−3
−2 0 2
x1
Fig. 2.3. Particular two-dimensional embedding for the object in Fig. 2.1. The latent
variables, corresponding to the axes of the coordinate system, are independent from
each other.
latent variable separation, whose result is illustrated in Fig. 2.3, has not been
obtained directly from the three-dimensional data set in Fig. 2.1. Instead,
it has been determined by modifying the low-dimensional representation of
Fig. 2.2. And actually, most methods of latent variable separation are not
able to reduce the dimensionality by themselves: they need another method or
some kind of preprocessing to achieve it. Moreover, the additional constraints
imposed on the desired representation, like statistical independence, mean
that the methods are restricted to very simple data models. For example,
observed variables are most often modeled as linear combinations of the latent
ones, in order to preserve some of their statistical properties.
22 2 Characteristics of an Analysis Method
All methods of analysis rely on the assumption that the data sets they are fed
with have been generated according to a well-defined model. In colorful terms,
the food must be compatible with the stomach: no vegetarian eats meat!
For example, principal component analysis (see Section 2.4) assumes that
the dependencies between the variables are linear. Of course, the user should
be aware of such a hypothesis, since the type of model determines the power
and/or limitations of the method. Consequently for this model choice, PCA
often delivers poor results when trying to project data lying on a nonlinear
subspace. This is illustrated in Fig. 2.4, where PCA has been applied to the
data set displayed in Fig. 2.1.
1.5
1
0.5
2
0
x
−0.5
−1
−1.5
−2 −1 0 1 2
x1
Fig. 2.4. Dimensionality reduction by PCA from 3 to 2 for the data set of Fig. 2.1.
Obviously, data do not fit the model of PCA, and the initial rectangular distribution
cannot be retrieved.
Hence, even for the relatively simple toy example in Fig. 2.1, methods
based on a nonlinear data model seem to be preferable. The embedding of
Fig. 2.2 is obtained by such a nonlinear method: the result is visually much
more convincing.
The distinction between linear and nonlinear models is not the only one.
For example, methods may have a continuous model or a discrete one. In the
2.3 Internal characteristics 23
first case, the model parameters completely define a continuous function (or
mapping) between the high- and low-dimensional spaces. In the second case,
the model parameters determine only a few values of such a function.
2.3.2 Algorithm
For the same model, several algorithms can implement the desired method of
analysis. For example, in the case of PCA, the model parameters are com-
puted in closed form by using general-purpose algebraic procedures. Most
often, these procedures work quickly, without any external hyperparameter
to tune, and are guaranteed to find the best possible solution (depending on
the criterion, see ahead). Nevertheless, in spite of many advantages, one of
their major drawbacks lies in the fact that they are so-called batch methods:
they cannot start working until the whole set of data is available.
When data samples arrive one by one, other types of algorithms exist.
For example, PCA can also be implemented by so-called online or adaptative
algorithms (see Subsection 2.4.4). Each time a new datum is available, on-line
algorithms handle it independently from the previous ones and then ‘forget’
it. Unfortunately, such algorithms do not show the same desirable properties
as algebraic procedures:
• By construction, they work iteratively (with a stochastic gradient descent
for example).
• They can fall in a local optimum of the criterion, i.e., find a solution that
is not exactly the best, but only an approximation.
• They often require a careful adjustment of several hyperparameters (e.g.,
learning rates) to fasten the convergence and avoid the above-mentioned
local minima.
Although PCA can be implemented by several types of algorithms, such
versatility does not hold for all methods. Actually, the more complex a model
is, the more difficult it is to compute its parameters in closed form. Along
with the data model, the criterion to be optimized also strongly influences
the algorithm.
2.3.3 Criterion
Despite the criterion is the last item in this list of the method characteris-
tics, it probably plays the most important role. The choice of the criterion
often determines which functionalities the method will offer, intervenes in the
data model, and always orients the implementation to a particular type of
algorithm.
Typically, the criterion to be optimized is written as a mathematical for-
mula. For example, a well-known criterion for dimensionality reduction is the
mean square error. In order to compute this criterion, the dimensionality is
first reduced and then expanded back, provided that the data model could
24 2 Characteristics of an Analysis Method
As explained in the next section, PCA can be derived from the reconstruction
error. Of course, other criteria exist. For example, statisticians may wish to
get a projection that preserves the variance initially observable in the raw
data. From a more geometrical or topological point of view, the projection
of the object should preserve its structure, for example, by preserving the
pairwise distances measured between the observations in the data set.
If the aim is latent variable separation, then the criterion can be decorre-
lation. This criterion can be further enriched by making the estimated latent
variables as independent as possible. The latter idea points toward indepen-
dent component analysis (ICA), which is out of the scope of this book. The
interested reader can find more details in [95, 34] and references therein.
As shown in the next section, several criterions described above, like min-
imizing the reconstruction error, maximizing the variance preservation, maxi-
mizing the distance preservation, or even decorrelating the observed variables,
lead to PCA when one considers a simple linear model.
The model of PCA essentially assumes that the D observed variables, gathered
in the random vector y = [y1 , . . . , yd , . . . , yD ]T , result from a linear transfor-
mation W of P unknown latent variables, written as x = [x1 , . . . , xp , . . . , xP ]T :
2.4 Example: Principal component analysis 25
y = Wx . (2.4)
All latent variables are assumed to have a Gaussian distribution (see Ap-
pendix B). Additionally, transformation W is constrained to be an axis
change, meaning that the columns wd of W are orthogonal to each other
and of unit norm. In other words, the D-by-P matrix W is a matrix such
that WT W = IP (but the permuted product WWT may differ from ID ).
A last important but not too restrictive hypothesis of PCA is that both the
observed variables y and the latent ones x are centered, i.e., Ey {y} = 0D and
Ex {x} = 0P .
Starting from this model, how can the dimension P and the linear transfor-
mation W be identified starting from a finite sample of the observed variables?
Usually, the sample is an unordered set of N observations (or realizations) of
the random vector y:
Preprocessing
where the left arrow means that the variable on the left-hand side is assigned
a new value indicated in the right-hand side. Of course, the exact expectation
of y is often unknown and must be approximated by the sample mean:
1
N
1
Ey {y} ≈ y(n) = Y1N . (2.8)
N n=1 N
With the last expression of the sample mean in matrix form, the centering
can be rewritten for the entire data set as
1
Y←Y− Y1N 1TN . (2.9)
N
Once data are centered, P and W can be identified by PCA.
Nevertheless, the data set may need to be further preprocessed. Indeed, the
components yd of the observed vector y may come from very different origins.
For example, in multivariate data analysis, one variable could be a weight
expressed in kilograms and another variable a length expressed in millimeters.
26 2 Characteristics of an Analysis Method
But the same variables could as well be written in other units, like grams
and meters. In both situations, it is expected that PCA detects the same
dependencies between the variables in order to yield the same results. A simple
way to solve this indeterminacy consists of standardizing the variables, i.e.,
dividing each yd by its standard deviation after centering. Does this mean that
the observed variables should always be standardized? The answer is negative,
and actually the standardization could even be dangerous when some variable
has a low standard deviation. Two cases should be distinguished from the
others:
• When a variable is zero, its standard deviation is also zero. Trivially, the
division by zero must be avoided, and the variable should be discarded.
Alternatively, PCA can detect and remove such a useless zero-variable in
a natural way.
• When noise pollutes an observed variable having a small standard devia-
tion, the contribution of the noise to the standard deviation may be pro-
portionally large. This means that discovering the dependency between
that variable and the other ones can be difficult. Therefore, that variable
should intuitively be processed exactly as in the previous case, that is, ei-
ther by discarding it or by avoiding the standardization. The latter could
only amplify the noise. By definition, noise is independent from all other
variables and, consequently, PCA will regard the standardized variable as
an important one, while the same variable would have been a minor one
without standardization.
These two simple cases demonstrate that standardization can be useful but
may not be achieved blindly. Some knowledge about the data set is necessary.
After centering (and standardization if appropriate), the parameters P
and W can be identified by PCA. Of course, the exact values of P and W
depend on the criterion optimized by PCA.
and the reconstruction error is zero. Unfortunately, in almost all real situa-
tions, the observed variables in y are polluted by some noise, or do not fully
respect the linear PCA model, yielding a nonzero reconstruction error. As a
direct consequence, W cannot be identified perfectly, and only an approxima-
tion can be computed.
The best approximation is determined by developing and minimizing the
reconstruction error. According to the definition of the Euclidean norm (see
Subsection 4.2.1), Ecodec successively becomes
where the first term is constant. Hence, minimizing Ecodec turns out to maxi-
mize the term Ey {yT WWT y}. As only a few observations y(n) are available,
the latter expression is approximated by the sample mean:
1
N
Ey {yT WWT y} ≈ (y(n))T WWT (y(n)) (2.14)
N n=1
1
≈ tr(YT WWT Y) , (2.15)
N
where tr(M) denotes the trace of some matrix M. To maximize this last
expression, Y has to be factored by singular value decomposition (SVD; see
Appendix A.1):
Y = VΣUT , (2.16)
where V, U are unitary matrices and where Σ is a matrix with the same
size as Y but with at most D nonzero entries σd , called singular values and
28 2 Characteristics of an Analysis Method
located on the first diagonal of Σ. The D singular values are usually sorted in
descending order. Substituting in the approximation of the expectation leads
to
1
Ey {YT WWT Y} ≈ tr(UΣT VT WWT VΣU) . (2.17)
N
Since the columns of V and U are orthonormal vectors by construction, it is
easy to see that
for a given P (ID×P is a matrix made of the first P columns of the identity
matrix ID ). Indeed, the above expression reaches its maximum when the P
columns of W are colinear with the columns of V that are associated with the
P largest singular values in Σ. Additionally, it can be trivially proved that
Ecodec = 0 for W = V. In the same way, the contribution of a principal com-
ponent vd to the Ecodec equals σd2 , i.e., the squared singular value associated
with vd .
Finally, P -dimensional latent variables are approximated by computing
the product
x̂ = IP ×D VT y . (2.19)
After the estimation of the intrinsic dimensionality, PCA reduces the dimen-
sion by projecting the observed variables onto the estimated latent subspace in
a linear way. Equation (2.19) shows how to obtain P -dimensional coordinates
from D-dimensional ones. In that equation, the dimensionality reduction is
achieved by the factor IP ×D , which discards the eigenvectors of V associ-
ated with the D − P smallest eigenvalues. On the other hand, the factor VT
ensures that the dimensionality reduction minimizes the loss of information.
Intuitively, this is done by canceling the linear dependencies between the ob-
served variables.
Beyond reduction of data dimensionality, PCA can also separate latent vari-
ables under certain conditions. In Eq. (2.19), the separation is achieved by
the factor VT . As clearly stated in the PCA model, the observed variables
can only be a rotation of the latent ones, which have a Gaussian distribution.
These are, of course, very restrictive conditions that can be somewhat relaxed.
For example, in Eq. (2.4), the columns of W can be solely orthogonal
instead of orthonormal. In this case, the latent variables will be retrieved up
to a permutation and a scaling factor.
Additionally, if all latent variables have a Gaussian distribution but W is
any matrix, then PCA can still retrieve a set of variables along orthogonal
directions. The explanation is that a set of any linear combinations of Gaus-
sian distributions is always equivalent to a set of orthogonal combinations of
Gaussian distributions (see Appendix B).
From a statistical point of view, PCA decorrelates the observed variables
y by diagonalizing the (sample) covariance matrix. Therefore, without consid-
eration of the true latent variables, PCA finds a reduced set of uncorrelated
variables from the observed ones. Actually, PCA cancels the second-order
crosscumulants, i.e., the off-diagonal entries of the covariance matrix.
Knowing that higher-order cumulants, like the skewness (third order) and
the kurtosis (fourth order), are null for Gaussian variables, it is not difficult
to see that decorrelating the observed variables suffices to obtain fully in-
dependent latent variables. If latent variables are no longer Gaussian, then
higher-order cumulants must be taken into account. This is what is done in
independent component analysis (ICA, [95, 34]), for which more complex algo-
rithms than PCA are able to cancel higher-order cross-cumulants. This leads
to latent variables that are statistically independent.
2.4.4 Algorithms
different criteria described in that subsection show that PCA can work in two
different ways:
• by SVD (singular value decomposition; see Appendix A.1) of the matrix
Y, containing the available sample.
• by EVD (eigenvalue decomposition; see Appendix A.2) of the sample co-
variance Ĉyy .
Obviously, both techniques are equivalent, at least if the singular values and
the eigenvalues are sorted in the same way:
1
Ĉyy = YYT (2.34)
N
1
= (VΣUT )(UΣT VT ) (2.35)
N
1
= V( ΣΣT )VT (2.36)
N
= VΛVT . (2.37)
By the way, the last equality shows the relationship between the eigenvalues
and the singular values: λd = σd2 /N . From a numerical point of view, the SVD
of the sample is more robust because it works on the whole data set, whereas
EVD works only on the summarized information contained in the covariance
matrix. As a counterpart, from the computational point of view, SVD is more
expensive and may be very slow for samples containing many observations.
The use of algebraic procedures makes PCA a batch algorithm: all ob-
servations have to be known before PCA starts. However, online or adaptive
versions of PCA exist; several are described in [95, 34]. These implementations
do not offer the same strong guarantees as the algebraic versions, but may
be very useful in real-time application, where computation time and memory
space are limited.
When estimating the latent variables, it must be pointed out that Eq. (2.19)
is not very efficient. Instead, it is much better to directly remove the unneces-
sary columns in V and to multiply by y afterwards, without the factor IP ×D .
And if PCA works by SVD of the sample, it is noteworthy that
X̂ = IP ×D VT Y (2.38)
= IP ×D VT VΣUT (2.39)
= IP ×D ΣUT . (2.40)
The standardization allows PCA not to consider observed variables with small
variances as being noise and not to discard them in the dimensionality reduc-
tion. On the other hand, the standardization can sometimes amplify variables
that are really negligible. User knowledge is very useful to decide if a scaling
is necessary.
In order to illustrate the capabilities as well the limitations of PCA, toy ex-
amples may be artificially generated. For visualization’s sake, only two latent
variables are created; they are embedded in a three-dimensional space, i.e.,
three variables are observed. Three simple cases are studied here.
In this first case, the two latent variables, shown in the first plot of Fig. 2.5,
have Gaussian distributions, with variances 1 and 4. The observed variables,
displayed in the second plot of Fig. 2.5, are obtained by multiplying the latent
ones by ⎡ ⎤
0.2 0.8
W = ⎣ 0.4 0.5 ⎦ . (2.41)
0.7 0.3
As the mixing process (i.e., the matrix W) is linear, PCA can perfectly reduce
the dimensionality. The eigenvalues of the sample covariance matrix are 0.89,
0.11, and 0.00. The number of latent variables is then clearly two, and PCA
reduces the dimensionality without any loss: the observed variables could be
perfectly reconstructed from the estimated latent variables shown in Fig. 2.6.
However, the columns of the mixing matrix W are neither orthogonal nor
normed. Consequently, PCA cannot retrieve exactly the true latent variables.
Yet as the latter have Gaussian distributions, PCA finds a still satisfying re-
sult: in Fig. 2.6, the estimated latent variables have Gaussian distributions
but are scaled and rotated. This is visible by looking at the schematic rep-
resentations of the distributions, displayed as solid and dashed ellipses. The
ellipses are almost identical, but the axes indicating the directions of the true
and estimated latent variables are different.
Nonlinear embedding
In this second case, the two latent variables are the same as in the previous
case, but this time the mixing process is nonlinear:
⎡ ⎤
4 cos( 14 x1 )
y = ⎣ 4 sin( 14 x1 ) ⎦ . (2.42)
x1 + x2
34 2 Characteristics of an Analysis Method
2
x2
0
−2
−4
−8 −6 −4 −2 0 2 4 6 8
x
1
Three−dimensional embedding space
5
0
y3
−5
5
0 −5
−5 5 0
y2
y1
Fig. 2.5. Two Gaussian latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a linear mixing process (second
plot). The ellipse schematically represents the joint distribution in both spaces.
The observed variables are shown in Fig. 2.7. Faced with nonlinear dependen-
cies between observed variables, PCA does not detect that only two latent
variables have generated them: the normalized eigenvalues of the sample co-
variance matrix are 0.90, 0.05 and 0.05 again. The projection onto the first
two principal components is given in Fig. 2.8. PCA is unable to completely
reconstruct the curved object displayed in Fig. 2.7 with these two principal
components (the reconstruction would be strictly planar!).
Unfortunately, the result of the dimensionality reduction is not the only
disappointing aspect. Indeed, the estimated latent variables are completely
different from the true ones. The schematic representation of the distribution
is totally deformed.
2.4 Example: Principal component analysis 35
PCA embedding
4
2
x2
0
−2
−4
−8 −6 −4 −2 0 2 4 6 8
x
1
Fig. 2.6. Projection of the three-dimensional observations (second plot of Fig. 2.5)
onto the two first principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.
Non-Gaussian distributions
In this third and last case, the two latent variables are no longer Gaussian. As
shown in the first plot of Fig. 2.9, their distribution is uniform. On the other
hand, the mixing process is exactly the same as in the first case. Therefore, the
eigenvalues of the sample covariance matrix are also identical (actually, very
close, depending on the sample). The dimensionality reduction is performed
without loss. However, the latent variable separation becomes really problem-
atic. As in the first case, the estimated latent variables shown in Fig. 2.10 are
rotated and scaled because the columns of W are not orthonormal. But be-
cause the latent variables have uniform distributions and not Gaussian ones,
the scale factors and rotations make the estimated latent variables no longer
uniformly distributed!
In the ideal situation, when its model is fully respected, PCA appears as
a very polyvalent method to analyze data. It determines data dimensional-
ity, builds an embedding accordingly, and retrieves the latent variables. In
practice, however, the PCA model relies on assumptions that are much too
restrictive, especially when it comes to latent variable separation. When only
dimensionality reduction is sought, the sole remaining but still annoying as-
sumption imposes that the dependencies between the observed variables are
(not far from being) linear.
The three toy examples detailed earlier clearly demonstrate that PCA is
not powerful enough to deal with complex data sets. This suggests design-
ing other methods, maybe at the expense of PCA simplicity and polyvalence.
36 2 Characteristics of an Analysis Method
2
x2
0
−2
−4
−8 −6 −4 −2 0 2 4 6 8
x1
Three−dimensional embedding space
5
y3
−5 2
0
5 −2
0 −4
−5 −6
y2 y
1
Fig. 2.7. Two Gaussian latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a nonlinear mixing process
(second plot). The ellipse schematically represents the joint distribution in both
spaces.
PCA embedding
4
2
x2
0
−2
−4
−8 −6 −4 −2 0 2 4 6 8
x
1
Fig. 2.8. Projection of the three-dimensional observations (second plot of Fig. 2.7)
onto the first two principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.
x2 1
−1
−2
−3
−5 0 5
x1
0
y3
−2
−4
4
2
0 −2 −4
−2 2 0
−4 4
y2
y1
Fig. 2.9. Two uniform latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a linear mixing process (second
plot). The rectangle schematically represents the joint distribution in both spaces.
PCA embedding
3
x2 1
−1
−2
−3
−5 0 5
x1
Fig. 2.10. Projection of the three-dimensional observations (second plot of Fig. 2.9)
onto the first two principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.
space (or at different instants over time). To this class belong classification
and pattern recognition problems involving images or speech. Most often,
the already difficult situation resulting from the huge number of variables is
complicated by a low number of available samples. Methods with a simple
model and few parameters like PCA are very effective for hard dimensionality
reduction.
Soft dimensionality reduction is suited for problems in which the data are
not too high-dimensional (less than a few tens of variables). Then no drastic
dimensionality reduction is needed. Usually, the components are observed or
measured values of different variables, which have a straightforward interpre-
tation. Many statistical studies and opinion polls in domains like social sci-
ences and psychology fall in this category. By comparison with hard problems
described above, these applications usually deal with sufficiently large sample
sizes. Typical methods include all the usual tools for multivariate analysis.
Finally, visualization problems lie somewhere in-between the two previ-
ous classes. The initial data dimensionality may equal any value. The sole
constraint is to reduce it to one, two, or three dimensions.
The model associated with a method actually refers to the way the method
connects the latent variables with the observed ones. Almost each method
makes different assumptions about this connection. It is noteworthy that this
connection can go in both directions: from the latent to the observed variables
40 2 Characteristics of an Analysis Method
or from the observed to the latent variables. Most methods use the second
solution, which is the simplest and most used one, since the method basically
goes in the same direction: the goal is to obtain an estimation of the latent
variables starting from the observed ones. More principled methods prefer the
first solution: they model the observed variables as a function of the unknown
latent variables. This more complex solution better corresponds to the real
way data are generated but often implies that those methods must go back
and forth between the latent and observed variables in order to determine the
model parameters. Such generative models are seldom encountered in the field
of dimensionality reduction.
A third distinction about the data model regards its continuity. For example,
the model of PCA given in Section 2.4, is continuous: it is a linear transform of
the variables. On the other hand, the model of an SOM is discrete: it consists
of a finite set of interconnected points.
The continuity is a very desirable property when the dimensionality re-
duction must be generalized to other points than those used to determine the
model parameters. When the model is continuous, the dimensionality reduc-
tion is often achieved by using a parameterized function or mapping between
the initial and final spaces. In this case, applying the mapping to new points
yields their coordinates in the embedding. With only a discrete model, new
points cannot be so easily re-embedded: an interpolation procedure is indeed
necessary to embed in-between points.
2.5 Toward a categorization of DR methods 41
A fourth distinction about the model, which is closely related to the previous
one, regards the way a method maps the high- and low-dimensional spaces.
Mainly two classes of mappings exist: explicit and implicit.
An explicit mapping consists of directly associating a low-dimensional rep-
resentation with each data point. Hence, using an explicit mapping clearly
means that the data model is discrete and that the generalization to new
points may be difficult. Sammon’s nonlinear mapping (see Subsection 4.2.3)
is a typical example of explicit mapping. Typically, the parameters of such a
mapping are coordinates, and their number is proportional to the number of
observations in the data set.
On the other hand, an implicit mapping is defined as a parameterized
function. For example, the paramaters in the model of PCA define a hyper-
plane. Clearly, there is no direct connection between those parameters and
the coordinates of the observations stored in the data set. Implicit mappings
often originate from continuous models, and generalization to new points is
usually straightforward.
A third intermediate class of mappings also exists. In this class may be
gathered all models that define a mapping by associating a low-dimensional
representation not to each data point, but to a subset of data points. In this
case, the number of low-dimensional representations does not depend strictly
on the number of data points, and the low-dimensional coordinates may be
considered as generic parameters, although they have a straightforward geo-
metric meaning. All DR methods, like SOMs, that involve some form of vector
quantization (see Subection 2.5.9 ahead) belong to this class.
manifold, what would be the justification to cut it off into small pieces? One of
the goals of dimensionality reduction is precisely to discover how the different
parts of a manifold connect to each other. When using disconnected pieces,
part of that information is lost, and furthermore it becomes impossible to
visualize or to process the data as a whole.
The most widely known method using several coordinates systems is cer-
tainly the local PCA introduced by Kambathla and Leen [101]. They perform
a simple vector quantization (see ahead or Appendix D for more details) on
the data in order to obtain the tesselation of the manifold. More recent papers
follow a similar approach but also propose very promising techniques to patch
together the manifold pieces in the low-dimensional embedding space. For ex-
ample, a nonparametric technique is given in [158, 166], whereas probabilistic
ones are studied in [159, 189, 178, 29].
When the amount of available data is very large, the user may decide to work
with a smaller set of representative observations. This operation can be done
automatically by applying a method of vector quantization to the data set
(see App. D for more details). Briefly, vector quantization replaces the origi-
nal observations in the data set with a smaller set of so-called prototypes or
centroids. The goal of vector quantization consists of reproducing as well as
possible the shape of the initial (discrete) data distribution with the proto-
types.
Unfortunately, the ideal case where data are overabondant seldom happens
in real applications. Therefore, the user often skips the vector quantization
and keeps the initial data set. However, some methods are designed in such
a way that vector quantization is mandatory. For example, SOMs belong to
this class (see Chapter 5).
Last but not least, the criterion that guides the dimensionality reduction is
probably the most important characteristic of a DR method, even before the
model specification. Actually, the data model and the algorithm are often
fitted in order to satisfy the constraints imposed by the chosen criterion.
3
An infinite number, according to the theory; see [156].
2.5 Toward a categorization of DR methods 45
Fig. 3.1. A space-filling curve. This curve, invented by Hilbert in 1891 [86], is a
one-dimensional object that evolves iteratively and progressively fills a square— a
two-dimensional object! —. The first six iteration steps that are displayed show how
the curve is successively refined, folded on itself in a similar way as a cabbage leaf.
the smallest such integer. For example, the Lebesgue covering dimension of
the usual Euclidean space RD is D.
Technically, the topological dimension is very difficult to estimate if only
a finite set of points is available. Hence, practical methods use various other
definitions of the intrinsic dimension. The most usual ones are related with
the fractal dimension, whose estimators are studied in Section 3.2. Other
definitions are based on DR methods and are summarized in Section 3.3.
Before going into further details, it is noteworthy that the estimation of
the intrinsic dimension should remain coherent with the DR method: an es-
timation of the dimension with a nonlinear model, like the fractal dimension,
makes no sense if the dimensionality reduction uses a linear model, like PCA.
by Mandelbrot in his pioneering work [131, 132, 133], the length of such a
coastline is different depending on the length ruler used to measure it. This
paradox is known as the coastline paradox: the shorter the ruler, the longer
the length measured.
The term fractal dimension [200] sometimes refers to what is more com-
monly called the capacity dimension (see Subsection 3.2.2). However, the term
can also refer to any of the dimensions commonly used to characterize frac-
tals, like the capacity dimension, the correlation dimension, or the information
dimension. The q-dimension unifies these three dimensions.
where B̄ (y) is the closed ball of radius centered on y. Then, according to
Pesin’s definition [151, 152], for q ≥ 0, q = 1, the lower and upper q-dimensions
of μ are
log Cq (μ, )
Dq− (μ) = lim inf , (3.2)
→0 (q − 1) log
log Cq (μ, )
Dq+ (μ) = lim sup . (3.3)
→0 (q − 1) log
If Dq− (μ) = Dq+ (μ), their common value is denoted Dq (μ) and is called the
q-dimension of μ. It is expected that Dq (μ) exists for sufficiently regular frac-
tal measures (smooth manifolds trivially fulfill this condition). For such a
measure, the function q → Dq (μ) is called the dimension spectrum of μ.
An alternative definition for Dq− (μ) and Dq+ (μ) originates from the physics
literature [83]. For > 0, instead of using closed balls, the support of μ is
covered with a (multidimensional) grid of cubes with edge length . Let N ()
be the number of cubes that intersect the support of μ, and let the natural
measures of these cubes be p1 , p2 , . . . , pN () . Since the pi may be seen as the
probability that these cubes are populated, they are normalized:
N ()
pi = 1 . (3.4)
i=1
Then
N ()
log i=1 pqi
Dq− (μ)= lim inf , (3.5)
→0 (q − 1) log
N () q
+ log i=1 pi
Dq (μ) = lim sup . (3.6)
→0 (q − 1) log
50 3 Estimation of the Intrinsic Dimension
Fig. 3.2. Koch’s island (or snowflake) [200]. This classical fractal object was first
described by Helge von Koch in 1904. As shown in the bottom of the figure, it
is built by starting with an equilateral triangle, removing the inner third of each
side, replacing it with two edges of a three-times-smaller equilateral triangle, and
then repeating the process indefinitely. This recursive process can be encoded as a
Lindenmayer system (a kind of grammar) with initial string S(0) = ‘F−−F−−F’
and string-rewriting rule ‘F’ → ‘F+F−−F+F’. In each string S(i), ‘F’ means “Go
forward and draw a line segment of given length”, ‘+’ means “Turn on the left
with angle 13 π” and ‘−’ means “Turn on the right with angle 13 π”. The drawings
corresponding to strings S(0) to S(3) are shown in the bottom of the figure, whereas
the main representation is the superposition of S(0) to S(4). Koch’s island is a typical
illustration of the coastline paradox: the length of the island boundary depends on
the ruler used to measure it; the shorter the ruler, the longer the coastline. Actually,
it is easy to see that for the representation of S(i), the length of the line segment is
i
L(i) = L∇ 13 , where L∇ = L(0) is the side length of the initial triangle. Similarly,
the number of corners is N (i) = 3 · 4i . Then the true perimeter associated with S(i)
i
is l(i) = L(i)N (i) = 3L∇ 43 .
3.2 Fractal dimensions 51
When setting q equal to zero in the second definition (Eq. (3.5) and (3.6))
and assuming that the equality Dq− (μ) = Dq+ (μ) holds, one gets the capacity
dimension [200, 152]:
N ()
log i=1 p0i
dcap = D0 (μ) = lim
→0 (0 − 1) log
N ()
log i=1 1
= − lim
→0 log
log N ()
= − lim . (3.7)
→0 log
In this definition, dcap does not depend on the natural measures pi . In practice,
dcap is also known as the ‘box-counting’ dimension [200]. When the manifold is
not known analytically and only a few data points are available, the capacity
dimension is quite easy to estimate:
1. Determine the hypercube that circumscribes all the data points.
2. Decompose the obtained hypercube in a grid of smaller hypercubes with
edge length (these “boxes” explain the name of the method).
3. Determine N (), the number of hypercubes that are occupied by one or
several data points.
4. Apply the log function, and divide by log .
5. Compute the limit when tends to zero; this is dcap .
Unfortunately, the limit to be computed in the last step appears to be the
sole obstacle to the overall simplicity of the technique. Subsection 3.2.6 below
gives some hints to circumvent this obstacle.
The intuitive interpretation of the capacity dimension is the following. As-
suming a three-dimensional space divided in small cubic boxes with a fixed
edge length , the box-counting dimension is closely related to the propor-
tion of occupied boxes. For a growing one-dimensional object placed in this
compartmentalized space, the number of occupied boxes grows proportion-
ally to the object length. Similarly, for a growing two-dimensional object, the
number of occupied boxes grows proportionally to the object surface. Finally,
for a growing three-dimensional object, the number of occupied boxes grows
proportionally to the object volume. Generalizing to a P -dimensional object
like a P -manifold embedded in RD , one gets
N () ∝ P . (3.8)
52 3 Estimation of the Intrinsic Dimension
And, trivially,
log N ()
P ∝ . (3.9)
log
To complete the analogy, the hypothesis of a growing object has to be replaced
with the reciprocal one: the size of the object remains unchanged but the edge
length of the boxes decreases, yielding the precise estimate of the dimension
at the limit.
As an illustration, the capacity dimension can be computed analytically
for the coastline of Koch’s island (Fig. 3.2). In this particular case, the devel-
opment is made easier by choosing triangular boxes. As shown in the caption
to Fig. 3.2, the length of the line segment for the ith iteration of the Lin-
denmayer system is L(i) = L∇ 3−i , where L∇ = L(0) is the edge length of
the initial triangle. The number of corners is N (i) = 3 · 4i . It is easy to see
that if the grid length equals L(i), the number of occupied boxes is N (i).
Consequently,
log N (i)
dcap = − lim
log L(i)
i→∞
log(3 · 4i )
= − lim
i→∞ log(L∇ 3−i )
log 3 + i log 4
= − lim
i→∞ log L∇ − i log 3
log 4
= = 1.261859507 . (3.10)
log 3
Since the pi are normalized (Eq. (3.4)), the numerator of the right factor
trivially tends to zero:
N () N ()
lim log pqi = log pi = log 1 = 0 , (3.13)
q→1
i=1 i=1
lim q − 1 = 0 . (3.14)
q→1
Hence, using l’Hospital’s rule, the numerator and denominator can be replaced
with their respective derivatives:
N () q
1 p log pi
dinf = lim Dq (μ) = lim lim i=1 i
q→1 →0 log q→1 1
N ()
pi log pi
= lim i=1 . (3.15)
→0 log
It is noteworthy that the numerator in the last expression resembles Shannon’s
entropy in information theory [40], justifying the name of D1 (μ).
The information dimension is mentioned here just for the sake of complete-
ness. Because the pi are seldom known when dealing with a finite number of
samples, its evaluation remains difficult, except when the pi are assumed to
be equal, meaning that all occupied boxes have the same probability to be
visited:
1
∀i, pi = . (3.16)
N ()
In this case, it turns out that the information dimension reduces to the ca-
pacity dimension:
N ()
i=1N ()−1 log N ()−1
dinf = lim
→0 log
log N ()
= − lim
→0 log
= dcap . (3.17)
1 N
C2 () = lim H( − y(i) − y(j)2 ) (3.18)
N →∞ N (N − 1)
i=1
i<j
= P (y(i) − y(j)2 ≤ ) , (3.19)
54 3 Estimation of the Intrinsic Dimension
log C2 ()
dcor = D2 = lim . (3.21)
→0 log
Like the capacity dimension, this discrete formulation of the correlation di-
mension no longer depends on the natural measures pi of the support μ. Given
a set of points Y, the correlation dimension is easily estimated by the following
procedure:
1. Compute the distances for all possible pairs of points {y(i), y(j)}.
2. Determine the proportion of distances that are less than or equal to .
3. Apply the log function, and divide by log .
4. Compute the limit when tends to zero; this is dcor .
The second step yields only an approximation Ĉ2 () of C2 () computed with
the available N points. But again, the difficult step is the last one. Section 3.2.6
brings some useful hints.
Intuitively, the interpretation of the correlation dimension is very similar
to the one associated with the capacity dimension. Instead of adopting a global
point of view (the number of boxes that an object or manifold occupies), a
closer view is necessary. When looking at the data set on the scale of a single
point, C2 () is the number of neighboring points lying closer than a certain
threshold . This number grows as a length for a 1D object, as a surface for
a 2D object, as a volume for a 3D object, and so forth. Generalizing for P
dimensions gives
C2 () ∝ P . (3.22)
And again,
log C2 ()
P ∝ . (3.23)
log
(N (7) = 3 · 47 = 49, 152), which is represented in the first plot of Fig. 3.3.
(Only corners can be taken into account since sides are indefinitely refined.)
The log-log plot of the estimated correlation sum Ĉ2 () is displayed in the
0.5
2 0
y
−0.5
−1
−1 −0.5 0 0.5 1
y
1
3
25
d log C2(ε) / d log ε
log C2(ε)
20 2
1.26
15 1
10
0
−8 −6 −4 −2 0 −8 −6 −4 −2 0
log ε log ε
Fig. 3.3. Correlation dimension of Koch’s island. The first plot shows the coastline,
whose corners are the data set for the estimation of the correlation dimension. The
log-log plots of the estimated correlation sum Ĉ2 () and its numerical derivative are
displayed below.
second plot of Fig. 3.3. Obviously, as the data set is generated artificially,
the result is nearly perfect: the slope of the curve is almost constant between
1 ≈ exp(−6) = 0.0025 and 2 ≈ exp(0) = 1. However, the manual adjustment
of a line onto the curve is a tedious task for the user.
Alternatively, the correlation dimension can be estimated by computing
the numerical derivative of log Ĉ2 (exp υ), with υ = log :
d
dˆcor = log Ĉ2 (exp υ) . (3.30)
dυ
This turns out to compute the slope of Ĉ2 () in a log-log plot.
For any function f (x) known at regularly spaced values of x, the numerical
derivative can be computed as a second-order estimate, written as
3.2 Fractal dimensions 57
f (x + Δx) − f (x − Δx)
f (x) = + O(Δx3 ) (3.31)
2Δx
and based on Taylor’s polynomial expansion of an infinitely differentiable
function f (x). The numerical derivative of log Ĉ2 (exp υ) directly yields the
dimension for any value of υ = log ; the result is displayed in the third plot
of Fig. 3.3 for the coastline of Koch’s island. As expected, the estimated corre-
lation dimension is very close to the capacity dimension computed analytically
at the end of Subsection 3.2.2.
The use of a numerical derivative is usually criticized because it yields a
chopping and changing estimate. Nevertheless, in normal situations, it works
rather well, is visually more satisfying, and, last but not least, provides a
result that proves a bit less user-dependent. Indeed, the manual adjustment
of a line onto a curve is essentially a matter of personal perception.
Finally, the following example illustrates the fact that the estimated cor-
relation dimension depends on the observation scale. The manifold is a spiral,
written as √
√ cos(10π x)
y= x √ +n , (3.32)
sin(10π x)
where the unique parameter x goes from 0 to 1 and n is white Gaussian noise
with standard deviation 0.005. By construction, this spiral is a 1-manifold
embedded in R2 , and a quick look at the first plot of Fig. 3.4 confirms it
visually. However, the correlation dimension gives a more contrasted result
(second and third plots of Fig. 3.4). In the second plot, from left to right, the
correlation is growing constantly, then seems to “slow down” and is finally
growing again until it reaches its maximal value. The explanation of this
behavior can be found in the third plot, by considering the derivative:
1. For extremely small values of , the correlation sum remains on the scale
of isolated points. This interval is shown as the black box, the same color
as the points of the spiral in the first plot. These points are 0-manifold
and the estimated dimension is low indeed.
2. For barely larger values of , the correlation sum measures the dimension
of the noise. This interval is shown as a dark gray box, which corresponds
to the small square box of the same color in the first plot. As noise occupies
all dimensions of space, the dimension is two.
3. For still larger values of , the correlation sum begins to take into account
entire pieces of the spiral curve. Such a piece is shown in the light gray
rectangular box in the first plot. On this scale, the spiral is a 1-manifold,
as intuitively expected and as confirmed by the estimated correlation di-
mension.
4. For values of close to the maximal diameter of the spiral, the correlation
dimension encompasses distances across the whole spiral. These values
are shown as a white box in the third plot, corresponding to the entire
white box surrounding the spiral in the first plot. On this scale the spiral
58 3 Estimation of the Intrinsic Dimension
0.5
−0.5
−1
−1 −0.5 0 0.5 1
2
2
10
1
5
0 0
−10 −5 0 −10 −8 −6 −4 −2 0
log ε log ε
Fig. 3.4. Correlation dimension of a noisy spiral. The first plot shows the data
set (10,000 points). The log-log plots of the estimated correlation sum Ĉ2 () and
its numerical derivative are displayed below. The black, dark gray, light gray, and
white boxes in the third plot illustrate that the correlation dimension depends on the
observation scale. They correspond, respectively, to the scale of the isolated points,
the noise, pieces of the spiral curve, and the whole spiral.
appears as a plane with some missing points, and indeed the dimension
equals two.
5. For values of far beyond the diameter, the correlation dimension sees
the spiral as a smaller and smaller fuzzy spot. Intuitively, this turns out
to zoom out in the first plot of Fig. 3.4. This explains why the dimension
vanishes for very large values of (no box is drawn).
All those variations of the estimated correlation dimension are usually called
microscopic effects (1 and 2), lacunarity effects (3), and macroscopic effects
(4 and 5) [174]. Other macroscopic effects that are not illustrated here are, for
example, side and corner effects. For example, when computing the correlation
dimension of a square, the number of points inside a ball of radius is always
proportional to 2 . However, a multiplicative coefficient should be taken into
account. Assuming that inside the square this coefficient equals 1, then near
a side, it is only 1/2; and near a corner, it further decrease towards 1/4.
3.3 Other dimension estimators 59
Therefore, the estimated dimension not only depends on scale but also on the
“location” in space where it is estimated!
The idea behind local methods consists of decomposing the space into small
patches, or “space windows”, and to consider each of them separately. To
some extent, this idea is closely related to the use of boxes and balls in the
capacity and correlation dimensions.
The most widely known local method is based on the nonlinear general-
ization of PCA already sketched in Subsection 2.5.8. Briefly put, the space
windows are determined by clustering the data. Usually, this is achieved by
vector quantization (see Appendix D). In a few words, vector quantization
processes a set of points by replacing it with a smaller set of “representative”
points. Usually, the probability distribution function of these points resembles
that of the initial data set, but their actual distribution is, of course, much
sparser. If each point is mapped to the closest representative point, then the
space windows are defined as the subsets of points that are mapped to the
same representative point. Next, PCA is carried out locally, on each space
window, assuming that the manifold is approximately linear on the scale of a
window. Finally, the dimensionality of the manifold is obtained as the average
estimate yielded by all local PCAs. Usually, each window is weighted by the
number of points it contains before computing the mean.
Moreover, it is noteworthy that not just the mean can be computed: other
statistics, like standard deviations or minimal and maximal values, may help
60 3 Estimation of the Intrinsic Dimension
to check that the dimensionality remains (nearly) identical over all space win-
dows. Hence, local PCA can detect spatial variations of the intrinsic dimen-
sionality. This is a great difference with other methods, like fractal dimension,
that usually assume that dimensionality is a global property of data.
For the noisy spiral of Fig. 3.4, the local PCA approach yields the result
shown in Fig. 3.5. The first plot is a copy of the spiral data set, but the bound-
aries of the space windows are added in gray (70 windows have been built).
The second plot shows the fraction of variance spanned by the first principal
component of each space window, as a function of the number of space win-
dows. In the third plot, the three curves indicate the dimensionality for three
variance thresholds (0.97, 0.98, and 0.99). As can be seen, the dimension given
by local PCA is scale-dependent, like the correlation dimension. Actually, the
scale is implicitly determined by the number of space windows. If this num-
ber is too low, the windows are too large and PCA “sees” the macroscopic
structure of the spiral, which is two-dimensional. At nearly 70, the value that
corresponds to the number of space windows in the first plot, the size of the
windows is close to the optimum and PCA “sees” small pieces of the spiral
curve: the dimension is one. If the number of windows further increases, the
windows become too small: the noise scale is attained and PCA needs two
components to explain the variance.
By comparison with the fractal dimensions like the correlation dimension,
the local PCA requires more data samples to yield an accurate estimate. This
is because local PCA works by dividing the manifold into nonoverlapping
patches. On the contrary, the correlation dimension places a ball on each
point of the data set. As a counterpart, local PCA is faster (O(N )) than the
correlation dimension (O(N 2 )), at least for a single run. Otherwise, if local
PCA is repeated for many different numbers of space windows, as in Fig. 3.5,
then the computation time grows.
The local PCA approach has been proposed by Kambathla and Leen [101]
as a DR method. Because this method does not provide an embedding in a
single coordinate system in a natural way, it does not encounter much success,
except in data compression. Fukanaga and Olsen [72], on the other hand,
followed the same approach more than two decades before Kambathla and
Leen in order to estimate the intrinsic dimensionality of data.
0.5
y2
−0.5
−1
−1 −0.5 0 0.5 1
y
1
Variance fraction 1st PC
1
0.8
0.6
0.4
0.2
0
50 100 150 200 250 300 350 400 450 500
Number of space windows
3
Dimension
1 0.97
0.98
0.99
0
50 100 150 200 250 300 350 400 450 500
Number of space windows
Fig. 3.5. Intrinsic dimensionality of the noisy spiral shown in Fig. 3.4, estimated
by local PCA. The first plot shows the spiral again, but the boundaries of the space
windows are added in gray (70 windows). The second plot shows the fraction of
the total variance spanned by the first principal component of each cluster or space
window. This fraction is actually computed as an average for different numbers
of windows (in abscissa). The third plot shows the corresponding dimensionality
(computed by piecewise linear interpolation) for three variance fractions (0.97, 0.98,
and0.99).
62 3 Estimation of the Intrinsic Dimension
any change. If P = 0, then the error reaches its maximal value, equal to the
global variance (tr(Cyy )). For 0 < P < D, the error varies between these
two extrema but cannot be predicted exactly. However, one may expect that
the error will remain low if P is greater than the intrinsic dimensionality of
the manifold to be embedded. On the contrary, if P goes below the intrinsic
dimensionality, the dimensionality reduction may cause a sudden increase in
Ecodec .
With this ideas in mind, the following procedure can yield an estimate of
the intrinsic dimensionality:
1. For a manifold embedded in a D-dimensional space, reduce dimensionality
successively to P = 1, 2, . . . , D; of course, if some additional information
about the manifold is available, the search interval may be smaller.
2. Plot Ecodec as a function of P .
3. Choose a threshold, and determine the lowest value of P such that Ecodec
goes below it: this is the estimate of the intrinsic dimensionality of the
manifold.
The choice of the threshold in the last step is critical since the user determines
it arbitrarily. But most of the time the curve Ecodec versus P is very explicit:
an elbow is usually clearly visible when P equals the intrinsic dimensionality.
An example is given in the following section.
An additional refinement of the procedure consists of using statistical esti-
mation methods like cross validation or bootstrapping. Instead of computing
an embedding for a certain dimensionality only once, those methods repeat
the dimensionality reduction on several subsets that are randomly drawn from
the available data. This results in a better estimation of the reconstruction
errors, and therefore in a more faithful estimation of the dimensionality at the
elbow.
The main disadvantage of the procedure, especially with cross validation
or bootstrapping, lies in its huge computational requirements. This is partic-
ularly annoying when the user is not interested in the embedding, but merely
in the value of the intrinsic dimensionality. Moreover, if the DR method is
not incremental (i.e., does not produce incremental embeddings, see Subsec-
tion 2.5.7), the computation time dramatically increases.
3.4 Comparisons
This section attempts to compare the above-mentioned methods in the case
of an artificially generated manifold whose dimensionality is known before-
hand. The first and simplest method is PCA; the others are the correlation
dimension, the local PCA, and finally the “trial and error” method.
3.4 Comparisons 63
x1 x2 x3
+0.026 +0.241 +0.026
+0.236 +0.193 −0.913
−0.653 +0.969 −0.700
+0.310 +0.094 +0.876
+0.507 +0.756 +0.216
−0.270 −0.978 −0.739
−0.466 −0.574 +0.556
−0.140 −0.502 −0.155
+0.353 −0.281 +0.431
−0.473 +0.993 +0.411
Three data sets are made available for the dimensionality estimation: they con-
tain, respectively, 100, 1000, and 10,000 observations. The three-dimensional
points that generated them are uniformly distributed in the three-dimensional
cube [−1, +1]3. Once the 10 distances are computed, white Gaussian noise is
added, with standard deviation equal to 0.01.
Figure 3.6 shows the results of PCA applied globally on the three data sets. As
can be seen, the number of observations does not greatly influence the results.
For the three data sets, the normalized variances vanish starting from the fifth
principal component. Clearly, this not a good result. But this overestimation
of the intrinsic dimension is not unexpected: PCA works with a linear model,
which is unable to cope with the nonlinear dependences hidden in the data
sets.
The results of the correlation dimension are given in Fig. 3.7. This method is
much more sensitive to the number of available observations. For 100 obser-
vation, the numerical derivative is chopping and changing, although the right
dimensionality can already be guessed. For 1000 and 10,000 observations, the
64 3 Estimation of the Intrinsic Dimension
1
100 points
1000 points
0.9
10000 points
0.8
0.7
Normalized variances
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
Principal component indices
Fig. 3.6. Estimation of the intrinsic dimensionality for the three “sensor” data sets
(100, 1000, and 10,000 observations), by using PCA applied globally.
20
100 points
1000 points
15 10,000 points
log C2(ε)
10
0
−4 −3 −2 −1 0 1 2
log ε
9
d log C2(ε) / d log ε
8 100 points
7 1000 points
10,000 points
6
5
4
3
2
1
−4 −3 −2 −1 0 1 2
log ε
Fig. 3.7. Estimation of the intrinsic dimensionality for the three ‘sensor’ data sets
(100, 1000 and 10000 observations), by using the correlation dimension.
3.4 Comparisons 65
The results of local PCA are displayed in Fig. 3.8. Actually, only the results for
1000 observations
0.8
Normalized variances
0.6
0.4
0.2
0
10 20 30 40 50 60 70 80 90 100
Number of space windows
10,000 observations
0.8
Normalized variances
0.6
0.4
0.2
0
10 20 30 40 50 60 70 80 90 100
Number of space windows
Fig. 3.8. Estimation of the intrinsic dimensionality for the two largest “sensor”
data sets (1000 and 10,000 observations) by local PCA. The normalized eigenvalues
are shown according to the number of space windows.
the two largest data sets are shown. For 100 observations, the space windows
do not contain enough observations to yield a trustworthy estimate. As the
correlation dimension, the local PCA yields the right dimensionality. This can
be seen in both plots: the largest three normalized eigenvalues remain high for
any number of windows, while the fourth and subsequent ones are negligible.
It is noteworthy that for a single window the result of local PCA is trivially the
same as for PCA applied globally. But as the number of windows is increasing,
the fourth normalized eigenvalue is decreasing slowly. This indicates that the
66 3 Estimation of the Intrinsic Dimension
0.35
100 points
1000 points
0.3
0.25
Reconstruction error
0.2
0.15
0.1
0.05
0
1 2 3 4 5 6 7 8 9 10
Reduced dimensionality
Fig. 3.9. Estimation of the intrinsic dimensionality for the two smallest “sensor”
data sets (100 and 1000 observations) by trial and error with Sammon’s nonlinear
mapping.
3.4 Comparisons 67
be seen, the number of points does not play an important role. At first sight,
the estimated dimension is four, since the error is almost zero starting from
this number. Actually, the DR method slightly overestimates the dimension-
ality, like PCA applied globally. Although the method relies on a nonlinear
model, the manifold may still be too curved to achieve a perfect embedding
in a space having the same dimension as the exact manifold dimensionality.
One or two extra dimensions are welcome to embed the manifold with some
more freedom. This explains why the overestimation observed for PCA does
not disappear but is only attenuated when switching to an NLDR method.
Among the four compared methods, PCA applied globally on the whole data
set undoubtly remains the simplest and fastest one. Unfortunately, its results
are not very convincing: the dimension is almost always overestimated if data
do not perfectly fit the PCA model. Since the PCA criterion can be seen as
a reconstruction error, PCA is actually a particular case of ‘trial and error’
method. And because PCA is an incremental method, embeddings for all
dimensions— and the corresponding errors —are computed at once.
When replacing PCA with a method relying on a nonlinear model, these
nice properties are often lost. In the case of Sammon’s nonlinear mapping, not
only is the method much slower due to its more complex model, but addition-
ally the method is no longer incremental! Put together, these two drawbacks
make the trial-and-error method very slow. Furthermore, the overestimation
that was observed with PCA does not disappear totally.
The use of local PCA, on the other hand, seems to be a good tradeoff. The
method keeps all advantages of PCA and combines them with the ability to
handle nonlinear manifolds. Local PCA runs fast if the number of windows
does not sweep a wide interval. But more importantly, local PCA has given
the right dimensionality for the studied data sets, along with the correlation
dimension.
Eventually, the correlation dimension clearly appears as the best method to
estimate the intrinsic dimensionality. It is not the fastest of the four methods,
but its results are the best and most detailed ones, giving the dimension on
all scales.
4
Distance Preservation
Overview. This chapter deals with methods that reduce the dimen-
sionality of data by using distance preservation as the criterion. In
the ideal case, the preservation of the pairwise distances measured in
a data set ensures that the low-dimensional embedding inherits the
main geometric properties of data, like the global shape or the lo-
cal neighborhood relationships. Unfortunately, in the nonlinear case,
distances cannot be perfectly preserved. The chapter reviews various
methods that attempt to overcome this difficulty. These methods use
different kinds of distances (mainly spatial or graph distances); they
also rely on different algorithms or optimization procedures to deter-
mine the embedding.
4.1 State-of-the-art
Historically, distance preservation has been the first criterion used to achieve
dimensionality reduction in a nonlinear way. In the linear case, simple criteria
like maximizing the variance preservation or minimizing the reconstruction
error, combined with a basic linear model, lead to robust methods like PCA.
In the nonlinear case however, the use of the same simple criteria requires
the definition of more complex data models. Unfortunately, the definition of
a generative model in the nonlinear case proves very difficult: there are many
different ways to model nonlinear manifolds, whereas there are only a few
(equivalent) ways to define a hyperplane.
In this context, distance preservation appears as a nongenerative way to
perform dimensionality reduction. The criterion does not need any explicit
model: no assumption is made about the mapping from the latent variables
to the observed ones. Intuitively, the motivation behind distance preservation
is that any manifold can be fully described by pairwise distances. Hence, if
a low-dimensional representation can be built in such a way that the initial
distances are reproduced, then the dimensionality reduction is successful: the
70 4 Distance Preservation
Spatial distances, like the Euclidean distance, are the most intuitive and natu-
ral way to measure distances in the real (Euclidean) world. The adjective spa-
tial indicates that these metrics compute the distance separating two points of
the space, without regards to any other information like the presence of a sub-
manifold: only the coordinates of the two points matter. Although these met-
rics are probably not the most appropriate for dimensionality reduction (see
Section 4.3.1), their simplicity makes them very appealing. Subsection 4.2.1
introduces some facts about and definitions of distances, norms, and scalar
products; then it goes on to describe the methods that reduce dimensionality
by using spatial distances.
0 ≤ d(c, a) . (4.2)
The conjunction of both inequalities forces the equality d(u, v) = d(v, u).
In the usual Cartesian vector space RD , the most-used distance functions are
derived from the Minkowski norm. Actually, the pth-order Minkowski norm
of point a = [a1 , . . . , ak , . . . , aD ]T , also called the Lp norm and noted ap , is
a simple function of the coordinates of a:
D
ap = p
|ak |p , (4.5)
k=1
When using the Minkowski distance, some values of p are chosen preferentially
because they lead to nice geometrical or mathematical properties:
• The maximum distance (p = ∞):
D
a − b1 = |ak − bk | , (4.8)
k=1
proves to be the most natural and intuitive distance measure in the real
world. The Euclidean distance also has particularly appealing mathemat-
ical properties (invariance with respect to rotations, etc.).
Among the three above-mentioned possibilities, the Euclidean distance is the
most widely used one, not only because of its natural interpretation in the
physical world, but also because of its simplicity. For example, the partial
derivative along a component ak of a is simply
∂d(a, b) ak − b k ∂d(a, b)
= =− , (4.10)
∂ak d(a, b) ∂bk
or, written directly in a vector form, is
∂d(a, b) a−b ∂d(a, b)
= =− . (4.11)
∂a d(a, b) ∂b
Another advantage of the Euclidean distance comes from the alternative def-
inition of the Euclidean norm by means of the scalar product:
a2 = a · a , (4.12)
where the notation a · b indicates the scalar product between vector a and
b. Formally, the scalar or dot product is defined as
D
a · b = a b =
T
ak b k . (4.13)
k=1
a · (b + c) = a · b + a · c , (4.15)
(a + b) · c = a · c + b · c . (4.16)
Finally, the overview of the classical distance functions would not be com-
plete without mentioning the Mahalanobis distance, a straight generalization
of the Euclidean distance. The Mahalanobis norm is defined as
is available, then the distance between the two points y(i) and y(j), normally
written as d(y(i), y(j)), can be shortened and noted as dy (i, j).
a short-hand notation may be given for the scalar product between vectors
y(i) and y(j):
sy (i, j) = s(y(i), y(j)) = y(i) · y(j) , (4.22)
as has been done for distances. Then it can be written that
Usually, both Y and X are unknown; only the matrix of pairwise scalar prod-
ucts S, called the Gram matrix, is given. As can be seen, the values of the
4.2 Spatial distances 75
S = UΛUT (4.27)
1/2 1/2 T
= (UΛ )(Λ U ) (4.28)
= (Λ1/2 UT )T (Λ1/2 UT ) , (4.29)
X̂ = IP ×N Λ1/2 UT . (4.30)
Starting from this solution, the equivalence between metric MDS and PCA
can easily be demonstrated.
Actually, metric MDS and PCA give the same solution. To demonstrate
it, the data coordinates Y are assumed to be known— this is mandatory for
PCA, but not for metric MDS —and centered. Moreover, the singular value
decomposition of Y is written as Y = VΣUT (see Appendix A.1). On one
hand, PCA decomposes the covariance matrix, which is proportional to YYT ,
into eigenvectors and eigenvalues:
By the way, this proves that PCA and metric MDS minimize the same crite-
rion. In the case of metric MDS, it can be rewritten as
76 4 Distance Preservation
N
EMDS = (sy (i, j) − x̂(i) · x̂(j))2 . (4.37)
i,j=1
where z is the translation vector. According to Eq. (4.42), the scalar products
can be computed as
1
sy (i, j) = − (d2y (i, j) − y(i) · y(i) − y(j) · y(j)) . (4.44)
2
Unfortunately, the coordinates y are usually unknown. Nevertheless, the two
subtractions in the right-hand side of Eq. (4.44) can be achieved in an implicit
way by an operation called “double centering” of D. It simply consists of
subtracting from each entry of D the mean of the corresponding row and the
mean of the corresponding column, and adding back the mean of all entries.
In matrix form, this can be written as
4.2 Spatial distances 77
1 1 1 1
S = − (D − D1N 1TN − 1N 1TN D + 2 1N 1TN D1N 1TN ) . (4.45)
2 N N N
Using the properties of the scalar product, knowing that data are centered,
and denoting by μ the mean operator, we find that the mean of the ith row
of D is
Clearly, the two last unknown terms in Eq. (4.44) are equal to the sum of
Eq. (4.46) and Eq. (4.47), minus Eq. (4.48):
1
sy (i, j) = − (d2y (i, j) − μj (d2y (i, j)) − μi (d2y (i, j)) + μi,j (d2y (i, j))) . (4.49)
2
The algorithm that achieves MDS is summarized in Fig. 4.1. In this algo-
rithm, it is noteworthy that, due to symmetry, the third term in the right-
hand side of Eq. (4.45) is the transpose of the second one, which is in turn a
subfactor of the fourth one. Similarly, the product in the last step of the
algorithm can be computed more efficiently by directly removing the un-
necessary rows/columns of U and Λ. The algorithm in Fig. 4.1 can easily
be implemented in less than 10 lines in MATLAB R
. A C++ implementa-
tion can also be downloaded from http://www.ucl.ac.be/mlg/. The only
parameter of metric MDS is the embedding dimension P . Computing the
pairwise distances requires O(N 2 ) memory entries and O(N 2 D) operations.
Actually, time and space complexities of metric MDS are directly related to
those of an EVD. Computing all eigenvalues and eigenvectors of an N -by-N
nonsparse matrix typically demands at most O(N 3 ) operations, depending on
the implementation.
78 4 Distance Preservation
If the data set is available as coordinates, then the equivalence between metric
MDS and PCA can be used to easily embed a test set. In practice, this means
that the principal components vd are explicitly known, after either the SVD
Y = VΣUT or the EVD of the estimated covariance matrix Ĉyy = VΛVT .
A point y of the test set is then embedded by computing the product:
x̂ = IP ×D VT y , (4.50)
s = UΣT VT y . (4.53)
Assuming that the test point y has been generated according to y = VID×P x,
then
and, eventually,
x̂ = IP ×N Λ−1/2 UT s , (4.57)
which gives the desired P -dimensional coordinates. This corresponds to the
Nyström formula [6, 16].
If the test set is given as distances, then a test point y is written as the
column vector
Example
Figure 4.2 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. Knowing that metric MDS projects data
0.5 1
x2
0 0
2
x
−0.5 −1
−1 −0.5 0 0.5 1 −1 0 1 2 3
x x
1 1
Fig. 4.2. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by metric MDS.
in a linear way, the results in Fig. 4.2 are not very surprising. Intuitively, they
look like pictures of the two manifolds shot from aside (Swiss roll) and from
above (open box), respectively.
80 4 Distance Preservation
In the case of the Swiss roll, such a result is not very useful, since all turns
of the manifold are superposed. Similarly, for the open box, the presence of
lateral faces can be guessed visually, but with difficulty: only the bottom face
and the lid are clearly visible. These disappointing results can be explained
theoretically by looking at the first couple of normalized eigenvalues
for the open box. In accordance with theory, only three eigenvalues are non-
zero. Unfortunately, none of these three eigenvalues can be neglected compared
to the others. This clearly means that metric MDS fails to detect that the two
benchmark manifolds are two-manifolds. Any embedding with dimensionality
less than three would cause an important information loss.
Classification
Exactly like PCA, metric MDS is an offline or batch method. The optimiza-
tion method is exact and purely algebraical: the optimal solution is obtained
in closed form. Metric MDS is also said to be a spectral method, since the core
operation in its procedure is an EVD of a Gram matrix. The model is continu-
ous and strictly linear. The mapping is implicit. Other characteristics of PCA
are also kept: eigenvalues of the Gram matrix can be used to estimate the
intrinsic dimensionality, and several embeddings can be built incrementally
by adding or removing eigenvectors in the solution.
Classical metric MDS possesses all the advantages and drawbacks of PCA: it
is simple, robust, but strictly linear. By comparison with PCA, metric MDS
is more flexible: it accepts coordinates as well as scalar products or Euclidean
distances. On the other hand, running metric MDS requires more memory
than PCA, in order to store the N -by-N Gram matrix (instead of the D-by-
D covariance matrix). Another limitation is the generalization to new data
points, which involves an approximate formula for the double-centering step.
Variants
Classical metric MDS has been generalized into metric MDS, for which pair-
wise distances, instead of scalar products, are explicitly preserved. The term
stress function has been coined to denote the objective function of metric
MDS, which can be written as
4.2 Spatial distances 81
1
N
EmMDS = wij (dy (i, j) − dx (i, j))2 , (4.64)
2 i,j=1
where dy (i, j) and dx (i, j) are the Euclidean distances in the high- and low-
dimensional space, respectively. In practice, the nonnegative weights wij are
often equal to one, except for missing data (wij = 0) or when one desires to
focus on more reliably measured dissimilarities dy (i, j). Although numerous
variants of metric MDS exist, only one particular version, namely Sammon’s
nonlinear mapping, is studied in the next section. Several reasons explain this
choice: Sammon’s nonlinear mapping meets a wide success beyond the usual
fields of application of MDS. Moreover, Sammon provided his NLDR method
with a well-explained and efficient optimization technique.
In human science, the assumption that collected proximity values are dis-
tance measures might be too strong. Shepard [171] and Kruskal [108] ad-
dressed this issue and developed a method known as nonmetric multidimen-
sional scaling. In nonmetric MDS, only the ordinal information (i.e., proximity
ranks) is used for determining the spatial representation. A monotonic trans-
formation of the proximities is calculated, yielding scaled proximities. Opti-
mally scaled proximities are sometimes referred to as disparities. The problem
of nonmetric MDS then consists of finding a spatial representation that mini-
mizes the squared differences between the optimally scaled proximities and the
distances between the points. In contrast to metric MDS, nonmetric MDS does
not attempt to reproduce scalar products but explicitly optimizes a quantita-
tive criterion that measures the preservation of the pairwise distances. Most
variants of nonmetric MDS optimize the following stress function:
N 2
i,j=1 wij |f (δ(i, j)) − dx (i, j)|
EnMDS = , (4.65)
c
where
• δ(i, j) are the collected proximities;
• f is a monotonic transformation of the proximities, such that the assump-
tion f (δ(i, j)) ≈ dy (i, j) holds, where dy (i, j) is the Euclidean distance
between the unknown data points y(i) and y(j);
• dx (i, j) is the Euclidean distance between the low-dimensional representa-
tions x(i) and x(j) of y(i) and y(j);
N
• c is a scale factor usually equal to i,j=1 wij dy (i, j);
• wij are nonnegative weights, with the same meaning and usage as for
metric MDS.
More details can be found in the vast literature dedicated to nonmetric MDS;
see, for instance, [41, 25] and references therein.
82 4 Distance Preservation
where
• dy (i, j) is a distance measure between the ith and jth points in the D-
dimensional data space,
• dx (i, j) is the Euclidean distance between the ith and jth points in the
P -dimensional latent space.
The normalizing constant c is defined as
N
c= dy (i, j) . (4.67)
i=1
i<j
Sammon’s stress can be cast as an instance of metric MDS (see (4.64)), for
which wij = 1/dy (i, j). The intuitive meaning of the factor 1/dy (i, j), which is
weighting the summed terms, is clear: it gives less importance to errors made
on large distances. During the dimensionality reduction, a manifold should be
unfolded in order to be mapped to a Cartesian vector space, which is flat, in
contrast with the initial manifold, which can be curved. This means that long
distances, between faraway points, cannot be preserved perfectly if the curva-
ture of the manifold is high: they have to be stretched in order to “flatten” the
manifold. On the other hand, small distances can be better preserved since on
a local scale the curvature is negligible, or at least less important than on the
global scale. Moreover, the preservation of short distances allows us to keep
the local cohesion of the manifold. In summary, the weighting factor simply
adjusts the importance to be given to each distance in Sammon’s stress, ac-
cording to its value: the preservation of long distances is less important than
the preservation of shorter ones, and therefore the weighting factor is chosen
to be inversely proportional to the distance.
Obviously, Sammon’s stress ENLM is never negative and vanishes in the
ideal case where dy (i, j) = dx (i, j) for all pairs {i, j}. The minimization
of ENLM is performed by determining appropriate coordinates for the low-
dimensional representations x(i) of each observation y(i). Although ENLM is a
relatively simple continuous function, its minimization cannot be performed in
closed form, in contrast with the error functions of PCA and classical metric
MDS. Nevertheless, standard optimization techniques can be applied in order
to find a solution in an iterative manner. Sammon’s idea relies on a variant
of Newton’s method, called quasi-Newton optimization (see Appendix C.1).
This method is a good tradeoff between the exact Newton method, which
involves the Hessian matrix, and a gradient descent, which is less efficient. As
Sammon’s stress ENLM depends on N P parameters, the Hessian would have
been much too big! According to Eq. (C.12), the quasi-Newton update rule
that iteratively determines the parameters xk (i) of ENLM can be written as
∂ENLM
∂x (i)
xk (i) ← xk (i) − α 2 k , (4.68)
∂xk (i)2
∂ ENLM
where the absolute value is used to distinguish the minima from the max-
ima. Sammon [165] recommends setting α (called magic factor in his paper)
between 0.3 and 0.4.
As dx (i, j) is the Euclidean distance between vectors x(i) and x(j), it
follows out from Eq. (4.9) that
P
dx (i, j) = x(i) − x(j)2 = (xk (i) − xk (j))2 . (4.69)
k=1
−2 dy (i, j) − dx (i, j)
N
= (xk (i) − xk (j)) . (4.73)
c j=1 dy (i, j) dx (i, j)
j =i
Example
Figure 4.4 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with the step
1
1
0.5
x2
0
2
0
x
−0.5 −1
−1
−1 0 1 −3 −2 −1 0 1 2
x1 x1
Fig. 4.4. Two-dimensional embeddings of the ‘Swiss roll’ and ‘open box’ data sets
(Fig. 1.4), found by Sammon’s NLM.
size α set to 0.3. By comparison with metric MDS, NLM can embed in a
nonlinear way. It is clearly visible that the two manifolds look “distorted” in
86 4 Distance Preservation
their two-dimensional embedding: some parts are stretched while others are
compressed.
Unfortunately, the results of NLM remain disappointing for the two bench-
mark manifolds, particularly for the Swiss roll. Turns of the spiral are super-
posed, meaning that the mapping between the initial manifold and its two-
dimensional embedding is not bijective. For the open box, NLM leads to a
better result than metric MDS, but two faces are still superposed.
The shape of the embedded manifolds visually shows how NLM trades off
the preservation of short against longer distances.
Classification
By comparison with classical metric MDS, NLM can efficiently handle non-
linear manifolds, at least if they are not too heavily folded. Among other
nonlinear variants of metric MDS, NLM remains relatively simple and ele-
gant.
As a main drawback, NLM lacks the ability to generalize the mapping to
new points.
In the same way as many other distance-preserving methods, NLM in its
original version works with a complete matrix of distances, hence containing
O(N 2 ) entries. This may be an obstacle when embedding very large data sets.
Another shortcoming of NLM is its optimization procedure, which may be
slow and/or inefficient for some data sets. In particular, Sammon’s stress func-
tion is not guaranteed to be concave; consequently the optimization process
can get stuck in a local minimum.
Variants of the original NLM, briefly presented ahead, show various at-
tempts to address the above-mentioned issues.
Variants
The SAMANN [134, 44], standing for Sammon artificial neural network, is
a variant of the original NLM that optimizes the same stress function but
4.2 Spatial distances 87
1
N
ECCA = (dy (i, j) − dx (i, j))2 Fλ (dx (i, j)) (4.75)
2 i=1
j=1
and closely resembles Sammon’s stress function. As for the latter, no gen-
erative model of data is assumed. Just as usual, dy (i, j) and dx (i, j) are,
respectively, the distances in the data space and the Euclidean distance in the
latent space. There are two differences, however:
• No scaling factor stands in front of the sum; this factor is not very impor-
tant, except for quantitative comparisons.
• The weighting 1/dy (i, j) is replaced with a more generic factor Fλ (dx (i, j)).
Of course, Fλ may not be any function. Like for the weighting of Sammon’s
NLM, the choice of Fλ is guided by the necessity to preserve short distances
4.2 Spatial distances 89
prior to longer ones. Because the global shape of the manifold has to be
unfolded, long distances often have to be stretched, and their contribution in
the stress should be low. On the other hand, the good preservation of short
distances is easier (since the curvature of the manifold is often low on a local
scale) but is also more important, in order to preserve the structure of the
manifold. Consequently, Fλ is typically chosen as a monotically decreasing
function of its argument. As CCA works on finite data sets, Fλ is also chosen
bounded in order to prevent an abnormally short or null distance to dominate
the other contributions in the stress function. This is especially critical because
Fλ depends on the distances in the embedding space, which are varying and
could temporarily be very small. Indeed, more important than the function Fλ
in itself is the argument of Fλ . In contrast with Sammon’s stress, the weighting
does not depend on the constant distances measured in the data space but on
the distances being optimized in the embedding space. When distances can
be preserved, the hypothesis dy (i, j) ≈ dx (i, j) holds and CCA behaves more
or less in the same way as NLM. If dx (i, j) dy (i, j) for some i and j, then
the manifold is highly folded up and the contribution of this pair will increase
in order to correct the flaw. But what happens if dx (i, j) dy (i, j)? Then
the contribution of this pair will decrease, meaning intuitively that CCA will
allow some stretching not only for long distances but also for shorter ones.
Demartines and Hérault designed an optimization procedure specifically
tailored for CCA. Like most optimization techniques, it is based on the deriva-
tive of ECCA . Using the short-hand notations dy = dy (i, j) and dx = dx (i, j),
the derivative can be written as
∂ECCA ∂ECCA ∂dx
=
∂xk (i) ∂dx ∂xk (i)
N
xk (j) − xk (i)
= (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) .
j=1
dx
(See Eqs. (4.10) and (4.11) for the derivative of dx (i, j).) Alternatively, in
vector form, this gives
N
x(j) − x(i)
∇x(i) ECCA = (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) , (4.76)
j=1
dx
where ∇x(i) ECCA represents the gradient of ECCA with respect to vector x(i).
The minimization of ECCA by a gradient descent gives the following update
rule:
x(i) ← x(i) − α∇x(i) ECCA , (4.77)
where α is a positive learning rate scheduled according to the Robbins–Monro
conditions [156]. Clearly, the modification brought to vector x(i) proves to be
a sum of contributions coming from all other vectors x(j) (j = i). Each con-
tribution may be seen as the result of a repulsive or attractive force between
90 4 Distance Preservation
vectors x(i) and x(j). Each contribution can also be written as a scalar co-
efficient β(i, j) multiplying the unit vector (x(j) − x(i))/dx (i, j). The scalar
coefficient is
β(i, j) = (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) = β(j, i) . (4.78)
N
i
ECCA = ECCA , (4.80)
i=1
where
1
N
i
ECCA = (dy (i, j) − dx (i, j))2 Fλ (dx (i, j)) . (4.81)
2 j=1
i
The separate optimization of each ECCA leads to the following update rule:
x(i) − x(j)
← x(j) − αβ(i, j) . (4.82)
dx
The modified procedure works by interlacing the gradient descent of each
i i=1 i=2
term ECCA : it updates all x(j) for ECCA , then all x(j) for ECCA , and so on.
Geometrically, the new procedure pins up a vector x(i) and moves all other
x(j) radially, without regards to the cross contributions computed by the
4.2 Spatial distances 91
x(1)
x(3) x(2)
x(3) x(2)
x(1)
x(4) x(4)
Fig. 4.5. Visual example of local minimum that can occur with distance preserva-
tion. The expected solution is shown on the left: the global minimum is attained
when x(1) is exactly in the middle of the triangle formed by x(2), x(3), and x(4).
A local minimum is shown on the right: dx (1, 2) and dx (1, 3) are too short, whereas
dx (1, 4) is too long. In order to reach the global minimum, x(1) must be moved
down, but this requires shortening dx (1, 2) and dx (1, 3) even further, at least tem-
porarily. This is impossible for a classical gradient descent: the three contributions
from x(2), x(3), and x(4) vanish after adding them, although they are not zero when
taken individually. Actually, the points x(2), x(3) tend to move x(1) away, while
x(4) pulls x(1) toward the center of the triangle. With a stochastic approach, the
contributions are considered one after the other, in random order: for example, if
the contribution of x(4) is taken into account first, then x(1) can escape from the
local minimum.
first rule (classical gradient, Eq. (4.77)). Computationally, the new procedure
performs many more updates than the first rule for (almost) the same number
of operations: while the application of the first rule updates one vector, the
new rule updates N − 1 vectors. Moreover, the norm of the updates is larger
with the new rule than with the first one. A more detailed study in [45] shows
that the new rule minimizes the global error function ECCA , not strictly like
a normal gradient descent, but well on average.
Unless the embedding dimensionality is the same as the initial dimension-
ality, in which case the embedding is trivial, the presence and the choice of
Fλ are very important. As already mentioned, the embedding of highly folded
manifolds requires focusing on short distances. Longer distances have to be
stretched in order to achieve the unfolding, and their contribution must be
lowered in ECCA . Therefore, Fλ is usefully chosen as a positive and decreasing
function. For example, Fλ could be defined as the following:
dx
Fλ (dx ) = exp − , (4.83)
λ
where λ controls the decrease. However, the condition (4.79) must hold, so
that the rule behaves as expected, i.e., points that are too far away are brought
closer to each other and points that too close are moved away. As Fλ is positive
92 4 Distance Preservation
Because it stems from the field of artificial neural networks, CCA has been
provided with an interpolation procedure that can embed test points. Com-
pared to the learning phase, the interpolation considers the prototypes (or the
data points if the vector quantization was skipped) as fixed points. For each
test point, the update rule (4.82) is applied, as in the learning phase, in order
to move the embedded test point to the right position.
For simple manifolds, the interpolation works well and can even perform
some basic extrapolation tasks [45, 48]. Unfortunately, data dimensionality
that is too high or the presence of noise in the data set dramatically reduces
the performance. Moreover, when the manifold is heavily crumpled, the tuning
of the neighborhood width proves still more difficult than during the learning
phase.
Example
Figure 4.7 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. As can be seen in the figure, CCA succeeds
in embedding the two benchmark manifolds in a much more satisfying way
than Sammon’s NLM. The two-dimensional embedding of the Swiss roll is al-
most free of superpositions. To achieve this result, CCA tears the manifold in
several pieces. However, this causes some neighborhood relationships between
data points to be both arbitrarily broken and created. From the viewpoint of
the underlying manifold, this also means that the mapping between the initial
and final embeddings is bijective but discontinuous.
94 4 Distance Preservation
2
1
1
2
x2
0 0
x
−1
−1
−2
−2 −1 0 1 2 −4 −2 0 2
x1 x1
Fig. 4.7. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by CCA.
Regarding the open box, the tearing strategy of CCA appears to be more
convincing. All faces of the box are visible, and the mapping is both bijective
and continuous. In this case, the weighted preservation of distances allows
CCA to leave almost all neighborhood relationships unchanged, except where
the two lateral faces are torn. Some short distances are shrunk near the bottom
corners of the box.
Classification
Although VQP, the first version of CCA, was intended to be an online method,
like usual versions of the SOM, that performed simultaneously vector quan-
tization and dimensionality reduction, most current implementations of CCA
are now simple batch procedures. Several reasons explain this choice. A
clear separation between vector quantization and dimensionality reduction
increases efficiency as well as modularity. Ready-to-use methods of vector
quantization can advantageously preprocess the data. Moreover, these proce-
dures may be skipped to ease the comparison with other DR methods, like
Sammon’s NLM, that are not provided with vector quantization.
In summary, CCA is an offline (batch) method with an optional vector
quantization as preprocessing. The model is nonlinear and discrete; the map-
ping is explicit. The method works with an approximate optimization pro-
cedure. CCA does not include an estimator of the data dimensionality and
cannot build embeddings incrementally. In other words, the intrinsic dimen-
sionality has to be given as an external parameter and cannot be modified
without running the method again.
Variants
1
N
i
ECCA-u = (dy − dx )2 Fλ (dx ) . (4.88)
2 j=1
On the other hand, for the local projection, the error function becomes
1 2
N
i
ECCA-p = (d − d2x )2 Fλ (dx ) , (4.89)
2 j=1 y
where the quadratic term resembles the one used in Torgerson’s classical met-
ric MDS (see Subsection 4.2.2). The local projection thus behaves like a local
PCA (or MDS) and projects the noisy points onto the underlying manifold
(more or less) orthogonally.
96 4 Distance Preservation
Manifold
Distance to stretch → Unfolding
Distance to shrink → Projection
2D
↓
1D
i i
Assuming Fλ has a null derivative, the gradients of ECCA-u and ECCA-p are:
x(j) − x(i)
∇x(j) ECCA-u
i
= 2(dy − dx )Fλ (dx ) , (4.90)
dx
∇x(j) ECCA-p
i
= −4(d2y − d2x )Fλ (dx )(x(j) − x(i)) . (4.91)
In order to obtain the same error function in both cases around dx = dy ,
i i i
Hérault equates the second derivatives of ECCA-u and ECCA-p and scales ECCA-p
accordingly:
N i
ECCA-u if dx > dy
ECCA = 1 i . (4.92)
4dy (i,j)2 ECCA-p if dx < dy
i=1
Another variant of CCA is also introduced in [186, 188, 187]. In that ver-
sion, the main change occurs in the weighting function F , which is a balanced
4.3 Graph distances 97
compound of two functions. The first one depends on the distance in the em-
bedding space, as for regular CCA, whereas the second has for argument the
distance in the data space, as for Sammon’s NLM:
The additional parameter ρ allows one to tune the balance between both
terms. With ρ close to one, local neighborhoods are generally well preserved,
but discontinuities can possibly appear (manifold tears). In contrast, with ρ
close to zero, the global manifold shape is better preserved, often at the price
of some errors in local neighborhoods.
Finally, a method similar to CCA is described in [59]. Actually, this method
is more closely related to VQP [46, 45] than to CCA as it is described in [48].
This method relies on the neural gas [135] for the vector quantization and is
thus able to work online (i.e., with a time-varying data set).
Manifold
Geodesic distance
Euclidean distance
2D
↓
1D
and may help to compute the manifold distance as an arc length. Indeed, the
arc length l from point y(i) = m(x(i)) to point y(j) = m(x(j)) is computed
as the integral
y(j)
y(j)
D
x(j)
l= dl = dm2i = Jx m(x)dx , (4.95)
y(i) y(i) k=1 x(i)
where Jx m(x) designates the Jacobian matrix of m with respect to the pa-
rameter x. As x is scalar, the Jacobian matrix reduces to a column vector.
Unfortunately, the situation gets worse for multidimensional manifolds,
which involve more than one parameter:
The integral then has to be minimized over all possible paths that connect
the starting and ending points:
100 4 Distance Preservation
z(j)
l = min Jz m(p(z))dz . (4.98)
p(z) z(i)
where π = [vi , . . . , vj ].
Dijkstra’s algorithm works as follows. The set of vertices V is divided into
two subsets such that VN = Vdone ∪ Vsort and ∅ = Vdone ∩ Vsort where
• Vdone contains the vertices for which the shortest path is known;
• Vsort contains the vertices for which the shortest path is either not known
at all or not completely known.
If vi is the source vertex, then the initial state of the algorithm is Vdone = ∅,
Vsort = VN , δ(vi , vi ) = 0, and δ(vi , vj ) = ∞ for all j = i. After the initializa-
tion, the algorithm iteratively looks for the vertex vj with the shortest δ(vi , vj )
in Vsort , removes it from Vsort , and puts it in Vdone . The distance δ(vi , vj ) is
then definitely known. Next, all vertices vk connected to vj , i.e., such that
(vj , vk ) ∈ E, are analyzed: if δ(vi , vj ) + label((vj , vk )) < δ(vi , vk ), then the
value of δ(vi , vk ) is changed to δ(vi , vj ) + label((vj , vk )) and the candidate
shortest path from vi to vk becomes [vi , . . . , vj , vk ], where [vi , . . . , vj ] is the
102 4 Distance Preservation
shortest path from vi to vj . The algorithm stops when Vsort = ∅. If the graph
is not connected, i.e., if there are one or several pairs of vertices that cannot
be connected by any path, then some δ(vi , vj ) keep an infinite value.
Intuitively, the correctness of the algorithm is demonstrated by proving
that vertex vj in Vsort with shortest path π1 = [vi , . . . , vj ] of length δ(vi , vj )
can be admitted in Vdone . This is trivially true for vi just after the initialization.
In the general case, it may be assumed that an unexplored path going from the
source vi to vertex vj is shorter than the path found by Dijkstra’s algorithm.
Then this hypothetical “shorter-than-shortest” path may be written as π2 =
[vi , . . . , vt , vu , . . . , vj ], wherein π3 = [vi , . . . , vt ] is the maximal (always non-
empty) subpath belonging to Vdone and π4 = [vu , . . . , vj ] is the unexplored
part of π2 . Consequently, vu still lies in Vsort and δ(vi , vu ) has been given its
value when vertex vt was removed from Vsort . Thus, the inequalities
hold but contradict the fact that the first path π1 was chosen as the one with
the shortest δ(vi , vj ) in Vsort .
At this point, it remains to prove that the graph distance approximates
the true geodesic distance in an appropriate way. Visually, it seems to be
true for the C curve, as illustrated in Fig. 4.10. Formal demonstrations are
provided in [19], but they are rather complicated and not reproduced here.
The intuitive idea consists of relating the natural Riemannian structure of a
smooth manifold M to the graph distance computed between points of M.
Bounds are more easily computed for graphs constructed with -balls than
with K-ary neighborhoods. Unfortunately, these bounds rely on assumptions
that are hard to meet with real data sets, especially for K-ary neighborhoods.
Moreover, the bounds are computed in the ideal case where no noise pollutes
the data.
4.3.2 Isomap
Isomap [179, 180] is the simplest NLDR method that uses the graph distance
as an approximation of the geodesic distance. Actually, the version of Isomap
described in [180] is closely related to Torgerson’s classical metric MDS (see
Subsection 4.2.2). The only difference between the two methods is the metric
used to measure the pairwise distances: Isomap uses graph distances instead
of Euclidean ones in the algebraical procedure of metric MDS. Just by intro-
ducing the graph distance, the purely linear metric MDS becomes a nonlinear
method. Nevertheless, it is important to remind that the nonlinear capabili-
ties of Isomap are exclusively brought by the graph distance. By comparison,
methods like Sammon’s NLM (Subsection 4.2.3) and Hérault’s CCA (Subsec-
tion 4.2.4) are built on inherently nonlinear models of data, independently
of the chosen metric. However, Isomap keeps the advantage of reducing the
dimensionality with a simple, fast, and direct algebraical manipulation. On
4.3 Graph distances 103
Graph edges
Manifold points
Graph distance
Geodesic distance
Euclidean distance
2D
↓
1D
Fig. 4.10. The same C curve as in Fig. 4.9. In this case, the manifold is not available:
only some points are known. In order to approximate the geodesic distance, vertices
are associated with the points and a graph is built. The graph distance can be
measured by summing the edges of the graph along the shortest path between both
ends of the curve. That shortest path can be computed by Dijkstra’s algorithm. If
the number of points is large enough, the graph distance gives a good approximation
of the true geodesic distance.
the other hand, Sammon’s NLM and Hérault’s CCA, which use specific opti-
mization procedures like gradient descent, are much slower.
Although Isomap greatly improves metric MDS, it inherits one of its ma-
jor shortcomings: a very rigid model. Indeed, the model of metric MDS is
restricted to the projection onto a hyperplane. Moreover, the analytical de-
velopments behind metric MDS rely on the particular form of the Euclidean
distance. In other words, this means that the matrix of pairwise distances D
handled by metric MDS must ideally contain Euclidean distances measured
between points lying on a hyperplane. Hence, if the distances in D are not
Euclidean, it is implicitly assumed that the replacement metric yields dis-
tances that are equal to Euclidean distances measured in some transformed
hyperplane. Otherwise, the conditions stated in the metric MDS model are
no longer fulfilled.
In the case of Isomap, the Euclidean distances are replaced with the graph
distances when computing the matrix D. For a theoretical analysis, however,
104 4 Distance Preservation
P
= x(i) − [x1 (i), . . . , xp (j), . . . , xP (i)]T 22
p=1
P
= δ 2 (m(x(i)), m([x1 (i), . . . , xp (j), . . . , xP (i)]T )) .
p=1
Therefore, the equality Jz p(z)2 = Jp(z) m(p(z))Jz p(z)2 holds. This
means that the Jacobian of a developable manifold must be a D-by-P matrix
whose columns are orthogonal vectors with unit norm (a similar reasoning is
developed in [209]). Otherwise, the norm of Jz p(z) cannot be preserved. More
precisely, the Jacobian matrix can be written in a generic way as
Isomap follows the same procedure as metric MDS (Subsection 4.2.2); the only
difference is the metric in the data space, which is the graph distance. In order
to compute the latter, data must be available as coordinates stored in matrix
Y as usual. Figure 4.11 shows a simple procedure that implements Isomap. It
is noteworthy that matrix S is not guaranteed to be positive semidefinite after
double centering [78], whereas this property holds in the case of classical metric
MDS. This comes from the fact that graph distances merely approximate the
true geodesic distances. Nevertheless, if the approximation is good (see more
details in [19]), none or only a few eigenvalues of low magnitude should be
negative after double centering. Notice, however, that care must be taken if
the eigenvalues are used for estimating the intrinsic dimensionality, especially
4.3 Graph distances 107
where ri2 j denotes the correlation coefficient over indices i and j. When plot-
ting the evolution of σP2 w.r.t. P , the ideal dimensionality is the abscissa of
the “curve elbow”, i.e., the lowest value of P such that σP2 is close enough to
zero and does not significantly decrease anymore. Another way to estimate
the right embedding dimensionality consists in computing the MDS objective
function w.r.t. P :
N
EMDS = (sy (i, j) − x̂(i) · x̂(j))2 , (4.110)
i,j=1
where sy (i, j) is computed from the matrix of graph distances after double
centering and x̂(i) is the ith column of X̂ = IP ×N Λ1/2 UT . Here also an elbow
indicates the right dimensionality.
A MATLAB R
package containing the above procedure is available at
http://isomap.stanford.edu/. A C++ implementation can also be down-
loaded from http://www.ucl.ac.be/mlg/. The parameters of Isomap are the
embedding dimensionality P and either the number of neighbors K or the ra-
dius , depending on the rule chosen to build the data graph. Space complexity
of Isomap is O(N 2 ), reflecting the amount of memory required to store the
pairwise geodesic distances. Time complexity is mainly determined by the
computation of the graph distances; using Dijkstra’s algorithm, this leads to
O(N 2 log N ). The EVD decomposition in the MDS-like step is generally fast
when using dedicated libraries or MATLAB R
.
108 4 Distance Preservation
Example
Figure 4.12 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. The use of geodesic distances instead of
2 2
1 1
2
x2
0 0
x
−1 −1
−2 −2
−4 −2 0 2 −2 0 2 4
x1 x1
Fig. 4.12. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by Isomap.
4.3 Graph distances 109
Euclidean ones allows Isomap to perform much better than metric MDS (see
Fig. 4.2). In the case of the two benchmark manifolds, the graphs used for
approximating the geodesic distances are built with the -rule; the value of
is somewhat larger than the distance between the three-dimensional points
of the open box data set shown in Fig. 1.4. This yields the regular graphs
displayed in the figure. As a direct consequence, the graph distances are com-
puted in the same way as city-block distances (see Subsection 4.2.1), i.e., by
summing the lengths of perpendicular segments.
With this method, the Swiss roll is almost perfectly unrolled (corners of the
rolled-up rectangle seem to be stretched outward). This result is not surprising
since the Swiss roll is a developable manifold. The first six eigenvalues confirm
that two dimensions suffice to embed the Swiss roll with Isomap:
The first two eigenvalues clearly dominate the others. By the way, it must
be remarked that in contrast to classical metric MDS, the eigenvalues λn
with n > D do not vanish completely (the last ones can even be negative, as
mentioned above). This phenomenon is due to the fact that graph distances
only approximate the true geodesic ones.
On the other hand, the open box is not a developable manifold and Isomap
does not embed it in a satisfying way. The first six eigenvalues found by Isomap
are
[λn ]1≤n≤N = [1099.6, 613.1, 413.0, 95.8, 71.4, 66.8, . . .] . (4.113)
As can be seen, the first three eigenvalues dominate the others, just as they
do with metric MDS. Hence, like MDS, Isomap does not succeed in detecting
that the intrinsic dimensionality of the box is two. Visually, two faces of the
open box are still superposed: neighborhood relationships between data points
on these faces are not correctly rendered.
Classification
By comparison with PCA and metric MDS, Isomap is much more powerful.
Whereas the generative model of PCA and metric MDS is designed for linear
submanifolds only, Isomap can handle a much wider class of manifolds: the
developable manifolds, which may be nonlinear. For such manifolds, Isomap
reduces the dimensionality without any loss.
Unfortunately, the class of developable manifolds is far from including all
possible manifolds. When Isomap is used with a nondevelopable manifold, it
suffers from the same limitations as PCA or metric MDS applied to a nonlinear
manifold.
From a computational point of view, Isomap shares the same advantages
and drawbacks as PCA and metric MDS. It is simple, works with simple
algebraic operations, and is guaranteed to find the global optimum of its error
function in closed form.
In summary, Isomap extends metric MDS in a very elegant way. However,
the data model of Isomap, which relies on developable manifolds, still remains
too rigid. Indeed, when the manifold to be embedded is not developable,
Isomap yields disappointing results. In this case, the guarantee of determining
a global optimum does not really matter, since actually the model and its
associated error function are not appropriate anymore.
Another problem encountered when running Isomap is the practical com-
putation of the geodesic distances. The approximations given by the graph
distances may be very rough, and their quality depends on both the data
(number of points, noise) and the method parameters (K or in the graph-
building rules). Badly chosen values for the latter parameters may totally
jeopardize the quality of the dimensionality reduction, as will be illustrated
in Section 6.1.
Variants
When the data set becomes too large, Isomap authors [180] advise running
the method on a subset of the available data points. Instead of performing a
4.3 Graph distances 111
vector quantization like for CCA, they simply select points at random in the
available data. Doing this presents some drawbacks that appear particularly
critical, especially because Isomap uses the graph distance (see examples in
Section 6.1).
A slightly different version of Isomap also exists (see [180]) and can be seen
as a compromise between the normal version of Isomap and the economical
version described just above. Instead of performing Isomap on a randomly
chosen subset of data, Isomap is run with all points but only on a subset
of all distances. For this purpose, a subset of the data points is chosen and
only the geodesic distances from these points to all other ones are computed.
These particular points are called anchors or landmarks. In summary, the
normal or “full” version of Isomap uses the N available points and works with
an N -by-N distance matrix. The economical or “light” version of Isomap
uses a subset of M < N points, yielding a M -by-M distance matrix. The
intermediate version uses M < N anchors and works with a rectangular M -
by-N distance matrix. Obviously, an adapted MDS procedure is required to
find the embedding (see [180] for more details) in the last version.
An online version of Isomap is detailed in [113]. Procedures to update the
neighborhood graph and the corresponding graph distances when points are
removed from or added to the data set are given, along with an online (but
approximate) update rule of the embedding.
Finally, it is noteworthy that the first version of Isomap [179] is quite
different from the current one. It relies on data resampling (graph vertices
are a subset of the whole data set) and the graph is built with a rule inspired
from topology-representing networks [136] (see also Appendix E). Next, graph
distances are compute with Floyd’s algorithm (instead of Dijkstra’s one, which
is more efficient), and the embedding is obtained with nonmetric MDS instead
of classical metric MDS. Hence this previous version is much closer to geodesic
NLM than the current one is.
Sammon’s nonlinear mapping (NLM) is mostly used with the Euclidean dis-
tance, in the data space as well as in the embedding space. Nevertheless, as
mentioned in Subsection 4.2.3, nothing forbids the user to choose another
metric, at least in the data space. Indeed, in the embedding space, the simple
and differentiable formula of the Euclidean distance helps to deduce a not-too-
complicated update rule for the optimization of the stress function. So, why
not create a variant of NLM that uses the graph distance in the data space?
Isomap (see Subsection 4.3.2) and CDA (see Subsection 4.3.4) follow the
same idea by modifying, respectively, the metric MDS (see Subsection 4.2.2)
and CCA (see Subsection 4.2.4). Strangely enough, very few references to
such variant of NLM can be found in the literature (see, however, [117, 150,
58]). NLM using geodesic distances is here named GNLM (geodesic NLM),
112 4 Distance Preservation
according to [117, 121, 58], although the method described in [58] is more
related to CDA.
The embedding of the data set simply follows the procedure indicated in
Subsection 4.2.3. The only difference regards the distance in the data space,
which is the graph distance introduced in Subsection 4.3.1. Hence, Sammon’s
stress can be rewritten as
where
• δy (i, j) is the graph distance between the ith and jth points in the D-
dimensional data space,
• dx (i, j) is the Euclidean distance between the ith and jth points in the
P -dimensional latent space.
• the normalizing constant c is defined as
N
c= δy (i, j) . (4.115)
i=1
i<j
where the absolute value is used for distinguishing the minima from the max-
ima. As for the classical NLM, the step size alpha is usually set between 0.3
and 0.4.
No MATLAB R
package is available for the geodesic NLM. However, it
is possible to build one quite easily using part of the Isomap archive (http:
//isomap.stanford.edu/), combined with Sammon’s NLM provided in the
SOM toolbox (http://www.cis.hut.fi/projects/somtoolbox/). Functions
and libraries taken from Isomap compute the pairwise graph distances, wheres
the mapping is achieved by the NLM function from the SOM toolbox. A
4.3 Graph distances 113
Example
Figure 4.13 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with step size
α set to 0.3. Geodesic NLM performs much better than its Euclidean version
(see Fig. 4.4). The Swiss roll is perfectly unrolled, and only two faces of the
open box are superposed. The fact that geodesic distances are longer than the
corresponding Euclidean ones explains this improvement: their weight in Sam-
mon’s stress is lower and the method focuses on preserving short distances.
GNLM also provides better results than Isomap (see Subsection 4.3.2), al-
though the graph distances have been computed in exactly the same way. This
is due to the fact that GNLM can embed in a nonlinear way independently
from the distance used. In practice, this means that GNLM is expected to
perform better than Isomap when the manifold to embed is not developable.
Classification
2 2
1 1
x2
x2
0 0
−1 −1
−2 −2
−2 0 2 4 −2 0 2 4
x1 x
1
Fig. 4.13. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by GNLM.
For a large part, the advantages and drawbacks of GNLM stay more or less
the same as for NLM. However, the use of the geodesic distance gives GNLM
a better ability to deal with heavily curved manifolds. If GNLM is given a
developable P -manifold, then it can reduce the dimensionality from D to P
with EGNLM ≈ 0 (the stress never vanishes completely since the graph distances
are only approximations). In the same situation, the original NLM could yield
a much larger value of ENLM and provide disappointing results.
As other methods using the geodesic distance, GNLM may suffer from
badly approximated graph distances. Additionally, the graph construction
necessary to get them requires adjusting one or several additional parame-
ters.
Variants
with the graph distance may be seen as the addition of lateral connections
between the neurons. Visually, these lateral synapses are the weighted edges of
the graph, very similar to the lattice of an SOM, except that it is not regular.
Even if this may seem counterintuitive, the use of the graph distance actu-
ally removes lateral connections. In the original CCA, the use of the Euclidean
distance may be seen as a degenerate graph distance where all connections
are allowed by default. From this point of view, the graph distance, by cap-
turing information about the manifold shape, removes misleading shortcuts
that the Euclidean distance could follow. Thus, the graph is less dependent
on the manifold embedding than the Euclidean distance.
Because the graph distance is a sum of Euclidean distances, and by virtue
of the triangular inequality, graph distances are always longer than (or equal
to) the corresponding Euclidean distances. This property elegantly addresses
the main issue encountered with distance-preserving DR methods: because
manifolds embedded in high-dimensional spaces may be folded on themselves,
spatial distances like the Euclidean one are shorter than the corresponding
distances measured along the manifold. This issue is usually circumvented
by giving an increasing weight to very short distances. In this context, the
graph distance appears as a less blind solution: instead of “forgetting” the
badly estimated distance, the graph distance enhances the estimation. To
some extent, the graph distance “guesses” the value of the distance in the
embedding space before embedding. When the manifold to be embedded is
exactly Euclidean (see Subsection 4.3.2), the guess is perfect and the error
made on long distances remains negligible. In this case, the function Fλ that
is weighting the distances in ECCA becomes almost useless.
Nevertheless, the last statement is valid only for developable manifolds.
This is a bit too restrictive and for other manifolds, the weighting of distances
remains totally useful. It is noteworthy, however, that the use of the graph
distance makes Fλ much easier to parameterize.
For a heavily crumpled nondevelopable manifold, the choice of the neigh-
borhood width λ in CCA appears very critical. Too large, it forces CCA to
take into account long distances that have to be distorted anyway; the error
criterion ECCA no longer corresponds to what the user expects and its mini-
mization no longer makes sense. Too small, the neighborhood width impeaches
CCA to access sufficient knowledge about the manifold and the convergence
becomes very slow.
On the other hand, in CDA, the use of the graph distance yields a better
estimate of the distance after embedding. Distances are longer, and so may λ
be longer, too.
exactly as for CCA, except for δy , which is here the graph distance instead of
the Euclidean one.
i i
Assuming Fλ has a null derivative, the gradients of ECCA-u and ECCA-p are
x(j) − x(i)
∇x(j) ECCA-u
i
= 2(δy − dx )Fλ (dx ) , (4.118)
dx
∇x(j) ECCA-p
i
= −4(δy2 − d2x )Fλ (dx )(x(j) − x(i)) , (4.119)
leading to the update rule:
∇x(j) ECCA-u
i
if dx (i, j) > δy
x(j) ← x(j) + α 1 . (4.120)
4δi,j
2 ∇x(j) ECCA-p if dx (i, j) < δy
i
Besides the use of the graph distance, another difference distinguishes CCA
from CDA. The latter is indeed implemented with a slightly different han-
dling of the weighting function Fλ (dx (i, j)). In the original publications about
CCA [46, 45, 48], the authors advise using a step function (see Eqs. (4.86)
and (4.87)), which is written as
0 if λ < dx (i, j)
Fλ (dx (i, j)) = (4.121)
1 if λ ≥ dx (i, j)
From a geometrical point of view, this function centers an open λ-ball around
each point x(i): the function equals one for all points x(j) inside the ball,
and zero otherwise. During the convergence of CCA, the so-called neighbor-
hood width λ, namely the radius of the ball, is usually decreasing according a
schedule established by the user. But the important point to remark is that
the neighborhood width has a unique and common value for all points x(i).
This means that depending on the local distribution of x, the balls will include
different numbers of points. In sparse regions, even not so small values of λ
could yield empty balls for some points, which are then no longer updated.
This problematic situation motivates the replacement of the neighborhood
width λ with a neighborhood proportion π, with 0 ≤ π ≤ 1. The idea consists
of giving each point x(i) an individual neighborhood width λ(i) such that the
corresponding ball centered on x(i) contains exactly πN points. This can be
achieved easily and exactly by computing the πN closest neighbors of x(i).
However, as mentioned in Appendix F.2, this procedure is computationally
demanding and would considerably slow down CCA.
Instead of computing exactly the πN closest neighbors of x(i), it is thus
cheaper to approximate the radius x(i) of the corresponding ball. Assuming
that π ≈ 1 when CDA starts, all the λ(i) could be initialized as follows:
λ(i) ← max dx (i, j) . (4.122)
j
4.3 Graph distances 117
Next, when CDA is running, each time the point x(i) is selected, all other
points x(j) lying inside the λ(i)-ball are updated radially. The number N (i)
of updated points gives the real proportion of neighbors, defined as π(i) =
N (i)/N . The real proportion π(i), once compared with the desired proportion,
helps to adjust λ(i). For example, this can be done with the simple update
rule
π
λ(i) ← λ(i) P , (4.123)
π(i)
which gives λ(i) its new value when point x(i) will be selected again by CDA.
In practice, the behavior of the update rule for λ(i) may be assessed when
CDA is running, by displaying the desired proportion π versus the effective
average one μi (π(i)). Typically, as π is continually decreasing, μi (π(i)) is
always a bit higher than desired. Experimentally, it has been shown that the
handling of Fλ (dx (i, j)) in CDA deals with outliers in a rather robust way
and avoids some useless tearings that are sometimes observed when using the
original CCA with a neighborhood width that is too small.
It is noteworthy that the use of an individual neighborhood width does
not complicate the parameter setting of CDA, since all neighborhood widths
are guided by a single proportion π.
Gathering all above-mentioned ideas leads to the procedure given in
Fig. 4.14. No MATLAB R
package is available for CDA. However, it is pos-
sible to build one quite easily using part of the Isomap archive (http:
//isomap.stanford.edu/), combined with CCA provided in the SOM tool-
box (http://www.cis.hut.fi/projects/somtoolbox/). Functions and li-
braries taken from Isomap compute the pairwise graph distances, wheres the
mapping is achieved by the CCA function from the SOM toolbox. A C++ im-
plementation of CDA can be downloaded from http://www.ucl.ac.be/mlg/.
Like the geodesic version of Sammon’s NLM, CDA involves additional param-
eters related to the construction of a data graph before applying Dijkstra’s
algorithm to compute graph distances. These parameters are, for instance,
the number of neighbors K or the neighborhood radius . Space complexity of
CDA remains unchanged compared to CCA, whereas time complexity must
take into account the application of Dijkstra’s algorithm for each graph vertex
(O(N 2 log N )) before starting the iterative core procedure of CCA/CDA.
A linear piecewise interpolation can work efficiently if the data set is not too
noisy. The interpolation procedure described in [45] for CCA also works for
CDA, at least if the neighborhood width is set to a small value; this ensures
that Euclidean and graph distances hardly differ on that local scale.
Example
Figure 4.15 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. The graph used to approximate the geodesic
2
2
1
2
x2
0 0
x
−1
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4
x1 x1
Fig. 4.15. Two-dimensional embeddings of the ‘Swiss roll’ and ‘open box’ data sets
(Fig. 1.4), found by CDA.
distance and shown in the figure is the same as for Isomap and GNLM (-
rule). The main difference between the results of CCA (see Fig. 4.7) and CDA
regards the Swiss roll. The use of geodesic distances enables CDA to unroll
the Swiss roll and to embed it without superpositions and unnecessary tears.
4.4 Other distances 119
Classification
Exactly like CCA, CDA is an offline (batch) method, with an optional vector
quantization as preprocessing. The data model is nonlinear and discrete; the
mapping is explicit. The method works with an approximate optimization
procedure. CDA does not include any estimator of the intrinsic dimensionality
of data and cannot build embeddings incrementally.
Variants
A variant of CDA is described in [58]. Although this method is named geodesic
nonlinear mapping, it is more closely related to CCA/CDA than to Sammon’s
NLM. Actually, this method uses the neural gas [135], a topology-representing
network (TRN [136]), instead of a more classical vector quantizer like K-
means. This choice enables the method to build the data graph in parallel
with the quantization. Next, graph distances are computed in the TRN. The
subsequent embedding procedure is driven with an objective function simmilar
to the one of CCA/CDA, except that F is an exponentially decaying function
of the neighborhood rank, exactly as in the neural gas. The resulting method
closely resembles Demartines and Hérault’s VQP [46, 45], though it remains
a batch procedure.
distances. This section introduces NLDR methods that rely on less intuitive
ideas. The first one is kernel PCA, which is closely related to metric MDS and
other spectral methods. In this case, the methods directly stem from mathe-
matical considerations about kernel functions. These functions can transform
a matrix of pairwise distances in such a way that the result can still be inter-
preted as distances and processed using a spectral decomposition as in metric
MDS. Unfortunately, in spite of its elegance, kernel PCA performs rather
poorly in practice.
The second method, semi-definite embedding, relies on the same theory
but succeeds in combining it with a clear geometrical intuition. This different
point of view leads to a much more efficient method.
The name of kernel PCA [167] (KPCA) is quite misleading, since its approach
relates it more closely to classical metric MDS than to PCA [203]. Beyond
the equivalence between PCA and classical metric MDS, maybe this choice
can be justified by the fact that PCA is more widely known in the field where
KPCA has been developed.
Whereas the changes brought by Isomap to metric MDS were motivated by
geometrical consideration, KPCA extends the algebraical properties of MDS
to nonlinear manifolds, without regards to their geometrical meaning. The
inclusion of KPCA in this chapter about dimensionality reduction by distance-
preserving methods is only justified by its resemblance to Isomap: they both
generalize metric MDS to nonlinear manifolds in similar ways, although the
underlying ideas completely differ.
The first idea of KPCA consists of reformulating the PCA into its metric
MDS equivalent, or dual form. If, as usual, the centered data points y(i) are
stored in matrix Y, then PCA works with the sample covariance matrix Ĉyy ,
proportional to YYT . On the contrary, KPCA works as metric MDS, i.e.,
with the matrix of pairwise scalar products S = YT Y.
The second idea of KPCA is to “linearize” the underlying manifold M.
For this purpose, KPCA uses a mapping φ : M ⊂ RD → RQ , y → z = φ(y),
where Q may be any dimension, possibly higher than D or even infinite.
Actually, the exact analytical expression of the mapping φ is useless, as will
become clear below. As a unique hypothesis, KPCA assumes that the mapping
φ is such that the mapped data span a linear subspace of the Q-dimensional
space, with Q > D. Interestingly, KPCA thus starts by increasing the data
dimensionality!
Once the mapping φ has been chosen, pairwise scalar products are com-
puted for the mapped data and stored in the N -by-N matrix Φ:
4.4 Other distances 121
where the shortened notation sz (i, j) stands for the scalar product between
the mapped points y(i) and y(j).
Next, according to the metric MDS procedure, the symmetric matrix Φ
has to be decomposed in eigenvalues and eigenvectors. However, this operation
will not yield the expected result unless Φ is positive semidefinite, i.e., when
the mapped data z(i) are centered. Of course, it is difficult to center z because
the mapping φ is unknown. Fortunately, however, centering can be achieved
in an implicit way by performing the double centering on Φ.
Assuming that z is already centered but z is not, it can be written in the
general case
where c is some unknown constant as, for instance, the mean that has previ-
ously been subtracted from z to get z. Then, denoting μi the mean operator
with respect to index i, the mean of the jth column of Φ is
μi,j (sz (i, j)) = μi,j (z(i) · z(j) + z(i) · c + c · z(j) + c · c)
= μi (z(i)) · μj (z(j)) + μi (z(i)) · c + c · μj (z(j)) + c · c
= 0 · 0 + 0 · c + c · 0 + c · c
= c · c . (4.130)
It is easily seen that unknown terms in the right-hand side of Eq. (4.127) can
be obtained as the sum of Eqs. (4.128) and (4.129) minus Eq. (4.130). Hence,
Once the double centering has been performed, Φ can be decomposed into its
eigenvectors and eigenvalues:
Φ = UΛUT . (4.133)
X̂ = IP ×N Λ1/2 UT . (4.134)
K : L2 → L2 , f → Kf , (4.136)
with
(Kf )(v) = κ(u, v)f (v)dv , (4.137)
is a mapping function into a space where κ acts as the Euclidean scalar prod-
uct, i.e.,
φ(u) · φ(v) = κ(u, v) . (4.142)
In practice, simple kernels that fulfill Mercer’s conditions are, for example,
• polynomial kernels [27]: κ(u, v) = (u · v + 1)p , where p
is some integer;
radial basis function like Gaussian kernels: κ(u, v) = exp − u−v
2
• 2σ2 ;
• kernels looking like the MLP activation function: κ(u, v) = tanh(u·v+b).
The choice of a specific kernel is quite arbitrary and mainly motivated by the
hope that the induced mapping φ linearizes the manifold to be embedded.
If this goal is reached, then PCA applied to the mapped data set should
efficiently reveal the nonlinear principal components of the data set.
The “kernel trick” described above plays a key role in a large family of
methods called support vector machines [27, 37, 42] (SVMs). This family
gathers methods dedicated to numerous applications like regression, function
approximation, classification, etc.
Finally, Fig. 4.16 shows how to implement KPCA. No general-purpose
MATLAB R
function is available on the Internet. However, a simple toy ex-
ample can be downloaded from http://www.kernel-machines.org/code/
kpca_toy.m; straightforward adaptations of this script can transform it in
a more generic function. Another implementation is available in the SPI-
DER software package (http://www.kyb.mpg.de/bs/people/spider/main.
124 4 Distance Preservation
The particular version of the double centering described in Subsection 4.2.2 for
MDS also works for KPCA. It is easily adapted to the use of kernel functions.
Example
Figure 4.17 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. Visually, it is clear that KPCA has trans-
10
10
5 5
x2
x2
0 0
−5
−5
−10
−10
−10 0 10 −10 −5 0 5 10 15
x1 x1
Fig. 4.17. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by KPCA.
for the open box with a kernel width of 2.5. In other methods using an EVD,
like metric MDS and Isomap, the variance remains concentrated within the
first three eigenvalues, whereas KPCA spreads it out in most cases. In order to
concentrate the variance within a minimal number of eigenvalues, the width of
the Gaussian kernel may be increased, but then the benefit of using a kernel
is lost and KPCA tends to yield the same result as metric MDS: a linear
projection.
Classification
Like metric MDS and Isomap, KPCA is a batch method working offline with
simple algebraical operations. As with other spectral methods, KPCA can
build projections incrementally, by discarding or keeping eigenvectors. The
model of the method is nonlinear thanks to the kernel function that maps the
data in an implicit way. The model is also continuous, as for PCA, and the
mapping is implicit.
As shown in the previous section, generalizing PCA using the kernel trick as
in KPCA proves to be an appealing idea. However, and although theoretical
conditions determine the domain of admissible kernel functions, what the
126 4 Distance Preservation
theorems do not say is how to choose the best-suited kernel function in some
particular case. This shortcoming could disappear if one could learn the best
kernel from the data. Semidefinite embedding [196, 198, 195] (SDE), also
known as maximum variance unfolding [197] (MVU), follows this approach.
Like metric MDS, Isomap and KPCA, SDE implements distance preservation
by means of a spectral decomposition. Given a set of N observations, all these
methods build the low-dimensional embedding by juxtaposing the dominat-
ing eigenvectors of an N -by-N matrix. In metric MDS, pairwise distances are
used as they are, and then converted into scalar products by double center-
ing. In Isomap, traditional Euclidean distance are simply replaced with graph
distances. In KPCA, Euclidean distances are nonlinearly transformed using
a kernel function. In the case of SDE, the transformation of the distances is
a bit more complicated than for the former methods. Results of those meth-
ods depend on the specific transformation applied to the pairwise distances.
Actually, this transformation is arbitrarily chosen by the user: the type of dis-
tance (Euclidean/graph) or kernel (Gaussian, etc.) is fixed beforehand. The
idea behind SDE is to determine this transformation in a purely data-driven
way. For this purpose, distances are constrained to be preserved locally only.
Nonlocal distances are free to change and are optimized in such a way that
a suitable embedding can be found. The only remaining constraint is that
the properties of the corresponding Gram matrix of scalar products are kept
(symmetry, positive semidefiniteness), so that metric MDS remains applica-
ble. This relaxation of strict distance preservation into a milder condition of
local isometry enables SDE to embed manifolds in a nonlinear way.
In practice, the constraint of local isometry can be applied to smooth
manifolds only. However, in the case of a finite data set, a similar constraint
can be stated. To this end, SDE first determines the K nearest neighbors of
each data point, builds the corresponding graph, and imposes the preservation
of angles and distances for all K-ary neighborhoods:
As in the case of metric MDS, it can be shown that such an isometry constraint
determines the embedding only up to a translation. If the embedding is also
constrained to be centered:
N
xi = 0 , (4.147)
i=1
then this undeterminacy is avoided.
Subject to the constraint of local isometry, SDE tries to find an embedding
of the data set that unfolds the underlying manifold. To illustrate this idea,
the Swiss roll (see Section 1.5) is once again very useful. The Swiss roll can be
obtained by rolling up a flat rectangle in a three-dimensional space, subject
to the same constraint of local isometry. This flat rectangle is also the best
two-dimensional embedding of the Swiss roll. As a matter of fact, pairwise
Euclidean distances between faraway points (e.g., the corners of the rolled-up
rectangle) depend on the embedding. In the three-dimensional space, these
distances are shorter than their counterparts in the two-dimensional embed-
ding. Therefore, maximizing long distances while maintaining the shortest
ones (i.e., those between neighbors) should be a way to flatten or unfold the
Swiss roll. This idea translates into the following objective function:
1 2
N N
φ= d (i, j) , (4.148)
2 i=1 j=1 x
1 2
N N
φ≤ δ (i, j) , (4.149)
2 i=1 j=1 y
holds (see Subsection 4.2.2). If K = [sx (i, j)]1≤i,j≤N denotes the symmetric
matrix of dot products xi · xj in the low-dimensional embedding space, then
the constraint in Eq. (4.146) becomes
N
0= xi (4.152)
i=1
N
0 = 0 · xj = xi · xj (4.153)
i=1
N
N
N
N
N
0= 0 · xj = xi · xj = sx (i, j) . (4.154)
j=1 i=1 j=1 i=1 j=1
Finally, the objective function can also be expressed in terms of dot products
using Eqs. (4.150) and (4.154):
1 2
N N
φ= d (i, j) (4.155)
2 i=1 j=1 x
1
N N
= sx (i, i) − 2sx (i, j) + sx (j, j) (4.156)
2 i=1 j=1
N
= sx (i, i) (4.157)
i=1
= tr(K) , (4.158)
where tr(K) denotes the trace of K. At this stage, all constraints are linear
with respect to the entries of K and the optimization problem can be refor-
mulated. The goal of SDE consists of maximizing the trace of some N -by-N
matrix K subject to the following constraints:
• The matrix K is symmetric and positive semidefinite.
• The sum of all entries of K is zero (Eq. (4.154)).
• For nonzero entries of the adjacency matrix, the quality sx (i, j) = sy (i, j)
must hold.
The first two constraints allows us to cast SDE within the framework of clas-
sical metric MDS. Compared to the latter, SDE enforces only the preservation
of dot products between neighbors in the graph; all other dot products are
free to change.
In practice, the optimization over the set of symmetric and positive
semidefinite matrices is an instance of semidefinite programming (SDP; see,
4.4 Other distances 129
e.g., [184, 112] and references therein): the domain is the cone of positive
semidefinite matrices interesected with hyperplanes (representing the equal-
ity constraints) and the objective function is a linear function of the matrix
entries. The optimization problem has some useful properties:
• Its objective function is bounded above by Eq. (4.149).
• It is also convex, thus preventing the existence of spurious local maxima.
• The problem is feasible, because S is a trivial solution that satisfies all
constraints.
Details on SDP are beyond the scope of this book and can be found in the
literature. Several SDP toolboxes in C++ or MATLAB R
can be found on the
Internet. Once the optimal matrix K is determined, low-dimensional embed-
ding is obtained by decomposing K into eigenvalues and eigenvectors, exactly
as in classical metric MDS. If the EVD is written as K = UΛUT , then the
low-dimensional sample coordinates are computed as
X̂ = IP ×N Λ1/2 UT . (4.159)
The Nyström formula that is referred to in [6, 16] (see also Subsection 4.2.2)
cannot be used in this case, because the kernel function applied to the Gram
matrix and learned from data in the SDP stage remains unknown in closed
form.
Example
Figure 4.19 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with number of
2
1
1
0.5
2
0
x2
x
−1 −0.5
−2 −1
−2 0 2 −1 0 1 2
x x1
1
Fig. 4.19. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by SDE.
neighbors K equal to five (the graph shown in the figure does not correspond
to the actual graph built by SDE). Slack variables are off for the Swiss roll
(isometry is required) and on for open box (distances may shrink). As can
be seen, the Swiss roll is well unfolded: local distances from one point to its
neighbors are almost perfectly preserved. For the open box, the embedding
is not so satisfying: although distances are allowed to shrink, SDE fails to
find a satisfying embedding and faces are superposed. For this manifold, the
SDP solver displays a failure message: no embedding with fewer than three
dimensions can be built without violating the constraints.
4.4 Other distances 131
Classification
SDE is a batch method that processes data offline. It belongs to the family
of spectral methods, like metric MDS, Isomap, and KPCA. Exactly as those
methods, SDE is provided with an estimator of the data dimensionality and
can build embeddings in an incremental way.
In [196, 198, 195], SDE is introduced without vector quantization. As SDE
is quite computationally demanding, this preprocessing proves to be very use-
ful in practice.
The mapping resulting from SDE is explicit, and the associated data model
is discrete.
The constraint of local isometry proposed in SDE is milder than strict isom-
etry, as required in metric MDS and derived methods. As a result, SDE com-
pares favorably to methods based on weighted distance preservation.
SDE can also be seen as a kind of super-Isomap that remedies some short-
comings of the graph distance. Results of Isomap tightly depend on the esti-
mation quality of the geodesic distances: if the data set is sparse, the latter
will be poorly approximated by graph distances. Leaving aside any graph con-
struction problem (shortcuts), graph distances are likely to be overestimated
in this case, because graph paths are zigzagging. Similarly, Isomap also fails to
embed correctly nonconvex manifolds (e.g., manifolds with holes): in this case,
some graph paths are longer than necessary because they need to go round
the holes. (Actually, both problems are closely related, since a sparse sample
from a convex manifold and a dense sample from a manifold with many holes
may look quite similar.)
On the down side, SDE proves to be dramatically slow. This comes from
the complexity of the SDP step. Finally, SDE cannot process test points.
Variants
A variant of SDE that uses landmark points has been developed and is re-
ferred to in [195]. Instead of using all pairwise distances between data points,
a rectangular matrix is used instead, which consists of distances from data
points to a few landmarks (typically a subset of data). The main goal is to
reduce the computational cost of SDE while approximating reasonably well
the normal version. A similar principle is used to develop a landmark-based
version of Isomap.
Other manifold or graph embedding methods based on semidefinite pro-
gramming are also developed in [125, 30].
5
Topology Preservation
even the number of points (vertices) in the lattice (graph) can be adjusted by
data.
Additional examples and comparisons of the described methods can be
found in Chapter 6.
Along with the multi-layer perceptron (MLP), the self-organizing map is per-
haps the most widely known method in the field of artificial neural networks.
The story began with von der Malsburg’s pioneering work [191] in 1973.
His project aimed at modeling the stochastic patterns of eye dominance and
the orientation preference in the visual cortex. After this rather biologically
motivated introduction of self-organization, little interest was devoted to the
continuation of von der Malsburg’s work until the 1980s. At that time, the field
of artificial neural network was booming again: it was a second birth for this
interdisciplinary field, after a long and quiet period due to Minski and Papert’s
flaming book [138]. This boom already announced the future discovery of the
back-propagation technique for the multi-layer perceptron [201, 161, 88, 160].
But in the early 1980s, all the attention was focused on Kohonen’s work.
He simplified von der Malsburg’s ideas, implemented them with a clear and
relatively fast algorithm, and introduced them in the field of artificial neural
networks: the so-called (Kohonen’s) self-organizing map (SOM or KSOM) was
born.
Huge success of the SOMs in numerous applications of data analysis
quickly followed and probably stems from the appealing elegance of the SOMs.
136 5 Topology Preservation
The task they perform is indeed very intuitive and easy to understand, al-
though the mathematical translation of these ideas into a compactly written
error function appears quite difficult or even impossible in the general case.
More precisely, SOMs simultaneously perform the combination of two con-
current subtasks: vector quantization (see Appendix D) and topographic rep-
resentation (i.e., dimensionality reduction). Thus, this “magic mix” is used
not only for pure vector quantization, but also in other domains where self-
organization plays a key part. For example, SOMs can also be used to some
extent for nonlinear blind source separation [147, 85] as for nonlinear dimen-
sionality reduction. This polyvalence explains the ubiquity of SOMs in nu-
merous applications in several fields of data analysis like data visualization,
time series prediction [123, 122], and so forth.
5
x2
−5
−10 −5 0 5 10
x1
ticulated around the points, the fitting inside the data cloud becomes easy,
at least intuitively. Roughly, it is like covering an object with an elastic fish-
ing net. This intuitive idea underlies the way an SOM works. Unfortunately,
things become difficult from an algorithmic point of view. How do we encode
the fishing net? And how do we fit it inside the data cloud?
Considering an SOM as a special case of a vector quantization method may
help to answer the question. As explained in Appendix D, vector quantization
5.2 Predefined lattice 137
where d is a distance function in the data space, usually the Euclidean dis-
tance.
The coordinates c(r) can be determined iteratively, by following more or
less the scheme of a Robbins–Monro procedure. Briefly put, the SOM runs
through the data set Y several times; each pass through the data set is called
an epochs. During each epoch, the following operations are achieved for each
datum y(i):
1. Determine the index r of the closest prototype of y(i), i.e.
where the learning rate α, obeying 0 ≤ α ≤ 1, plays the same role as the
step size in a Robbins–Monro procedure. Usually, α slowly decreases as
epochs go by.
In the update rule (5.3), νλ (r, s) is called the neighborhood function and can
be defined in several ways.
In the early publications about SOMs [191, 104], νλ (r, s) was defined to
be the so-called ‘Bubble’ function:
0 if dg (r, s) > λ
νλ (r, s) = , (5.4)
1 if dg (r, s) ≤ λ
where λ is the neighborhood (or Bubble) width. As the step size α does, the
neighborhood width usually decreases slowly after each epoch.
In [155, 154], νλ (r, s) is defined as
! "
d2g (r, s)
νλ (r, s) = exp − , (5.5)
2λ2
which looks like a Gaussian function (see Appendix B) where the neighbor-
hood width λ replaces the standard deviation.
It is noteworthy that if νλ (r, s) is defined as
0 if r = s
νλ (r, s) = , (5.6)
1 if r = s
then the SOM does not take the lattice into account and becomes equivalent
to a simple competitive learning procedure (see Appendix D).
In all above definitions, if dg (r, s) is implicitly given by dg (g(r), g(s)), then
dg may be any distance function introduced in Section 4.2. In Eq. (5.5), the
Euclidean distance (L2 ) is used most of the time. In Eq. (5.4), on the other
hand, L2 as well as L1 or L∞ are often used.
Moreover, in most implementations, the points g(r) are regularly spaced
on a plane. This forces the embedding space to be two-dimensional. Two
neighborhood shapes are then possible: square (eight neighbors) or hexagonal
(six neighbors) as in Fig. 5.1. In the first (resp., second) case, all neighbors are
equidistant for L∞ (resp., L2 ). For higher-dimensional lattices, (hyper-)cubic
neighborhoods are the most widely used. The global shape of the lattice is
often a rectangle or a hexagon (or a parallelepiped in higher dimensions).
Figure 5.2 shows a typical implementation of a SOM. The reference software
package for SOMs is the SOM toolbox, available at http://www.cis.hut.
fi/projects/somtoolbox/. This a complete MATLAB R
toolbox, including
5.2 Predefined lattice 139
not only the SOM algorithm but also other mapping functions (NLM, CCA)
and visualization functions. A C++ implementation can also be downloaded
from http://www.ucl.ac.be/mlg/. The main parameters of an SOM are the
lattice shape (width, height, additional dimensions if useful), neighborhood
shape (square or hexagon), the neighborhood function νλ , and the learning
schedules (for learning rate α and neighborhood width λ). Space complexity
is negligible and varies according to implementation choices. Time complexity
is O(N D|C|) per iteration.
Eventually, it is noteworthy that SOMs are motivated by biological and
empirical arguments. Neither a generative model of data nor an objective
function is defined, except in very particular cases [38]. More information
about mathematical aspects of SOMs can be found in [62] and references
therein.
Example
The two benchmark data sets introduced in Section 1.5 can easily be pro-
cessed using an SOM. In order to quantize the 350 and 316 three-dimensional
points contained in the data sets (see Fig. 1.4), the same 30-by-10 rectangular
lattice as in Fig. 5.1 is defined. The neighborhood shape is hexagonal. With
the neighborhood function defined as in Eq. (5.5), the SOM computes the
embeddings shown in Fig. 5.3. By construction, the embedding computed by
140 5 Topology Preservation
2 5 5
x2
0 0
x
−5 −5
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1
Fig. 5.3. Two-dimensional embeddings of the “Swiss roll” and “open box” data sets
(Fig. 1.4), found by an SOM. The shape of the embedding is identical to the prede-
fined lattice shown in Fig. 5.1. Only the color patches differ: the specific color and
shape of each data point in Fig. 1.4 have been assigned to their closest prototypes.
1
0 3
3
y
0
y3
2
1
−1 −1 1
−1 0 0
−1
0 0 y2
−1 y 1 −1
1 2
y1 y1
Fig. 5.4. Three-dimensional views showing how the lattice of an SOM curves in
order to fit in the two data sets of Fig. 1.3. Colors indicate the left-right position of
each point in the lattice, as in Fig. 5.1.
5.2 Predefined lattice 141
the box. Cracks, however, are visible on two lateral faces. They explain why
two lateral faces are torn in Fig. 5.3 and cause the loss of some topological
relationships. In addition, some points of the lattice have no color spot. This
means that although the lattice includes fewer points than the data set, some
points of the lattice never play the role of closest prototype.
Finally, it must be remarked that the axis ranges differ totally in Figs. 5.3
and 5.4. This intuitively demonstrates that only the topology has been pre-
served: the SOM has not taken distances into account.
Classification
The wide success of SOMs can be explained by the following advantages. The
method is very simple from an algorithmic point of view, and its underlying
idea, once understood, is intuitively appealing. SOMs are quite robust and
perform very well in many situations, such as the visualization of labeled
data.
Nevertheless, SOMs have some well-known drawbacks, especially when
they are used for dimensionality reduction. Most implementations handle one-
or two-dimensional lattices only. Vector quantization is mandatory, meaning
that a SOM does not really embed the data points: low-dimensional coor-
dinates are computed for the prototypes only. Moreover, the shape of the
embedding is identical to the lattice, which is, in turn, defined in advance,
arbitrarily. This means that an SOM cannot capture the shape of the data
cloud in the low-dimensional embedding.
From a computational point of view, it is very difficult to assess the con-
vergence of an SOM, since no explicit objective function or error criterion is
optimized. Actually, it has been proved that such a criterion cannot be de-
fined, except in some very particular cases [57]. In addition, the parameter
setting of an SOM appears as a tedious task, especially for the neighborhood
width λ.
142 5 Topology Preservation
Variants
The generative topographic mapping (GTM) has been put forward by Bishop,
Svensén, and Williams [23, 24, 176] as a principled alternative to the SOM.
Actually, GTM is a specific density network based on generative modeling, as
indicated by its name. Although the term “generative model” has already been
used, for example in Subsection 2.4.1 where the model of PCA is described,
here it has a stronger meaning.
In generative modeling1 , all variables in the problem are assigned a prob-
ability distribution to which the Bayesian machinery is applied. For instance,
density networks [129] are a form of Bayesian learning that try to model data
in terms of latent variables [60]. Bayesian neural networks learn differently
from other, more traditional, neural networks like an SOM. Actually, Bayesian
learning defines a more general framework than traditional (frequentist) learn-
ing and encompasses it. Assuming that the data set Y = [y(i)]1≤i≤N has to
be modeled using parameters stored in vector w, the likelihood function L(w)
is defined as the probability of the data set given the parameters
with a prior p(w) = exp R(w). For the quadratic regularizer (Eq. (5.9)), the
prior would be proportional to a Gaussian density with variance 1/α.
Compared to frequentist learning, the Bayesian approach has the advan-
tage of finding a distribution for the parameters in w, instead of a single value.
Unfortunately, this is earned at the expense of introducing the prior, whose
selection is often criticized as being arbitrary.
Within the framework of Bayesian learning, density networks like GTM
are intended to model a certain distribution p(y) in the data space RD by a
small number P of latent variables. Given a data set (in matrix form) Y =
[y(i)]1≤i≤N drawn independently from the distribution p(y), the likelihood
and log-likelihood become
#
N
L(w) = p(Y|w) = p(y(i)|w) , (5.15)
i=1
N
l(w) = ln p(Y|w) = ln p(y(i)|w) , (5.16)
i=1
m : RP → Y ⊂ RD , x → y = m(x, w) . (5.17)
Using Bayes’s rule once again, as in Eq. (5.10), gives the posterior in the
parameter space:
p(Y|w)p(w) L(w)p(w)
p(w|Y) = = , (5.20)
p(Y) p(Y)
• An optimization algorithm in order to find the parameter w that max-
imizes the posterior in the parameter space p(w|Y). In practice, this is
achieved by maximizing the log-likelihood, for example by gradient de-
scent on Eq. (5.19), when this is computationally feasible.
which behaves as an isotropic Gaussian noise model for y(x, W) that ex-
tends the manifold Y to RD : a given data vector y(i) could have been
generated by any point x with probability p(y(i)|x, W, β). It can be also
remarked that the error function Gi (x, W, β) trivially depends on the
squared distance between the observed data point y(i) and the generating
point x.
146 5 Topology Preservation
where the C points g(r) stand on a regular grid in latent space, in the
same way as the prototypes of an SOM. This discrete choice of the prior
distribution directly simplifies the integral in Eq. (5.19) into a sum. Oth-
erwise, for an arbitrary p(x), the integral must be explicitly discretized
(Monte Carlo approximation). Then, finally, Eq. (5.19) becomes in the
case of GTM:
1
C
p(y(i)|W, β) = p(y(i)|g(r), W, β) , (5.23)
C r=1
• The mapping from latent space to data space is a generalized linear model
m(x, W) = Wφ(x), where W is D-by-B matrix and φ a B-by-1 vector
consisting of B (nonlinear) basis functions. Typically, these B basis func-
tions are Gaussian kernels with explicitly set parameters: their centers are
drawn from the grid in the latent space, and their common variance is pro-
portional to the mean distances between the centers. In other words, the
mapping used by GTM roughly corresponds to an RBF network with con-
strained centers: by comparison with the MLP or usual density networks,
the RBFN remains an universal approximator but yields considerable sim-
plifications in the subsequent computations (see ahead). The exact posi-
tioning of the centers and the tuning of the width σ are not discussed here;
for more details, see [176]. Nevertheless, in order to get a smooth manifold
and avoid overfitting, it is noteworthy that (i) the number of kernels in the
constrained RBF must be lower than the number of grid points and (ii)
the width σ must be larger than the mean distance between neighboring
centers. As in other RBF networks, additional linear terms and biases may
complement the basis functions in order to easily take into account linear
trends in the mapping.
• The optimization algorithm is the expectation-maximization (EM) proce-
dure [50, 21]. This choice is typical when maximizing the likelihood and
working with mixtures of Gaussian kernels. By design, the prior in latent
space is a mixture of Gaussian kernels (see Eq. (5.23)) that fortunately
makes EM applicable. In other density networks, a more complex choice
5.2 Predefined lattice 147
of the prior does not allow simplifying the integral of Eq. (5.19) into a sum.
In the case of GTM, the objective function is the log-likelihood function.
Without going into technical details [176], the log-likelihood function
N
l(W, β) = ln p(y(i)|W, β) (5.25)
i=1
E step.
M step.
solvable for Wnew with standard matrix inversion techniques. This sim-
ple update rule results from the adoption of an RBFN-like approxima-
tor instead of, for example, an MLP, which would have required a
gradient ascent as optimization procedure.
– A re-estimation formula for β:
1
C N
1
← ρi,r (Wnew , β) y(i) − m(g(r), Wnew )2 , (5.27)
β N D r=1 i=1
where
– Φ = [φ(g(r))]1≤r≤C is a B-by-C constant matrix,
– Y = [y(i)]1≤i≤N is the data set (constant D-by-N matrix),
– P = [ρi,r (W, β)] is a varying N -by-C matrix of posterior probabilities
or responsibilities:
N
gr,r (W, β) = ρi,r (W, β) . (5.29)
i=1
148 5 Topology Preservation
C
x̂(i) = g(r)p(g(r)|y(i)) . (5.34)
r=1
The second possibility is clearly the best except when the posterior distribu-
tion is multimodal. Putting together all the above-mentioned ideas leads to
the procedure presented in Fig. 5.5. A GTM MATLAB R
package is avail-
able at http://www.ncrg.aston.ac.uk/GTM/. The parameter list of GTM is
quite long. As with an SOM, the shape of the grid or lattice may be changed
(rectangular or hexagonal). The basis function of the RBF-like layer can be
tuned, too. Other parameters are related to the EM procedure (initialization,
number of iterations, etc.).
5.2 Predefined lattice 149
Example
Figure 5.6 illustrates how GTM embeds the “Swiss roll” and “open box”
manifolds introduced in Section 1.5. The points of the data sets (see Fig. 1.4)
are embedded in the latent space (a square 10-by-10 grid) using Eq. (5.34).
The mapping m works with a grid of 4-by-4 basis functions.
As can be seen, GTM fails to embed the Swiss roll correctly: all turns of
the spiral are superposed. On the other hand, the square latent space perfectly
suits the cubic shape of the open box: the bottom face is in the middle of the
square, surrounded by the four lateral faces. The upper corners of the box
correspond to the corners of the square latent space. By comparison with an
SOM (Fig. 5.3), GTM yields a much more regular embedding. The only visible
shortcoming is that the lateral faces are a bit shrunk near the borders of the
latent space.
150 5 Topology Preservation
1 1
0.5 0.5
2
x2
0 0
x
−0.5 −0.5
−1 −1
−1 0 1 −1 0 1
x1 x1
Fig. 5.6. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by the GTM.
Classification
The essential difference between GTM and almost all other methods described
in this book is that GTM relies on the principle of Bayesian learning. This
probabilistic approach leads to a different optimization technique: an EM
algorithm is used instead of a (stochastic) gradient descent or a spectral de-
composition. As described above, GTM is a batch method, but a version that
works with a stochastic EM procedure also exists.
Because GTM determines the parameters of a generative model of data,
the dimensionality reduction easily generalizes to new points. Therefore, it
can be said that GTM defines an implicit mapping although the latent space
is discrete.
If the implementation does not impose a two-dimensional grid, an external
procedure is needed to estimate the intrinsic dimensionality of data, in order
to determine the right dimension for the latent space.
If several embeddings with various dimensionalities are desired, GTM must
be run again.
Variants
Some extensions to GTM are described in [176]. For example, the probabilistic
framework of GTM can be adapted to data sets with missing entries. Another
variant uses a different noise model: the Gaussian kernels include a full co-
variance matrix instead of being isotropic. The mapping m can also be easily
152 5 Topology Preservation
By comparison with an SOM and GTM, locally linear embedding [158, 166]
(LLE) considers topology preservation from a slightly different point of view.
Usual methods like an SOM and GTM try to preserve topology by keeping
neighboring points close to each other (neighbors in the lattice are maintained
close in the data space). In other words, for these methods, the qualitative
notion of topology is concretely translated into relative proximities: points
are close to or far from each other. LLE proposes another approach based on
conformal mappings. A conformal mapping (or conformal map or biholomor-
phic map) is a transformation that preserves local angles. To some extent, the
preservation of local angles and that of local distances are related and may
be interpreted as two different ways to preserve local scalar products.
As LLE builds a conformal mapping, the first task of LLE consists of de-
termining which angles to take into account. For this purpose, LLE se-
lects a couple of neighbors for each data point y(i) in the data set Y =
[. . . , y(i), . . . , y(j), . . .]1≤i,j≤N . Like other methods already studied, LLE can
perform this task with several techniques (see Appendix E). The most often
used ones associate with each point y(i) either the K closest other points or
all points lying inside an -ball centered on y(i).
If the data set is sufficiently large and not too noisy, i.e., if the underlying
manifold is well sampled, then one can assume that a value for K (or ) exists
such that the manifold is approximately linear, on the local scale of the K-ary
neighborhoods (or -balls). The idea of LLE is then to replace each point y(i)
with a linear combination of its neighbors. Hence, the local geometry of the
5.3 Data-driven lattice 153
where N (i) is the set containing all neighbors of point y(i) and wi,j , the
entries of the N -by-N matrix W, weight the neighbors in the reconstruction
of y(i). Briefly put, E(W) sums all the squared distances between a point and
its locally linear reconstruction. In order to compute the coefficients wi,j , the
cost function is minimized under two constraints:
• Points are reconstructed solely by their neighbors, i.e., the coefficients wi,j
for points outside the neighborhood of y(i) are equal to zero: wi,j = 0 ∀j ∈/
N (i);
N
• The rows of the coefficient matrix sum to one: j=1 wi,j = 1.
The constrained weights that minimize the reconstruction error obey an
important property: for any particular data point y(i), they are invariant to
rotations, rescalings, and translations of that data point and its neighbors. The
invariance to rotations and rescalings follows immediately from the particular
form of Eq. (5.35); the invariance to translations is enforced by the second
constraint on the rows of matrix W. A consequence of this symmetry is that
the reconstruction weights characterize intrinsic geometric properties of each
neighborhood, as opposed to properties that depend on a particular frame of
reference. The key idea of LLE then consists of assuming that these geometric
properties would also be valid for a low-dimensional representation of the
data.
More precisely, as stated in [158], LLE assumes that the data lie on or
near a smooth, nonlinear manifold of low intrinsic dimensionality. And then,
to a good approximation, there exists a linear mapping, consisting of a trans-
lation, rotation, and rescaling, that maps the high-dimensional coordinates
of each neighborhood to global intrinsic coordinates on the manifold. By de-
sign, the reconstruction weights wi,j reflect intrinsic geometric properties of
the data that are invariant to exactly these transformations. Therefore, it is
expected that their characterization of local geometry in the original data
space be equally valid for local patches on the manifold. In particular, the
same weights wi,j that reconstruct the data point y(i) in the D-dimensional
data space should also reconstruct its manifold coordinates in a P -dimensional
embedding space.
LLE constructs a neighborhood-preserving embedding based on the above
assumption. In the final step of LLE indeed, each high-dimensional data point
is mapped to a low-dimensional vector representing global intrinsic coordi-
nates on the manifold. This is done by choosing P -dimensional coordinates to
minimize the embedding cost function:
154 5 Topology Preservation
$ $2
N $
$
$ $
$ wi,j x̂(j)$
Φ(X̂) = $x̂(i) − $ . (5.36)
i=1 $ j∈N (i) $
This cost function, very similar to the previous one in Eq. (5.35), sums the
reconstruction errors caused by locally linear reconstruction. In this case, how-
ever, the errors are computed in the embedding space and the coefficients wi,j
are fixed. The minimization of Φ(X̂) gives the low-dimensional coordinates
X̂ = [. . . , x̂(i), . . . , x̂(j), . . .]1≤i,j≤N that best reconstruct y(i) given W.
In practice, the minimization of the two cost functions E(W ) and Φ(X̂) is
achieved in two successive steps.
First, the constrained coefficients wi,j can be computed in closed form, for
each data point separately. Considering a particular data point y(i) with K
nearest neighbors, its contribution to E(W) is
$ $2
$ $
$ $
$
Ei (W) = $y(i) − wi,j y(j)$
$ , (5.37)
$ j∈N (i) $
K
= ωr (i)ωs (i)gr,s (i) , (5.40)
r,s=1
where ω(i) is a vector that contains the nonzero entries of the ith (sparse)
row of W and ν(r) the rth neighbor of y(i), corresponding to y(j) in the
notation of Eq. (5.37). The second equality holds thanks to the (reformulated)
constraint K r=1 ωr (i) = 1, and the third one uses the K-by-K local Gram
matrix G(i) whose entries are defined as
Δ2 tr(G)
G←G+ I , (5.43)
K
where Δ is small compared to the trace of C. This amounts to penalizing large
weights that exploit correlations beyond some level of precision in the data
sampling process. Actually, Δ is somehow a “hidden” parameter of LLE.
The minimization of the second cost function Φ(X̂) can be done at once
by solving an eigenproblem. For this purpose, Φ(X̂) is developed as follows:
$ $2
N $ $
$ $
Φ(X̂) = $x̂(i) − w x̂(j)$ (5.44)
$ i,j $
i=1 $ j∈N (i) $
$ $2
N $ $
$ $
= $ wi,j (x̂(i) − x̂(j))$ (5.45)
$ $
i=1 $j∈N (i) $
N
= mi,j (x̂(i)T x̂(j)) , (5.46)
i,j=1
M = (I − W)T (I − W) , (5.47)
eigenvectors must sum to zero by virtue of orthogonality with the last one.
The remaining P eigenvectors give the estimated P -dimensional coordinates
of the points x̂(i) in the latent space. Figure 5.7 summarizes all previous ideas
in a short procedure: A MATLAB R
function implementing LLE is available
Example
Figure 5.8 shows how LLE embeds the two benchmark manifolds introduced
in Section 1.5. The dimensionality of the data sets (see Fig. 1.4) is reduced
from three to two using the following parameter values: K = 7 and Δ2 = 1e−2
for the Swiss roll and K = 4 and Δ2 = 10 for the open box. Perhaps because
of the low number of data points, both parameters require careful tuning.
It is noteworthy that in the case of the open box, Δ highly differs from the
proposed all-purpose value (Δ2 = 1e − 4).
1
1
0
2
x2
0
x
−1
−1
−2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x1 x1
Fig. 5.8. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by LLE.
Once the parameters are correctly set, the embedding looks rather good:
there are no tears and the box is deformed smoothly, without superpositions.
The only problem for the open box is that at least one lateral face is completely
crushed.
Classification
Like MDS, Isomap, and KPCA, LLE is a batch (offline) method working with
simple algebraic operations. As many other spectral methods that rely on an
EVD, LLE is able to build embeddings incrementally just by appending or
removing eigenvectors. The mapping provided by LLE is explicit and discrete.
In contrast with classical metric MDS, LLE assumes that data are linear
locally, not globally. Hence, the model of LLE allows one to unfold nonlinear
manifolds, as expected. More precisely, LLE assumes that the manifold can
be mapped to a plane using a conformal mapping. Although both MDS and
LLE use an EVD, which is purely linear, the nonlinear capabilities of LLE
actually come from its first step: the computation of the nearest neighbors.
158 5 Topology Preservation
The greatest advantage of LLE lies in its sound theoretical foundation. The
principle of the method is elegant and simple. Like Isomap, LLE can embed
a manifold in a nonlinear way while sticking by an eigensolver. Importantly,
even if the reconstruction coefficients for each data point are computed from its
local neighborhood only, independently from other neighborhoods, the embed-
ding coordinates involve the solution of an N -by-N eigenproblem. This means
that although LLE primarily relies on local information about the manifold,
a global operation couples all data points in the connected components of the
graph underlying matrix W.
From the computational viewpoint, LLE needs to compute the EVD of an
N -by-N matrix, where N is the number of points in the data set. Hence, it
may be feared that too large a data set may rapidly become computationally
untractable. Fortunately, the matrix to be decomposed is sparse, enabling
specialized EVD procedures to keep the computational load low. Furthermore,
it is noteworthy that in contrast with most other methods using an EVD or
SVD described in this book, LLE looks for eigenvectors associated with the
smallest eigenvalues. In practice, this specificity combined with the fact that
the matrix is large, reinforces the need for high-performance EVD procedures.
In contrast to what is claimed in [158], finding good parameters for LLE is
not so easy, as reported in [166], for instance. Actually, two parameters must
be tuned carefully: K, the number of neighbors (or when the neighborhoods
are determined by the -rule) and Δ, the regularization factor. Depending on
these two parameters, LLE can yield completely different embeddings.
Variants
The method called “Laplacian eigenmaps” [12, 13] (LE in short) belongs to
the now largely developed family of NLDR techniques based on spectral de-
composition. The method was intended to remedy some shortcomings of other
spectral methods like Isomap (Subsection 4.3.2) and LLE (Subsection 5.3.1).
In contrast with Isomap, LE develops a local approach to the problem of non-
linear dimensionality reduction. In that sense, LE is closely related to LLE,
although it tackles the problem in a different way: instead of reproducing
small linear patches around each datum, LE relies on graph-theoretic con-
cepts like the Laplacian operator on a graph. LE is based on the minimization
of local distances, i.e., distances between neighboring data points. In order to
avoid the trivial solution where all points are mapped to a single points (all
distances are then zero!), the minimization is constrained.
that keep the same neighborhood relationships. For this purpose, the following
criterion is defined:
160 5 Topology Preservation
1
N
ELE = x(i) − x(j)22 wi,j , (5.50)
2 i,j=1
where entries wi,j of the symmetric matrix W are related to those of the
adjacency matrix in the following way: wi,j = 0 if ai,j = 0; otherwise, wi,j ≥ 0.
Several choices are possible for the nonzero entries. In [12] it is recommended
to use a Gaussian bell-shaped kernel
y(i) − y(j)22
wi,j = exp − , (5.51)
2T 2
1
N
ELE = x(i) − x(j)22 wi,j (5.54)
2 i,j=1
1
P N
= (xp (i) − xp (j))2 wi,j (5.55)
2 p=1 i,j=1
1 2
P N
= (x (i) + x2p (j) − 2xp (i)xp (j))wi,j (5.56)
2 p=1 i,j=1 p
⎛ ⎞
1 P N N N
= ⎝ x2 (i)di,i + x2p (j)dj,j − 2 xp (i)xp (j)wi,j ⎠(5.57)
2 p=1 i=1 p j=1 i,j=1
1 T
P
= 2f (y)Dfp (y) − 2fpT (y)Wfp (y) (5.58)
2 p=1 p
1 T
P
= f (y)Lfp (y) = tr(XLXT ) , (5.59)
2 p=1 p
where fp (y) is an N -dimensional vector giving the pth coordinate for each
embedded point, and fp (y) is the transpose of the pth row of X. By the
way, it is noteworthy that the above calculation also shows that L is positive
semidefinite.
Minimizing ELE with respect to X under the constraint XDXT = IP ×P
reduces to solving the generalized eigenvalue problem λDf = Lf and looking
for the P eigenvectors of L associated with the smallest eigenvalues. As L is
symmetric and positive semidefinite, all eigenvalues are real and not smaller
than zero. This can be seen by solving the problem incrementally, i.e., by
computing first a one-dimensional embedding, then a two-dimensional one,
and so on. At this point, it must noticed that λDf = Lf possesses a trivial
solution. Indeed, for f = 1N where 1N = [1, . . . , 1]T , it comes out that W1N =
D1N and thus that L1N = 0N . Hence λN = 0 is the smallest eigenvalue of L
and fN (y) = 1.
An equivalent approach [16] to obtain the low-dimensional embedding (up
to a componentwise scaling) consists of normalizing the Laplacian matrix:
) *
l
L = D−1/2 LD−1/2 =
ij
, (5.60)
dii djj
1≤i,j≤N
L = UΛUT . (5.61)
The eigenvectors associated with the P smallest eigenvalues (except the last
one, which is zero) form a P -dimensional embedding of the data set. The
162 5 Topology Preservation
eigenvalues are the same as for the generalized eigenvalue problem, and the
following relationship holds for the eigenvectors: ui = D1/2 fi .
The Laplacian matrix as computed above is an operator on the neigh-
borhood graph, which is a discrete representation of the underlying manifold.
Actually, the Laplacian matrix stems from a similar operator on smooth mani-
folds, the Laplace-Beltrami operator. Cast in this framework, the eigenvectors
ui are discrete approximations of the eigenfunctions of the Laplace-Beltrami
operator applied on the manifold. More details can be found in [13, 17].
Laplacian eigenmaps can be implemented with the procedure shown in
Fig. 5.9. A software package in the MATLAB R
language is available at http:
Example
Figure 5.10 shows how LE embeds the two benchmark manifolds introduced in
Section 1.5. The dimensionality of the data sets (see Fig. 1.4) is reduced from
three to two using the following parameter values: K = 7 for the Swiss roll and
K = 8 for the open box. These values lead to graphs with more edges than the
lattices shown in the figure. Moreover, the parameter that controls the graph
building (K or ) requires careful tuning to obtain satisfying embeddings.
Matrix W is computed with the degenerate heat kernel (T = ∞). As can be
0.1
0.05 0.05
x2
2
0 0
x
−0.05
−0.05
−0.1
−0.1 −0.05 0 0.05 0.1 −0.2 −0.1 0 0.1
x1 x1
Fig. 5.10. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by LE.
seen, the Swiss roll is only partially unfolded; moreover, the third dimension
of the spiral is crushed. Results largely depend on parameter K: changing it
can yield embeddings of a completely different shape. In the case of the open
box, the result is more satisfying although at least one face of the open box
is crushed.
Classification
LE is almost parameter-free when the heat kernel is used with T = +∞: only
K or remains; this is an advantage over LLE in this respect. Nevertheless,
the choice of these two last parameters may have a dramatic influence on the
results of LE. Moreover, it is also possible to change the kernel function.
Like KPCA, LE usually yields poor embeddings. Connections between LE
and spectral clustering (see ahead) indicate that the method performs better
for data clustering than for dimensionality reduction.
Another explanation of the poor performance in dimensionality reduction
can be found in its objective function. Minimizing distances between neigh-
boring points seems to be an appealing idea at first sight. (Note, by the way,
that semi-definite embedding follows exactly the opposite approach: distances
between nonneighboring points are maximized; see Subsection 4.4.2 for more
details.) But as shown above, this can lead to degenerate solutions, such as
an embedding having identical coordinates for all data points. If this trivial
solution can easily be found by looking at the equations, it is likely that other
configurations minimize the objective function of LE but do not provide a
suitable embedding. Intuitively, it is not difficult to imagine what such config-
urations should look like. For instance, assume that data points are regularly
distributed on a plane, just as in an SOM grid, and that parameter K is high
enough so that three aligned points on the grid are all direct neighbors of each
other. Applying LE to that data set can lead to a curved embedding. Indeed,
curving the manifold allows LE to minimize the distance between the first
and third points of the alignment. This phenomenon can easily be verified ex-
perimentally and seriously questions the applicability of LE to dimensionality
reduction.
From the computational viewpoint, LE works in a similar way as LLE, by
computing a Gram-like matrix and extracting eigenvectors associated with the
smallest eigenvalues. Therefore, LE requires robust EVD procedures. As with
LLE, specific procedures can exploit the intrinsic sparsity of the Laplacian
matrix.
Variants
5.3.3 Isotop
Isotop reduces the dimensionality of a data set by breaking down the problem
into three successive steps:
1. vector quantization (optional),
2. graph building,
166 5 Topology Preservation
3. low-dimensional embedding.
These three steps are further detailed ahead. Isotop relies on a single and
simple hypothesis: the data set
Y = [. . . , y(i), . . . , y(j), . . .]1≤i,j≤N (5.62)
contains a sufficiently large number N of points lying on (or near) a smooth
P -manifold.
If the data set contains too many points, the first step of Isotop consists of
performing a vector quantization in order to reduce the number of points (see
Appendix D). This optional step can easily be achieved by various algorithms,
like Lloyd’s, or a competitive learning procedure. In contrast with an SOM,
no neighborhood relationships between the prototypes are taken into account
at this stage in Isotop. For the sake of simplicity, it is assumed that the first
step is skipped, meaning the subsequent steps work directly with the raw data
set Y.
Second, Isotop connects neighboring data points (or prototypes), by using
the graph-building rules proposed in Appendix E, for example. Typically,
each point y(i) of the data set is associated with a graph vertex vi and then
connected with its K closest neighbors or with all other points lying inside an
-ball. The obtained graph G = (VN , E) is intended to capture the topology
of the manifold underlying the data points, in the same way as it is done
in other graph-based methods like Isomap, GNLM, CDA, LLE, LE, etc. In
contrast with an SOM, G may be completely different from a rectangular
lattice, since it is not predefined by the user. Instead, G is “data-driven”,
i.e., completely determined by the available data. Moreover, until this point,
no low-dimensional representation is associated with the graph, whereas it
is precisely such a representation that predetermines the lattice in an SOM.
Eventually, the second step of Isotop ends by computing the graph distances
δ(vi , vj ) for all pairs of vertices in the graph (see Subsection 4.3.1). These
distances will be used in the third step; in order to compute them, each edge
(vi , vj ) of the graph is given a weight, which is equal to the Euclidean distance
y(i) − y(j) separating the corresponding points in the data space. The
graph distances approximate the geodesic distances in the manifold, i.e., the
distances between points along the manifold.
The third step of Isotop is the core of the method. While the first and sec-
ond steps aim at converting the data, given as D-dimensional coordinates, into
graph G, the third step achieves the inverse transformation. More precisely,
the goal consists of translating graph G into P -dimensional coordinates. For
this purpose, the D-dimensional coordinates y(i) associated with the vertices
of the graph are replaced with P -dimensional coordinates x(i), which are ini-
tialized to zero. At this stage, the low-dimensional representation X of Y is
built but, obviously, X does not truly preserve the topology of Y yet: it must
be modified or updated in some way. For this purpose, Y may be forgotten
henceforth: Isotop will use only the information conveyed by G in order to
update X.
5.3 Data-driven lattice 167
where the learning rate α, which satisfies 0 ≤ α ≤ 1, plays the same role
as the step size in a Robbins–Monro procedure.
In the update rule (5.64), νλ (i, j) is called the neighborhood function and is
defined as follows:
! "
1 δy2 (i, j)
νλ (i, j) = exp − , (5.65)
2 λ2 μ2(vh ,vj )∈E (δy (h, j))
μ(vh ,vj )∈E (δy (h, j)) = μ(vh ,vj )∈E (dy (h, j)) = μ(vh ,vj )∈E (y(h) − y(j)) .
(5.66)
The second factor of the denominator aims at normalizing the numerator
δy2 (i, j), in order to roughly approximate the relative distances from y(j) to
y(i) inside the manifold. Without this factor, Isotop would depend too much
on the local density of the points on the manifold: smaller (resp., larger)
distances are measured in denser (resp., sparser) regions.
Typically, the three steps above are repeated N times with the same values
for the parameters α and λ; such a cycle may be called an epoch, as for other
adaptive algorithms following the scheme of a Robins-Monro procedure [156].
Moreover, instead of drawing r from the normalized sum of Gaussian kernels
168 5 Topology Preservation
1. Perform a vector quantization of the data set; this step is optional and
can be done with any standard quantization method.
2. Build a graph structure with an appropriate rule (see Appendix E), and
compute all pairwise graph distances.
3. Initialize low-dimensional coordinates x(i) of all vertices vi to zero.
4. Initialize the learning rate α and the neighborhood width λ with their
scheduled values for epoch number q.
5. For each vertex vi in the graph,
• Generate a point r randomly drawn from a Gaussian distribution cen-
tered on the associated coordinates x(i).
• Compute the closest vertex from r according to Eq. (5.63).
• Update the coordinates of all vertices according to rule (5.64).
6. Increase q and return to step 4 if convergence is not reached.
Example
Figure 5.12 shows how Isotop embeds the two benchmark manifolds intro-
duced in Section 1.5. The graph is built using -balls, in order to obtain the
graphs shown in the figure. No superpositions occur in the obtained embed-
dings. In the case of the open box, the bottom face is shrunk, allowing the
lateral faces to be embedded without too many distortions; similarily, the box
lid is stretched but remains square. The two examples clearly illustrate the
ability of Isotop to either preserve the manifold shape (in the case of the Swiss
roll) or deform some regions of the manifold if necessary (for the box).
170 5 Topology Preservation
2
2 2
x2
0 0
x
−2 −2
−5 0 5 −4 −2 0 2 4
x x
1 1
Fig. 5.12. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by Isotop.
Classification
Isotop shares many characteristics with an SOM. Both methods rely on a non-
linear model and use vector quantization (optional in the case of Isotop). They
also both belong to the world of artificial neural networks and use approxi-
mate optimization techniques. Because Isotop is divided into three successive
steps, it cannot easily be implemented as an online algorithm, unlike a SOM.
The mapping Isotop produces between the high- and low-dimensional
spaces is discrete and explicit. Hence, the generalization to new points is not
easy.
The comparison between Isotop and an SOM is unavoidable since both meth-
ods are closely related. Actually, Isotop can be interpreted as an SOM working
in “reverse gear”. Indeed, the following procedure describes how a SOM works:
1. Determine the shape of the low-dimensional embedding (usually a two-
dimensional grid of regularly spaced points).
2. Build a lattice between the grid points (this second step is often implicitly
merged with the first one).
3. Perform a vector quantization in the high-dimensional space in order to
place and deform the lattice in the data cloud; this step defines the map-
ping between the high- and low-dimensional spaces.
As can be seen, data are involved only (and lately) in the third step. In other
words, an SOM weirdly starts by defining the shape of the low-dimensional
embedding, without taking the data into account! In the framework of dimen-
sionality reduction, this approach goes in the opposite direction compared to
the other usual methods. Isotop puts things back in their natural order and
the three above steps occur as follows:
5.3 Data-driven lattice 171
Variants
To some extent, Isotop can be related to spring-based layouts and other graph-
embedding techniques. See, for example, [52].
Some “historical” aspects regarding the development of Isotop can be
found in [119]. In earlier versions, only one Gaussian kernel was used in order
to unfold the manifold and counterbalance the attractive force induced by
Eq. (5.64).
Like GTM, which was proposed as a principled variant of Kohonen’s SOM,
stochastic neighbor embedding (SNE) [87] can be seen as a principled version
of Isotop. SNE follows a probabilistic approach to the task of embedding and,
like Isotop, associates a Gaussian kernel with each point to be embedded.
The set of all these kernels allows SNE to model the probability of one point
to be the neighbor of the others. This probability distribution can be mea-
sured for each point in the high-dimensional data space and, given a (random)
embedding, corresponding probability distributions can also be computed in
172 5 Topology Preservation
the low-dimensional space. The goal of SNE is then to update the embed-
ding in order to match the distributions in both high- and low-dimensional
spaces. In practice, the objective function involves a sum of Kullback-Leibler
divergences, which measure the “distances” between pairs of corresponding
distributions in their respective space. Because the minimization of the ob-
jective function is difficult and can get stuck in local minima, SNE requires
complex optimization techniques to achieve good results. A version of SNE
using graph distances instead of Euclidean ones in the data space is mentioned
in [186, 187] and compared to other NLDR methods.
6
Method comparisons
Overview. This chapter illustrates all NLDR methods that are de-
cribed in the previous two chapters with both toy examples and real
data. It also aims to compare the results of the different methods in
order to shed some light on their respective strengths and weaknesses.
The Swiss roll is an illustrative manifold that has been used in [179, 180]
as an example to demonstrate the capabilities of Isomap. Briefly put, it is a
spiral with a third dimension, as shown in Fig. 6.1. The name of the manifold
originates from a delicious kind of cake made in Switzerland: jam is spread on
a one-centimeter-thick layer of airy pastry, which is then rolled up on itself.
With some imagination, the manifold in Fig. 6.1 can then be interpreted as a
very thick slice of Swiss roll, where only the jam is visible.
The parametric equations that generate the Swiss roll are
⎡√ √ ⎤
√ 2 + 2x1 cos(2 ∗ π ∗ √ 2 + 2x1 )
y = ⎣ 2 + 2x1 sin(2 ∗ π ∗ 2 + 2x1 ) ⎦ , (6.1)
2x2
y3 −2
2
0 2
0
y2 −2 −2
y1
the manifold as well. As can be easily seen, each coordinate of the manifold
depends on a single latent variable. Hence, the Swiss roll is a developable
two-manifold embedded in R3 .
The Swiss roll is the ideal manifold to demonstrate the benefits of using
graph distances. Because it is developable, all methods using graph distances
can easily unfold it and reduce its dimensionality to two. On the contrary,
methods working with Euclidean distances embed it with difficulty, because
it is heavily crumpled on itself. For example, it has been impossible to get con-
vincing results with NLM or CCA. Figure 6.2 shows the best CCA embedding
obtained using Euclidean distances. This result has required carefully tuning
the method parameters. In light of this result, Euclidean distance-preserving
methods (MDS, NLM, CCA) are discarded in the experiments ahead.
Concretely, 5000 points or observations of y are made available in a data
set. These points are generated according to Eq. 6.1. The corresponding points
in the latent space are drawn randomly; they are not placed on a regular grid,
as was the case for the Swiss roll described in Section 1.5 to illustrate the
methods described in Chapters 4 and 5. As the number of points is relatively
high, a subset of fewer than 1000 points is already representative of the man-
ifold and allows the computation time to be dramatically decreased. As all
methods work with N 2 distances or an N -by-N Gram-like matrix, they work
25 times faster with 1000 points than with 5000. In [180] the authors suggest
choosing the subset of points randomly among the available ones. Figure 6.3
shows the results of Isomap, GNLM, and CDA with a random subset of 800
points. The graph distances are computed in the same way for three methods,
i.e., with the K-rule and K = 5. Other parameters are left to their default
“all-purpose” values.
6.1 Toy examples 175
CCA
5
1
x2
−1
−2
−3
−4
−5
−4 −2 0 2 4 6
x
1
Fig. 6.2. Two-dimensional embedding of the “Swiss roll” manifold by CCA, using
1800 points.
As expected, all three methods succeed in unfolding the Swiss roll. More
importantly, however, the random subset used for the embeddings is not really
representative of the initial manifold. According to Eq. (6.1), the distribution
of data points is uniform on the manifold. At first sight, points seem indeed
to be more or less equally distributed in all regions of the embeddings. Nev-
ertheless, a careful inspection reveals the presence of holes and bubbles in the
distribution (Brand speaks of manifold “foaming” [30]). It looks like jam in
the Swiss roll has been replaced with a slice of ... Swiss cheese [117, 120]!
This phenomenon is partly due to the fact that the effective data set is re-
sampled, i.e., drawn from a larger but finite-size set of points. As can be seen,
Isomap amplifies the Swiss-cheese effect: holes are larger than for the two
other methods. The embeddings found by GNLM and CDA look better.
Instead of performing a random subset selection, vector quantization (see
Appendix D) could be applied to the Swiss roll data set. If the 800 randomly
chosen points are replaced with only 600 prototypes, obtained with a simple
competitive learning procedure, results shown in Fig. 6.4 can be obtained.
As can be seen, the embeddings are much more visually pleasing. Beyond
the visual feeling, the prototypes also represent better the initial manifold
176 6 Method comparisons
Isomap
2
1
2
0
x
−1
−2
−6 −4 −2 0 2 4 6
x
1
GNLM CDA
2 2
2
2
0 0
x
x
−2 −2
−5 0 5 −5 0 5
x1 x
1
than randomly chosen points. This results, for example, in embeddings with
neater corners and almost perfectly rectangular shapes, thus reproducing more
faithfully the latent space. Such results definitely pledge in favor of the use of
vector quantization when the size of the data set can be or has to be reduced.
Accordingly, vector quantization is always used in the subsequent experiments
unless otherwise specified.
Figure 6.5 shows the embeddings obtained by topology-preserving methods
(an SOM, GTM, LLE, LE, and Isotop) with the same set of prototypes, except
for the SOM, since it is a vector quantization method by nature. By doing so,
it is hoped that the comparison between all methods is as fair as possible. The
SOM (15-by-40 map) and GTM (40-by-40 latent grid, 10× 10 kernels, 200 EM
iterations) suffer from jumps and shortcuts joining the successive whorls. LLE
(K = 5, Δ = 0.02) unfolds the Swiss roll and perfectly preserves the topology.
As often happens with LLE, the embedding has a triangular or cuneiform
shape (see [30] for a technical explanation). Finally, Isotop yields a visually
pleasant result: the topology is perfectly preserved and even the rectangular
shape of the Swiss roll remains visible.
Some flaws of spectral methods like Isomap, SDE, LLE, and LE can be ex-
plained by inspecting 3D embeddings instead of 2D ones, such as those shown
in Fig. 6.6. Such 3D embeddings can be obtained easily starting from the
2D one, knowing that spectral methods can build embeddings incrementally
just by taking into account an additional eigenvector of the Gram-like matrix.
While the embeddings of SDE appear almost perfect (the third dimension is
negligible), the result of Isomap shows that the manifold is not totally made
6.1 Toy examples 177
Isomap
2
2
0
x
−2
−6 −4 −2 0 2 4 6
x1
GNLM CDA
2 2
x2
2
0 0
x
−2 −2
−5 0 5 −5 0 5
x1 x1
SDE equ.
2
SDE inequ.
1
x2
0 0
x
−1
−2 −5 0 5
−4 −2 0 2 4 6 x1
x1
flat. The “thickness” of the third dimension can be related to the importance
of discrepancies between graph distances and true geodesic ones. These dis-
crepancies can also explain the Swiss-cheese effect: long graph distances follow
a zigzagging path and are thus longer than the corresponding geodesic dis-
tances. On the contrary, short graph distances, involving a single graph edge,
can be slightly shorter than the true geodesic length, which can be curved.
For spectral topology-preserving methods, the picture is even worse in 3D.
As can be seen, the LLE embedding looks twisted; this can explain why LLE
often yields a triangular or cuneiform embedding although the latent space is
known to be square or rectangular. So LLE unrolls the manifold but intro-
duces other distortions that make it not perfectly flat. The LE embedding is
the worst: this method does not really succeed in unrolling the manifold.
Obviously, until here, all methods have worked in nice conditions, namely
with an easy-to-embed developable manifold and with standard values for
their parameters. Yet even in this ideal setting, the dimensionality reduction
may be not so easy for some methods.
178 6 Method comparisons
SOM
5
2
0
x
−5
−20 −15 −10 −5 0 5 10 15 20
x1
GTM LLE
1 4
2
2
2
0 0
x
x
−2
−1 −4
−1 0 1 −2 −1 0 1
x1 x1
Isotop Laplacian Eigenmaps
0.05
1
2
0
2
0
x
−1
−0.05
−4 −2 0 2 4
−0.05 0 0.05
x
1 x
1
Isomap
3
5
x
2 0
0 −5
−2
x x
1
2
5
5
x3
3
2 x 0
0
0 10
−5 −1 −5
−2
x1 x1
x2 x
2
LE
LLE
2 0.05
1
0
3
3
0
x
x
−1
−2 −0.05
−1 0 1 −24 0 2 4
−2
x 0.05 0.05
x 2
1 0 0
−0.05 −0.05
x x
2 1
Geodesic distance
Euclidean distance
1.5 1.5
1 1
0.5 0.5
x2
x2
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
x1 x
1
Fig. 6.7. On the left: latent space for the “Japanese flag” manifold. On the right:
the fact that geodesic (or graph) distances (blue curve) are replaced with Euclidean
ones (red line) in the embedding space explains the stretched-hole phenomenon that
appears when the Japanese flag is embedded. The phenomenon is visible in Fig. 6.8.
Isomap
x2
0
−2
−5 0 5
x
1
GNLM CDA
2 2
2
2
0
x
x
−2
−2
−5 0 5 −6 −4 −2 0 2 4 6
x1 x
1
SDE equ.
2
SDE inequ.
2
x2
0
x
−6 −4 −2 0 2 4 6
−2 x
−6 −4 −2 0 2 4 6 1
x1
Fig. 6.8. Two-dimensional embeddings of the rolled Japanese flag obtained with
distance-preserving NLDR methods. All methods embed 600 prototypes obtained
beforehand with vector quantization.
SOM
5
x2
0
−5
−20 −15 −10 −5 0 5 10 15 20
x1
GTM LLE
1 2
x2
2
0 0
x
−1 −2
−1 0 1 −1 0 1
x1 x1
Isotop Laplacian Eigenmaps
0.05
1
2
0
2
0
x
−1
−0.05
−4 −2 0 2 4 −0.05 0 0.05
x x1
1
Fig. 6.9. Two-dimensional embeddings of the rolled Japanese flag obtained with
topology-preserving NLDR methods. All methods embed 600 prototypes obtained
beforehand with vector quantization. The 5000 points are given as such for the SOM.
3
y
−1
−2
2
0 2
0
y −2 −2
2 y
1
Isomap
0.5
2
0
x
−0.5
−6 −4 −2 0 2 4 6
x
1
GNLM
CDA
1
2
x2
0
x
−1 −6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6 x1
x1
x2
x
−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
x x
1
1
SOM
2
2
0
x
−2
−40 −30 −20 −10 0 10 20 30 40
x
1
GTM LLE
1 2
x2
2
0 0
x
−1 −2
−1 0 1 −1.5 −1 −0.5 0 0.5 1 1.5
x1 x1
Isotop Laplacian Eigenmaps
2 0.05
1
2
0 x2 0
x
−1
−2 −0.05
−2 0 2 −0.05 0 0.05
x1 x1
set. Moreover, in the case of the thin Swiss roll, these badly approximated
geodesics are more or less oriented in the same way, along the main axis
of the rectangular manifold. Isomap tries to find a global tradeoff between
overestimated long distances and well-approximated short ones by stretching
vertically the manifold. GNLM and CDA act differently: as they favor the
preservation of small distances, they cannot stretch the manifold; instead,
they twist it.
Topology-preserving methods provide poor results, except GTM, which
succeeds better in embedding the thin Swiss roll than the normal one does.
Still, the embedding is twisted and oddly stretched; this is because the Swiss
roll must be cast inside the square latent space assumed by GTM. Changing
the shape of the latent space did not lead to a better embedding. Clearly,
the SOM does not yield the expected result: the color does not vary smoothly
along the rectangular lattice as it does along the Swiss roll. As for the standard
Swiss roll, the SOM lattice “jumps” between the successive whorls of the man-
ifold, as illustrated in Fig. 6.13. Actually, the SOM tries to occupy the same
regions as the thin Swiss roll, but those jumps break the perfect preservation
6.1 Toy examples 185
1.5
0.5
0
3
y
−0.5
−1
−1.5
1.5
0.5
1.5
0
1
−0.5 0.5
0
−1 −0.5
−1
y −1.5 −1.5
2
y1
Fig. 6.13. Three-dimensional view showing how a 8-by-75 SOM typically unfurls
in a thin slice of a Swiss roll.
0
3
y −2
−2 −1
−2 0
0 1
y1
2 2
y2
Fig. 6.14. After vector quantization, the K-rule weaves a graph that connects the
600 prototypes. Because K equals 8 instead of 5, undesired or “parasitic” edges
appear in the graph. More precisely, a link connects the outer corner of the Swiss
roll to a point lying in another whorl.
it can completely mislead the NLDR methods since they take it into account,
just as they would with any other normal edge. For instance, such an edge can
jeopardize the approximation of the geodesic distances. In that context, it can
be compared to a shortcut in an electrical circuit: nothing works as desired.
With K = 8, NLDR methods embed the Swiss roll, as shown in Figs. 6.15
and 6.16. As can be seen, the parasitic link completely misleads Isomap and
GNLM. Obviously, CDA yields a result very similar to the one displayed in
Fig. 6.4, i.e., in the ideal case, without undesired links. The good performance
of CDA is due to the parameterized weighting function Fλ , which allows CDA
to tear some parts of the manifold. In this case, the parasitic link has been
torn. However, this nice result can require some tuning of the neighborhood
proportion π in CDA. Depending on the parameter value, the tear does not
always occur at the right place. A slight modification in the parameter setting
may cause small imperfections: for example, the point on the corner may be
torn off and pulled toward the other end of the parasitic link.
Results provided by SDE largely vary with respect to the paramater val-
ues. Specifically, imposing a strict preservation of local distances or allowing
them to shrink leads to completely different embeddings. With strict equal-
ities required, the presence of the parasitic edge prevents the semidefinite
programming procedure included in SDE to unroll the manifold. Hence, the
result looks like a slightly deformed PCA projection. With inequalities, SDE
nearly succeeds in unfolding the Swiss roll; unfortunately, some regions of the
manifold are superposed.
6.1 Toy examples 187
Isomap
4
2
x2
0
−2
−4
−4 −2 0 2 4 6
x1
GNLM
CDA
2 2
x2
x2
0
−2
−2
−5 0 5 −6 −4 −2 0 2 4 6
x x
1 1
SDE equ. SDE inequ.
2
2
1
x2
x2
0 0
−1
−2
−2
−2 0 2 −4 −2 0 2
x1 x1
SOM
5
x2 0
−5
−20 −15 −10 −5 0 5 10 15 20
x
1
GTM LLE
1
2
x2
2
0 0
x
−2
−1
−1 0 1 −2 −1 0 1 2
x x
1 1
Isotop Laplacian Eigenmaps
2 0.05
2
x2
0 0
x
−2 −0.05
−2 0 2 −0.05 0 0.05
x x
1 1
0
y3
−2
4
2
0 4
2
−2 0
−2
y −4 −4
2 y1
generated, but in this case at least 800 prototypes are needed, instead of 600,
because consecutive whorls are closer to each other. The K-rule is used with
K = 5. The results of NLDR methods are shown in Figs. 6.18 and 6.19. As
expected, Isomap performs poorly with this nondevelopable manifold: points
in the right part of the embedding are congregated. GNLM yields a better
result than Isomap; unfortunately, the embedding is twisted. CDA does even
better but needs some parameter tuning in order to balance the respective
influences of short and long distances during the convergence. The neighbor-
hood proportion must be set to a high value for CDA to behave as GNLM.
More precisely, the standard schedule (hyperbolic decrease between 0.75 and
0.05) is replaced with a slower one (between 0.75 and 0.50) in order to avoid
undesired tears. Results of SDE also depend on the parameter setting. With
strict preservation of the distances, the embedding is twisted, whereas allowing
the distances to shrink produces an eye-shaped embedding.
Topology-preserving methods are expected to behave better with this non-
developable manifold, since no constraint is imposed on distances. An SOM
and GTM do not succeed in unfolding the heated Swiss roll. Spectral methods
like LLE and LE fail, too. Only Isotop yields a nice embedding.
In the case of spectral methods, some flaws in the embeddings can be
explained by looking to what happens in a third dimension, as was done for
the standard Swiss roll. As spectral methods build embeddings incrementally,
a third dimension is obtained by keeping an additional eigenvector of the
Gram-like matrix. Figure 6.20 shows those three-dimensional embeddings. As
can be seen, most embeddings are far from resembling the genuine latent
space, namely a flat rectangle. Except for SDE with inequalities, the “span”
190 6 Method comparisons
Isomap
x2
0
−2
−5 0 5
x1
GNLM
CDA
2 2
2
2
0
x
x
−2 −2
−5 0 5 10 −5 0 5 10
x1 x1
0 0
x
−1 −1
−5 0 5 −5 0 5
x1 x1
Fig. 6.18. Two-dimensional embeddings of the “heated” Swiss roll manifold com-
puted by distance-preserving methods. All methods embed 800 prototypes obtained
beforehand with vector quantization.
(or the variance) along the third dimension is approximately equal to the
one along the second dimension. In the case of Isomap, the manifold remains
folded lengthways. SDE with equalities leads to a twisted manifold, as does
LLE. LE produces a helical embedding.
The bottom line of the above experiments consists of two conclusions. First,
spectral methods offer nice theoretical properties (exact optimization, sim-
plicity, possibility to build embeddings in an incremental way, etc.). However,
they do not prove to be very robust against departures from their underly-
ing model. For instance, Isomap does not produce satisfying embeddings for
non-developable manifolds, which are not uncommon in real-life data. Sec-
ond, iterative methods based on gradient descent, for example, can deal with
more complex objective functions, whereas spectral methods are restricted to
functions having “nice” algebraic properties. This makes the former methods
more widely applicable. The price to pay, of course, is a heavier computational
load and a larger number of parameters to be adjusted by the user.
6.1 Toy examples 191
SOM
5
x2
0
−5
2
0 0
x
−2
−1 −4
−1 0 1 −1 −0.5 0 0.5 1 1.5 2
x x
1 1
Isotop Laplacian Eigenmaps
0.05
1
2
2
0 0
x
−1
−0.05
−2 −1 0 1 2 3 −0.05 0 0.05
x x
1 1
Fig. 6.19. Two-dimensional embeddings of the “heated” Swiss roll manifold com-
puted by topology-preserving methods. All methods embed 800 prototypes obtained
beforehand with vector quantization. The 5000 points are given as such for the SOM.
Isomap
x3
0
−2
5
0
2 0 −2 −5
x x1
SDE equ. 2
SDE inequ.
2
1
−5
x3
0
x3
0
5
−1 −1 0 1 x1
−5 x2
−2 0
−1 0 5
1 x1
x
2
LLE
LE
2
0.06
1 0.04
0.02
0
x3
x3
0
−1
−0.02
−2
−0.04
−0.06
4
2 −0.05 0 0.05
0
−2 2 x2
−4 −1 0 1
x2 x x
1
1
data set that is too small or too noisy, or to an inappropriate parameter value.
In the worst case, the graph used to compute the graph distances may fail to
represent correctly the underlying manifold. This often comes from a wrong
parameter value in the rule used to build the graph. Therefore, it is not pru-
dent to rely solely on the graph distance when designing an NLDR method.
Other techniques should be integrated in order to compensate for the flaws
of the graph distance and make the method more flexible, i.e., more tolerant
to data that do not fulfill all theoretical requirements. This is exactly what
has been done in GNLM, CDA and other iterative methods. They use all-
purpose optimization techniques that allow more freedom in the definition of
the objective function. This makes them more robust and more polyvalent. As
a counterpart, these methods are less elegant from the theoretical viewpoint
and need some parameter tuning in order to fully exploit their capabilities.
Regarding topology preservation, methods using a data-driven lattice
(LLE and Isotop) clearly outperform those relying on a predefined lattice. The
advantage of the former methods lies in their ability to extract more informa-
tion from data (essentially, the neighborhoods in the data manifold). Another
explanation is that LLE and Isotop work in the low-dimensional embedding
space, whereas an SOM and GTM iterate in the high-dimensional data space.
In other words, LLE and Isotop attempt to embed the graph associated with
the manifold, whereas an SOM and GTM try to deform and fit a lattice in
the data space. The second solution offers too much freedom: because the
lattice has more “Lebensraum” at its disposal in the high-dimensional space,
it can jump from one part of a folded manifold to another. Working in the
embedding space like LLE and Isotop do is more constraining but avoids these
shortcuts and discontinuities.
This subsection briefly describes how CCA/CDA can tear manifolds with es-
sential loops (circles, knots, cylinders, tori, etc.) or spheres (spheres, ellipsoids,
etc.). Three examples are given:
• The trefoil knot (Fig. 6.21) is a compact 1-manifold embedded in a three-
dimensional space. The parametric equations are
⎡ ⎤
41cos x−18 sin x−83 cos x−83 sin(2x)−11sin(3x)+27 sin(3x)
y = ⎣36 cos x+27 sin x−113 cos(2x)+30 sin(2x)+11cos(3x)−27 sin(3x)⎦ ,
45 sin x−30 cos(2x)+113 sin(2x)−11cos(3x)+27 sin(3x)
where 0 ≤ x < 2π. The data set consists of 300 prototypes, which are
obtained by vector quantization on 20,000 points randomly drawn in the
knot. Neighboring prototypes are connected using the K-rule with K = 2.
100
y3 0
200
−100
0
−100 −200
0
100 y
2
y
1
⎡ ⎤
cos(x1 ) cos(x2 )
y = ⎣ sin(x1 ) cos(x2 ) ⎦ ,
sin(x2 )
0.5 0.5
3
y3
0 0
y
−0.5 −0.5
0.5 0.5
0 0
−0.5 −0.5
0 −0.5 0 −0.5
0.5 0.5
y y
2 2
y y1
1
0.5 0.5
y3
0 0
y
−0.5 2 −0.5 2
1 1
−2 0 −2 0
−1 −1 −1
0 0
−2 1 −2
2 2
y2 y2
y1 y1
For these three manifolds, it is interesting to see whether or not the graph
distance helps the dimensionality reduction, knowing that the trefoil knot is
the only developable manifold. Another question is: does the use of the graph
distance change the way the manifolds are torn?
In the case of the trefoil knot, CCA and CDA can reduce the dimension-
ality from three to one, as illustrated in Fig. 6.24. In both one-dimensional
CDA
CCA
−500 0 500
x −1000 −500 0 500
1 x
1
Fig. 6.24. One-dimensional embeddings of the trefoil knot by CCA and CDA.
Graph edges between prototypes are displayed with blue corners (∨ or ∧). The color
scale is copied from Fig. 6.21.
plots, the graph edges are represented by blue corners (∨ or ∧). As can be
seen, CCA tears the knot several times, although it is absolutely not needed.
In contrast, CDA succeeds in unfolding the knot with a single tear only. The
behavior of CCA and CDA is also different when the dimensionality is not re-
duced to its smallest possible value. Figure 6.25 illustrates the results of both
methods when the knot is embedded in a two-dimensional space. This figure
clearly explains why CDA yields a different embedding. Because the knot is
highly folded and because CCA must preserve Euclidean distances, the em-
bedding CCA computes attempts to reproduce the global shape of the knot
(the three loops are still visible in the embedding). On the contrary, CDA
gets no information about the shape of the knot since the graph distance
is measured along the knot. As a result, the knot is topologically equiva-
lent to a circle, which precisely corresponds to the embedding computed by
CDA. Hence, in the case of the trefoil knot, the graph distance allows one to
avoid unnecessary tears. This is confirmed by looking at Fig. 6.26, showing
6.1 Toy examples 197
CCA CDA
200
200
100 100
0
0
2
x2
x
−100
−100
−200
−200 −300
Fig. 6.25. Two-dimensional embeddings of the trefoil knot by CCA and CDA. The
color scale is copied from Fig. 6.21.
NLM GNLM
200
150 300
100 200
50 100
0 0
2
x2
x
−50 −100
−100 −200
−150 −300
−200 −400
−100 0 100 200 −400 −200 0 200
x1 x1
Fig. 6.26. Two-dimensional embeddings of the trefoil knot by NLM and GNLM.
The color scale is copied from Fig. 6.21.
For the sphere and the torus, the graph distance seems to play a less im-
portant role, as illustrated by Figs. 6.27 and 6.28. Euclidean as well as graph
distances cannot be perfectly preserved for these nondevelopable manifolds.
The embeddings computed by both methods are similar. The sphere remains
198 6 Method comparisons
CCA
CDA
2 2
1 1
x2
x2
0 0
−1
−1
−2
−2
−2 −1 0 1 2
−2 −1 0 1 2 x1
x1
Fig. 6.27. Two-dimensional embeddings of the sphere by CCA and CDA. Colors
remains the same as in Fig. 6.22.
CDA
CCA
5 6
2
x2
x2
0
0
−2
−4
−5
−4 −2 0 2 4 6 −6
−5 0 5
x1
x
1
Fig. 6.28. Two-dimensional embeddings of the torus by CCA and CDA. The color
scale is copied from Fig. 6.23.
6.2 Cortex unfolding 199
rather easy to embed, but the torus requires some tuning of the parameters
for both methods.
As already mentioned, CCA and CDA are the only methods having the
intrinsic capabilities to tear manifolds. An SOM can also break some neigh-
borhoods, when opposite edges of the map join each other on the manifold
in the data space, for instance. It is noteworthy, however, that most recent
NLDR methods working with local neighborhoods or a graph can be extended
in order to tear manifolds with essential loops. This can be done by breaking
some neighborhoods or edges in the graph before embedding the latter. A
complete algorithm is described in [118, 121]. Unfortunately, this technique
does not work for P -manifolds with essential spheres when P > 1 (loops on a
sphere are always contractible and thus never essential).
20
15
10
x2 5
−5
−10
−15
−20
−20 −15 −10 −5 0 5 10 15 20 25 30
x
1
Fig. 6.30. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by Sammon’s NLM.
30
20
10
0
x2
−10
−20
−30
Fig. 6.31. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by CCA.
202 6 Method comparisons
30
20
10
0
x2
−10
−20
−30
−40
Fig. 6.32. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by GNLM.
40
30
20
10
x2
−10
−20
−30
−40
−40 −30 −20 −10 0 10 20 30 40
x1
Fig. 6.33. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by CDA.
6.3 Image processing 203
of Isomap, we must be also remark that the cortical surface is clearly not a
developable manifold.
As can be seen, the embedding computed by methods preserving Euclidean
distances and the corresponding ones using graph distance do not yield the
same results. Although the manifold is not developable, the graph distance
greatly helps to unfold it. This is confirmed intuitively by looking at the
final value of Sammon’s stress. Using the Euclidean distance, NLM converges
on ENLM = 0.0162, whereas GNLM reaches a much lower and better value:
EGNLM = 0.0038.
Leaving aside the medical applications of these cortex maps, it is notewor-
thy that dimensionality reduction was also used to obtain a colored surface-like
representation of the cortex data. The cortical surface is indeed not available
directly as a surface, but rather as a set of points sampled from this surface.
However, the only way to render a surface in a three-dimensional view con-
sists of approximating it with a set of connected triangles. Unfortunately, most
usual triangulation techniques that can convert points into coordinates work
for two dimensions only. (Graph-building rules mentioned in Sections 4.3 and
Appendix E are not able to provide an exact triangulation.) Consequently,
dimensionality reduction provides an elegant solution to this problem: points
are embedded in a two-dimensional space, triangulation is achieved using a
standard technique in 2D, like Delauney’s one, and the obtained triangles are
finally sent back to the three-dimensional space. This allows us to display the
three-dimensional views of the cortical surface in Fig. 6.29.
How do topology-preserving methods compare to them? Only results for
an SOM (40-by-40 map), GTM (20-by-20 latent grid, 3 × 3 kernels, 100 EM
iterations), and Isotop (without vector quantization, -rule, = 1.1) are given.
Due to its huge memory requirements, LLE failed to embed the 4961 points de-
scribing the cortical surface. The result of the SOM is illustrated by Fig. 6.34.
The embedding computed by GTM is shown in Fig. 6.35. Figure 6.36 displays
Isotop’s result. Again, it must be stressed that methods using a predefined lat-
tice— even if they manage to preserve the topology —often distort the shape
of data. This is particularly true for the SOM and questionable for GTM
(the latent space can be divided into several regions separated by empty or
sparsely populated frontiers). Isotop yields the most satisfying result; indeed,
the topology is perfectly preserved and, in addition, so is the shape of data:
the embedding looks very close to the results of distance-preserving methods
(see, e.g., Fig. 6.33). The main difference between Figs. 6.33 and 6.36 lies in
the axes: for Isotop, the size of the embedding is not related to the pairwise
distances measured in the data set.
55
50
45
y3
40
35
30
125 130 90
135 140 145 80
150 155 160 165 170
y
y 2
1
15
10
5
x2
−5
−10
−15
Fig. 6.34. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by an SOM. The first plot shows how the SOM unfurls in the three-
dimensional space, whereas the second represents the two-dimensional SOM lattice.
6.3 Image processing 205
0.8
0.6
0.4
0.2
x2
−0.2
−0.4
−0.6
−0.8
Fig. 6.35. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by GTM.
0
x2
−2
−4
−6
−6 −4 −2 0 2 4 6
x
1
Fig. 6.36. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by Isotop.
206 6 Method comparisons
A set of 698 face images is proposed in [180]. The images represent an arti-
ficially generated face rendered with different poses and lighting directions.
Figure 6.37 shows several faces drawn at random in the set. Each image con-
Fig. 6.37. Several face pictures drawn at random from the set of 698 images pro-
posed in [180].
Fig. 6.38. Three-dimensional embedding of the 698 face images computed by metric
MDS. The embedding is sliced into six layers, which in turn are divided into 6 by
6 cells. Each cell is represented by displaying the image corresponding to one of the
points it contains. See text for details.
Fig. 6.39. Three-dimensional embedding of the 698 face images computed by Sam-
mon’s NLM. The embedding is sliced into six layers, which in turn are divided into
6 by 6 cells. Each cell is represented by displaying the image corresponding to one
of the points it contains. See text for details.
6.3 Image processing 209
Fig. 6.40. Three-dimensional embedding of the 698 face images computed by CCA.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
poses vary smoothly accross the layers. The lighting direction can be perceived
too: the light source is on the left (resp., right) of the face in the first (resp.,
last) layers. Can this outstanding performance be explained by the use of the
graph distance?
When looking at Fig. 6.42, the answer seems to be yes, since GNLM per-
forms much better than NLM, and does as well as Isomap. The final confir-
mation comes with the good result of CDA (Fig. 6.43). All layers are quite
densely populated, including the first and last ones. As for Isomap, the head
smoothly moves from one picture to another. The changes of lighting direction
are clearly visible, too.
In addition, two versions of SDE (with equality constraints or with local
distances allowed to shrink) work very well. Layers are more densely populated
with strict equality constraints, however.
Results of topology-preserving methods are given in Figs. 6.46–6.48. LLE
provides a disappointing embedding, though several values for its parameters
were tried. Layers are sparsely populated, and transitions between pictures
are not so smooth. The result of LE is better, but still far from most distance-
preserving methods. Finally, Isotop succeeds in providing a good embedding,
though some discontinuities can be observed, for example, in layer 2.
In order to verify the visual impression left by Figs. 6.38–6.48, a quantita-
tive criterion can be used to assess the embeddings computed by the NLDR
methods. Based on ideas developped in [9, 74, 10, 190, 106, 103], a simple
criterion can be based on the proximity rank. The latter can be denoted as
the function r = rank(X, i, j) and is computed as follows:
210 6 Method comparisons
Fig. 6.43. Three-dimensional embedding of the 698 face images computed by CDA.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
Fig. 6.44. Three-dimensional embedding of the 698 face images computed by SDE
(with equality constraints). The embedding is sliced into six layers, which in turn
are divided into 6 by 6 cells. Each cell is represented by displaying the image corre-
sponding to one of the points it contains. See text for details.
212 6 Method comparisons
Fig. 6.45. Three-dimensional embedding of the 698 face images computed by SDE
(with inequality constraints). The embedding is sliced into six layers, which in turn
are divided into 6 by 6 cells. Each cell is represented by displaying the image corre-
sponding to one of the points it contains. See text for details.
Fig. 6.46. Three-dimensional embedding of the 698 face images computed by LLE.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
6.3 Image processing 213
Fig. 6.47. Three-dimensional embedding of the 698 face images computed by LE.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
Fig. 6.48. Three-dimensional embedding of the 698 face images computed by Isotop.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
214 6 Method comparisons
• Using the vector set X and taking the ith vector as reference, compute all
Euclidean distances x(k) − x(i), for 1 ≤ k ≤ N .
• Sort the obtained distances in ascending order, and let output r be the
rank of x(j) according to the sorted distances.
In the same way as in [185, 186], this allows us to write two different measures,
called mean relative rank errors:
1
N
|rank(X, i, j) − rank(Y, i, j)|
MRREY→X (K) (6.3)
C i=1 rank(Y, i, j)
j∈NK (y(i))
1
N
|rank(X, i, j) − rank(Y, i, j)|
MRREX→Y (K) , (6.4)
C i=1 rank(X, i, j)
j∈NK (x(i))
0.005 0.006
0.004
0.002
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Rank (reference is data) Rank (reference is embedding)
0.04
0.015
0.03
0.01
0.02
0.005
0.01
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Rank (reference is data) Rank (reference is embedding)
Fig. 6.49. Mean relative rank errors (MRREY→X (K) on the left and
MRREX→Y (K) on the right) of all NLDR methods for the artificial faces.
problem is studied, but with real images. As a direct consequence, the intrinsic
dimensionality of the images is not known in advance. Instead of computing it
with the methods described in Chapter 3, it is proposed to embed the data set
in a two-dimensional plane by means of NLDR methods. In other words, even
if the intrinsic dimensionality is probably higher than two, a two-dimensional
representation is “forced”.
In practice, the data set [158] comprises 1965 images in levels of gray;
they are 28 pixels high and 20 pixels wide. Figure 6.50 shows a randomly
drawn subset of the images. The data are rearranged as a set of 1965 560-
dimensional vectors, which are given as they are, without any preprocessing.
(In the previous section, images were larger and were thus preprocessed with
PCA.) Six distance-preserving methods are used: metric MDS, Isomap, NLM,
GNLM, CCA, and CDA. SDE could not be used due to its huge computa-
tional requirements. Four topology-preserving methods are compared too: a
44-by-44 SOM, LLE, LE, and Isotop. GTM is discarded because of the high
dimensionality of the data space. All methods involving K-ary neighborhoods
216 6 Method comparisons
Fig. 6.50. Some faces randomly drawn from the set of real faces available on the
LLE website.
Fig. 6.51. Two-dimensional embedding of the 1965 real faces by metric MDS.
6.3 Image processing 217
embedding resulting from Isomap, that is, metric MDS with graph distances,
is shown in Fig. 6.52. More details appear, and fewer cells are blank, but
a Swiss-cheese effect appears: holes in the graph look stretched. Moreover,
images corresponding to some regions of the embeddings look blurred or fuzzy.
This is due to the fact that regions in between or on the border of the holes
are shrunk: too many different images are concentrated in a single cell and
are averaged together.
The embedding provided by NLM (Fig. 6.53) confirms that the data set is
roughly composed of two dense, weakly connected clusters. No Swiss-cheese
effect can be observed. Unfortunately, there are still blurred regions and some
discontinuities are visible in the embedding. The use of graph distances in
Sammon’s nonlinear mapping (Fig. 6.54) leads to a sparser embedding. Nev-
ertheless, many different facial expressions appear. Holes in the graph are not
stretched as they are for Isomap.
CCA provides a dense embedding, as shown in Fig. 6.55. Few discontinu-
ities can be found. Replacing the Euclidean distance with the graph distance
leads to a sparser embedding, just like for NLM and GNLM. This time, sev-
eral smaller clusters can be distinguished. Due to sparsity, some parts of the
main cluster on the right are somewhat blurred.
Figure 6.57 shows the embedding computed by LLE. As usual, LLE yields a
cuneiform embedding. Although the way to display the embedding is different
in [158], the global shape looks very similar. For the considered data set, the
218 6 Method comparisons
assumption of an underlying manifold does not really hold: data points are
distributed in different clusters, which look stretched in LLE embedding. As
for other embeddings, similar faces are quite well grouped, but the triangular
shape of the embedding is probably not related to the true shape of the data
cloud. Just like other methods relying on a graph or K-ary neighborhoods,
LLE produces a very sparse embedding.
initial shape of the data cloud. Although the SOM yields a visually satisfying
embedding and reveals many details, some shortcomings must be remarked.
First, similar faces may be distributed in two different places (different re-
gions of the map can be folded close to each other in the data space). Second,
the SOM is the only method that involves a mandatory vector quantization.
Consequently, the points that are displayed as thumbnails are not in the data
set: they are points (or prototypes) of the SOM grid.
Fig. 6.60. Two-dimensional embedding of the 1965 real faces by a 44-by-44 SOM.
Until now, only the visual aspect of the embeddings has been assessed. In
the same way as in Subsection 6.3.1, the mean relative rank errors (Eqs. (6.3)
and (6.4)) can be computed in order to have a quantitative measure of the
neighborhood preservation. The values reached by all reviewed methods are
given in Fig. 6.61 for a number of neighbors ranging between 0 and 15. Metric
MDS clearly performs the worst; this is also the only linear method. Per-
formances of LLE are not really good either. The best results are achieved
by Isotop and by methods working with graph distances (Isomap, GNLM,
and CDA). GNLM reaches the best tradeoff between MRREY→X (K) and
MRREX→Y (K), followed by Isotop. On the other hand, CDA performs very
well when looking at MRREX→Y (K) only. Finally, the SOM cannot be com-
pared directly to the other methods, since it is the only method involving a
predefined lattice and a mandatory vector quantization (errors are computed
on the prototypes coordinates). The first error MRREX→Y (K) starts at low
values but grows much faster than for other methods when K increases. This
6.3 Image processing 223
can be explained by the fact that the SOM can be folded on itself in the data
space. On the other hand, MRREY→X (K) remains low because neighbors in
the predefined lattice are usually close in the data space, too.
0.015 0.06
0.01 0.04
0.005 0.02
0 0
0 5 10 15 0 5 10 15
LLE LLE
0.1
LE LE
0.035 Isotop Isotop
0.03 0.08
0.025
0.06
0.02
0.015 0.04
0.01
0.02
0.005
0 0
0 5 10 15 0 5 10 15
Fig. 6.61. Mean relative rank errors (MRREY→X (K) on the left and
MRREX→Y (K) on the right) of all NLDR methods for the real faces.
7
Conclusions
cases. Other examples are the Euclidean norm, which is nearly useless in high-
dimensional spaces, and the intrinsic sparsity of high-dimensional spaces (the
“empty space phenomenon”). All those issues are usually called the “curse
of dimensionality” and must be taken into account when processing high-
dimensional data.
Historically, one of the first methods intended for the analysis of high-
dimensional data was principal component analysis (PCA), introduced in
Chapter 2. Starting from a data set in matrix form, and under some con-
ditions, this method is able to perform three essential tasks:
• Intrinsic dimensionality estimation. This consists in estimating the
(small) number of hidden parameters, called latent variables, that gener-
ated data.
• Dimensionality reduction. This consists in building a low-dimensional
representation of data (a projection), according to the estimated dimen-
sionality.
• Latent variable separation. This consists of a further transformation of
the low-dimensional representation, such that the latent variables appear
as mutually “independent” as possible.
Obviously, these are very desirable functionalities. Unfortunately, PCA re-
mains a rather basic method and suffers from many shortcomings. For ex-
ample, PCA assumes that observed variables are linear combinations of the
latent ones. According to this data model, PCA just yields a linear projec-
tion of the observed variables. Additionally, the latent variable separation is
achieved by simple decorrelation, explaining the quotes around the adjective
“independent” in the above list.
For more than seven decades, the limitations of PCA have motivated the
development of more powerful methods. Mainly two directions have been ex-
plored: namely, dimensionality reduction and latent variable separation.
Much work has been devoted to designing methods that are able to reduce the
data dimensionality in a nonlinear way, instead of merely projecting data with
a linear transformation. The first step in that direction was made by refor-
mulating the PCA as a distance-preserving method. This yielded the classical
metric multidimensional scaling (MDS) in the late 1930s (see Table 7.1). Al-
though this method remains linear, like PCA, it is the basis of numerous
nonlinear variants described in Chapter 4. The most widely known ones are
undoubtedly nonmetric MDS and Sammon’s nonlinear mapping (published in
the late 1960s). Further optimizations are possible, by using stochastic tech-
niques, for example, as in curvilinear component analysis (CCA), published
7.1 Summary of the book 227
in the early 1990s. Besides this evolution toward more and more complex
algorithms, recent progress has been accomplished in the family of distance-
preserving methods by replacing the usual Euclidean distance with another
metric: the geodesic distance, introduced in the late 1990s. This particular
distance measure is especially well suited for dimensionality reduction. The
unfolding of nonlinear manifolds is made much easier with geodesic distances
than with Euclidean ones.
Geodesic distances, however, cannot be used as such, because they hide a
complex mathematical machinery that would create a heavy computational
burden in practical cases. Fortunately, geodesic distances may be approxi-
mated in a very elegant way by graph distances. To this end, it suffices to
connect neighboring points in the data set, in order to obtain a graph, and
then to compute the graph distances with Dijkstra’s algorithm [53], for in-
stance.
A simple change allows us to use graph distances instead of Euclidean ones
in classical distance-preserving methods. Doing so transforms metric MDS,
Sammon’s NLM and CCA in Isomap (1998), geodesic NLM (2002), and curvi-
linear distance analysis (2000), respectively. Comparisons on various examples
in Chapter 6 clearly show that the graph distance outperforms the traditional
Euclidean metric. Yet, in many cases and in spite of all its advantages, the
graph distance is not the panacea: it broadens the set of manifolds that can
easily be projected by distance preservation, but it does not help in all cases.
For that reason, the algorithm that manages the preservation of distances fully
keeps its importance in the dimensionality reduction process. This explains
why the flexibility of GNLM and CDA is welcome in difficult cases where
Isomap can fail.
Distance preservation is not the sole paradigm used for dimensionality
reduction. Topology preservation, introduced in Chapter 5, is certainly more
powerful and appealing but also more difficult to implement. Actually, in order
to be usable, the concept of “topology” must be clearly defined; its translation
from theory to practice does not prove as straightforward as measuring a dis-
tance. Because of that difficulty, topology-preserving methods like Kohonen’s
self-organizing maps appeared later (in the early 1980s) than distance-based
methods. Other methods, like the generative topographic mapping (1995),
may be viewed as principled reformulations of the SOM, within a probabilis-
tic framework. More recent methods, like locally linear embedding (2000) and
Isotop (2002), attempt to overcome some limitations of the SOM.
In Chapter 5, methods are classified according to the way they model the
topology of the data set. Typically, this topology is encoded as neighborhood
relations between points, using a graph that connects the points, for instance.
The simplest solution consists of predefining those relations, without regards
to the available data, as it is done in an SOM and GTM. If data are taken
into account, the topology is said to be data-driven, like with LLE and Isotop.
While data-driven methods generally outperform SOMs for dimensionality
reduction purposes, the latter remains a reference tool for 2D visualization.
228 7 Conclusions
ANN DR Method Author(s) & reference(s)
1901 PCA Pearson [149]
1933 PCA Hotelling [92]
1938 classical metric MDS Young & Householder [208]
1943 formal neuron McCulloch & Pitts [137]
1946 PCA Karhunen [102]
1948 PCA Loève [128]
1952 MDS Torgerson [182]
1958 Perceptron Rosenblatt [157]
1959 Shortest paths in a graph Dijkstra [53]
1962 nonmetric MDS Shepard [171]
1964 nonmetric MDS Kruskal [108]
1965 K-means (VQ) Forgy [61]
1967 K-means (VQ) MacQueen [61]
ISODATA (VQ) Ball & Hall [8]
1969 PP Kruskal [109]
NLM (nonlinear MDS) Sammon [109]
1969 Perceptron Minsky & Papert’s paper [138]
1972 PP Kruskal [110]
1973 SOM von der Malsburg [191]
1974 PP Friedman & Tukey [67]
1974 Back-propagation Werbos [201]
1980 LBG (VQ) Linde, Buzo & Gray [124]
1982 SOM (VQ & NLDR) Kohonen [104]
1982 Hopfield network Hopfield [91]
LLoyd (VQ) Lloyd [127]
1984 Principal curves Hastie & Stuetzle [79, 80]
1985 Competitive learning (VQ) Rumelhart & Zipser [162, 163]
1986 Back-propagation & MLP Rumelhart, Hinton & Williams [161, 160]
BSS/ICA Jutten [99, 98, 100]
1991 Autoassociative MLP Kramer [107, 144, 183]
1992 “Neural” PCA Oja [145]
1993 VQP (NLM) Demartines & Hérault [46]
Autoassociative ANN DeMers & Cottrell [49]
1994 Local PCA Kambhatla & Leen [101]
1995 CCA (VQP) Demartines & Hérault [47, 48]
NLM with ANN Mao & Jain [134]
1996 KPCA Schölkopf, Smola & Müller [167]
GTM Bishop, Svensén & Williams [22, 23, 24]
1997 Normalized cut (spectral clustering) Shi & Malik [172, 199]
1998 Isomap Tenenbaum [179, 180]
2000 CDA (CCA) Lee & Verleysen [116, 120]
LLE Roweis & Saul [158]
2002 Isotop (MDS) Lee [119, 114]
LE Belkin & Niyogi [12, 13]
Spectral clustering Ng, Jordan & Weiss [143]
Coordination of local linear models Roweis, Saul & Hinton [159]
2003 HLLE Donoho & Grimes [56, 55]
2004 LPP He & Niyogi [81]
SDE (MDS) Weinberger & Saul [196]
2005 LMDS (CCA) Venna & Kaski [186, 187]
2006 Autoassociative ANN Hinton & Salakhutdinov [89]
Table 7.1. Timeline of DR methods. Major steps in ANN history are given as
milestones. Spectral clustering has been added because of its tight relationship with
spectral DR methods.
Starting from PCA, the other direction that can be explored is latent variable
separation. The first step in that direction was made with projection pursuit
(PP; see Table 7.1) [109, 110, 67]. This technique, which is widely used in
exploratory data analysis, aims at finding “interesting” (linear) one- or two-
dimensional projections of a data set. Axes of these projections can then be
interpreted as latent variables. A more recent approach, initiated in the late
1980s by Jutten and Hérault [99, 98, 100], led to the flourishing development
of blind source separation (BSS) and independent component analysis (ICA).
These fields propose more recent but also more principled ways to tackle the
problem of latent variable separation. In contrast with PCA, BSS and ICA can
7.1 Summary of the book 229
The aim of this first step is to make sure that all variables or signals in the
data set convey useful information about the phenomenon of interest. Hence,
if some variables or signals are zero or are related to another phenomenon, a
variable selection must be achieved beforehand, in order to discard them. To
some extent, this selection is a “binary” dimensionality reduction: each ob-
served variable is kept or thrown away. Variable selection methods are beyond
the scope of this book; this topic is covered in, e.g., [2, 96, 139].
7.2.2 Calibration
This second step aims at “standardizing” the variables. When this is required,
the average of each variable is subtracted. Variables can also be scaled if
needed. The division by the standard deviation is useful when the variables
7.2 Data flow 231
come from various origins. For example, meters do not compare with kilo-
grams, and kilometers do not with grams. Scaling the variables helps to make
them more comparable.
Sometimes, however, the standardization can make things worse. For ex-
ample, an almost-silent signal becomes pure noise after standardization. Obvi-
ously, the knowledge that it was silent is important and should not be lost. In
the ideal case, silent signals and other useless variables are eliminated by the
above-mentioned variable selection. Otherwise, if no standardization has been
performed, further processing methods can still remove almost-zero variables.
(See Subsection 2.4.1 for a more thorough discussion.)
Nonlinear methods of dimensionality reduction may take over from PCA once
the dimensionality is no longer too high, between a few tens and a few hun-
dreds, depending on the chosen method. The use of PCA as preprocessing is
justified by the fact that most nonlinear methods remain more sensitive to the
curse of dimensionality than PCA due to their more complex model, which
involves many parameters to identify.
232 7 Conclusions
7.4 Taxonomy
Figure 7.1 presents a nonexhaustive hierarchy tree of some unsupervised data
analysis methods, according to their purpose (latent variable separation or
dimensionality reduction). This figure also gives an overview of all methods
described in this book, which focuses on nonlinear dimensionality reduction
based mainly on “geometrical” concepts (distances, topology, neighborhoods,
manifolds, etc.).
Two classes of NLDR methods are distinguished in this book: those trying
to preserve pairwise distances measured in the data set and those attempting
to reproduce the data set topology. This distinction may seem quite arbi-
trary, and other ways to classify the methods exist. For instance, methods
can be distinguished according to their algorithmic structure. In the latter
case, spectral methods can be separated from those relying on iterative opti-
mization schemes like (stochastic) gradient ascent/descent. Nevertheless, this
last distinction seems to be less fundamental.
Actually, it can be observed that all distance-preserving methods involve
pairwise distances either directly (metric MDS, Isomap) or with some kind of
weighting (NLM, GNLM, CCA, CDA, SDE). In (G)NLM, this weighting is
proportional to the inverse of the Euclidean (or geodesic) distances measured
in the data space, whereas a decreasing function of the Euclidean distances
in the embedding space is used in CCA and CDA. For SDE, only Euclidean
distances to the K nearest neighbors are taken into account, while others are
simply forgotten and replaced by those determined during the semidefinite
programming step.
234 7 Conclusions
PCA
BSS PP NLDR
Distance Topology AA NN
BN MLP
Isomap
GNLM
KPCA
Isotop
SOM
MDS
GTM
NLM
CCA
CDA
PCA
SDE
LLE
=
LE
Linear Nonlinear
This book
Fig. 7.1. Methods for latent variable separation and dimensionality reduction: a
nonexhaustive hierarchy tree. Acronyms: PCA, principal component analysis; BSS,
blind source separation; PP, projection pursuit; NLDR, nonlinear dimensionality
reduction; ICA, independent component analysis; AA NN, auto-associative neural
network; PDL, predefined lattice; DDL, data-driven lattice. Methods are shown as
tree leaves.
tive naming convention. The first letter indicates the distance or kernel type
(E for Euclidean, G for geodesic/graph, C for commute-time distance, K for
fixed kernel, and O for optimized kernel), whereas the three next letters refer
to the algorithm (MDS for spectral decomposition, NLM for quasi-Newton
optimization, and CCA for CCA-like stochastic gradient descent).
It is noteworthy that most methods have an intricate name that often
gives few or no clues about their principle. For instance, KPCA is by far
closer to metric MDS than to PCA. While SDE stands for semidefinite em-
bedding, it should be remarked that all spectral methods compute an em-
bedding from a positive semidefinite Gram-like matrix. The author of SDE
renamed his method MVU, standing for maximum variance unfolding [197];
this new name does not shed any new light on the method: all MDS-based
methods (like Isomap and KPCA, for instance) yield an embedding having
maximal variance. The series can be continued, for instance, with PCA (pric-
ipal component analysis), CCA (curvilinear component analysis), and CDA
(curvilinear distances analysis), whose names seem to be designed to ensure
a kind of filiation while remaining rather unclear about their principle. The
7.4 Taxonomy 237
name Isomap, which stands for Isometric feature mapping [179], is quite un-
clear too, since all distance-preserving NLDR methods attempt to yield an
isometric embedding. Unfortunately, in most practical cases, perfect isometry
is not reached.
Looking back at Table 7.2, the third and fourth rows contain methods that
were initially not designed as distance-preserving methods. Regarding KPCA,
the kernels listed in [167] are given for their theoretical properties, without any
geometrical justification. However, the application of the kernel is equivalent
to mapping data to a feature space in which a distance-preserving embedding
is found by metric MDS. In the case of LE, the duality described in [204] and
the connection with commute-time distances detailed in [164, 78] allow it to
occupy a table entry.
Finally, it should be remarked that the bottom right corner of Table 7.2
contains many empty cells that could give rise to new methods with potentially
good performances.
a method that fits a plane in the data space in order to capture as much
238 7 Conclusions
variance as possible after projection on this plane, then to some extent running
an SOM can then be seen as a way to fit a nonrigid (or articulated) piece of
plane within the data cloud. Isotop, on the contrary, follows a similar strategy
in the opposite direction: an SOM-like update rule is used for embedding a
graph deduced from the data set in a low-dimensional space.
By essence, GTM solves the problem in a similar way as an SOM would.
The approach is more principled, however, and the resulting algorithm works
in a totally different way. A generative model is used, in which the latent space
is fixed a priori and whose mapping parameters are identified by statistical
inference.
Finally, topology-preserving spectral methods, like LLE and LE, develop
a third approach to the problem. They build what can be called an “affinity”
matrix [31], which is generally sparse; after double-centering or application of
the Laplacian operator, some of its bottom eigenvectors form the embedding.
A relationship between LLE and LE is established in [13], while the duality
described in [204] allows us to relate both methods to distance-preserving
spectral methods.
transformation of data is not optimized, except in SDE. In the last case, the
kernel optimization unfortunately induces a heavy computational burden.
transform data (nonlinearly when building the Gram-like matrix, and then
linearly when solving the eigenproblem) before pruning the unnecessary di-
mensions (eigenvectors are discarded). In contrast, most nonspectral methods
start by initializing mapped data vectors in a low-dimensional space and then
rearrange them to optimize some objective function.
Guidelines can also be given according to the data set’s size. In the case
of ...
• Large data set. If several thousands of data points are available (N >
2000), most NLDR methods will generate a heavy computational burden
because of their time and space complexities, which are generally propor-
tional to N 2 (or even higher for the computation time). It is then useful
to reduce the data set’s size, at least to perform some preliminary steps.
The easiest way to obtain a smaller data set consists of resampling the
available one, i.e., drawing a subset of points at random. Obviously, this
is not an optimal way, since it is possible, as ill luck would have it, for
the drawn subsample not to be representative of the whole data set. Some
examples throughout this book have shown that a representative subset
can be determined using vector quantization techniques, like K-means and
similar methods.
• Medium-sized set. If several hundreds of data points are available (200 <
N ≤ 2000), most NLDR methods can be applied directly to the data set,
without any size reduction.
• Small data set. When fewer than 200 data points are available, the use of
most NLDR methods becomes questionable, as the limited amount of data
could be insufficient to identify the large number of parameters involved in
many of these methods. Using PCA or classical metric MDS often proves
to be a better option.
The dimensionality of data, along with the target dimension, can also be
taken into account. In case of a ...
• very high data dimensionality. For more than 50 dimensions (D > 50),
NLDR methods can suffer from the curse of dimensionality, get confused,
and provide meaningless results. It can then be wise first to apply PCA or
metric MDS in order to perform a hard dimensionality reduction. These
two methods can considerably decrease the data dimensionality without
losing much information (in terms of measured variance, for instance). De-
pending on the data set’s characteristics, PCA or metric MDS can also help
attenuate statistical noise in data. After PCA/MDS, a nonlinear method
can be used with more confidence (see the two next cases) in order to
further reduce the dimensionality.
• high data dimensionality. For a few tens of dimensions (5 < D ≤ 50),
NLDR methods should be used with care. The curse of dimensionality is
already no longer negligible.
• low data dimensionality. For up to five dimensions, any NLDR method
can be applied with full confidence.
Obviously, the choice of the target dimensionality should take into account
the intrinsic dimensionality of data if it is known or can be estimated.
• If the target dimensionality is (much) higher than the intrinsic one, PCA or
MDS performs very well. These two methods have numerous advantages:
244 7 Conclusions
they are simple, fast, do not fall in local optima, and involve no parameters.
In this case, even the fact that they transform data in a linear way can be
considered an advantage in many respects.
• If the target dimensionality is equal to or hardly higher than the intrinsic
one, NLDR methods can yield very good results. Most spectral or non-
spectral methods work quite well in this case. For highly curved manifolds,
one or two supernumerary dimensions can improve the embedding quality.
Most NLDR methods (and especially those based on distance preservation)
have limited abilities to deform/distort manifolds. Some extra dimensions
can then compensate for this lack of “flexibility.” The same strategy can
be followed to embed manifolds with essential loops or spheres.
• If the target dimensionality is lower than the intrinsic one, such as for vi-
sualization purposes, use NLDR methods at your own risk. It is likely that
results will be meaningless since the embedding dimensionality is “forced.”
In this case, most nonspectral NLDR methods should be avoided. They
simply fail to converge in an embedding space of insufficient dimensional-
ity. On the other hand, spectral methods do not share this drawback since
they solve an eigenproblem independently from the target dimensionality.
This last parameter is involved only in the final selection of eigenvectors.
Obviously, although an embedding dimensionality that is deliberately too
low does not jeopardize the method convergence, this option does not guar-
antee that the obtained embedding is meaningful either. Its interpretation
and/or subsequent use must be questioned.
Here is a list of additional advices related to the application’s purpose and
other considerations.
• Collect information about your data set prior to NLDR: estimate the in-
trinsic dimensionality and compute an adjacency graph in order to deduce
the manifold connectivity.
• Never use any NLDR method without knowing the role and influence of
all its parameters (true for any method, with a special emphasis on non-
spectral methods).
• For 2D visualization and exploratory data analysis, Kohonen’s SOM re-
mains a reference tool.
• Never use KPCA for embedding purposes. The theoretical framework hid-
den behind KPCA is elegant and appealing; it paved the way toward a
unified view of all spectral methods. However, in practice, the method
lacks a geometrical interpretation that could help the user choose use-
ful kernel functions. Use SDE instead; this method resembles KPCA in
many respects, and the SDP step implicitly determines the optimal kernel
function for distance preservation.
• Never use SDE with large data sets; this method generates a heavy com-
putational burden and needs to run on much more powerful computers
than alternative methods do.
7.8 Perspectives 245
• Avoid using GTM as much as possible; the method involves too many
parameters and is restricted to 1D or 2D rectangular latent spaces; the
mapping model proves to be not flexible enough to deal with highly curved
manifolds.
• LLE is very sensitive to its parameter values (K or , and the regularization
parameter Δ). Use it carefully, and do not hesitate to try different values,
as is done in the literature [166].
• Most nonspectral methods can get stuck in local optima: depending on the
initialization, different embeddings can be obtained.
• Finally, do not forget to assess the embedding quality using appropriate
criteria [186, 185, 9, 74, 10, 190, 103] (see an example in Subsection 6.3.1).
The above recommendations leave the following question unanswered:
given a data set, does one choose between distance and topology preserva-
tion? If the data set is small, the methods with the simplest models often suit
the best (e.g., PCA, MDS, or NLM). With mid-sized data sets, more complex
distance-preserving methods like Isomap or CCA/CDA often provide more
meaningful results. Topology-preserving methods like LLE, LE, and Isotop
should be applied to large data sets only. Actually, the final decision between
distance and topology preservation should then be guided by the shape of the
underlying manifold. Heavily crumpled manifolds are more easily embedded
using topology preservation rather than distance preservation. The key point
to know is that both strategies extract neither the same kind nor the same
amount of information from data. Topology-preserving methods focus on local
information (neighborhood relationships), whereas distance-preserving ones
exploit both the local and global manifold structure.
7.8 Perspectives
During the 1900s, dimensionality reduction went through several eras. The
first era mainly relied on spectral methods like PCA and then classical metric
MDS. Next, the second era consisted of the generalization of MDS into non-
linear variants, many of them being based on distance or rank preservation
and among which Sammon’s NLM is probably the most emblematic represen-
tative. At the end of the century, the field of NLDR was deeply influenced by
“neural” approaches; the autoassociative MLP and Kohonen’s SOM are the
most prominent examples of this stream. The beginning of the new century
witnessed the rebirth of spectral approaches, starting with the discovery of
KPCA.
So in which directions will the researchers orient their investigations in
the coming years ? The paradigm of distance preservation can be counted
among the classical NLDR tools, whereas no real breakthrough has happened
in topology preservation since the SOM invention. It seems that the vein of
spectral methods has now been largely exploited. Many recent papers deal-
ing with that topic do not present new methods but are instead surveys that
246 7 Conclusions
summarize the domain and explore fundamental aspects of the methods, like
their connections or duality within a unifying frameworks. A recent publica-
tion in Science [89] describing a new training technique for auto-associative
MLP could reorient the NLDR research toward artificial neural networks once
again, in the same way as the publication of Isomap and LLE in the same jour-
nal in 2000 lead to the rapid development of many spectral methods. This
renewed interest in ANNs could focus on issues that were barely addressed
by spectral methods and distance preservation: large-scale NLDR problems
(training samples with several thousands of items), “out-of-sample” general-
ization, bidirectional mapping, etc.
A last open question regards the curse of dimensionality. An important
motivation behind (NL)DR aims at avoiding its harmful effects. Paradoxi-
cally, however, many NLDR methods do not bring a complete solution to the
problem, but only dodge it. Many NLDR methods give poor results when
the intrinsic dimensionality of the underlying manifold exceeds four or five.
In such cases, the dimension of the embedding space becomes high enough to
observe undesired effects related to the curse of dimensionality, such as the
empty space phenomenon. The future will tell whether new techniques will
be able to take up this ultimate challenge.
A
Matrix Calculus
In the general case, even if A contains only real entries, V and Λ can be
complex. If A is symmetric (A = AT ), then V is orthonormal (the eigen-
vectors are orthogonal in addition to being normed); the EVD can then be
rewritten as
A = VΛVT , (A.5)
and the eigenvalues are all real numbers. Moreover, if A is positive definite
(resp., negative definite), then all eigenvalues are positive (resp., negative).
If A is positive semidefinite (resp., negative semidefinite), then all eigenval-
ues are nonnegative (resp., nonpositive). For instance, a covariance matrix is
positive semidefinite.
Again, the eigenvalue decomposition leads to the solution. The square root is
then written as
A1/2 = VΛ1/2 V−1 , (A.9)
and it is easy to check that
This is valid in the general case, i.e., A can be complex and/or nonsym-
metric, yielding complex eigenvalues and eigenvectors. If A is symmetric, the
last equation can be further simplified since the eigenvectors are real and
orthonormal (V−1 = VT ).
It is notenorthy that the second definition of the matrix square root can
be generalized to compute matrix powers:
This appendix briefly introduces Gaussian random variables and some of their
basic properties.
where μ and σ 2 are the mean and the variance, respectively, and correspond to
the first-order moment and second-order central moment. Figure B.1 shows a
plot of a Gaussian probability density function. Visually, the mean μ indicates
0.5
0.4
0.3
f (x)
x
0.2
0.1
0
−2 0 2
x
the abscissa where the bump reaches its highest point, whereas σ is related to
the spreading of the bump. For this reason, the standard deviation σ is often
called the width in a geometrical context (see ahead).
Since only the mean and variance suffice to characterize a Gaussian vari-
able, it is often denoted as N (μ, σ). The letter N recalls the alternative name
of a Gaussian variable: a normal variable or normally distributed variable.
This name simply reflects the fact that for real-valued variables, the Gaussian
distribution is the most widely observed one in a great variety of phenomena.
Moreover, the central limit theorem states that a variable obtained as
the sum of several independent identically distributed variables, regardless
of their distribution, tends to be Gaussian if the number of terms in the
sum tends to infinity. Thus, to some extent, the Gaussian distribution can
be considered the “child” of all other distributions. On the other hand, the
Gaussian distribution can also be interpreted as the “mother” of all other
distributions. This is intuitively confirmed by the fact that any zero-mean
unit variance pdf fy (y) can be modeled starting from a zero-mean and unit-
variance Gaussian variable with pdf fx (x), by means of the Gram–Charlier or
Edgeworth development:
1 1
fy (y) = fx (y) 1 + μ3 (y)H3 (y) + (μ4 (y) − 3)H4 (y) + . . . , (B.2)
6 24
Visually, in the last development, a nonzero skewness κ3 (y) makes the pdf
fy (y) asymmetric, whereas a nonzero kurtosis excess κ4 (y) makes its bump
flatter or sharper. Even if the development does not go beyond the fourth
order, it is easy to guess that the Gaussian distribution is the only one having
zero cumulants for orders higher than two. This partly explains why Gaussian
variables are said to be the “least interesting” ones in some contexts [95].
Actually, a Gaussian distribution has absolutely no salient characteristic:
• The support is unbounded, in contrast to a uniform distribution, for in-
stance.
• The pdf is smooth, symmetric, and unimodal, without a sharp peak like
the pdf of a Laplacian distribution.
• The distribution maximizes the differential entropy.
The function defined in Eq. (B.1) and plotted in Fig. B.1 is the sole func-
tion that both shows the above properties and respects the necessary con-
ditions to be a probability density function. These conditions are set on the
cumulative density function Fx (x) of the random variable, defined as
B.2 Multidimensional Gaussian distribution 253
x
Fx (x) = fx (u)du , (B.4)
−∞
where μx and Cxx are, respectively, the mean vector and the covariance ma-
trix. As the covariance matrix is symmetric and positive semidefinite, its deter-
minant is nonnegative. The joint pdf of a two-dimensional Gaussian is drawn
in Fig. B.2.
0.15
0.1
f (x)
x
0.05
0
2
0 2
0
−2 −2
x2 x1
It can be seen that in the argument of the exponential function, the factor
(x − μx )T C−1
xx (x − μx ) is related to the square of the Mahalanobis distance
between x and μx (see Subsection 4.2.1).
254 B Gaussian Variables
#
1 (xp − μxp )2
P
1
= exp − (B.7)
p=1
2πcp,p 2 cp,p
! "
#P
1 1 (xp − μxp )2
= exp − (B.8)
p=1
2πσxp 2 σx2p
#
P
= fxp (xp ) , (B.9)
p=1
showing that the joint pdf of uncorrelated Gaussian variables factors into the
product of the marginal probability density functions. In other words, uncor-
related Gaussian variables are also statistically independent. Again, fx (x) is
the sole and unique probability density function that can satisfy this property.
For other multivariate distributions, the fact of being uncorrelated does not
imply the independence of the marginal densities. Nevertheless, the reverse
implication is always true.
Geometrically, a multidimensional Gaussian distribution looks like a fuzzy
ellipsoid, as shown in Fig. B.3. The axes of the ellipsoid correspond to coor-
dinate axes.
x2
0
−2
−4
−4 −2 0 2 4
x
1
Fig. B.3. Sample joint distribution (10,000 realizations, in blue) of two uncorrelated
Gaussian variables. The variances are proportional to the axis lengths (in red).
2
x2
−2
−4
−4 −2 0 2 4
x1
The function fx (x) is often used outside the statistical framework, possibly
without its normalization factor. In this case, fx (x) is usually called a radial
basis function or Gaussian kernel. In addition to be isotropic, such a function
has very nice properties:
• It produces a single localized bump.
• Very few parameters have to be set (P means and one single variance,
compared to P (P + 1)/2 for a complete covariance matrix).
• It depends on the well-known and widely used Euclidean distance.
Gaussian kernels are omnipresent in applications like radial basis function
networks [93] (RBFNs) and support vector machines (SVM) [27, 37, 42].
y2
0
−2
−4
−4 −2 0 2 4
y
1
Fig. B.5. Sample joint distribution (10,000 realizations) of two isotropic Gaussian
variables. The original coordinate system, in green, has been deformed by the mixing
matrix. Any attempt to retrieve it leads to the orthogonal system shown in red.
Recalling that function extremas are such that f (x) = 0, a straight extension
of Newton’s procedure can be applied to find a local extremum of a twicely
derivable function f :
f (x)
x ← x − , (C.6)
f (x)
where the first and second derivatives are assumed to be continuous. The last
update rule, unfortunately, does not distinguish between a minimum and a
maximum and yields either one of them. An extremum is a minimum only if
the second derivative is positive, i.e., the function is concave in the neighbor-
hood of the extremum. In order to avoid the convergence toward a maximum,
a simple trick consists of forcing the second derivative to be positive:
f (x)
x← x− . (C.7)
|f (x)|
One step further, H−1 is the inverse of the Hessian matrix, defined as
2
∂ f (x)
H ∇x ∇Tx f (x) = , (C.11)
∂xi ∂xj ij
A solution to these two issues consists of assuming that the Hessian matrix
is diagonal, although it is often a very crude hypothesis. This approximation,
usually called quasi-Newton or diagonal Newton, can be written component-
wise:
∂f (x)
∂x
xp ← xp − α 2 p , (C.12)
∂ f (x)
∂x2p
where the coefficient α (0 < α ≤ 1) slows down the update rule in order
to avoid unstable behaviors due to the crude approximation of the Hessian
matrix.
Within the framework of data analysis, it often happens that the function to
be optimized is of the form
1
N
f (x) = Ey {g(y, x)} or f (x) = g(y(i), x) , (C.14)
N i=1
1
N
x ← x − α∇x g(y(i), x)} . (C.15)
N i=1
This is the usual update rule for the classical gradient descent. In the frame-
work of neural networks and other adaptive methods, the classical gradient
descent is often replaced with the stochastic gradient descent. In the latter
method, the update rule can be written in the same way as in the classical
method, except that the mean (or expectation) operator disappears:
Because of the dangling index i, the update rule must be repeated N times,
over all available observations y(i). From an algorithmic point of view, this
means that two loops are needed. The first one corresponds to the iterations
that are already performed in the classical gradient descent, whereas an inner
loop traverses all vectors of the data set. A traversal of the data set is usually
called an epoch.
Moreover, from a theoretical point of view, as the partial updates are no
longer weighted and averaged, additional conditions must be fulfilled in order
to attain convergence. Actually, the learning rate α must decrease as epochs
go by and, assuming t is an index over the epochs, the following (in)equalities
must hold [156]:
∞
∞
α(t) = ∞ and (α(t))2 < ∞ . (C.17)
t=1 t=1
1
N
EVQ = y(i) − dec(cod(y(i)))2 , (D.1)
N i=1
where the coding and decoding functions cod and dec are respectively defined
as
cod : RD → {1, . . . , M } : y → arg min y − c(j) (D.2)
1≤j≤M
and
dec : {1, . . . , M } → C : j → c(j) . (D.3)
264 D Vector quantization
1.5
0.5
2
0
y
−0.5
−1
−1.5
−2 −1 0 1
y1
1.5
0.5
2
0
y
−0.5
−1
−1.5
−2 −1 0 1
y
1
1.5
0.5
2
0
y
−0.5
−1
−1.5
−2 −1 0 1
y1
Fig. D.1. Principle of vector quantization. The first plot shows a data set (2000
points). As illustrated by the second plot, a vector quantization method can reduce
the number of points by replacing the initial data set with a smaller set of repre-
sentative points: the prototypes, centroids, or code vectors, which are stored in the
codebook. The third plot shows simultaneously the initial data, the prototypes, and
the boundaries of the corresponding Voronoı̈ regions.
D.1 Classical techniques 265
The application of the coding function to some vector y(i) of the data set
gives the index j of the best-matching unit of y(i) (BMU in short), i.e., the
closest prototype from y(i). Appendix F.2 explains how to compute the BMU
efficiently. The application of the decoding function to j simply gives the
coordinates c(j) of the corresponding prototype. The coding function induces
a partition of RD : the open sets of all points in RD that share the same BMU
c(j) are called the Voronoı̈ regions (see Fig. D.1). A discrete approximation
of the Voronoı̈ regions can be obtained by constituting the sets Vj of all data
points y(i) having the same BMU c(j). Formally, these sets are written as
Vj = {y(i)|cod(y(i)) = j} (D.4)
1
c(j) ← y(i) . (D.5)
|Vj |
y(i)∈Vj
where
j
EVQ = y(i) − c(j)2 . (D.7)
y(i)∈Vj
j
Trivially, the barycenter of some Vj minimizes the corresponding EVQ . There-
fore, step 2 decreases EVQ . But, as a side effect, the update of the prototypes
also modifies the results of the encoding function. So it must be shown that the
266 D Vector quantization
re-encoding occurring in step 1 also decreases the distortion. The only terms
that change in the quantization distortion defined in Eq. (D.1) are those cor-
responding to data points that change their BMU. By definition of the coding
function, the distance y(i) − c(j) is smaller for the new BMU than for the
old one. Therefore, the error is lowered after re-encoding, which concludes the
correctness proof of the K-means.
1
N
∂y(i) − c(j)
∇c(j) EVQ = 2y(i) − c(j) (D.8)
N i=1 ∂c(j)
1
N
(c(j) − y(i))
= 2y(i) − c(j) (D.9)
N i=1 2y(i) − c(j)
1
N
= (c(j) − y(i)) , (D.10)
N i=1
D.3 Taxonomy
Quantization methods may be divided into static, incremental, and dynamic
ones. This distinction refers to their capacity to increase or decrease the num-
ber of prototypes they update. Most methods, like competitive learning and
D.4 Initialization and “dead units” 267
LBG, are static and manage a number of prototypes fixed in advance. Incre-
mental methods (see, for instance, [68, 71, 11, 70]) are able to increase this
predetermined number by inserting supplemental units when this is necessary
(various criterions exist). Fully dynamic methods (see, for instance, [69]) can
add new units and remove unnecessary ones.
In addition to the distinction between classical quantization and compet-
itive learning, the latter can further be divided into two subcategories:
• Winner take all (WTA). Similarly to the stochastic method sketched
just above, WTA methods update only one prototype (the BMU) at each
presentation of a datum. WTA methods are the simplest ones and include
the classical competitive learning [162, 163] and the frequency-sensitive
competitive learning [51].
• Winner take most (WTM). WTM methods are more complex than
WTA ones, because the prototypes interact at each presentation of a da-
tum. In practice, several prototypes are updated at each presentation of a
datum. In addition to the BMU, some other prototypes related to the BMU
are also updated. Depending on the specific quantization method, these
prototypes may be the second, third, and so forth closest prototypes in the
data space, as in the neural gas [135] (NG). Otherwise, the neighborhood
relationships with the BMU may also be predefined and data-independent,
as in a self-organizing map [105, 154] (SOM; see also Subsection 5.2.1).
and
2x cos(6πx)
y= with x ∈ [0, 1] . (E.2)
2x sin(6πx)
270 E Graph Building
In both cases, Gaussian noise is added (for the sine, on y2 only, with
standard deviation equal to 0.10; for the spiral, on both y1 and y2 , with
standard deviation equal to 0.05). Afterwards, the 3000 points of each data set
are quantized with 120 and 25 prototypes respectively. Figure E.1 illustrates
both data sets. Both manifolds have the property of being one-dimensional
Data set 2
Data set 1 2
1.5
1
1
0.5 0.5
0
y2
y2
0
−0.5
−0.5 −1
−1.5
−1
−2
0 1 2 3
y1 −2.5
−2 −1 0 1 2
y1
Fig. E.1. Two data sets used to test graph-building rules: the sine with exponen-
tially increasing frequency and the linear spiral. Data points are displayed as points
whereas prototypes are circles.
only in a local scale. For example, going from left to right along the sine
wave makes it appear “more and more two-dimensional” (see Chapter 3). For
the rules that build a graph without assuming that the available points are
prototypes resulting from a vector quantization, the prototypes are given as
such. On the other hand, the rules relying on the fact that some data set has
been quantized are given both the prototypes and the original data sets.
E.1.1 K-rule
Also known as the rule of K-ary neighborhoods, this rule is actually very
simple: each point y(i) is connected with the K closest other points. As a
direct consequence, if the graph is undirected (see Section 4.3), then each point
is connected with at least K other points. Indeed, each point elects exactly K
E.1 Without vector quantization 271
neighbors but can also be elected by points that do not belong to this set of
K neighbors. This phenomenum typically happens with an isolated point: it
elects as neighbors faraway points while those points find their K neighbors
within a much smaller distance. Another consequence of this rule is that no
(nontrivial) upper bound can be easily given for the longest distance between
a point and its neighbors. The practical determination of the K closest points
is detailed in Section F.2.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 proto-
types mentioned above, the K-rule gives the results displayed in the first row
of Fig. E.2. Knowing that the manifolds are one-dimensional, the value of K
is set to 2. For the sine wave, the result is good but gets progressively worse
as the frequency increases. For the spiral, the obtained graph is totally wrong,
mainly because the available points do not sample the spiral correctly (i.e.,
the distances between points on different whorls may be smaller than those
of points lying on the same whorl).
E.1.2 -rule
By comparison with the K-rule, the -rule works almost conversely: each point
y(i) is connected with all other points lying inside an -ball centered on y(i).
Consequently, is by construction the upper bound for the longest distance
between a point and its neighbors. But as a counterpart, no (nontrivial) lower
bound can easily be given for the smallest number of neighbors that are con-
nected with each point. Consequently, it may happen that isolated points have
no neighbors. As extensively demonstrated in [19], the -rule shows better
properties for the approximation of geodesic distances with graph distances.
However, the choice of appears more difficult in practice than the one of K.
The practical determination of the points lying closer than a fixed distance
from another point is detailed in Section F.2.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 proto-
types mentioned above, the -rule gives the results displayed in the second row
of Fig. E.2. The parameter is given the values 0.3 and 0.8 for, respectively,
the sine and the spiral. As can be seen, the -rule gives good results only if the
distribution of points remains approximately uniform; this assumption is not
exactly true for the two proposed data sets. Consequently, the graph includes
too many edges in dense regions like the first minimum of the sine wave and
the center of the spiral. On the other hand, points lying in sparsely populated
regions often remain disconnected.
E.1.3 τ -rule
This more complex rule connects two points y(i) and y(j) if they satisfy two
conditions. The first condition states that the distances di = minj y(i)−y(j)
and dj = mini y(j) − y(i) from the two points to their respective closest
neighbor is
272 E Graph Building
This rule behaves almost like the -rule, except that the radius is implicit,
is hidden in τ , and adapts to the local data distribution. The mean radius
increases in sparse regions and decreases in dense regions.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 pro-
totypes mentioned above, the τ -rule gives the results displayed in the third
row of Fig. E.2. For the sine as well as for the spiral, the parameter τ equals
1.5. As expected, the τ -rule behaves a little better than the -rule when the
density of points varies. This is especially visible in dense regions: the nu-
merous unnecessary edges produced by the -rule disappear. Moreover, it is
noteworthy that the τ -rule better extracts the shape of the spiral than the
two previous rules.
For each data point y(i), this rule computes the set containing the K
closest prototypes, written as {c(j1 ), . . . , c(jK )}. Then each possible pair
{c(js ), c(jt )} in this set is analyzed. If the point fulfills the following two
conditions, then the prototypes of the considered pair are connected, and a
graph edge is created between their associated vertices. The first one is the
condition of the ellipse, written as
d(y(i), c(jr )) < C2 d(y(i), c(jd )) and d(y(i), c(js )) < C2 d(y(i), c(jr )) ,
(E.6)
E.2 With vector quantization 273
K−rule
K−rule 2
1
1
0.5
0
y2
y2
0
−0.5 −1
−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y1
ε−rule
ε−rule 2
1
1
0.5
0
y2
y2
−0.5 −1
−1
−2
0 1 2 3
y −2 −1 0 1 2
1
y1
τ−rule
τ−rule 2
1
1
0.5
0
y2
y2
−0.5 −1
−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y1
Fig. E.2. Results of the K-rule, -rule, and τ -rule on the data sets proposed in
Fig. E.1.
274 E Graph Building
0 0 0
−1 −1 −1
−2 0 2 −2 0 2 −2 0 2
Fig. E.3. The two conditions to fulfill in order to create a graph edge between
the vertices associated to two prototypes (black crosses). The points that create the
edge must be inside the ellipse and outside both circles. The ellipse and circles are
shown for different values of S.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 pro-
totypes mentioned in the introductory section, the data rule gives the results
displayed in the first row of Fig. E.4. The parameter K equals 2, and S is
assigned the values 0.5 and 0.3 for the sine and the spiral, respectively. As can
be seen, the exploitation of the information provided by the data set before
vector quantization allows one to extract the shape of both data sets in a
much better way than the three previous rules did.
Data−rule
Data−rule 2
1
1
0.5
0
y2
y2
0
−0.5 −1
−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y
1
Histogram−rule
Histogram−rule 2
1
1
0.5
0
y2
y2
−0.5 −1
−1
−2
0 1 2 3
y −2 −1 0 1 2
1
y1
Fig. E.4. Results of the data rule and histogram rule on the data sets proposed in
Fig. E.1.
F
Implementation Issues
This appendix gathers some hints in order to implement efficiently the meth-
ods and algorithms described in the main chapters.
applied to the bin heights, yielding the necessary log-log curves. Finally, the
numerical derivative is computed as in Eqs. (3.30) and (3.31). The second-
order estimate is replaced with a first-order one for the first and last bins.
As the pth root is a monotonic function, it does not change the ordering of
the distance and may be omitted.
For the sake of simplicity, it is assumed that a sorted list of K canditates
closest points is available. This list can be initialized by randomly choosing K
280 F Implementation Issues
points in the data set, computing their distances (without applying the pth
root) to the source point x, and sorting the distances. Let be assigned with
the value of the last and longest distance in the list. From this state, PDS has
to traverse the N − K remaining points in order to update the list. For each of
these points y(i), PDS starts to compute the distance from the source y, by
cumulating the terms |yd (i) − yd |p for increasing values of index d. While the
sum of those terms remains lower than , it is worth carrying on the additions,
because the point remains a possible candidate to enter the list. Otherwise,
if the sum grows beyond , it can be definitely deduced that the point is not
a good candidate, and the PDS may simply stop the distance computation
between y(i) and y. If the point finally enters the list, the largest distance in
the list is thrown away and the new point replaces it at the right position in
the list.
The denomination is not given arbitrarily to the longest distance in the
list. A small change in the PDS algorithm can perform another task very effi-
ciently: the detection of the points y(i) lying in a hyperball of radius . In this
variant, is fixed and the list has a dynamic size; the sorting of the list is un-
necessary. The original and modified PDS algorithms find direct applications
in the K-rule and -rule for graph building (see Appendix E).
Other techniques for an even faster determination of the closest points exist
in the literature, but they either need special data structures built around
the set of data points [207] or provide only an approximation of the exact
result [26].
The problem of computing the shortest paths from one source vertex to all
other vertices is usually solved by Dijkstra’s [53] algorithm. Dijkstra’s algo-
rithm has already been sketched in Subsection 4.3.1; its time complexity is
O(D|E| + N log N ), where |E| is the number of edges in the graph. The main
idea of the algorithm consists of computing the graph distances in ascend-
ing order. This way implicitly ensures that each distance is computed along
the shortest path. This also means that distances have to be sorted in some
way. Actually, at the beginning of the algorithm, all distances are initialized
to +∞, except for the distance to the source, which is trivially zero. Other
distances are updated and output one by one, as the algorithm is running
and discovering the shortest paths. Hence, distances are not really sorted, but
their intermediate values are stored in a priority queue, which allows us to
store a set of values and extract the smallest one. An essential property of
a priority queue is the possibility to decrease the stored values. In the case
282 F Implementation Issues
27. B.E. Boser, I.M. Guyon, and V. Vapnik. A training algorithm for optimal mar-
gin classifiers. In Fifth Annual Workshop on Computational Learning Theory.
ACM, Pittsburg, PA, 1992.
28. C. Bouveyron. Dépliage du ruban cortical à partir d’images obtenues en IRMf,
mémoire de DEA de mathématiques appliquées. Master’s thesis, Unité Mixte
Inserm – UJF 594, Université Grenoble 1, France, June 2003.
29. M. Brand. Charting a manifold. In S. Becker, S. Thrun, and K. Obermayer,
editors, Advances in Neural Information Processing Systems (NIPS 2002), vol-
ume 15. MIT Press, Cambridge, MA, 2003.
30. M. Brand. Minimax embeddings. In S. Thrun, L. Saul, and B. Schölkopf,
editors, Advances in Neural Information Processing Systems (NIPS 2003), vol-
ume 16. MIT Press, Cambridge, MA, 2004.
31. M. Brand and K. Huang. A unifying theorem for spectral embedding and
clustering. In C.M. Bishop and B.J. Frey, editors, Proceedings of International
Workshop on Artificial Intelligence and Statistics (AISTATS’03). Key West,
FL, January 2003. Also presented at NIPS 2002 workshop on spectral methods
and available as Technical Report TR2002-042.
32. J. Bruske and G. Sommer. Intrinsic dimensionality estimation with optimally
topology preserving maps. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 20(5):572–575, 1998.
33. M.A. Carreira-Perpiñán. A review of dimension reduction techniques. Techni-
cal report, University of Sheffield, Sheffield, January 1997.
34. A. Cichocki and S. Amari. Adaptative Blind Signal and Image Processing. John
Wiley & Sons, New York, 2002.
35. R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and
S. Zucker. Geometric diffusion as a tool for harmonic analysis and structure
definition of data, part i: Diffusion maps. Proceedings of the National Academy
of Sciences, 102(21):7426–7431, 2005.
36. R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and
S. Zucker. Geometric diffusion as a tool for harmonic analysis and structure
definition of data, part i: Multiscale methods. Proceedings of the National
Academy of Sciences, 102(21):7432–7437, 2005.
37. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–
297, 1995.
38. M. Cottrell, J.-C. Fort, and G. Pagès. Two or three things that we know about
the Kohonen algorithm. In M. Verleysen, editor, Proceedings ESANN’94, 2nd
European Symposium on Artificial Neural Networks, pages 235–244. D-Facto
conference services, Brussels, Belgium, 1994.
39. R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 1. In-
terscience Publishers, Inc., New York, 1953.
40. T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley,
New York, 1991.
41. T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall, Lon-
don, 1995.
42. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma-
chines (and Other Kernel-Based Learning Methods). Cambridge University
Press, 2000.
43. J. de Leeuw and W. Heiser. Theory of multidimensional scaling. In Handbook
of Statistics, chapter 13, pages 285–316. North-Holland Publishing Company,
Amsterdam, 1982.
286 References
44. D. de Ridder and R.P.W. Duin. Sammon’s mapping using neural networks: A
comparison. Pattern Recognition Letters, 18(11–13):1307–1316, 1997.
45. P. Demartines. Analyse de données par réseaux de neurones auto-organisés.
PhD thesis, Institut National Polytechnique de Grenoble (INPG), Grenoble,
France, 1994.
46. P. Demartines and J. Hérault. Vector quantization and projection neural
network. volume 686 of Lecture Notes in Computer Science, pages 328–333.
Springer-Verlag, New York, 1993.
47. P. Demartines and J. Hérault. CCA: Curvilinear component analysis. In 15th
Workshop GRETSI, Juan-les-Pins (France), September 1995.
48. P. Demartines and J. Hérault. Curvilinear component analysis: A self-
organizing neural network for nonlinear mapping of data sets. IEEE Transac-
tions on Neural Networks, 8(1):148–154, January 1997.
49. D. DeMers and G.W. Cottrell. Nonlinear dimensionality reduction. In D. Han-
son, J. Cowan, and L. Giles, editors, Advances in Neural Information Process-
ing Systems (NIPS 1992), volume 5, pages 580–587. Morgan Kaufmann, San
Mateo, CA, 1993.
50. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-
complete data via the EM algorithm. Journal of Royal Statistical Society, B,
39(1):1–38, 1977.
51. D. DeSieno. Adding a conscience to competitive learning. In Proceedings
of ICNN’88 (International Conference on Neural Networks), pages 117–124.
IEEE Service Center, Piscataway, NJ, 1988.
52. G. Di Battista, P. Eades, R. Tamassia, and I.G. Tollis. Algorithms for drawing
graphs: An annotated bibliography. Technical report, Brown University, June
1994.
53. E.W. Dijkstra. A note on two problems in connection with graphs. Numerical
Mathematics, 1:269–271, 1959.
54. D. Donoho and C. Grimes. When does geodesic distance recover the true
parametrization of families of articulated images? In M. Verleysen, editor,
Proceedings of ESANN 2002, 10th European Symposium on Artificial Neural
Networks, pages 199–204, Bruges, Belgium, April 2002. d-side.
55. D.L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding
techniques for high-dimensional data. In Proceedings of the National Academy
of Arts and Sciences, volume 100, pages 5591–5596, 2003.
56. D.L. Donoho and C. Grimes. Hessian eigenmaps: New locally linear techniques
for high-dimensional data. Technical Report TR03-08, Department of Statis-
tics, Stanford University, Palo Alto, CA, 2003.
57. E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering,
convergence properties and energy functions. Biological Cybernetics, 67:47–55,
1992.
58. P.A. Estévez and A.M. Chong. Geodesic nonlinear mapping using the neural
gas network. In Proceedings of IJCNN 2006. 2006. In press.
59. P.A. Estévez and C.J. Figueroa. Online data visualization using the neural gas
network. Neural Networks, 19:923–934, 2006.
60. B.S. Everitt. An Introduction to Latent Variable Models. Monographs on
Statistics and Applied Probability. Chapman & Hall, London, New York, 1984.
61. E. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability
of classifications. Biometrics, 21:768, 1965.
References 287
80. T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical
Association, 84(406):502–516, 1989.
81. X.F. He and P. Niyogi. Locality preserving projections. In S. Thrun, L. Saul,
and B. Schölkopf, editors, Advances in Neural Information Processing Systems
(NIPS 2003), volume 16. MIT Press, Cambridge, MA, 2004.
82. X.F. He, S.C. Yan, Y.X. Hu, H.G. Niyogi, and H.J. Zhang. Face recognition
using Laplacianfaces. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27(3):328–340, 2005.
83. H.G.E. Hentschel and I. Procaccia. The infinite number of generalized dimen-
sions of fractals and strange attractors. Physica, D8:435–444, 1983.
84. J. Hérault, C. Jaussions-Picaud, and A. Guérin-Dugué. Curvilinear component
analysis for high dimensional data representation: I. Theoretical aspects and
practical use in the presence of noise. In J. Mira and J.V. Sánchez, editors,
Proceedings of IWANN’99, volume II, pages 635–644. Springer, Alicante, Spain,
June 1999.
85. M. Herrmann and H.H. Yang. Perspectives and limitations of self-organizing
maps in blind separation of source signals. In S. Amari, L. Xu, L.-W. Chan,
I. King, and K.-S. Leung, editors, Progress in Neural Information Processing,
Proceedings of ICONIP’96, volume 2, pages 1211–1216. Springer-Verlag, 1996.
86. D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flachenstück. Math.
Ann., 38:459–460, 1891.
87. G. Hinton and S.T. Roweis. Stochastic neighbor embedding. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information Process-
ing Systems (NIPS 2002), volume 15, pages 833–840. MIT Press, Cambridge,
MA, 2003.
88. G.E. Hinton. Learning distributed representations of concepts. In Proceedings
of the Eighth Annual Conference of the Cognitive Science Society, Amherst,
MA, 1986. Reprinted in R.G.M. Morris, editor, Parallel Distributed Processing:
Implications for Psychology and Neurobiology, Oxford University Press, USA,
1990.
89. G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data
with neural networks. Science, 313(5786):504–507, July 2006.
90. G.E. Hinton and R.R. Salakhutdinov. Supporting online mate-
rial for “reducing the dimensionality of data with neural net-
works”. Science, 313(5786):504–507, July 2006. Available at
www.sciencemag.org/cgi/content/full/313/5786/502/DC1.
91. J.J. Hopfield. Neural networks and physical systems with emergent collective
computational abilities. In Proc. Natl. Acad. Sci. USA 79, pages 2554–2558.
1982.
92. H. Hotelling. Analysis of a complex of statistical variables into principal com-
ponents. Journal of Educational Psychology, 24:417–441, 1933.
93. J.R. Howlett and L.C. Jain. Radial Basis Function Networks 1: Recent Devel-
opments in Theory and Applications. Physica Verlag, Heidelberg, 2001.
94. P.J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985.
95. A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis.
Wiley-Interscience, 2001.
96. A.K. Jain and D. Zongker. Feature selection: Evaluation, application and small
sample performance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(2):153–158, 1997.
References 289
97. M.C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal
Statistical Society, Series A, 150:1–36, 1987.
98. C. Jutten. Calcul neuromimétique et traitement du signal, analyse en com-
posantes indépendantes. PhD thesis, Institut National Polytechnique de Greno-
ble, 1987.
99. C. Jutten and J. Hérault. Space or time adaptive signal processing by neural
network models. In Neural Networks for Computing, AIP Conference Proceed-
ings, volume 151, pages 206–211. Snowbird, UT, 1986.
100. C. Jutten and J. Hérault. Blind separation of sources, part I: An adaptative
algorithm based on neuromimetic architecture. Signal Processing, 24:1–10,
1991.
101. N. Kambhatla and T.K. Leen. Dimension reduction by local principal compo-
nent analysis. Neural Computation, 9(7):1493–1516, October 1994.
102. K. Karhunen. Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. Sci.
Fennicae, 34, 1946.
103. K. Kiviluoto. Topology preservation in self-organizing maps. In IEEE Neu-
ral Networks Council, editor, Proc. Int. Conf. on Neural Networks, ICNN’96,
volume 1, pages 294–299, Piscataway, NJ, 1996. Also available as technical
report A29 of the Helsinki University of Technology.
104. T. Kohonen. Self-organization of topologically correct feature maps. Biological
Cybernetics, 43:59–69, 1982.
105. T. Kohonen. Self-Organizing Maps. Springer, Heidelberg, 2nd edition, 1995.
106. A. König. Interactive visualization and analysis of hierarchical neural projec-
tions for data mining. IEEE Transactions on Neural Networks, 11(3):615–624,
2000.
107. M. Kramer. Nonlinear principal component analysis using autoassociative neu-
ral networks. AIChE Journal, 37:233, 1991.
108. J.B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. Psychometrika, 29:1–28, 1964.
109. J.B. Kruskal. Toward a practical method which helps uncover the structure of
a set of multivariate observations by finding the linear transformation which
optimizes a new index of condensation. In R.C. Milton and J.A. Nelder, editors,
Statistical Computation. Academic Press, New York, 1969.
110. J.B. Kruskal. Linear transformations of multivariate data to reveal cluster-
ing. In Multidimensional Scaling: Theory and Application in the Behavioural
Sciences, I, Theory. Seminar Press, New York and London, 1972.
111. W. Kühnel. Differential Geometry Curves – Surfaces – Manifolds. Amer. Math.
Soc., Providence, RI, 2002.
112. M. Laurent and F. Rendl. Semidefinite programming and integer programming.
Technical Report PNA-R0210, CWI, Amsterdam, April 2002.
113. M.H. C. Law, N. Zhang, and A.K. Jain. Nonlinear manifold learning for data
stream. In Proceedings of SIAM Data Mining, pages 33–44. Orlando, FL, 2004.
114. J.A. Lee, C. Archambeau, and M. Verleysen. Locally linear embedding versus
Isotop. In M. Verleysen, editor, Proceedings of ESANN 2003, 11th European
Symposium on Artificial Neural Networks, pages 527–534. d-side, Bruges, Bel-
gium, April 2003.
115. J.A. Lee, C. Jutten, and M. Verleysen. Non-linear ICA by using isometric
dimensionality reduction. In C.G. Puntonet and A. Prieto, editors, Independent
Component Analysis and Blind Signal Separation, Lecture Notes in Computer
Science, pages 710–717, Granada, Spain, 2004. Springer-Verlag.
290 References
133. B.B. Mandelbrot. Les objets fractals: forme, hasard et dimension. Flammarion,
Paris, 1984.
134. J. Mao and A.K. Jain. Artificial neural networks for feature extraction and mul-
tivariate data projection. IEEE Transactions on Neural Networks, 6(2):296–
317, March 1995.
135. T. Martinetz and K. Schulten. A “neural-gas” network learns topologies. In
T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural
Networks, volume 1, pages 397–402. Elsevier, Amsterdam, 1991.
136. T. Martinetz and K. Schulten. Topology representing networks. Neural Net-
works, 7(3):507–522, 1994.
137. W. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.
138. M. Minsky and S. Papert. Perceptrons: An Introduction to Computational
Geometry. MIT Press, Cambridge, MA, 1969.
139. L.C. Molina, L. Belanche, and À. Nebot. Feature selection algorithms: A sur-
vey and experimental evaluation. In Proceedings of 2002 IEEE International
Conference on Data Mining (ICDM’02), pages 306–313. December 2002. Also
available as technical report LSI-02-62-R at the Departament de Lleguatges i
Sistemes Informàtics of the Universitat Politècnica de Catalunya, Spain.
140. J.R. Munkres. Topology: A First Course. Prentice-Hall, Englewood Cliffs, NJ,
1975.
141. B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps,
spectral clustering and eigenfunction of fokker-planck operators. In Y. Weiss,
B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing
Systems (NIPS 2005), volume 18. MIT Press, Cambridge, MA, 2006.
142. R.M. Neal. Bayesian Learning for Neural Networks. Springer Series in Statis-
tics. Springer-Verlag, Berlin, 1996.
143. A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: analysis and an
algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems (NIPS 2001), volume 14. MIT Press,
Cambridge, MA, 2002.
144. E. Oja. Data compression, feature extraction, and autoassociation in feedfor-
ward neural networks. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas,
editors, Artificial Neural Networks, volume 1, pages 737–745. Elsevier Science
Publishers, B.V., North-Holland, 1991.
145. E. Oja. Principal components, minor components, and linear neural networks.
Neural Networks, 5:927–935, 1992.
146. E. Ott. Measure and spectrum of dq dimensions. In Chaos in Dynamical
Systems, pages 78–81. Cambridge University Press, New York, 1993.
147. P. Pajunen. Nonlinear independent component analysis by self-organizing
maps. In C. von der Malsburg, W. von Seelen, J.C. Vorbruggen, and B. Send-
hoff, editors, Artificial Neural Networks, Proceedings of ICANN’96, pages 815–
820. Springer-Verlag, Bochum, Germany, 1996.
148. E. Pȩkalska, D. de Ridder, R.P.W. Duin, and M.A. Kraaijveld. A new method
of generalizing Sammon mapping with application to algorithm speed-up. In
M. Boasson, J.A. Kaandorp, J.F.M. Tonino, and M.G. Vosselman, editors,
Proceedings of ASCI’99, 5th Annual Conference of the Advanced School for
Computing and Imaging, pages 221–228. ASCI, Delft, The Netherlands, June
1999.
292 References
spectral, 107, 120, 126, 150, 159, 164, hard vs. soft, 37, 38
236, 248, see also Eigenvalue linear, 40, see also LDR, 231, 232
Decorrelation, 24, 28, 226, 231 nonlinear, 40, see also NLDR, 88,
Definition, 12, 13, 47–49, 51, 53, 69, 70, 133, 136, 165, 231–233, 268
72, 82, 95, 97, 104, 138, 160, 171, Disparity, 81
191, 193, 247–249, 266 Dissimilarity, 73, 81
Delaunay triangulation, 269, 275 Distance, 54, 55, 63, 79, 127, 233, 266,
Dependencies, 10, 22, 26, 34 271, 279, 280
Derivative, 53, 260 city-block, 71
numerical, 56, 57, 279 commute-time, 164, see also CTD,
partial, 72, 83, 147, 260 181, 235, 237
second, 84 computation, 280
Descent Euclidean, 70, 72, 76, 87, 98, 99, 111,
gradient, 83, 87, 145, 147, 233, 261 113, 126, 127, 138, 227, 230, 233,
classical, 262, 266 234, 239, 240, 255, 280, 282
stochastic, 23, 44, 150, 233, 236, all-pairs, 281
238, 241, 261, 262, 266 single-pair, 281
steepest, 261 single-source, 281
Diagonalization, 31 function, 70, 71, 133
Diffeomorphism, 13, 104 geodesic, 99, 102–104, 106, 110, 111,
Dijkstra algorithm, 101–103, 107, 108, 113, 227, 242, 269, 280
113, 227, 281, 282 graph, 101–103, 106, 107, 109,
Dimension 111–114, 126, 127, 131, 167, 168,
q-, 49 197, 207, 222, 227, 230, 234, 235,
box-counting, 51, 277 239, 240, 242, 269, 280–282
capacity, 49, 51, 53–55, 59, 277 interpoint, 73
correlation, 49, 53–55, 57, 59, 60, 63, Mahalanobis, 72, 231
66, 67 Manhattan, 71
fractal, 48, 49, 229 matrix, 87, 111, 239
inequalities, 54 maximum, 71
information, 49, 52, 53, 55 Minkowski, 71
intrinsic, 242 pairwise, 24, 45, 55, 69, 73, 80, 81, 87,
Lebesgue covering, 47 97, 126, 127, 129, 168, 231, 233,
spectrum, 49 234, 277, 280, 282
target, 243, 244 preservation, 16, 24, 45, 69, 70, 81,
topological, 47, 48 83, 86, 88, 91, 95–99, 113, 126,
Dimensionality, 15, 35, 239, 263, 279 133, 227, 233, 235, 236, 244, 245
curse of, 3, 6, 226, 243, 246 local, 186
data, 243 strict, 126, 189
embedding, 107 weighted, 94, 131
estimation, 18, 30, 37, 41, 60, 106, spatial, 70
150, 226, 229, 277 Distortion, 235
intrinsic, 18, 19, 30, 47, 60, 62, 67, Distribution, 7, 59, 142, 167, 175, 194
106, 109, 150, 229, 230, 243, 244, Gaussian, 7, 8, 25, 31, 33, 168, 252
246, 277 isotropic, 254
reduction, 2, 11, 13, 18, 20, 69, 94, multidimensional, 253
97, 125, 191, 225–227, 229, 233, joint, 11
234, 245, 263, 280 Laplacian, 252
hard, 231, 243 multivariate, 254
300 Index
Gram, see Matrix ICA, 11, see also Analysis, 24, 228, 232,
Graph, 12, 100, 103, 112, 126, 130, 134, 234, 257
152, 160, 162, 191, 207, 216, 217, Image, 215
227, 228, 238, 239 face, 206
adjacency, 244 number of, 207
building, 111, 165, 168, 269, 272, 280 processing, 2
connected, 102, 282 Implementation, 77, 280
directed, 100 Independence, 2, 26, 226, 232, 252
distance, see also Distance statistical, 8, 21, 24, 36, 74, 229, 232,
edge, 100, 103, 107, 134, 152, 159, 254
239, 272, 274, 281 Inequality, 71, 133
parasitic, 185 triangular, 70, 101, 115
edge-labeled, 100 Inference, 238
edge-weighted, 100 Information, 3, 10, 24, 70, 80, 88, 225,
Euclidean, 100 229, 243–245, 272, 274
Laplacian, 159, 163, 164 dimension, see also Dimension
partition, 165 local, 158
path, 100, 102 ordinal, 81
theory, 101, 159, 238 theory, 53
undirected, 100, 126, 129, 159, 270, Initialization, 84, 93, 101, 102, 116, 117,
281, 282 139, 148, 149, 166–168, 242, 265,
vertex, 100–103, 111, 118, 134, 135, 267
152, 159, 166–169, 272, 274, 281, Inplementation, 277
282 Invariance, 153
closest, 168 ISODATA, 228, 265
vertex-labeled, 100 Isomap, 97, 102–104, 106–114, 118, 120,
weighted, 101, 107, 134, 281, 282 125, 126, 131, 157–159, 163, 165,
Grid, 6, 49, 52, 136, 137, 146, 148–151, 166, 173–176, 178, 182, 186, 189,
170, 174, 176, 203, 222, 234, 277 190, 199, 206, 207, 209, 214, 215,
-, 51 217, 222, 227, 228, 233, 234, 237,
239, 242
growing, 142
Isometry, 104, 130, 131, 191, 237
hexagonal, 135, 216
local, 126, 127, 131
rectangular, 142
Isotop, 142, 165–171, 176, 181, 187, 189,
shape, 143
193, 203, 206, 209, 214, 215, 220,
GTM, 142, see also Mapping, 143, 146,
222, 227, 228, 241, 245
148, 152, 228, 229, 237, 238, 245
Iteration, 48, 52, 55, 83, 84, 92, 139,
Guideline, 242, 243
233, 241, 259, 262
number of, 84, 92, 113, 130, 148, 151,
Heap, 282 168, 176, 203, 241
Hexagon, 135, 138, 139
Hierarchy, 233 Japanese flag, 178, 180–182
Histogram, 278
bin, 275, 278 K-means, 119, 228, 243, 265–267
rule, 274, 275 K-rule, 100, 101, 107, 194, 207, 270,
History, 69, 73, 88, 171, 228 271, 280
HLLE, 159, see also LLE, 228 Kernel, 70, 122, 151, 156, 160, 162, 171,
Hole, 106, 131, 175, 178, 180, 217 176, 236, 237, 239
Homeomorphism, 12 function, 125, 126, 130, 239, 240, 244
302 Index
Gaussian, 123–125, 167, 169, 171, Literature, 49, 55, 81, 109, 111, 129,
234, 256 142, 220, 229, 236, 239, 240, 242,
heat, 160, 162–164, 234 245
learning, 126 LLE, 152, see also Embedding, 157,
matrix, 239 158, 163, 166, 176, 190, 193, 214,
optimization, 241 216, 228, 238, 239, 242, 245
PCA, 120, see also KPCA, 228 Hessian, 159, 228
polynomial, 123 LMDS, 228, see also MDS
trick, 123, 125 Locality preserving projection, 165
type, 236 Locally linear embedding, 227
width, 124 log-log plot, 55, 56, 58, 279
Knot, 12, 193 Loop, 196, 199, 262
trefoil, 13, 193, 196 essential, 193, 199, 242, 244
Koch’s island, 48, 55, 57 LPCA, 158, see also PCA
KPCA, 120, see also Kernel, 122–126, LPP, 165, see also Projection, 228
131, 157, 165, 199, 228, 236, 237,
239, 244, 245 Magic factor, 83
Kurtosis, 31 Magnetic resonance imaging, 199
excess, 252 Manifold, 12, 14, 19, 48, 54, 57, 79, 103,
124, 133, 233, 242, 244, 269, 271,
Lagrange multiplier, 154 280
Landmark, 109, 111, 131 assumption, 242
Laplace-Beltrami operator, 162–164 boundary, 12
Laplacian eigenmaps, 159, see also LE, convex, 131, 242
162, 165, 228 curved, 245
Laplacianfaces, 165 developable, 104–106, 109, 110, 114,
Lattice, 115, 134–138, 152, 237 115, 119, 174, 177, 189, 191, 196,
data-driven, 134, 142, 152, 187, 193, 240, 242
220, 234, 237 dimension, 13
predefined, 134, 135, 152, 187, 203, disconnected, 242
222, 234 nonconvex, 131, 180
LBG, 228, 265, 267 nondevelopable, 106, 110, 113, 115,
LDR, 40, see also Dimensionality 188–190, 197, 203, 240
LE, 159, see also Laplacian eigenmaps, nonlinear, 42, 60, 67, 69, 86, 109, 110,
163, 165, 166, 176, 228, 238, 239, 119, 120, 125, 153, 157, 180
242, 245 shape, 169
Learning, 3, 168 smooth, 13, 49, 102, 126, 162, 163,
Bayesian, 144 166
competitive, 138, 166, 175, 228, 266, underlying, 14, 19, 66, 93, 95, 100,
267 119, 120, 127, 152, 159, 162, 166,
frequentist, 143, 144 185, 193, 220, 229, 237, 240, 242,
machine, 238 245, 246
rate, 23, 44, 92, 95, 138, 139, 168, MAP, 144
241, 261, 262, 266 Mapping, 14, 20, 23, 41, 82, 120, 122,
supervised, 10, 232 123, 171
unsupervised, 87, 232, 233 conformal, 152
Likelihood, 143–148, 150 continuous, 82
maximum, 237 discrete, 87
Lindenmayer system, 50, 55 explicit vs. implicit, 37, 41
Index 303