Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

10.1007@978 0 387 39351 3 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 316
At a glance
Powered by AI
The document appears to be about nonlinear dimensionality reduction techniques and related machine learning algorithms. It discusses topics like manifold learning, embedding, and vector quantization.

The document discusses topics related to dimensionality reduction, machine learning algorithms, vector quantization, embedding, manifold learning, and more.

Algorithms discussed include self-organizing maps (SOM), vector quantization, stress majorization, and stochastic neighbor embedding (SNE).

Information Science and Statistics

Series Editors:
M. Jordan
J. Kleinberg
B. Schölkopf
Information Science and Statistics
Akaike and Kitagawa: The Practice of Time Series Analysis.
Bishop: Pattern Recognition and Machine Learning.
Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and Expert
Systems.
Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice.
Fine: Feedforward Neural Network Methodology.
Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality
Improvement.
Jensen and Nielsen: Bayesian Networks and Decision Graphs, Second Edition.
Lee and Verleysen: Nonlinear Dimensionality Reduction.
Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical
Viewpoint.
Rissanen: Information and Complexity in Statistical Modeling.
Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to
Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning.
Studený: Probabilistic Conditional Independence Structures.
Vapnik: The Nature of Statistical Learning Theory, Second Edition.
Wallace: Statistical and Inductive Inference by Minimum Massage Length.
John A. Lee Michel Verleysen

Nonlinear Dimensionality
Reduction
John Lee Michel Verleysen
Molecular Imaging and Experimental Machine Learning Group – DICE
Radiotherapy Université catholique de Louvain
Université catholique de Louvain Place du Levant 3
Avenue Hippocrate 54/69 B-1348 Louvain-la-Neuve
B-1200 Bruxelles Belgium
Belgium michel.verleysen@uclouvain.be
john.lee@uclouvain.be

Series Editors:
Michael Jordan Jon Kleinberg Bernhard Schölkopf
Division of Computer Department of Computer Max Planck Institute for
Science and Science Biological Cybernetics
Department of Statistics Cornell University Spemannstrasse 38
University of California, Ithaca, NY 14853 72076 Tübingen
Berkeley USA Germany
Berkeley, CA 94720
USA

Library of Congress Control Number: 2006939149

ISBN-13: 978-0-387-39350-6 e-ISBN-13: 978-0-387-39351-3

Printed on acid-free paper.

© 2007 Springer Science+Business Media, LLC


All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science + Business Media, LLC, 233 Spring Street,
New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereafter
developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or
not they are subject to proprietary rights.

9 8 7 6 5 4 3 2 1

springer.com
To our families
Preface

Methods of dimensionality reduction are innovative and important tools in the


fields of data analysis, data mining, and machine learning. They provide a way
to understand and visualize the structure of complex data sets. Traditional
methods like principal component analysis and classical metric multidimen-
sional scaling suffer from being based on linear models. Until recently, very
few methods were able to reduce the data dimensionality in a nonlinear way.
However, since the late 1990s, many new methods have been developed and
nonlinear dimensionality reduction, also called manifold learning, has become
a hot topic. New advances that account for this rapid growth are, for ex-
ample, the use of graphs to represent the manifold topology, and the use of
new metrics like the geodesic distance. In addition, new optimization schemes,
based on kernel techniques and spectral decomposition, have led to spectral
embedding, which encompasses many of the recently developed methods.
This book describes existing and advanced methods to reduce the dimen-
sionality of numerical databases. For each method, the description starts from
intuitive ideas, develops the necessary mathematical details, and ends by out-
lining the algorithmic implementation. Methods are compared with each other
with the help of different illustrative examples.
The purpose of the book is to summarize clear facts and ideas about
well-known methods as well as recent developments in the topic of nonlinear
dimensionality reduction. With this goal in mind, methods are all described
from a unifying point of view, in order to highlight their respective strengths
and shortcomings.
The book is primarily intended for statisticians, computer scientists, and
data analysts. It is also accessible to other practitioners having a basic back-
ground in statistics and/or computational learning, such as psychologists (in
psychometry) and economists.

Louvain-la-Neuve, Belgium John A. Lee


October 2006 Michel Verleysen
Contents

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XV

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .XVII
.

1 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Practical motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Fields of application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 The goals to be reached . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Theoretical motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 How can we visualize high-dimensional spaces? . . . . . . . . 4
1.2.2 Curse of dimensionality and empty space phenomenon . 6
1.3 Some directions to be explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Relevance of the variables . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Dependencies between the variables . . . . . . . . . . . . . . . . . . 10
1.4 About topology, spaces, and manifolds . . . . . . . . . . . . . . . . . . . . . 11
1.5 Two benchmark manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Overview of the next chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Characteristics of an Analysis Method . . . . . . . . . . . . . . . . . . . . . 17


2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Expected functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Estimation of the number of latent variables . . . . . . . . . . 18
2.2.2 Embedding for dimensionality reduction . . . . . . . . . . . . . . 19
2.2.3 Embedding for latent variable separation . . . . . . . . . . . . . 20
2.3 Internal characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Underlying model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Example: Principal component analysis . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Data model of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Criteria leading to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
X Contents

2.4.3 Functionalities of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


2.4.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.5 Examples and limitations of PCA . . . . . . . . . . . . . . . . . . . 33
2.5 Toward a categorization of DR methods . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Hard vs. soft dimensionality reduction . . . . . . . . . . . . . . . 38
2.5.2 Traditional vs. generative model . . . . . . . . . . . . . . . . . . . . . 39
2.5.3 Linear vs. nonlinear model . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.4 Continuous vs. discrete model . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.5 Implicit vs. explicit mapping . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.6 Integrated vs. external estimation of the dimensionality 41
2.5.7 Layered vs. standalone embeddings . . . . . . . . . . . . . . . . . . 42
2.5.8 Single vs. multiple coordinate systems . . . . . . . . . . . . . . . . 42
2.5.9 Optional vs. mandatory vector quantization . . . . . . . . . . 43
2.5.10 Batch vs. online algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.11 Exact vs. approximate optimization . . . . . . . . . . . . . . . . . . 44
2.5.12 The type of criterion to be optimized . . . . . . . . . . . . . . . . 44

3 Estimation of the Intrinsic Dimension . . . . . . . . . . . . . . . . . . . . . 47


3.1 Definition of the intrinsic dimension . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Fractal dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 The q-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Capacity dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Information dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.4 Correlation dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.5 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.6 Practical estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Other dimension estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Local methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Trial and error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 PCA estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.3 Correlation dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.4 Local PCA estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.5 Trial and error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Distance Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Spatial distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 Metric space, distances, norms and scalar product . . . . . 70
4.2.2 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.3 Sammon’s nonlinear mapping . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.4 Curvilinear component analysis . . . . . . . . . . . . . . . . . . . . . 88
4.3 Graph distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Geodesic distance and graph distance . . . . . . . . . . . . . . . . 97
Contents XI

4.3.2 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


4.3.3 Geodesic NLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.4 Curvilinear distance analysis . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4 Other distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4.1 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.2 Semidefinite embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5 Topology Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


5.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2 Predefined lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2.1 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2.2 Generative Topographic Mapping . . . . . . . . . . . . . . . . . . . 143
5.3 Data-driven lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3.1 Locally linear embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.3.2 Laplacian eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.3.3 Isotop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6 Method comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173


6.1 Toy examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.1.1 The Swiss roll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.1.2 Manifolds having essential loops or spheres . . . . . . . . . . . 193
6.2 Cortex unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.3 Image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.3.1 Artificial faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.3.2 Real faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1 Summary of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.1.2 A basic solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.1.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.1.4 Latent variable separation . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.1.5 Intrinsic dimensionality estimation . . . . . . . . . . . . . . . . . . . 229
7.2 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.1 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.3 Linear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . 231
7.2.4 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . 231
7.2.5 Latent variable separation . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2.6 Further processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.3 Model complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.4 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.4.1 Distance preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.4.2 Topology preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.5 Spectral methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.6 Nonspectral methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
XII Contents

7.7 Tentative methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


7.8 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

A Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247


A.1 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
A.2 Eigenvalue decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
A.3 Square root of a square matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

B Gaussian Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251


B.1 One-dimensional Gaussian distribution . . . . . . . . . . . . . . . . . . . . . 251
B.2 Multidimensional Gaussian distribution . . . . . . . . . . . . . . . . . . . . 253
B.2.1 Uncorrelated Gaussian variables . . . . . . . . . . . . . . . . . . . . . 254
B.2.2 Isotropic multivariate Gaussian distribution . . . . . . . . . . . 254
B.2.3 Linearly mixed Gaussian variables . . . . . . . . . . . . . . . . . . . 256

C Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.1.1 Finding extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
C.1.2 Multivariate version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
C.2 Gradient ascent/descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
C.2.1 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 261

D Vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263


D.1 Classical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Competitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
D.3 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
D.4 Initialization and “dead units” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

E Graph Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269


E.1 Without vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
E.1.1 K-rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
E.1.2 -rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
E.1.3 τ -rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
E.2 With vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
E.2.1 Data rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
E.2.2 Histogram rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

F Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277


F.1 Dimension estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
F.1.1 Capacity dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
F.1.2 Correlation dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
F.2 Computation of the closest point(s) . . . . . . . . . . . . . . . . . . . . . . . . 279
F.3 Graph distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Contents XIII

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Notations

N The set of positive natural numbers: {0, 1, 2, 3, . . .}


R The set of real numbers
y, x Known or unknown random variables taking their values in R
A A matrix
ai,j An entry of the matrix A
(located at the crossing of the ith row and the jth column)
N Number of points in the data set
M Number of prototypes in the codebook C
D Dimensionality of the data space (which is usually RD )
P Dimensionality of the latent space (which is usually RP )
(or its estimation as the intrinsic dimension of the data)
ID D-dimensional identity matrix
IP ×D Rectangular matrix containing the first P rows of ID
1N N -dimensional column vector containing ones everywhere
y Random vector in the known data space: y = [y1 , . . . , yd , . . . , yD ]T
x Random vector in the unknown latent space: x = [x1 , . . . , xp , . . . , xP ]T
y(i) The ith vector of the data set
x(i) (Unknown) latent vector that generated y(i)
x̂(i) The estimate of x(i)
Y The data set Y = {. . . , y(i), . . .}1≤i≤N
X The (unknown) set of latent vectors that generated Y
Xˆ Estimation of X
Y The data set in matrix notation: Y = [. . . , y(i), . . .]1≤i≤N
X The (unknown) ordered set of latent vectors that generated Y
X̂ Estimation of X
XVI Notations

M A manifold (noted as a set)


m The functional notation of M: y = m(x)
Ex {x} The expectation of the random variable x
μx (x) The mean value of the random variable x
(computed with its known values x(i), i = 1, . . . , N )
μi The ith-order centered moment
μi The ith-order raw moment
Cxy The covariance matrix between the random vectors x and y
Ĉxy The estimate of the covariance matrix
f (x), f (x) Uni- or multivariate function of the random vector x
∂f (x)
∂xp Partial derivative of f with respect to xp
∇x f (x) Gradient vector of f with respect to x
Hx f (x) Hessian matrix of f with respect to x
Jx f (x) Jacobian matrix of f with respect to x
y(i) · y(j) Scalar product between the two vectors y(i) and y(j)
d(y(i), y(j)) Distance function between the two vectors y(i) and y(j)
(often a spatial distance, like the Euclidean one)
shortened as dy (i, j) or dy when the context is clear
δ(y(i), y(j)) Geodesic or graph distance between y(i) and y(j)
C, G Codebook (noted as a set) in the data and latent spaces
C, G Codebook (noted as a matrix) in the data and latent spaces
c(r), g(r) Coordinates of the rth prototypes in the codebook
(respectively, in the data and latent spaces)
Acronyms

DR Dimensionality reduction
LDR Linear dimensionality reduction
NLDR Nonlinear dimensionality reduction

ANN Artificial neural networks


EVD Eigenvalue decomposition
SVD Singular value decomposition
SVM Support vector machines
VQ Vector quantization

CCA Curvilinear component analysis NLDR method


CDA Curvilinear distance analysis NLDR method
EM Expectation-maximization optimization technique
GTM Generative topographic mapping NLDR method
HLLE Hessian LLE (see LLE) NLDR method
KPCA Kernel PCA (see PCA) NLDR method
LE Laplacian eigenmaps NLDR method
LLE Locally linear embedding NLDR method
MDS Multidimensional scaling LDR/NLDR method
MLP Multilayer perceptron ANN for function approx.
MVU Maximum variance unfolding (see SDE) NLDR method
NLM (Sammon’s) nonlinear mapping NLDR method
PCA Principal component analysis LDR method
RBFN Radial basis function network ANN for function approx.
SDE Semidefinite embedding NLDR method
SDP Semidefinite programming optimization technique
SNE Stochastic neighbor embedding NLDR method
SOM (Kohonen’s) self-organizing map NLDR method
TRN Topology-representing network ANN
1
High-Dimensional Data

Overview. This chapter introduces the difficulties raised by the anal-


ysis of high-dimensional data and motivates the use of appropriate
methods. Both practical and theoretical motivations are given. The
former ones mainly translate the need to solve real-life problems,
which naturally involve high-dimensional feature vectors. Image pro-
cessing is a typical example. On the other hand, theoretical motiva-
tions relate to the study of high-dimensional spaces and distributions.
Their properties prove unexpected and completely differ from what
is usually observed in low-dimensional spaces. The empty space phe-
nomenon and other strange behaviors are typical examples of the so-
called curse of dimensionality. Similarily, the problem of data visual-
ization is shortly dealt with. Regarding dimensionality reduction, the
chapter gives two directions to explore: the relevance of variables and
the dependencies that bind them. This chapter also introduces the
theoretical concepts and definitions (topology, manifolds, etc.) that
are typically used in the field of nonlinear dimensionality reduction.
Next, a brief section presents two simple manifolds that will be used
to illustrate how the different methods work. Finally, the chapter ends
with an overview of the following chapters.

1.1 Practical motivations


By essence, the world is multidimensional. To persuade yourself, just look
at human beings, bees, ants, neurons, or, in the field of technology, computer
networks, sensor arrays, etc. In most cases, combining a large number of simple
and existing units allows us to perform a great variety of complex tasks. This
solution is cheaper than creating or designing a specific device and is also
more robust: the loss or malfunction of a few units does not impair the whole
system. This nice property can be explained by the fact that units are often
2 1 High-Dimensional Data

partially redundant. Units that come to failure can be replaced with others
that achieve the same or a similar task.
Redundancy means that parameters or features that could characterize
the set of various units are not independent from each other. Consequently,
the efficient management or understanding of all units requires taking the
redundancy into account. The large set of parameters or features must be
summarized into a smaller set, with no or less redundancy. This is the goal
of dimensionality reduction (DR), which is one of the key tools for analyzing
high-dimensional data.

1.1.1 Fields of application

The following paragraphs present some fields of technology or science where


high-dimensional data are typically encountered.

Processing of sensor arrays

These terms encompass all applications using a set of several identical sen-
sors. Arrays of antennas (e.g., in radiotelescopes) are the best example. But
to this class also belong numerous biomedical applications, such as electrocar-
diagram or electroencephalograph acquisition, where several electrodes record
time signals at different places on the chest or the scalp. The same configura-
tion is found again in seismography and weather forecasting, for which several
stations or satellites deliver data. The problem of geographic positioning us-
ing satellites (as in the GPS or Galileo system) may be cast within the same
framework too.

Image processing

Let’s consider a picture as the output of a digital camera; then its processing
reduces to the processing of a sensor array, like the well-known photosensitive
CCD or CMOS captors used in digital photography. However, image process-
ing is often seen as a standalone domain, mainly because vision is a very
specific task that holds a priviliged place in information science.

Multivariate data analysis

In contrast with sensor arrays or pixel arrays, multivariate data analysis rather
focuses on the analysis of measures that are related to each other but come
from different types of sensors. An obvious example is a car, wherein the gear-
box connecting the engine to the wheels has to take into account information
from rotation sensors (wheels and engine shaft), force sensors (brake and gas
pedals), position sensors (gearbox stick, steering wheel), temperature sensors
(to prevent engine overheating or to detect glaze), and so forth. Such a sit-
uation can also occur in psychosociology: a poll often gathers questions for
which the answers are from different types (true/false, percentage, weight,
age, etc.).
1.2 Theoretical motivations 3

Data mining

At first sight, data mining seems to be very close to multivariate data analy-
sis. However, the former has a broader scope of applications than the latter,
which is a classical subdomain of statistics. Data mining can deal with more
exotic data structures than arrays of numbers. For example, data mining en-
compasses text mining. The analysis of large sets of text documents aims,
for instance, at detecting similarities between texts, like common vocabulary,
same topic, etc. If these texts are Internet pages, hyperlinks can be encoded
in graph structures and analyzed using tools like graph embedding. Cross
references in databases can be analyzed in the same way.

1.1.2 The goals to be reached

Understanding large amounts of multidimensional data requires extracting


information out of them. Otherwise, data are useless. For example, in elec-
troencephalography, neurologists are interested in finding among numerous
electrodes the signals coming from well-specified regions of the brain. When
automatically processing images, computers should be able to detect and es-
timate the movement of objects in the scene. In a car with an automatic
gearbox, the on-board computer must be able to select the most appropriate
gear ratio according to data from the car sensors.
In all these examples, computers have to help the user to discover and
extract information that lies hidden in the huge quantity of data. Information
discovery amounts to detecting which variables are relevant and how variables
interact with each other. Information extraction then consists of reformulating
data, using less variables. Doing so may considerably simplify any further
processing of data, whether it is manual, visual, or even automated. In other
words, information discovery and extraction help to
• Understand and classify the existing data (by using a “data set” or “learn-
ing set”), i.e., assign a class, a color, a rank, or a number to each data
sample.
• Infer and generalize to new data (by using a “test set” or “validation set”),
i.e., get a continuous representation of the data, so that the unknown class,
colour, rank, or number of new data items can be determined, too.

1.2 Theoretical motivations


From a theoretical point of view, all difficulties that occur when dealing with
high-dimensional data are often referred to as the “curse of dimensionality”.
When the data dimensionality grows, the good and well-known properties of
the usual 2D or 3D Euclidean spaces make way for strange and annoying
phenomena. The following two subsections highlight two of these phenomena.
4 1 High-Dimensional Data

1.2.1 How can we visualize high-dimensional spaces?

Visualization is a task that regards mainly two classes of data: spatial and tem-
poral. In the latter case, the analysis may resort to the additional information
given by the location in time.

Spatial data

Quite obviously, a high dimensionality makes the visualization of objects


rather uneasy. Drawing one- or two-dimensional objects on a sheet of paper
seems very straightforward, even for children. Things becomes harder when
three-dimensional objects have to represented. The knowledge of perspective,
and its correct mastering, are still recent discoveries (paintings before the
Renaissance are not very different from Egyptian papyri!). Even with to-
day’s technology, a smooth, dynamic, and realistic representation of our three-
dimensional world on a computer screen requires highly specialized chips. On
the other hand, three-dimensional objects can also be sculptured or carved.
To replace the chisel and the hammer, computer representations of 3D ob-
jects can be materialized in a polymer bath: on the surface a laser beam is
solidifying the object, layer per layer.
But what happens when more than three dimensions must be taken into
account? In this case, the computer screen and the sheet of paper, with only
two dimensions, become very limited. Nevertheless, several techniques exist:
they use colors or multiple linear projections. Unfortunately, all these tech-
niques are not very intuitive and are often suited only for 4D objects. As an
example, Fig. 1.1 shows the projection of a 4D cube that has been projected
on a plane in a linear way; the color indicates the depth. Regardless of the
projection method is, it is important to remark that the human eye attempts
to understand high-dimensional objects in the same way as 3D objects: it
seeks distances from one point to another, tries to distinguish what is far and
what is close, and follows discontinuities like edges, corners, and so on. Ob-
viously, objects are understood by identifying the relationships between their
constituting parts.

Temporal data

When it is known that data are observed in the course of time, an additional
piece of information is available. As a consequence, the above-mentioned ge-
ometrical representation is no longer unique. Instead of visualizing all di-
mensions simultaneously in the same coordinate system, one can draw the
evolution of each variable as a function of time. For example, in Fig. 1.2, the
same data set is displayed “spatially” in the first plot, and “temporally” in
the second one: the time structure of data is revealed by the temporal rep-
resentation only. In constrast with the spatial representation, the temporal
representation easily generalizes to more than three dimensions. Nevertheless,
1.2 Theoretical motivations 5

Fig. 1.1. Two-dimensional representation of a four-dimensional cube. In addition


to perspective, the color indicates the depth in the fourth dimension.

1
x2

−1

−2
−2 0 2
x
1

p=1
p
x

p=2

100 200 300 400 500 600 700 800 900 1000
Time

Fig. 1.2. Two plots of the same temporal data. In the first representation, data
are displayed in a single coordinate system (spatial representation). In the second
representation, each variable is plotted in its own coordinate system, with time as
the abscissa (time representation).
6 1 High-Dimensional Data

when dimensionality increases, it becomes harder and harder to perceive the


similarities and dissimilarities between the different variables: the eye is con-
tinually jumping from one variable to another, and finally gets lost! In such a
case, a representation of data using a smaller set of variables is welcome, as
for spatial data. This makes the user’s perception easier, especially if each of
these variables concentrates on a particular aspect of data. A compact repre-
sentation that avoids redundancy while remaining trustworthy proves to be
the most appealing.

1.2.2 Curse of dimensionality and empty space phenomenon

The colorful term “curse of dimensionality” was apparently first coined by


Bellman [14] in connection with the difficulty of optimization by exhaustive
enumeration on product spaces. Bellman underlines the fact that considering
a Cartesian grid of spacing 1/10 on the unit cube in 10 dimensions, the num-
ber of points equals 1010 ; for a 20-dimensional cube, the number of points
further increases to 1020 . Accordingly, Bellman’s interpretation is the follow-
ing: if the goal consists of optimizing a function over a continuous domain of
a few dozen variables by exhaustively searching a discrete search space de-
fined by a crude discretization, one could easily be faced with the problem
of making tens of trillions of evaluations of the function. In other words, the
curse of dimensionality also refers to the fact that in the absence of simplify-
ing assumptions, the number of data samples required to estimate a function
of several variables to a given accuracy (i.e., to get a reasonably low-variance
estimate) on a given domain grows exponentially with the number of dimen-
sions. This fact, responsible for the curse of dimensionality, is often called
the “empty space phenomenon” [170]. Because the amount of available data
is generally restricted to a few observations, high-dimensional spaces are in-
herently sparse. More concretely, the curse of dimensionality and the empty
space phenomenon give unexpected properties to high-dimensional spaces, as
illustrated by the following subsections, which are largely inspired by Chapter
1 of [169].

Hypervolume of cubes and spheres

In a D-dimensional space, a sphere and the corresponding circumscripted cube


(all edges equal the sphere diameter) lead to the following volume formulas:

π D/2 rD
Vsphere (r) = , (1.1)
Γ (1 + D/2)
Vcube (r) = (2r)D , (1.2)

where r is the radius of the sphere. Surprisingly, the ratio Vsphere /Vcube tends
to zero when D increases:
1.2 Theoretical motivations 7

Vsphere (r)
lim =0 . (1.3)
D→∞ Vcube (r)
Intuitively, this means that as dimensionality increases, a cube becomes more
and more spiky, like a sea urchin: the spherical body gets smaller and smaller
while the number of spikes increases, the latter occupying almost all the avail-
able volume. Now, assigning the value 1/2 to r, Vcube equals 1, leading to

lim Vsphere (r) = 0 . (1.4)


D→∞

This indicates that the volume of a sphere vanishes when dimensionality in-
creases!

Hypervolume of a thin spherical shell

By virtue of Eq. (1.1), the relative hypervolume of a thin spherical shell is

Vsphere (r) − Vsphere (r(1 − )) 1D − (1 − )D


= , (1.5)
Vsphere (r) 1D
where  is the thickness of the shell (  1). When D increases, the ratio
tends to 1, meaning that the shell contains almost all the volume [194].

Tail probability of isotropic Gaussian distributions

For any dimension D, the probability density function (pdf) of an isotropic


Gaussian distribution (see Appendix B) is written as

1 1 y − μy 2
fy (y) =  exp(− ) , (1.6)
2
(2πσ )D 2 σ2

where y is a D-dimensional vector, μy its D-dimensional mean, and σ 2 the


isotropic (scalar) variance. Assuming the random vector y has zero mean and
unit variance, the formula simplifies into
1 r2
fy (y) = K(r) =  exp(− ) , (1.7)
(2π)D 2

where r = y can be interpreted as a radius. Indeed, because the distribution


is isotropic, the equiprobable contours are spherical. With the previous exam-
ples in mind, it can thus be expected that the distribution behaves strangely
in high dimensions.
This is confirmed by computing r0.95 defined as the radius of a hypersphere
that contains 95% of the distribution [45]. The value of r0.95 is such that
 r0.95
Ssphere (r)K(r)dr
0 ∞ = 0.95 , (1.8)
0 Ssphere (r)K(r)dr
8 1 High-Dimensional Data

where Ssphere (r) is the surface of a D-dimensional hypersphere of radius r:

2π D/2 rD−1
Ssphere (r) = . (1.9)
Γ (D/2)
The radius r0.95 grows as the dimensionality D increases, as illustrated in the
following table:
D 1 2 3 4 5 6
r0.95 1.96σ 2.45σ 2.80σ 3.08σ 3.33σ 3.54σ
This shows the weird behavior of a Gaussian distribution in high-dimensional
spaces.

Concentration of norms and distances

Another problem encountered in high-dimensional spaces regards the weak


discrimination power of a metric. As dimensionality grows, the contrast pro-
vided by usual metrics decreases, i.e., the distribution of norms in a given
distribution of points tends to concentrate. This is known as the concentra-
tion phenomenon [20, 64].
For example, the Eulidean norm of vectors consisting of several variables
that are i.i.d. (independent and identically distributed) behaves in a totally
unexpected way. The explanation can be found in the following theorem (taken
from [45], where the demonstration can be found as well):
Theorem 1.1. Let y be a D-dimensional vector [y1 , . . . , yd , . . . , yD ]T ; all
components yd of the vector are independent and identically distributed, with
2
a finite eighth order moment. Then the mean μy and the variance σy of
the Euclidean norm (see Subsection 4.2.1) are

μy = aD − b + O(D−1 ) (1.10)
2
σy = b + O(D−1/2 ) , (1.11)

where a and b are parameters depending only on the central moments of order
1, 2, 3, and 4 of the xi :

a = μ2 + μ2 (1.12)
4μ2 μ2 − μ22 + 4μμ3 + μ4
b= , (1.13)
4(μ2 + μ2 )

where μ = E{xd } is the common mean of all components xd and μk their


common central k-th order moment (μk = E{(xd − μ)k }).

In other words, the norm of random vectors grows proportionally to D,
as naturally expected, but the variance remains more or less constant for a
sufficiently large D. This also means that the vector y seems to be normalized
in high dimensions. More precisely, thanks to Chebychev’s inequality, one has
1.3 Some directions to be explored 9
2
   σy
P y − μy  ≥ ε ≤ 2 , (1.14)
ε
i.e., the probability that the norm of y falls outside an interval of fixed width
centered on μy becomes approximately constant when D grows. As μy
also grows, the relative error made by taking μy instead of y becomes
negligible. Therefore, high-dimensional random i.i.d. vectors seem to be dis-
tributed close to the surface of a hypersphere of radius μy . This means not
only that successive drawings of such random vectors yield almost the same
norm, but also that the Euclidean distance between any two vectors is approx-
imately constant. The Euclidean distance is indeed the Euclidean norm of the
difference of two random vectors (see Subsection 4.2.1), and this difference is
also a random vector.
In practice, the concentration phenomenon makes the nearest-neighbor
search problem difficult to solve in high-dimensional spaces [20, 26]. Other
results about the surprising behavior of norms and distances measured in high-
dimensional spaces are given, for instance, in [1, 64] and references therein.

Diagonal of a hypercube

Considering the hypercube [−1, +1]D , any segment from its center to one of
its 2D corners, i.e., a half-diagonal, can be written as v = [±1, . . . , ±1]T . The
angle between a half-diagonal v and the dth coordinate axis

ed = [0, . . . , 0, 1, 0, . . . , 0]T

is computed as
v T ed ±1
cos θD = =√ . (1.15)
v ed D
When the dimensionality D grows, the cosine tends to zero, meaning that
half-diagonals are nearly orthogonal to all coordinates axes [169]. Hence, the
visualization of high-dimensional data by plotting a subset of two coordinates
on a plane can be misleading. Indeed, a cluster of points lying near a diagonal
line of the space will be surprisingly plotted near the origin, whereas a cluster
lying near a coordinate axis is plotted as intuitively expected.

1.3 Some directions to be explored


In the presence of high-dimensional data, two possibilities exist to avoid or
at least attenuate the effects of the above-mentioned phenomena. The first
one focuses on the separation between relevant and irrelevant variables. The
second one concentrates on the dependencies between the (relevant) variables.
10 1 High-Dimensional Data

1.3.1 Relevance of the variables

When analyzing multivariate data, not necessarily all variables are related to
the underlying information the user wishes to catch. Irrelevant variables may
be eliminated from the data set.
Most often, techniques to distinguish relevant variables from irrelevant
ones are supervised: the “interest” of a variable is given by an “oracle” or
“professor”. For example, in a system with many inputs and outputs, the
relevance of an input can be measured by computing the correlations between
known pairs of input/output. Input variables that are not correlated with the
outputs may then be eliminated.
Techniques to determine whether variables are (ir)relevant are not further
studied in this book, which focuses mainly on non-supervised methods. For
the interested reader, some introductory references include [2, 96, 139].

1.3.2 Dependencies between the variables

Even when assuming that all variables are relevant, the dimensionality of the
observed data may still be larger than necessary. For example, two variables
may be highly correlated: knowing one of them brings information about the
other. In that case, instead of arbitrarily removing one variable in the pair,
another way to reduce the number of variables would be to find a new set
of transformed variables. This is motivated by the facts that dependencies
between variables may be very complex and that keeping one of them might
not suffice to catch all the information content they both convey.
The new set should obviously contain a smaller number of variables but
should also preserve the interesting characteristics of the initial set. In other
words, one seeks a transformation of the variables with some well-defined
properties. These properties must ensure that the transformation does not
alter the information content conveyed by the initial data set, but only rep-
resents it in a different form. In the remainder of this book, linear as well
as nonlinear transformations of observed variables will often be called projec-
tions, mainly because many transformations are designed for the preservation
of characteristics that are geometrical or interpreted as such.
The type of projection must be chosen according to the model that un-
derlies the data set. For example, if the given variables are assumed to be
mixtures of a few unobserved ones, then a projection that inverts the mixing
process is very useful. In other words, this projection tracks and eliminates
dependencies between the observed variables. These dependencies often result
from a lack of knowledge or other imperfections in the observation process: the
interesting variables are not directly accessible and are thus measured in sev-
eral different but largely redundant ways. The determination of a projection
may also follow two different goals.
The first and simplest one aims to just detect and eliminate the depen-
dencies. For this purpose, the projection is determined in order to reduce the
1.4 About topology, spaces, and manifolds 11

number of variables. This task is traditionally known as dimensionality re-


duction and attempts to eliminate any redundancy in the initial variables.
Principal component analysis (PCA) is most probably the best-known tech-
nique for dimensionality reduction.
The second and more complex goal of a projection is not only to reduce
the dimensionality, but also to retrieve the so-called latent variables, i.e., those
that are at the origin of the observed ones but cannot be measured directly.
This task, in its most generic acceptation, is often called latent variable sep-
aration. Blind source separation (BSS), in signal processing, or Independent
component analysis (ICA), in multivariate data analysis, are particular cases
of latent variable separation.
As can be deduced, dimensionality reduction only focuses on the number
of latent variables and attempts to give a low-dimensional representation of
data according to this number. For this reason, dimensionality reduction does
not care for the latent variables themselves: any equivalent representation will
do. By comparison, latent variable separation is more difficult since it aims,
beyond dimensionality reduction, at recovering the unknown latent variables
as well as possible.

1.4 About topology, spaces, and manifolds


From a geometrical point of view, when two or more variables depend on
each other, their joint distribution — or, more accurately, the support of their
joint distribution — does not span the whole space. Actually, the dependence
induces some structure in the distribution, in the form of a geometrical locus
that can be seen as a kind of object in the space. The hypercube illustrated in
Fig. 1.1 is an example of such a structure or object. And as mentioned above,
dimensionality reduction aims at giving a new representation of these objects
while preserving their structure.
In mathematics, topology studies the properties of objects that are pre-
served through deformations, twistings, and stretchings. Tearing is the only
prohibited operation, thereby guaranteeing that the intrinsic “structure” or
connectivity of objects is not altered. For example, a circle is topologically
equivalent to an ellipse, and a sphere is equivalent to an ellipsoid.1 However,
subsequent chapters of this book will show that tearing still remains a very
interesting operation when used carefully.
One of the central ideas of topology is that spatial objects like circles and
spheres can be treated as objects in their own right: the knowledge of objects
does not depend on how they are represented, or embedded, in space. For ex-
ample, the statement, “If you remove a point from a circle, you get a (curved)
line segment” holds just as well for a circle as for an ellipse, and even for
1
Of course, this does not mean that soccer is equivalent to rugby!
12 1 High-Dimensional Data

tangled or knotted circles. In other words, topology is used to abstract the in-
trinsic connectivity of objects while ignoring their detailed form. If two objects
have the same topological properties, they are said to be homeomorphic.
The “objects” of topology are formally defined as topological spaces. A
topological space is a set for which a topology is specified [140]. For a set Y, a
topology T is defined as a collection of subsets of Y that obey the following
properties:
• Trivially, ∅ ∈ T and Y ∈ T .
• Whenever two sets are in T , then so is their intersection.
• Whenever two or more sets are in T , then so is their union.
This definition of a topology holds as well for a Cartesian space (RD ) as for
graphs. For example, the natural topology associated with R, the set of real
numbers, is the union of all open intervals.
From a more geometrical point of view, a topological space can also be
defined using neighborhoods and Haussdorf’s axioms. The neighborhood of a
point y ∈ RD , also called a -neighborhood or infinitesimal open set, is often
defined as the open -ball B (y), i.e. the set of points inside a D-dimensional
hollow sphere of radius  > 0 and centered on y. A set containing an open
neighborhood is also called a neighborhood. Then, a topological space is such
that
• To each point y there corresponds at least one neighborhood U(y), and
U(y) contains y.
• If U(y) and V(y) are neighborhoods of the same point y, then a neighbor-
hood W(y) exists such that W(y) ⊂ U(y) ∪ V(y).
• If z ∈ U(y), then a neighborhood V(z) of z exists such that V(z) ⊂ U(y).
• For two distinct points, two disjoint neighborhoods of these points exist.
Within this framework, a (topological) manifold M is a topological space
that is locally Euclidean, meaning that around every point of M is a neigh-
borhood that is topologically the same as the open unit ball in RD . In general,
any object that is nearly “flat” on small scales is a manifold. For example, the
Earth is spherical but looks flat on the human scale.
As a topological space, a manifold can be compact or noncompact, con-
nected or disconnected. Commonly, the unqualified term “manifold” means
“manifold without boundary”. Open manifolds are noncompact manifolds
without boundary, whereas closed manifolds are compact manifolds without
boundary. If a manifold contains its own boundary, it is called, not surpris-
ingly, a “manifold with boundary”. The closed unit ball B̄1 (0) in RD is a
manifold with boundary, and its boundary is the unit hollow sphere. By defi-
nition, every point on a manifold has a neighborhood together with a home-
omorphism of that neighborhood with an open ball in RD .
An embedding is a representation of a topological object (a manifold, a
graph, etc.) in a certain space, usually RD for some D, in such a way that its
1.4 About topology, spaces, and manifolds 13

topological properties are preserved. For example, the embedding of a man-


ifold preserves open sets. More generally, a space X is embedded in another
space Y when the properties of Y restricted to X are the same as the properties
of X .
A smooth manifold, also called an (infinitely) differentiable manifold, is
a manifold together with its “functional structure” (e.g., parametric equa-
tions). Hence, a smooth manifold differs from a simple topological manifold,
as defined above, because the notion of differentiability exists on it. Every
smooth manifold is a topological manifold, but the reverse statement is not
always true. Moreover, the availability of parametric equations allows us to
relate the manifold to its latent variables, namely its parameters or degrees
of freedom.
A smooth manifold M without boundary is said to be a submanifold of
another smooth manifold N if M ⊂ N and the identity map of M into N is
an embedding. However, it is noteworthy that, while a submanifold M is just
a subset of another manifold N , M can have a dimension from a geometrical
point of view, and the dimension of M may be lower than the dimension of N .
With this idea in mind, and according to [175], a P -manifold or P -dimensional
manifold M is defined as a submanifold of N ⊂ RD if the following condition
holds for all points y ∈ M: there exist two open sets U, V ⊂ M, with y ∈ U,
and a diffeomorphism h : U → V, y → x = h(y) such that

h(U ∩ M) = V ∩ (RP × {0}) = {x ∈ V : xP +1 = · · · = xD = 0} .

As can be seen, x can trivially be reduced to P -dimensional coordinates. If


N = RD in the previous definition, then
• A point y ∈ RD is a manifold.
• A P -dimensional vector subspace (a P -dimensional hyperplane) is a P -
manifold.
• The hollow D-dimensional hypersphere is a (D − 1)-manifold.
• Any open subset is a D-manifold.
Whitney [202] showed in the 1930s that any P -manifold can be embedded
in R2P +1 , meaning that 2P + 1 dimensions at most are necessary to embed a
P -manifold. For example, an open line segment is an (open) 1-manifold that
can already be embedded in R1 . On the other hand, a circle is a (compact) 1-
manifold that can be embedded in R2 but not in R1 . And a knotted circle, like
a trefoil knot, reaches the bound of Whitney’s theorem: it can be embedded
only in RD , with D ≥ 2P + 1 = 3.
In the remainder of this book, the word manifold used alone typically des-
ignates a P -manifold embedded in RD . In the light of topology, dimension-
ality reduction amounts to re-embedding a manifold from a high-dimensional
space to a lower-dimensional one. In practice, however, a manifold is noth-
ing more than the underlying support of a data distribution, which is known
only through a finite sample. This raises two problems. First, dimensionality
14 1 High-Dimensional Data

reduction techniques must work with partial and limited data. Second, as-
suming the existence of an underlying manifold allows us to take into account
the support of the data distribution but not its other properties, such as its
density. This may be problematic for latent variable separation, for which a
model of the data density is of prime importance.
Finally, the manifold model does not account for the noise that may cor-
rupt data. In that case, data points no longer lie on the manifold: instead fly
nearby. Hence, regarding terminology, it is correct to write that dimension-
ality reduction re-embeds a manifold, but, on the other hand, it can also be
said that noisy data points are (nonlinearly) projected on the re-embedded
manifold.

1.5 Two benchmark manifolds


In order to illustrate the advantages and drawbacks of the various methods
of dimensionality reduction to be studied in Chapters 4 and 5, the manifolds
shown in Fig. 1.3 will be used repeatedly as running examples. The first

1
0 3
3
y

0
y3

1 2
−1 −1 1
−1 0
−1 0
0 0
−1 y2
1 −1 y 1
2
y y
1 1

Fig. 1.3. Two benchmark manifold: the ‘Swiss roll’ and the ‘open box’.

manifold, on the left in Fig. 1.3, is called the Swiss roll, according to the
name of a Swiss-made cake: it is composed of a layer of airy pastry, which
is spread with jam and then rolled up. The manifold shown in the figure
represents the thin layer of jam in a slice of Swiss roll. The challenge of the
Swiss roll consists of finding a two-dimensional embedding that “unrolls” it,
in order to avoid superpositions of the successive turns of the spiral and to
obtain a bijective mapping between the initial and final embeddings of the
manifold. The Swiss roll is a noncompact, smooth, and connected manifold.
1.5 Two benchmark manifolds 15

The second two-manifold of Fig. 1.3 is naturally called the “open box”.
As for the Swiss roll, the goal is to reduce the embedding dimensionality from
three to two. As can be seen, the open box is connected but neither compact
(in contrast with a cube or closed box) nor smooth (there are sharp edges and
corners). Intuitively, it is not so obvious to guess what an embedding of the
open box should look like. Would the lateral faces be stretched? Or torn? Or
would the bottom face be shrunk? Actually, the open box helps to show the
way each particular method behaves.
In practice, all DR methods work with a discrete representation of the
manifold to be embedded. In other words, the methods are fed with a finite
subset of points drawn from the manifold. In the case of the Swiss roll and
open box manifolds, 350 and 316 points are selected, respectively, as shown in
Fig. 1.4. The 350 and 316 available points are regularly spaced, in order to be

1
0 3
3
y

0
y3

2
1
−1 −1 1
−1 0 0
−1
0 0 y
−1 y 1 −1 2
1 2
y1 y
1

Fig. 1.4. A subset of points drawn from the “Swiss roll” and “open box” manifolds
displayed in Fig. 1.3. These points are used as data sets for DR methods in order
to assess their particular behavior. Corners and points on the edges of the box are
shown with squares, whereas points inside the faces are shown as smaller circles. The
color indicates the height of the points in the box or the radius in the Swiss roll. A
lattice connects the points in order to highlight their neighborhood relationships.

as representative of the manifold as possible. Moreover, points are connected


and displayed with different colors (indicating the height in the box or the
radius in the Swiss roll). In the case of the box, points also have different
shapes (small circles inside the faces, larger squares on the edges). All these
features are intended to improve the readability once the manifold is mapped
onto a plane, although the three-dimensional representation of Fig. 1.4 looks
a bit overloaded.
16 1 High-Dimensional Data

1.6 Overview of the next chapters


This chapter has quickly reviewed some of the practical and theoretical reasons
that raise interest toward methods of analyzing high-dimensional data. Next,
Chapter 2 details the most common characteristics of such a method:
• Which functionalities are expected by the user?
• How is the underlying data model defined?
• Which criterion is to be optimized?
In order to illustrate the answers to these questions, Chapter 2 contains a
description of principal component analysis (PCA), which is probably the
most-known and used method of analyzing high-dimensional data. The chap-
ter ends by listing several properties that allows us to categorize methods of
nonlinear dimensionality reduction.
Because numerous DR methods do not integrate an estimator of the intrin-
sic dimensionality of the data, Chapter 3 describes some usual estimators of
the intrinsic dimensionality. A good estimation of the intrinsic dimensionality
spares a lot of time when the method takes it as an external hyperparameter.
This chapter is necessary for completeness, but the reader familiar with the
subject may easily skip it.
The next two chapters are dedicated to the study of two main families of
DR techniques. Those techniques can be viewed as replacements, evolutions,
or specializations of PCA. On one side, Chapter 4 details methods based on
distance preservation. On the other side, Chapter 5 concentrates on the more
elegant but more difficult principle of topology preservation. Each of these
browses a wide range of classical and more recent methods, and describes
them extensively. Next, Chapter 6 gives some examples and compares the
results of the various methods.
Finally, Chapter 7 draws the conclusions. It summarizes the main points of
the book and outlines a unifying view of the data flow for a typical method of
analyzing high-dimensional data. Chapter 7 is followed by several appendices
that deal with mathematical or technical details.
2
Characteristics of an Analysis Method

Overview. This chapter lists and describes the functionalities that


the user usually expects from a method of analyzing high-dimensional
data. Next, more mathematical details are given, such as the way
data are modeled, the criterion to optimize, and the algorithm that
implements them. Finally, the chapter ends by introducing a typical
method, namely the principal component analysis (PCA). The method
is briefly assessed by applying it to simple examples. Afterwards, and
because of the limitations of linear methods like principal component
analysis, this chapter promotes the use of methods that can reduce
the dimensionality in a nonlinear way. In contrast with PCA, these
more complex methods assume that data are generated according to a
nonlinear model. The last section of this chapter attempts to list some
important characteristics that allow us to classify the methods in var-
ious categories. These characteristics are, among others, the way data
are modeled, the criterion to be optimized, the way it is optimized,
and the kind of algorithm that implements the method.

2.1 Purpose
This chapter aims at gathering all features or properties that characterize
a method of analyzing high-dimensional data. The first section lists some
functionalities that the user usually expects. The next sections present more
technical characterictics like the mathematical or statistical model that under-
lies the method, the type of algorithm that identifies the model parameters,
and, last but not least, the criterion optimized by the method. Although the
criterion ends the list, it often has a great influence on other characteristics.
Depending on the criterion, indeed, some functionalities are available or not;
similarly, the optimization of a given criterion is achieved more easily with
some type of algorithm and may be more difficult with another one.
18 2 Characteristics of an Analysis Method

2.2 Expected functionalities


As mentioned in the previous chapter, the analysis of high-dimensional data
amounts to identifying and eliminating the redundancies among the observed
variables. This requires three main functionalities: an ideal method should
indeed be able to
1. Estimate the number of latent variables.
2. Embed data in order to reduce their dimensionality.
3. Embed data in order to recover the latent variable.
Before detailing them, it is noteworthy that these three functionalities are not
always available together in most methods. Very often, methods are able to
either reduce the dimensionality or separate latent variables, but can rarely
perform both. Furthermore, only a few methods include an estimator of the
intrinsic dimensionality: most of them take the dimensionality as an external
hyperparameter and need an additional algorithm to evaluate it.
In some cases, two or even three methods can be combined in order to
achieve the three tasks: a first method gives the number of latent variables, a
second one yields a low-dimensional representation of data, and, if necessary,
a third one further transforms data in order to retrieve the latent variables.

2.2.1 Estimation of the number of latent variables

The first necessary step to extract information from high-dimensional data


consists of computing the number of latent variables. Sometimes latent vari-
ables are also called degrees of freedom. But how many are there? How do we
estimate their number given only a few observations in a data set?
Detailed answers to those questions are given in Chapter 3. At this point,
it is just useful to know that the number of latent variables is often computed
from a topological point of view, by estimating the intrinsic dimension(ality)
of data. In contrast with a number of variables, which is necessarily an integer
value, the intrinsic dimension hides a more generic concept and may take on
real values. As the intrinsic dimension is the most common way to estimate
the number of latent variables, the term “intrinsic dimension(ality)” is often
used instead of “number of latent variables” further in this book.
The intrinsic dimension reveals the presence of a topological structure in
data. When the intrinsic dimension P of data equals the dimension D of the
embedding space, there is no structure: there are enough degrees of freedom
so that an -ball centered on any data point can virtually be completely filled
by other data points. On the contrary, when P < D, data points are often1
constrained to lie in a well-delimited subspace. Consequently, a low intrinsic
dimension indicates that a topological object or structure underlies the data
1
Not always, due to the existence of fractal objects; see Chapter 3.
2.2 Expected functionalities 19

set. Figure 2.1 gives an example: a two-dimensional object (a surface or 2-


manifold) is embedded in a three-dimensional Euclidean space. Intuitively, two
parameters (or degrees of freedom or latent variables) suffice to fully describe
the manifold. The intrinsic dimension estimated on a few points drawn from
the surface confirms that intuition.

1
0
3
y

−1

1
0.5
0 1
−0.5 0
y −1 −1
2 y1

Fig. 2.1. A two-dimensional manifold embedded in a three-dimensional space. The


data set contains only a finite number of points (or observations).

Without a good estimate of the intrinsic dimension, dimensionality reduc-


tion is no more than a risky bet since one does not known to what extent the
dimensionality can be reduced.

2.2.2 Embedding for dimensionality reduction

The knowledge of the intrinsic dimension P indicates that data have some
topological structure and do not completely fill the embedding space. Quite
naturally, the following step would consist of re-embedding the data in a lower-
dimensional space that would be better filled. The aims are both to get the
most compact representation and to make any subsequent processing of data
more easy. Typical applications include data compression and visualization.
More precisely, if the estimate of the intrinsic dimensionality P is reli-
able, then two assumptions can be made. First, data most probably hide a
P -dimensional manifold.2 Second, it is possible to re-embed the underlying
P -dimensional manifold in a space having dimensionality between P and D,
hopefully closer to P than D.
Intuitively, dimensionality reduction aims at re-embedding data in such
way that the manifold structure is preserved. If this constraint is relaxed,
then dimensionality reduction no longer makes sense. The main problem is,
2
Of course, this is not necessarily true, as P is a global estimator and data may be
a combination of several manifolds with various local dimensionalities.
20 2 Characteristics of an Analysis Method

of course, how to measure or characterize the structure of a manifold in order


to preserve it.
Figure 2.2 shows a two-dimensional embedding of the manifold that was
initially shown in a three-dimensional space in Fig. 2.1. After dimensional-
ity reduction, the structure of the manifold is now completely unveiled: it
is a rectangle. Obviously, for this toy example, that statement could have
already been deduced by looking at the three-dimensional representation in
Fig. 2.1. In this example, such a visual clue simply confirms that the dimen-
sionality reduction worked properly and preserved the structure of the object.
Actually, from the viewpoint of topology, the curvature of the rectangle in
its three-dimensional embedding does not really matter, in contrast with the
connectivity and local relationships between data points. More importantly,
the dimensionality reduction establishes a one-to-one mapping between the
three-dimensional points and the two-dimensional ones. This mapping allows
us to go back to the initial embedding if necessary.

1
x2

−1

−2

−3
−2 0 2
x1

Fig. 2.2. Possible two-dimensional embedding for the object in Fig. 2.1. The di-
mensionality of the data set has been reduced from three to two.

2.2.3 Embedding for latent variable separation

Dimensionality reduction aims at decreasing the number of variables that de-


scribe data. In fact, most DR methods can only achieve that precise task.
By comparison, the recovery of latent variables goes a step further than di-
mensionality reduction. Additional constraints are imposed on the desired
low-dimensional representation.
These constraints are generally not related to topology. For example, it
is often assumed that the latent variables that generated the data set are
2.2 Expected functionalities 21

(statistically) independent from each other. In this case, the low-dimensional


representation must also satisfy this property in order to state that the latent
variables have been retrieved.
The example of Fig. 2.1, which has been re-embedded in Fig. 2.2, can
be further processed, as in Fig. 2.3. In the latter representation, the two
parameters of the representation, corresponding to the axes of the coordinate
system, have been made independent. Intuitively, it can be seen that knowing
the abscissa of a point gives no clue about the ordinate of the same point,
and vice versa. This was not the case in Fig. 2.2: each abscissa determines
a different interval where the ordinate may lie. It is noteworthy that the

1
x2

−1

−2

−3
−2 0 2
x1

Fig. 2.3. Particular two-dimensional embedding for the object in Fig. 2.1. The latent
variables, corresponding to the axes of the coordinate system, are independent from
each other.

latent variable separation, whose result is illustrated in Fig. 2.3, has not been
obtained directly from the three-dimensional data set in Fig. 2.1. Instead,
it has been determined by modifying the low-dimensional representation of
Fig. 2.2. And actually, most methods of latent variable separation are not
able to reduce the dimensionality by themselves: they need another method or
some kind of preprocessing to achieve it. Moreover, the additional constraints
imposed on the desired representation, like statistical independence, mean
that the methods are restricted to very simple data models. For example,
observed variables are most often modeled as linear combinations of the latent
ones, in order to preserve some of their statistical properties.
22 2 Characteristics of an Analysis Method

2.3 Internal characteristics


Behind the expected functionalities of an analysis method, less visible char-
acteristics are hidden, though they play a key role. These characteristics are
• The model that data are assumed to follow.
• The type of algorithm that identifies the model parameters.
• The criterion to be optimized, which guides the algorithm.

2.3.1 Underlying model

All methods of analysis rely on the assumption that the data sets they are fed
with have been generated according to a well-defined model. In colorful terms,
the food must be compatible with the stomach: no vegetarian eats meat!
For example, principal component analysis (see Section 2.4) assumes that
the dependencies between the variables are linear. Of course, the user should
be aware of such a hypothesis, since the type of model determines the power
and/or limitations of the method. Consequently for this model choice, PCA
often delivers poor results when trying to project data lying on a nonlinear
subspace. This is illustrated in Fig. 2.4, where PCA has been applied to the
data set displayed in Fig. 2.1.

1.5
1
0.5
2

0
x

−0.5
−1
−1.5
−2 −1 0 1 2
x1

Fig. 2.4. Dimensionality reduction by PCA from 3 to 2 for the data set of Fig. 2.1.
Obviously, data do not fit the model of PCA, and the initial rectangular distribution
cannot be retrieved.

Hence, even for the relatively simple toy example in Fig. 2.1, methods
based on a nonlinear data model seem to be preferable. The embedding of
Fig. 2.2 is obtained by such a nonlinear method: the result is visually much
more convincing.
The distinction between linear and nonlinear models is not the only one.
For example, methods may have a continuous model or a discrete one. In the
2.3 Internal characteristics 23

first case, the model parameters completely define a continuous function (or
mapping) between the high- and low-dimensional spaces. In the second case,
the model parameters determine only a few values of such a function.

2.3.2 Algorithm

For the same model, several algorithms can implement the desired method of
analysis. For example, in the case of PCA, the model parameters are com-
puted in closed form by using general-purpose algebraic procedures. Most
often, these procedures work quickly, without any external hyperparameter
to tune, and are guaranteed to find the best possible solution (depending on
the criterion, see ahead). Nevertheless, in spite of many advantages, one of
their major drawbacks lies in the fact that they are so-called batch methods:
they cannot start working until the whole set of data is available.
When data samples arrive one by one, other types of algorithms exist.
For example, PCA can also be implemented by so-called online or adaptative
algorithms (see Subsection 2.4.4). Each time a new datum is available, on-line
algorithms handle it independently from the previous ones and then ‘forget’
it. Unfortunately, such algorithms do not show the same desirable properties
as algebraic procedures:
• By construction, they work iteratively (with a stochastic gradient descent
for example).
• They can fall in a local optimum of the criterion, i.e., find a solution that
is not exactly the best, but only an approximation.
• They often require a careful adjustment of several hyperparameters (e.g.,
learning rates) to fasten the convergence and avoid the above-mentioned
local minima.
Although PCA can be implemented by several types of algorithms, such
versatility does not hold for all methods. Actually, the more complex a model
is, the more difficult it is to compute its parameters in closed form. Along
with the data model, the criterion to be optimized also strongly influences
the algorithm.

2.3.3 Criterion

Despite the criterion is the last item in this list of the method characteris-
tics, it probably plays the most important role. The choice of the criterion
often determines which functionalities the method will offer, intervenes in the
data model, and always orients the implementation to a particular type of
algorithm.
Typically, the criterion to be optimized is written as a mathematical for-
mula. For example, a well-known criterion for dimensionality reduction is the
mean square error. In order to compute this criterion, the dimensionality is
first reduced and then expanded back, provided that the data model could
24 2 Characteristics of an Analysis Method

be reversed. Most often the loss of information or deterioration of the data


structure occurs solely in the first step, but the second is necessary in order
to have a comparison reference. Mathematically, this reconstruction error can
be written as
Ecodec = Ey {y − dec(cod(y))22 } , (2.1)
where E{ } is the expectation operator; the dimensionality reduction and ex-
pansion are respectively denoted by the coding and decoding functions cod
and dec:

cod : RD → RP , y → x = cod(y) , (2.2)


dec : R → R ,
P D
x → y = dec(x) . (2.3)

As explained in the next section, PCA can be derived from the reconstruction
error. Of course, other criteria exist. For example, statisticians may wish to
get a projection that preserves the variance initially observable in the raw
data. From a more geometrical or topological point of view, the projection
of the object should preserve its structure, for example, by preserving the
pairwise distances measured between the observations in the data set.
If the aim is latent variable separation, then the criterion can be decorre-
lation. This criterion can be further enriched by making the estimated latent
variables as independent as possible. The latter idea points toward indepen-
dent component analysis (ICA), which is out of the scope of this book. The
interested reader can find more details in [95, 34] and references therein.
As shown in the next section, several criterions described above, like min-
imizing the reconstruction error, maximizing the variance preservation, maxi-
mizing the distance preservation, or even decorrelating the observed variables,
lead to PCA when one considers a simple linear model.

2.4 Example: Principal component analysis


Principal component analysis (PCA in short) is perhaps one of the oldest
and best-known methods in multivariate analysis and data mining. PCA was
introduced by Pearson [149], who used it in a biological framework. Next,
PCA was further developed by Hotelling [92] in the field of psychometry. In
the framework of stochastic processes, PCA was also discovered independently
by Karhunen [102] and was subsequently generalized by Loève [128]. This
explains why PCA is also known as the “Karhunen-Loève transform” (or
expansion) in this field.

2.4.1 Data model of PCA

The model of PCA essentially assumes that the D observed variables, gathered
in the random vector y = [y1 , . . . , yd , . . . , yD ]T , result from a linear transfor-
mation W of P unknown latent variables, written as x = [x1 , . . . , xp , . . . , xP ]T :
2.4 Example: Principal component analysis 25

y = Wx . (2.4)

All latent variables are assumed to have a Gaussian distribution (see Ap-
pendix B). Additionally, transformation W is constrained to be an axis
change, meaning that the columns wd of W are orthogonal to each other
and of unit norm. In other words, the D-by-P matrix W is a matrix such
that WT W = IP (but the permuted product WWT may differ from ID ).
A last important but not too restrictive hypothesis of PCA is that both the
observed variables y and the latent ones x are centered, i.e., Ey {y} = 0D and
Ex {x} = 0P .
Starting from this model, how can the dimension P and the linear transfor-
mation W be identified starting from a finite sample of the observed variables?
Usually, the sample is an unordered set of N observations (or realizations) of
the random vector y:

Y = {y(1), . . . , y(n), . . . , y(N )} , (2.5)

but it is often more convenient to write it in matrix form:

Y = [y(1), . . . , y(n), . . . , y(N )] . (2.6)

Preprocessing

Before determining P and W, it must be checked that the observations are


centered. If this is not the case, they can be centered by removing the expec-
tation of y from each observation y(n):

y(n) ← y(n) − Ey {y} , (2.7)

where the left arrow means that the variable on the left-hand side is assigned
a new value indicated in the right-hand side. Of course, the exact expectation
of y is often unknown and must be approximated by the sample mean:

1 
N
1
Ey {y} ≈ y(n) = Y1N . (2.8)
N n=1 N

With the last expression of the sample mean in matrix form, the centering
can be rewritten for the entire data set as
1
Y←Y− Y1N 1TN . (2.9)
N
Once data are centered, P and W can be identified by PCA.
Nevertheless, the data set may need to be further preprocessed. Indeed, the
components yd of the observed vector y may come from very different origins.
For example, in multivariate data analysis, one variable could be a weight
expressed in kilograms and another variable a length expressed in millimeters.
26 2 Characteristics of an Analysis Method

But the same variables could as well be written in other units, like grams
and meters. In both situations, it is expected that PCA detects the same
dependencies between the variables in order to yield the same results. A simple
way to solve this indeterminacy consists of standardizing the variables, i.e.,
dividing each yd by its standard deviation after centering. Does this mean that
the observed variables should always be standardized? The answer is negative,
and actually the standardization could even be dangerous when some variable
has a low standard deviation. Two cases should be distinguished from the
others:
• When a variable is zero, its standard deviation is also zero. Trivially, the
division by zero must be avoided, and the variable should be discarded.
Alternatively, PCA can detect and remove such a useless zero-variable in
a natural way.
• When noise pollutes an observed variable having a small standard devia-
tion, the contribution of the noise to the standard deviation may be pro-
portionally large. This means that discovering the dependency between
that variable and the other ones can be difficult. Therefore, that variable
should intuitively be processed exactly as in the previous case, that is, ei-
ther by discarding it or by avoiding the standardization. The latter could
only amplify the noise. By definition, noise is independent from all other
variables and, consequently, PCA will regard the standardized variable as
an important one, while the same variable would have been a minor one
without standardization.
These two simple cases demonstrate that standardization can be useful but
may not be achieved blindly. Some knowledge about the data set is necessary.
After centering (and standardization if appropriate), the parameters P
and W can be identified by PCA. Of course, the exact values of P and W
depend on the criterion optimized by PCA.

2.4.2 Criteria leading to PCA


PCA can be derived from several criteria, all leading to the same method
and/or results. Two criteria — minimal reconstruction error and maximal
preserved variance — are developed in the following two subsections. A third
criterion — distance preservation — is detailed further in Subsection 4.2.2,
which describes metric multidimensional scaling.

Minimal reconstruction error


This approach is due to Pearson [149]. Starting from the linear model of PCA
(Eq. (2.4)), the coding and decoding functions (resp., Eq. (2.2) and Eq. (2.3))
can be rewritten as
cod : RD → RP , y → x = cod(y) = W+ y , (2.10)
dec : R → R ,
P D
x → y = dec(x) = W x , (2.11)
2.4 Example: Principal component analysis 27

where W+ = (WT W)−1 WT = WT is the (left) pseudo-inverse of W. Then


the reconstruction mean square error (Eq. (2.1)) becomes:

Ecodec = Ey {y − WWT y22 } , (2.12)

As already remarked above, the equality WT W = IP holds, but WWT = ID


is not necessarily true. Therefore, no simplification may occur in the recon-
struction error. However, in a perfect world, the observed vector y has been
generated precisely according to the PCA model (Eq. (2.4)). In this case only,
y can be perfectly retrieved. Indeed, if y = Wx, then

WWT y = WWT Wx = WIP x = y ,

and the reconstruction error is zero. Unfortunately, in almost all real situa-
tions, the observed variables in y are polluted by some noise, or do not fully
respect the linear PCA model, yielding a nonzero reconstruction error. As a
direct consequence, W cannot be identified perfectly, and only an approxima-
tion can be computed.
The best approximation is determined by developing and minimizing the
reconstruction error. According to the definition of the Euclidean norm (see
Subsection 4.2.1), Ecodec successively becomes

Ecodec = Ey {y − WWT y22 }


= Ey {(y − WWT y)T (y − WWT y)}
= Ey {yT y − 2yT WWT y + yT WWT WWT y}
= Ey {yT y − 2yT WWT y + yT WWT y}
= Ey {yT y − yT WWT y}
= Ey {yT y} − Ey {yT WWT y} , (2.13)

where the first term is constant. Hence, minimizing Ecodec turns out to maxi-
mize the term Ey {yT WWT y}. As only a few observations y(n) are available,
the latter expression is approximated by the sample mean:

1 
N
Ey {yT WWT y} ≈ (y(n))T WWT (y(n)) (2.14)
N n=1
1
≈ tr(YT WWT Y) , (2.15)
N
where tr(M) denotes the trace of some matrix M. To maximize this last
expression, Y has to be factored by singular value decomposition (SVD; see
Appendix A.1):
Y = VΣUT , (2.16)
where V, U are unitary matrices and where Σ is a matrix with the same
size as Y but with at most D nonzero entries σd , called singular values and
28 2 Characteristics of an Analysis Method

located on the first diagonal of Σ. The D singular values are usually sorted in
descending order. Substituting in the approximation of the expectation leads
to
1
Ey {YT WWT Y} ≈ tr(UΣT VT WWT VΣU) . (2.17)
N
Since the columns of V and U are orthonormal vectors by construction, it is
easy to see that

arg max tr(UΣT VT WWT VΣU) = VID×P , (2.18)


W

for a given P (ID×P is a matrix made of the first P columns of the identity
matrix ID ). Indeed, the above expression reaches its maximum when the P
columns of W are colinear with the columns of V that are associated with the
P largest singular values in Σ. Additionally, it can be trivially proved that
Ecodec = 0 for W = V. In the same way, the contribution of a principal com-
ponent vd to the Ecodec equals σd2 , i.e., the squared singular value associated
with vd .
Finally, P -dimensional latent variables are approximated by computing
the product
x̂ = IP ×D VT y . (2.19)

Maximal preserved variance and decorrelation

This approach is due to Hotelling [92]. From a statistical point of view, it


can be assumed that the latent variables in x are uncorrelated (no linear
dependencies bind them). In practice, this means that the covariance matrix
of x, defined as
Cxx = E{xxT } , (2.20)
provided x is centered, is diagonal. However, after the axis change induced by
W, it is very likely that the observed variables in y are correlated, i.e., Cyy
is no longer diagonal. The goal of PCA is then to get back the P uncorrelated
latent variables in x. Assuming that the PCA model holds and the covariance
of y is known, we find that

Cyy = E{yyT } (2.21)


= E{Wxx W }T T
(2.22)
= WE{xxT }WT (2.23)
= WCxx WT . (2.24)

Since WT W = I, left and right multiplications by, respectively, WT and W


lead to
Cxx = WT Cyy W . (2.25)
Next, the covariance matrix Cyy can be factored by eigenvalue decomposition
(EVD; see Appendix A.2):
2.4 Example: Principal component analysis 29

Cyy = VΛVT , (2.26)


where V is a matrix of normed eigenvectors vd and Λ a diagonal matrix
containing their associated eigenvalues λd , in descending order. Because the
covariance matrix is symmetric and semipositive definite, the eigenvectors are
orthogonal and the eigenvalues are nonnegative real numbers. Substituting in
Eq. (2.25) finally gives
Cxx = WT VΛVT W . (2.27)
This equality holds only when the P columns of W are taken colinear with P
columns of V, among D ones. If the PCA model is fully respected, then only
the first P eigenvalues in Λ are strictly larger than zero; the other ones are
zero. The eigenvectors associated with these P nonzero eigenvalues must be
kept:
W = VID×P , (2.28)
yielding
Cxx = IP ×D ΛID×P . (2.29)
This shows that the eigenvalues in Λ correspond to the variances of the latent
variables (the diagonal entries of Cxx ).
In real situations, some noise may corrupt the observed variables in y. As
a consequence, all eigenvalues of Cyy are larger than zero, and the choice of P
columns in V becomes more difficult. Assuming that the latent variables have
larger variances than the noise, it suffices to choose the eigenvectors associated
with the largest eigenvalues. Hence, the same solution as in Eq. (2.28) remains
valid, and the latent variables are estimated exactly as in Eq. (2.19).
If the global variance of y is defined as

D 
D
σy2 = tr(Cyy ) = cd,d = λd , (2.30)
d=1 d=1

then the proposed solution is guaranteed to preserve a maximal fraction of the


global variance. From a geometrical point of view, the columns of V indicates
the directions in RD that span the subspace of the latent variables. If these
columns are called components, the choice of the columns associated with the
largest variances justifies the name “principal component analysis”.
To conclude, it must be emphasized that in real situations the true covari-
ance of y is not known but can be approximated by the sample covariance:
1
Ĉyy = YYT . (2.31)
N

2.4.3 Functionalities of PCA


The success of PCA finds an explanation not only in its simplicity but also in
the broad applicability of the method. Indeed, PCA easily makes available the
three main functionalities that a user may expect in a method that analyzes
high-dimensional data.
30 2 Characteristics of an Analysis Method

Intrinsic dimension estimation


If the model of PCA (Eq. (2.4)) is fully respected, then only the P largest
eigenvalues of Cyy will depart from zero. Hence, the rank of the covariance
matrix (the number of nonzero eigenvalues) indicates trivially the number of
latent variables. However, when having only a finite sample, the covariance can
only be approximated. Moreover, the data probably do not entirely respect the
PCA model (presence of noise, etc.). For all those reasons, all D eigenvalues
are often different from zero, making the estimation of the intrinsic dimension
more difficult.
A first way to determine it consists of looking at the eigenvalues as such.
Normally, if the model holds reasonably well, large (significant) eigenvalues
correspond to the variances of the latent variables, while smaller (negligible)
ones are due to noise and other imperfections. Ideally, a rather visible gap
should separate the two kinds of eigenvalues. The gap can be visualized by
plotting the eigenvalues in descending order: a sudden fall should appear right
after the P th eigenvalue. If the gap is not visible, plotting minus the logarithm
of the normalized eigenvalues may help:
λd
0 ≤ − log . (2.32)
λ1
In this plot, the intrinsic dimension is indicated by a sudden ascent.
Unfortunately, when the data dimensionality D is high, there may also
be numerous latent variables showing a wide spectrum of variances. In the
extreme, the variances of latent variables can no longer be distinguished from
the variances related to noise. In this case, the intrinsic dimension P is cho-
sen in order to preserve an arbitrarily chosen fraction of the global variance
(Eq. (2.30)). Then the dimension P is determined so as to preserve at least
this fraction of variance. For example, if it is assumed that the latent variables
bear 95% of the global variance, then P is the smaller integer such that the
inequality
P
λd tr(IP ×D ΛID×P )
0.95 ≤ d=1 = (2.33)
D
d=1 λd
σy2
holds. Sometimes the threshold is set on individual variances instead of cu-
mulated ones. For example, all components having a variance lower than 1%
of σy2 are discarded. The best way to set the threshold consists of finding
a threshold that separates the significant variances from the negligible ones.
This turns out to be equivalent to the visual methods proposed in the previous
paragraph.
More complex methods exist to set the frontier between the latent and
noise subspaces, such as Akaike’s information criterion [4, 126] (AIC), the
Bayesian information criterion [168] (BIC), and the minimum description
length [153, 126] (MDL). These methods determine the value of P on the
basis of information-theoretic considerations. Their particularization to PCA,
with a noisy model, is well described in [34].
2.4 Example: Principal component analysis 31

Projection for dimensionality reduction

After the estimation of the intrinsic dimensionality, PCA reduces the dimen-
sion by projecting the observed variables onto the estimated latent subspace in
a linear way. Equation (2.19) shows how to obtain P -dimensional coordinates
from D-dimensional ones. In that equation, the dimensionality reduction is
achieved by the factor IP ×D , which discards the eigenvectors of V associ-
ated with the D − P smallest eigenvalues. On the other hand, the factor VT
ensures that the dimensionality reduction minimizes the loss of information.
Intuitively, this is done by canceling the linear dependencies between the ob-
served variables.

Projection for latent variable separation

Beyond reduction of data dimensionality, PCA can also separate latent vari-
ables under certain conditions. In Eq. (2.19), the separation is achieved by
the factor VT . As clearly stated in the PCA model, the observed variables
can only be a rotation of the latent ones, which have a Gaussian distribution.
These are, of course, very restrictive conditions that can be somewhat relaxed.
For example, in Eq. (2.4), the columns of W can be solely orthogonal
instead of orthonormal. In this case, the latent variables will be retrieved up
to a permutation and a scaling factor.
Additionally, if all latent variables have a Gaussian distribution but W is
any matrix, then PCA can still retrieve a set of variables along orthogonal
directions. The explanation is that a set of any linear combinations of Gaus-
sian distributions is always equivalent to a set of orthogonal combinations of
Gaussian distributions (see Appendix B).
From a statistical point of view, PCA decorrelates the observed variables
y by diagonalizing the (sample) covariance matrix. Therefore, without consid-
eration of the true latent variables, PCA finds a reduced set of uncorrelated
variables from the observed ones. Actually, PCA cancels the second-order
crosscumulants, i.e., the off-diagonal entries of the covariance matrix.
Knowing that higher-order cumulants, like the skewness (third order) and
the kurtosis (fourth order), are null for Gaussian variables, it is not difficult
to see that decorrelating the observed variables suffices to obtain fully in-
dependent latent variables. If latent variables are no longer Gaussian, then
higher-order cumulants must be taken into account. This is what is done in
independent component analysis (ICA, [95, 34]), for which more complex algo-
rithms than PCA are able to cancel higher-order cross-cumulants. This leads
to latent variables that are statistically independent.

2.4.4 Algorithms

As already unveiled in Subsection 2.4.2, PCA is often implemented by


general-purpose algebraic procedures. The developments of PCA from the two
32 2 Characteristics of an Analysis Method

different criteria described in that subsection show that PCA can work in two
different ways:
• by SVD (singular value decomposition; see Appendix A.1) of the matrix
Y, containing the available sample.
• by EVD (eigenvalue decomposition; see Appendix A.2) of the sample co-
variance Ĉyy .
Obviously, both techniques are equivalent, at least if the singular values and
the eigenvalues are sorted in the same way:
1
Ĉyy = YYT (2.34)
N
1
= (VΣUT )(UΣT VT ) (2.35)
N
1
= V( ΣΣT )VT (2.36)
N
= VΛVT . (2.37)
By the way, the last equality shows the relationship between the eigenvalues
and the singular values: λd = σd2 /N . From a numerical point of view, the SVD
of the sample is more robust because it works on the whole data set, whereas
EVD works only on the summarized information contained in the covariance
matrix. As a counterpart, from the computational point of view, SVD is more
expensive and may be very slow for samples containing many observations.
The use of algebraic procedures makes PCA a batch algorithm: all ob-
servations have to be known before PCA starts. However, online or adaptive
versions of PCA exist; several are described in [95, 34]. These implementations
do not offer the same strong guarantees as the algebraic versions, but may
be very useful in real-time application, where computation time and memory
space are limited.
When estimating the latent variables, it must be pointed out that Eq. (2.19)
is not very efficient. Instead, it is much better to directly remove the unneces-
sary columns in V and to multiply by y afterwards, without the factor IP ×D .
And if PCA works by SVD of the sample, it is noteworthy that
X̂ = IP ×D VT Y (2.38)
= IP ×D VT VΣUT (2.39)
= IP ×D ΣUT . (2.40)

As Σ is diagonal, the cheapest way to compute X̂ consists of copying the first


P columns of U, multiplying them by the corresponding diagonal entry of Σ,
and, finally, transposing the result.
As already mentioned, PCA assumes that the observed variables are cen-
tered. Sometimes data are also standardized, meaning that each variable yd
is scaled in order to have unit variance. This is usually done when the ob-
served variables come from various origins and have very different variances.
2.4 Example: Principal component analysis 33

The standardization allows PCA not to consider observed variables with small
variances as being noise and not to discard them in the dimensionality reduc-
tion. On the other hand, the standardization can sometimes amplify variables
that are really negligible. User knowledge is very useful to decide if a scaling
is necessary.

2.4.5 Examples and limitations of PCA

In order to illustrate the capabilities as well the limitations of PCA, toy ex-
amples may be artificially generated. For visualization’s sake, only two latent
variables are created; they are embedded in a three-dimensional space, i.e.,
three variables are observed. Three simple cases are studied here.

Gaussian variables and linear embedding

In this first case, the two latent variables, shown in the first plot of Fig. 2.5,
have Gaussian distributions, with variances 1 and 4. The observed variables,
displayed in the second plot of Fig. 2.5, are obtained by multiplying the latent
ones by ⎡ ⎤
0.2 0.8
W = ⎣ 0.4 0.5 ⎦ . (2.41)
0.7 0.3
As the mixing process (i.e., the matrix W) is linear, PCA can perfectly reduce
the dimensionality. The eigenvalues of the sample covariance matrix are 0.89,
0.11, and 0.00. The number of latent variables is then clearly two, and PCA
reduces the dimensionality without any loss: the observed variables could be
perfectly reconstructed from the estimated latent variables shown in Fig. 2.6.
However, the columns of the mixing matrix W are neither orthogonal nor
normed. Consequently, PCA cannot retrieve exactly the true latent variables.
Yet as the latter have Gaussian distributions, PCA finds a still satisfying re-
sult: in Fig. 2.6, the estimated latent variables have Gaussian distributions
but are scaled and rotated. This is visible by looking at the schematic rep-
resentations of the distributions, displayed as solid and dashed ellipses. The
ellipses are almost identical, but the axes indicating the directions of the true
and estimated latent variables are different.

Nonlinear embedding

In this second case, the two latent variables are the same as in the previous
case, but this time the mixing process is nonlinear:
⎡ ⎤
4 cos( 14 x1 )
y = ⎣ 4 sin( 14 x1 ) ⎦ . (2.42)
x1 + x2
34 2 Characteristics of an Analysis Method

Two−dimensional latent space


4

2
x2
0

−2

−4
−8 −6 −4 −2 0 2 4 6 8
x
1
Three−dimensional embedding space
5

0
y3

−5
5
0 −5
−5 5 0

y2
y1

Fig. 2.5. Two Gaussian latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a linear mixing process (second
plot). The ellipse schematically represents the joint distribution in both spaces.

The observed variables are shown in Fig. 2.7. Faced with nonlinear dependen-
cies between observed variables, PCA does not detect that only two latent
variables have generated them: the normalized eigenvalues of the sample co-
variance matrix are 0.90, 0.05 and 0.05 again. The projection onto the first
two principal components is given in Fig. 2.8. PCA is unable to completely
reconstruct the curved object displayed in Fig. 2.7 with these two principal
components (the reconstruction would be strictly planar!).
Unfortunately, the result of the dimensionality reduction is not the only
disappointing aspect. Indeed, the estimated latent variables are completely
different from the true ones. The schematic representation of the distribution
is totally deformed.
2.4 Example: Principal component analysis 35

PCA embedding
4

2
x2
0

−2

−4
−8 −6 −4 −2 0 2 4 6 8
x
1

Fig. 2.6. Projection of the three-dimensional observations (second plot of Fig. 2.5)
onto the two first principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.

Non-Gaussian distributions

In this third and last case, the two latent variables are no longer Gaussian. As
shown in the first plot of Fig. 2.9, their distribution is uniform. On the other
hand, the mixing process is exactly the same as in the first case. Therefore, the
eigenvalues of the sample covariance matrix are also identical (actually, very
close, depending on the sample). The dimensionality reduction is performed
without loss. However, the latent variable separation becomes really problem-
atic. As in the first case, the estimated latent variables shown in Fig. 2.10 are
rotated and scaled because the columns of W are not orthonormal. But be-
cause the latent variables have uniform distributions and not Gaussian ones,
the scale factors and rotations make the estimated latent variables no longer
uniformly distributed!

Concluding remarks about PCA

In the ideal situation, when its model is fully respected, PCA appears as
a very polyvalent method to analyze data. It determines data dimensional-
ity, builds an embedding accordingly, and retrieves the latent variables. In
practice, however, the PCA model relies on assumptions that are much too
restrictive, especially when it comes to latent variable separation. When only
dimensionality reduction is sought, the sole remaining but still annoying as-
sumption imposes that the dependencies between the observed variables are
(not far from being) linear.
The three toy examples detailed earlier clearly demonstrate that PCA is
not powerful enough to deal with complex data sets. This suggests design-
ing other methods, maybe at the expense of PCA simplicity and polyvalence.
36 2 Characteristics of an Analysis Method

Two−dimensional latent space


4

2
x2
0

−2

−4
−8 −6 −4 −2 0 2 4 6 8
x1
Three−dimensional embedding space

5
y3

−5 2
0
5 −2
0 −4
−5 −6
y2 y
1

Fig. 2.7. Two Gaussian latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a nonlinear mixing process
(second plot). The ellipse schematically represents the joint distribution in both
spaces.

Two directions may be explored: true latent variable separation or merely


dimensionality reduction. In the former case, requirements are high since an
embedding has to be found that not only reduces the data dimensionality
but also recovers the latent variables. If a linear model is kept, PCA can be
refined in order to process non-Gaussian distributions; this leads to ICA for
example. If the model cannot be assumed to be linear then latent variable
separation becomes theoretically very difficult, at least when relying on sta-
tistical concepts like independence. In cases where a nonlinear model must be
considered, the data analysis most often cannot go further than dimensional-
ity reduction. Extending PCA to nonlinear models still remains an appealing
challenge. Chapters 4 and 5 deal with pure dimensionality reduction. With-
out the necessicity of retrieving exactly the latent variables, more freedom is
left and numerous models become possible. Nevertheless, most of them rely
2.5 Toward a categorization of DR methods 37

PCA embedding
4

2
x2
0

−2

−4
−8 −6 −4 −2 0 2 4 6 8
x
1

Fig. 2.8. Projection of the three-dimensional observations (second plot of Fig. 2.7)
onto the first two principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.

on geometrical considerations, ranging from simple distance measurements


(Chapters 4) to more complex ideas from topology (Chapters 5).

2.5 Toward a categorization of DR methods

Before a detailed description of several DR methods, this section proposes


a (nonexhaustive) list of several qualifications that will help to characterize
and categorize the various methods. These qualifications mainly regard the
purpose of the method, its underlying model, the mathematical criterion to be
optimized and the algorithm that performs the optimization. Twelve possible
qualifications are for instance
• hard vs. soft dimensionality reduction,
• traditional vs. generative model,
• linear vs. nonlinear model,
• continuous vs. discrete model,
• implicit vs. explicit mapping,
• integrated vs. external estimation of the dimensionality,
• layered vs. standalone embeddings,
• single vs. multiple coordinate systems,
• optional vs. mandatory vector quantization,
• batch vs. online algorithm,
• exact vs. approximate optimization,
• the type of criterion to be optimized.
The remainder of this section describes these features in detail.
38 2 Characteristics of an Analysis Method

Two−dimensional latent space


3

x2 1

−1

−2

−3
−5 0 5
x1

Three−dimensional embedding space


4

0
y3

−2

−4
4
2
0 −2 −4
−2 2 0
−4 4

y2
y1

Fig. 2.9. Two uniform latent variables (1000 observations, displayed in the first
plot) are embedded in a three-dimensional space by a linear mixing process (second
plot). The rectangle schematically represents the joint distribution in both spaces.

2.5.1 Hard vs. soft dimensionality reduction

The problem of dimensionality reduction arises in many different applications.


Depending on them, a first distinction between methods regards the ratio
between the initial dimension of data and the desired dimension after re-
embedding [33].
Hard dimensionality reduction is suited for problems in which the data
have a dimension ranging from hundreds to maybe hundreds of thousands of
variables. In such a case, a drastic dimensionality reduction is usually sought,
possibly one of several orders of magnitude. The variables are often massively
repeated measures of a certain phenomenon of interest in different points of
2.5 Toward a categorization of DR methods 39

PCA embedding
3

x2 1

−1

−2

−3
−5 0 5
x1

Fig. 2.10. Projection of the three-dimensional observations (second plot of Fig. 2.9)
onto the first two principal components found by PCA. The solid line shows a
schematic representation of the true latent distribution, whereas the dashed one
corresponds to the estimated latent variables.

space (or at different instants over time). To this class belong classification
and pattern recognition problems involving images or speech. Most often,
the already difficult situation resulting from the huge number of variables is
complicated by a low number of available samples. Methods with a simple
model and few parameters like PCA are very effective for hard dimensionality
reduction.
Soft dimensionality reduction is suited for problems in which the data are
not too high-dimensional (less than a few tens of variables). Then no drastic
dimensionality reduction is needed. Usually, the components are observed or
measured values of different variables, which have a straightforward interpre-
tation. Many statistical studies and opinion polls in domains like social sci-
ences and psychology fall in this category. By comparison with hard problems
described above, these applications usually deal with sufficiently large sample
sizes. Typical methods include all the usual tools for multivariate analysis.
Finally, visualization problems lie somewhere in-between the two previ-
ous classes. The initial data dimensionality may equal any value. The sole
constraint is to reduce it to one, two, or three dimensions.

2.5.2 Traditional vs. generative model

The model associated with a method actually refers to the way the method
connects the latent variables with the observed ones. Almost each method
makes different assumptions about this connection. It is noteworthy that this
connection can go in both directions: from the latent to the observed variables
40 2 Characteristics of an Analysis Method

or from the observed to the latent variables. Most methods use the second
solution, which is the simplest and most used one, since the method basically
goes in the same direction: the goal is to obtain an estimation of the latent
variables starting from the observed ones. More principled methods prefer the
first solution: they model the observed variables as a function of the unknown
latent variables. This more complex solution better corresponds to the real
way data are generated but often implies that those methods must go back
and forth between the latent and observed variables in order to determine the
model parameters. Such generative models are seldom encountered in the field
of dimensionality reduction.

2.5.3 Linear vs. nonlinear model

The distinction between methods based on a linear or a nonlinear model is


probably the straightest way to classify them. This explains why methods us-
ing a linear (resp., nonlinear) model are simply called linear (resp., nonlinear)
methods. Both linear and nonlinear dimensionality reduction are denoted by
the acronyms LDR and NLDR, respectively, in the remainder of this book.
Nonlinear methods are often more powerful than linear ones, because the
connection between the latent variables and the observed ones may be much
richer than a simple matrix multiplication. On the other hand, their model
often comprises many parameters, whose identification requires large amounts
of data.
For example, if PCA projects D-dimensional vectors onto a P -dimensional
plane, the model comprises O(P D) parameters, and already P + 1 points
suffice to entirely determine them. For a nonlinear method like an SOM (see
Subsection 5.2.1), the number of parameters grows much more quickly: 3P D
parameters hardly suffice to describe a basic nonlinear model.

2.5.4 Continuous vs. discrete model

A third distinction about the data model regards its continuity. For example,
the model of PCA given in Section 2.4, is continuous: it is a linear transform of
the variables. On the other hand, the model of an SOM is discrete: it consists
of a finite set of interconnected points.
The continuity is a very desirable property when the dimensionality re-
duction must be generalized to other points than those used to determine the
model parameters. When the model is continuous, the dimensionality reduc-
tion is often achieved by using a parameterized function or mapping between
the initial and final spaces. In this case, applying the mapping to new points
yields their coordinates in the embedding. With only a discrete model, new
points cannot be so easily re-embedded: an interpolation procedure is indeed
necessary to embed in-between points.
2.5 Toward a categorization of DR methods 41

2.5.5 Implicit vs. explicit mapping

A fourth distinction about the model, which is closely related to the previous
one, regards the way a method maps the high- and low-dimensional spaces.
Mainly two classes of mappings exist: explicit and implicit.
An explicit mapping consists of directly associating a low-dimensional rep-
resentation with each data point. Hence, using an explicit mapping clearly
means that the data model is discrete and that the generalization to new
points may be difficult. Sammon’s nonlinear mapping (see Subsection 4.2.3)
is a typical example of explicit mapping. Typically, the parameters of such a
mapping are coordinates, and their number is proportional to the number of
observations in the data set.
On the other hand, an implicit mapping is defined as a parameterized
function. For example, the paramaters in the model of PCA define a hyper-
plane. Clearly, there is no direct connection between those parameters and
the coordinates of the observations stored in the data set. Implicit mappings
often originate from continuous models, and generalization to new points is
usually straightforward.
A third intermediate class of mappings also exists. In this class may be
gathered all models that define a mapping by associating a low-dimensional
representation not to each data point, but to a subset of data points. In this
case, the number of low-dimensional representations does not depend strictly
on the number of data points, and the low-dimensional coordinates may be
considered as generic parameters, although they have a straightforward geo-
metric meaning. All DR methods, like SOMs, that involve some form of vector
quantization (see Subection 2.5.9 ahead) belong to this class.

2.5.6 Integrated vs. external estimation of the dimensionality

When considering dimensionality reduction, a key point to discuss is the pres-


ence of an estimator of the intrinsic dimensionality. The case of PCA appears
as an exception, as most other methods are deprived of an integrated es-
timator. Actually, they take the intrinsic dimensionality as an external hy-
perparameter to be given by the user. In that sense, this hyper-parameter
is preferably called the embedding dimension(ality) rather than the intrinsic
dimensionality of data. This is justified by the fact that the user may wish
to visualize the data on a plane and thus force a two-dimensional embedding
even if the intrinsic dimensionality is actually higher. On the contrary, meth-
ods that are not powerful enough to embed highly curved P -manifolds may
need more than P dimensions to work properly.
Anyway, if a dimensionality estimator is not integrated in the dimension-
ality reduction itself, this functionality must be performed by an external
procedure. Some typical techniques to estimate the intrinsic dimensionality of
manifolds are described in Chapter 3.
42 2 Characteristics of an Analysis Method

2.5.7 Layered vs. standalone embeddings

When performing PCA on data, all embeddings of dimensionality ranging


between 1 and D are computed at once. The different embeddings are obtained
by removing the coordinates along one or several of the D eigenvectors. Since
the eigenvectors are orthogonal by construction, the removal of an eigenvector
obviously does not change the coordinates along the kept ones. In other words,
PCA proves to be an incremental method that produces layered embeddings:
adding or removing a dimension does not require any change of the coordinates
along the other dimensions. For instance, computing a 2D embedding can be
done by taking the leading eigenvector, which specifies coordinates along a first
dimension, and then the second eigenvector, in decreasing order of eigenvalue
magnitude. If a 3D embedding is needed, it suffices to retrieve the 2D one,
take the third eigenvector, and append the corresponding coordinates.
All methods that translate the dimensionality reduction into an eigenprob-
lem and assemble eigenvectors to form an embedding share this capability and
are called spectral methods. To some extent, the embeddings of different di-
mensionality provided by spectral methods are not independent since coordi-
nates are added or removed but never changed when the target dimensionality
increases or decreases.
In contrast, methods relying on other principles (nonspectral methods)
do not offer such a comfort. When they compute an embedding of a given
dimensionality, only that specific embedding is determined. If the target di-
mensionality changes, then all coordinates of the new embedding must be
computed again. This means that the method builds standalone embeddings
for each specified dimensionality. Although such methods seem to waste com-
putational power, the independence of the embeddings can also be an advan-
tage: for each dimensionality, the embedding is specifically optimized.

2.5.8 Single vs. multiple coordinate systems

Strictly speaking, dimensionality reduction does not imply establishing a low-


dimensional representation in a single system of coordinates. For example, a
natural extension of PCA to a nonlinear model consists in dividing a nonlinear
manifold into small pieces, like a (curved) jigsaw. If these pieces are small
enough, they may be considered as linear and PCA may be applied to each
of them. Obviously, each PCA is independent from the others, raising the
difficulty of patching together the projections of each piece of the manifold in
the low-dimensional space.
One of the rare applications for which multiple systems of coordinates raise
no difficulties is data compression. But in this domain more specific methods
exist that reach a better compression rate by considering the data as a binary
flow instead of coordinates in a Cartesian space.
Most other applications of dimensionality reduction, like visualization,
usually need a single system of coordinates. Indeed, if the data lie on a smooth
2.5 Toward a categorization of DR methods 43

manifold, what would be the justification to cut it off into small pieces? One of
the goals of dimensionality reduction is precisely to discover how the different
parts of a manifold connect to each other. When using disconnected pieces,
part of that information is lost, and furthermore it becomes impossible to
visualize or to process the data as a whole.
The most widely known method using several coordinates systems is cer-
tainly the local PCA introduced by Kambathla and Leen [101]. They perform
a simple vector quantization (see ahead or Appendix D for more details) on
the data in order to obtain the tesselation of the manifold. More recent papers
follow a similar approach but also propose very promising techniques to patch
together the manifold pieces in the low-dimensional embedding space. For ex-
ample, a nonparametric technique is given in [158, 166], whereas probabilistic
ones are studied in [159, 189, 178, 29].

2.5.9 Optional vs. mandatory vector quantization

When the amount of available data is very large, the user may decide to work
with a smaller set of representative observations. This operation can be done
automatically by applying a method of vector quantization to the data set
(see App. D for more details). Briefly, vector quantization replaces the origi-
nal observations in the data set with a smaller set of so-called prototypes or
centroids. The goal of vector quantization consists of reproducing as well as
possible the shape of the initial (discrete) data distribution with the proto-
types.
Unfortunately, the ideal case where data are overabondant seldom happens
in real applications. Therefore, the user often skips the vector quantization
and keeps the initial data set. However, some methods are designed in such
a way that vector quantization is mandatory. For example, SOMs belong to
this class (see Chapter 5).

2.5.10 Batch vs. online algorithm

Depending on the application, data observations may arrive consecutively or,


alternatively, the whole data set may be available at once. In the first case, an
online algorithm is welcome; in the second case, an offline algorithm suffices.
More precisely, offline or “batch” algorithms cannot work until the whole set
of observations is known. On the contrary, online algorithms typically work
with no more than a single observation at a time.
For most methods, the choice of the model largely orients the implemen-
tation toward one or the other type of algorithm. Generally, the simpler the
model is, the more freedom is left into the implementation. For example, as
already seen in Section 2.4, the simple model of PCA naturally leads to a
simple batch procedure, although online variants are possible as well. On the
other hand, the more complex model of a SOM favors an online algorithm (a
batch version also exists, but it is seldom used).
44 2 Characteristics of an Analysis Method

Actually, the behavior of true online algorithms is rather complex, espe-


cially when the data sequence does not fulfill certain conditions (like sta-
tionarity or ergodicity). For this reason, batch algorithms are usually pre-
ferred. Fortunately, most online algorithms can be made batch ones using
the Robbins–Monro procedure [156]. The latter simulates a possibly infinite
data sequence by repeating a finite set of observations; by construction, this
method ensures some kind of stationarity. Each repetition is called an epoch; if
the order of the available observations does not matter, then they are usually
ramdomly permuted before each epoch, in order to avoid any influence on the
algorithm convergence. The Robbins–Monro procedure guarantees the con-
vergence on a solution if the algorithm satisfies certain conditions as epochs
go by. These conditions typically regard parameters like step sizes, learning
rates, and other time-varying parameters of online algorithms.
In this book, most algorithms that are said to be online are, in fact, im-
plemented with the Robbins–Monro procedure for the various examples and
comparisons. More details can be found in Appendix C.2.

2.5.11 Exact vs. approximate optimization

The kind of optimization that an algorithm can achieve is closely related to


its offline or online character. Most often, batch algorithms result from some
analytical or algebraic developments that give the solution in closed form, like
PCA. Given a finite data set, which is known in advance, a batch algorithm
like PCA can be proved to compute the optimal solution.
On the opposite, online or adaptive algorithms are often associated with
generic optimization procedures like stochastic gradient descent (see App. C.2).
Such procedures do not offer strong guarantees about the result: the conver-
gence may fail. Nonetheless, on-line algorithms transformed into batch ones
by the Robbins–Monro procedure are guaranteed to converge on a solution,
after a certain number of epochs.3 Unfortunately, this solution may be a local
optimum.
Although they are slow and not optimal, generic optimization procedures
have a major advantage: they are able to optimize rather complicated objec-
tive functions. On the other hand, solutions in closed form are only available
for very simple objective functions; the latter ones are typically required to
be not only differentiable but also concave (or convex).

2.5.12 The type of criterion to be optimized

Last but not least, the criterion that guides the dimensionality reduction is
probably the most important characteristic of a DR method, even before the
model specification. Actually, the data model and the algorithm are often
fitted in order to satisfy the constraints imposed by the chosen criterion.
3
An infinite number, according to the theory; see [156].
2.5 Toward a categorization of DR methods 45

As already mentioned, the domain of dimensionality reduction is mainly


motivated by geometrical considerations. From this point of view, data are
interpreted as a cloud of points in a geometrical space; these points are often
assumed to lie on a smooth manifold. A way to characterize a manifold in-
dependently from any coordinate system consists of discovering and making
explicit the relationships between the points of the manifold. Human percep-
tion naturally focuses on proximity relations: when looking at a point in the
manifold, the eyes immediately distinguish neighboring points from those lying
farther away. Therefore, a good embedding of the manifold should reproduce
those proximity relationships as well as possible.
Formally, this means that a criterion for a good DR method should effi-
ciently measure the proximities observed in data and quantify their preser-
vation in the embedding. There are two main ways to measure proximities
between points, as detailed below.
The straightest one consists of computing pairwise distances between the
points. As an advantage, distances are scalar values that can be easily com-
pared to each other. Thus, a possible criterion for dimensionality reduction is
distance preservation: the pairwise distances measured between the embedded
points should be as close as possible to the ones measured between the initial
data points. The next chapter explores this direction and shows how distance
preservation can be translated in various objective functions.
On the other hand, a less intuitive but more satisfying way to measure
proximities would be qualitative only. In this case, the exact value of the
distances does not matter: for example, one just knows that, “From point a,
point b is closer than point c.” The translation of such qualitative concepts
into an objective function proves more difficult than for distances. Chapter 5
presents solutions to this problem.
3
Estimation of the Intrinsic Dimension

Overview. This chapter introduces the concept of intrinsic dimension


along with several techniques that can estimate it. Briefly put, the in-
trinsic dimension can be interpreted as the number of latent variables,
which is often smaller than the number of observed variables. The es-
timation of the number of latent variables is an essential step in the
process of dimensionality reduction, because most DR methods need
that number as an external and user-defined parameter. Many esti-
mators of the intrinsic dimension come from fractal geometry. Other
estimators described in this chapter are related to PCA or based on
a trial-and-error approach. The chapter ends by applying the various
estimators to some simple toy examples.

3.1 Definition of the intrinsic dimension


In a few words, the intrisinc dimension(ality) of a random vector y is usually
defined as the minimal number of parameters or latent variables needed to
describe y. Although this informal definition seems to be clear in practice, it
is formally ambiguous due to the existence of strange geometrical objects like
space-filling curves (see an example in Fig. 3.1).
A better definition uses the classical concept of topological dimension [200]:
the intrinsic dimension of y equals the topological dimension of the support
Y of the distribution of y. The definition requires some additional notions.
Given a topological space Y, the covering of a subset S is a collection C of
open subsets in Y whose union contains S. A refinement of a covering C of
S is another covering C  such that each set in C  is contained in some set in
C. Because a D-dimensional set can be covered by open balls such that each
point belongs to maximum (D + 1) open balls, the following statement holds:
a subset S of a topological space Y has topological dimension Dtop (a.k.a.
Lebesgue covering dimension) if every covering C of S has a refinement C  in
which every point of S belongs to at most (Dtop + 1) sets in C  , and Dtop is
48 3 Estimation of the Intrinsic Dimension

Fig. 3.1. A space-filling curve. This curve, invented by Hilbert in 1891 [86], is a
one-dimensional object that evolves iteratively and progressively fills a square— a
two-dimensional object! —. The first six iteration steps that are displayed show how
the curve is successively refined, folded on itself in a similar way as a cabbage leaf.

the smallest such integer. For example, the Lebesgue covering dimension of
the usual Euclidean space RD is D.
Technically, the topological dimension is very difficult to estimate if only
a finite set of points is available. Hence, practical methods use various other
definitions of the intrinsic dimension. The most usual ones are related with
the fractal dimension, whose estimators are studied in Section 3.2. Other
definitions are based on DR methods and are summarized in Section 3.3.
Before going into further details, it is noteworthy that the estimation of
the intrinsic dimension should remain coherent with the DR method: an es-
timation of the dimension with a nonlinear model, like the fractal dimension,
makes no sense if the dimensionality reduction uses a linear model, like PCA.

3.2 Fractal dimensions


A usual generalization of the topological dimension is the fractal dimension.
While the topological dimension defined above regards topological subsets like
manifolds and yields an integer value, the fractal dimension relates to so-called
fractal objects and is a real number.
Actually, the adjective fractal designates objects or quantities that display
self-similarity, in a somewhat technical sense, on all scales. The object does
not need to exhibit exactly the same structure on all scales, but the same type
of structures must appear [200]. A classical example of a fractal object is a
coastline. Figure 3.2 illustrates the coastline of Koch’s island. As pointed out
3.2 Fractal dimensions 49

by Mandelbrot in his pioneering work [131, 132, 133], the length of such a
coastline is different depending on the length ruler used to measure it. This
paradox is known as the coastline paradox: the shorter the ruler, the longer
the length measured.
The term fractal dimension [200] sometimes refers to what is more com-
monly called the capacity dimension (see Subsection 3.2.2). However, the term
can also refer to any of the dimensions commonly used to characterize frac-
tals, like the capacity dimension, the correlation dimension, or the information
dimension. The q-dimension unifies these three dimensions.

3.2.1 The q-dimension


Let μ be a Borel probability measure on a metric space Y (a space provided
with a metric or distance function; see Subsection 4.2 for more details). For
q ≥ 0 and  > 0, one defines

Cq (μ, ) = [μ(B̄ (y))]q−1 dμ(y) , (3.1)

where B̄ (y) is the closed ball of radius  centered on y. Then, according to
Pesin’s definition [151, 152], for q ≥ 0, q = 1, the lower and upper q-dimensions
of μ are
log Cq (μ, )
Dq− (μ) = lim inf , (3.2)
→0 (q − 1) log 
log Cq (μ, )
Dq+ (μ) = lim sup . (3.3)
→0 (q − 1) log 
If Dq− (μ) = Dq+ (μ), their common value is denoted Dq (μ) and is called the
q-dimension of μ. It is expected that Dq (μ) exists for sufficiently regular frac-
tal measures (smooth manifolds trivially fulfill this condition). For such a
measure, the function q → Dq (μ) is called the dimension spectrum of μ.
An alternative definition for Dq− (μ) and Dq+ (μ) originates from the physics
literature [83]. For  > 0, instead of using closed balls, the support of μ is
covered with a (multidimensional) grid of cubes with edge length . Let N ()
be the number of cubes that intersect the support of μ, and let the natural
measures of these cubes be p1 , p2 , . . . , pN () . Since the pi may be seen as the
probability that these cubes are populated, they are normalized:
N ()

pi = 1 . (3.4)
i=1

Then
N ()
log i=1 pqi
Dq− (μ)= lim inf , (3.5)
→0 (q − 1) log 
N () q
+ log i=1 pi
Dq (μ) = lim sup . (3.6)
→0 (q − 1) log 
50 3 Estimation of the Intrinsic Dimension

Fig. 3.2. Koch’s island (or snowflake) [200]. This classical fractal object was first
described by Helge von Koch in 1904. As shown in the bottom of the figure, it
is built by starting with an equilateral triangle, removing the inner third of each
side, replacing it with two edges of a three-times-smaller equilateral triangle, and
then repeating the process indefinitely. This recursive process can be encoded as a
Lindenmayer system (a kind of grammar) with initial string S(0) = ‘F−−F−−F’
and string-rewriting rule ‘F’ → ‘F+F−−F+F’. In each string S(i), ‘F’ means “Go
forward and draw a line segment of given length”, ‘+’ means “Turn on the left
with angle 13 π” and ‘−’ means “Turn on the right with angle 13 π”. The drawings
corresponding to strings S(0) to S(3) are shown in the bottom of the figure, whereas
the main representation is the superposition of S(0) to S(4). Koch’s island is a typical
illustration of the coastline paradox: the length of the island boundary depends on
the ruler used to measure it; the shorter the ruler, the longer the coastline. Actually,
it is easy to see that for the representation of S(i), the length of the line segment is
 i
L(i) = L∇ 13 , where L∇ = L(0) is the side length of the initial triangle. Similarly,
the number of corners is N (i) = 3 · 4i . Then the true perimeter associated with S(i)
 i
is l(i) = L(i)N (i) = 3L∇ 43 .
3.2 Fractal dimensions 51

For q ≥ 0, q = 1, these limits do not depend on the choice of the -grid,


and give the same values as Eqs. (3.2) and (3.3). The main advantage of that
second definition is that its associated formulas are more tractable in practice
due to the use of sums and grids intead of integrals and balls.

3.2.2 Capacity dimension

When setting q equal to zero in the second definition (Eq. (3.5) and (3.6))
and assuming that the equality Dq− (μ) = Dq+ (μ) holds, one gets the capacity
dimension [200, 152]:
N ()
log i=1 p0i
dcap = D0 (μ) = lim
→0 (0 − 1) log 
N ()
log i=1 1
= − lim
→0 log 
log N ()
= − lim . (3.7)
→0 log 
In this definition, dcap does not depend on the natural measures pi . In practice,
dcap is also known as the ‘box-counting’ dimension [200]. When the manifold is
not known analytically and only a few data points are available, the capacity
dimension is quite easy to estimate:
1. Determine the hypercube that circumscribes all the data points.
2. Decompose the obtained hypercube in a grid of smaller hypercubes with
edge length  (these “boxes” explain the name of the method).
3. Determine N (), the number of hypercubes that are occupied by one or
several data points.
4. Apply the log function, and divide by log .
5. Compute the limit when  tends to zero; this is dcap .
Unfortunately, the limit to be computed in the last step appears to be the
sole obstacle to the overall simplicity of the technique. Subsection 3.2.6 below
gives some hints to circumvent this obstacle.
The intuitive interpretation of the capacity dimension is the following. As-
suming a three-dimensional space divided in small cubic boxes with a fixed
edge length , the box-counting dimension is closely related to the propor-
tion of occupied boxes. For a growing one-dimensional object placed in this
compartmentalized space, the number of occupied boxes grows proportion-
ally to the object length. Similarly, for a growing two-dimensional object, the
number of occupied boxes grows proportionally to the object surface. Finally,
for a growing three-dimensional object, the number of occupied boxes grows
proportionally to the object volume. Generalizing to a P -dimensional object
like a P -manifold embedded in RD , one gets

N () ∝ P . (3.8)
52 3 Estimation of the Intrinsic Dimension

And, trivially,
log N ()
P ∝ . (3.9)
log 
To complete the analogy, the hypothesis of a growing object has to be replaced
with the reciprocal one: the size of the object remains unchanged but the edge
length  of the boxes decreases, yielding the precise estimate of the dimension
at the limit.
As an illustration, the capacity dimension can be computed analytically
for the coastline of Koch’s island (Fig. 3.2). In this particular case, the devel-
opment is made easier by choosing triangular boxes. As shown in the caption
to Fig. 3.2, the length of the line segment for the ith iteration of the Lin-
denmayer system is L(i) = L∇ 3−i , where L∇ = L(0) is the edge length of
the initial triangle. The number of corners is N (i) = 3 · 4i . It is easy to see
that if the grid length  equals L(i), the number of occupied boxes is N (i).
Consequently,

log N (i)
dcap = − lim
log L(i)
i→∞

log(3 · 4i )
= − lim
i→∞ log(L∇ 3−i )
log 3 + i log 4
= − lim
i→∞ log L∇ − i log 3
log 4
= = 1.261859507 . (3.10)
log 3

3.2.3 Information dimension

The information dimension corresponds to the case where q = 1. Care must


be taken in order to avoid a denominator equal to zero in Eqs. (3.5) and (3.6).
Assuming again that equality Dq− (μ) = Dq+ (μ) holds, it follows that
N () q
log i=1 pi
dinf = lim Dq (μ) = lim lim (3.11)
q→1 q→1 →0 (q − 1) log 
N ()
1 log i=1 pqi
= lim lim . (3.12)
→0 log  q→1 q−1

Since the pi are normalized (Eq. (3.4)), the numerator of the right factor
trivially tends to zero:
N () N ()
 
lim log pqi = log pi = log 1 = 0 , (3.13)
q→1
i=1 i=1

and so does the denominator:


3.2 Fractal dimensions 53

lim q − 1 = 0 . (3.14)
q→1

Hence, using l’Hospital’s rule, the numerator and denominator can be replaced
with their respective derivatives:
N () q
1 p log pi
dinf = lim Dq (μ) = lim lim i=1 i
q→1 →0 log  q→1 1
N ()
pi log pi
= lim i=1 . (3.15)
→0 log 
It is noteworthy that the numerator in the last expression resembles Shannon’s
entropy in information theory [40], justifying the name of D1 (μ).
The information dimension is mentioned here just for the sake of complete-
ness. Because the pi are seldom known when dealing with a finite number of
samples, its evaluation remains difficult, except when the pi are assumed to
be equal, meaning that all occupied boxes have the same probability to be
visited:
1
∀i, pi = . (3.16)
N ()
In this case, it turns out that the information dimension reduces to the ca-
pacity dimension:
N ()
i=1N ()−1 log N ()−1
dinf = lim
→0 log 
log N ()
= − lim
→0 log 
= dcap . (3.17)

3.2.4 Correlation dimension

The correlation dimension, introduced by Grassberger and Procaccia [76], cor-


responds to the case where q = 2. The term correlation refers to the fact that
the probabilities or natural measures pi are squared. In contrast with both the
capacity and information dimensions, the derivation of the correlation dimen-
sion is easier starting from the first definition of the q-dimension (Eqs. (3.2)
and (3.3)). When the manifold or fractal object is only known by a countable
set of points Y = {y(1), . . . , y(n), . . . , y(N )}, the correlation integral C2 (μ, )
(Eq. (3.1) with q = 2) can be discretized and replaced with the limit of the
correlation sum:

1  N
C2 () = lim H( − y(i) − y(j)2 ) (3.18)
N →∞ N (N − 1)
i=1
i<j
= P (y(i) − y(j)2 ≤ ) , (3.19)
54 3 Estimation of the Intrinsic Dimension

where H(u) is a step function, defined as



0 if u < 0
H(u) = . (3.20)
+1 if u ≥ 0

In Eq. (3.18), H simulates a closed ball of radius  centered on each available


point. Then, assuming that the upper and lower limits coincide in Eq. (3.2)
and (3.3), the correlation dimension is written as

log C2 ()
dcor = D2 = lim . (3.21)
→0 log 
Like the capacity dimension, this discrete formulation of the correlation di-
mension no longer depends on the natural measures pi of the support μ. Given
a set of points Y, the correlation dimension is easily estimated by the following
procedure:
1. Compute the distances for all possible pairs of points {y(i), y(j)}.
2. Determine the proportion of distances that are less than or equal to .
3. Apply the log function, and divide by log .
4. Compute the limit when  tends to zero; this is dcor .
The second step yields only an approximation Ĉ2 () of C2 () computed with
the available N points. But again, the difficult step is the last one. Section 3.2.6
brings some useful hints.
Intuitively, the interpretation of the correlation dimension is very similar
to the one associated with the capacity dimension. Instead of adopting a global
point of view (the number of boxes that an object or manifold occupies), a
closer view is necessary. When looking at the data set on the scale of a single
point, C2 () is the number of neighboring points lying closer than a certain
threshold . This number grows as a length for a 1D object, as a surface for
a 2D object, as a volume for a 3D object, and so forth. Generalizing for P
dimensions gives
C2 () ∝ P . (3.22)
And again,
log C2 ()
P ∝ . (3.23)
log 

3.2.5 Some inequalities

Considering the q-dimension Dq , if q1 < q2 , then the inequality Dq2 ≤ Dq1


holds [146]. As a consequence, it follows that

D2 ≤ D1 ≤ D0 , i.e., dcor ≤ dinf ≤ dcap . (3.24)


3.2 Fractal dimensions 55

3.2.6 Practical estimation

When the knowledge of a manifold or fractal object is limited to a finite num-


ber of points, the capacity and correlation dimensions are much more easily
computed than the information dimension. However, the theoretical estima-
tion of their respective formulas includes a limit toward zero, for either the
radius of the balls (correlation dimension) or the edge of the boxes (capacity
dimension). This is clearly impossible in practice.
Concretely, only values of N () or Ĉ2 () are available for  ranging be-
tween the smallest and largest distances measured in the data set. A trick to
circumvent this obstacle consists of applying l’Hospital’s rule, knowing that
both the numerator and the denominator tend to −∞. For example, in the
case of the correlation dimension

log Ĉ2 ()


dcor = lim (3.25)
→0 log 
∂ log Ĉ2 ()  ∂ log 
= lim (3.26)
→0 ∂ ∂
∂ log Ĉ2 ()
= lim (3.27)
→0 ∂ log 
log Ĉ2 (2 ) − log Ĉ2 (1 )
= lim . (3.28)
1 ,2 →0 log 2 − log 1
According to the literature, the differentiation brings a better estimate when
the limit is omitted. Consequently, one defines the scale-dependent correlation
dimension as
log Ĉ2 (2 ) − log Ĉ2 (1 )
dˆcor (1 , 2 ) = , (3.29)
log 2 − log 1
which is practically computed as the average slope of the curve in a log-log
plot of Ĉ2 () versus . The values of 1 and 2 are set between the minimal
and maximal pairwise distances measured in the available data set.
As dˆcor depends on scale, what are the best values for 1 and 2 ? Since
the number of points is finite, choosing small values is hopeless: the estima-
tion gives a dimension near zero; zero is indeed the dimension of isolated
points. Similarly, high values for 1 and 2 do not make any sense either, since
Eq. (3.21) contains a limit toward zero. In this case, the dimension estimate
also vanishes since the set of points, seen from a remote point of view, also
looks like a single (fuzzy) isolated point. Between these two extreme choices
lies the adequate solution. Usually, the best estimate of dˆcor is obtained in the
largest region where the slope of Ĉ2 () is almost constant in the log-log plot.
This region is often called a “plateau”.
For example, the scale-dependent correlation dimension dˆcor can be com-
puted for the coastline of Koch’s island (see Fig. 3.2). The data set contains
all corners of the seventh iteration of the associated Lindenmayer system
56 3 Estimation of the Intrinsic Dimension

(N (7) = 3 · 47 = 49, 152), which is represented in the first plot of Fig. 3.3.
(Only corners can be taken into account since sides are indefinitely refined.)
The log-log plot of the estimated correlation sum Ĉ2 () is displayed in the

0.5

2 0
y

−0.5

−1
−1 −0.5 0 0.5 1
y
1

3
25
d log C2(ε) / d log ε
log C2(ε)

20 2

1.26
15 1

10
0
−8 −6 −4 −2 0 −8 −6 −4 −2 0
log ε log ε

Fig. 3.3. Correlation dimension of Koch’s island. The first plot shows the coastline,
whose corners are the data set for the estimation of the correlation dimension. The
log-log plots of the estimated correlation sum Ĉ2 () and its numerical derivative are
displayed below.

second plot of Fig. 3.3. Obviously, as the data set is generated artificially,
the result is nearly perfect: the slope of the curve is almost constant between
1 ≈ exp(−6) = 0.0025 and 2 ≈ exp(0) = 1. However, the manual adjustment
of a line onto the curve is a tedious task for the user.
Alternatively, the correlation dimension can be estimated by computing
the numerical derivative of log Ĉ2 (exp υ), with υ = log :
d
dˆcor = log Ĉ2 (exp υ) . (3.30)

This turns out to compute the slope of Ĉ2 () in a log-log plot.
For any function f (x) known at regularly spaced values of x, the numerical
derivative can be computed as a second-order estimate, written as
3.2 Fractal dimensions 57

f (x + Δx) − f (x − Δx)
f  (x) = + O(Δx3 ) (3.31)
2Δx
and based on Taylor’s polynomial expansion of an infinitely differentiable
function f (x). The numerical derivative of log Ĉ2 (exp υ) directly yields the
dimension for any value of υ = log ; the result is displayed in the third plot
of Fig. 3.3 for the coastline of Koch’s island. As expected, the estimated corre-
lation dimension is very close to the capacity dimension computed analytically
at the end of Subsection 3.2.2.
The use of a numerical derivative is usually criticized because it yields a
chopping and changing estimate. Nevertheless, in normal situations, it works
rather well, is visually more satisfying, and, last but not least, provides a
result that proves a bit less user-dependent. Indeed, the manual adjustment
of a line onto a curve is essentially a matter of personal perception.
Finally, the following example illustrates the fact that the estimated cor-
relation dimension depends on the observation scale. The manifold is a spiral,
written as  √ 
√ cos(10π x)
y= x √ +n , (3.32)
sin(10π x)
where the unique parameter x goes from 0 to 1 and n is white Gaussian noise
with standard deviation 0.005. By construction, this spiral is a 1-manifold
embedded in R2 , and a quick look at the first plot of Fig. 3.4 confirms it
visually. However, the correlation dimension gives a more contrasted result
(second and third plots of Fig. 3.4). In the second plot, from left to right, the
correlation is growing constantly, then seems to “slow down” and is finally
growing again until it reaches its maximal value. The explanation of this
behavior can be found in the third plot, by considering the derivative:
1. For extremely small values of , the correlation sum remains on the scale
of isolated points. This interval is shown as the black box, the same color
as the points of the spiral in the first plot. These points are 0-manifold
and the estimated dimension is low indeed.
2. For barely larger values of , the correlation sum measures the dimension
of the noise. This interval is shown as a dark gray box, which corresponds
to the small square box of the same color in the first plot. As noise occupies
all dimensions of space, the dimension is two.
3. For still larger values of , the correlation sum begins to take into account
entire pieces of the spiral curve. Such a piece is shown in the light gray
rectangular box in the first plot. On this scale, the spiral is a 1-manifold,
as intuitively expected and as confirmed by the estimated correlation di-
mension.
4. For values of  close to the maximal diameter of the spiral, the correlation
dimension encompasses distances across the whole spiral. These values
are shown as a white box in the third plot, corresponding to the entire
white box surrounding the spiral in the first plot. On this scale the spiral
58 3 Estimation of the Intrinsic Dimension

0.5

−0.5

−1
−1 −0.5 0 0.5 1

d log C2(ε) / d log ε


15
log C (ε)

2
2

10

1
5

0 0
−10 −5 0 −10 −8 −6 −4 −2 0
log ε log ε

Fig. 3.4. Correlation dimension of a noisy spiral. The first plot shows the data
set (10,000 points). The log-log plots of the estimated correlation sum Ĉ2 () and
its numerical derivative are displayed below. The black, dark gray, light gray, and
white boxes in the third plot illustrate that the correlation dimension depends on the
observation scale. They correspond, respectively, to the scale of the isolated points,
the noise, pieces of the spiral curve, and the whole spiral.

appears as a plane with some missing points, and indeed the dimension
equals two.
5. For values of  far beyond the diameter, the correlation dimension sees
the spiral as a smaller and smaller fuzzy spot. Intuitively, this turns out
to zoom out in the first plot of Fig. 3.4. This explains why the dimension
vanishes for very large values of  (no box is drawn).
All those variations of the estimated correlation dimension are usually called
microscopic effects (1 and 2), lacunarity effects (3), and macroscopic effects
(4 and 5) [174]. Other macroscopic effects that are not illustrated here are, for
example, side and corner effects. For example, when computing the correlation
dimension of a square, the number of points inside a ball of radius  is always
proportional to 2 . However, a multiplicative coefficient should be taken into
account. Assuming that inside the square this coefficient equals 1, then near
a side, it is only 1/2; and near a corner, it further decrease towards 1/4.
3.3 Other dimension estimators 59

Therefore, the estimated dimension not only depends on scale but also on the
“location” in space where it is estimated!

3.3 Other dimension estimators


Other methods that are not primarily intended to compute the fractal di-
mension can though be used to evaluate the dimensionality of a manifold.
Principal component analysis, studied in Section 2.4, is the most-known ex-
ample: this DR method integrates an estimator of the intrinsic dimensionality
that is based on the same model. Despite this nice coherence, the model of
PCA is linear (see Eq. (2.4)), meaning that the estimator works only for mani-
folds containing linear dependencies (i.e., linear subspaces). For more complex
manifolds, PCA gives at best an estimate of the global dimensionality of an
object.
For example, PCA estimates the dimension of the spiral in the first plot of
Fig. 3.4 as two. In other words, PCA suffers from the first macroscopic effect
mentioned at the end of the previous subsection. On the other hand, the
correlation dimension succeeds in giving the dimension on all scales. Hence, a
promising way to explore is the use of PCA on a local scale.

3.3.1 Local methods

The idea behind local methods consists of decomposing the space into small
patches, or “space windows”, and to consider each of them separately. To
some extent, this idea is closely related to the use of boxes and balls in the
capacity and correlation dimensions.
The most widely known local method is based on the nonlinear general-
ization of PCA already sketched in Subsection 2.5.8. Briefly put, the space
windows are determined by clustering the data. Usually, this is achieved by
vector quantization (see Appendix D). In a few words, vector quantization
processes a set of points by replacing it with a smaller set of “representative”
points. Usually, the probability distribution function of these points resembles
that of the initial data set, but their actual distribution is, of course, much
sparser. If each point is mapped to the closest representative point, then the
space windows are defined as the subsets of points that are mapped to the
same representative point. Next, PCA is carried out locally, on each space
window, assuming that the manifold is approximately linear on the scale of a
window. Finally, the dimensionality of the manifold is obtained as the average
estimate yielded by all local PCAs. Usually, each window is weighted by the
number of points it contains before computing the mean.
Moreover, it is noteworthy that not just the mean can be computed: other
statistics, like standard deviations or minimal and maximal values, may help
60 3 Estimation of the Intrinsic Dimension

to check that the dimensionality remains (nearly) identical over all space win-
dows. Hence, local PCA can detect spatial variations of the intrinsic dimen-
sionality. This is a great difference with other methods, like fractal dimension,
that usually assume that dimensionality is a global property of data.
For the noisy spiral of Fig. 3.4, the local PCA approach yields the result
shown in Fig. 3.5. The first plot is a copy of the spiral data set, but the bound-
aries of the space windows are added in gray (70 windows have been built).
The second plot shows the fraction of variance spanned by the first principal
component of each space window, as a function of the number of space win-
dows. In the third plot, the three curves indicate the dimensionality for three
variance thresholds (0.97, 0.98, and 0.99). As can be seen, the dimension given
by local PCA is scale-dependent, like the correlation dimension. Actually, the
scale is implicitly determined by the number of space windows. If this num-
ber is too low, the windows are too large and PCA “sees” the macroscopic
structure of the spiral, which is two-dimensional. At nearly 70, the value that
corresponds to the number of space windows in the first plot, the size of the
windows is close to the optimum and PCA “sees” small pieces of the spiral
curve: the dimension is one. If the number of windows further increases, the
windows become too small: the noise scale is attained and PCA needs two
components to explain the variance.
By comparison with the fractal dimensions like the correlation dimension,
the local PCA requires more data samples to yield an accurate estimate. This
is because local PCA works by dividing the manifold into nonoverlapping
patches. On the contrary, the correlation dimension places a ball on each
point of the data set. As a counterpart, local PCA is faster (O(N )) than the
correlation dimension (O(N 2 )), at least for a single run. Otherwise, if local
PCA is repeated for many different numbers of space windows, as in Fig. 3.5,
then the computation time grows.
The local PCA approach has been proposed by Kambathla and Leen [101]
as a DR method. Because this method does not provide an embedding in a
single coordinate system in a natural way, it does not encounter much success,
except in data compression. Fukanaga and Olsen [72], on the other hand,
followed the same approach more than two decades before Kambathla and
Leen in order to estimate the intrinsic dimensionality of data.

3.3.2 Trial and error

Instead of generalizing the use of PCA to nonlinear manifolds by dividing the


space into small patches, PCA could be replaced with other DR methods that
inherently rely on a nonlinear model. As already mentioned, most DR methods
do not integrate a dimensionality estimator as PCA. But on the other hand,
some of these methods minimize a reconstruction error Ecodec (Eq. (2.1)),
exactly as PCA does (see Subsection 2.4.2). Actually, the reconstruction error
depends on the embedding dimensionality P . If P = D, then trivially Ecodec is
exactly zero, since the manifold to be embedded can simply be copied without
3.3 Other dimension estimators 61

0.5
y2

−0.5

−1
−1 −0.5 0 0.5 1
y
1
Variance fraction 1st PC

1
0.8
0.6
0.4
0.2
0
50 100 150 200 250 300 350 400 450 500
Number of space windows
3
Dimension

1 0.97
0.98
0.99
0
50 100 150 200 250 300 350 400 450 500
Number of space windows

Fig. 3.5. Intrinsic dimensionality of the noisy spiral shown in Fig. 3.4, estimated
by local PCA. The first plot shows the spiral again, but the boundaries of the space
windows are added in gray (70 windows). The second plot shows the fraction of
the total variance spanned by the first principal component of each cluster or space
window. This fraction is actually computed as an average for different numbers
of windows (in abscissa). The third plot shows the corresponding dimensionality
(computed by piecewise linear interpolation) for three variance fractions (0.97, 0.98,
and0.99).
62 3 Estimation of the Intrinsic Dimension

any change. If P = 0, then the error reaches its maximal value, equal to the
global variance (tr(Cyy )). For 0 < P < D, the error varies between these
two extrema but cannot be predicted exactly. However, one may expect that
the error will remain low if P is greater than the intrinsic dimensionality of
the manifold to be embedded. On the contrary, if P goes below the intrinsic
dimensionality, the dimensionality reduction may cause a sudden increase in
Ecodec .
With this ideas in mind, the following procedure can yield an estimate of
the intrinsic dimensionality:
1. For a manifold embedded in a D-dimensional space, reduce dimensionality
successively to P = 1, 2, . . . , D; of course, if some additional information
about the manifold is available, the search interval may be smaller.
2. Plot Ecodec as a function of P .
3. Choose a threshold, and determine the lowest value of P such that Ecodec
goes below it: this is the estimate of the intrinsic dimensionality of the
manifold.
The choice of the threshold in the last step is critical since the user determines
it arbitrarily. But most of the time the curve Ecodec versus P is very explicit:
an elbow is usually clearly visible when P equals the intrinsic dimensionality.
An example is given in the following section.
An additional refinement of the procedure consists of using statistical esti-
mation methods like cross validation or bootstrapping. Instead of computing
an embedding for a certain dimensionality only once, those methods repeat
the dimensionality reduction on several subsets that are randomly drawn from
the available data. This results in a better estimation of the reconstruction
errors, and therefore in a more faithful estimation of the dimensionality at the
elbow.
The main disadvantage of the procedure, especially with cross validation
or bootstrapping, lies in its huge computational requirements. This is partic-
ularly annoying when the user is not interested in the embedding, but merely
in the value of the intrinsic dimensionality. Moreover, if the DR method is
not incremental (i.e., does not produce incremental embeddings, see Subsec-
tion 2.5.7), the computation time dramatically increases.

3.4 Comparisons
This section attempts to compare the above-mentioned methods in the case
of an artificially generated manifold whose dimensionality is known before-
hand. The first and simplest method is PCA; the others are the correlation
dimension, the local PCA, and finally the “trial and error” method.
3.4 Comparisons 63

3.4.1 Data Sets

The proposed manifold has already been used in [45]. In a three-dimensional


cube ([−1, +1]3 ), 10 distance sensors are placed at random locations. With
these sensors, each position of the cube can be encoded as the vector contain-
ing the distances toward the 10 captors. Obviously, these distances are not
independent: nonlinear relationships bind them. And, of course, the number
of free parameters is actually three, i.e. the dimensionality of the cube where
points are picked out.
For the experiments below, the positions of the 10 sensors are

x1 x2 x3
+0.026 +0.241 +0.026
+0.236 +0.193 −0.913
−0.653 +0.969 −0.700
+0.310 +0.094 +0.876
+0.507 +0.756 +0.216
−0.270 −0.978 −0.739
−0.466 −0.574 +0.556
−0.140 −0.502 −0.155
+0.353 −0.281 +0.431
−0.473 +0.993 +0.411

Three data sets are made available for the dimensionality estimation: they con-
tain, respectively, 100, 1000, and 10,000 observations. The three-dimensional
points that generated them are uniformly distributed in the three-dimensional
cube [−1, +1]3. Once the 10 distances are computed, white Gaussian noise is
added, with standard deviation equal to 0.01.

3.4.2 PCA estimator

Figure 3.6 shows the results of PCA applied globally on the three data sets. As
can be seen, the number of observations does not greatly influence the results.
For the three data sets, the normalized variances vanish starting from the fifth
principal component. Clearly, this not a good result. But this overestimation
of the intrinsic dimension is not unexpected: PCA works with a linear model,
which is unable to cope with the nonlinear dependences hidden in the data
sets.

3.4.3 Correlation dimension

The results of the correlation dimension are given in Fig. 3.7. This method is
much more sensitive to the number of available observations. For 100 obser-
vation, the numerical derivative is chopping and changing, although the right
dimensionality can already be guessed. For 1000 and 10,000 observations, the
64 3 Estimation of the Intrinsic Dimension

1
100 points
1000 points
0.9
10000 points

0.8

0.7
Normalized variances

0.6

0.5

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10
Principal component indices

Fig. 3.6. Estimation of the intrinsic dimensionality for the three “sensor” data sets
(100, 1000, and 10,000 observations), by using PCA applied globally.

20
100 points
1000 points
15 10,000 points
log C2(ε)

10

0
−4 −3 −2 −1 0 1 2
log ε

9
d log C2(ε) / d log ε

8 100 points
7 1000 points
10,000 points
6
5
4
3
2
1

−4 −3 −2 −1 0 1 2
log ε

Fig. 3.7. Estimation of the intrinsic dimensionality for the three ‘sensor’ data sets
(100, 1000 and 10000 observations), by using the correlation dimension.
3.4 Comparisons 65

results are nearly perfect: the estimated dimensionality is three, as expected.


It can be remarked that the noise dimensionality appears more clearly as the
number of observations grows. Nevertheless, even for 10,000 observations, the
noise dimensionality remains underestimated. Moreover, edge effects appear
for −1 ≤ log  ≤ 0: the dimensionality is slightly underestimated.
From the computational point of view, the correlation dimension is much
slower than PCA but yields higher quality results.

3.4.4 Local PCA estimator

The results of local PCA are displayed in Fig. 3.8. Actually, only the results for

1000 observations
0.8
Normalized variances

0.6

0.4

0.2

0
10 20 30 40 50 60 70 80 90 100
Number of space windows
10,000 observations
0.8
Normalized variances

0.6

0.4

0.2

0
10 20 30 40 50 60 70 80 90 100
Number of space windows

Fig. 3.8. Estimation of the intrinsic dimensionality for the two largest “sensor”
data sets (1000 and 10,000 observations) by local PCA. The normalized eigenvalues
are shown according to the number of space windows.

the two largest data sets are shown. For 100 observations, the space windows
do not contain enough observations to yield a trustworthy estimate. As the
correlation dimension, the local PCA yields the right dimensionality. This can
be seen in both plots: the largest three normalized eigenvalues remain high for
any number of windows, while the fourth and subsequent ones are negligible.
It is noteworthy that for a single window the result of local PCA is trivially the
same as for PCA applied globally. But as the number of windows is increasing,
the fourth normalized eigenvalue is decreasing slowly. This indicates that the
66 3 Estimation of the Intrinsic Dimension

division into an increasing number of space windows allows us to capture the


nonlinear shape of the underlying manifold. The smaller the windows are, the
stronger the assumption that the manifold is locally linear holds. Nevertheless,
if the windows become too numerous and too small, PCA is no longer reliable,
because the windows do not contain enough points. This is especially visible in
the first plot of Fig. 3.8: the second and third eigenvalues are slowly vanishing,
whereas the first one is becoming more and more dominant.
Local PCA is obviously much slower than global PCA, but still faster than
the correlation dimension, at least if the number of windows does not sweep
an interval that is too wide.

3.4.5 Trial and error

For the trial-and-error technique, the chosen DR method is Sammon’s nonlin-


ear mapping (see Subsection 4.2.3), using the Euclidean distance. This choice
is justified by the fact that the method minimizes an explicit error criterion,
called “Sammon’s stress”. This criterion is not exactly a reconstruction error
as defined in Eq. (2.1), but it is closely related to it.
The results are shown only for the two smallest data sets in Fig. 3.9, be-
cause the computation time proves too long for 10,000 observations. As can

0.35
100 points
1000 points
0.3

0.25
Reconstruction error

0.2

0.15

0.1

0.05

0
1 2 3 4 5 6 7 8 9 10
Reduced dimensionality

Fig. 3.9. Estimation of the intrinsic dimensionality for the two smallest “sensor”
data sets (100 and 1000 observations) by trial and error with Sammon’s nonlinear
mapping.
3.4 Comparisons 67

be seen, the number of points does not play an important role. At first sight,
the estimated dimension is four, since the error is almost zero starting from
this number. Actually, the DR method slightly overestimates the dimension-
ality, like PCA applied globally. Although the method relies on a nonlinear
model, the manifold may still be too curved to achieve a perfect embedding
in a space having the same dimension as the exact manifold dimensionality.
One or two extra dimensions are welcome to embed the manifold with some
more freedom. This explains why the overestimation observed for PCA does
not disappear but is only attenuated when switching to an NLDR method.

3.4.6 Concluding remarks

Among the four compared methods, PCA applied globally on the whole data
set undoubtly remains the simplest and fastest one. Unfortunately, its results
are not very convincing: the dimension is almost always overestimated if data
do not perfectly fit the PCA model. Since the PCA criterion can be seen as
a reconstruction error, PCA is actually a particular case of ‘trial and error’
method. And because PCA is an incremental method, embeddings for all
dimensions— and the corresponding errors —are computed at once.
When replacing PCA with a method relying on a nonlinear model, these
nice properties are often lost. In the case of Sammon’s nonlinear mapping, not
only is the method much slower due to its more complex model, but addition-
ally the method is no longer incremental! Put together, these two drawbacks
make the trial-and-error method very slow. Furthermore, the overestimation
that was observed with PCA does not disappear totally.
The use of local PCA, on the other hand, seems to be a good tradeoff. The
method keeps all advantages of PCA and combines them with the ability to
handle nonlinear manifolds. Local PCA runs fast if the number of windows
does not sweep a wide interval. But more importantly, local PCA has given
the right dimensionality for the studied data sets, along with the correlation
dimension.
Eventually, the correlation dimension clearly appears as the best method to
estimate the intrinsic dimensionality. It is not the fastest of the four methods,
but its results are the best and most detailed ones, giving the dimension on
all scales.
4
Distance Preservation

Overview. This chapter deals with methods that reduce the dimen-
sionality of data by using distance preservation as the criterion. In
the ideal case, the preservation of the pairwise distances measured in
a data set ensures that the low-dimensional embedding inherits the
main geometric properties of data, like the global shape or the lo-
cal neighborhood relationships. Unfortunately, in the nonlinear case,
distances cannot be perfectly preserved. The chapter reviews various
methods that attempt to overcome this difficulty. These methods use
different kinds of distances (mainly spatial or graph distances); they
also rely on different algorithms or optimization procedures to deter-
mine the embedding.

4.1 State-of-the-art
Historically, distance preservation has been the first criterion used to achieve
dimensionality reduction in a nonlinear way. In the linear case, simple criteria
like maximizing the variance preservation or minimizing the reconstruction
error, combined with a basic linear model, lead to robust methods like PCA.
In the nonlinear case however, the use of the same simple criteria requires
the definition of more complex data models. Unfortunately, the definition of
a generative model in the nonlinear case proves very difficult: there are many
different ways to model nonlinear manifolds, whereas there are only a few
(equivalent) ways to define a hyperplane.
In this context, distance preservation appears as a nongenerative way to
perform dimensionality reduction. The criterion does not need any explicit
model: no assumption is made about the mapping from the latent variables
to the observed ones. Intuitively, the motivation behind distance preservation
is that any manifold can be fully described by pairwise distances. Hence, if
a low-dimensional representation can be built in such a way that the initial
distances are reproduced, then the dimensionality reduction is successful: the
70 4 Distance Preservation

information content conveyed by the manifold— its geometrical structure —is


preserved. It is clear that if close points are kept close, and if far points remain
far, then the initial manifold and its low-dimensional embedding share the
same shape.
The next three sections of this chapter review some of the best-known
DR methods that use the principle of distance preservation; they are called
distance-preserving methods in short. Each of the three sections focuses on a
particular type of distance. Section 4.2 introduces the most common distance
measures, like the Euclidean one, and methods that are based on it. Next,
Section 4.3 describes geodesic and graph distances, which have peaked much
interest in the last few years. Finally, Section 4.4 deals with even more exotic
distance measures that are related to kernel functions and kernel learning.
Additional examples and comparisons between the described methods can
be found in Chapter 6.

4.2 Spatial distances

Spatial distances, like the Euclidean distance, are the most intuitive and natu-
ral way to measure distances in the real (Euclidean) world. The adjective spa-
tial indicates that these metrics compute the distance separating two points of
the space, without regards to any other information like the presence of a sub-
manifold: only the coordinates of the two points matter. Although these met-
rics are probably not the most appropriate for dimensionality reduction (see
Section 4.3.1), their simplicity makes them very appealing. Subsection 4.2.1
introduces some facts about and definitions of distances, norms, and scalar
products; then it goes on to describe the methods that reduce dimensionality
by using spatial distances.

4.2.1 Metric space, distances, norms and scalar product

A space Y with a distance function d(a, b) between two points a, b ∈ Y is said


to be a metric space if the distance function respects the following axioms:
• Nondegeneracy. For any points a and b in the space d(a, b) = 0 if and
only if a = b.
• Triangular inequality. For any points a, b and c in the space d(a, b) ≤
d(c, a) + d(c, b).
Other usual and desired properties for the distance function, like the symmetry
and the nonnegativity, trivially follow from these two axioms. This comes from
the specific formulation of the triangular inequality. If the latter is defined as
d(a, b) ≤ d(a, c) + d(c, b), then the symmetry and nonnegativity must be
added as axioms. But with the first definition, they can be derived as follows:
4.2 Spatial distances 71

• Nonnegativity. If a = b, the triangular inequality becomes

d(a, a) ≤ d(c, a) + d(c, a) = 2d(c, a) . (4.1)

Simplifying with the help of nondegeneracy results in:

0 ≤ d(c, a) . (4.2)

• Symmetry. If a = u and b = c = v, the triangular inequality becomes

d(u, v) ≤ d(v, u) + d(v, v) = d(v, u) , (4.3)

by the use of nondegeneracy. Similarly, if a = v and b = c = u, then

d(v, u) ≤ d(u, v) + d(u, u) = d(u, v) . (4.4)

The conjunction of both inequalities forces the equality d(u, v) = d(v, u).
In the usual Cartesian vector space RD , the most-used distance functions are
derived from the Minkowski norm. Actually, the pth-order Minkowski norm
of point a = [a1 , . . . , ak , . . . , aD ]T , also called the Lp norm and noted ap , is
a simple function of the coordinates of a:

D

ap =  p
|ak |p , (4.5)
k=1

where p ∈ N0 . A distance function that respects the above-mentioned axioms


is obtained by measuring the norm of the difference between two points:

d(a, b) = a − bp = b − ap . (4.6)

When using the Minkowski distance, some values of p are chosen preferentially
because they lead to nice geometrical or mathematical properties:
• The maximum distance (p = ∞):

a − b∞ = max |ak − bk | , (4.7)


1≤k≤D

also called the dominance distance, because when p → ∞, all summed


terms in Eq. (4.5) become negligible, except the largest one.
• The city-block distance (p = 1):


D
a − b1 = |ak − bk | , (4.8)
k=1

also called the Manhattan distance because from a geometrical point of


view, the measurement of the distance resembles driving a taxi in an Amer-
ican city divided into regular rectangular blocks.
72 4 Distance Preservation

• The Euclidean distance (p = 2):



D

a − b2 =  (ak − bk )2 , (4.9)
k=1

proves to be the most natural and intuitive distance measure in the real
world. The Euclidean distance also has particularly appealing mathemat-
ical properties (invariance with respect to rotations, etc.).
Among the three above-mentioned possibilities, the Euclidean distance is the
most widely used one, not only because of its natural interpretation in the
physical world, but also because of its simplicity. For example, the partial
derivative along a component ak of a is simply
∂d(a, b) ak − b k ∂d(a, b)
= =− , (4.10)
∂ak d(a, b) ∂bk
or, written directly in a vector form, is
∂d(a, b) a−b ∂d(a, b)
= =− . (4.11)
∂a d(a, b) ∂b
Another advantage of the Euclidean distance comes from the alternative def-
inition of the Euclidean norm by means of the scalar product:

a2 = a · a , (4.12)

where the notation a · b indicates the scalar product between vector a and
b. Formally, the scalar or dot product is defined as


D
a · b = a b =
T
ak b k . (4.13)
k=1

Two important properties of the scalar product are:


• commutativity:
a · b = b · a . (4.14)
• left and right distributivity:

a · (b + c) = a · b + a · c , (4.15)
(a + b) · c = a · c + b · c . (4.16)

Finally, the overview of the classical distance functions would not be com-
plete without mentioning the Mahalanobis distance, a straight generalization
of the Euclidean distance. The Mahalanobis norm is defined as

aMahalanobis = aT M−1 a , (4.17)


4.2 Spatial distances 73

where M is often chosen as the covariance matrix Caa = E{aaT }. Obviously,


the Euclidean distance corresponds to the particular case where M is the iden-
tity matrix. Intuitively, the equicontours are circles for the Euclidean distance
and ellipses for the Mahalanobis distance.
Most distance-preserving NLDR algorithms that are described in the forth-
coming sections involve pairwise distances. Assuming that the finite set of
indexed points, denoted as

Y = {y(1), . . . , y(i), . . . , y(j), . . . , y(N )} , (4.18)

is available, then the distance between the two points y(i) and y(j), normally
written as d(y(i), y(j)), can be shortened and noted as dy (i, j).

4.2.2 Multidimensional scaling

The term multidimensional scaling (MDS) actually hides a family of methods


rather than a single well-defined procedure. Scaling refers to methods that
construct a configuration of points in a target metric space from informa-
tion about interpoint distances, and MDS is scaling when the target space is
Euclidean [43]. Historically, the first major steps were made by Young and
Householder in 1938 [208] and then by Torgerson [182] in 1952, who pro-
posed the purely Euclidean model of metric MDS. The next breakthrough
was accomplished a few years later by Shepard in 1962 [171] and by Kruskal
in 1964 [108], who elaborated methods for nonmetric MDS, focusing on rank
information instead of interpoint distances.
Actually, MDS has been widely used and developed in human sciences like
sociology, anthropology, economy, and also particularly in a subfield of psy-
chology called psychometrics. In the latter domain, MDS is used as a tool for
the geometrical representation of concepts, objects, or opinions. Classically,
people are asked to give a quantitative separation between the concepts: for
each concept, they have either to place all the others as points in a continuous
interval or to rank them by order of similarity. The first approach is typical
of metric MDS and the second of nonmetric MDS. In both cases, each object
is characterized by either distances between or similarities to the others. In-
tuitively, the notion of distance, or dissimilarity, is very easy to understand:
the dissimilarity is zero for identical objects and grows as they become in-
creasingly different from each other. Conversely, similarity is high for nearly
identical objects and decreases as differences appear. More formally, in the
Euclidean case, a distance between two points y(i) and y(j) is related to the
norm of their difference:

d(y(i), y(j)) = (y(i) − y(j)) · (y(i) − y(j)) , (4.19)

whereas an example of a similarity measure can be written using the inverse of


the distance. As expected, the similarity then vanishes as the distance grows,
and conversely the similarity tends to infinity for small distances.
74 4 Distance Preservation

When used in psychometrics, two different kinds of MDS can be distin-


guished. Indeed, when MDS has to process the results of a public opinion
poll or survey, each individual surveyed provides similarities for all pairs of
concepts. Consequently, the data are not stored in a matrix but in a three-
dimensional tensor. The so-called three-way MDS is designed to analyze such
a data set. However, it will be not studied hereafter, and the emphasis is put
on two-way MDS, for which only one similarity value is attributed for each
pair of objects.
The remainder of this section describes the original method, the oldest
one, often called classical metric multidimensional scaling, which gave rise to
many variants.

Embedding of the data set

Actually, classical metric MDS is not a true distance-preserving method. In


its classical version, metric MDS preserves pairwise scalar products instead
of pairwise distances (both are closely related, however, as will become clear
farther ahead). Moreover, classical metric MDS cannot achieve dimensionality
reduction in a nonlinear way either. Nevertheless, as it can be considered to
be the antecedent of all nonlinear distance-preserving methods, its place in
this chapter is fully justified.
Like PCA, metric MDS relies on a simple generative model. More precisely,
only an orthogonal axis change separates the observed variables in y and the
latent ones, stored in x:
y = Wx , (4.20)
where the components of x are independent or at least uncorrelated and W
is a D-by-P matrix such that WT W = IP . Both the observed and latent
variables are assumed to be centered.
For a finite set of N points, written in matrix form as

Y = [. . . , y(i), . . . , y(j), . . .] , (4.21)

a short-hand notation may be given for the scalar product between vectors
y(i) and y(j):
sy (i, j) = s(y(i), y(j)) = y(i) · y(j) , (4.22)
as has been done for distances. Then it can be written that

S = [sy (i, j)]1≤i,j≤N = YT Y (4.23)


T
= (WX) (WX) (4.24)
T T
= X W WX (4.25)
= XT X . (4.26)

Usually, both Y and X are unknown; only the matrix of pairwise scalar prod-
ucts S, called the Gram matrix, is given. As can be seen, the values of the
4.2 Spatial distances 75

latent variables can be found trivially by computing the eigenvalue decompo-


sition (see Appendix A.2) of the Gram matrix S:

S = UΛUT (4.27)
1/2 1/2 T
= (UΛ )(Λ U ) (4.28)
= (Λ1/2 UT )T (Λ1/2 UT ) , (4.29)

where U is an N -by-N orthonormal matrix and Λ is an N -by-N diagonal


matrix containing the eigenvalues. (The reason why U is used instead of V like
in Appendix A.2 will become clear farther ahead. Moreover, it is noteworthy
that as S is the Gram matrix of the centered data, at most D eigenvalues are
strictly positive while others are zero in Λ.) If the eigenvalues are sorted in
descending order, then the estimated P -dimensional latent variables can be
computed as the product

X̂ = IP ×N Λ1/2 UT . (4.30)

Starting from this solution, the equivalence between metric MDS and PCA
can easily be demonstrated.
Actually, metric MDS and PCA give the same solution. To demonstrate
it, the data coordinates Y are assumed to be known— this is mandatory for
PCA, but not for metric MDS —and centered. Moreover, the singular value
decomposition of Y is written as Y = VΣUT (see Appendix A.1). On one
hand, PCA decomposes the covariance matrix, which is proportional to YYT ,
into eigenvectors and eigenvalues:

Ĉyy ∝ YYT = VΣT UT UΣVT = VΣΣT VT = VΛPCA VT , (4.31)

where the division by N is intentionally omitted in the covariance, and


ΛPCA = ΣΣT . The solution is X̂PCA = IP ×D VT Y (see Eq. (2.19) in Sub-
section 2.4.2). On the other hand, metric MDS decomposes the Gram matrix
into eigenvectors and eigenvalues:

S = YT Y = UΣVT VΣUT = UΣT ΣUT = UΛMDS UT , (4.32)


1/2
where ΛMDS = ΣT Σ. The solution is X̂MDS = IP ×N ΛMDS UT . By equating
both solutions and by again using the singular value decomposition of Y:

X̂PCA = X̂MDS (4.33)


T 1/2 T
IP ×D V Y = IP ×N ΛMDS U (4.34)
IP ×D VT VΣUT = IP ×N (ΣT Σ)1/2 UT (4.35)
IP ×D ΣUT = IP ×D ΣUT . (4.36)

By the way, this proves that PCA and metric MDS minimize the same crite-
rion. In the case of metric MDS, it can be rewritten as
76 4 Distance Preservation


N
EMDS = (sy (i, j) − x̂(i) · x̂(j))2 . (4.37)
i,j=1

The equivalence between the two methods may be an advantage in some


situations. For example, when data consist of distances or similarities, the
absence of the coordinates does not prevent us from applying PCA: it suffices
to replace PCA with metric MDS. On the other hand, when the coordinates
are known, the equivalence is also very useful when the size of the data matrix
Y becomes problematic. If data are not too high-dimensional but the number
of points is huge, PCA spends fewer memory resources than MDS since the
product YYT has a smaller size than YT Y. By contrast, MDS is better
when the dimensionality is very high but the number of points rather low. An
intermediate solution consists of using the SVD of Y in both cases.
Until now, it has been assumed that data are known by either coordinates
(stored in Y) or scalar products (stored in S). What can be done when Eu-
clidean distances are given instead of scalar products, which often happens?
In that case, distances have to be converted into scalar products before
applying metric MDS. For this purpose, it is assumed that pairwise distances
are squared and stored in an N -by-N matrix D:

D = [d2y (i, j)]1≤i,j≤N . (4.38)

According to Section 4.2, the Euclidean distance can be defined as a scalar


product:

d2y (i, j) = y(i) − y(j)22 (4.39)


= y(i) − y(j) · y(i) − y(j) (4.40)
= y(i) · y(i) − 2y(i) · y(j) + y(j) · y(j) (4.41)
= sy (i, i) − 2sy (i, j) + sy (j, j) , (4.42)

where y is assumed to be centered. Of course, this assumption does not influ-


ence the distances, which are invariant to translation:

y(i) − y(j)22 = (y(i) − z) − (y(j) − z)22 , (4.43)

where z is the translation vector. According to Eq. (4.42), the scalar products
can be computed as
1
sy (i, j) = − (d2y (i, j) − y(i) · y(i) − y(j) · y(j)) . (4.44)
2
Unfortunately, the coordinates y are usually unknown. Nevertheless, the two
subtractions in the right-hand side of Eq. (4.44) can be achieved in an implicit
way by an operation called “double centering” of D. It simply consists of
subtracting from each entry of D the mean of the corresponding row and the
mean of the corresponding column, and adding back the mean of all entries.
In matrix form, this can be written as
4.2 Spatial distances 77

1 1 1 1
S = − (D − D1N 1TN − 1N 1TN D + 2 1N 1TN D1N 1TN ) . (4.45)
2 N N N
Using the properties of the scalar product, knowing that data are centered,
and denoting by μ the mean operator, we find that the mean of the ith row
of D is

μj (d2y (i, j)) = μj (y(i) − y(j) · y(i) − y(j))


= μj (y(i) · y(i) − 2y(i) · y(j) + y(j) · y(j))
= y(i) · y(i) − 2y(i) · μj (y(j)) + μj (y(j) · y(j))
= y(i) · y(i) − 2y(i) · 0 + μj (y(j) · y(j))
= y(i) · y(i) + μj (y(j) · y(j)) . (4.46)

Similarly, by symmetry of D, the mean of the jth column of D is

μi (d2y (i, j)) = μi (y(i) · y(i)) + y(j) · y(j) . (4.47)

The mean of all entries of D2 is

μi,j (d2y (i, j)) = μi,j (y(i) − y(j) · y(i) − y(j))


= μi,j (y(i) · y(i)) − 2μi,j (y(i) · y(j)) + μi,j (y(j) · y(j))
= μi (y(i) · y(i)) − 2μi (y(i) · μj (y(j))) + μj (y(j) · y(j))
= μi (y(i) · y(i)) − 2μi (y(i) · 0) + μj (y(j) · y(j))
= μi (y(i) · y(i)) + μj (y(j) · y(j)) . (4.48)

Clearly, the two last unknown terms in Eq. (4.44) are equal to the sum of
Eq. (4.46) and Eq. (4.47), minus Eq. (4.48):
1
sy (i, j) = − (d2y (i, j) − μj (d2y (i, j)) − μi (d2y (i, j)) + μi,j (d2y (i, j))) . (4.49)
2
The algorithm that achieves MDS is summarized in Fig. 4.1. In this algo-
rithm, it is noteworthy that, due to symmetry, the third term in the right-
hand side of Eq. (4.45) is the transpose of the second one, which is in turn a
subfactor of the fourth one. Similarly, the product in the last step of the
algorithm can be computed more efficiently by directly removing the un-
necessary rows/columns of U and Λ. The algorithm in Fig. 4.1 can easily
be implemented in less than 10 lines in MATLAB R
. A C++ implementa-
tion can also be downloaded from http://www.ucl.ac.be/mlg/. The only
parameter of metric MDS is the embedding dimension P . Computing the
pairwise distances requires O(N 2 ) memory entries and O(N 2 D) operations.
Actually, time and space complexities of metric MDS are directly related to
those of an EVD. Computing all eigenvalues and eigenvectors of an N -by-N
nonsparse matrix typically demands at most O(N 3 ) operations, depending on
the implementation.
78 4 Distance Preservation

1. If available data consist of vectors gathered in Y, then center them, com-


pute the pairwise scalar products S = YT Y, and go to step 3.
2. If available data consist of pairwise Euclidean distances, transform them
into scalar products:
• Square the distances and build D.
• Perform the double centering of D, according to Eq. (4.45); this yields
S.
3. Compute the eigenvalue decomposition S = UΛUT .
4. A P -dimensional representation is obtained by computing the product
X̂ = IP ×N Λ1/2 UT .

Fig. 4.1. Algorithm for classical metric multidimensional scaling.

Embedding of test set

If the data set is available as coordinates, then the equivalence between metric
MDS and PCA can be used to easily embed a test set. In practice, this means
that the principal components vd are explicitly known, after either the SVD
Y = VΣUT or the EVD of the estimated covariance matrix Ĉyy = VΛVT .
A point y of the test set is then embedded by computing the product:

x̂ = IP ×D VT y , (4.50)

as for any point y(i) of the data set.


If the test set is given as scalar products, or if the scalar products are
computed from the coordinates, then the knowledge of V is useless. Indeed,
in that case, a test point y is written as the column vector

s = [y(i) · y]1≤i≤N (4.51)


= YT y . (4.52)

Knowing that the SVD of Y is written as Y = VΣUT , it follows that

s = UΣT VT y . (4.53)

Assuming that the test point y has been generated according to y = VID×P x,
then

s = UΣT VT VID×P x (4.54)


= UΣT ID×P x . (4.55)

where V disappears. Knowing the EVD YT Y = UΛUT , it further follows


that
s = UΛ1/2 IN ×P x , (4.56)
4.2 Spatial distances 79

and, eventually,
x̂ = IP ×N Λ−1/2 UT s , (4.57)
which gives the desired P -dimensional coordinates. This corresponds to the
Nyström formula [6, 16].
If the test set is given as distances, then a test point y is written as the
column vector

D = [y(i) − y · y(i) − y]1≤i≤N (4.58)


= [y(i) · y(i) − 2y(i) · y + y · y]1≤i≤N (4.59)
= −2s + [y(i) · y(i) + y · y]1≤i≤N . (4.60)

Actually, a slightly modified version of the double centering can be applied in


order to determine s:
1 1 1 1
s = − (d − 1N 1TN d − D1N + 2 1N 1TN D1N ) , (4.61)
2 N N N
where the second term is a column vector repeating the mean of d, the third
term is a column vector containing the mean of the rows of D, and the fourth
term is a column vector repeating the grand mean of D. The above equation
is only a discrete approximation and stems from a continuous formulation of
double centering where N → ∞ (see [16] for more details).

Example

Figure 4.2 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. Knowing that metric MDS projects data

0.5 1
x2

0 0
2
x

−0.5 −1

−1 −0.5 0 0.5 1 −1 0 1 2 3
x x
1 1

Fig. 4.2. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by metric MDS.

in a linear way, the results in Fig. 4.2 are not very surprising. Intuitively, they
look like pictures of the two manifolds shot from aside (Swiss roll) and from
above (open box), respectively.
80 4 Distance Preservation

In the case of the Swiss roll, such a result is not very useful, since all turns
of the manifold are superposed. Similarly, for the open box, the presence of
lateral faces can be guessed visually, but with difficulty: only the bottom face
and the lid are clearly visible. These disappointing results can be explained
theoretically by looking at the first couple of normalized eigenvalues

[λn ]1≤n≤N = [0.46, 0.31, 0.22, 0.00, 0.00, 0.00, . . .] (4.62)

for the Swiss roll and

[λn ]1≤n≤N = [0.58, 0.24, 0.18, 0.00, 0.00, 0.00, . . .] (4.63)

for the open box. In accordance with theory, only three eigenvalues are non-
zero. Unfortunately, none of these three eigenvalues can be neglected compared
to the others. This clearly means that metric MDS fails to detect that the two
benchmark manifolds are two-manifolds. Any embedding with dimensionality
less than three would cause an important information loss.

Classification

Exactly like PCA, metric MDS is an offline or batch method. The optimiza-
tion method is exact and purely algebraical: the optimal solution is obtained
in closed form. Metric MDS is also said to be a spectral method, since the core
operation in its procedure is an EVD of a Gram matrix. The model is continu-
ous and strictly linear. The mapping is implicit. Other characteristics of PCA
are also kept: eigenvalues of the Gram matrix can be used to estimate the
intrinsic dimensionality, and several embeddings can be built incrementally
by adding or removing eigenvectors in the solution.

Advantages and drawbacks

Classical metric MDS possesses all the advantages and drawbacks of PCA: it
is simple, robust, but strictly linear. By comparison with PCA, metric MDS
is more flexible: it accepts coordinates as well as scalar products or Euclidean
distances. On the other hand, running metric MDS requires more memory
than PCA, in order to store the N -by-N Gram matrix (instead of the D-by-
D covariance matrix). Another limitation is the generalization to new data
points, which involves an approximate formula for the double-centering step.

Variants

Classical metric MDS has been generalized into metric MDS, for which pair-
wise distances, instead of scalar products, are explicitly preserved. The term
stress function has been coined to denote the objective function of metric
MDS, which can be written as
4.2 Spatial distances 81

1 
N
EmMDS = wij (dy (i, j) − dx (i, j))2 , (4.64)
2 i,j=1

where dy (i, j) and dx (i, j) are the Euclidean distances in the high- and low-
dimensional space, respectively. In practice, the nonnegative weights wij are
often equal to one, except for missing data (wij = 0) or when one desires to
focus on more reliably measured dissimilarities dy (i, j). Although numerous
variants of metric MDS exist, only one particular version, namely Sammon’s
nonlinear mapping, is studied in the next section. Several reasons explain this
choice: Sammon’s nonlinear mapping meets a wide success beyond the usual
fields of application of MDS. Moreover, Sammon provided his NLDR method
with a well-explained and efficient optimization technique.
In human science, the assumption that collected proximity values are dis-
tance measures might be too strong. Shepard [171] and Kruskal [108] ad-
dressed this issue and developed a method known as nonmetric multidimen-
sional scaling. In nonmetric MDS, only the ordinal information (i.e., proximity
ranks) is used for determining the spatial representation. A monotonic trans-
formation of the proximities is calculated, yielding scaled proximities. Opti-
mally scaled proximities are sometimes referred to as disparities. The problem
of nonmetric MDS then consists of finding a spatial representation that mini-
mizes the squared differences between the optimally scaled proximities and the
distances between the points. In contrast to metric MDS, nonmetric MDS does
not attempt to reproduce scalar products but explicitly optimizes a quantita-
tive criterion that measures the preservation of the pairwise distances. Most
variants of nonmetric MDS optimize the following stress function:

N 2
i,j=1 wij |f (δ(i, j)) − dx (i, j)|
EnMDS = , (4.65)
c
where
• δ(i, j) are the collected proximities;
• f is a monotonic transformation of the proximities, such that the assump-
tion f (δ(i, j)) ≈ dy (i, j) holds, where dy (i, j) is the Euclidean distance
between the unknown data points y(i) and y(j);
• dx (i, j) is the Euclidean distance between the low-dimensional representa-
tions x(i) and x(j) of y(i) and y(j);
N
• c is a scale factor usually equal to i,j=1 wij dy (i, j);
• wij are nonnegative weights, with the same meaning and usage as for
metric MDS.
More details can be found in the vast literature dedicated to nonmetric MDS;
see, for instance, [41, 25] and references therein.
82 4 Distance Preservation

4.2.3 Sammon’s nonlinear mapping

In 1969 Sammon [165] proposed a method to establish a mapping between


a high-dimensional space and a lower-dimensional one. Actually, the word
“mapping” may seem rather misleading. Indeed, Sammon’s method does not
exactly determine a continuous mapping between two Cartesian spaces: its
real purpose is just to reduce the dimensionality of a finite set of data points.
Sammon’s method has met a considerable success, and still today much work is
devoted to either its enhancement [148] or its applications. Due to its ubiquity
in many fields of data analysis, several names or acronyms refer to Sammon’s
method: Sammon’s nonlinear mapping, Sammon mapping, NLM (standing for
nonlinear mapping), etc. The acronym NLM is preferred in this book.

Embedding of data set

Actually, NLM is closely related to metric MDS (see Subsection 4.2.2). No


generative model of data is assumed: only a stress function is defined. Conse-
quently, the low-dimensional representation can be totally different from the
distribution of the true latent variables. More precisely, NLM minimizes the
following stress function:

1  (dy (i, j) − dx (i, j))2


N
ENLM = , (4.66)
c i=1 dy (i, j)
i<j

where
• dy (i, j) is a distance measure between the ith and jth points in the D-
dimensional data space,
• dx (i, j) is the Euclidean distance between the ith and jth points in the
P -dimensional latent space.
The normalizing constant c is defined as


N
c= dy (i, j) . (4.67)
i=1
i<j

In the definition of ENLM , no assumption is made about the distance function


dy (i, j) in the high-dimensional space. Classically, the Euclidean distance is
chosen by default. Moreover, as the algorithm usually works in batch mode, all
distances dy (i, j) must be known in advance. On the contrary, the distances
in the low-dimensional space are imposed to be Euclidean for the sake of
simplicity in the following developments. Hence, dx (i, j) = x(i) − x(j)2 ,
where the points x(i) and x(j) are the low-dimensional representations of the
data points y(i) and y(j).
4.2 Spatial distances 83

Sammon’s stress can be cast as an instance of metric MDS (see (4.64)), for
which wij = 1/dy (i, j). The intuitive meaning of the factor 1/dy (i, j), which is
weighting the summed terms, is clear: it gives less importance to errors made
on large distances. During the dimensionality reduction, a manifold should be
unfolded in order to be mapped to a Cartesian vector space, which is flat, in
contrast with the initial manifold, which can be curved. This means that long
distances, between faraway points, cannot be preserved perfectly if the curva-
ture of the manifold is high: they have to be stretched in order to “flatten” the
manifold. On the other hand, small distances can be better preserved since on
a local scale the curvature is negligible, or at least less important than on the
global scale. Moreover, the preservation of short distances allows us to keep
the local cohesion of the manifold. In summary, the weighting factor simply
adjusts the importance to be given to each distance in Sammon’s stress, ac-
cording to its value: the preservation of long distances is less important than
the preservation of shorter ones, and therefore the weighting factor is chosen
to be inversely proportional to the distance.
Obviously, Sammon’s stress ENLM is never negative and vanishes in the
ideal case where dy (i, j) = dx (i, j) for all pairs {i, j}. The minimization
of ENLM is performed by determining appropriate coordinates for the low-
dimensional representations x(i) of each observation y(i). Although ENLM is a
relatively simple continuous function, its minimization cannot be performed in
closed form, in contrast with the error functions of PCA and classical metric
MDS. Nevertheless, standard optimization techniques can be applied in order
to find a solution in an iterative manner. Sammon’s idea relies on a variant
of Newton’s method, called quasi-Newton optimization (see Appendix C.1).
This method is a good tradeoff between the exact Newton method, which
involves the Hessian matrix, and a gradient descent, which is less efficient. As
Sammon’s stress ENLM depends on N P parameters, the Hessian would have
been much too big! According to Eq. (C.12), the quasi-Newton update rule
that iteratively determines the parameters xk (i) of ENLM can be written as
∂ENLM
∂x (i)
xk (i) ← xk (i) − α  2 k  , (4.68)
 ∂xk (i)2 
∂ ENLM

where the absolute value is used to distinguish the minima from the max-
ima. Sammon [165] recommends setting α (called magic factor in his paper)
between 0.3 and 0.4.
As dx (i, j) is the Euclidean distance between vectors x(i) and x(j), it
follows out from Eq. (4.9) that

 P

dx (i, j) = x(i) − x(j)2 =  (xk (i) − xk (j))2 . (4.69)
k=1

Therefore, the first partial derivative of ENLM is


84 4 Distance Preservation

∂ENLM ∂ENLM ∂dx (i, j)


= (4.70)
∂xk (i) ∂dx (i, j) ∂xk (i)
−2  dy (i, j) − dx (i, j) ∂dx (i, j)
N
= (4.71)
c j=1 dy (i, j) ∂xk (i)
j =i

−2  dy (i, j) − dx (i, j) (xk (i) − xk (j))


N
= (4.72)
c j=1 dy (i, j) dx (i, j)
j =i

−2  dy (i, j) − dx (i, j)
N
= (xk (i) − xk (j)) . (4.73)
c j=1 dy (i, j) dx (i, j)
j =i

After simplification, the second derivative of ENLM is


 
−2 
N
∂ 2 ENLM dy (i, j) − dx (i, j) (xk (i) − xk (j))2
2 = − . (4.74)
∂xk (i) c j=1 dy (i, j) dx (i, j) d3x (i, j)
j =i

The procedure detailed in Fig. 4.3 implements the above-mentioned ideas.


A MATLAB R
function that performs Sammon’s nonlinear mapping can be

1. Compute all pairwise distances dy (i, j) in the D-dimensional data space.


2. Initialize the P -dimensional coordinates of all points x(i), either randomly
or on the hyperplane spanned by the first P principal components of the
data set (after PCA or MDS).
3. Compute the right-hand side of Eq. (4.68) for the coordinates of all points
x(i).
4. Update the coordinates of all points x(i).
5. Return to step 3 until the value of the stress function no longer decreases.

Fig. 4.3. Algorithm implementing Sammon’s nonlinear mapping.

found in the SOM toolbox, which is available at http://www.cis.hut.fi/


projects/somtoolbox/. A C++ implementation can also be downloaded
from http://www.ucl.ac.be/mlg/. In addition to the embedding dimension-
ality P , Sammon’s NLM involves several parameters, essentially due to its
iterative optimization scheme. These are the number of iterations and the
magic factor α; also, it is noteworthy that initialization may play a part in
the final result. Space complexity is O(N 2 ), corresponding to the amount of
memory required to store the pairwise distances. Time complexity is O(N 2 P )
per iteration.
4.2 Spatial distances 85

Embedding of test set

Sammon’s original method was published without regards to the embedding


of test points. Test data available as distances or coordinates can be embedded
by means of two techniques:
• The easiest solution consists of running a modified NLM for each test point,
where all data points are taken into account but only the test point to be
embedded is updated. Unfortunately, this procedure generally gives poor
results, because NLM tries to find a global minimum of the stress function
for each test point. Indeed, it is intuitively clear that a local projection
would perform better: only a few data points around a given test point are
needed to place it correctly in the low-dimensional space.
• The interpolation procedure of CCA (see Subsection 4.2.4) could also be
used, precisely because it can perform a local projection thanks to its
adjustable neighborhood width. But it does not behave much better than
the preceding solution.
When points are available as coordinates only, another possibility exists. It
involves using neural variants of NLM, like the SAMANN (discussed ahead),
which naturally provides the ability to generalize the mapping to test points,
at the expense of a more complex and perhaps less precise algorithm for the
embedding of the data set.

Example

Figure 4.4 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with the step

1
1
0.5
x2

0
2

0
x

−0.5 −1

−1
−1 0 1 −3 −2 −1 0 1 2
x1 x1

Fig. 4.4. Two-dimensional embeddings of the ‘Swiss roll’ and ‘open box’ data sets
(Fig. 1.4), found by Sammon’s NLM.

size α set to 0.3. By comparison with metric MDS, NLM can embed in a
nonlinear way. It is clearly visible that the two manifolds look “distorted” in
86 4 Distance Preservation

their two-dimensional embedding: some parts are stretched while others are
compressed.
Unfortunately, the results of NLM remain disappointing for the two bench-
mark manifolds, particularly for the Swiss roll. Turns of the spiral are super-
posed, meaning that the mapping between the initial manifold and its two-
dimensional embedding is not bijective. For the open box, NLM leads to a
better result than metric MDS, but two faces are still superposed.
The shape of the embedded manifolds visually shows how NLM trades off
the preservation of short against longer distances.

Classification

Sammon’s NLM is a batch method. The model is nonlinear and discrete,


producing an explicit mapping; no generative model of data is provided. No
vector quantization was included in the original method described in Sub-
section 4.2.3. However, this useful preprocessing can easily be grafted on the
method.
Sammon’s NLM uses an approximate optimization procedure, which can
possibly get stuck in a local minimum. The method does not include an es-
timator of the intrinsic dimensionality; the embedding dimension is actually
fixed by the user. Incremental or layered embeddings are not possible: the
method must be run separately for each specified dimensionality.

Advantages and drawbacks

By comparison with classical metric MDS, NLM can efficiently handle non-
linear manifolds, at least if they are not too heavily folded. Among other
nonlinear variants of metric MDS, NLM remains relatively simple and ele-
gant.
As a main drawback, NLM lacks the ability to generalize the mapping to
new points.
In the same way as many other distance-preserving methods, NLM in its
original version works with a complete matrix of distances, hence containing
O(N 2 ) entries. This may be an obstacle when embedding very large data sets.
Another shortcoming of NLM is its optimization procedure, which may be
slow and/or inefficient for some data sets. In particular, Sammon’s stress func-
tion is not guaranteed to be concave; consequently the optimization process
can get stuck in a local minimum.
Variants of the original NLM, briefly presented ahead, show various at-
tempts to address the above-mentioned issues.

Variants

The SAMANN [134, 44], standing for Sammon artificial neural network, is
a variant of the original NLM that optimizes the same stress function but
4.2 Spatial distances 87

achieves it in a completely different way. Instead of minimizing the stress


function as such with a standard optimization technique, which leads to a
discrete mapping of the data set, Mao and Jain propose establishing an im-
plicit mapping between the initial space and the final space. They propose is
a three-layer MLP with D inputs and P outputs, working in an unsupervised
way with a modified back-propagation rule. More precisely, that rule takes
into account the pairwise distances between the data points in a rather origi-
nal way. Indeed, the SAMANN does not update the network parameters after
the presentation of each input/output pair, as a usual supervised MLP does.
Instead, the SAMANN updates its parameters after the presentation of each
pair of inputs. The distance between the two corresponding outputs, com-
pared to the distance measured between the two inputs, allows one to derive
the necessary corrections for the network parameters in a totally unsupervised
way. The main drawback of the SAMANN is that it requires scaling the data
in order to cast the pairwise distances within the range of the (sigmoidal)
outputs of an MLP. Moreover, despite its appealing elegance, the SAMANN
is outperformed by another extension of NLM with an MLP. In that case [44],
the original NLM is not replaced with an MLP: NLM is run as usual and if
generalization to new points is needed afterwards, a classical supervised MLP
learns the mapping previously obtained by NLM.
Fast versions of NLM, or versions using less memory space, are studied,
for example, in [148]. These variants avoid using the whole distance matrix
by using, for instance, triangulation methods. Another way of reducing the
problem size consists of using vector quantization before running NLM.
From an algorithmic point of view, even when the original architecture
of NLM is kept, some freedom remains in the way the stress function is op-
timized. It has often been reported that the second-order gradient method
Sammon proposed has several weaknesses. The quasi-Newton optimization
neglects off-diagonal entries of the Hessian matrix. Moreover, in order to dis-
tinguish minima from maxima, the remaining diagonal elements are taken in
absolute values; this simple trick just changes the direction of the update in
Eq. (4.68) while a modification of its norm would be required, too. Briefly
put, the step size in the gradient descent of the original NLM is computed by
an often ill-suited heuristics. Various other optimization techniques, showing
better properties, may be used.
Finally, until now, very few variants of NLM have considered a change of
metric to increase the ability of NLM to deal with heavily curved manifolds.
Actually, as already mentioned, the Euclidean distance plays an essential role
only in the embedding space, in order to derive a simple update rule from the
stress function. In the data space, any distance function may suit, at least
if it behaves more or less in the same way as the Euclidean distance, seeing
that the stress function tries to equate them. A version of NLM using graph
distances is described in Subsection 4.3.3.
88 4 Distance Preservation

4.2.4 Curvilinear component analysis

Demartines and Hérault proposed curvilinear component analysis (CCA) in


1995 [47], as an enhancement of Kohonen’s self-organizing maps (see Sub-
section 5.2.1) when the latter method is used for nonlinear dimensionality
reduction [48]. Nevertheless, and also despite its denomination being very
close to that of PCA, CCA does not derive or even resemble either PCA or
a SOM. Actually, CCA belongs to the class of distance-preserving methods
and is more closely related to Sammon’s NLM [48]. Common points between
CCA and SOM must be sought elsewhere:
First, like an SOM, the CCA procedure includes a vector quantization
step (see Appendix D). CCA was actually the first method to combine vector
quantization with a nonlinear dimensionality reduction achieved by distance
preservation. Historically, indeed, CCA was called VQP in [46, 45], which
stands for vector quantization and projection.
Second, CCA and an SOM work with the same kind of optimization tech-
niques, borrowed from the field of artificial neural networks. In the original
VQP method decribed by Demartines in his PhD thesis, the algorithm per-
forms simultanously the vector quantization and the nonlinear dimensionality
reduction, exactly like an SOM.
In the following, however, the vector quantization is considered an optional
preprocessing of the data. Actually, vector quantization can be applied to
reduce the number of vectors in large databases, for DR method. For small
databases or sparsely sampled manifolds, however, it is often better to skip
vector quantization, in order to fully exploit the available information.

Embedding of data set

Like most distance-preserving methods, CCA minimizes a stress or error func-


tion, which is written as

1
N
ECCA = (dy (i, j) − dx (i, j))2 Fλ (dx (i, j)) (4.75)
2 i=1
j=1

and closely resembles Sammon’s stress function. As for the latter, no gen-
erative model of data is assumed. Just as usual, dy (i, j) and dx (i, j) are,
respectively, the distances in the data space and the Euclidean distance in the
latent space. There are two differences, however:
• No scaling factor stands in front of the sum; this factor is not very impor-
tant, except for quantitative comparisons.
• The weighting 1/dy (i, j) is replaced with a more generic factor Fλ (dx (i, j)).
Of course, Fλ may not be any function. Like for the weighting of Sammon’s
NLM, the choice of Fλ is guided by the necessity to preserve short distances
4.2 Spatial distances 89

prior to longer ones. Because the global shape of the manifold has to be
unfolded, long distances often have to be stretched, and their contribution in
the stress should be low. On the other hand, the good preservation of short
distances is easier (since the curvature of the manifold is often low on a local
scale) but is also more important, in order to preserve the structure of the
manifold. Consequently, Fλ is typically chosen as a monotically decreasing
function of its argument. As CCA works on finite data sets, Fλ is also chosen
bounded in order to prevent an abnormally short or null distance to dominate
the other contributions in the stress function. This is especially critical because
Fλ depends on the distances in the embedding space, which are varying and
could temporarily be very small. Indeed, more important than the function Fλ
in itself is the argument of Fλ . In contrast with Sammon’s stress, the weighting
does not depend on the constant distances measured in the data space but on
the distances being optimized in the embedding space. When distances can
be preserved, the hypothesis dy (i, j) ≈ dx (i, j) holds and CCA behaves more
or less in the same way as NLM. If dx (i, j)  dy (i, j) for some i and j, then
the manifold is highly folded up and the contribution of this pair will increase
in order to correct the flaw. But what happens if dx (i, j)  dy (i, j)? Then
the contribution of this pair will decrease, meaning intuitively that CCA will
allow some stretching not only for long distances but also for shorter ones.
Demartines and Hérault designed an optimization procedure specifically
tailored for CCA. Like most optimization techniques, it is based on the deriva-
tive of ECCA . Using the short-hand notations dy = dy (i, j) and dx = dx (i, j),
the derivative can be written as
∂ECCA ∂ECCA ∂dx
=
∂xk (i) ∂dx ∂xk (i)
N
  xk (j) − xk (i)
= (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) .
j=1
dx

(See Eqs. (4.10) and (4.11) for the derivative of dx (i, j).) Alternatively, in
vector form, this gives


N
  x(j) − x(i)
∇x(i) ECCA = (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) , (4.76)
j=1
dx

where ∇x(i) ECCA represents the gradient of ECCA with respect to vector x(i).
The minimization of ECCA by a gradient descent gives the following update
rule:
x(i) ← x(i) − α∇x(i) ECCA , (4.77)
where α is a positive learning rate scheduled according to the Robbins–Monro
conditions [156]. Clearly, the modification brought to vector x(i) proves to be
a sum of contributions coming from all other vectors x(j) (j = i). Each con-
tribution may be seen as the result of a repulsive or attractive force between
90 4 Distance Preservation

vectors x(i) and x(j). Each contribution can also be written as a scalar co-
efficient β(i, j) multiplying the unit vector (x(j) − x(i))/dx (i, j). The scalar
coefficient is
 
β(i, j) = (dy − dx ) 2Fλ (dx ) − (dy − dx )Fλ (dx ) = β(j, i) . (4.78)

Provided that condition

2Fλ (dx ) > (dy − dx )Fλ (dx ) (4.79)

holds, the coefficient is negative when the low-dimensional distance dx (i, j) is


shorter than the original high-dimensional distance dy (i, j), and positive in
the converse case. This means that when vector x(j) is too far from vector
x(i), it is moved closer, whereas it is moved farther away in the converse
case. Hence, the update rule just behaves as intuitively expected. However, it
has a major drawback; the sum of all radial contributions to the update of a
given vector x(i) leads to an averaging; as a consequence, the update is small
and the convergence slows down. Intuitively, the update rule behaves like a
very careful diplomat: when listening to numerous contradictory opinions at
the same time, he makes no clear decision and the negotiations stagnate. As a
consequence, the update rule easily gets stuck in a local minimum of ECCA : the
sum of all contributions vanishes, whereas the same terms taken individually
are nonzero. This phenomenon is illustrated in Fig. 4.5.
A simple modification of the update rule allows the diplomat to show more
dynamic behavior. The idea just consists of reorganizing the negotiations in
such a way that every participant gives its opinion sequentially while the
others stay quiet, and the diplomat immediately decides. For this purpose,
ECCA is decomposed in a sum of indiced terms:


N
i
ECCA = ECCA , (4.80)
i=1

where
1
N
i
ECCA = (dy (i, j) − dx (i, j))2 Fλ (dx (i, j)) . (4.81)
2 j=1
i
The separate optimization of each ECCA leads to the following update rule:

x(j) ← x(j) − α∇x(j) ECCA


i

x(i) − x(j)
← x(j) − αβ(i, j) . (4.82)
dx
The modified procedure works by interlacing the gradient descent of each
i i=1 i=2
term ECCA : it updates all x(j) for ECCA , then all x(j) for ECCA , and so on.
Geometrically, the new procedure pins up a vector x(i) and moves all other
x(j) radially, without regards to the cross contributions computed by the
4.2 Spatial distances 91

Global Minimum Local Minimum

x(1)
x(3) x(2)
x(3) x(2)
x(1)

x(4) x(4)

Fig. 4.5. Visual example of local minimum that can occur with distance preserva-
tion. The expected solution is shown on the left: the global minimum is attained
when x(1) is exactly in the middle of the triangle formed by x(2), x(3), and x(4).
A local minimum is shown on the right: dx (1, 2) and dx (1, 3) are too short, whereas
dx (1, 4) is too long. In order to reach the global minimum, x(1) must be moved
down, but this requires shortening dx (1, 2) and dx (1, 3) even further, at least tem-
porarily. This is impossible for a classical gradient descent: the three contributions
from x(2), x(3), and x(4) vanish after adding them, although they are not zero when
taken individually. Actually, the points x(2), x(3) tend to move x(1) away, while
x(4) pulls x(1) toward the center of the triangle. With a stochastic approach, the
contributions are considered one after the other, in random order: for example, if
the contribution of x(4) is taken into account first, then x(1) can escape from the
local minimum.

first rule (classical gradient, Eq. (4.77)). Computationally, the new procedure
performs many more updates than the first rule for (almost) the same number
of operations: while the application of the first rule updates one vector, the
new rule updates N − 1 vectors. Moreover, the norm of the updates is larger
with the new rule than with the first one. A more detailed study in [45] shows
that the new rule minimizes the global error function ECCA , not strictly like
a normal gradient descent, but well on average.
Unless the embedding dimensionality is the same as the initial dimension-
ality, in which case the embedding is trivial, the presence and the choice of
Fλ are very important. As already mentioned, the embedding of highly folded
manifolds requires focusing on short distances. Longer distances have to be
stretched in order to achieve the unfolding, and their contribution must be
lowered in ECCA . Therefore, Fλ is usefully chosen as a positive and decreasing
function. For example, Fλ could be defined as the following:
 
dx
Fλ (dx ) = exp − , (4.83)
λ
where λ controls the decrease. However, the condition (4.79) must hold, so
that the rule behaves as expected, i.e., points that are too far away are brought
closer to each other and points that too close are moved away. As Fλ is positive
92 4 Distance Preservation

and decreasing, the condition (4.79) becomes


 
 Fλ (dx )  1
 
 F  (dx )  > 2 (dx − dy ) , (4.84)
λ

and in the particular case of Eq. (4.83):


1
λ> (dx − dy ) , (4.85)
2
which is difficult to ensure practically, because the distances in the embedding
space may grow during the unfolding. Consequently, Fλ is preferably defined
as
Fλ (dx ) = H(λ − dx ) , (4.86)
where H(u) is a step function defined as:

0 if u ≤ 0
H(u) = . (4.87)
1 if u > 0
Of course, the main interest of this simple function is its null derivative making
the condition (4.79) trivially fulfilled. Unfortunately, as a counterpart, the
behavior of the function becomes a little bit abrupt: distances larger than the
value of λ are not taken into account at all. As a consequence, the choice of
λ becomes critical.
A first way to circumvent the problem consists of using different values for
λ during the convergence. Like in an SOM (see Subsection 5.2.1), the user can
schedule the value of λ: high values, close to the maximal distance measured
in the data, can be used during the first updates, whereas lower values finely
tune the embedding when convergence is almost reached.
A second way to circumvent the critical choice of λ consists in neglecting
the derivative of Fλ in the update rule. This can be seen as using a staircase
function with infinitely small stairwidths.
In both cases, the hyperparameter λ may be interpreted as a neighborhood
width, as in an SOM.
Figure 4.6 summarizes the procedure achieving CCA. A function written in
MATLAB R
that implements CCA can be found in the SOM toolbox, which
is available at http://www.cis.hut.fi/ projects/somtoolbox/. A C++
implementation can also be downloaded from http://www.ucl.ac.be/mlg/.
The main parameter of CCA is the embedding dimensionality. Like Sammon’s
mapping, CCA relies on an iterative optimization procedure that involves sev-
eral parameters. The latter are the number of iterations and for each iteration
the scheduled values of the learning rate α and neighborhood width λ. The
choice of the weighting function can be considered, too. Finally, vector quan-
tization also involves some parameters (number of prototypes, number of iter-
ations, convergence parameters, etc.). Space complexity of CCA is O(M 2 ) as
with most distance-preserving methods, whereas time complexity is O(M 2 P ),
where M ≤ N is the number of prototypes obtained after vector quantization.
4.2 Spatial distances 93

1. Perform a vector quantization (see Appendix D) to reduce the size of the


data set, if needed.
2. Compute all pairwise Euclidean distances dy (i, j) in the D-dimensional
data space.
3. Initialize the P -dimensional coordinates of all points x(i), either randomly
or on the hyperplane spanned by the first principal components (after a
PCA). Let q be equal to 1.
4. Give the learning rate α and the neighborhood width λ their scheduled
value for epoch number q.
5. Select a point x(i), and update all other ones according to Eq. (4.82).
6. Return to step 5 until all points x(i) have been selected exactly once
during the current epoch.
7. Increase q, and if convergence is not reached return to step 4.

Fig. 4.6. Algorithm implementing curvilinear component analysis.

Embedding of test set

Because it stems from the field of artificial neural networks, CCA has been
provided with an interpolation procedure that can embed test points. Com-
pared to the learning phase, the interpolation considers the prototypes (or the
data points if the vector quantization was skipped) as fixed points. For each
test point, the update rule (4.82) is applied, as in the learning phase, in order
to move the embedded test point to the right position.
For simple manifolds, the interpolation works well and can even perform
some basic extrapolation tasks [45, 48]. Unfortunately, data dimensionality
that is too high or the presence of noise in the data set dramatically reduces
the performance. Moreover, when the manifold is heavily crumpled, the tuning
of the neighborhood width proves still more difficult than during the learning
phase.

Example

Figure 4.7 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. As can be seen in the figure, CCA succeeds
in embedding the two benchmark manifolds in a much more satisfying way
than Sammon’s NLM. The two-dimensional embedding of the Swiss roll is al-
most free of superpositions. To achieve this result, CCA tears the manifold in
several pieces. However, this causes some neighborhood relationships between
data points to be both arbitrarily broken and created. From the viewpoint of
the underlying manifold, this also means that the mapping between the initial
and final embeddings is bijective but discontinuous.
94 4 Distance Preservation

2
1
1
2

x2
0 0
x

−1
−1
−2
−2 −1 0 1 2 −4 −2 0 2
x1 x1

Fig. 4.7. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by CCA.

Regarding the open box, the tearing strategy of CCA appears to be more
convincing. All faces of the box are visible, and the mapping is both bijective
and continuous. In this case, the weighted preservation of distances allows
CCA to leave almost all neighborhood relationships unchanged, except where
the two lateral faces are torn. Some short distances are shrunk near the bottom
corners of the box.

Classification

Although VQP, the first version of CCA, was intended to be an online method,
like usual versions of the SOM, that performed simultaneously vector quan-
tization and dimensionality reduction, most current implementations of CCA
are now simple batch procedures. Several reasons explain this choice. A
clear separation between vector quantization and dimensionality reduction
increases efficiency as well as modularity. Ready-to-use methods of vector
quantization can advantageously preprocess the data. Moreover, these proce-
dures may be skipped to ease the comparison with other DR methods, like
Sammon’s NLM, that are not provided with vector quantization.
In summary, CCA is an offline (batch) method with an optional vector
quantization as preprocessing. The model is nonlinear and discrete; the map-
ping is explicit. The method works with an approximate optimization pro-
cedure. CCA does not include an estimator of the data dimensionality and
cannot build embeddings incrementally. In other words, the intrinsic dimen-
sionality has to be given as an external parameter and cannot be modified
without running the method again.

Advantages and drawbacks

By comparison with Sammon’s NLM, Demartines’ CCA proves much more


flexible, mainly because the user can choose and parameterize the weighting
function Fλ in ECCA at will. This allows one to limit the range of considered
4.2 Spatial distances 95

distances and focus on the preservation of distances on a given scale only.


Moreover, the weighting function Fλ depends on the distances measured in
the embedding space (and not in the data space like NLM); this allows tearing
some regions of the manifold, as shown in the example. This is a better solution
than crushing the manifold, like NLM does.
From a computational point of view, the stochastic optimization procedure
of CCA works much better than the quasi-Newton rule of NLM. Convergence
is generally faster.
On the other hand, the interpretation of CCA error criterion is difficult,
since Fλ is changing when CCA is running, according to the schedule of λ.
CCA can also get stuck in local minima, and sometimes convergence may
fail when the learning rate α or the neighborhood width λ is not adequately
scheduled. These are the unavoidable counterparts of the method’s flexibility.

Variants

In 1998 Hérault [84] proposed an enhanced version of CCA including a model


suited for noisy data. In the presence of noise, distances between the given
noisy data points can be longer than the corresponding distances measured
on the underlying manifold. This phenomenon, especially annoying for short
distances, is illustrated in Fig. 4.8. Actually, the embedding of a distance-
preserving method should not take into account the contribution of noise that
is hidden in the distances. For that purpose, Hérault [84] decomposes the task
of dimensionality reduction into two simultaneous subtasks. The first one is
the global unfolding of the manifold, in such a way that it can be embedded
in a lower-dimensional space. The second subtask aims at “projecting” locally
the noisy data points onto the underlying manifold.
As unfolding generally stretches the distances, the condition to perform
unfolding could be dy ≤ dx . And since noise elimination tends to shrink the
polluted distances, the corresponding condition is dy ≥ dx .
i
For the global unfolding, Hérault keeps the former definition of ECCA ,
which is written as

1
N
i
ECCA-u = (dy − dx )2 Fλ (dx ) . (4.88)
2 j=1

On the other hand, for the local projection, the error function becomes

1 2
N
i
ECCA-p = (d − d2x )2 Fλ (dx ) , (4.89)
2 j=1 y

where the quadratic term resembles the one used in Torgerson’s classical met-
ric MDS (see Subsection 4.2.2). The local projection thus behaves like a local
PCA (or MDS) and projects the noisy points onto the underlying manifold
(more or less) orthogonally.
96 4 Distance Preservation

Manifold
Distance to stretch → Unfolding
Distance to shrink → Projection

2D

1D

Fig. 4.8. The C curve is a one-manifold embedded in a two-dimensional space.


Assuming, in the theoretical case, that infinitely small distances are perfectly pre-
served, the dimensionality reduction then consists of unrolling the C curve in order
to make it straight. In practical cases, only a discrete set of distances is available.
Some distances between pairs of noisy points (bold black circles) are shown. Long
distances often need to be stretched, because the curvature of the manifold gives
them a smaller value than expected. On the other hand, shorter distances are cor-
rupted by noise (noise is negligible for long distances); this makes them longer than
expected.

i i
Assuming Fλ has a null derivative, the gradients of ECCA-u and ECCA-p are:
x(j) − x(i)
∇x(j) ECCA-u
i
= 2(dy − dx )Fλ (dx ) , (4.90)
dx
∇x(j) ECCA-p
i
= −4(d2y − d2x )Fλ (dx )(x(j) − x(i)) . (4.91)
In order to obtain the same error function in both cases around dx = dy ,
i i i
Hérault equates the second derivatives of ECCA-u and ECCA-p and scales ECCA-p
accordingly: 
N i
ECCA-u if dx > dy
ECCA = 1 i . (4.92)
4dy (i,j)2 ECCA-p if dx < dy
i=1
Another variant of CCA is also introduced in [186, 188, 187]. In that ver-
sion, the main change occurs in the weighting function F , which is a balanced
4.3 Graph distances 97

compound of two functions. The first one depends on the distance in the em-
bedding space, as for regular CCA, whereas the second has for argument the
distance in the data space, as for Sammon’s NLM:

Fλ (dx , dy ) = (1 − ρ)H(λ − dx ) + ρH(λ − dy ) . (4.93)

The additional parameter ρ allows one to tune the balance between both
terms. With ρ close to one, local neighborhoods are generally well preserved,
but discontinuities can possibly appear (manifold tears). In contrast, with ρ
close to zero, the global manifold shape is better preserved, often at the price
of some errors in local neighborhoods.
Finally, a method similar to CCA is described in [59]. Actually, this method
is more closely related to VQP [46, 45] than to CCA as it is described in [48].
This method relies on the neural gas [135] for the vector quantization and is
thus able to work online (i.e., with a time-varying data set).

4.3 Graph distances


Briefly put, graph distances attempt to overcome some shortcomings of spatial
metrics like the Euclidean distance. The next subsection introduces both the
geodesic and graph distances, explains how they relate to each other, and
motivates their use in the context of dimensionality reduction. The subsequent
subsections describe three methods of nonlinear dimensionality reduction that
use graph distances.

4.3.1 Geodesic distance and graph distance

In order to reduce the dimensionality of highly folded manifolds, algorithms


like NLM and CCA extend the purely linear MDS mainly by changing the
optimization procedure. Instead of an exact solution computed algebraically,
these algorithms use more sophisticated optimization procedures. This gives
more freedom in the definition of the error function, for example, by allowing
the user to differently weight short and long distances.
Fundamentally, however, the problem of unfolding may be solved from
the opposite side. The goal of NLM and CCA is formalized by complex error
functions that preserve short distance and allow the stretching of longer ones.
But is this not an awkward remedy to the fact that traditional spatial metrics
like the Euclidean one are not adapted to distance preservation? Would it be
possible, just by changing the metric used to measure the pairwise distances,
either to keep the simplicity of metric MDS or to attain better performances?
These two directions are explored respectively by Isomap (Subsection 4.3.2)
and curvilinear distance analysis (Subsection 4.3.4).
The idea of these two methods results from simple geometrical consid-
erations, as illustrated in Fig. 4.9. That figure shows a curved line, i.e., a
98 4 Distance Preservation

Manifold
Geodesic distance
Euclidean distance

2D

1D

Fig. 4.9. The C curve is a one-manifold embedded in a two-dimensional space.


Intuitively, it is expected that the reduction to one dimension unrolls the curve
and makes it straight. With this idea in mind, the Euclidean distance cannot be
preserved easily, except for very small distances, on so small a scale that the manifold
is nearly linear. For larger distances, difficulties arise. The best example consists of
measuring the Euclidean distance between the two endpoints of the curve. In the
two-dimensional space, this distance is short because the manifold is folded on itself.
In the one-dimensional embedding space, the same length proves much smaller than
the new distance measured between the endpoints of the unrolled curve. In contrast
with the Euclidean distance, the geodesic distance is measured along the manifold.
As a consequence, it does not depend as much as the Euclidean metric on a particular
embedding of the manifold. In the case of the C curve, the geodesic distance remains
the same in both two- and one-dimensional spaces.
4.3 Graph distances 99

one-dimensional manifold embedded in a two-dimensional space. In order to


reduce the dimensionality to 1, it is intuitively expected that the curve has
to be unrolled. Assuming that very short Euclidean distances are preserved,
this means, as a counterpart, that Euclidean longer distances are consider-
ably stretched. For example, the distance between the two endpoints of the
C curve in Fig. 4.9 would be multiplied by more than three! Intuitively, such
an issue could be addressed by measuring the distance along the manifold
and not through the embedding space, like the Euclidean distance does. With
such a metric, the distance depends less on the particular embedding of the
manifold. In simpler words, the curvature of the manifold does not modify
(or hardly modifies) the value of the distance. The distance along a manifold
is usually called the geodesic distance, by analogy with curves drawn on the
surface of Earth. The geodesic distance can also be interpreted as a railroad
distance: trains are forced to follow the track (the manifold). On the other
hand, the Euclidean distances may follow straight shortcuts, as a plane flies
regardless of roads and tracks.
Formally, the geodesic distance is rather complicated to compute starting
with the analytical expression of a manifold. For example, in the case of a
one-dimensional manifold M, which depends on a single latent variable x,
like the C curve, parametric equations can be written as

R → M ⊂ RD : x → m(x) = [m1 (x), . . . , mD (x)]T (4.94)

and may help to compute the manifold distance as an arc length. Indeed, the
arc length l from point y(i) = m(x(i)) to point y(j) = m(x(j)) is computed
as the integral

y(j) y(j) 
D x(j)
l= dl =  dm2i = Jx m(x)dx , (4.95)
y(i) y(i) k=1 x(i)

where Jx m(x) designates the Jacobian matrix of m with respect to the pa-
rameter x. As x is scalar, the Jacobian matrix reduces to a column vector.
Unfortunately, the situation gets worse for multidimensional manifolds,
which involve more than one parameter:

m : RP → M ⊂ RD , x = [x1 , . . . , xP ]T → y = m(x) = [m1 (x), . . . , mD (x)]T .


(4.96)
In this case, several different paths may go from point y(i) to point y(j).
Each of these paths is in fact a one-dimensional submanifold P of M with
parametric equations

p : R → P ⊂ M ⊂ RP , z → x = p(z) = [p1 (z), . . . , pP (z)]T . (4.97)

The integral then has to be minimized over all possible paths that connect
the starting and ending points:
100 4 Distance Preservation
z(j)
l = min Jz m(p(z))dz . (4.98)
p(z) z(i)

In practice, such a minimization is intractable since it is a functional mini-


mization. Anyway, the parametric equations of M (and P) are unknown; only
some (noisy) points of M are available.
Although this lack of analytical information is usually considered as a
drawback, this situation simplifies the problem of the arc length minimization
because it becomes discrete. Consequently, the problem has to be reformu-
lated. Instead of minimizing an arc length between two points on a manifold,
the task becomes the following: minimize the length of a path (i.e., a broken
line) from point y(i) to point y(j) and passing through a certain number of
other points y(k), y(l), . . . of the manifold M. Obviously, the Euclidean dis-
tance is a trivial solution to this problem: the path directly connects y(i) to
y(j). However, the idea is that the path should be constrained to follow the
underlying manifold, at least approximately. Therefore, a path cannot jump
from one point of the manifold to any other one. Restrictions have to be set.
Intuitively, they are quite simple to establish. In order to obtain a good ap-
proximation of the true arc length, a fine discretization of the manifold is
needed. Thus, only the smallest jumps will be permitted. In practice, several
simple rules can achieve this goal. A first example is the K-rule, which allows
jumping from one point to the K closest other ones, K being some constant.
A second example is the -rule, which allows jumping from one point to all
other ones lying inside a ball with a predetermined radius . See Appendix E
for more details about these rules.
Formally, the set of data points associated with the set of allowed jumps
constitutes a graph in the mathematical sense of the term. More precisely,
a graph is a pair G = (VN , E), where VN is a finite set of N vertices vi
(intuitively: the data points) and E is a set of edges (intuitively: the allowed
jumps). The edges are encoded as pairs of vertices. If edges are ordered pairs,
then the graph is said to be directed; otherwise it is undirected. In the present
case, the graph is vertex-labeled, i.e., there is a one-to-one correspondence
between the N vertices and the N data points y(i): label(vi ) = y(i). In order
to compute the length of paths, the graph has to be vertex-labeled as well
as edge-labeled, meaning in the present case that edges are given a length. As
the length is a numerical attribute, the graph is said to be (edge-)weighted. If
the edge length is its Euclidean length, then the graph is said to be Euclidean
and

label ((vi , vj )) = label(vi ) − label(vj )2 = y(i) − y(j)2 . (4.99)

A path π in a graph G is an ordered subset of vertices [vi , vj , vk , . . .] such


that the edges (vi , vj ), (vj , vk ), . . . belong to E. In a Euclidean graph, it is
sometimes easier to write the same path π by the short-hand notation π =
[y(i), y(j), y(k), . . .]. The length of π is defined as the sum of the lengths of
its constituting edges:
4.3 Graph distances 101

length(π) = label ((vi , vj )) + label ((vj , vk )) + . . . (4.100)

In order to be considered as a distance, the path length must show the


properties of nonnegativity, symmetry, and triangular inequality (see Subsec-
tion 4.2.1). Non-negativity is trivially ensured since the path length is a sum
of Euclidean distances that already show this property. Symmetry is gained
when the graph is undirected. In the case of the K-rule, this means in practice
that if point y(j) belongs to the K closest ones from point y(i), then the edge
from y(i) to y(j) as well as the edge from y(j) to y(i) are added to E, even
if point y(i) does not belong to the K closest ones from y(j). The triangu-
lar inequality requires that, given points y(i) and y(j), the length computed
between them must follow the shortest path. If this condition is satisfied, then

length([y(i), . . . , y(j)]) ≤ length([y(i), . . . , y(k)]) + length([y(k), . . . , y(j)])


(4.101)
holds. Otherwise, the concatenated path [y(i) . . . , y(k), . . . , y(j)] would be
shorter than length([y(i) . . . , y(j)]), what is impossible by construction.
At this point, the only remaining problem is how to compute the short-
est paths in a weighted graph? Dijkstra [53] designed an algorithm for this
purpose. Actually, Dijkstra’s procedure solves the single-source shortest path
problem (SSSP). In other words, it computes the shortest paths between one
predetermined point (the source) and all other ones. The all-pairs shortest
path problem (APSP) is easily solved by repeatedly running Dijkstra’s proce-
dure for each possible source [210]. The only requirement of the procedure if
the nonnegativity of the edge labels. The result of the procedure is the length
of the shortest paths from the source to all other vertices. A representation of
the shortest paths can be computed as a direct byproduct. In graph theory,
the lengths of the shortest paths are traditionally called graph distances and
noted
δ(y(i), y(j)) = δ(vi , vj ) = min length(π) , (4.102)
π

where π = [vi , . . . , vj ].
Dijkstra’s algorithm works as follows. The set of vertices V is divided into
two subsets such that VN = Vdone ∪ Vsort and ∅ = Vdone ∩ Vsort where
• Vdone contains the vertices for which the shortest path is known;
• Vsort contains the vertices for which the shortest path is either not known
at all or not completely known.
If vi is the source vertex, then the initial state of the algorithm is Vdone = ∅,
Vsort = VN , δ(vi , vi ) = 0, and δ(vi , vj ) = ∞ for all j = i. After the initializa-
tion, the algorithm iteratively looks for the vertex vj with the shortest δ(vi , vj )
in Vsort , removes it from Vsort , and puts it in Vdone . The distance δ(vi , vj ) is
then definitely known. Next, all vertices vk connected to vj , i.e., such that
(vj , vk ) ∈ E, are analyzed: if δ(vi , vj ) + label((vj , vk )) < δ(vi , vk ), then the
value of δ(vi , vk ) is changed to δ(vi , vj ) + label((vj , vk )) and the candidate
shortest path from vi to vk becomes [vi , . . . , vj , vk ], where [vi , . . . , vj ] is the
102 4 Distance Preservation

shortest path from vi to vj . The algorithm stops when Vsort = ∅. If the graph
is not connected, i.e., if there are one or several pairs of vertices that cannot
be connected by any path, then some δ(vi , vj ) keep an infinite value.
Intuitively, the correctness of the algorithm is demonstrated by proving
that vertex vj in Vsort with shortest path π1 = [vi , . . . , vj ] of length δ(vi , vj )
can be admitted in Vdone . This is trivially true for vi just after the initialization.
In the general case, it may be assumed that an unexplored path going from the
source vi to vertex vj is shorter than the path found by Dijkstra’s algorithm.
Then this hypothetical “shorter-than-shortest” path may be written as π2 =
[vi , . . . , vt , vu , . . . , vj ], wherein π3 = [vi , . . . , vt ] is the maximal (always non-
empty) subpath belonging to Vdone and π4 = [vu , . . . , vj ] is the unexplored
part of π2 . Consequently, vu still lies in Vsort and δ(vi , vu ) has been given its
value when vertex vt was removed from Vsort . Thus, the inequalities

length(π3 ) < length(π2 ) < length(π1 ) (4.103)

hold but contradict the fact that the first path π1 was chosen as the one with
the shortest δ(vi , vj ) in Vsort .
At this point, it remains to prove that the graph distance approximates
the true geodesic distance in an appropriate way. Visually, it seems to be
true for the C curve, as illustrated in Fig. 4.10. Formal demonstrations are
provided in [19], but they are rather complicated and not reproduced here.
The intuitive idea consists of relating the natural Riemannian structure of a
smooth manifold M to the graph distance computed between points of M.
Bounds are more easily computed for graphs constructed with -balls than
with K-ary neighborhoods. Unfortunately, these bounds rely on assumptions
that are hard to meet with real data sets, especially for K-ary neighborhoods.
Moreover, the bounds are computed in the ideal case where no noise pollutes
the data.

4.3.2 Isomap

Isomap [179, 180] is the simplest NLDR method that uses the graph distance
as an approximation of the geodesic distance. Actually, the version of Isomap
described in [180] is closely related to Torgerson’s classical metric MDS (see
Subsection 4.2.2). The only difference between the two methods is the metric
used to measure the pairwise distances: Isomap uses graph distances instead
of Euclidean ones in the algebraical procedure of metric MDS. Just by intro-
ducing the graph distance, the purely linear metric MDS becomes a nonlinear
method. Nevertheless, it is important to remind that the nonlinear capabili-
ties of Isomap are exclusively brought by the graph distance. By comparison,
methods like Sammon’s NLM (Subsection 4.2.3) and Hérault’s CCA (Subsec-
tion 4.2.4) are built on inherently nonlinear models of data, independently
of the chosen metric. However, Isomap keeps the advantage of reducing the
dimensionality with a simple, fast, and direct algebraical manipulation. On
4.3 Graph distances 103

Graph edges
Manifold points
Graph distance
Geodesic distance
Euclidean distance

2D

1D

Fig. 4.10. The same C curve as in Fig. 4.9. In this case, the manifold is not available:
only some points are known. In order to approximate the geodesic distance, vertices
are associated with the points and a graph is built. The graph distance can be
measured by summing the edges of the graph along the shortest path between both
ends of the curve. That shortest path can be computed by Dijkstra’s algorithm. If
the number of points is large enough, the graph distance gives a good approximation
of the true geodesic distance.

the other hand, Sammon’s NLM and Hérault’s CCA, which use specific opti-
mization procedures like gradient descent, are much slower.
Although Isomap greatly improves metric MDS, it inherits one of its ma-
jor shortcomings: a very rigid model. Indeed, the model of metric MDS is
restricted to the projection onto a hyperplane. Moreover, the analytical de-
velopments behind metric MDS rely on the particular form of the Euclidean
distance. In other words, this means that the matrix of pairwise distances D
handled by metric MDS must ideally contain Euclidean distances measured
between points lying on a hyperplane. Hence, if the distances in D are not
Euclidean, it is implicitly assumed that the replacement metric yields dis-
tances that are equal to Euclidean distances measured in some transformed
hyperplane. Otherwise, the conditions stated in the metric MDS model are
no longer fulfilled.
In the case of Isomap, the Euclidean distances are replaced with the graph
distances when computing the matrix D. For a theoretical analysis, however,
104 4 Distance Preservation

it is easier to assume that the graph distances perfectly approximate the


true geodesic distances. Therefore, data agree with the model of Isomap if
the pairwise geodesic distances computed between points of the P -manifold
to be embedded can be mapped to pairwise Euclidean distances measured
in a P -dimensional Euclidean space. A manifold that fulfills that condition
is called a developable P -manifold [111, 31]. (The terminology “Euclidean
manifold” in [120, 115] seems to be misleading and encompasses a broader
class of manifolds.) In this book, a more restrictive definition is proposed:
a P -manifold is developable if and only if a diffeomorphism between the P -
manifold and a convex subset of the P -dimensional Euclidean space exists
in such a way that geodesic distances are mapped to Euclidean distances by
the identity mapping. For the sake of simplicity, it is useful to assume that a
developable manifold has parametric equations of the form y = m(x) and that
the isometry δ(y(i), y(j)) = x(i) − x(j)2 holds. This assumption is justified
by the fact that a developable manifold can always be parameterized in that
form and, additionally, that the purpose of Isomap is precisely to retrieve the
latent vector x starting from the knowledge of y.
Developable manifolds have nice properties. For example, a straightfor-
ward consequence of the isometry between geodesic distances in a developable
P -manifold and Euclidean distances in RP is the following: in a developable
manifold, the Pythagorean theorem is valid for geodesic distances. In practice
this means that the geodesic distance between points y(i) = m(x(i)) and
y(j) = m(x(j)) can be computed as follows:

δ 2 (y(i), y(j)) = δ 2 (m(x(i)), m(x(j)))


= x(i) − x(j)22

P
= (xp (i) − xp (j))2
p=1


P
= x(i) − [x1 (i), . . . , xp (j), . . . , xP (i)]T 22
p=1


P
= δ 2 (m(x(i)), m([x1 (i), . . . , xp (j), . . . , xP (i)]T )) .
p=1

Therefore, in a developable manifold, geodesic distances can be computed


componentwise, either in the latent Euclidean space of the manifold space or
in the D-dimensional embedding space.
Another very nice property of developable manifolds is that the computa-
tion of geodesic distances does not require a minimization as for other more
generic manifolds. Indeed, it is easy to see that
4.3 Graph distances 105

δ(y(i), y(j)) = x(i) − x(j)2


= x(i) − (x(i) + α(x(j) − x(i)))2
+ (x(i) + α(x(j) − x(i))) − x(j)2
= δ(m(x(i)), m(x(i) + α(x(j) − x(i))))
+ δ(m(x(i) + α(x(j) − x(i))), m(x(j))) , (4.104)

where α is a real number between 0 and 1. These equalities simply demonstrate


that the shortest geodesic between y(i) and y(j) is the image by m of the
line segment going from x(i) to x(j): all points m(x(i) + α(x(j) − x(i))) on
that segment must also lie on the shortest path. Therefore, in the case of a
developable manifold, the path p(z) in Eq. (4.97) can be written as

p : [0, 1] ⊂ R → RP , z → x = p(z) = x(i) + z(x(j) − x(i)) , (4.105)

and the minimization in Eq. (4.98) may be dropped.


With the last property in mind (Eq. (4.104)) and knowing that p(z) =
x(i) + z(x(j) − x(i)), we find that

x(i) − x(j)2 = δ(y(i), y(j))


1 1
Jz p(z)2 dz = Jz m(p(z))2 dz
0 0
1
= Jp(z) m(p(z))Jz p(z)2 dz . (4.106)
0

Therefore, the equality Jz p(z)2 = Jp(z) m(p(z))Jz p(z)2 holds. This
means that the Jacobian of a developable manifold must be a D-by-P matrix
whose columns are orthogonal vectors with unit norm (a similar reasoning is
developed in [209]). Otherwise, the norm of Jz p(z) cannot be preserved. More
precisely, the Jacobian matrix can be written in a generic way as

Jx m(x) = QV(x) , (4.107)

where Q is a constant orthonormal matrix (a rotation matrix in the D-


dimensional space) and V(x) a D-by-P matrix with unit-norm columns and
only one nonzero entry per row. The last requirement ensures that the columns
of V(x) are always orthogonal, independently of the value of x.
Because of the particular form of its Jacobian matrix, a developable P -
manifold embedded in a D-dimensional space can always be written with the
following “canonical” parametric equations:
⎡ ⎤
f1 (x1≤p≤P )
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
y = Q f (x) = Q ⎢ f (x
⎢ d 1≤p≤P ⎥ , ) ⎥ (4.108)
⎢ .. ⎥
⎣ . ⎦
fD (x1≤p≤P )
106 4 Distance Preservation

where Q is the same as above, Jx f (x) = V(x), and f1 , . . . , fD are constant,


linear, or nonlinear continuous functions from R to R, with x1≤p≤P standing
for one of the latent variables [120]. Hence, if Q is omitted, a manifold is devel-
opable when the parametric equation of each coordinate in the D-dimensional
space depends on at most a single latent variable xk .
Of course, the above development states the conditions that are neces-
sary from a generative point of view. This means that (i) a manifold whose
parametric equations fulfill those conditions is developable and (ii) the la-
tent variable can be retrieved. However, within the framework of nonlinear
dimensionality reduction, generative models are seldom used. In practice, the
main question consists of determining when a given manifold is developable,
without regards to the actual parametric equations that generate it. In other
words, the actual latent variables do not matter, but it must be checked that
parametric equations fulfilling the above conditions could exist. If the answer
is yes, then data respect the model of Isomap and the “canonical” latent
variables could be retrieved. From this non-generative point of view, some
conditions may be relaxed in Eq. (4.108): the columns of Q no longer need to
be orthonormal and the sum

P,D
2

P,D   ∂fd (x1≤p≤P ) 2
P,D
2
(Jx f (x))d,p = vd,p (x) =
∂xp
p,d=1 p,d=1 p,d=1

may be different from P . Only the conditions on the functions fd (x1≤p≤P )


remain important, in order to obtain a Jacobian matrix with orthogonal
columns. This milder set of conditions allows one to recover the latent vari-
ables up to a scaling.
Visually, in a three-dimensional space, a manifold is developable if it looks
like a curved sheet of paper. Yet simple manifolds like (a piece of) a hollow
sphere are not developable. Moreover, no holes may exist on the sheet (due
to the convexity of the latent space); an example that does not fulfill this
condition is studied in Subsection 6.1.1.

Embedding of data set

Isomap follows the same procedure as metric MDS (Subsection 4.2.2); the only
difference is the metric in the data space, which is the graph distance. In order
to compute the latter, data must be available as coordinates stored in matrix
Y as usual. Figure 4.11 shows a simple procedure that implements Isomap. It
is noteworthy that matrix S is not guaranteed to be positive semidefinite after
double centering [78], whereas this property holds in the case of classical metric
MDS. This comes from the fact that graph distances merely approximate the
true geodesic distances. Nevertheless, if the approximation is good (see more
details in [19]), none or only a few eigenvalues of low magnitude should be
negative after double centering. Notice, however, that care must be taken if
the eigenvalues are used for estimating the intrinsic dimensionality, especially
4.3 Graph distances 107

1. Build a graph with either the K-rule or the -rule.


2. Weight the graph by labeling each edge with its Euclidean length.
3. Compute all pairwise graph distances with Dijkstra’s algorithm, square
them, and store them in matrix D.
4. Convert the matrix of distances D into a Gram matrix S by double cen-
tering.
5. Once the Gram matrix is known, compute its spectral decomposition S =
UΛUT .
6. A P -dimensional representation of Y is obtained by computing the prod-
uct X̂ = IP ×N Λ1/2 UT .

Fig. 4.11. Isomap algorithm.

if the goal is to normalize them in the same way as in Subsection 2.4.3. In


that case, it is better to compute “residual variances”, as advised in [180]:

σP2 = 1 − ri2 j(dx̂ (i, j), δy (i, j)) , (4.109)

where ri2 j denotes the correlation coefficient over indices i and j. When plot-
ting the evolution of σP2 w.r.t. P , the ideal dimensionality is the abscissa of
the “curve elbow”, i.e., the lowest value of P such that σP2 is close enough to
zero and does not significantly decrease anymore. Another way to estimate
the right embedding dimensionality consists in computing the MDS objective
function w.r.t. P :

N
EMDS = (sy (i, j) − x̂(i) · x̂(j))2 , (4.110)
i,j=1

where sy (i, j) is computed from the matrix of graph distances after double
centering and x̂(i) is the ith column of X̂ = IP ×N Λ1/2 UT . Here also an elbow
indicates the right dimensionality.
A MATLAB R
package containing the above procedure is available at
http://isomap.stanford.edu/. A C++ implementation can also be down-
loaded from http://www.ucl.ac.be/mlg/. The parameters of Isomap are the
embedding dimensionality P and either the number of neighbors K or the ra-
dius , depending on the rule chosen to build the data graph. Space complexity
of Isomap is O(N 2 ), reflecting the amount of memory required to store the
pairwise geodesic distances. Time complexity is mainly determined by the
computation of the graph distances; using Dijkstra’s algorithm, this leads to
O(N 2 log N ). The EVD decomposition in the MDS-like step is generally fast
when using dedicated libraries or MATLAB R
.
108 4 Distance Preservation

Embedding of test set

Like Sammon’s NLM, Isomap is not provided with an interpolation procedure


for test points. From the point of view of statistical data analysis, the gen-
eralization to new points is generally useless: data are gathered at once and
processed offline.
Nevertheless, it is not difficult to adapt the interpolation procedure of
metric MDS described in Subsection 4.2.2. For this purpose, each test point
y is temporarily “grafted” on the graph, by connecting it to the K closest
data points or to all data points lying in an -ball centered on the test point.
Then the geodesic distances from the test point to all data points can be
computed by running Dijkstra’s algorithm with the test point as source vertex;
the obtained distances are stored in the column vector δ = [δ(y, y(i))]1≤i≤N .
A more efficient way to obtain δ consists of taking advantage of the graph
distances that have already been computed in the learning phase between the
neighbors of the test point y and all other data points [113]. More precisely,
it can be written that

δ(y, y(i)) = min d(y, y(j)) + δ(y(j), y(i)) , (4.111)


j∈N (y)

where the set N (y) contains the indices of the neighbors of y.


Next, the distances in δ are transformed into scalar products in the column
vector s by double centering. Finally, the low-dimensional representation of y
is computed as the product x̂ = IP ×N Λ−1/2 UT s. More details can be found
in [16].

Example

Figure 4.12 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. The use of geodesic distances instead of

2 2
1 1
2

x2

0 0
x

−1 −1

−2 −2

−4 −2 0 2 −2 0 2 4
x1 x1

Fig. 4.12. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by Isomap.
4.3 Graph distances 109

Euclidean ones allows Isomap to perform much better than metric MDS (see
Fig. 4.2). In the case of the two benchmark manifolds, the graphs used for
approximating the geodesic distances are built with the -rule; the value of
 is somewhat larger than the distance between the three-dimensional points
of the open box data set shown in Fig. 1.4. This yields the regular graphs
displayed in the figure. As a direct consequence, the graph distances are com-
puted in the same way as city-block distances (see Subsection 4.2.1), i.e., by
summing the lengths of perpendicular segments.
With this method, the Swiss roll is almost perfectly unrolled (corners of the
rolled-up rectangle seem to be stretched outward). This result is not surprising
since the Swiss roll is a developable manifold. The first six eigenvalues confirm
that two dimensions suffice to embed the Swiss roll with Isomap:

[λn ]1≤n≤N = [1560.1, 494.7, 112.2, 70.1, 43.5, 42.4, . . .] . (4.112)

The first two eigenvalues clearly dominate the others. By the way, it must
be remarked that in contrast to classical metric MDS, the eigenvalues λn
with n > D do not vanish completely (the last ones can even be negative, as
mentioned above). This phenomenon is due to the fact that graph distances
only approximate the true geodesic ones.
On the other hand, the open box is not a developable manifold and Isomap
does not embed it in a satisfying way. The first six eigenvalues found by Isomap
are
[λn ]1≤n≤N = [1099.6, 613.1, 413.0, 95.8, 71.4, 66.8, . . .] . (4.113)
As can be seen, the first three eigenvalues dominate the others, just as they
do with metric MDS. Hence, like MDS, Isomap does not succeed in detecting
that the intrinsic dimensionality of the box is two. Visually, two faces of the
open box are still superposed: neighborhood relationships between data points
on these faces are not correctly rendered.

Classification

Isomap is an offline batch method, working with an exact algebraical op-


timization. As Isomap operates like metric MDS, by decomposing a Gram
matrix into eigenvalues and eigenvectors, it is often qualified to be a spectral
method [31, 12]. In the literature, Isomap is described without any prepro-
cessing of data such as vector quantization (see, however, [120]). Instead, a
variant of Isomap (see Subsection 4.3.2 ahead) works by selecting a random
subset of the available data points, which are called anchors or landmarks.
Isomap relies on a nonlinear model. Actually, if the Euclidean distance
can be considered as a “linear” metric, then the ability of Isomap to embed
nonlinear manifolds results only from the use of the graph distance; other parts
of the method, such as the underlying model of the optimization procedure,
stem from classical metric MDS and remain purely linear. Hence, the data
model of Isomap is hybrid: the approximation of geodesic distances by graph
110 4 Distance Preservation

distances is discrete whereas the subsequent MDS-like step can be considered


to be continuous.
Since Isomap shares the same model type and optimization procedure as
PCA and metric MDS, it also inherits its integrated estimation of the intrinsic
dimensionality. The latter can be evaluated just by looking at the eigenvalues
of D. (In [179, 180] and in the Stanford implementation of Isomap, eigenvalues
are transformed into so-called residual variances, which are defined as one
minus the correlation coefficient between the geodesic distances in the initial
space and the Euclidean distances in a P -dimensional final space.)
Just as for PCA and MDS, embeddings in spaces of increasing dimension-
ality can be obtained at once, in an incremental way.

Advantages and drawbacks

By comparison with PCA and metric MDS, Isomap is much more powerful.
Whereas the generative model of PCA and metric MDS is designed for linear
submanifolds only, Isomap can handle a much wider class of manifolds: the
developable manifolds, which may be nonlinear. For such manifolds, Isomap
reduces the dimensionality without any loss.
Unfortunately, the class of developable manifolds is far from including all
possible manifolds. When Isomap is used with a nondevelopable manifold, it
suffers from the same limitations as PCA or metric MDS applied to a nonlinear
manifold.
From a computational point of view, Isomap shares the same advantages
and drawbacks as PCA and metric MDS. It is simple, works with simple
algebraic operations, and is guaranteed to find the global optimum of its error
function in closed form.
In summary, Isomap extends metric MDS in a very elegant way. However,
the data model of Isomap, which relies on developable manifolds, still remains
too rigid. Indeed, when the manifold to be embedded is not developable,
Isomap yields disappointing results. In this case, the guarantee of determining
a global optimum does not really matter, since actually the model and its
associated error function are not appropriate anymore.
Another problem encountered when running Isomap is the practical com-
putation of the geodesic distances. The approximations given by the graph
distances may be very rough, and their quality depends on both the data
(number of points, noise) and the method parameters (K or  in the graph-
building rules). Badly chosen values for the latter parameters may totally
jeopardize the quality of the dimensionality reduction, as will be illustrated
in Section 6.1.

Variants

When the data set becomes too large, Isomap authors [180] advise running
the method on a subset of the available data points. Instead of performing a
4.3 Graph distances 111

vector quantization like for CCA, they simply select points at random in the
available data. Doing this presents some drawbacks that appear particularly
critical, especially because Isomap uses the graph distance (see examples in
Section 6.1).
A slightly different version of Isomap also exists (see [180]) and can be seen
as a compromise between the normal version of Isomap and the economical
version described just above. Instead of performing Isomap on a randomly
chosen subset of data, Isomap is run with all points but only on a subset
of all distances. For this purpose, a subset of the data points is chosen and
only the geodesic distances from these points to all other ones are computed.
These particular points are called anchors or landmarks. In summary, the
normal or “full” version of Isomap uses the N available points and works with
an N -by-N distance matrix. The economical or “light” version of Isomap
uses a subset of M < N points, yielding a M -by-M distance matrix. The
intermediate version uses M < N anchors and works with a rectangular M -
by-N distance matrix. Obviously, an adapted MDS procedure is required to
find the embedding (see [180] for more details) in the last version.
An online version of Isomap is detailed in [113]. Procedures to update the
neighborhood graph and the corresponding graph distances when points are
removed from or added to the data set are given, along with an online (but
approximate) update rule of the embedding.
Finally, it is noteworthy that the first version of Isomap [179] is quite
different from the current one. It relies on data resampling (graph vertices
are a subset of the whole data set) and the graph is built with a rule inspired
from topology-representing networks [136] (see also Appendix E). Next, graph
distances are compute with Floyd’s algorithm (instead of Dijkstra’s one, which
is more efficient), and the embedding is obtained with nonmetric MDS instead
of classical metric MDS. Hence this previous version is much closer to geodesic
NLM than the current one is.

4.3.3 Geodesic NLM

Sammon’s nonlinear mapping (NLM) is mostly used with the Euclidean dis-
tance, in the data space as well as in the embedding space. Nevertheless, as
mentioned in Subsection 4.2.3, nothing forbids the user to choose another
metric, at least in the data space. Indeed, in the embedding space, the simple
and differentiable formula of the Euclidean distance helps to deduce a not-too-
complicated update rule for the optimization of the stress function. So, why
not create a variant of NLM that uses the graph distance in the data space?
Isomap (see Subsection 4.3.2) and CDA (see Subsection 4.3.4) follow the
same idea by modifying, respectively, the metric MDS (see Subsection 4.2.2)
and CCA (see Subsection 4.2.4). Strangely enough, very few references to
such variant of NLM can be found in the literature (see, however, [117, 150,
58]). NLM using geodesic distances is here named GNLM (geodesic NLM),
112 4 Distance Preservation

according to [117, 121, 58], although the method described in [58] is more
related to CDA.

Embedding of data set

The embedding of the data set simply follows the procedure indicated in
Subsection 4.2.3. The only difference regards the distance in the data space,
which is the graph distance introduced in Subsection 4.3.1. Hence, Sammon’s
stress can be rewritten as

1  (δy (i, j) − dx (i, j))2


N
EGNLM = , (4.114)
c i=1 δy (i, j)
i<j

where
• δy (i, j) is the graph distance between the ith and jth points in the D-
dimensional data space,
• dx (i, j) is the Euclidean distance between the ith and jth points in the
P -dimensional latent space.
• the normalizing constant c is defined as


N
c= δy (i, j) . (4.115)
i=1
i<j

In order to compute the graph distance, a graph structure must be attached


to the initial data set. This can be done by using the same simple rules as
in Isomap: each data point y(i) may be connected either to its K closest
neighbors or to all points lying inside an -ball centered on y(i). Other ways
to build the graph are described in Appendix E.
According to Eq. (C.12), the quasi-Newton update rule that iteratively
determines the parameters xk (i) of ENLM can be written as
∂EGNLM
∂x (i)
xk (i) ← xk (i) − α  2 k  , (4.116)
∂ EGNLM 
 ∂xk (i)2 

where the absolute value is used for distinguishing the minima from the max-
ima. As for the classical NLM, the step size alpha is usually set between 0.3
and 0.4.
No MATLAB R
package is available for the geodesic NLM. However, it
is possible to build one quite easily using part of the Isomap archive (http:
//isomap.stanford.edu/), combined with Sammon’s NLM provided in the
SOM toolbox (http://www.cis.hut.fi/projects/somtoolbox/). Functions
and libraries taken from Isomap compute the pairwise graph distances, wheres
the mapping is achieved by the NLM function from the SOM toolbox. A
4.3 Graph distances 113

C++ implementation of geodesic NLM can also be downloaded from http:


//www.ucl.ac.be/mlg/. Using graph distances in Sammon’s NLM requires
tuning additional parameters, namely those related to the construction of the
data graph before applying Dijkstra’s algorithm. These parameters are, for
instance, the number of neighbors K (for K-ary neighborhoods) or the radius
 (for -ball neighborhoods). Space complexity of GNLM remains the same
as for Euclidean NLM. On the other hand, the time complexity must take
into acount the computation of all pairwise graph distances (O(N 2 log N )).
Depending on the data size and the number of iterations, this additional step
may eventually become the most time-consuming one.

Embedding of test set

As with the original NLM using Euclidean distance, no generalization of the


mapping is possible with GNLM as such. Propositions mentioned in Subsec-
tion 4.2.3 may work. However, the use of the geodesic distance brings an
additional difficulty, as it does for Isomap. Indeed, the test points must be
“grafted” on the graph connecting the data points, in order to compute their
geodesic distances to all data points. Moreover, it must be recalled that graph
distances only approximate the true geodesic distances. For first-order neigh-
bors, the graph distances equal the Euclidean distances: on a small scale, this
is a good approximation. On the other hand, the graph distances to second-
order neighbors are computed as a sum of two Euclidean distances; depending
on the angle formed by the two segments, the graph distance can overestimate
the true geodesic distance. First- and second-order neighbors are exactly the
points that are essential to get a good embedding of the test points.

Example

Figure 4.13 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with step size
α set to 0.3. Geodesic NLM performs much better than its Euclidean version
(see Fig. 4.4). The Swiss roll is perfectly unrolled, and only two faces of the
open box are superposed. The fact that geodesic distances are longer than the
corresponding Euclidean ones explains this improvement: their weight in Sam-
mon’s stress is lower and the method focuses on preserving short distances.
GNLM also provides better results than Isomap (see Subsection 4.3.2), al-
though the graph distances have been computed in exactly the same way. This
is due to the fact that GNLM can embed in a nonlinear way independently
from the distance used. In practice, this means that GNLM is expected to
perform better than Isomap when the manifold to embed is not developable.

Classification

The classification of GNLM exactly follows that of NLM in Subsection 4.2.3.


114 4 Distance Preservation

2 2
1 1

x2
x2

0 0

−1 −1

−2 −2

−2 0 2 4 −2 0 2 4
x1 x
1

Fig. 4.13. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by GNLM.

Advantages and drawbacks

For a large part, the advantages and drawbacks of GNLM stay more or less
the same as for NLM. However, the use of the geodesic distance gives GNLM
a better ability to deal with heavily curved manifolds. If GNLM is given a
developable P -manifold, then it can reduce the dimensionality from D to P
with EGNLM ≈ 0 (the stress never vanishes completely since the graph distances
are only approximations). In the same situation, the original NLM could yield
a much larger value of ENLM and provide disappointing results.
As other methods using the geodesic distance, GNLM may suffer from
badly approximated graph distances. Additionally, the graph construction
necessary to get them requires adjusting one or several additional parame-
ters.

Variants

Enhancements regarding the optimization procedure of NLM are still appli-


cable to GNLM. See Subsection 4.2.3 for more details.

4.3.4 Curvilinear distance analysis

Curvilinear distance analysis [116, 117] (CDA in short) is a version of CCA


(see Subsection 4.2.4) that makes use, among other differences, of the graph
distance, in the same way as Isomap extends Torgerson’s classical metric MDS.
Despite their common use of the graph distance and their almost simul-
taneous publication, CDA and Isomap have been developed independently, in
very different contexts. Whereas Isomap aims at extending the application of
metric MDS to developable manifolds, keeping its algebraical simplicity, CDA
generalizes CCA and provides it with a metric that reinforces its resemblance
to an SOM. Considering CDA as a neural network, the prototypes after vec-
tor quantization being the neurons, the replacement of the Euclidean distance
4.3 Graph distances 115

with the graph distance may be seen as the addition of lateral connections
between the neurons. Visually, these lateral synapses are the weighted edges of
the graph, very similar to the lattice of an SOM, except that it is not regular.
Even if this may seem counterintuitive, the use of the graph distance actu-
ally removes lateral connections. In the original CCA, the use of the Euclidean
distance may be seen as a degenerate graph distance where all connections
are allowed by default. From this point of view, the graph distance, by cap-
turing information about the manifold shape, removes misleading shortcuts
that the Euclidean distance could follow. Thus, the graph is less dependent
on the manifold embedding than the Euclidean distance.
Because the graph distance is a sum of Euclidean distances, and by virtue
of the triangular inequality, graph distances are always longer than (or equal
to) the corresponding Euclidean distances. This property elegantly addresses
the main issue encountered with distance-preserving DR methods: because
manifolds embedded in high-dimensional spaces may be folded on themselves,
spatial distances like the Euclidean one are shorter than the corresponding
distances measured along the manifold. This issue is usually circumvented
by giving an increasing weight to very short distances. In this context, the
graph distance appears as a less blind solution: instead of “forgetting” the
badly estimated distance, the graph distance enhances the estimation. To
some extent, the graph distance “guesses” the value of the distance in the
embedding space before embedding. When the manifold to be embedded is
exactly Euclidean (see Subsection 4.3.2), the guess is perfect and the error
made on long distances remains negligible. In this case, the function Fλ that
is weighting the distances in ECCA becomes almost useless.
Nevertheless, the last statement is valid only for developable manifolds.
This is a bit too restrictive and for other manifolds, the weighting of distances
remains totally useful. It is noteworthy, however, that the use of the graph
distance makes Fλ much easier to parameterize.
For a heavily crumpled nondevelopable manifold, the choice of the neigh-
borhood width λ in CCA appears very critical. Too large, it forces CCA to
take into account long distances that have to be distorted anyway; the error
criterion ECCA no longer corresponds to what the user expects and its mini-
mization no longer makes sense. Too small, the neighborhood width impeaches
CCA to access sufficient knowledge about the manifold and the convergence
becomes very slow.
On the other hand, in CDA, the use of the graph distance yields a better
estimate of the distance after embedding. Distances are longer, and so may λ
be longer, too.

Embedding of data set

Formally, the error criterion of CDA is


116 4 Distance Preservation

N ⎨
 i
ECCA-u if dx > δy
ECCA = , (4.117)
⎩ 4δ12 ECCA-p
i
if dx < δy
i=1 i,j

exactly as for CCA, except for δy , which is here the graph distance instead of
the Euclidean one.
i i
Assuming Fλ has a null derivative, the gradients of ECCA-u and ECCA-p are
x(j) − x(i)
∇x(j) ECCA-u
i
= 2(δy − dx )Fλ (dx ) , (4.118)
dx
∇x(j) ECCA-p
i
= −4(δy2 − d2x )Fλ (dx )(x(j) − x(i)) , (4.119)
leading to the update rule:

∇x(j) ECCA-u
i
if dx (i, j) > δy
x(j) ← x(j) + α 1 . (4.120)
4δi,j
2 ∇x(j) ECCA-p if dx (i, j) < δy
i

Besides the use of the graph distance, another difference distinguishes CCA
from CDA. The latter is indeed implemented with a slightly different han-
dling of the weighting function Fλ (dx (i, j)). In the original publications about
CCA [46, 45, 48], the authors advise using a step function (see Eqs. (4.86)
and (4.87)), which is written as

0 if λ < dx (i, j)
Fλ (dx (i, j)) = (4.121)
1 if λ ≥ dx (i, j)
From a geometrical point of view, this function centers an open λ-ball around
each point x(i): the function equals one for all points x(j) inside the ball,
and zero otherwise. During the convergence of CCA, the so-called neighbor-
hood width λ, namely the radius of the ball, is usually decreasing according a
schedule established by the user. But the important point to remark is that
the neighborhood width has a unique and common value for all points x(i).
This means that depending on the local distribution of x, the balls will include
different numbers of points. In sparse regions, even not so small values of λ
could yield empty balls for some points, which are then no longer updated.
This problematic situation motivates the replacement of the neighborhood
width λ with a neighborhood proportion π, with 0 ≤ π ≤ 1. The idea consists
of giving each point x(i) an individual neighborhood width λ(i) such that the
corresponding ball centered on x(i) contains exactly πN  points. This can be
achieved easily and exactly by computing the πN  closest neighbors of x(i).
However, as mentioned in Appendix F.2, this procedure is computationally
demanding and would considerably slow down CCA.
Instead of computing exactly the πN  closest neighbors of x(i), it is thus
cheaper to approximate the radius x(i) of the corresponding ball. Assuming
that π ≈ 1 when CDA starts, all the λ(i) could be initialized as follows:
λ(i) ← max dx (i, j) . (4.122)
j
4.3 Graph distances 117

Next, when CDA is running, each time the point x(i) is selected, all other
points x(j) lying inside the λ(i)-ball are updated radially. The number N (i)
of updated points gives the real proportion of neighbors, defined as π(i) =
N (i)/N . The real proportion π(i), once compared with the desired proportion,
helps to adjust λ(i). For example, this can be done with the simple update
rule 
π
λ(i) ← λ(i) P , (4.123)
π(i)
which gives λ(i) its new value when point x(i) will be selected again by CDA.
In practice, the behavior of the update rule for λ(i) may be assessed when
CDA is running, by displaying the desired proportion π versus the effective
average one μi (π(i)). Typically, as π is continually decreasing, μi (π(i)) is
always a bit higher than desired. Experimentally, it has been shown that the
handling of Fλ (dx (i, j)) in CDA deals with outliers in a rather robust way
and avoids some useless tearings that are sometimes observed when using the
original CCA with a neighborhood width that is too small.
It is noteworthy that the use of an individual neighborhood width does
not complicate the parameter setting of CDA, since all neighborhood widths
are guided by a single proportion π.
Gathering all above-mentioned ideas leads to the procedure given in
Fig. 4.14. No MATLAB R
package is available for CDA. However, it is pos-

1. Perform a vector quantization (see Appendix D) to reduce the size of the


data set, if needed.
2. Build a graph with a suitable rule (see Appendix E).
3. Compute all pairwise graph distances δ(y(i), y(j)) in the D-dimensional
data space.
4. Initialize the P -dimensional coordinates x(i) of all points, either randomly
or on the hyperplane spanned by the first principal components (after a
PCA).
5. Initialize the individual neighborhood widths λ(i) according to
Eq. (4.122). Let q be equal to one.
6. Give the learning rate α and the neighborhood proportion π their sched-
uled value for epoch number q.
7. Select a point x(i) in the data set and update all other ones according to
Eq. (4.120).
8. Update λ(i) according to Eq. (4.123).
9. Return to step 7 until all points have been selected exactly once during
the current epoch.
10. Increase q, and return to step 6 if convergence is not reached.

Fig. 4.14. Algorithm implementing curvilinear distances analysis.


118 4 Distance Preservation

sible to build one quite easily using part of the Isomap archive (http:
//isomap.stanford.edu/), combined with CCA provided in the SOM tool-
box (http://www.cis.hut.fi/projects/somtoolbox/). Functions and li-
braries taken from Isomap compute the pairwise graph distances, wheres the
mapping is achieved by the CCA function from the SOM toolbox. A C++ im-
plementation of CDA can be downloaded from http://www.ucl.ac.be/mlg/.
Like the geodesic version of Sammon’s NLM, CDA involves additional param-
eters related to the construction of a data graph before applying Dijkstra’s
algorithm to compute graph distances. These parameters are, for instance,
the number of neighbors K or the neighborhood radius . Space complexity of
CDA remains unchanged compared to CCA, whereas time complexity must
take into account the application of Dijkstra’s algorithm for each graph vertex
(O(N 2 log N )) before starting the iterative core procedure of CCA/CDA.

Embedding of test set

A linear piecewise interpolation can work efficiently if the data set is not too
noisy. The interpolation procedure described in [45] for CCA also works for
CDA, at least if the neighborhood width is set to a small value; this ensures
that Euclidean and graph distances hardly differ on that local scale.

Example

Figure 4.15 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. The graph used to approximate the geodesic

2
2
1
2

x2

0 0
x

−1
−2
−2

−4 −2 0 2 4 −4 −2 0 2 4
x1 x1

Fig. 4.15. Two-dimensional embeddings of the ‘Swiss roll’ and ‘open box’ data sets
(Fig. 1.4), found by CDA.

distance and shown in the figure is the same as for Isomap and GNLM (-
rule). The main difference between the results of CCA (see Fig. 4.7) and CDA
regards the Swiss roll. The use of geodesic distances enables CDA to unroll
the Swiss roll and to embed it without superpositions and unnecessary tears.
4.4 Other distances 119

Neighborhood relationships between nearby data points are respected ev-


erywhere, except along the tears in the open box. Considering the underlying
manifolds, CDA establishes a bijective mapping between their initial and final
embeddings.
In the case of the open box, the ability of CCA to deal with a nonlinear
manifold is already so good that the use of the graph distance does not improve
the result. More difficult examples can be found in Chapter 6.

Classification
Exactly like CCA, CDA is an offline (batch) method, with an optional vector
quantization as preprocessing. The data model is nonlinear and discrete; the
mapping is explicit. The method works with an approximate optimization
procedure. CDA does not include any estimator of the intrinsic dimensionality
of data and cannot build embeddings incrementally.

Advantages and drawbacks


CDA keeps all the advantages of CCA. In addition, the parameter tuning is
easier for CDA than for CCA. As already mentioned, this is due to the use
of the graph distance and to the different handling of the weighting function.
Consequently, fewer trials and errors are needed before getting good results.
Moreover, CDA converges faster than CCA for heavily crumpled manifolds
that are (nearly) developable. In that case, the minimization of ECDA becomes
trivial, whereas CCA may encounter difficulties as for any other nonlinear
manifold.

Variants
A variant of CDA is described in [58]. Although this method is named geodesic
nonlinear mapping, it is more closely related to CCA/CDA than to Sammon’s
NLM. Actually, this method uses the neural gas [135], a topology-representing
network (TRN [136]), instead of a more classical vector quantizer like K-
means. This choice enables the method to build the data graph in parallel
with the quantization. Next, graph distances are computed in the TRN. The
subsequent embedding procedure is driven with an objective function simmilar
to the one of CCA/CDA, except that F is an exponentially decaying function
of the neighborhood rank, exactly as in the neural gas. The resulting method
closely resembles Demartines and Hérault’s VQP [46, 45], though it remains
a batch procedure.

4.4 Other distances


The previous two sections deal with natural distance measures, in the sense
that mainly geometrical arguments motivate the use of both spatial and graph
120 4 Distance Preservation

distances. This section introduces NLDR methods that rely on less intuitive
ideas. The first one is kernel PCA, which is closely related to metric MDS and
other spectral methods. In this case, the methods directly stem from mathe-
matical considerations about kernel functions. These functions can transform
a matrix of pairwise distances in such a way that the result can still be inter-
preted as distances and processed using a spectral decomposition as in metric
MDS. Unfortunately, in spite of its elegance, kernel PCA performs rather
poorly in practice.
The second method, semi-definite embedding, relies on the same theory
but succeeds in combining it with a clear geometrical intuition. This different
point of view leads to a much more efficient method.

4.4.1 Kernel PCA

The name of kernel PCA [167] (KPCA) is quite misleading, since its approach
relates it more closely to classical metric MDS than to PCA [203]. Beyond
the equivalence between PCA and classical metric MDS, maybe this choice
can be justified by the fact that PCA is more widely known in the field where
KPCA has been developed.
Whereas the changes brought by Isomap to metric MDS were motivated by
geometrical consideration, KPCA extends the algebraical properties of MDS
to nonlinear manifolds, without regards to their geometrical meaning. The
inclusion of KPCA in this chapter about dimensionality reduction by distance-
preserving methods is only justified by its resemblance to Isomap: they both
generalize metric MDS to nonlinear manifolds in similar ways, although the
underlying ideas completely differ.

Embedding of data set

The first idea of KPCA consists of reformulating the PCA into its metric
MDS equivalent, or dual form. If, as usual, the centered data points y(i) are
stored in matrix Y, then PCA works with the sample covariance matrix Ĉyy ,
proportional to YYT . On the contrary, KPCA works as metric MDS, i.e.,
with the matrix of pairwise scalar products S = YT Y.
The second idea of KPCA is to “linearize” the underlying manifold M.
For this purpose, KPCA uses a mapping φ : M ⊂ RD → RQ , y → z = φ(y),
where Q may be any dimension, possibly higher than D or even infinite.
Actually, the exact analytical expression of the mapping φ is useless, as will
become clear below. As a unique hypothesis, KPCA assumes that the mapping
φ is such that the mapped data span a linear subspace of the Q-dimensional
space, with Q > D. Interestingly, KPCA thus starts by increasing the data
dimensionality!
Once the mapping φ has been chosen, pairwise scalar products are com-
puted for the mapped data and stored in the N -by-N matrix Φ:
4.4 Other distances 121

Φ = [φ(y(i)) · φ(y(j))]1≤i,j≤N (4.124)


= [z(i) · z(j)]1≤i,j≤N (4.125)
= [sz (i, j)]1≤i,j≤N , (4.126)

where the shortened notation sz (i, j) stands for the scalar product between
the mapped points y(i) and y(j).
Next, according to the metric MDS procedure, the symmetric matrix Φ
has to be decomposed in eigenvalues and eigenvectors. However, this operation
will not yield the expected result unless Φ is positive semidefinite, i.e., when
the mapped data z(i) are centered. Of course, it is difficult to center z because
the mapping φ is unknown. Fortunately, however, centering can be achieved
in an implicit way by performing the double centering on Φ.
Assuming that z is already centered but z is not, it can be written in the
general case

sz (i, j) = z(i) + c · z(j) + c


= z(i) · z(j) + z(i) · c + c · z(j) + c · c
= sz (i, j) + z(i) · c + c · z(j) + c · c , (4.127)

where c is some unknown constant as, for instance, the mean that has previ-
ously been subtracted from z to get z. Then, denoting μi the mean operator
with respect to index i, the mean of the jth column of Φ is

μi (sz (i, j)) = μi (z(i) · z(j) + z(i) · c + c · z(j) + c · c)


= μi (z(i)) · z(j) + μi (z(i)) · c + c · z(j) + c · c
= 0 · z(j) + 0 · c + c · z(j) + c · c
= c · z(j) + c · c . (4.128)

Similarly, by symmetry, the mean of the ith row of Φ is

μj (sz (i, j)) = z(i) · c + c · c . (4.129)

The mean of all entries of Φ is

μi,j (sz (i, j)) = μi,j (z(i) · z(j) + z(i) · c + c · z(j) + c · c)
= μi (z(i)) · μj (z(j)) + μi (z(i)) · c + c · μj (z(j)) + c · c
= 0 · 0 + 0 · c + c · 0 + c · c
= c · c . (4.130)

It is easily seen that unknown terms in the right-hand side of Eq. (4.127) can
be obtained as the sum of Eqs. (4.128) and (4.129) minus Eq. (4.130). Hence,

sz (i, j) = sz (i, j) − z(i) · c − c · z(j) − c · c (4.131)


= sz (i, j) − μi (sz (i, j)) − μj (sz (i, j)) + μi,j (sz (i, j)) . (4.132)
122 4 Distance Preservation

Once the double centering has been performed, Φ can be decomposed into its
eigenvectors and eigenvalues:

Φ = UΛUT . (4.133)

As for metric MDS, the P eigenvectors (columns of U ) associated with the P


largest eigenvalues give the coordinates of all data points along the P principal
components in the Q-dimensional space:

X̂ = IP ×N Λ1/2 UT . (4.134)

The final result is thus coordinates on a hyperplane in RQ . It is noteworthy


that in contrast with metric MDS the maximal number of strictly positive
eigenvalues is not bounded by min(N, D). Instead, the bound for KPCA is
min(N, Q), which often equals N since actually Q may be very high, depending
on the mapping φ.
At this point, it must be remarked that the mapping φ is used solely in
scalar products. Therefore, φ may stay unknown if a kernel function κ directly
gives the value of the scalar product starting from y(i) and y(j):

κ : RD ×RD → R, (y(i), y(j)) → κ(y(i), y(j)) = φ(y(i))·φ(y(j)) . (4.135)

Obviously, κ may not be any function: the reformulation as a scalar product


implies satisfying some conditions that have been extensively studied in [167].
More precisely, Mercer’s theorem of functional analysis (see, e.g., [39]) states
that
• if κ is a continuous kernel of a positive integral operator K, written as

K : L2 → L2 , f → Kf , (4.136)

with
(Kf )(v) = κ(u, v)f (v)dv , (4.137)

• if K is positive definite, i.e.,



f (u)κ(u, v)f (v)dudv > 0 if f = 0 , (4.138)

then κ can be expanded into a series




κ(u, v) = λq φq (u)φq (v) (4.139)
q=1

with positive coefficients λq (the eigenvalues) and orthogonal functions (the


eigenfunctions, instead of eigenvectors)

0 if q1 = q2
φq1 · φq2  = . (4.140)
1 if q1 = q2
4.4 Other distances 123

Using Eq. (4.139), it is easy to see that



 
φ(y) = λq φq (y) (4.141)
q=1

is a mapping function into a space where κ acts as the Euclidean scalar prod-
uct, i.e.,
φ(u) · φ(v) = κ(u, v) . (4.142)
In practice, simple kernels that fulfill Mercer’s conditions are, for example,
• polynomial kernels [27]: κ(u, v) = (u · v + 1)p , where p 
is some integer;
radial basis function like Gaussian kernels: κ(u, v) = exp − u−v
2
• 2σ2 ;
• kernels looking like the MLP activation function: κ(u, v) = tanh(u·v+b).
The choice of a specific kernel is quite arbitrary and mainly motivated by the
hope that the induced mapping φ linearizes the manifold to be embedded.
If this goal is reached, then PCA applied to the mapped data set should
efficiently reveal the nonlinear principal components of the data set.
The “kernel trick” described above plays a key role in a large family of
methods called support vector machines [27, 37, 42] (SVMs). This family
gathers methods dedicated to numerous applications like regression, function
approximation, classification, etc.
Finally, Fig. 4.16 shows how to implement KPCA. No general-purpose

1. Compute either the matrix S (scalar products) or the matrix D (squared


Euclidean distances), depending on the chosen kernel.
2. Compute the matrix of kernel values Φ.
3. Double-center Φ:
• Compute the mean of the rows, the mean of the columns, and the
grand mean.
• Subtract from each entry the mean of the corresponding row and the
mean of the corresponding column, and add back the grand mean.
4. Decompose the double-centered Φ into eigenvalues and eigenvectors.
5. A P -dimensional representation of Y is obtained by computing the prod-
uct X̂ = IP ×N Λ1/2 UT .

Fig. 4.16. Kernel PCA algorithm.

MATLAB R
function is available on the Internet. However, a simple toy ex-
ample can be downloaded from http://www.kernel-machines.org/code/
kpca_toy.m; straightforward adaptations of this script can transform it in
a more generic function. Another implementation is available in the SPI-
DER software package (http://www.kyb.mpg.de/bs/people/spider/main.
124 4 Distance Preservation

html). Otherwise, it is straightforward to write a simple function from scratch


or to derive it from MDS using the above pseudo-code. Parameters of KPCA
are the embedding dimensionality along with all kernel-related parameters.
Last but not least, the choice of a precise kernel proves to be a tedious prob-
lem. Assuming that the kernel is an easy-to-compute function, time complexity
of KPCA is almost the same as for metric MDS (i.e., O(N 3 ) or less, depend-
ing on the efficiency of the EVD implementation). Space complexity remains
unchanged too.

Embedding of test set

The particular version of the double centering described in Subsection 4.2.2 for
MDS also works for KPCA. It is easily adapted to the use of kernel functions.

Example

Figure 4.17 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. Visually, it is clear that KPCA has trans-

10
10

5 5
x2

x2

0 0
−5
−5
−10
−10
−10 0 10 −10 −5 0 5 10 15
x1 x1

Fig. 4.17. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by KPCA.

formed the two manifolds in a nonlinear way. Unfortunately, the embeddings


are rather disappointing: the Swiss roll remains rolled up, and faces of the
open box are superposed. Once again, it must be emphasized that KPCA is
not motivated by geometrical arguments. It aims at embedding the manifold
into a space where an MDS-based projection would be more successful than
in the initial space. However, no guarantee is provided that this goal can be
reached; one can only hope that the kernel is well chosen and well param-
eterized. For example, the results in Fig. 4.17 are computed with Gaussian
kernels: tuning the width of this kernel proves to be tedious. In case of a bad
choice of the kernel, KPCA may increase the embedding dimensionality of
the manifold instead of reducing it! This is what exactly happens in Fig. 4.17:
with a kernel width of 0.75, the first six eigenvalues are
4.4 Other distances 125

[λn ]1≤n≤N = [0.1517, 0.1193, 0.0948, 0.0435, 0.0416, 0.0345, . . .] , (4.143)

for the Swiss roll, and

[λn ]1≤n≤N = [0.1694, 0.1084, 0.0890, 0.0544, 0.0280, 0.0183, . . .] , (4.144)

for the open box with a kernel width of 2.5. In other methods using an EVD,
like metric MDS and Isomap, the variance remains concentrated within the
first three eigenvalues, whereas KPCA spreads it out in most cases. In order to
concentrate the variance within a minimal number of eigenvalues, the width of
the Gaussian kernel may be increased, but then the benefit of using a kernel
is lost and KPCA tends to yield the same result as metric MDS: a linear
projection.

Classification

Like metric MDS and Isomap, KPCA is a batch method working offline with
simple algebraical operations. As with other spectral methods, KPCA can
build projections incrementally, by discarding or keeping eigenvectors. The
model of the method is nonlinear thanks to the kernel function that maps the
data in an implicit way. The model is also continuous, as for PCA, and the
mapping is implicit.

Advantages and drawbacks

By construction, KPCA shares many advantages and drawbacks with PCA


and metric MDS. In contrast with these methods, however, KPCA can deal
with nonlinear manifolds. And actually, the theory hidden behind KPCA is a
beautiful and powerful work of art.
However, KPCA is not used much in dimensionality reduction. The reasons
are that the method is not motivated by geometrical arguments and the ge-
ometrical interpretation of the various kernels remains difficult. For example,
in the case of Gaussian kernels, which is the most widely used, the Euclidean
distance is transformed in such a way that long distances yield smaller values
than short distances. This is exactly the inverse of what is intuitively expected:
in Isomap and CDA, the use of the graph distance was precisely intended to
stretch distances!
The main difficulty in KPCA, as highlighted in the example, is the choice
of an appropriate kernel along with the right values for its parameters.

4.4.2 Semidefinite embedding

As shown in the previous section, generalizing PCA using the kernel trick as
in KPCA proves to be an appealing idea. However, and although theoretical
conditions determine the domain of admissible kernel functions, what the
126 4 Distance Preservation

theorems do not say is how to choose the best-suited kernel function in some
particular case. This shortcoming could disappear if one could learn the best
kernel from the data. Semidefinite embedding [196, 198, 195] (SDE), also
known as maximum variance unfolding [197] (MVU), follows this approach.

Embedding of data set

Like metric MDS, Isomap and KPCA, SDE implements distance preservation
by means of a spectral decomposition. Given a set of N observations, all these
methods build the low-dimensional embedding by juxtaposing the dominat-
ing eigenvectors of an N -by-N matrix. In metric MDS, pairwise distances are
used as they are, and then converted into scalar products by double center-
ing. In Isomap, traditional Euclidean distance are simply replaced with graph
distances. In KPCA, Euclidean distances are nonlinearly transformed using
a kernel function. In the case of SDE, the transformation of the distances is
a bit more complicated than for the former methods. Results of those meth-
ods depend on the specific transformation applied to the pairwise distances.
Actually, this transformation is arbitrarily chosen by the user: the type of dis-
tance (Euclidean/graph) or kernel (Gaussian, etc.) is fixed beforehand. The
idea behind SDE is to determine this transformation in a purely data-driven
way. For this purpose, distances are constrained to be preserved locally only.
Nonlocal distances are free to change and are optimized in such a way that
a suitable embedding can be found. The only remaining constraint is that
the properties of the corresponding Gram matrix of scalar products are kept
(symmetry, positive semidefiniteness), so that metric MDS remains applica-
ble. This relaxation of strict distance preservation into a milder condition of
local isometry enables SDE to embed manifolds in a nonlinear way.
In practice, the constraint of local isometry can be applied to smooth
manifolds only. However, in the case of a finite data set, a similar constraint
can be stated. To this end, SDE first determines the K nearest neighbors of
each data point, builds the corresponding graph, and imposes the preservation
of angles and distances for all K-ary neighborhoods:

(xi − xj ) · (xi − xk ) = (yi − yj ) · (yi − yk ) . (4.145)

This constraint of local isometry reduces to the preservation of pairwise dis-


tances only in an “enriched” graph. The latter is obtained by creating a fully
connected clique of size K + 1 out of every data point yi and its K near-
est neighbors. In other words, after building the undirected graph of K-ary
neighborhoods, additional edges are created between the neighbors of each
datum, if they do not already exist. If A = [a(ij)]1≤i,j≤N denotes the N -by-
N adjacency matrix of this graph, then the local isometry constraint can be
expressed as
xi − xj 22 = yi − yj 22 if Aij = 1 . (4.146)
4.4 Other distances 127

As in the case of metric MDS, it can be shown that such an isometry constraint
determines the embedding only up to a translation. If the embedding is also
constrained to be centered:
N
xi = 0 , (4.147)
i=1
then this undeterminacy is avoided.
Subject to the constraint of local isometry, SDE tries to find an embedding
of the data set that unfolds the underlying manifold. To illustrate this idea,
the Swiss roll (see Section 1.5) is once again very useful. The Swiss roll can be
obtained by rolling up a flat rectangle in a three-dimensional space, subject
to the same constraint of local isometry. This flat rectangle is also the best
two-dimensional embedding of the Swiss roll. As a matter of fact, pairwise
Euclidean distances between faraway points (e.g., the corners of the rolled-up
rectangle) depend on the embedding. In the three-dimensional space, these
distances are shorter than their counterparts in the two-dimensional embed-
ding. Therefore, maximizing long distances while maintaining the shortest
ones (i.e., those between neighbors) should be a way to flatten or unfold the
Swiss roll. This idea translates into the following objective function:

1  2
N N
φ= d (i, j) , (4.148)
2 i=1 j=1 x

which should be maximized subject to the above-mentioned constraint of local


isometry and where d2x (i, j) = xi − xj 22 . It is not very difficult to prove
that the objective function is bounded, meaning that distances cannot grow
infinitely. To this end, graph distances (see Section 4.3) can be used. By
construction, graph distances measure the length of the path connecting two
data points as the sum of edge lengths. As all edges are subject to the local
isometry constraint, it results that

1  2
N N
φ≤ δ (i, j) , (4.149)
2 i=1 j=1 y

where δy (i, j) is the graph distance between data points yi and yj .


The optimization as stated above is not convex, as it involves maximiz-
ing a quadratic form (Eq. (4.148)) subject to quadratic equality constraints
(Eq. (4.146)). Fortunately, the formulation of the problem can be simplified by
using dot products instead of squared distances. Doing so not only makes the
optimization convex but also casts the problem within the framework of clas-
sical metric MDS (see Subsection 4.2.2). If D = [d2y (i, j)]1≤i,j≤N denotes the
square matrix of squared Euclidean distances and if S = [sy (i, j)]1≤i,j≤N with
entries sy (i, j) = yi · yj  denotes the corresponding matrix of dot products,
then the relation
d2y (i, j) = sy (i, i) − 2sy (i, j) + sy (j, j) (4.150)
128 4 Distance Preservation

holds (see Subsection 4.2.2). If K = [sx (i, j)]1≤i,j≤N denotes the symmetric
matrix of dot products xi · xj  in the low-dimensional embedding space, then
the constraint in Eq. (4.146) becomes

sx (i, j) = sy (i, j) if a(i, j) = 1 . (4.151)

Likewise, the centering constraint can be transformed as follows:


N
0= xi (4.152)
i=1

N
0 = 0 · xj  = xi · xj  (4.153)
i=1

N 
N 
N 
N 
N
0= 0 · xj  = xi · xj  = sx (i, j) . (4.154)
j=1 i=1 j=1 i=1 j=1

Finally, the objective function can also be expressed in terms of dot products
using Eqs. (4.150) and (4.154):

1  2
N N
φ= d (i, j) (4.155)
2 i=1 j=1 x

1 
N N
= sx (i, i) − 2sx (i, j) + sx (j, j) (4.156)
2 i=1 j=1


N
= sx (i, i) (4.157)
i=1
= tr(K) , (4.158)

where tr(K) denotes the trace of K. At this stage, all constraints are linear
with respect to the entries of K and the optimization problem can be refor-
mulated. The goal of SDE consists of maximizing the trace of some N -by-N
matrix K subject to the following constraints:
• The matrix K is symmetric and positive semidefinite.
• The sum of all entries of K is zero (Eq. (4.154)).
• For nonzero entries of the adjacency matrix, the quality sx (i, j) = sy (i, j)
must hold.
The first two constraints allows us to cast SDE within the framework of clas-
sical metric MDS. Compared to the latter, SDE enforces only the preservation
of dot products between neighbors in the graph; all other dot products are
free to change.
In practice, the optimization over the set of symmetric and positive
semidefinite matrices is an instance of semidefinite programming (SDP; see,
4.4 Other distances 129

e.g., [184, 112] and references therein): the domain is the cone of positive
semidefinite matrices interesected with hyperplanes (representing the equal-
ity constraints) and the objective function is a linear function of the matrix
entries. The optimization problem has some useful properties:
• Its objective function is bounded above by Eq. (4.149).
• It is also convex, thus preventing the existence of spurious local maxima.
• The problem is feasible, because S is a trivial solution that satisfies all
constraints.
Details on SDP are beyond the scope of this book and can be found in the
literature. Several SDP toolboxes in C++ or MATLAB R
can be found on the
Internet. Once the optimal matrix K is determined, low-dimensional embed-
ding is obtained by decomposing K into eigenvalues and eigenvectors, exactly
as in classical metric MDS. If the EVD is written as K = UΛUT , then the
low-dimensional sample coordinates are computed as

X̂ = IP ×N Λ1/2 UT . (4.159)

The SDE algorithm is summarized in Fig. 4.18. A MATLAB


R
version can

1. If data consist of pairwise distances, then skip step 2 and go directly to


step 3.
2. If data consist of vectors, then compute all squared pairwise distances in
matrix D.
3. Determine the K nearest neighbors of each datum.
4. Build an undirected graph that comprises the complete clique of each
K-ary neighborhood.
5. Perform the double centering of D in order to obtain S (see MDS).
6. Maximize the trace of some N -by-N matrix K subject to the following
constraints:
• The matrix K is symmetric and positive semidefinite.
• The sum of all entries of K is zero (Eq. (4.154)).
• For nonzero entries of the graph adjacency matrix, the quality
sx (i, j) = sy (i, j) must hold.
7. Perform classical metric MDS on the optimized matrix K to obtain low-
dimensional coordinates:
• Compute the EVD of K: K = UΛUT .
• The P -dimensional embedding is then given by X̂ = IP ×N Λ1/2 UT .

Fig. 4.18. Procedure achieving semidefinite embedding.

be found at http://www.seas.upenn.edu/~kilianw/sde/. The use of this


package requires installing at least one additional package, which performs the
130 4 Distance Preservation

semidefinite programming step of SDE. This step consumes a large amount


of computational resources, regarding memory space as well as running time;
so it is advised to run it on a high-end machine. Parameters of SDE are the
embedding dimensionality, the number of neighbors K, the type of constraints
(equality or inequality), along with all parameters and options involved by the
semidefinite programming solver (number of iterations, etc.). See the help of
the MATLAB R
package for a detailed list of all parameters and options.

Embedding of test set

The Nyström formula that is referred to in [6, 16] (see also Subsection 4.2.2)
cannot be used in this case, because the kernel function applied to the Gram
matrix and learned from data in the SDP stage remains unknown in closed
form.

Example

Figure 4.19 shows the two-dimensional embeddings of the two data sets
(Fig. 1.4) described in Section 1.5. These results are obtained with number of

2
1
1
0.5
2

0
x2
x

−1 −0.5

−2 −1
−2 0 2 −1 0 1 2
x x1
1

Fig. 4.19. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by SDE.

neighbors K equal to five (the graph shown in the figure does not correspond
to the actual graph built by SDE). Slack variables are off for the Swiss roll
(isometry is required) and on for open box (distances may shrink). As can
be seen, the Swiss roll is well unfolded: local distances from one point to its
neighbors are almost perfectly preserved. For the open box, the embedding
is not so satisfying: although distances are allowed to shrink, SDE fails to
find a satisfying embedding and faces are superposed. For this manifold, the
SDP solver displays a failure message: no embedding with fewer than three
dimensions can be built without violating the constraints.
4.4 Other distances 131

Classification

SDE is a batch method that processes data offline. It belongs to the family
of spectral methods, like metric MDS, Isomap, and KPCA. Exactly as those
methods, SDE is provided with an estimator of the data dimensionality and
can build embeddings in an incremental way.
In [196, 198, 195], SDE is introduced without vector quantization. As SDE
is quite computationally demanding, this preprocessing proves to be very use-
ful in practice.
The mapping resulting from SDE is explicit, and the associated data model
is discrete.

Advantages and drawbacks

The constraint of local isometry proposed in SDE is milder than strict isom-
etry, as required in metric MDS and derived methods. As a result, SDE com-
pares favorably to methods based on weighted distance preservation.
SDE can also be seen as a kind of super-Isomap that remedies some short-
comings of the graph distance. Results of Isomap tightly depend on the esti-
mation quality of the geodesic distances: if the data set is sparse, the latter
will be poorly approximated by graph distances. Leaving aside any graph con-
struction problem (shortcuts), graph distances are likely to be overestimated
in this case, because graph paths are zigzagging. Similarly, Isomap also fails to
embed correctly nonconvex manifolds (e.g., manifolds with holes): in this case,
some graph paths are longer than necessary because they need to go round
the holes. (Actually, both problems are closely related, since a sparse sample
from a convex manifold and a dense sample from a manifold with many holes
may look quite similar.)
On the down side, SDE proves to be dramatically slow. This comes from
the complexity of the SDP step. Finally, SDE cannot process test points.

Variants

A variant of SDE that uses landmark points has been developed and is re-
ferred to in [195]. Instead of using all pairwise distances between data points,
a rectangular matrix is used instead, which consists of distances from data
points to a few landmarks (typically a subset of data). The main goal is to
reduce the computational cost of SDE while approximating reasonably well
the normal version. A similar principle is used to develop a landmark-based
version of Isomap.
Other manifold or graph embedding methods based on semidefinite pro-
gramming are also developed in [125, 30].
5
Topology Preservation

Overview. This chapter reviews methods that reduce the dimen-


sionality by preserving the topology of data rather than their pairwise
distances. Topology preservation appears more powerful than but also
more complex to implement than distance preservation. The described
methods are separated into two classes according to the kind of topol-
ogy they use. The simplest methods rely on a predefined topology
whereas more recent methods prefer a topology built according to the
data set to be re-embedded.

5.1 State of the art


As demonstrated in the previous chapter, nonlinear dimensionality reduction
can be achieved by distance preservation. Numerous methods use distances,
which are intuitively simple to understand and very easy to compute. Unfor-
tunately, the principle of distance preservation also has a major drawback.
Indeed, the appealing quantitative nature of a distance function also makes
it very constraining. Characterizing a manifold with distances turns out to
support and bolt it with rigid steel beams. In many cases, though, the embed-
ding of a manifold requires some flexibility: some subregions must be locally
stretched or shrunk in order to embed them in a lower-dimensional space.
As stated in Section 1.4, the important point about a manifold is its topol-
ogy, i.e., the neighborhood relationships between subregions of the manifold.
More precisely, a manifold can be entirely characterized by giving relative or
comparative proximities: a first region is close to a second one but far from
a third one. To some extent, distances give too much information: the exact
measure of a distance depends not exclusively on the manifold itself but also,
for much too large a part, on a given embedding of the manifold. Actually,
comparative information between distances, like inequalities or ranks, suffices
to characterize a manifold, for any embedding.
134 5 Topology Preservation

Another argument pleads for neighborhood relations: they are considered


exclusively inside the manifold. On the other hand, most distance functions
make no distinction between the manifold and the surrounding empty space.
Most often the distance between two points is computed along a straight
line and does not take the manifold into account. A first attempt to avoid
those annoying “shortcuts” caused by spatial distances comes with the graph
distance, studied in Section 4.3.
From a practical point of view, the use of topology raises a serious difficulty.
Indeed, how do we characterize the topology of a manifold, knowing that
only a few points are available? Without a good description of the topology,
its preservation in a different space is impossible! This chapter attempts to
answer the above question by studying several existing methods. Briefly put,
all these methods translate the qualitative concept of topology into a set of
numerical descriptors. Obviously, care must be taken in order for that those
descriptors to remain as independent as possible from the initial embedding
of the manifold. Otherwise, the same drawback as in the case of distance
reappears.
The next two sections review some of the best-known methods that re-
duce the dimensionality using the principle of topology preservation; they are
called topology-preserving methods in short and are classified according to
the type of topology they use in the embedding space. Actually, as most of
these methods work with a discrete mapping model (see Subsection 2.5.5),
the topology is often defined in a discrete way, too. More precisely, such a dis-
crete representation of the topology is usually called a lattice. It can be a set of
points regularly spaced on a plane or, more formally, a graph (see Section 4.3).
The latter is the most generic and flexible mathematical object to discretize
a topology: points are associated with graph vertices and their proximity is
symbolized by a (weighted) graph edge. Relationships between faraway points
may be explicit (an edge with a heavy weight in a weighted graph) or implicit
(absence of an edge). In the latter case, the relationships may still be deduced,
for example, by graph distances (see Subsection 4.3). A large family of neural
networks, called topology-representing networks [136] (TRNs), is specifically
devoted to the discrete representation of topologies. Unfortunately, most of
those networks do not easily provide a low-dimensional representation of the
topology. Hence, other methods must be sought, or TRNs must be adapted.
Section 5.2 deals with methods relying on a predefined lattice, i.e., the
lattice or graph is fixed in advance and cannot change after the dimensionality
reduction has begun. Kohonen self-organizing map is the best-known example
in this category.
Section 5.3 introduces methods working with a data-driven lattice, mean-
ing that the shape of the lattice can be modified or is entirely built while the
methods are running. If a method reduces the dimensionality without vector
quantization (see Subsection 2.5.9), the shape is the sole property of the lat-
tice that can be tuned. On the other hand, when vector quantization is used,
5.2 Predefined lattice 135

even the number of points (vertices) in the lattice (graph) can be adjusted by
data.
Additional examples and comparisons of the described methods can be
found in Chapter 6.

5.2 Predefined lattice


This section describes two methods, namely the self-organizing map and the
generative topographic mapping, that reduce the dimensionality using a pre-
defined lattice. Because the lattice is imposed in advance, the applicability
of these methods seems very limited. Indeed, the lattice proposed by most
implementations seldom differs from a rectangular or hexagonal grid made of
regularly spaced points. As can easily be guessed, very few manifolds fit such
a simple shape in practice. Hence, the lattice must often be heavily deformed
in order to fit the data cloud; sometimes, it is even folded on itself. However,
considering the lattice as the domain of the manifold parameters may help to
accept such a restrictive hypothesis.
It is noteworthy, however, that methods working with a predefined lattice
allow the easy visualization of neighborhood relationships between labeled
data. This is usually realized by displaying the labels in a standardized rect-
angle or hexagon: each point of the lattice is then represented by its own
average or dominant label.

5.2.1 Self-Organizing Maps

Along with the multi-layer perceptron (MLP), the self-organizing map is per-
haps the most widely known method in the field of artificial neural networks.
The story began with von der Malsburg’s pioneering work [191] in 1973.
His project aimed at modeling the stochastic patterns of eye dominance and
the orientation preference in the visual cortex. After this rather biologically
motivated introduction of self-organization, little interest was devoted to the
continuation of von der Malsburg’s work until the 1980s. At that time, the field
of artificial neural network was booming again: it was a second birth for this
interdisciplinary field, after a long and quiet period due to Minski and Papert’s
flaming book [138]. This boom already announced the future discovery of the
back-propagation technique for the multi-layer perceptron [201, 161, 88, 160].
But in the early 1980s, all the attention was focused on Kohonen’s work.
He simplified von der Malsburg’s ideas, implemented them with a clear and
relatively fast algorithm, and introduced them in the field of artificial neural
networks: the so-called (Kohonen’s) self-organizing map (SOM or KSOM) was
born.
Huge success of the SOMs in numerous applications of data analysis
quickly followed and probably stems from the appealing elegance of the SOMs.
136 5 Topology Preservation

The task they perform is indeed very intuitive and easy to understand, al-
though the mathematical translation of these ideas into a compactly written
error function appears quite difficult or even impossible in the general case.
More precisely, SOMs simultaneously perform the combination of two con-
current subtasks: vector quantization (see Appendix D) and topographic rep-
resentation (i.e., dimensionality reduction). Thus, this “magic mix” is used
not only for pure vector quantization, but also in other domains where self-
organization plays a key part. For example, SOMs can also be used to some
extent for nonlinear blind source separation [147, 85] as for nonlinear dimen-
sionality reduction. This polyvalence explains the ubiquity of SOMs in nu-
merous applications in several fields of data analysis like data visualization,
time series prediction [123, 122], and so forth.

Embedding of data set

Within the framework of dimensionality reduction, an SOM can be interpreted


intuitively as a kind of nonlinear but discrete PCA. In the latter method, a
hyperplane is fitted at best inside the data cloud, and points are encoded as
coordinates on that hyperplane. If the data cloud is curved, the hyperplane
should be curved as well. One way to put this idea in practice consists of
replacing the hyperplane with a discrete (and bounded) representation. For
example, a grid or lattice defined by some points perfectly can play this role
(see Fig. 5.1). If the segments connecting the grid points are elastic and ar-

5
x2

−5

−10 −5 0 5 10
x1

Fig. 5.1. Two-dimensional grid or lattice that can be used by a SOM.

ticulated around the points, the fitting inside the data cloud becomes easy,
at least intuitively. Roughly, it is like covering an object with an elastic fish-
ing net. This intuitive idea underlies the way an SOM works. Unfortunately,
things become difficult from an algorithmic point of view. How do we encode
the fishing net? And how do we fit it inside the data cloud?
Considering an SOM as a special case of a vector quantization method may
help to answer the question. As explained in Appendix D, vector quantization
5.2 Predefined lattice 137

aims at replacing a set of points with a smaller set of representative points


called prototypes. Visually, “representative” means that the cloud of proto-
types resembles the initial data cloud. In other words, the prototypes are fitted
inside the data cloud. Unfortunately, classical vector quantization methods do
not take into account a grid or lattice: prototypes move independently from
each other.
An SOM circumvents this obstacle by moving several prototypes together,
according to their location in the lattice. Assuming the prototypes are fitted
iteratively in the data cloud, each time a prototype is moved, its neighbors in
the lattice are moved too, in the same direction. Doing so allows us to keep
the cohesion of the grid: neighboring prototypes in the grid will be located
close to each other in the data cloud.
More formally, an SOM consists of
• A set C containing the prototypes for the vector quantization; these pro-
totypes are D-dimensional points c(r), where D is the dimensionality of
the data space.
• A function dg (r, s) giving the distance between a pair of prototypes in
the lattice; this distance function implicitly determines the neighborhood
relationships between the prototypes.
In the context of dimensionality reduction, it is clear that the lattice plays the
role of the embedding space. Hence, dg (r, s) may not be any function. Very
often dg (r, s) is defined as d(g(r), g(s)), where g(r), g(s) ∈ G ⊂ RP , and P is
the dimensionality of the embedding space. In other words, prototypes have
coordinates in the initial space as well as in the final space. Weirdly enough,
coordinates g(r) in the embedding or latent space are known before running
the SOM since they are fixed in advance, either explicitly or implicitly by
defining dg (r, s). On the other hand, the corresponding coordinates c(r) in
the data space are unknown; it is the SOM’s duty to determine them. Once
the coordinates c(r) are computed, the embedding of a data point y(i) is
given as the P -dimensional coordinates associated with the nearest prototype
in the data space, i.e.,

x̂(i) = g(r) with r = arg min d(y(i), c(s)) , (5.1)


s

where d is a distance function in the data space, usually the Euclidean dis-
tance.
The coordinates c(r) can be determined iteratively, by following more or
less the scheme of a Robbins–Monro procedure. Briefly put, the SOM runs
through the data set Y several times; each pass through the data set is called
an epochs. During each epoch, the following operations are achieved for each
datum y(i):
1. Determine the index r of the closest prototype of y(i), i.e.

r = arg min d(y(i), c(r)) , (5.2)


s
138 5 Topology Preservation

where d is typically the Euclidean distance.


2. Update the D-dimensional coordinates c(s) of all prototypes according to

c(s) ← c(s) + ανλ (r, s)(y(i) − c(s)) , (5.3)

where the learning rate α, obeying 0 ≤ α ≤ 1, plays the same role as the
step size in a Robbins–Monro procedure. Usually, α slowly decreases as
epochs go by.
In the update rule (5.3), νλ (r, s) is called the neighborhood function and can
be defined in several ways.
In the early publications about SOMs [191, 104], νλ (r, s) was defined to
be the so-called ‘Bubble’ function:

0 if dg (r, s) > λ
νλ (r, s) = , (5.4)
1 if dg (r, s) ≤ λ

where λ is the neighborhood (or Bubble) width. As the step size α does, the
neighborhood width usually decreases slowly after each epoch.
In [155, 154], νλ (r, s) is defined as
! "
d2g (r, s)
νλ (r, s) = exp − , (5.5)
2λ2

which looks like a Gaussian function (see Appendix B) where the neighbor-
hood width λ replaces the standard deviation.
It is noteworthy that if νλ (r, s) is defined as

0 if r = s
νλ (r, s) = , (5.6)
1 if r = s

then the SOM does not take the lattice into account and becomes equivalent
to a simple competitive learning procedure (see Appendix D).
In all above definitions, if dg (r, s) is implicitly given by dg (g(r), g(s)), then
dg may be any distance function introduced in Section 4.2. In Eq. (5.5), the
Euclidean distance (L2 ) is used most of the time. In Eq. (5.4), on the other
hand, L2 as well as L1 or L∞ are often used.
Moreover, in most implementations, the points g(r) are regularly spaced
on a plane. This forces the embedding space to be two-dimensional. Two
neighborhood shapes are then possible: square (eight neighbors) or hexagonal
(six neighbors) as in Fig. 5.1. In the first (resp., second) case, all neighbors are
equidistant for L∞ (resp., L2 ). For higher-dimensional lattices, (hyper-)cubic
neighborhoods are the most widely used. The global shape of the lattice is
often a rectangle or a hexagon (or a parallelepiped in higher dimensions).
Figure 5.2 shows a typical implementation of a SOM. The reference software
package for SOMs is the SOM toolbox, available at http://www.cis.hut.
fi/projects/somtoolbox/. This a complete MATLAB R
toolbox, including
5.2 Predefined lattice 139

1. Define the lattice by assigning the low-dimensional coordinates g(r) of the


prototypes in the embedding space.
2. Initialize the coordinates c(r) of the prototypes in the data space.
3. Give α and λ their scheduled values for epoch q.
4. For all points y(i) in the data set, compute r as in Eq. (5.2) and update
all prototypes according to Eq. (5.3).
5. Return to step 3 until convergence is reached (i.e., updates of the proto-
types become negligible).

Fig. 5.2. Batch version of Kohonen’s self-organizing map.

not only the SOM algorithm but also other mapping functions (NLM, CCA)
and visualization functions. A C++ implementation can also be downloaded
from http://www.ucl.ac.be/mlg/. The main parameters of an SOM are the
lattice shape (width, height, additional dimensions if useful), neighborhood
shape (square or hexagon), the neighborhood function νλ , and the learning
schedules (for learning rate α and neighborhood width λ). Space complexity
is negligible and varies according to implementation choices. Time complexity
is O(N D|C|) per iteration.
Eventually, it is noteworthy that SOMs are motivated by biological and
empirical arguments. Neither a generative model of data nor an objective
function is defined, except in very particular cases [38]. More information
about mathematical aspects of SOMs can be found in [62] and references
therein.

Embedding of test set

Within the framework of vector quantization, any point y is encoded (not


projected; see Appendix D) by giving the index r of the closest prototype
c(r), which is computed in the same way as in Eq. (5.2). In this context,
the embedding of y can be defined as x̂ = g(r). Obviously, this gives only
a very rough estimate of the low-dimensional coordinates, since at most |C|
values (the number of prototypes) can be output. More precise interpolation
procedures are described in [75].

Example

The two benchmark data sets introduced in Section 1.5 can easily be pro-
cessed using an SOM. In order to quantize the 350 and 316 three-dimensional
points contained in the data sets (see Fig. 1.4), the same 30-by-10 rectangular
lattice as in Fig. 5.1 is defined. The neighborhood shape is hexagonal. With
the neighborhood function defined as in Eq. (5.5), the SOM computes the
embeddings shown in Fig. 5.3. By construction, the embedding computed by
140 5 Topology Preservation

2 5 5

x2
0 0
x

−5 −5

−10 −5 0 5 10 −10 −5 0 5 10
x1 x1

Fig. 5.3. Two-dimensional embeddings of the “Swiss roll” and “open box” data sets
(Fig. 1.4), found by an SOM. The shape of the embedding is identical to the prede-
fined lattice shown in Fig. 5.1. Only the color patches differ: the specific color and
shape of each data point in Fig. 1.4 have been assigned to their closest prototypes.

an SOM is the predefined lattice, regardless of what the manifold to embed


looks like. Indeed, the 30-by-10 rectangle of Fig. 5.1 is left unchanged.
Does it mean that the embedding conveys no information? Actually, as
already mentioned, an SOM is often used to visualize labeled data. In the
case of the open box, each point in the data set has been given a color (dark
blue for the bottom, dark red for the top) and a shape (small circles for the
faces and larger square for the edges and corners). Prototypes of the SOM can
inherit these visual labels: each point gives its label to its closest prototype.
As can be seen, the bottom face occupies the center of the rectangular lattice.
This interpretation is confirmed by looking at Fig. 5.4. The lattice “packs”

1
0 3
3
y

0
y3

2
1
−1 −1 1
−1 0 0
−1
0 0 y2
−1 y 1 −1
1 2
y1 y1

Fig. 5.4. Three-dimensional views showing how the lattice of an SOM curves in
order to fit in the two data sets of Fig. 1.3. Colors indicate the left-right position of
each point in the lattice, as in Fig. 5.1.
5.2 Predefined lattice 141

the box. Cracks, however, are visible on two lateral faces. They explain why
two lateral faces are torn in Fig. 5.3 and cause the loss of some topological
relationships. In addition, some points of the lattice have no color spot. This
means that although the lattice includes fewer points than the data set, some
points of the lattice never play the role of closest prototype.
Finally, it must be remarked that the axis ranges differ totally in Figs. 5.3
and 5.4. This intuitively demonstrates that only the topology has been pre-
served: the SOM has not taken distances into account.

Classification

An SOM is mainly a method of vector quantization. This means that vector


quantization is mandatory in an SOM.
Regarding dimensionality reduction, an SOM models data in a nonlinear
and discrete way, by representing it with a deformed lattice. The mapping is
explicit (and defined only for the prototypes).
Most of the time, SOMs are implemented by offline algorithms, looking
like a Robbins–Monro procedure. Online versions can easily be derived. A
so-called ‘batch’ version of the SOM also exists [105]: instead of updating
prototypes one by one, they are all moved simultaneously at the end of each
epoch, as in a standard gradient descent.

Advantages and drawbacks

The wide success of SOMs can be explained by the following advantages. The
method is very simple from an algorithmic point of view, and its underlying
idea, once understood, is intuitively appealing. SOMs are quite robust and
perform very well in many situations, such as the visualization of labeled
data.
Nevertheless, SOMs have some well-known drawbacks, especially when
they are used for dimensionality reduction. Most implementations handle one-
or two-dimensional lattices only. Vector quantization is mandatory, meaning
that a SOM does not really embed the data points: low-dimensional coor-
dinates are computed for the prototypes only. Moreover, the shape of the
embedding is identical to the lattice, which is, in turn, defined in advance,
arbitrarily. This means that an SOM cannot capture the shape of the data
cloud in the low-dimensional embedding.
From a computational point of view, it is very difficult to assess the con-
vergence of an SOM, since no explicit objective function or error criterion is
optimized. Actually, it has been proved that such a criterion cannot be de-
fined, except in some very particular cases [57]. In addition, the parameter
setting of an SOM appears as a tedious task, especially for the neighborhood
width λ.
142 5 Topology Preservation

Variants

In the literature, many variants have been adapted to specific applications


and tried with good success. Further details can be found in [105, 154] and
references therein. To some extent, SOMs can also be related to principal
curves [79, 80], or at least to their variants working with polygonal curve
approximations [7].
It is noteworthy that several attempts have been made to give SOMs a
data-driven lattice (see Section 5.3 although the methods described therein are
quite different from the SOMs). In particular, Fritzke developed the growing
cell structure [69] (GCS) and the growing grid [70] (GG), whereas Bauer and
Villmann designed the growing SOM [11] (GSOM). Actually, these methods
are incremental vector quantization techniques, i.e., the number of prototypes
may increase automatically depending on the problem’s complexity. Some
variants of principal curves, using polygonal curve approximations, develop
the same strategy in the 1D case [7].
The GCS is actually a TRN (see Section 5.1) able to build any two-
dimensional lattice (neither the number nor the position of the lattice points is
fixed in advance), but as most other TRNs, the GCS does not really perform a
dimensionality reduction: an additional procedure [69] (embedding method) is
needed in order to embed the lattice in a plane. Isotop (see Subsection 5.3.3),
and particularly its third step, may be seen as an efficient way to embed the
result of a TRN like the GCS in low-dimensional space.
The GSOM and GG, on the contrary, depart less from the original SOM
and conserve a rectangular grid, making a two- or higher-dimensional repre-
sentation straightforward, as for a traditional SOM. The difference with an
SOM holds here in the possibility to add rows and columns in the lattice.
Thus, visibly, many variants of the SOMs stay very close to the origi-
nal method. Specifically, very few models use anything other than the two-
dimensional rectangular grid, with regularly placed points. However, it is pos-
sible to define lattices with any number of points, any shape, and even any
distribution. A lattice could indeed be defined by randomly drawing points
in any distribution. The predefined neighborhood relationships between these
points can be made explicit by computing distances between them (either spa-
tial or graph distance; see Sections 4.2 and 4.3). Thus, if any information about
the manifold shape or distribution is known in advance, it can be exploited by
fitting the lattice shape or distribution accordingly. Additionally, with today’s
computing power, the lattice can contain a huge number of points, in order
to refine the discretization or to better approximate a specific distribution.
As the only shortcoming, the use of such a randomly drawn lattice makes the
visualization less straightforward.
The idea of giving a prior distribution to the lattice is also followed by
the generative topographic mapping (GTM), which is introduced in Subsec-
tion 5.2.2. In the case of GTM, however, the choice of a prior distribution is
5.2 Predefined lattice 143

intended to exploit statistical concepts and stochastic optimization methods


rather than to broaden the choice of the grid shape.

5.2.2 Generative Topographic Mapping

The generative topographic mapping (GTM) has been put forward by Bishop,
Svensén, and Williams [23, 24, 176] as a principled alternative to the SOM.
Actually, GTM is a specific density network based on generative modeling, as
indicated by its name. Although the term “generative model” has already been
used, for example in Subsection 2.4.1 where the model of PCA is described,
here it has a stronger meaning.
In generative modeling1 , all variables in the problem are assigned a prob-
ability distribution to which the Bayesian machinery is applied. For instance,
density networks [129] are a form of Bayesian learning that try to model data
in terms of latent variables [60]. Bayesian neural networks learn differently
from other, more traditional, neural networks like an SOM. Actually, Bayesian
learning defines a more general framework than traditional (frequentist) learn-
ing and encompasses it. Assuming that the data set Y = [y(i)]1≤i≤N has to
be modeled using parameters stored in vector w, the likelihood function L(w)
is defined as the probability of the data set given the parameters

L(w) = p(Y|w) . (5.7)

Then traditional and Bayesian learning can be understood as follows [142]:


• Traditional (frequentist) learning. In an SOM, or in any other classical
neural network like an MLP, no distribution over the model parameters w
is assumed. The aim is then to determine the optimal value wopt , which is
often found as a maximum likelihood estimator:

wopt = arg max (ln L(w) + R(w)) , (5.8)


w

where R(w) is an optional regularization term, such as the usual quadratic


regularizer:
1 2
R(w) = α wk , (5.9)
2
k

where the hyperparameter α allows us to tune the relative importance of


the regularization with respect to the primary objective function. Intu-
itively, regularization penalizes a neural network whose structure is a too
complex, in order to avoid overfitting of the model. Test data Ytest can
be processed by computing p(Ytest |wopt ) when this is feasible or computa-
tionally tractable.
1
This introdution to generative modeling is inspired from [33].
144 5 Topology Preservation

• Bayesian learning. For density networks, like GTM, a probability dis-


tribution over the model parameters w is obtained before considering any
datum. Actually, such a distribution is based on a prior distribution p(w)
that expresses an initial belief about the value of w. Afterwards, given
data Y, the prior distribution is updated to a posterior distribution using
Bayes’s rule:
p(Y|w)p(w)
p(w|Y) = ∝ L(w)p(w) . (5.10)
p(Y)
Test data Ytest can be processed by computing

p(Ytest |Y) = p(Ytest |w)p(w|Y)dw (5.11)

when this is feasible or computationally tractable.


According to Eq. (5.8), traditional learning may be viewed as a maximum
a posteriori probability (MAP) estimate of Bayesian learning:

wopt = arg max p(w|Y) (5.12)


w
= arg max ln(L(w)p(w)) (5.13)
w
= arg max(ln L(w) + ln p(w)) , (5.14)
w

with a prior p(w) = exp R(w). For the quadratic regularizer (Eq. (5.9)), the
prior would be proportional to a Gaussian density with variance 1/α.
Compared to frequentist learning, the Bayesian approach has the advan-
tage of finding a distribution for the parameters in w, instead of a single value.
Unfortunately, this is earned at the expense of introducing the prior, whose
selection is often criticized as being arbitrary.
Within the framework of Bayesian learning, density networks like GTM
are intended to model a certain distribution p(y) in the data space RD by a
small number P of latent variables. Given a data set (in matrix form) Y =
[y(i)]1≤i≤N drawn independently from the distribution p(y), the likelihood
and log-likelihood become

#
N
L(w) = p(Y|w) = p(y(i)|w) , (5.15)
i=1

N
l(w) = ln p(Y|w) = ln p(y(i)|w) , (5.16)
i=1

respectively. Starting from these equations, density networks are completely


determined by choosing the following parameters or elements:
• The dimension P of the latent space.
• The prior distribution in the latent P -dimensional space: p(x), x being a
random vector of RP .
5.2 Predefined lattice 145

• A smooth mapping m from the latent space onto a P -manifold Y in the


D-dimensional data space, with parameter vector w (for example, if m is
an MLP, w would be the weights and biases):

m : RP → Y ⊂ RD , x → y = m(x, w) . (5.17)

The possibility to reduce the dimensionality clearly holds in this jump


from P to D dimensions.
• An error function Gi (x, w) = ln p(y(i)|x, w) = ln p(y(i)|y). Applying
Bayes’s rule, the posterior can then be computed in the latent space from
the prior and the error function:
p(y(i)|x, w)p(x) exp(Gi (x, w))p(x)
p(x|y(i), w) = = , (5.18)
p(y(i)|w) p(y(i)|w)
with the normalization constant

p(y(i)|w) = p(y(i)|x, w)p(x)dx . (5.19)

Using Bayes’s rule once again, as in Eq. (5.10), gives the posterior in the
parameter space:
p(Y|w)p(w) L(w)p(w)
p(w|Y) = = , (5.20)
p(Y) p(Y)
• An optimization algorithm in order to find the parameter w that max-
imizes the posterior in the parameter space p(w|Y). In practice, this is
achieved by maximizing the log-likelihood, for example by gradient de-
scent on Eq. (5.19), when this is computationally feasible.

Embedding of data set

In GTM, the various elements of a density network are set as follows:


• The error functions Gi (x, w): the probabilities p(y(i)|x, w) are spherical
Gaussian kernels N (m(x, W), β −1 I), with parameters w = {W, β}. The
kernels are centered on m(x, W) and have a variance equal to β −1 :
  D2  
β β 2
p(y(i)|x, W, β) = exp − y(i) − m(x, W) , (5.21)
2π 2

which behaves as an isotropic Gaussian noise model for y(x, W) that ex-
tends the manifold Y to RD : a given data vector y(i) could have been
generated by any point x with probability p(y(i)|x, W, β). It can be also
remarked that the error function Gi (x, W, β) trivially depends on the
squared distance between the observed data point y(i) and the generating
point x.
146 5 Topology Preservation

• The prior distribution in latent space:



1 
C
0 if x = g(r)
p(x) = δ(x − g(r)) = (5.22)
C r=1 1/C if x = g(r)

where the C points g(r) stand on a regular grid in latent space, in the
same way as the prototypes of an SOM. This discrete choice of the prior
distribution directly simplifies the integral in Eq. (5.19) into a sum. Oth-
erwise, for an arbitrary p(x), the integral must be explicitly discretized
(Monte Carlo approximation). Then, finally, Eq. (5.19) becomes in the
case of GTM:

1 
C
p(y(i)|W, β) = p(y(i)|g(r), W, β) , (5.23)
C r=1

which is a constrained mixture of Gaussian kernels (because the kernels


lie on the P -dimensional manifold Y in the data space). Similarly, the
log-likelihood becomes
! "
N
1 
C
l(W, β) = ln p(y(i)|W, β) . (5.24)
i=1
C r=1

• The mapping from latent space to data space is a generalized linear model
m(x, W) = Wφ(x), where W is D-by-B matrix and φ a B-by-1 vector
consisting of B (nonlinear) basis functions. Typically, these B basis func-
tions are Gaussian kernels with explicitly set parameters: their centers are
drawn from the grid in the latent space, and their common variance is pro-
portional to the mean distances between the centers. In other words, the
mapping used by GTM roughly corresponds to an RBF network with con-
strained centers: by comparison with the MLP or usual density networks,
the RBFN remains an universal approximator but yields considerable sim-
plifications in the subsequent computations (see ahead). The exact posi-
tioning of the centers and the tuning of the width σ are not discussed here;
for more details, see [176]. Nevertheless, in order to get a smooth manifold
and avoid overfitting, it is noteworthy that (i) the number of kernels in the
constrained RBF must be lower than the number of grid points and (ii)
the width σ must be larger than the mean distance between neighboring
centers. As in other RBF networks, additional linear terms and biases may
complement the basis functions in order to easily take into account linear
trends in the mapping.
• The optimization algorithm is the expectation-maximization (EM) proce-
dure [50, 21]. This choice is typical when maximizing the likelihood and
working with mixtures of Gaussian kernels. By design, the prior in latent
space is a mixture of Gaussian kernels (see Eq. (5.23)) that fortunately
makes EM applicable. In other density networks, a more complex choice
5.2 Predefined lattice 147

of the prior does not allow simplifying the integral of Eq. (5.19) into a sum.
In the case of GTM, the objective function is the log-likelihood function.
Without going into technical details [176], the log-likelihood function


N
l(W, β) = ln p(y(i)|W, β) (5.25)
i=1

leads to the following two steps:

E step.

Computation of the responsibilities ρi,r (W, β) (see Eq. (5.28) ahead).

M step.

The partial derivatives of L(W, β) with respect to parameters W and β


give:
– A matrix equation for W:

Wnew ΦGold ΦT = YPold ΦT , (5.26)

solvable for Wnew with standard matrix inversion techniques. This sim-
ple update rule results from the adoption of an RBFN-like approxima-
tor instead of, for example, an MLP, which would have required a
gradient ascent as optimization procedure.
– A re-estimation formula for β:

1 
C N
1
← ρi,r (Wnew , β) y(i) − m(g(r), Wnew )2 , (5.27)
β N D r=1 i=1

where
– Φ = [φ(g(r))]1≤r≤C is a B-by-C constant matrix,
– Y = [y(i)]1≤i≤N is the data set (constant D-by-N matrix),
– P = [ρi,r (W, β)] is a varying N -by-C matrix of posterior probabilities
or responsibilities:

ρi,r (W, β) = p(g(r)|y(i), W, β)


p(y(i)|g(r), W, β)p(g(r))
= C
s=1 p(y(i)|g(s), W, β)p(g(s))
p(y(i)|g(r), W, β)
= C , (5.28)
s=1 p(y(i)|g(s), W, β)

– G is a diagonal C-by-C matrix with entries


N
gr,r (W, β) = ρi,r (W, β) . (5.29)
i=1
148 5 Topology Preservation

Because the EM algorithm increases the log-likelihood monotonically [50],


the convergence of GTM is guaranteed. According to [24], convergence is
generally attained after a few tens of iterations. As initial weights, one can
take the first P principal components of the data set (see [176] for more
details). After convergence, a small value of the variance 1/βopt generally
indicates a good approximation of the data.
Once the optimal values for the parameters W and β are known, the
embedding x(i) in the latent space of the data points y(i) can easily be com-
puted. By construction, the optimal parameters allow defining p(y(i)|g(r)),
i.e., the probability distribution in the data space, conditioned by the latent
variables. Knowing the prior distribution p(x) over the latent variables (see
Eq. (5.22)) and using Bayes’s theorem, the posterior probability distribution
can be written as

p(y(i)|g(r), Wopt , βopt )p(g(r))


p(g(r)|y(i)) = C (5.30)
s=1 p(y(i)|g(s), W, β)p(g(s))
p(y(i)|g(r), Wopt , βopt )
= C (5.31)
s=1 p(y(i)|g(s), W, β)
= ρi,r (Wopt , βopt ) (5.32)

and closely resembles the computation of the responsibilities in Eq. (5.28),


where again the p(g(r)) disappear. Until here, only the probabilities of the
different points of the grid in the latent space are known. In order to compute
an estimate x(i) in vector form, two possibilities exist:
• the mode of the posterior distribution in the latent space:

x̂(i) = arg max p(g(r)|y(i)) , (5.33)


g(r)

• the mean of the posterior distribution in the latent space:


C
x̂(i) = g(r)p(g(r)|y(i)) . (5.34)
r=1

The second possibility is clearly the best except when the posterior distribu-
tion is multimodal. Putting together all the above-mentioned ideas leads to
the procedure presented in Fig. 5.5. A GTM MATLAB R
package is avail-
able at http://www.ncrg.aston.ac.uk/GTM/. The parameter list of GTM is
quite long. As with an SOM, the shape of the grid or lattice may be changed
(rectangular or hexagonal). The basis function of the RBF-like layer can be
tuned, too. Other parameters are related to the EM procedure (initialization,
number of iterations, etc.).
5.2 Predefined lattice 149

1. Generate the grid of latent points {g(r)}1≤r≤C .


2. Generate the grid of the centers used in the basis functions.
3. Select the width σ of the basis functions.
4. Evaluate the basis functions at the latent points g(r), and store the results
in Φ.
5. Initialize W either randomly or using PCA.
6. Initialize the inverse of the noise variance β.
7. Compute the matrix of squared distances D = [y(i) − Wφ(g(r))2 ].
8. E step:
• Compute P from Eq. (5.28) using D and β.
• Compute G from Eq. (5.29) using R.
9. M step:
• Update W with Eq. (5.26): W = YPΦT (ΦGΦT )−1 .
• Compute D as above, with the updated value of W.
• Update β according Eq. (5.27), using R and D.
10. Return to step 8 if convergence is not yet reached.
11. Compute the embedding of the data points using P and then Eq. (5.33)
or (5.34).

Fig. 5.5. Algorithm implementing the generative topographic mapping.

Embedding of test set

By construction, GTM aims at providing an easy way to generalize the di-


mensionality reduction to new points. The embedding of a test point y can
be computed in the same way as for the data points y(i), using Eq. (5.32)
and then Eq. (5.33) or (5.34).

Example

Figure 5.6 illustrates how GTM embeds the “Swiss roll” and “open box”
manifolds introduced in Section 1.5. The points of the data sets (see Fig. 1.4)
are embedded in the latent space (a square 10-by-10 grid) using Eq. (5.34).
The mapping m works with a grid of 4-by-4 basis functions.
As can be seen, GTM fails to embed the Swiss roll correctly: all turns of
the spiral are superposed. On the other hand, the square latent space perfectly
suits the cubic shape of the open box: the bottom face is in the middle of the
square, surrounded by the four lateral faces. The upper corners of the box
correspond to the corners of the square latent space. By comparison with an
SOM (Fig. 5.3), GTM yields a much more regular embedding. The only visible
shortcoming is that the lateral faces are a bit shrunk near the borders of the
latent space.
150 5 Topology Preservation

1 1

0.5 0.5
2

x2
0 0
x

−0.5 −0.5

−1 −1
−1 0 1 −1 0 1
x1 x1

Fig. 5.6. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by the GTM.

Classification

The essential difference between GTM and almost all other methods described
in this book is that GTM relies on the principle of Bayesian learning. This
probabilistic approach leads to a different optimization technique: an EM
algorithm is used instead of a (stochastic) gradient descent or a spectral de-
composition. As described above, GTM is a batch method, but a version that
works with a stochastic EM procedure also exists.
Because GTM determines the parameters of a generative model of data,
the dimensionality reduction easily generalizes to new points. Therefore, it
can be said that GTM defines an implicit mapping although the latent space
is discrete.
If the implementation does not impose a two-dimensional grid, an external
procedure is needed to estimate the intrinsic dimensionality of data, in order
to determine the right dimension for the latent space.
If several embeddings with various dimensionalities are desired, GTM must
be run again.

Advantages and drawbacks

By comparison with an SOM, GTM provides a generative model for data.


Moreover, the probabilistic approach that has come along has several advan-
tages.
First, in addition to finding the latent coordinates x̂ of a point y, GTM can
also approximate p̂(x|y), i.e., the probability of the embedding to be located
at coordinates x in the latent space. This allows us to detect problems in the
dimensionality reduction when, for instance, the probability distribution is
not unimodal.
Second, from an algorithmic point of view, GTM optimizes a well-defined
objective function, namely the log-likelihood, whereas no such function exists
for the SOM [57]. While most methods described in this book work with
5.2 Predefined lattice 151

(stochastic) gradient descents or related techniques, GTM optimizes the log-


likelihood using an EM algorithm. Compared with these classical optimization
techniques, the EM is guaranteed to maximize the likelihood monotically and
converges to a maximum after a few tens of iterations.
As an SOM, GTM is limited to low-dimensional latent spaces: typically,
only one- or two-dimensional grids are used. Actually, the mapping m does
not manage very well with high-dimensional latent spaces for the following two
reasons. First, the number of grid points must grow exponentially, like in an
SOM. But in the case of GTM, the number of kernels defining m must grow as
well. Second, as remarked in Subsection 1.2.2, the Gaussian kernels used in m
behave surprisingly in high-dimensional spaces. Therefore, the dimensionality
of the latent space must remain low, as for an SOM.
By the way, regarding the mapping m, it is important to remark that
although the universal approximation property holds for RBF networks in
theory, the RBF-like mapping of GTM is utterly simple and constrained:
• The number of kernels is limited by the number of latent points in the
grid.
• The width σ is isotropic in the latent space and shared by all kernels.
• The kernel centers and variances are the same for all D dimensions of the
data space.
• The kernel centers and the variance are set in advance and are constant;
only the weights are adjusted.
By comparison, RBF networks used for function approximation [93, 18] are
far more flexible:
• The number of kernels is only limited by the number of available data
points.
• The variance is often individualized for each kernel and is sometimes not
isotropic (complete covariance matrix).
• The kernel centers and variances are optimized separately for each output
dimension.
• The kernel centers and variances are optimized on the same footing as the
weights.
Unfortunately, the integration of such a generic RBF network appears difficult
in GTM, since many simplifications become impossible.
A table summarizing the differences between GTM and an SOM is given
in [33].

Variants

Some extensions to GTM are described in [176]. For example, the probabilistic
framework of GTM can be adapted to data sets with missing entries. Another
variant uses a different noise model: the Gaussian kernels include a full co-
variance matrix instead of being isotropic. The mapping m can also be easily
152 5 Topology Preservation

modified in order to cope with high-dimensional latent spaces. Mixtures of


several GTMs are also considered.

5.3 Data-driven lattice


In contrast with methods using a predefined lattice studied in the previous
section, the methods described here make no assumption about the shape and
topology of the embedding. Instead, they use information contained in data
in order to establish the topology of the data set and compute the shape of
the embedding accordingly. Thus, the embedding is not constrained in any
way and can adapt itself in order to capture the manifold shape.
In all methods detailed in the coming sections, the data-driven lattice is
formalized by a graph whose vertices are the data points and whose edges
represent neighborhood relationships.

5.3.1 Locally linear embedding

By comparison with an SOM and GTM, locally linear embedding [158, 166]
(LLE) considers topology preservation from a slightly different point of view.
Usual methods like an SOM and GTM try to preserve topology by keeping
neighboring points close to each other (neighbors in the lattice are maintained
close in the data space). In other words, for these methods, the qualitative
notion of topology is concretely translated into relative proximities: points
are close to or far from each other. LLE proposes another approach based on
conformal mappings. A conformal mapping (or conformal map or biholomor-
phic map) is a transformation that preserves local angles. To some extent, the
preservation of local angles and that of local distances are related and may
be interpreted as two different ways to preserve local scalar products.

Embedding of data set

As LLE builds a conformal mapping, the first task of LLE consists of de-
termining which angles to take into account. For this purpose, LLE se-
lects a couple of neighbors for each data point y(i) in the data set Y =
[. . . , y(i), . . . , y(j), . . .]1≤i,j≤N . Like other methods already studied, LLE can
perform this task with several techniques (see Appendix E). The most often
used ones associate with each point y(i) either the K closest other points or
all points lying inside an -ball centered on y(i).
If the data set is sufficiently large and not too noisy, i.e., if the underlying
manifold is well sampled, then one can assume that a value for K (or ) exists
such that the manifold is approximately linear, on the local scale of the K-ary
neighborhoods (or -balls). The idea of LLE is then to replace each point y(i)
with a linear combination of its neighbors. Hence, the local geometry of the
5.3 Data-driven lattice 153

manifold can be characterized by linear coefficients that reconstruct each data


point from its neighbors. The total reconstruction error can be measured by
the simple quadratic cost function
$ $2
N $ 
$
$ $
E(W) = $ wi,j y(j)$
$y(i) − $ , (5.35)
i=1 $ j∈N (i) $

where N (i) is the set containing all neighbors of point y(i) and wi,j , the
entries of the N -by-N matrix W, weight the neighbors in the reconstruction
of y(i). Briefly put, E(W) sums all the squared distances between a point and
its locally linear reconstruction. In order to compute the coefficients wi,j , the
cost function is minimized under two constraints:
• Points are reconstructed solely by their neighbors, i.e., the coefficients wi,j
for points outside the neighborhood of y(i) are equal to zero: wi,j = 0 ∀j ∈/
N (i);
N
• The rows of the coefficient matrix sum to one: j=1 wi,j = 1.
The constrained weights that minimize the reconstruction error obey an
important property: for any particular data point y(i), they are invariant to
rotations, rescalings, and translations of that data point and its neighbors. The
invariance to rotations and rescalings follows immediately from the particular
form of Eq. (5.35); the invariance to translations is enforced by the second
constraint on the rows of matrix W. A consequence of this symmetry is that
the reconstruction weights characterize intrinsic geometric properties of each
neighborhood, as opposed to properties that depend on a particular frame of
reference. The key idea of LLE then consists of assuming that these geometric
properties would also be valid for a low-dimensional representation of the
data.
More precisely, as stated in [158], LLE assumes that the data lie on or
near a smooth, nonlinear manifold of low intrinsic dimensionality. And then,
to a good approximation, there exists a linear mapping, consisting of a trans-
lation, rotation, and rescaling, that maps the high-dimensional coordinates
of each neighborhood to global intrinsic coordinates on the manifold. By de-
sign, the reconstruction weights wi,j reflect intrinsic geometric properties of
the data that are invariant to exactly these transformations. Therefore, it is
expected that their characterization of local geometry in the original data
space be equally valid for local patches on the manifold. In particular, the
same weights wi,j that reconstruct the data point y(i) in the D-dimensional
data space should also reconstruct its manifold coordinates in a P -dimensional
embedding space.
LLE constructs a neighborhood-preserving embedding based on the above
assumption. In the final step of LLE indeed, each high-dimensional data point
is mapped to a low-dimensional vector representing global intrinsic coordi-
nates on the manifold. This is done by choosing P -dimensional coordinates to
minimize the embedding cost function:
154 5 Topology Preservation
$ $2
N $ 
$
$ $
$ wi,j x̂(j)$
Φ(X̂) = $x̂(i) − $ . (5.36)
i=1 $ j∈N (i) $

This cost function, very similar to the previous one in Eq. (5.35), sums the
reconstruction errors caused by locally linear reconstruction. In this case, how-
ever, the errors are computed in the embedding space and the coefficients wi,j
are fixed. The minimization of Φ(X̂) gives the low-dimensional coordinates
X̂ = [. . . , x̂(i), . . . , x̂(j), . . .]1≤i,j≤N that best reconstruct y(i) given W.
In practice, the minimization of the two cost functions E(W ) and Φ(X̂) is
achieved in two successive steps.
First, the constrained coefficients wi,j can be computed in closed form, for
each data point separately. Considering a particular data point y(i) with K
nearest neighbors, its contribution to E(W) is
$ $2
$ $
$  $
$
Ei (W) = $y(i) − wi,j y(j)$
$ , (5.37)
$ j∈N (i) $

which can be reformulated as



K
Ei (ω(i)) = y(i) − ωr (i)ν(r)2 (5.38)
r=1

K
= ωr (i)(y(i) − ν(r))2 (5.39)
j=1


K
= ωr (i)ωs (i)gr,s (i) , (5.40)
r,s=1

where ω(i) is a vector that contains the nonzero entries of the ith (sparse)
row of W and ν(r) the rth neighbor of y(i), corresponding to y(j) in the
notation of Eq. (5.37). The second equality holds thanks to the (reformulated)

constraint K r=1 ωr (i) = 1, and the third one uses the K-by-K local Gram
matrix G(i) whose entries are defined as

gr,s (i) = (yi − ν(r))T (yi − ν(s)) . (5.41)

The matrices G(i) can be interpreted as a kind of local covariance matrices


around y(i). The reconstruction error can be minimized in closed form, using

a Lagrange multiplier to enforce the constraint K r=1 ωr (i) = 1. In terms of
the inverse of G(i), the optimal weights are given by
K
(G−1 (i))r,s
ωr (i) = Ks=1 . (5.42)
−1 (i))
r,s=1 (G r,s
5.3 Data-driven lattice 155

This solution requires an explicit inversion of the local covariance matrix. In


practice, a more efficient wayK to minimize the error is simply to solve the
linear system of equations r=1 gr,s ωr (i) and then to rescale the coefficients
so that they sum to one, yielding the same result. By construction, the matrix
G(i) is symmetric and positive semidefinite. Unfortunately, it can be singu-
lar or nearly singular, for example, when there are more neighbors than the
dimensions in the data space (K > D). In this case, G can be conditioned,
before solving the system, by adding a small multiple of the identity matrix:

Δ2 tr(G)
G←G+ I , (5.43)
K
where Δ is small compared to the trace of C. This amounts to penalizing large
weights that exploit correlations beyond some level of precision in the data
sampling process. Actually, Δ is somehow a “hidden” parameter of LLE.
The minimization of the second cost function Φ(X̂) can be done at once
by solving an eigenproblem. For this purpose, Φ(X̂) is developed as follows:
$ $2
N $  $
$ $
Φ(X̂) = $x̂(i) − w x̂(j)$ (5.44)
$ i,j $
i=1 $ j∈N (i) $
$ $2
N $  $
$ $
= $ wi,j (x̂(i) − x̂(j))$ (5.45)
$ $
i=1 $j∈N (i) $

N
= mi,j (x̂(i)T x̂(j)) , (5.46)
i,j=1

where mi,j are the entries of an N -by-N matrix M, defined as

M = (I − W)T (I − W) , (5.47)

which is sparse, symmetric, and positive semidefinite. The optimization is


then performed subject to constraints that make the problem well posed. It is
clear that the coordinates x̂(i) can be translated by a constant displacement
without affecting the cost. This degree of freedom  disappears if the coordi-
nates are required to be centered on the origin ( N i=1 x̂(i) = 0). Moreover, in
order to avoid degenerate solutions, the latent coordinates are constrained to
have unit covariance (ĈX̂X̂ = N1 X̂X̂T = I). Such a constraint simply exploits
the invariance of the cost function to rotations and homogenous rescalings.
The optimal embedding, up to a global rotation of the embedding space, is
found by computing the bottom P + 1 eigenvectors of the matrix M. The
last eigenvector of M, which is actually discarded by LLE, is a scaled unit
vector with all components equal; it represents a free translation mode and is
associated with a zero eigenvalue. Discarding this eigenvector enforces the con-
straint that the embeddings have a zero mean, since the components of other
156 5 Topology Preservation

eigenvectors must sum to zero by virtue of orthogonality with the last one.
The remaining P eigenvectors give the estimated P -dimensional coordinates
of the points x̂(i) in the latent space. Figure 5.7 summarizes all previous ideas
in a short procedure: A MATLAB R
function implementing LLE is available

1. For each datum y(i), compute


• the K nearest neighbors of y(i),
• the regularized matrix G(i) according to Eq. (5.41) and (5.43),
• the weights ω(i) (Eq. (5.42)).
2. Knowing the vectors ω(i), build the sparse matrices W and M
(Eq. (5.47)).
3. Compute the EVD of M; the estimated coordinates are given by the
eigenvectors associated with the second-to-(1 + P )th smallest eigenvalues.

Fig. 5.7. Algorithm for locally linear embedding.

at http://www.cs.toronto.edu/~roweis/lle/ along with related publica-


tions. The main parameters of this MATLAB R
function are the embedding
dimensionality P and the number of neighbors K. An additional parameter,
called “tolerance” and denoted Δ above, is used for regularizing matrix G(i)
computed for each datum. This parameter can play an important part in the
result of LLE. Space and time complexities largely depend on the way LLE is
implemented. As LLE involves mainly N -by-N sparse matrices and knowing
that only a few eigenvectors need to be computed, LLE can take advantage
of optimized software libraries.

Embedding of test set

In the original version of LLE [158], no interpolation procedure is provided.


However, two different ideas are proposed in [166].
The first idea consists of a local linear interpolation, following the same
intuition that motivated LLE. The point y to embed is compared with the
known data points in order to determine its K nearest neighbors. Next, re-
construction weights are computed. Finally, the embedding x̂ is built using
the reconstruction weights.
The second idea relies on a parametric model, for instance a neural network
like an MLP or an RBFN, which learns the mapping between the data set Y
and its embedding X̂ computed by LLE. Similar ideas have been developed for
Sammon’s NLM. This transforms the discrete and explicit mapping between
Y and X̂ into an implicit and continuous one. In the case of LLE, a mixture
of Gaussian kernels whose parameters are optimized by an EM algorithm is
proposed.
5.3 Data-driven lattice 157

A third possibility is proposed in [16]; it is based on kernel theory and


Nyström’s formula.

Example

Figure 5.8 shows how LLE embeds the two benchmark manifolds introduced
in Section 1.5. The dimensionality of the data sets (see Fig. 1.4) is reduced
from three to two using the following parameter values: K = 7 and Δ2 = 1e−2
for the Swiss roll and K = 4 and Δ2 = 10 for the open box. Perhaps because
of the low number of data points, both parameters require careful tuning.
It is noteworthy that in the case of the open box, Δ highly differs from the
proposed all-purpose value (Δ2 = 1e − 4).

1
1

0
2

x2

0
x

−1
−1
−2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x1 x1

Fig. 5.8. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by LLE.

Once the parameters are correctly set, the embedding looks rather good:
there are no tears and the box is deformed smoothly, without superpositions.
The only problem for the open box is that at least one lateral face is completely
crushed.

Classification

Like MDS, Isomap, and KPCA, LLE is a batch (offline) method working with
simple algebraic operations. As many other spectral methods that rely on an
EVD, LLE is able to build embeddings incrementally just by appending or
removing eigenvectors. The mapping provided by LLE is explicit and discrete.
In contrast with classical metric MDS, LLE assumes that data are linear
locally, not globally. Hence, the model of LLE allows one to unfold nonlinear
manifolds, as expected. More precisely, LLE assumes that the manifold can
be mapped to a plane using a conformal mapping. Although both MDS and
LLE use an EVD, which is purely linear, the nonlinear capabilities of LLE
actually come from its first step: the computation of the nearest neighbors.
158 5 Topology Preservation

This operation can be interpreted as a kind of thresholding, which is nonlinear


by nature.
Another essential difference between these two methods holds in the se-
lection of the eigenvectors involved in the coordinates of the embedding. In
the case of MDS, the eigenvectors associated with the largest eigenvalues are
kept, whereas in LLE those with the smallest eigenvalues are taken.

Advantages and drawbacks

The greatest advantage of LLE lies in its sound theoretical foundation. The
principle of the method is elegant and simple. Like Isomap, LLE can embed
a manifold in a nonlinear way while sticking by an eigensolver. Importantly,
even if the reconstruction coefficients for each data point are computed from its
local neighborhood only, independently from other neighborhoods, the embed-
ding coordinates involve the solution of an N -by-N eigenproblem. This means
that although LLE primarily relies on local information about the manifold,
a global operation couples all data points in the connected components of the
graph underlying matrix W.
From the computational viewpoint, LLE needs to compute the EVD of an
N -by-N matrix, where N is the number of points in the data set. Hence, it
may be feared that too large a data set may rapidly become computationally
untractable. Fortunately, the matrix to be decomposed is sparse, enabling
specialized EVD procedures to keep the computational load low. Furthermore,
it is noteworthy that in contrast with most other methods using an EVD or
SVD described in this book, LLE looks for eigenvectors associated with the
smallest eigenvalues. In practice, this specificity combined with the fact that
the matrix is large, reinforces the need for high-performance EVD procedures.
In contrast to what is claimed in [158], finding good parameters for LLE is
not so easy, as reported in [166], for instance. Actually, two parameters must
be tuned carefully: K, the number of neighbors (or  when the neighborhoods
are determined by the -rule) and Δ, the regularization factor. Depending on
these two parameters, LLE can yield completely different embeddings.

Variants

Historically, LLE itself can be considered a variant or an evolution of Local


PCA (LPCA [101, 32]; see also Subsection 3.3.1). The development of LLE
was indeed partly motivated by the need to overcome the main shortcoming
of LPCA, i.e., the fact that LPCA cannot yield an embedding in a unique
coordinate system (see Subsection 2.5.8). This shortcoming prevents the use
of LPCA in many applications, although the method has many advantages:
it is simple, is fast, relies on well-known linear algebraic procedures, and can
yet perform a nonlinear dimensionality reduction. To some extent, LLE can
be seen as an LPCA without vector quantization, along with a procedure to
patch together the small local linear pieces of manifold in a unique coordinate
5.3 Data-driven lattice 159

system. The coordination or alignment of locally linear models seems to be a


very promising approach that is developed in [159, 189, 178, 29].
Another variant of LLE is Hessian LLE [56, 55], which overcomes some
shortcomings of LLE. This variant computes locally an estimate of the mani-
fold Hessian, for each data point and its K neighbors.

5.3.2 Laplacian eigenmaps

The method called “Laplacian eigenmaps” [12, 13] (LE in short) belongs to
the now largely developed family of NLDR techniques based on spectral de-
composition. The method was intended to remedy some shortcomings of other
spectral methods like Isomap (Subsection 4.3.2) and LLE (Subsection 5.3.1).
In contrast with Isomap, LE develops a local approach to the problem of non-
linear dimensionality reduction. In that sense, LE is closely related to LLE,
although it tackles the problem in a different way: instead of reproducing
small linear patches around each datum, LE relies on graph-theoretic con-
cepts like the Laplacian operator on a graph. LE is based on the minimization
of local distances, i.e., distances between neighboring data points. In order to
avoid the trivial solution where all points are mapped to a single points (all
distances are then zero!), the minimization is constrained.

Embedding of data set

LE relies on a single and simple hypothesis: the data set

Y = [. . . , y(i), . . . , y(j), . . .]1≤i,j≤N (5.48)

contains a sufficiently large number N of points lying on (or near) a smooth


P -manifold. As only the data points are given, the manifold itself remains
unknown. However, if N is large enough, the underlying manifold can be rep-
resented with good accuracy by a graph G = (VN , E). In this representation,
a vertex vi of the graph is associated with each datum y(i), and an edge
connects vertices vi and vj if the corresponding data points are neighbors.
The neighborhood relationships can be determined using either K-ary neigh-
borhoods or -ball neighborhoods, as for other graph-based methods (see also
Appendix E). Neighborhood relationships can be encoded in a specific data
structure or more simply in a (sparse) adjacency matrix A. The binary entries
ai,j ∈ {0, 1} indicate whether data points y(i) and y(j) are neighbors or not.
For reasons that will be given ahead, A must be symmetric, meaning that the
graph G must be undirected.
The aim of LE is to map Y to a set of low-dimensional points

X = [. . . , x(i), . . . , x(j), . . .]1≤i,j≤N (5.49)

that keep the same neighborhood relationships. For this purpose, the following
criterion is defined:
160 5 Topology Preservation

1 
N
ELE = x(i) − x(j)22 wi,j , (5.50)
2 i,j=1

where entries wi,j of the symmetric matrix W are related to those of the
adjacency matrix in the following way: wi,j = 0 if ai,j = 0; otherwise, wi,j ≥ 0.
Several choices are possible for the nonzero entries. In [12] it is recommended
to use a Gaussian bell-shaped kernel
 
y(i) − y(j)22
wi,j = exp − , (5.51)
2T 2

where parameter T can be thought of as a temperature in a heat kernel in-


volved in diffusion equations. A simpler option consists of taking wi,j = 1 if
ai,j = 1. This amounts to setting T = ∞ in the heat kernel.
According to the definition of W, minimizing ELE under appropriate con-
straints is an attempt to ensure that if y(i) and y(j) are close to each other,
then x(i) and x(j) should be close as well. In other words, the topological prop-
erties (i.e., the neighborhood relationships) are preserved and the weights wi,j
act as penalties that are heavier (resp., small or null) for close (resp., faraway)
data points.
Knowing that W is symmetric, the criterion ELE can be written in matrix
form as follows:
ELE = tr(XLXT ) . (5.52)
In this equation, L is the weighted Laplacian matrix of the graph G, defined
as
L=W−D , (5.53)
N
where D is a diagonal matrix with entries di,i = j=1 wi,j . To prove the
equality, it suffices to notice that for a p-dimensional embedding:
5.3 Data-driven lattice 161

1 
N
ELE = x(i) − x(j)22 wi,j (5.54)
2 i,j=1

1 
P N
= (xp (i) − xp (j))2 wi,j (5.55)
2 p=1 i,j=1

1  2
P N
= (x (i) + x2p (j) − 2xp (i)xp (j))wi,j (5.56)
2 p=1 i,j=1 p
⎛ ⎞
1 P N N N
= ⎝ x2 (i)di,i + x2p (j)dj,j − 2 xp (i)xp (j)wi,j ⎠(5.57)
2 p=1 i=1 p j=1 i,j=1

1 T
P
= 2f (y)Dfp (y) − 2fpT (y)Wfp (y) (5.58)
2 p=1 p

1 T
P
= f (y)Lfp (y) = tr(XLXT ) , (5.59)
2 p=1 p

where fp (y) is an N -dimensional vector giving the pth coordinate for each
embedded point, and fp (y) is the transpose of the pth row of X. By the
way, it is noteworthy that the above calculation also shows that L is positive
semidefinite.
Minimizing ELE with respect to X under the constraint XDXT = IP ×P
reduces to solving the generalized eigenvalue problem λDf = Lf and looking
for the P eigenvectors of L associated with the smallest eigenvalues. As L is
symmetric and positive semidefinite, all eigenvalues are real and not smaller
than zero. This can be seen by solving the problem incrementally, i.e., by
computing first a one-dimensional embedding, then a two-dimensional one,
and so on. At this point, it must noticed that λDf = Lf possesses a trivial
solution. Indeed, for f = 1N where 1N = [1, . . . , 1]T , it comes out that W1N =
D1N and thus that L1N = 0N . Hence λN = 0 is the smallest eigenvalue of L
and fN (y) = 1.
An equivalent approach [16] to obtain the low-dimensional embedding (up
to a componentwise scaling) consists of normalizing the Laplacian matrix:
) *
l
L = D−1/2 LD−1/2 = 
ij
, (5.60)
dii djj
1≤i,j≤N

and finding directly its eigenvectors:

L = UΛUT . (5.61)

The eigenvectors associated with the P smallest eigenvalues (except the last
one, which is zero) form a P -dimensional embedding of the data set. The
162 5 Topology Preservation

eigenvalues are the same as for the generalized eigenvalue problem, and the
following relationship holds for the eigenvectors: ui = D1/2 fi .
The Laplacian matrix as computed above is an operator on the neigh-
borhood graph, which is a discrete representation of the underlying manifold.
Actually, the Laplacian matrix stems from a similar operator on smooth mani-
folds, the Laplace-Beltrami operator. Cast in this framework, the eigenvectors
ui are discrete approximations of the eigenfunctions of the Laplace-Beltrami
operator applied on the manifold. More details can be found in [13, 17].
Laplacian eigenmaps can be implemented with the procedure shown in
Fig. 5.9. A software package in the MATLAB R
language is available at http:

1. If data consist of pairwise distances, then skip step 2 and go directly to


step 3.
2. If data consist of vectors, then compute all pairwise distances.
3. Determine either K-ary neighborhoods or -ball neighborhoods.
4. Build the corresponding graph and its adjacency matrix A.
5. Apply the heat kernel (or another one) to adjacent data points, and build
matrix W as in Eq. (5.51).
6. Sum all columns of W in order to build the diagonal matrix D, which
consists of the rowwise sums of W.
7. Compute L, the Laplacian of matrix W: L = W − D.
8. Normalize the Laplacian matrix: L = D−1/2 LD−1/2 .
9. Compute the EVD of the normalized Laplacian: L = UΛUT .
10. A low-dimensional embedding is finally obtained by multiplying eigenvec-
tors by D1/2 , transposing them, and keeping those associated with the P
smallest eigenvalues, except the last one, which is zero.

Fig. 5.9. Algorithm of Laplacian eigenmaps.

//people.cs.uchicago.edu/~misha/ManifoldLearning/. The parameters


of Laplacian eigenmaps are the embedding dimensionality, the neighborhood
type, and the corresponding parameter (K-ary neighborhoods or -balls). De-
pending on the kernel that yields matrix W, additional parameters may need
to be tuned, such as the temperature T in the heat kernel. Laplacian eigen-
maps involve an EVD of an N -by-N nonsparse matrix. However, only a few
eigenvectors need to be calculated; this reduces the time complexity when
efficient linear algebra libraries are available.

Embedding of test set

An extension of LE for new points is very briefly introduced in [16], based on


the Nyström formula.
5.3 Data-driven lattice 163

Example

Figure 5.10 shows how LE embeds the two benchmark manifolds introduced in
Section 1.5. The dimensionality of the data sets (see Fig. 1.4) is reduced from
three to two using the following parameter values: K = 7 for the Swiss roll and
K = 8 for the open box. These values lead to graphs with more edges than the
lattices shown in the figure. Moreover, the parameter that controls the graph
building (K or ) requires careful tuning to obtain satisfying embeddings.
Matrix W is computed with the degenerate heat kernel (T = ∞). As can be

0.1

0.05 0.05
x2

2
0 0
x
−0.05
−0.05
−0.1
−0.1 −0.05 0 0.05 0.1 −0.2 −0.1 0 0.1
x1 x1

Fig. 5.10. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by LE.

seen, the Swiss roll is only partially unfolded; moreover, the third dimension
of the spiral is crushed. Results largely depend on parameter K: changing it
can yield embeddings of a completely different shape. In the case of the open
box, the result is more satisfying although at least one face of the open box
is crushed.

Classification

LE is a batch method that processes data offline. No vector quantization is


integrated in it.
Like metric MDS, Isomap and LLE, LE is a spectral method. This enables
LE to build embeddings in an incremental way, by adding or discarding eigen-
vectors. In contrast with metric MDS and Isomap, LE embeds data using the
eigenvectors associated with the smallest eigenvalues, like LLE. Unfortunately,
it is quite difficult to estimate the data dimensionality from those eigenvalues.
The mapping provided by LE is discrete. A continuous generalization can
be obtained using the Nyström formula. The data model is discrete as well:
it involves the Laplacian matrix of a graph. Replacing it with the Laplace-
Beltrami operator enables LE to work on smooth manifolds.
164 5 Topology Preservation

Advantages and drawbacks

LE is almost parameter-free when the heat kernel is used with T = +∞: only
K or  remains; this is an advantage over LLE in this respect. Nevertheless,
the choice of these two last parameters may have a dramatic influence on the
results of LE. Moreover, it is also possible to change the kernel function.
Like KPCA, LE usually yields poor embeddings. Connections between LE
and spectral clustering (see ahead) indicate that the method performs better
for data clustering than for dimensionality reduction.
Another explanation of the poor performance in dimensionality reduction
can be found in its objective function. Minimizing distances between neigh-
boring points seems to be an appealing idea at first sight. (Note, by the way,
that semi-definite embedding follows exactly the opposite approach: distances
between nonneighboring points are maximized; see Subsection 4.4.2 for more
details.) But as shown above, this can lead to degenerate solutions, such as
an embedding having identical coordinates for all data points. If this trivial
solution can easily be found by looking at the equations, it is likely that other
configurations minimize the objective function of LE but do not provide a
suitable embedding. Intuitively, it is not difficult to imagine what such config-
urations should look like. For instance, assume that data points are regularly
distributed on a plane, just as in an SOM grid, and that parameter K is high
enough so that three aligned points on the grid are all direct neighbors of each
other. Applying LE to that data set can lead to a curved embedding. Indeed,
curving the manifold allows LE to minimize the distance between the first
and third points of the alignment. This phenomenon can easily be verified ex-
perimentally and seriously questions the applicability of LE to dimensionality
reduction.
From the computational viewpoint, LE works in a similar way as LLE, by
computing a Gram-like matrix and extracting eigenvectors associated with the
smallest eigenvalues. Therefore, LE requires robust EVD procedures. As with
LLE, specific procedures can exploit the intrinsic sparsity of the Laplacian
matrix.

Variants

In [13], a connection between LE and LLE is established. LE can thus be seen


as a variant of LLE under some conditions. If the Laplacian matrix involved in
LE approximates the Laplace-Beltrami operator on manifolds, then the result
of applying twice the operator can be related to the Gram matrix built in
LLE.
Diffusion maps [35, 36, 141] involve a heat kernel as in LE and follow the
same approach (spectral decomposition and focus on the bottom eigenvec-
tors). Approaches developed in [164, 206] are also closely related and connect
the pseudo-inverse of the normalized graph Laplacian with so-called commute-
time distances in a Markov random field. Whereas bottom eigenvectors of the
5.3 Data-driven lattice 165

normalized Laplacian L are involved in LE, top eigenvectors of the pseudo-


inverse of L are used in [164, 206]. These papers also refer to works about
electrical networks (composed of resistance only): the underlying theory is
identical, except that the kernel function is inversely proportional to the dis-
tance (wi,j = y(i) − y(j)−1 ). This connection between Laplacian-based em-
beddings and commute-time distances or network resistances is important and
means that the local approach developed in LE (K-ary neighborhoods) can
be considered from a global point of view. This allows us to relate a topology-
based spectral method (LE) to distance-based spectral methods (KPCA and
Isomap).
Similarly, LE is also closely related to spectral clustering [143], as indicated
in [13, 164, 206, 141], and to graph partitioning [172, 199, 173]. Intuitively,
graph partitioning aims at dividing a graph into two parts while minimizing
the “damage” (the normalized sum of edge weights). Within this framework,
the eigenvector associated with the smallest nonzero eigenvalue of the normal-
ized Laplacian matrix is a “continuous” indicator of a binary graph partition
(flags are only required to be in R instead of being strictly binary, i.e., in
{−1, +1}).
Finally, it is noteworthy that locality preserving projections (LPP [81],
a.k.a. Laplacianfaces [82]) is a linear variant of Laplacian eigenmaps. This
method works in the same way as PCA, by building a P -by-D transformation
matrix that can be applied on any vector in RD . This method keeps many
advantages of PCA although the objective function is different: while PCA
tries to preserve the global structure of the data set, LPP attempts to preserve
the local structure. This method involves K-ary neighborhoods like LE.

5.3.3 Isotop

The aim of Isotop [114] consists of overcoming some of the limitations of


the SOMs when they are used for nonlinear dimensionality reduction. In this
case, indeed, the vector quantization achieved by an SOM may be useless
or even undesired. Moreover, an SOM usually imposes a rectangular latent
space, which seldom suits the shape of the manifold to be embedded. Isotop
addresses these two issues by separating the vector quantization (that becomes
optional) and the dimensionality reduction. These two steps are indeed tightly
interlaced in an SOM and Isotop clearly separates them in order to optimize
them independently.

Embedding of data set

Isotop reduces the dimensionality of a data set by breaking down the problem
into three successive steps:
1. vector quantization (optional),
2. graph building,
166 5 Topology Preservation

3. low-dimensional embedding.
These three steps are further detailed ahead. Isotop relies on a single and
simple hypothesis: the data set
Y = [. . . , y(i), . . . , y(j), . . .]1≤i,j≤N (5.62)
contains a sufficiently large number N of points lying on (or near) a smooth
P -manifold.
If the data set contains too many points, the first step of Isotop consists of
performing a vector quantization in order to reduce the number of points (see
Appendix D). This optional step can easily be achieved by various algorithms,
like Lloyd’s, or a competitive learning procedure. In contrast with an SOM,
no neighborhood relationships between the prototypes are taken into account
at this stage in Isotop. For the sake of simplicity, it is assumed that the first
step is skipped, meaning the subsequent steps work directly with the raw data
set Y.
Second, Isotop connects neighboring data points (or prototypes), by using
the graph-building rules proposed in Appendix E, for example. Typically,
each point y(i) of the data set is associated with a graph vertex vi and then
connected with its K closest neighbors or with all other points lying inside an
-ball. The obtained graph G = (VN , E) is intended to capture the topology
of the manifold underlying the data points, in the same way as it is done
in other graph-based methods like Isomap, GNLM, CDA, LLE, LE, etc. In
contrast with an SOM, G may be completely different from a rectangular
lattice, since it is not predefined by the user. Instead, G is “data-driven”,
i.e., completely determined by the available data. Moreover, until this point,
no low-dimensional representation is associated with the graph, whereas it
is precisely such a representation that predetermines the lattice in an SOM.
Eventually, the second step of Isotop ends by computing the graph distances
δ(vi , vj ) for all pairs of vertices in the graph (see Subsection 4.3.1). These
distances will be used in the third step; in order to compute them, each edge
(vi , vj ) of the graph is given a weight, which is equal to the Euclidean distance
y(i) − y(j) separating the corresponding points in the data space. The
graph distances approximate the geodesic distances in the manifold, i.e., the
distances between points along the manifold.
The third step of Isotop is the core of the method. While the first and sec-
ond steps aim at converting the data, given as D-dimensional coordinates, into
graph G, the third step achieves the inverse transformation. More precisely,
the goal consists of translating graph G into P -dimensional coordinates. For
this purpose, the D-dimensional coordinates y(i) associated with the vertices
of the graph are replaced with P -dimensional coordinates x(i), which are ini-
tialized to zero. At this stage, the low-dimensional representation X of Y is
built but, obviously, X does not truly preserve the topology of Y yet: it must
be modified or updated in some way. For this purpose, Y may be forgotten
henceforth: Isotop will use only the information conveyed by G in order to
update X.
5.3 Data-driven lattice 167

In order to unfold the low-dimensional representation X associated with


G, a Gaussian kernel N (x(i), I) of unit variance is centered on each point
x(i). The normalized sum of the N Gaussian kernels gives a distribution that
itself is a Gaussian kernel just after the initialization, since x(i) = 0 for all i
in {1, . . . , N }. The main idea of Isotop then consists of performing a vector
quantization of that distribution, in which X plays the role of the codebook
and is updated using a similar learning rule as in an SOM. More precisely,
the three following operations are carried out:
1. Randomly draw a point r from the distribution.
2. Determine the index j of the nearest point from r in the codebook, i.e.,

j = arg min = d(r, x(i)) , (5.63)


i

where d is typically the Euclidean distance.


3. Update all points x(i) according to the rule

x(i) ← x(i) + ανλ (i, j)(r − x(i)) , (5.64)

where the learning rate α, which satisfies 0 ≤ α ≤ 1, plays the same role
as the step size in a Robbins–Monro procedure.
In the update rule (5.64), νλ (i, j) is called the neighborhood function and is
defined as follows:
! "
1 δy2 (i, j)
νλ (i, j) = exp − , (5.65)
2 λ2 μ2(vh ,vj )∈E (δy (h, j))

where the first factor λ of the denominator is a neighborhood width, acting


and scheduled exactly as in an SOM. The second factor of the denominator is
simply the square of the mean graph distance between the vertices vh and vj
(or equivalently, between the points y(i) and y(j)) if the edge (vh , vj ) belongs
to E. In other words, this factor is the square distance between vj and its
direct neighbors in the graph. Because the neighbors are direct and the edges
are labeled with the Euclidean distance:

μ(vh ,vj )∈E (δy (h, j)) = μ(vh ,vj )∈E (dy (h, j)) = μ(vh ,vj )∈E (y(h) − y(j)) .
(5.66)
The second factor of the denominator aims at normalizing the numerator
δy2 (i, j), in order to roughly approximate the relative distances from y(j) to
y(i) inside the manifold. Without this factor, Isotop would depend too much
on the local density of the points on the manifold: smaller (resp., larger)
distances are measured in denser (resp., sparser) regions.
Typically, the three steps above are repeated N times with the same values
for the parameters α and λ; such a cycle may be called an epoch, as for other
adaptive algorithms following the scheme of a Robins-Monro procedure [156].
Moreover, instead of drawing r from the normalized sum of Gaussian kernels
168 5 Topology Preservation

N times, it is computationally easier and statistically more interesting to draw


it successively from each of the N kernels, in random order. Doing so allows us
to generate r by adding white Gaussian noise successively to each point x(i).
This also allows us to visit more or less equally each region of the distribution
during a single epoch.
Intuitively, the above learning rule unfolds the connected structure in a
low-dimensional space, trying to preserve the neighborhoods. As a side effect,
the mixture of Gaussian distributions evolves concurrently in order to capture
the shape of the manifold.
Figure 5.11 shows a procedure that implements Isotop. A C++ implemen-

1. Perform a vector quantization of the data set; this step is optional and
can be done with any standard quantization method.
2. Build a graph structure with an appropriate rule (see Appendix E), and
compute all pairwise graph distances.
3. Initialize low-dimensional coordinates x(i) of all vertices vi to zero.
4. Initialize the learning rate α and the neighborhood width λ with their
scheduled values for epoch number q.
5. For each vertex vi in the graph,
• Generate a point r randomly drawn from a Gaussian distribution cen-
tered on the associated coordinates x(i).
• Compute the closest vertex from r according to Eq. (5.63).
• Update the coordinates of all vertices according to rule (5.64).
6. Increase q and return to step 4 if convergence is not reached.

Fig. 5.11. Isotop algorithm.

tation of Isotop can also be downloaded from http://www.ucl.ac.be/mlg/.


Isotop inherits parameters from the vector quantization performed in its first
step. Other parameters are related to the graph-building step (K for K-ary
neighborhood or  for -balls). Like an SOM, Isotop also involves parame-
ters in its learning phase (number of iterations, learning rate, neighborhood
width, and neighborhood function). Regarding space and time complexities,
Isotop requires computing pairwise geodesic distances for all prototypes; this
demands O(M 2 ) memory entries and O(M 2 log M ) operations, where M ≤ N
is the number of prototypes. Once geodesic distances are known, the learning
phase of Isotop is relatively fast: the time complexity of each epoch is O(M 2 ).
Finally, some intuitive arguments that explains why the procedure detailed in
Fig. 5.11 works are gathered hereafter.
5.3 Data-driven lattice 169

The vertices do not collapse on each other.


As the update rule acts as an attractive force between the prototypes, it may
be feared that the third step of Isotop will yield a trivial solution: all low-
dimensional coordinates converge on the same point. Fortunately, however,
this does not happen, because Isotop involves Gaussian distributions centered
on each vertex in the low-dimensional embedding space. The update rule takes
into account points drawn from the distributions in order to move the ver-
tices. This means that near the “boundary” of the graph some random points
can move vertices outward. Actually, the entire “boundary” of the embed-
ded graph contributes to stretching it; this effect balances the attractive force
induced by the update rule. If Isotop’s third step is compared to an SOM run-
ning in a low-dimensional space, then to some extent Isotop performs a vector
quantization on a (dynamically varying) mixture of Gaussian distributions. In
the worst case, all Gaussian kernels are superposed; but still in this situation,
the prototypes will disperse and try to reproduce the mixture of Gaussian
kernels (only an infinite neighborhood width impeaches this process). Since
the dispersion of the prototypes is ensured, they may all be initialized to zero.
The vertices do not infinitely expand.
On the other side, the vertices do not diverge toward infinitely faraway posi-
tions. This could only happen if the Gaussian kernels centered on the proto-
types were infinitely large. The probability to draw a point from a Gaussian
distribution lying very far away from its center is very low. Eventually, as long
as the neighborhood width is kept larger than zero, the update rule generates
an attractive force between the vertices that limits their expansion.

Embedding of test set


Isotop does not provide a generalization procedure. Actually, the generaliza-
tion to new points is difficult, as it is for an SOM, because Isotop can heavily
deform the graph structure, even on a local scale. Fortunately, the generaliza-
tion problem appears less critical for Isotop than for an SOM. Because vector
quantization is a mandatory step in the latter, even the embedding of the
data points is not known.

Example
Figure 5.12 shows how Isotop embeds the two benchmark manifolds intro-
duced in Section 1.5. The graph is built using -balls, in order to obtain the
graphs shown in the figure. No superpositions occur in the obtained embed-
dings. In the case of the open box, the bottom face is shrunk, allowing the
lateral faces to be embedded without too many distortions; similarily, the box
lid is stretched but remains square. The two examples clearly illustrate the
ability of Isotop to either preserve the manifold shape (in the case of the Swiss
roll) or deform some regions of the manifold if necessary (for the box).
170 5 Topology Preservation

2
2 2

x2
0 0
x

−2 −2

−5 0 5 −4 −2 0 2 4
x x
1 1

Fig. 5.12. Two-dimensional embeddings of the “Swiss roll” and “open box” data
sets (Fig. 1.4), found by Isotop.

Classification

Isotop shares many characteristics with an SOM. Both methods rely on a non-
linear model and use vector quantization (optional in the case of Isotop). They
also both belong to the world of artificial neural networks and use approxi-
mate optimization techniques. Because Isotop is divided into three successive
steps, it cannot easily be implemented as an online algorithm, unlike a SOM.
The mapping Isotop produces between the high- and low-dimensional
spaces is discrete and explicit. Hence, the generalization to new points is not
easy.

Advantages and drawbacks

The comparison between Isotop and an SOM is unavoidable since both meth-
ods are closely related. Actually, Isotop can be interpreted as an SOM working
in “reverse gear”. Indeed, the following procedure describes how a SOM works:
1. Determine the shape of the low-dimensional embedding (usually a two-
dimensional grid of regularly spaced points).
2. Build a lattice between the grid points (this second step is often implicitly
merged with the first one).
3. Perform a vector quantization in the high-dimensional space in order to
place and deform the lattice in the data cloud; this step defines the map-
ping between the high- and low-dimensional spaces.
As can be seen, data are involved only (and lately) in the third step. In other
words, an SOM weirdly starts by defining the shape of the low-dimensional
embedding, without taking the data into account! In the framework of dimen-
sionality reduction, this approach goes in the opposite direction compared to
the other usual methods. Isotop puts things back in their natural order and
the three above steps occur as follows:
5.3 Data-driven lattice 171

1. Perform a vector quantization in the high-dimensional space (no lattice or


neighborhoods intervene in this first step, which is performed by standard
quantization methods); this step is optional.
2. Build a lattice (or graph) between the prototypes obtained after the quan-
tization according to their coordinates in the high-dimensional space.
3. Determine the shape of the low-dimensional embedding; this step defines
the mapping between the high- and low-dimensional spaces.
As in an SOM, the mapping is determined in the third step. However, data
intervene earlier, already in the first step. Moreover, it is noteworthy that
Isotop uses the SOM update rule in the low-dimensional space. Besides the
processing order, this is a second and very important difference between Isotop
and an SOM. The advantage of working in the low-dimensional space, as do
most other DR methods, is clearly demonstrated in Section 6.1.
Another advantage of Isotop over the SOM is that the quantization is
optional and not interlaced with the computation of the embedding. This
clear separation between both tasks allows a better control of each of them
(parameters, convergence, etc.).
In the current development state of Isotop, it is not known whether the
method optimizes a well-defined objective function. The same problem has
been remarked in [57] for the SOMs. However, it is hoped that an objective
function could be found, because the way Isotop works allows the introduc-
tion of many simplifications. For example, the quantization of the Gaussian
mixtures does not matter in the third step of Isotop, since the kernels follow
the embedded points. Hence, the definition of an objective function may focus
on topology preservation only. In addition, the third step of Isotop takes place
in a space whose dimensionality is a priori equal to the dimensionality of the
manifold to be embedded.

Variants

To some extent, Isotop can be related to spring-based layouts and other graph-
embedding techniques. See, for example, [52].
Some “historical” aspects regarding the development of Isotop can be
found in [119]. In earlier versions, only one Gaussian kernel was used in order
to unfold the manifold and counterbalance the attractive force induced by
Eq. (5.64).
Like GTM, which was proposed as a principled variant of Kohonen’s SOM,
stochastic neighbor embedding (SNE) [87] can be seen as a principled version
of Isotop. SNE follows a probabilistic approach to the task of embedding and,
like Isotop, associates a Gaussian kernel with each point to be embedded.
The set of all these kernels allows SNE to model the probability of one point
to be the neighbor of the others. This probability distribution can be mea-
sured for each point in the high-dimensional data space and, given a (random)
embedding, corresponding probability distributions can also be computed in
172 5 Topology Preservation

the low-dimensional space. The goal of SNE is then to update the embed-
ding in order to match the distributions in both high- and low-dimensional
spaces. In practice, the objective function involves a sum of Kullback-Leibler
divergences, which measure the “distances” between pairs of corresponding
distributions in their respective space. Because the minimization of the ob-
jective function is difficult and can get stuck in local minima, SNE requires
complex optimization techniques to achieve good results. A version of SNE
using graph distances instead of Euclidean ones in the data space is mentioned
in [186, 187] and compared to other NLDR methods.
6
Method comparisons

Overview. This chapter illustrates all NLDR methods that are de-
cribed in the previous two chapters with both toy examples and real
data. It also aims to compare the results of the different methods in
order to shed some light on their respective strengths and weaknesses.

6.1 Toy examples


Most examples in the next subsections are two-manifolds embedded in a three-
dimensional space. The most popular of them is undoubtedly the Swiss roll.
Those simple manifolds help to clearly visualize how the different methods
behave in order to reduce the dimensionality.

6.1.1 The Swiss roll

The standard Swiss roll

The Swiss roll is an illustrative manifold that has been used in [179, 180]
as an example to demonstrate the capabilities of Isomap. Briefly put, it is a
spiral with a third dimension, as shown in Fig. 6.1. The name of the manifold
originates from a delicious kind of cake made in Switzerland: jam is spread on
a one-centimeter-thick layer of airy pastry, which is then rolled up on itself.
With some imagination, the manifold in Fig. 6.1 can then be interpreted as a
very thick slice of Swiss roll, where only the jam is visible.
The parametric equations that generate the Swiss roll are
⎡√ √ ⎤
√ 2 + 2x1 cos(2 ∗ π ∗ √ 2 + 2x1 )
y = ⎣ 2 + 2x1 sin(2 ∗ π ∗ 2 + 2x1 ) ⎦ , (6.1)
2x2

where x = [x1 , x2 ]T is uniformly distributed in the interval [−1, +1]2 . The


purpose of the square root in EQ. 6.1 is to obtain a uniform distribution on
174 6 Method comparisons

y3 −2
2

0 2
0
y2 −2 −2
y1

Fig. 6.1. The “Swiss roll” manifold.

the manifold as well. As can be easily seen, each coordinate of the manifold
depends on a single latent variable. Hence, the Swiss roll is a developable
two-manifold embedded in R3 .
The Swiss roll is the ideal manifold to demonstrate the benefits of using
graph distances. Because it is developable, all methods using graph distances
can easily unfold it and reduce its dimensionality to two. On the contrary,
methods working with Euclidean distances embed it with difficulty, because
it is heavily crumpled on itself. For example, it has been impossible to get con-
vincing results with NLM or CCA. Figure 6.2 shows the best CCA embedding
obtained using Euclidean distances. This result has required carefully tuning
the method parameters. In light of this result, Euclidean distance-preserving
methods (MDS, NLM, CCA) are discarded in the experiments ahead.
Concretely, 5000 points or observations of y are made available in a data
set. These points are generated according to Eq. 6.1. The corresponding points
in the latent space are drawn randomly; they are not placed on a regular grid,
as was the case for the Swiss roll described in Section 1.5 to illustrate the
methods described in Chapters 4 and 5. As the number of points is relatively
high, a subset of fewer than 1000 points is already representative of the man-
ifold and allows the computation time to be dramatically decreased. As all
methods work with N 2 distances or an N -by-N Gram-like matrix, they work
25 times faster with 1000 points than with 5000. In [180] the authors suggest
choosing the subset of points randomly among the available ones. Figure 6.3
shows the results of Isomap, GNLM, and CDA with a random subset of 800
points. The graph distances are computed in the same way for three methods,
i.e., with the K-rule and K = 5. Other parameters are left to their default
“all-purpose” values.
6.1 Toy examples 175

CCA
5

1
x2

−1

−2

−3

−4

−5

−4 −2 0 2 4 6
x
1

Fig. 6.2. Two-dimensional embedding of the “Swiss roll” manifold by CCA, using
1800 points.

As expected, all three methods succeed in unfolding the Swiss roll. More
importantly, however, the random subset used for the embeddings is not really
representative of the initial manifold. According to Eq. (6.1), the distribution
of data points is uniform on the manifold. At first sight, points seem indeed
to be more or less equally distributed in all regions of the embeddings. Nev-
ertheless, a careful inspection reveals the presence of holes and bubbles in the
distribution (Brand speaks of manifold “foaming” [30]). It looks like jam in
the Swiss roll has been replaced with a slice of ... Swiss cheese [117, 120]!
This phenomenon is partly due to the fact that the effective data set is re-
sampled, i.e., drawn from a larger but finite-size set of points. As can be seen,
Isomap amplifies the Swiss-cheese effect: holes are larger than for the two
other methods. The embeddings found by GNLM and CDA look better.
Instead of performing a random subset selection, vector quantization (see
Appendix D) could be applied to the Swiss roll data set. If the 800 randomly
chosen points are replaced with only 600 prototypes, obtained with a simple
competitive learning procedure, results shown in Fig. 6.4 can be obtained.
As can be seen, the embeddings are much more visually pleasing. Beyond
the visual feeling, the prototypes also represent better the initial manifold
176 6 Method comparisons

Isomap
2
1

2
0

x
−1
−2

−6 −4 −2 0 2 4 6
x
1
GNLM CDA

2 2

2
2

0 0

x
x

−2 −2

−5 0 5 −5 0 5
x1 x
1

Fig. 6.3. Two-dimensional embedding of the “Swiss roll” manifold by Isomap,


GNLM, and CDA. A subset of 800 points is randomly chosen among the 5000
available points.

than randomly chosen points. This results, for example, in embeddings with
neater corners and almost perfectly rectangular shapes, thus reproducing more
faithfully the latent space. Such results definitely pledge in favor of the use of
vector quantization when the size of the data set can be or has to be reduced.
Accordingly, vector quantization is always used in the subsequent experiments
unless otherwise specified.
Figure 6.5 shows the embeddings obtained by topology-preserving methods
(an SOM, GTM, LLE, LE, and Isotop) with the same set of prototypes, except
for the SOM, since it is a vector quantization method by nature. By doing so,
it is hoped that the comparison between all methods is as fair as possible. The
SOM (15-by-40 map) and GTM (40-by-40 latent grid, 10× 10 kernels, 200 EM
iterations) suffer from jumps and shortcuts joining the successive whorls. LLE
(K = 5, Δ = 0.02) unfolds the Swiss roll and perfectly preserves the topology.
As often happens with LLE, the embedding has a triangular or cuneiform
shape (see [30] for a technical explanation). Finally, Isotop yields a visually
pleasant result: the topology is perfectly preserved and even the rectangular
shape of the Swiss roll remains visible.
Some flaws of spectral methods like Isomap, SDE, LLE, and LE can be ex-
plained by inspecting 3D embeddings instead of 2D ones, such as those shown
in Fig. 6.6. Such 3D embeddings can be obtained easily starting from the
2D one, knowing that spectral methods can build embeddings incrementally
just by taking into account an additional eigenvector of the Gram-like matrix.
While the embeddings of SDE appear almost perfect (the third dimension is
negligible), the result of Isomap shows that the manifold is not totally made
6.1 Toy examples 177

Isomap
2

2
0

x
−2

−6 −4 −2 0 2 4 6
x1
GNLM CDA
2 2
x2

2
0 0

x
−2 −2
−5 0 5 −5 0 5
x1 x1
SDE equ.
2
SDE inequ.
1
x2

0 0
x

−1
−2 −5 0 5
−4 −2 0 2 4 6 x1
x1

Fig. 6.4. Two-dimensional embeddings of the “Swiss roll” manifold by Isomap,


GNLM, CDA, and SDE (with both equality and inequality constraints). The three
methods embed 600 prototypes obtained beforehand with vector quantization.

flat. The “thickness” of the third dimension can be related to the importance
of discrepancies between graph distances and true geodesic ones. These dis-
crepancies can also explain the Swiss-cheese effect: long graph distances follow
a zigzagging path and are thus longer than the corresponding geodesic dis-
tances. On the contrary, short graph distances, involving a single graph edge,
can be slightly shorter than the true geodesic length, which can be curved.
For spectral topology-preserving methods, the picture is even worse in 3D.
As can be seen, the LLE embedding looks twisted; this can explain why LLE
often yields a triangular or cuneiform embedding although the latent space is
known to be square or rectangular. So LLE unrolls the manifold but intro-
duces other distortions that make it not perfectly flat. The LE embedding is
the worst: this method does not really succeed in unrolling the manifold.
Obviously, until here, all methods have worked in nice conditions, namely
with an easy-to-embed developable manifold and with standard values for
their parameters. Yet even in this ideal setting, the dimensionality reduction
may be not so easy for some methods.
178 6 Method comparisons

SOM
5

2
0

x
−5
−20 −15 −10 −5 0 5 10 15 20
x1
GTM LLE
1 4
2
2

2
0 0
x

x
−2
−1 −4
−1 0 1 −2 −1 0 1
x1 x1
Isotop Laplacian Eigenmaps
0.05
1
2

0
2
0
x

−1
−0.05
−4 −2 0 2 4
−0.05 0 0.05
x
1 x
1

Fig. 6.5. Two-dimensional embeddings of the “Swiss roll” manifold by a SOM,


GTM, LLE, LE, and Isotop. The data set consists of 600 prototypes resulting from
a vector quantization on 5000 points of the manifold. The 5000 points are given as
such for the SOM.

The rolled Japanese flag

Another more difficult example consists of removing a well-chosen piece of


the Swiss roll. More precisely, a disk centered on [0, 0]T , with radius 0.5, is re-
moved from the latent space, which then looks like the Japanese flag, as shown
in the first plot of Fig. 6.7. This flag can be rolled using Eq. (6.1); vector quan-
tization is then applied on the obtained three-dimensional points. Next, all
methods reduce the dimensionality back to two, as shown in Figs. 6.8 and 6.9,
and the embedding can be compared with the Japanese flag. Unfortunately,
the initial aspect ratio of the latent space is lost due to the particular form
of the parametric equations. Such a scaling cannot, of course, be retrieved.
What appears more surprising is that the region where the hole lies looks
vertically stretched. This phenomenon is especially visible for Isomap, a bit
less for GNLM, and nearly absent for CDA. For SDE, depending on the use
of equalities or inequalities, the second dimension can be shrunk.
6.1 Toy examples 179

Isomap

3
5

x
2 0
0 −5
−2

x x
1
2

SDE equ. SDE inequ.

5
5
x3

3
2 x 0
0
0 10
−5 −1 −5
−2

x1 x1
x2 x
2

LE

LLE

2 0.05

1
0
3
3

0
x
x

−1
−2 −0.05

−1 0 1 −24 0 2 4
−2
x 0.05 0.05
x 2
1 0 0
−0.05 −0.05
x x
2 1

Fig. 6.6. Three-dimensional embeddings of the “Swiss roll” manifold by Isomap,


SDE, LLE, and LE.
180 6 Method comparisons

Geodesic distance
Euclidean distance
1.5 1.5

1 1

0.5 0.5
x2

x2
0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
x1 x
1

Fig. 6.7. On the left: latent space for the “Japanese flag” manifold. On the right:
the fact that geodesic (or graph) distances (blue curve) are replaced with Euclidean
ones (red line) in the embedding space explains the stretched-hole phenomenon that
appears when the Japanese flag is embedded. The phenomenon is visible in Fig. 6.8.

For methods relying on graph distances, the explanation is rather sim-


ple: as mentioned in Section 4.3, they measure graph distances in the data
space but use Euclidean distances in the embedding space, because the latter
metric offers nice properties. For nonconvex manifolds like the Japanese flag,
Euclidean distances cannot be made equal to the measured graph distances
for some pairs of points; this is illustrated by the second plot of Fig. 6.7. Al-
though the manifold is already unrolled, the geodesic distance remains longer
than the corresponding Euclidean one because it circumvents the hole. As a
consequence, NLDR methods try to preserve these long distances by stretch-
ing the hole. In particular, a spectral method like Isomap, whose ability to
embed curved manifolds depends exclusively on the way distances are mea-
sured or deformed, is misleading. More complex methods, which can already
deal with nonlinear manifolds without graph distances, like GNLM and CDA,
behave better. They favor the preservation of short distances, which are less
affected by the presence of holes. The SDE case is more complex: with strict
equalities, the method behaves perfectly. When local distances are allowed to
decrease, the whole manifold gets shrunk along the second dimension.
It is noteworthy that the hole in the Japanese flag manifold in Fig. 6.7 also
explains the poor results of Isomap when it uses resampling instead of vector
6.1 Toy examples 181

Isomap

x2
0

−2

−5 0 5
x
1
GNLM CDA
2 2
2

2
0
x

x
−2
−2
−5 0 5 −6 −4 −2 0 2 4 6
x1 x
1
SDE equ.
2
SDE inequ.
2

x2

0
x

−6 −4 −2 0 2 4 6
−2 x
−6 −4 −2 0 2 4 6 1
x1

Fig. 6.8. Two-dimensional embeddings of the rolled Japanese flag obtained with
distance-preserving NLDR methods. All methods embed 600 prototypes obtained
beforehand with vector quantization.

quantization, as in Fig. 6.3. Small holes appear in the manifold distribution,


and Isomap tends to stretch them; as a direct consequence, denser regions
become even denser.
Regarding topology-preserving methods illustrated in Fig. 6.9, the results
of the SOM and GTM do not require particular comments: discontinuities
in the embeddings can clearly be seen. LLE produces a nice result, but it
can easily be guessed visually that the manifold is not perfectly unrolled.
Isotop also yields a good embedding: the topology is well preserved though
some distortions are visible. Finally, the embedding provided by LE looks
like Isomap’s one. Considering LE as a method that preserves commute-time
distances, this kind of result is not surprising. Since both ends of the Japanese
flag are nearly disconnected, the commute time from one to the other is larger
than required to embed correctly the manifold.
182 6 Method comparisons

SOM
5

x2
0

−5
−20 −15 −10 −5 0 5 10 15 20
x1
GTM LLE
1 2
x2

2
0 0

x
−1 −2
−1 0 1 −1 0 1
x1 x1
Isotop Laplacian Eigenmaps
0.05
1
2

0
2

0
x

−1
−0.05
−4 −2 0 2 4 −0.05 0 0.05
x x1
1

Fig. 6.9. Two-dimensional embeddings of the rolled Japanese flag obtained with
topology-preserving NLDR methods. All methods embed 600 prototypes obtained
beforehand with vector quantization. The 5000 points are given as such for the SOM.

The thin Swiss roll slice

Light can be shed on other flaws of NLDR methods by considering a very


thin slice of a Swiss roll, as illustrated in Fig. 6.10. Actually, this manifold is
identical to the one displayed in Fig. 6.1 and generated by Eq. (6.1), except
that its height is divided by four, i.e., y3 = x2 /2. NLDR methods embed the
thin slice as shown in Figs. 6.11 and 6.12. The best result is undoubtedly
given by SDE with strict equalities (with distances allowed to shrink, the
embedding almost loses its second dimension and looks like a sine wave).
GNLM and CDA produce twisted embeddings. On the other hand, Isomap
suffers from a weird “bone effect”: both ends of the rectangular manifold are
wider than its body. Again, these two phenomena can be explained by the fact
that graph distances only approximate true geodesic ones. More precisely, the
length of a long, smooth curve is replaced with the length of a broken line,
which is longer since it zigzags between the few available points in the data
6.1 Toy examples 183

3
y
−1

−2
2

0 2
0
y −2 −2
2 y
1

Fig. 6.10. A very thin slice of Swiss roll.

Isomap

0.5
2

0
x

−0.5

−6 −4 −2 0 2 4 6
x
1

GNLM
CDA
1
2

x2

0
x

−1 −6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6 x1
x1

SDE equ. SDE inequ.


2

x2
x

−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
x x
1
1

Fig. 6.11. Two-dimensional embedding of a thin slice of Swiss roll by distance-


preserving methods. All methods embed 600 prototypes obtained beforehand with
vector quantization.
184 6 Method comparisons

SOM
2
2

0
x

−2
−40 −30 −20 −10 0 10 20 30 40
x
1

GTM LLE
1 2
x2

2
0 0

x
−1 −2
−1 0 1 −1.5 −1 −0.5 0 0.5 1 1.5
x1 x1
Isotop Laplacian Eigenmaps
2 0.05
1
2

0 x2 0
x

−1
−2 −0.05
−2 0 2 −0.05 0 0.05
x1 x1

Fig. 6.12. Two-dimensional embedding of a thin slice of Swiss roll by topology-


preserving methods. All methods embed 600 prototypes obtained beforehand with
vector quantization. The 5000 points are given as such for the SOM.

set. Moreover, in the case of the thin Swiss roll, these badly approximated
geodesics are more or less oriented in the same way, along the main axis
of the rectangular manifold. Isomap tries to find a global tradeoff between
overestimated long distances and well-approximated short ones by stretching
vertically the manifold. GNLM and CDA act differently: as they favor the
preservation of small distances, they cannot stretch the manifold; instead,
they twist it.
Topology-preserving methods provide poor results, except GTM, which
succeeds better in embedding the thin Swiss roll than the normal one does.
Still, the embedding is twisted and oddly stretched; this is because the Swiss
roll must be cast inside the square latent space assumed by GTM. Changing
the shape of the latent space did not lead to a better embedding. Clearly,
the SOM does not yield the expected result: the color does not vary smoothly
along the rectangular lattice as it does along the Swiss roll. As for the standard
Swiss roll, the SOM lattice “jumps” between the successive whorls of the man-
ifold, as illustrated in Fig. 6.13. Actually, the SOM tries to occupy the same
regions as the thin Swiss roll, but those jumps break the perfect preservation
6.1 Toy examples 185

1.5

0.5

0
3
y

−0.5

−1

−1.5
1.5

0.5
1.5
0
1
−0.5 0.5
0
−1 −0.5
−1
y −1.5 −1.5
2
y1

Fig. 6.13. Three-dimensional view showing how a 8-by-75 SOM typically unfurls
in a thin slice of a Swiss roll.

of neighborhoods. These “shortcuts” are possible because the SOM works in


the high-dimensional space of data. Other methods do not unroll the manifold
completely. This can easily be understood, as distances are not taken directly
into account. Constraints on the shape of the manifold are thus almost com-
pletely relaxed, and such a longer-than-wide manifold can easily be embedded
in two dimensions without making it perfectly straight.

Inadequate parameter values

Still another flaw of many NLDR methods lies in a wrong or inappropriate


setting of the parameters. Most methods define local K-ary or -ball neigh-
borhoods. Until now, the value of K in the K-rule that builds the graph has
been given an appropriate value (according to the number of available data
points, the manifold curvature, the data noise, etc.). Briefly put, the graph
induced by the neighborhoods is representative of the underlying manifold.
For example, what happens if K is set to 8 instead of 5? The resulting graph
is shown in Fig. 6.14. An undesired or “parasitic” edge appears in the graph;
it connects an outer corner of the Swiss roll to a point lying on the next
whorl. At first sight, one can think that this edge is negligible, but actually
186 6 Method comparisons

0
3
y −2

−2 −1
−2 0
0 1
y1
2 2
y2

Fig. 6.14. After vector quantization, the K-rule weaves a graph that connects the
600 prototypes. Because K equals 8 instead of 5, undesired or “parasitic” edges
appear in the graph. More precisely, a link connects the outer corner of the Swiss
roll to a point lying in another whorl.

it can completely mislead the NLDR methods since they take it into account,
just as they would with any other normal edge. For instance, such an edge can
jeopardize the approximation of the geodesic distances. In that context, it can
be compared to a shortcut in an electrical circuit: nothing works as desired.
With K = 8, NLDR methods embed the Swiss roll, as shown in Figs. 6.15
and 6.16. As can be seen, the parasitic link completely misleads Isomap and
GNLM. Obviously, CDA yields a result very similar to the one displayed in
Fig. 6.4, i.e., in the ideal case, without undesired links. The good performance
of CDA is due to the parameterized weighting function Fλ , which allows CDA
to tear some parts of the manifold. In this case, the parasitic link has been
torn. However, this nice result can require some tuning of the neighborhood
proportion π in CDA. Depending on the parameter value, the tear does not
always occur at the right place. A slight modification in the parameter setting
may cause small imperfections: for example, the point on the corner may be
torn off and pulled toward the other end of the parasitic link.
Results provided by SDE largely vary with respect to the paramater val-
ues. Specifically, imposing a strict preservation of local distances or allowing
them to shrink leads to completely different embeddings. With strict equal-
ities required, the presence of the parasitic edge prevents the semidefinite
programming procedure included in SDE to unroll the manifold. Hence, the
result looks like a slightly deformed PCA projection. With inequalities, SDE
nearly succeeds in unfolding the Swiss roll; unfortunately, some regions of the
manifold are superposed.
6.1 Toy examples 187

Isomap
4
2

x2
0
−2

−4
−4 −2 0 2 4 6
x1
GNLM
CDA
2 2
x2

x2
0
−2
−2
−5 0 5 −6 −4 −2 0 2 4 6
x x
1 1
SDE equ. SDE inequ.
2
2
1
x2

x2

0 0
−1
−2
−2
−2 0 2 −4 −2 0 2
x1 x1

Fig. 6.15. Two-dimensional embeddings of the “Swiss roll” manifold by distance-


preserving methods. All methods embed 600 prototypes obtained beforehand with
vector quantization. In contrast with Fig. 6.4, the value of K in the K-rule is too
high and an undesired edge appears in the graph connecting the prototypes, as
shown in Fig. 6.14. As a consequence, graph distances fail to approximate the true
geodesic distances.

Until now, topology-preserving methods using a data-driven lattice (LLE,


LE, Isotop) seemed to outperform those working with a predefined lattice
(SOM, GTM). Obviously, the former methods depend on the quality of the
lattice. These methods cannot break the parasitic edge and fail to unfold
the Swiss roll nicely. However, because they are not constrained to preserve
distances strictly, even locally, those methods manage to distort the manifold
in order to yield embeddings without superpositions. The SOM and GTM, of
course, produce the same results as for the standard Swiss roll.
188 6 Method comparisons

SOM
5

x2 0

−5
−20 −15 −10 −5 0 5 10 15 20
x
1
GTM LLE
1
2
x2

2
0 0

x
−2
−1
−1 0 1 −2 −1 0 1 2
x x
1 1
Isotop Laplacian Eigenmaps

2 0.05
2

x2

0 0
x

−2 −0.05

−2 0 2 −0.05 0 0.05
x x
1 1

Fig. 6.16. Two-dimensional embeddings of the “Swiss roll” manifold by topology-


preserving methods. All methods embed 600 prototypes obtained beforehand with
vector quantization. The 5000 points are given as such for the SOM. In contrast with
Fig. 6.4, the value of K in the K-rule is too high and an undesired edge appears in
the graph connecting the prototypes, as shown in Fig. 6.14. As a consequence, the
embedding quality decreases.

Nondevelopable Swiss roll

What happens when we attempt to embed a nondevelopable manifold? The


answer may be given by the “heated” Swiss roll, whose parametric equations
are ⎡ √ √ ⎤
(1 + x22 ) √1 + x1 cos(2π√ 1 + x1 )
y = ⎣ (1 + x22 ) 1 + x1 sin(2π 1 + x1 ) ⎦ , (6.2)
2x2
where x = [x1 , x2 ]T is uniformly distributed in the interval [−1, +1]2 . The
square root ensures that the distribution remains uniform in the 3D embed-
ding. The first two coordinates, y1 and y2 , depend on both latent variables x1
6.1 Toy examples 189

and x2 in a nonlinear way. Hence, conditions to have a developable manifold


are broken. Figure 6.17 shows the resulting manifold, which looks like a Swiss
roll that has melted in an oven. As for the normal Swiss roll, 5000 points are

0
y3

−2
4
2
0 4
2
−2 0
−2
y −4 −4
2 y1

Fig. 6.17. The “heated” Swiss roll manifold.

generated, but in this case at least 800 prototypes are needed, instead of 600,
because consecutive whorls are closer to each other. The K-rule is used with
K = 5. The results of NLDR methods are shown in Figs. 6.18 and 6.19. As
expected, Isomap performs poorly with this nondevelopable manifold: points
in the right part of the embedding are congregated. GNLM yields a better
result than Isomap; unfortunately, the embedding is twisted. CDA does even
better but needs some parameter tuning in order to balance the respective
influences of short and long distances during the convergence. The neighbor-
hood proportion must be set to a high value for CDA to behave as GNLM.
More precisely, the standard schedule (hyperbolic decrease between 0.75 and
0.05) is replaced with a slower one (between 0.75 and 0.50) in order to avoid
undesired tears. Results of SDE also depend on the parameter setting. With
strict preservation of the distances, the embedding is twisted, whereas allowing
the distances to shrink produces an eye-shaped embedding.
Topology-preserving methods are expected to behave better with this non-
developable manifold, since no constraint is imposed on distances. An SOM
and GTM do not succeed in unfolding the heated Swiss roll. Spectral methods
like LLE and LE fail, too. Only Isotop yields a nice embedding.
In the case of spectral methods, some flaws in the embeddings can be
explained by looking to what happens in a third dimension, as was done for
the standard Swiss roll. As spectral methods build embeddings incrementally,
a third dimension is obtained by keeping an additional eigenvector of the
Gram-like matrix. Figure 6.20 shows those three-dimensional embeddings. As
can be seen, most embeddings are far from resembling the genuine latent
space, namely a flat rectangle. Except for SDE with inequalities, the “span”
190 6 Method comparisons

Isomap

x2
0

−2

−5 0 5
x1
GNLM
CDA
2 2
2

2
0
x

x
−2 −2

−5 0 5 10 −5 0 5 10
x1 x1

SDE equ. SDE inequ.


1 1
2

0 0
x

−1 −1
−5 0 5 −5 0 5
x1 x1

Fig. 6.18. Two-dimensional embeddings of the “heated” Swiss roll manifold com-
puted by distance-preserving methods. All methods embed 800 prototypes obtained
beforehand with vector quantization.

(or the variance) along the third dimension is approximately equal to the
one along the second dimension. In the case of Isomap, the manifold remains
folded lengthways. SDE with equalities leads to a twisted manifold, as does
LLE. LE produces a helical embedding.

A last word about the Swiss roll

The bottom line of the above experiments consists of two conclusions. First,
spectral methods offer nice theoretical properties (exact optimization, sim-
plicity, possibility to build embeddings in an incremental way, etc.). However,
they do not prove to be very robust against departures from their underly-
ing model. For instance, Isomap does not produce satisfying embeddings for
non-developable manifolds, which are not uncommon in real-life data. Sec-
ond, iterative methods based on gradient descent, for example, can deal with
more complex objective functions, whereas spectral methods are restricted to
functions having “nice” algebraic properties. This makes the former methods
more widely applicable. The price to pay, of course, is a heavier computational
load and a larger number of parameters to be adjusted by the user.
6.1 Toy examples 191

SOM
5

x2
0

−5

−25 −20 −15 −10 −5 0 5 10 15 20 25


x1
GTM LLE
1 4
2
x2

2
0 0

x
−2
−1 −4
−1 0 1 −1 −0.5 0 0.5 1 1.5 2
x x
1 1
Isotop Laplacian Eigenmaps
0.05
1
2

2
0 0
x

−1
−0.05
−2 −1 0 1 2 3 −0.05 0 0.05
x x
1 1

Fig. 6.19. Two-dimensional embeddings of the “heated” Swiss roll manifold com-
puted by topology-preserving methods. All methods embed 800 prototypes obtained
beforehand with vector quantization. The 5000 points are given as such for the SOM.

Most recent NLDR methods are based on the definition of neighborhoods,


which induce a graph connecting the data points. This graph can be used for
approximating either the geodesic distances or the manifold topology. Obvi-
ously, building such a graph requires the user to choose a neighborhood type
(K-ary neighboroods, -balls, or still another rule). Of course, none of these
rules is perfect and each of them involves parameters to adjust, thereby creat-
ing the possibility for the algorithm to fail in case of bad values. The presence
of unwanted edges in the graph, for instance, can jeopardize the dimensionality
reduction.
In recent distance-preserving NLDR methods, geodesic distances take a
key part. This metric offers an elegant way to embed developable manifold.
Unfortunately, in practice true geodesic distances cannot be computed; they
are approximated by graph distances. In many cases, however, the approxima-
tions is far from perfection or can even fail. So, even if a manifold is developable
in theory, it may happen that it cannot be considered as such in practice: a
perfect isometry cannot be found. This can be due among other reasons to a
192 6 Method comparisons

Isomap

x3
0

−2

5
0
2 0 −2 −5

x x1
SDE equ. 2

SDE inequ.
2

1
−5
x3

0
x3
0
5
−1 −1 0 1 x1
−5 x2
−2 0
−1 0 5
1 x1
x
2

LLE
LE
2
0.06

1 0.04

0.02
0
x3

x3

0
−1
−0.02
−2
−0.04

−0.06
4
2 −0.05 0 0.05
0
−2 2 x2
−4 −1 0 1
x2 x x
1
1

Fig. 6.20. Three-dimensional embeddings of the “heated” Swiss roll manifold by


Isomap, SDE, LLE, and LE.
6.1 Toy examples 193

data set that is too small or too noisy, or to an inappropriate parameter value.
In the worst case, the graph used to compute the graph distances may fail to
represent correctly the underlying manifold. This often comes from a wrong
parameter value in the rule used to build the graph. Therefore, it is not pru-
dent to rely solely on the graph distance when designing an NLDR method.
Other techniques should be integrated in order to compensate for the flaws
of the graph distance and make the method more flexible, i.e., more tolerant
to data that do not fulfill all theoretical requirements. This is exactly what
has been done in GNLM, CDA and other iterative methods. They use all-
purpose optimization techniques that allow more freedom in the definition of
the objective function. This makes them more robust and more polyvalent. As
a counterpart, these methods are less elegant from the theoretical viewpoint
and need some parameter tuning in order to fully exploit their capabilities.
Regarding topology preservation, methods using a data-driven lattice
(LLE and Isotop) clearly outperform those relying on a predefined lattice. The
advantage of the former methods lies in their ability to extract more informa-
tion from data (essentially, the neighborhoods in the data manifold). Another
explanation is that LLE and Isotop work in the low-dimensional embedding
space, whereas an SOM and GTM iterate in the high-dimensional data space.
In other words, LLE and Isotop attempt to embed the graph associated with
the manifold, whereas an SOM and GTM try to deform and fit a lattice in
the data space. The second solution offers too much freedom: because the
lattice has more “Lebensraum” at its disposal in the high-dimensional space,
it can jump from one part of a folded manifold to another. Working in the
embedding space like LLE and Isotop do is more constraining but avoids these
shortcuts and discontinuities.

6.1.2 Manifolds having essential loops or spheres

This subsection briefly describes how CCA/CDA can tear manifolds with es-
sential loops (circles, knots, cylinders, tori, etc.) or spheres (spheres, ellipsoids,
etc.). Three examples are given:
• The trefoil knot (Fig. 6.21) is a compact 1-manifold embedded in a three-
dimensional space. The parametric equations are
⎡ ⎤
41cos x−18 sin x−83 cos x−83 sin(2x)−11sin(3x)+27 sin(3x)
y = ⎣36 cos x+27 sin x−113 cos(2x)+30 sin(2x)+11cos(3x)−27 sin(3x)⎦ ,
45 sin x−30 cos(2x)+113 sin(2x)−11cos(3x)+27 sin(3x)

where 0 ≤ x < 2π. The data set consists of 300 prototypes, which are
obtained by vector quantization on 20,000 points randomly drawn in the
knot. Neighboring prototypes are connected using the K-rule with K = 2.

• The sphere (Fig. 6.22) is a compact 2-manifold embedded in a three-dimen-


sional space. For a unit radius, the parametric equations could be
194 6 Method comparisons

100

y3 0
200
−100
0

−100 −200
0
100 y
2
y
1

Fig. 6.21. The trefoil knot: a compact 1-manifold embedded in a three-dimensional


space. The data set consists of 300 prototypes obtained after a vector quantization on
20,000 points of the knot. The color varies according to y3 . Prototypes are connected
using the K-rule with K = 2.

⎡ ⎤
cos(x1 ) cos(x2 )
y = ⎣ sin(x1 ) cos(x2 ) ⎦ ,
sin(x2 )

where 0 ≤ x1 < π and 0 ≤ x2 < 2π. But unfortunately, with these


parametric equations, a uniform distribution in the latent space yields a
nonuniform and nonisotropic distribution on the sphere surface. Actually,
points of the sphere are obtained by normalizing points randomly drawn
in a three-dimensional Gaussian distribution (see Appendix B). The data
set consists of 500 prototypes that are obtained by vector quantization on
20,000 points drawn randomly in the sphere. Neighboring prototypes are
connected using the K-rule with K = 5.
• The torus (Fig. 6.23) is a compact 2-manifold embedded in a three-
dimensional space. The parametric equations are
⎡ ⎤
(2 + cos(x1 )) cos(x2 )
y = ⎣ (2 + cos(x1 )) sin(x2 ) ⎦ ,
sin(x1 )

where 0 ≤ x1 < 2π and 0 ≤ x2 < 2π. With these parametric equations,


the distribution on the torus surface is not uniform, but at least it is
isotropic in the plane spanned by the coordinates y1 and y2 . The data
set consists of 1000 prototypes obtained by vector quantization on 20,000
points randomly drawn in the torus. Neighboring prototypes are connected
using the K-rule with K = 5.
6.1 Toy examples 195

0.5 0.5
3

y3
0 0
y

−0.5 −0.5

0.5 0.5
0 0
−0.5 −0.5
0 −0.5 0 −0.5
0.5 0.5
y y
2 2
y y1
1

Fig. 6.22. The sphere: a compact 2-manifold embedded in a three-dimensional


space. The manifold itself is displayed in the first plot, whereas the second plot
shows the 500 prototypes used as the data set. They are obtained by a vector
quantization on 20,000 points of the sphere. The color varies according to y3 in both
plots. Prototypes are connected using the K-rule with K = 5.

0.5 0.5
y3

0 0
y

−0.5 2 −0.5 2
1 1
−2 0 −2 0
−1 −1 −1
0 0
−2 1 −2
2 2
y2 y2
y1 y1

Fig. 6.23. The torus: a compact 2-manifold embedded in a three-dimensional space.


The manifold itself is displayed in the first plot, whereas the second plot shows the
1000 prototypes used as the data set. They are obtained by a vector quantization on
20,000 points of the torus. The color varies according to y3 in both plots. Prototypes
are connected using the K-rule with K = 5.
196 6 Method comparisons

For these three manifolds, it is interesting to see whether or not the graph
distance helps the dimensionality reduction, knowing that the trefoil knot is
the only developable manifold. Another question is: does the use of the graph
distance change the way the manifolds are torn?
In the case of the trefoil knot, CCA and CDA can reduce the dimension-
ality from three to one, as illustrated in Fig. 6.24. In both one-dimensional

CDA
CCA

−500 0 500
x −1000 −500 0 500
1 x
1

Fig. 6.24. One-dimensional embeddings of the trefoil knot by CCA and CDA.
Graph edges between prototypes are displayed with blue corners (∨ or ∧). The color
scale is copied from Fig. 6.21.

plots, the graph edges are represented by blue corners (∨ or ∧). As can be
seen, CCA tears the knot several times, although it is absolutely not needed.
In contrast, CDA succeeds in unfolding the knot with a single tear only. The
behavior of CCA and CDA is also different when the dimensionality is not re-
duced to its smallest possible value. Figure 6.25 illustrates the results of both
methods when the knot is embedded in a two-dimensional space. This figure
clearly explains why CDA yields a different embedding. Because the knot is
highly folded and because CCA must preserve Euclidean distances, the em-
bedding CCA computes attempts to reproduce the global shape of the knot
(the three loops are still visible in the embedding). On the contrary, CDA
gets no information about the shape of the knot since the graph distance
is measured along the knot. As a result, the knot is topologically equiva-
lent to a circle, which precisely corresponds to the embedding computed by
CDA. Hence, in the case of the trefoil knot, the graph distance allows one to
avoid unnecessary tears. This is confirmed by looking at Fig. 6.26, showing
6.1 Toy examples 197

CCA CDA

200
200
100 100

0
0
2

x2
x

−100
−100
−200

−200 −300

−200 −100 0 100 200 −200 0 200


x x1
1

Fig. 6.25. Two-dimensional embeddings of the trefoil knot by CCA and CDA. The
color scale is copied from Fig. 6.21.

two-dimensional embeddings computed by Sammon’s NLM and its variant


using the graph distance.

NLM GNLM
200
150 300

100 200

50 100

0 0
2

x2
x

−50 −100

−100 −200

−150 −300

−200 −400
−100 0 100 200 −400 −200 0 200
x1 x1

Fig. 6.26. Two-dimensional embeddings of the trefoil knot by NLM and GNLM.
The color scale is copied from Fig. 6.21.

For the sphere and the torus, the graph distance seems to play a less im-
portant role, as illustrated by Figs. 6.27 and 6.28. Euclidean as well as graph
distances cannot be perfectly preserved for these nondevelopable manifolds.
The embeddings computed by both methods are similar. The sphere remains
198 6 Method comparisons

CCA
CDA

2 2

1 1
x2

x2
0 0

−1
−1

−2
−2

−2 −1 0 1 2
−2 −1 0 1 2 x1
x1

Fig. 6.27. Two-dimensional embeddings of the sphere by CCA and CDA. Colors
remains the same as in Fig. 6.22.

CDA
CCA
5 6

2
x2

x2

0
0

−2

−4
−5
−4 −2 0 2 4 6 −6
−5 0 5
x1
x
1

Fig. 6.28. Two-dimensional embeddings of the torus by CCA and CDA. The color
scale is copied from Fig. 6.23.
6.2 Cortex unfolding 199

rather easy to embed, but the torus requires some tuning of the parameters
for both methods.
As already mentioned, CCA and CDA are the only methods having the
intrinsic capabilities to tear manifolds. An SOM can also break some neigh-
borhoods, when opposite edges of the map join each other on the manifold
in the data space, for instance. It is noteworthy, however, that most recent
NLDR methods working with local neighborhoods or a graph can be extended
in order to tear manifolds with essential loops. This can be done by breaking
some neighborhoods or edges in the graph before embedding the latter. A
complete algorithm is described in [118, 121]. Unfortunately, this technique
does not work for P -manifolds with essential spheres when P > 1 (loops on a
sphere are always contractible and thus never essential).

6.2 Cortex unfolding


The study of the human brain has revealed that this complex organ is com-
posed of roughly two tissues: the white matter and the gray matter. The
former occupies a large volume in the center of the brain and contains mainly
the axons of the neurons. On the other hand, the latter is only a thin (2–
4-mm) shell, called the cortex and containing mainly the cell nuclei of the
neurons [193]. In order to maximize the cortical surface without swelling the
skull, evolution has folded it up, as illustrated by Fig. 6.29. Data in the lat-
ter figure come from the Unité mixte INSERM-UJF 594 (Université Joseph
Fourier de Grenoble) and shows the cortical surface of a living patient [28].
How can it be obtained?
Actually, the shape of the cortical surface can be extracted noninvasively
by techniques like magnetic resonance imaging (MRI) [192]. The principle of
MRI is based on the specific way different tissues react to a varying magnetic
field. According to the intensity of the reaction, a three-dimensional image
of the tissues can be deduced. In the case of the brain, that image can be
further processed in order to keep only the cortical surface. This can be done
by segmenting the image, i.e., by delineating the subvolumes occupied by
the different tissues [181]. Next, only voxels (3D pixels) corresponding to the
cortical surface are kept and are encoded with three-dimensional coordinates.
Obviously, the analysis of the cortical surface as displayed in Fig. 6.29
appears difficult because of the many folds. It would be easier to have a flat,
two-dimensional map of the cortex, in the same way as there are maps of
the Earth’s globe. DR techniques may help build such a representation. The
two-dimensional embeddings computed by Sammon’s NLM, CCA, GNLM,
and CDA are shown in Figs. 6.30 to 6.33, respectively. Because the data set
contains more than 5000 points, spectral methods using an EVD, like Isomap,
SDE, and KPCA, are not considered. Usual software like MATLAB R
failed
to manage the huge amount of memory required by those methods. With
200 6 Method comparisons

Fig. 6.29. Three-dimensional representations of a small piece of a human cortical


surface, obtained by magnetic resonance imaging. Two different views of the same
surface are shown, and the color varies according to the height.
6.2 Cortex unfolding 201

20

15

10

x2 5

−5

−10

−15

−20
−20 −15 −10 −5 0 5 10 15 20 25 30
x
1

Fig. 6.30. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by Sammon’s NLM.

30

20

10

0
x2

−10

−20

−30

−30 −20 −10 0 10 20 30 40


x1

Fig. 6.31. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by CCA.
202 6 Method comparisons

30

20

10

0
x2

−10

−20

−30

−40

−30 −20 −10 0 10 20 30


x1

Fig. 6.32. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by GNLM.

40

30

20

10
x2

−10

−20

−30

−40
−40 −30 −20 −10 0 10 20 30 40
x1

Fig. 6.33. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by CDA.
6.3 Image processing 203

of Isomap, we must be also remark that the cortical surface is clearly not a
developable manifold.
As can be seen, the embedding computed by methods preserving Euclidean
distances and the corresponding ones using graph distance do not yield the
same results. Although the manifold is not developable, the graph distance
greatly helps to unfold it. This is confirmed intuitively by looking at the
final value of Sammon’s stress. Using the Euclidean distance, NLM converges
on ENLM = 0.0162, whereas GNLM reaches a much lower and better value:
EGNLM = 0.0038.
Leaving aside the medical applications of these cortex maps, it is notewor-
thy that dimensionality reduction was also used to obtain a colored surface-like
representation of the cortex data. The cortical surface is indeed not available
directly as a surface, but rather as a set of points sampled from this surface.
However, the only way to render a surface in a three-dimensional view con-
sists of approximating it with a set of connected triangles. Unfortunately, most
usual triangulation techniques that can convert points into coordinates work
for two dimensions only. (Graph-building rules mentioned in Sections 4.3 and
Appendix E are not able to provide an exact triangulation.) Consequently,
dimensionality reduction provides an elegant solution to this problem: points
are embedded in a two-dimensional space, triangulation is achieved using a
standard technique in 2D, like Delauney’s one, and the obtained triangles are
finally sent back to the three-dimensional space. This allows us to display the
three-dimensional views of the cortical surface in Fig. 6.29.
How do topology-preserving methods compare to them? Only results for
an SOM (40-by-40 map), GTM (20-by-20 latent grid, 3 × 3 kernels, 100 EM
iterations), and Isotop (without vector quantization, -rule,  = 1.1) are given.
Due to its huge memory requirements, LLE failed to embed the 4961 points de-
scribing the cortical surface. The result of the SOM is illustrated by Fig. 6.34.
The embedding computed by GTM is shown in Fig. 6.35. Figure 6.36 displays
Isotop’s result. Again, it must be stressed that methods using a predefined lat-
tice— even if they manage to preserve the topology —often distort the shape
of data. This is particularly true for the SOM and questionable for GTM
(the latent space can be divided into several regions separated by empty or
sparsely populated frontiers). Isotop yields the most satisfying result; indeed,
the topology is perfectly preserved and, in addition, so is the shape of data:
the embedding looks very close to the results of distance-preserving methods
(see, e.g., Fig. 6.33). The main difference between Figs. 6.33 and 6.36 lies in
the axes: for Isotop, the size of the embedding is not related to the pairwise
distances measured in the data set.

6.3 Image processing


As already mentioned in Subsection 1.1.1), dimensionality reduction can be
applied to image processing. Within this framework, each pixel of an image
204 6 Method comparisons

55

50

45
y3

40

35

30

125 130 90
135 140 145 80
150 155 160 165 170
y
y 2
1

15

10

5
x2

−5

−10

−15

−20 −15 −10 −5 0 5 10 15 20


x1

Fig. 6.34. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by an SOM. The first plot shows how the SOM unfurls in the three-
dimensional space, whereas the second represents the two-dimensional SOM lattice.
6.3 Image processing 205

0.8

0.6

0.4

0.2
x2

−0.2

−0.4

−0.6

−0.8

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8


x
1

Fig. 6.35. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by GTM.

0
x2

−2

−4

−6

−6 −4 −2 0 2 4 6
x
1

Fig. 6.36. Two-dimensional embedding of the cortical surface shown in Fig. 6.29,
achieved by Isotop.
206 6 Method comparisons

can be considered to be a dimension or observed variable in a very high-


dimensional space, namely the whole image. As an example, a small picture,
of size 64 by 64 pixels, corresponds to a vector in a 4096-dimensional space.
However, in a typical image, it is very likely that neighboring pixels depend on
each other; an image is often split into regions having almost the same color.
Obviously, the relationships between pixels depend on the particular type of
image analyzed. For a given set of images, it can be reasonably stated that the
intrinsic dimensionality largely depends on the depicted content, and not on
the number of pixels. For instance, if the set contains similar images (several
landscapes or several portraits, etc.) or if the same object is depicted in several
positions, orientations, or illuminations, then the intrinsic dimensionality can
be quite low. In this case, dimensionality reduction can be very useful to
“sort” the image set, i.e., to obtain a representation in which similar images
are placed close to each other.

6.3.1 Artificial faces

A set of 698 face images is proposed in [180]. The images represent an arti-
ficially generated face rendered with different poses and lighting directions.
Figure 6.37 shows several faces drawn at random in the set. Each image con-

Fig. 6.37. Several face pictures drawn at random from the set of 698 images pro-
posed in [180].

sists of an array of 64 by 64 pixels, that are associated with a single brightness


value. Before dimensionality reduction, each image is converted into a 4096-
dimensional vector. As the latter number is very high, an initial dimensionality
reduction is performed by PCA, in a purely linear way. The first 240 princi-
pal components are kept; they bear more than 99% of the global variance. In
other words, PCA achieves an almost lossless dimensionality reduction in this
case.
Next, the 240-dimensional vectors are processed by the following distance-
preserving methods: metric MDS, Sammon’s NLM, CCA, Isomap, GNLM,
CDA, and SDE (with equality and inequality constraints). Three topology-
preserving methods are also used (LLE, LE, and Isotop); methods using a
6.3 Image processing 207

predefined lattice like an SOM or GTM are often restricted to two-dimensional


latent spaces.
Metric MDS is the only linear method. All methods involving neighbor-
hoods, a graph, or graph distances use the K-rule, with K = 4 (this value is
lower than the one proposed in [180] but still produces a connected graph).
No vector quantization is used since the number of images (N = 698) is low.
At this point, it remains to decide the dimensionality of the embedding.
Actually, it is known that the images have been generated using three degrees
of freedom. The first latent variable is the left-right pose, the second one is the
up-down pose, and the third one is the lighting direction (left or right). Hence,
from a theoretical point of view, the 240-dimensional vectors lie on a three-
manifold. From a practical point of view, are the eight distance-preserving
methods able to reduce the dimensionality to three?
Figures 6.38–6.48 answer this question in a visual way. Each of these figures
represents the three-dimensional embedding computed by one of the eight
methods. In order to ease the visualization, the embeddings are processed as
follows:
1. Each embedding is rotated in such a way that its three coordinates actu-
ally correspond to its three principal components.
2. A rectangular parallelepiped is centered around the embedding.
3. The parallelepiped is divided into 6-by-6-by-6 cells.
4. The six layers along the third coordinate are shown separately for each
embedding.
Each layer consists of 36 cells containing some points. Instead of displaying
them as simple dots— which does not convey much information —the image
associated with one of the points in each cell is displayed. This point is chosen
to be the closest one from the average of all points in the cell. Empty cells are
left blank.
With such a representation, the embedding quality can be assessed visually
by looking at neighboring cells. If images smoothly change from one cell to
another, then local distances have been correctly reproduced.
As it can be seen in Fig. 6.38, metric MDS does not perform very well.
This demonstrates that a linear method does not suffice to embed data. This is
confirmed by looking at the eigenvalues: coordinates along the three principal
components carry barely 60% of the global variance measured on the 240-
dimensional vectors.
The result of Sammon’s NLM, illustrated by Fig. 6.39, seems visually more
pleasing. Layers are more regularly populated, and clusters of similar images
can be perceived.
By comparison, the embedding computed by CCA in Fig. 6.40 is disap-
pointing. There are discontinuities in layers 3, 4, and 5: neighboring faces are
sometimes completely different from each other.
The embedding provided by Isomap is good, as visually confirmed by
Fig. 6.41. However, layers are sparsely populated; the left-right and up-down
208 6 Method comparisons

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.38. Three-dimensional embedding of the 698 face images computed by metric
MDS. The embedding is sliced into six layers, which in turn are divided into 6 by
6 cells. Each cell is represented by displaying the image corresponding to one of the
points it contains. See text for details.

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.39. Three-dimensional embedding of the 698 face images computed by Sam-
mon’s NLM. The embedding is sliced into six layers, which in turn are divided into
6 by 6 cells. Each cell is represented by displaying the image corresponding to one
of the points it contains. See text for details.
6.3 Image processing 209

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.40. Three-dimensional embedding of the 698 face images computed by CCA.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.

poses vary smoothly accross the layers. The lighting direction can be perceived
too: the light source is on the left (resp., right) of the face in the first (resp.,
last) layers. Can this outstanding performance be explained by the use of the
graph distance?
When looking at Fig. 6.42, the answer seems to be yes, since GNLM per-
forms much better than NLM, and does as well as Isomap. The final confir-
mation comes with the good result of CDA (Fig. 6.43). All layers are quite
densely populated, including the first and last ones. As for Isomap, the head
smoothly moves from one picture to another. The changes of lighting direction
are clearly visible, too.
In addition, two versions of SDE (with equality constraints or with local
distances allowed to shrink) work very well. Layers are more densely populated
with strict equality constraints, however.
Results of topology-preserving methods are given in Figs. 6.46–6.48. LLE
provides a disappointing embedding, though several values for its parameters
were tried. Layers are sparsely populated, and transitions between pictures
are not so smooth. The result of LE is better, but still far from most distance-
preserving methods. Finally, Isotop succeeds in providing a good embedding,
though some discontinuities can be observed, for example, in layer 2.
In order to verify the visual impression left by Figs. 6.38–6.48, a quantita-
tive criterion can be used to assess the embeddings computed by the NLDR
methods. Based on ideas developped in [9, 74, 10, 190, 106, 103], a simple
criterion can be based on the proximity rank. The latter can be denoted as
the function r = rank(X, i, j) and is computed as follows:
210 6 Method comparisons

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.41. Three-dimensional embedding of the 698 face images computed by


Isomap. The embedding is sliced into six layers, which in turn are divided into
6 by 6 cells. Each cell is represented by displaying the image corresponding to one
of the points it contains. See text for details.

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.42. Three-dimensional embedding of the 698 face images computed by


GNLM. The embedding is sliced into six layers, which in turn are divided into
6 by 6 cells. Each cell is represented by displaying the image corresponding to one
of the points it contains. See text for details.
6.3 Image processing 211

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.43. Three-dimensional embedding of the 698 face images computed by CDA.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.44. Three-dimensional embedding of the 698 face images computed by SDE
(with equality constraints). The embedding is sliced into six layers, which in turn
are divided into 6 by 6 cells. Each cell is represented by displaying the image corre-
sponding to one of the points it contains. See text for details.
212 6 Method comparisons

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.45. Three-dimensional embedding of the 698 face images computed by SDE
(with inequality constraints). The embedding is sliced into six layers, which in turn
are divided into 6 by 6 cells. Each cell is represented by displaying the image corre-
sponding to one of the points it contains. See text for details.

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.46. Three-dimensional embedding of the 698 face images computed by LLE.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
6.3 Image processing 213

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.47. Three-dimensional embedding of the 698 face images computed by LE.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.

Layer 1 Layer 2 Layer 3

Layer 4 Layer 5 Layer 6

Fig. 6.48. Three-dimensional embedding of the 698 face images computed by Isotop.
The embedding is sliced into six layers, which in turn are divided into 6 by 6 cells.
Each cell is represented by displaying the image corresponding to one of the points
it contains. See text for details.
214 6 Method comparisons

• Using the vector set X and taking the ith vector as reference, compute all
Euclidean distances x(k) − x(i), for 1 ≤ k ≤ N .
• Sort the obtained distances in ascending order, and let output r be the
rank of x(j) according to the sorted distances.
In the same way as in [185, 186], this allows us to write two different measures,
called mean relative rank errors:
1  
N
|rank(X, i, j) − rank(Y, i, j)|
MRREY→X (K)  (6.3)
C i=1 rank(Y, i, j)
j∈NK (y(i))

1  
N
|rank(X, i, j) − rank(Y, i, j)|
MRREX→Y (K)  , (6.4)
C i=1 rank(X, i, j)
j∈NK (x(i))

where NK (x(i)) denotes the K-ary neighborhood of x(i). The normalization


factor is given by
K
|2k − N − 1|
C =N (6.5)
k
k=1
and scales the error between 0 and 1. Quite obviously, MRREY→X (K) and
MRREX→Y (K) both vanish if the K closest neighbors of each datum appear
in the same order in both spaces. The first error, MRREX→Y (K), can be
compared to the continuity measure, whereas MRREY→X (K), is similar to
the trustworthiness [185, 186].
Figure 6.49 shows the evolution of the two errors as a function of K for
all NLDR methods. Looking at the curves of MRREY→X (K) on the left, it
can be seen that LLE yields the worst results, though its parameters were
carefully tuned (K = 6 and Δ = 0.01). Metric MDS performs rather poorly;
this is not unexpected since it is the only purely linear method. Better results
were obtained by SDE (with strict equality constraints), CDA, and GNLM.
Isomap and SDE with inequalities achieve comparable results. Methods pre-
serving geodesic or graph distances outperform those working with Euclidean
distances. Isotop achieves very good results, too. It is also noteworthy that
there is a discontinuity in the slope of MRREY→X (K) at K = 4 for methods
using a graph or K-ary neighborhoods; this precisely corresponds to the value
of K in the K-rule (except for LLE).
Looking at the curves of MRREY→X (K), on the right, LLE is again the
worst method. On the other hand, CDA performs the best, followed by GNLM
and SDE with strict equality constraints. Again, all geodesic methods outper-
form their corresponding Euclidean equivalent. Regarding topology preserva-
tion, the best embedding is provided by Isotop.

6.3.2 Real faces


The previous section shows an application of NLDR methods to the visual-
ization of a set of articially generated face pictures. In this section, the same
6.3 Image processing 215

Distance−preserving methods Distance−preserving methods


0.015 0.018
PCA/MDS PCA/MDS
Isomap 0.016 Isomap
Mean relative rank error

Mean relative rank error


NLM NLM
GNLM 0.014 GNLM
CCA CCA
0.01 CDA 0.012 CDA
SDE equ. SDE equ.
SDE inequ. 0.01 SDE inequ.
0.008

0.005 0.006

0.004

0.002

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Rank (reference is data) Rank (reference is embedding)

Topology−preserving methods Topology−preserving methods


0.025 0.06
PCA/MDS PCA/MDS
Mean relative rank error

LLE Mean relative rank error LLE


LE 0.05 LE
0.02
Isotop Isotop

0.04
0.015

0.03

0.01
0.02

0.005
0.01

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Rank (reference is data) Rank (reference is embedding)

Fig. 6.49. Mean relative rank errors (MRREY→X (K) on the left and
MRREX→Y (K) on the right) of all NLDR methods for the artificial faces.

problem is studied, but with real images. As a direct consequence, the intrinsic
dimensionality of the images is not known in advance. Instead of computing it
with the methods described in Chapter 3, it is proposed to embed the data set
in a two-dimensional plane by means of NLDR methods. In other words, even
if the intrinsic dimensionality is probably higher than two, a two-dimensional
representation is “forced”.
In practice, the data set [158] comprises 1965 images in levels of gray;
they are 28 pixels high and 20 pixels wide. Figure 6.50 shows a randomly
drawn subset of the images. The data are rearranged as a set of 1965 560-
dimensional vectors, which are given as they are, without any preprocessing.
(In the previous section, images were larger and were thus preprocessed with
PCA.) Six distance-preserving methods are used: metric MDS, Isomap, NLM,
GNLM, CCA, and CDA. SDE could not be used due to its huge computa-
tional requirements. Four topology-preserving methods are compared too: a
44-by-44 SOM, LLE, LE, and Isotop. GTM is discarded because of the high
dimensionality of the data space. All methods involving K-ary neighborhoods
216 6 Method comparisons

Fig. 6.50. Some faces randomly drawn from the set of real faces available on the
LLE website.

or a graph work with K = 4, except LLE (K = 12, Δ = 0.0001 as advised


in [158]). The embedding computed by each method is displayed by means of
thumbnail pictures, in a similar way as in the previous section. Actually, the
region occupied by the embedding is divided into 22-by-22 cells. Each of them
is represented by a “mean” image, which is the average of all images corre-
sponding to the two-dimensional points lying in the considered cell. A graph
connecting each data point with its four closest neighbors is shown alongside
(except for the SOM, for which the hexagonal grid is represented).
The two-dimensional embedding computed by metric MDS is shown in
Fig. 6.51. As can be seen, the embedding has a C shape. In the lower (resp.,
upper) branch, the boy looks on his left (resp., right). In the left part of
the embedding, the boy seems unhappy; he is smiling in the right part. The

Fig. 6.51. Two-dimensional embedding of the 1965 real faces by metric MDS.
6.3 Image processing 217

embedding resulting from Isomap, that is, metric MDS with graph distances,
is shown in Fig. 6.52. More details appear, and fewer cells are blank, but
a Swiss-cheese effect appears: holes in the graph look stretched. Moreover,
images corresponding to some regions of the embeddings look blurred or fuzzy.
This is due to the fact that regions in between or on the border of the holes
are shrunk: too many different images are concentrated in a single cell and
are averaged together.

Fig. 6.52. Two-dimensional embedding of the 1965 real faces by Isomap.

The embedding provided by NLM (Fig. 6.53) confirms that the data set is
roughly composed of two dense, weakly connected clusters. No Swiss-cheese
effect can be observed. Unfortunately, there are still blurred regions and some
discontinuities are visible in the embedding. The use of graph distances in
Sammon’s nonlinear mapping (Fig. 6.54) leads to a sparser embedding. Nev-
ertheless, many different facial expressions appear. Holes in the graph are not
stretched as they are for Isomap.
CCA provides a dense embedding, as shown in Fig. 6.55. Few discontinu-
ities can be found. Replacing the Euclidean distance with the graph distance
leads to a sparser embedding, just like for NLM and GNLM. This time, sev-
eral smaller clusters can be distinguished. Due to sparsity, some parts of the
main cluster on the right are somewhat blurred.
Figure 6.57 shows the embedding computed by LLE. As usual, LLE yields a
cuneiform embedding. Although the way to display the embedding is different
in [158], the global shape looks very similar. For the considered data set, the
218 6 Method comparisons

Fig. 6.53. Two-dimensional embedding of the 1965 real faces by NLM.

Fig. 6.54. Two-dimensional embedding of the 1965 real faces by GNLM.


6.3 Image processing 219

Fig. 6.55. Two-dimensional embedding of the 1965 real faces by CCA.

Fig. 6.56. Two-dimensional embedding of the 1965 real faces by CDA.


220 6 Method comparisons

assumption of an underlying manifold does not really hold: data points are
distributed in different clusters, which look stretched in LLE embedding. As
for other embeddings, similar faces are quite well grouped, but the triangular
shape of the embedding is probably not related to the true shape of the data
cloud. Just like other methods relying on a graph or K-ary neighborhoods,
LLE produces a very sparse embedding.

Fig. 6.57. Two-dimensional embedding of the 1965 real faces by LLE.

The embedding computed by LE (Fig. 6.58) emphasizes the separations


between the main clusters of the data set, which are shrunk. This confirms
the relationship between LE and spectral clustering, which has already been
pointed out in the literature. The resulting embedding is sparse and cuneiform,
as is the one provided by LLE. The different clusters look like pikes that are
beaming in all directions; computing an embedding of higher dimensionality
would probably allow us to separate the different clusters quite well.
The embedding computed by Isotop is displayed in Fig. 6.59. In contrast
with LLE, the two-dimensional representation given by Isotop is denser and
reveals more facial expressions. The result of Isotop looks very similar to the
embedding computed by CDA: several small clusters can be distinguished.
Transitions between them are usually smooth.
The SOM gives the result illustrated by Fig. 6.60. By construction, the
lattice is not data-driven, and thus the embedding is rectangular. As an ad-
vantage, this allows the SOM to unfold the data set on the largest available
surface. On the other hand, this completely removes any information about the
6.3 Image processing 221

Fig. 6.58. Two-dimensional embedding of the 1965 real faces by LE.

Fig. 6.59. Two-dimensional embedding of the 1965 real faces by Isotop.


222 6 Method comparisons

initial shape of the data cloud. Although the SOM yields a visually satisfying
embedding and reveals many details, some shortcomings must be remarked.
First, similar faces may be distributed in two different places (different re-
gions of the map can be folded close to each other in the data space). Second,
the SOM is the only method that involves a mandatory vector quantization.
Consequently, the points that are displayed as thumbnails are not in the data
set: they are points (or prototypes) of the SOM grid.

Fig. 6.60. Two-dimensional embedding of the 1965 real faces by a 44-by-44 SOM.

Until now, only the visual aspect of the embeddings has been assessed. In
the same way as in Subsection 6.3.1, the mean relative rank errors (Eqs. (6.3)
and (6.4)) can be computed in order to have a quantitative measure of the
neighborhood preservation. The values reached by all reviewed methods are
given in Fig. 6.61 for a number of neighbors ranging between 0 and 15. Metric
MDS clearly performs the worst; this is also the only linear method. Per-
formances of LLE are not really good either. The best results are achieved
by Isotop and by methods working with graph distances (Isomap, GNLM,
and CDA). GNLM reaches the best tradeoff between MRREY→X (K) and
MRREX→Y (K), followed by Isotop. On the other hand, CDA performs very
well when looking at MRREX→Y (K) only. Finally, the SOM cannot be com-
pared directly to the other methods, since it is the only method involving a
predefined lattice and a mandatory vector quantization (errors are computed
on the prototypes coordinates). The first error MRREX→Y (K) starts at low
values but grows much faster than for other methods when K increases. This
6.3 Image processing 223

can be explained by the fact that the SOM can be folded on itself in the data
space. On the other hand, MRREY→X (K) remains low because neighbors in
the predefined lattice are usually close in the data space, too.

Distance−preserving methods Distance−preserving methods


0.03 0.12
PCA/MDS PCA/MDS
Isomap Isomap
Mean relative rank error

Mean relative rank error


NLM NLM
0.025 0.1
GNLM GNLM
CCA CCA
CDA CDA
0.02 0.08

0.015 0.06

0.01 0.04

0.005 0.02

0 0
0 5 10 15 0 5 10 15

Rank (reference is data) Rank (reference is embedding)

Topology−preserving methods Topology−preserving methods


0.045 0.12
PCA/MDS PCA/MDS
0.04 SOM SOM
Mean relative rank error

Mean relative rank error

LLE LLE
0.1
LE LE
0.035 Isotop Isotop

0.03 0.08

0.025
0.06
0.02

0.015 0.04

0.01
0.02
0.005

0 0
0 5 10 15 0 5 10 15

Rank (reference is data) Rank (reference is embedding)

Fig. 6.61. Mean relative rank errors (MRREY→X (K) on the left and
MRREX→Y (K) on the right) of all NLDR methods for the real faces.
7
Conclusions

Overview. In addition to summarizing the key points of the book,


this chapter attempts to establish a generic procedure (or “data flow”)
for the analysis of high-dimensional data. All methods used in the
previous chapters are also classified from several points of view. Next,
some guidelines are given for using the various methods. The chapter
ends by presenting some perspectives for future developments in the
field of nonlinear dimensionality reduction.

7.1 Summary of the book


The main motivations of this book are the analysis and comparison of various
DR methods, with particular attention paid to nonlinear ones. Dimensionality
reduction often plays an important role in the analysis, interpretation, and un-
derstanding of numerical data. In practice, dimensionality reduction can help
one to extract some information from arrays of numbers that would otherwise
remain useless because of their large size. To some extent, the goal consists of
enhancing the readability of data. This can be achieved by visualizing data in
charts, diagrams, plots, and other graphical representations.

7.1.1 The problem

As illustrated in Chapter 1, visualization becomes problematic once the di-


mensionality — the number of coordinates or simultaneous observations —
goes beyond three or four. Usual projections and perspective techniques al-
ready reach their limits for only three dimensions ! This suggests using other
methods to build low-dimensional representations of data. Beyond visualiza-
tion, dimensionality reduction is also justified from a theoretical point of
view by unexpected properties of high-dimensional spaces. In high dimen-
sions, usual mathematical objects like spheres and cubes behave strangely
and do not share the same nice properties as in the two- or three-dimensional
226 7 Conclusions

cases. Other examples are the Euclidean norm, which is nearly useless in high-
dimensional spaces, and the intrinsic sparsity of high-dimensional spaces (the
“empty space phenomenon”). All those issues are usually called the “curse
of dimensionality” and must be taken into account when processing high-
dimensional data.

7.1.2 A basic solution

Historically, one of the first methods intended for the analysis of high-
dimensional data was principal component analysis (PCA), introduced in
Chapter 2. Starting from a data set in matrix form, and under some con-
ditions, this method is able to perform three essential tasks:
• Intrinsic dimensionality estimation. This consists in estimating the
(small) number of hidden parameters, called latent variables, that gener-
ated data.
• Dimensionality reduction. This consists in building a low-dimensional
representation of data (a projection), according to the estimated dimen-
sionality.
• Latent variable separation. This consists of a further transformation of
the low-dimensional representation, such that the latent variables appear
as mutually “independent” as possible.
Obviously, these are very desirable functionalities. Unfortunately, PCA re-
mains a rather basic method and suffers from many shortcomings. For ex-
ample, PCA assumes that observed variables are linear combinations of the
latent ones. According to this data model, PCA just yields a linear projec-
tion of the observed variables. Additionally, the latent variable separation is
achieved by simple decorrelation, explaining the quotes around the adjective
“independent” in the above list.
For more than seven decades, the limitations of PCA have motivated the
development of more powerful methods. Mainly two directions have been ex-
plored: namely, dimensionality reduction and latent variable separation.

7.1.3 Dimensionality reduction

Much work has been devoted to designing methods that are able to reduce the
data dimensionality in a nonlinear way, instead of merely projecting data with
a linear transformation. The first step in that direction was made by refor-
mulating the PCA as a distance-preserving method. This yielded the classical
metric multidimensional scaling (MDS) in the late 1930s (see Table 7.1). Al-
though this method remains linear, like PCA, it is the basis of numerous
nonlinear variants described in Chapter 4. The most widely known ones are
undoubtedly nonmetric MDS and Sammon’s nonlinear mapping (published in
the late 1960s). Further optimizations are possible, by using stochastic tech-
niques, for example, as in curvilinear component analysis (CCA), published
7.1 Summary of the book 227

in the early 1990s. Besides this evolution toward more and more complex
algorithms, recent progress has been accomplished in the family of distance-
preserving methods by replacing the usual Euclidean distance with another
metric: the geodesic distance, introduced in the late 1990s. This particular
distance measure is especially well suited for dimensionality reduction. The
unfolding of nonlinear manifolds is made much easier with geodesic distances
than with Euclidean ones.
Geodesic distances, however, cannot be used as such, because they hide a
complex mathematical machinery that would create a heavy computational
burden in practical cases. Fortunately, geodesic distances may be approxi-
mated in a very elegant way by graph distances. To this end, it suffices to
connect neighboring points in the data set, in order to obtain a graph, and
then to compute the graph distances with Dijkstra’s algorithm [53], for in-
stance.
A simple change allows us to use graph distances instead of Euclidean ones
in classical distance-preserving methods. Doing so transforms metric MDS,
Sammon’s NLM and CCA in Isomap (1998), geodesic NLM (2002), and curvi-
linear distance analysis (2000), respectively. Comparisons on various examples
in Chapter 6 clearly show that the graph distance outperforms the traditional
Euclidean metric. Yet, in many cases and in spite of all its advantages, the
graph distance is not the panacea: it broadens the set of manifolds that can
easily be projected by distance preservation, but it does not help in all cases.
For that reason, the algorithm that manages the preservation of distances fully
keeps its importance in the dimensionality reduction process. This explains
why the flexibility of GNLM and CDA is welcome in difficult cases where
Isomap can fail.
Distance preservation is not the sole paradigm used for dimensionality
reduction. Topology preservation, introduced in Chapter 5, is certainly more
powerful and appealing but also more difficult to implement. Actually, in order
to be usable, the concept of “topology” must be clearly defined; its translation
from theory to practice does not prove as straightforward as measuring a dis-
tance. Because of that difficulty, topology-preserving methods like Kohonen’s
self-organizing maps appeared later (in the early 1980s) than distance-based
methods. Other methods, like the generative topographic mapping (1995),
may be viewed as principled reformulations of the SOM, within a probabilis-
tic framework. More recent methods, like locally linear embedding (2000) and
Isotop (2002), attempt to overcome some limitations of the SOM.
In Chapter 5, methods are classified according to the way they model the
topology of the data set. Typically, this topology is encoded as neighborhood
relations between points, using a graph that connects the points, for instance.
The simplest solution consists of predefining those relations, without regards
to the available data, as it is done in an SOM and GTM. If data are taken
into account, the topology is said to be data-driven, like with LLE and Isotop.
While data-driven methods generally outperform SOMs for dimensionality
reduction purposes, the latter remains a reference tool for 2D visualization.
228 7 Conclusions
ANN DR Method Author(s) & reference(s)
1901 PCA Pearson [149]
1933 PCA Hotelling [92]
1938 classical metric MDS Young & Householder [208]
1943 formal neuron McCulloch & Pitts [137]
1946 PCA Karhunen [102]
1948 PCA Loève [128]
1952 MDS Torgerson [182]
1958 Perceptron Rosenblatt [157]
1959 Shortest paths in a graph Dijkstra [53]
1962 nonmetric MDS Shepard [171]
1964 nonmetric MDS Kruskal [108]
1965 K-means (VQ) Forgy [61]
1967 K-means (VQ) MacQueen [61]
ISODATA (VQ) Ball & Hall [8]
1969 PP Kruskal [109]
NLM (nonlinear MDS) Sammon [109]
1969 Perceptron Minsky & Papert’s paper [138]
1972 PP Kruskal [110]
1973 SOM von der Malsburg [191]
1974 PP Friedman & Tukey [67]
1974 Back-propagation Werbos [201]
1980 LBG (VQ) Linde, Buzo & Gray [124]
1982 SOM (VQ & NLDR) Kohonen [104]
1982 Hopfield network Hopfield [91]
LLoyd (VQ) Lloyd [127]
1984 Principal curves Hastie & Stuetzle [79, 80]
1985 Competitive learning (VQ) Rumelhart & Zipser [162, 163]
1986 Back-propagation & MLP Rumelhart, Hinton & Williams [161, 160]
BSS/ICA Jutten [99, 98, 100]
1991 Autoassociative MLP Kramer [107, 144, 183]
1992 “Neural” PCA Oja [145]
1993 VQP (NLM) Demartines & Hérault [46]
Autoassociative ANN DeMers & Cottrell [49]
1994 Local PCA Kambhatla & Leen [101]
1995 CCA (VQP) Demartines & Hérault [47, 48]
NLM with ANN Mao & Jain [134]
1996 KPCA Schölkopf, Smola & Müller [167]
GTM Bishop, Svensén & Williams [22, 23, 24]
1997 Normalized cut (spectral clustering) Shi & Malik [172, 199]
1998 Isomap Tenenbaum [179, 180]
2000 CDA (CCA) Lee & Verleysen [116, 120]
LLE Roweis & Saul [158]
2002 Isotop (MDS) Lee [119, 114]
LE Belkin & Niyogi [12, 13]
Spectral clustering Ng, Jordan & Weiss [143]
Coordination of local linear models Roweis, Saul & Hinton [159]
2003 HLLE Donoho & Grimes [56, 55]
2004 LPP He & Niyogi [81]
SDE (MDS) Weinberger & Saul [196]
2005 LMDS (CCA) Venna & Kaski [186, 187]
2006 Autoassociative ANN Hinton & Salakhutdinov [89]

Table 7.1. Timeline of DR methods. Major steps in ANN history are given as
milestones. Spectral clustering has been added because of its tight relationship with
spectral DR methods.

7.1.4 Latent variable separation

Starting from PCA, the other direction that can be explored is latent variable
separation. The first step in that direction was made with projection pursuit
(PP; see Table 7.1) [109, 110, 67]. This technique, which is widely used in
exploratory data analysis, aims at finding “interesting” (linear) one- or two-
dimensional projections of a data set. Axes of these projections can then be
interpreted as latent variables. A more recent approach, initiated in the late
1980s by Jutten and Hérault [99, 98, 100], led to the flourishing development
of blind source separation (BSS) and independent component analysis (ICA).
These fields propose more recent but also more principled ways to tackle the
problem of latent variable separation. In contrast with PCA, BSS and ICA can
7.1 Summary of the book 229

go beyond variable decorrelation: most methods involve an objective function


that can be related to statistical independence.
In spite of its appealing elegance, latent variable separation does not fit
in the scope of this book. The reason is that most methods remain limited to
linear data models. Only GTM can be cast within that framework: it is one
of the rare NLDR methods that propose a latent variable model, i.e. one that
considers the observed variables to be functions of the latent ones. Most other
methods follow a more pragmatic strategy and work in the opposite direction,
by finding any set of variables that give a suitable low-dimensional represen-
tation of the observed variables, regardless of the true latent variables. It is
noteworthy, however, that GTM involves a nonlinear mapping and therefore
offers no guarantee of recovering the true latent variables either, despite its
more complex data model.
More information on projection pursuit can be found in [94, 66, 97]. For
BSS and ICA, many details and references can be found in the excellent book
by Hyvärinen, Karhunen, and Oja [95].

7.1.5 Intrinsic dimensionality estimation

Finally, an important key to the success of both dimensionality reduction and


latent variable separation resides in the right estimation of the intrinsic di-
mensionality of data. This dimensionality indicates the minimal number of
variables or free parameters that are needed to describe the data set with-
out losing the information it conveys. The word “information” can be un-
derstood in many ways: it can be the variance in the context of PCA, for
instance. Within the framework of manifold learning, it can also be the mani-
fold “structure” or topology; finding the intrinsic dimensionality then amounts
to determining the underlying manifold dimensionality. Chapter 3 reviews a
couple of classical methods that can estimate the intrinsic dimensionality of a
data set. A widely used approach consists of measuring the fractal dimension
of the data set. Several fractal dimensions exist: the most-known ones are
the correlation dimension and the box-counting dimension. These measures
come from subdomains of physics, where they are used to study dynamical
systems. Although they are often criticized in the physics literature, their
claimed shortcomings do not really matter within the framework of dimen-
sionality reduction. It is just useful to know that fractal dimensions tend to
underestimate the true dimensionality [174] and that noise may pollute the
estimation. But if the measure of the fractal dimension fails, then it is very
likely that the data set is insufficient or too noisy and that any attempt to
reduce the dimensionality will fail, too.
Other methods to estimate the intrinsic dimensionality are also reviewed
in Chapter 3. For example, some DR methods can also be used to estimate
the intrinsic dimensionality: they are run iteratively, with a decreasing target
dimension, until they fail. The intrinsic dimensionality may then be assumed
230 7 Conclusions

to be equal to the smallest target dimension before failure. This is a “trial-


and-error” approach. Obviously, this way of estimating the dimensionality
largely depends on the method used to reduce the dimensionality. The result
may vary significantly just by changing the type of dimensionality reducer
(distance- or topology-preserving method, Euclidean or graph distance, etc.).
Moreover, the computational cost of repeating the dimensionality reduction
to obtain merely the dimensionality may rapidly become prohibitive. From
this point of view, methods that can build projections in an incremental way
(see Subsection 2.5.7), such as PCA, Local PCA, Isomap, or SDE, appear as
the best compromise because a single run suffices to determine the projections
of all possible dimensionalities at once. In contrast with fractal dimensions,
the trial-and-error technique tends to overestimate the true intrinsic dimen-
sionality.
Section 3.4 compares various methods for the estimation of the intrinsic
dimensionality. The correlation dimension (or another fractal dimension) and
local PCA give the best results on the proposed data sets. Indeed, these meth-
ods are able to estimate the dimensionality on different scales (or resolutions)
and thus yield more informative results.

7.2 Data flow


This section proposes a generic data flow for the analysis of high-dimensional
data. Of course, the accent is put on the word “generic”: the proposed pattern
must obviously be particularized to each application. However, it provides a
basis that has been proven effective in many cases.
As a starting point, it is assumed that data consist of an unordered set of
vectors. All vector entries are real numbers; there are no missing data.

7.2.1 Variable Selection

The aim of this first step is to make sure that all variables or signals in the
data set convey useful information about the phenomenon of interest. Hence,
if some variables or signals are zero or are related to another phenomenon, a
variable selection must be achieved beforehand, in order to discard them. To
some extent, this selection is a “binary” dimensionality reduction: each ob-
served variable is kept or thrown away. Variable selection methods are beyond
the scope of this book; this topic is covered in, e.g., [2, 96, 139].

7.2.2 Calibration

This second step aims at “standardizing” the variables. When this is required,
the average of each variable is subtracted. Variables can also be scaled if
needed. The division by the standard deviation is useful when the variables
7.2 Data flow 231

come from various origins. For example, meters do not compare with kilo-
grams, and kilometers do not with grams. Scaling the variables helps to make
them more comparable.
Sometimes, however, the standardization can make things worse. For ex-
ample, an almost-silent signal becomes pure noise after standardization. Obvi-
ously, the knowledge that it was silent is important and should not be lost. In
the ideal case, silent signals and other useless variables are eliminated by the
above-mentioned variable selection. Otherwise, if no standardization has been
performed, further processing methods can still remove almost-zero variables.
(See Subsection 2.4.1 for a more thorough discussion.)

7.2.3 Linear dimensionality reduction

When data dimensionality is very high, linear dimensionality reduction by


PCA may be very usesul to suppress a large number of useless dimensions.
Indeed, PCA clearly remains one of the best techniques for “hard” dimension-
ality reduction. For this step, the strategy consists in elimating the largest
number of variables while maintaining the reconstruction error very close to
zero. This is achieved in order to make the operation as “transparent” as
possible, i.e., nearly reversible. This also eases the work to be achieved by
subsequent nonlinear methods (e.g., for a further dimensionality reduction).
If the dimensionality is not too high, or if linear dimensionality causes a large
reconstruction error, then PCA may be skipped.
In some cases, whitening can also be used [95]. Whitening, also known as
sphering, is closely related to PCA. In the latter, the data space is merely
rotated, using an orthogonal matrix, and the decorrelated variables having
a variance close to zero are discarded. In whitening an additional step is
used for scaling the decorrelated variables, in order to end up with unit-
variance variables. This amounts to performing a standardization, just as
described above, after PCA instead of before. Whereas the rotation involved
in PCA does not change pairwise distances in the data set, the additional
transformation achieved by whitening does, like the standardization. For zero-
mean variables, Euclidean distances measured after whitening are equivalent
to Mahalanobis distances measured in the raw data set (with the Mahalanobis
matrix being the inverse of the data set covariance matrix).

7.2.4 Nonlinear dimensionality reduction

Nonlinear methods of dimensionality reduction may take over from PCA once
the dimensionality is no longer too high, between a few tens and a few hun-
dreds, depending on the chosen method. The use of PCA as preprocessing is
justified by the fact that most nonlinear methods remain more sensitive to the
curse of dimensionality than PCA due to their more complex model, which
involves many parameters to identify.
232 7 Conclusions

Typically, nonlinear dimensionality reduction is the last step in the data


flow. Indeed, the use of nonlinear methods transforms the data set in such a
way that latent variable separation becomes difficult or impossible.

7.2.5 Latent variable separation


In the current state of the art, most methods for latent variable separation are
incompatible with nonlinear dimensionality reduction. Hence, latent variable
separation appears more as an alternative step to nonlinear dimensionality
reduction than a subsequent one. The explanation is that most methods for
latent variable separation like ICA assume in fact that the observed variables
are linear combinations of the latent ones [95]. Without that assumption, these
methods may not use statistical independence as a criterion to separate the
variables. Only a few methods can cope with restricted forms of nonlinear
mixtures like, such as, for example, postnonlinear mixtures [205, 177].

7.2.6 Further processing


Once dimensionality reduction or latent variable separation is achieved, the
transformed data may be further processed, depending on the targeted appli-
cation. This can range from simple visualization to automated classification
or function approximation. In the two last cases, unsupervised learning is
followed by supervised learning.
In summary, the proposed generic data flow for the analysis of high-
dimensional data goes through the following steps:
1. (Variable selection.) This step allows the suppression of useless vari-
ables.
2. Calibration. This step gathers all preprocessings that must or may be
applied to data (mean subtraction, scaling, or standardization, etc.).
3. Linear dimensionality reduction. This step usually consists of per-
forming PCA (data may be whitened at the same time, if necessary).
4. Nonlinear dimensionality reduction and/or latent variable sep-
aration. These (often incompatible) steps are the main ones; they allow
us to find “interesting” representations of data, by optimizing either the
number of required variables (nonlinear dimensionality reduction) or their
independence (latent variable separation).
5. (Further processing.) Visualization, classification, function approxima-
tion, etc.
Steps between paratheses are topics that are not covered in this book.

7.3 Model complexity


It is noteworthy that in the above-mentioned data flow, the model complexity
grows at each step. For example, if N observations of D variables or signals are
7.4 Taxonomy 233

available, the calibration determines D means and D standard deviations; the


time complexity to compute them is then O(DN ). Next, for PCA, the covari-
ance matrix contains D(D−1)/2 independent entries and the time complexity
to compute them is O(D2 N ). Obtaining a P -dimensional projection requires
O(P DN ) additional operations.
Things become worse for nonlinear dimensionality reduction. For example,
a typical distance-preserving method requires N (N − 1)/2 memory entries to
store all pairwise distances. The time complexity to compute them is O(DN 2 ),
at least for Euclidean distances. For graph distances indeed, the time com-
plexity grows further to O(P N 2 log N ). In order to obtain a P -dimensional
embedding, an NLDR method relying on a gradient descent such as— Sam-
mon’s NLM —requires O(P N 2 ) operations for a single iteration. On the other
hand, a spectral method requires the same amount of operations per iteration,
but the eigensolver has the advantage of converging much faster.
To some extent, progress of NLDR models and methods seems to be related
not only to science breakthroughs but also to the continually increasing power
of computers, which allows us to investigate directions that were previously
out of reach from a practical point of view.

7.4 Taxonomy
Figure 7.1 presents a nonexhaustive hierarchy tree of some unsupervised data
analysis methods, according to their purpose (latent variable separation or
dimensionality reduction). This figure also gives an overview of all methods
described in this book, which focuses on nonlinear dimensionality reduction
based mainly on “geometrical” concepts (distances, topology, neighborhoods,
manifolds, etc.).
Two classes of NLDR methods are distinguished in this book: those trying
to preserve pairwise distances measured in the data set and those attempting
to reproduce the data set topology. This distinction may seem quite arbi-
trary, and other ways to classify the methods exist. For instance, methods
can be distinguished according to their algorithmic structure. In the latter
case, spectral methods can be separated from those relying on iterative opti-
mization schemes like (stochastic) gradient ascent/descent. Nevertheless, this
last distinction seems to be less fundamental.
Actually, it can be observed that all distance-preserving methods involve
pairwise distances either directly (metric MDS, Isomap) or with some kind of
weighting (NLM, GNLM, CCA, CDA, SDE). In (G)NLM, this weighting is
proportional to the inverse of the Euclidean (or geodesic) distances measured
in the data space, whereas a decreasing function of the Euclidean distances
in the embedding space is used in CCA and CDA. For SDE, only Euclidean
distances to the K nearest neighbors are taken into account, while others are
simply forgotten and replaced by those determined during the semidefinite
programming step.
234 7 Conclusions

Latent Variable Separation Dimensionality reduction

PCA

BSS PP NLDR

Other ICA Geometric Other

Distance Topology AA NN

Eucl. Geod. Other PDL DDL

BN MLP
Isomap
GNLM

KPCA

Isotop
SOM
MDS

GTM
NLM
CCA

CDA
PCA

SDE

LLE
=

LE
Linear Nonlinear
This book

Fig. 7.1. Methods for latent variable separation and dimensionality reduction: a
nonexhaustive hierarchy tree. Acronyms: PCA, principal component analysis; BSS,
blind source separation; PP, projection pursuit; NLDR, nonlinear dimensionality
reduction; ICA, independent component analysis; AA NN, auto-associative neural
network; PDL, predefined lattice; DDL, data-driven lattice. Methods are shown as
tree leaves.

On the other hand, in topology-preserving methods pairwise distances are


never used directly. Instead they are replaced with some kind of similarity
measure, which most of the time is a decreasing function of the pairwise
distances. For instance, in LE only distances to the K nearest neighbors are
involved; next the heat kernel is applied to them (possibly with an infinite
temperature) and the Laplacian matrix is computed. This matrix is such that
off-diagonal entries in a row or column are always lower than the corresponding
entry on the diagonal. A similar reasoning leads to the same conclusion for
LLE. In an SOM or Isotop, either grid distances (in the embedding space) or
graph distances (in the data space) are used as the argument of a Gaussian
kernel.
In the case of spectral methods, this distinction between distance and
topology preservation has practical consequences. In all distance-preserving
methods, the eigensolver is applied to a dense matrix, whose entries are either
Euclidean distances (metric MDS), graph distances (Isomap), or distances
7.4 Taxonomy 235

optimized by means of semidefinite programming (SDE). In all these methods,


the top eigenvectors are sought (those associated with the largest eigenvalues).
To some extent, these eigenvectors form the solution of a maximization prob-
lem; in the considered case, the problem primarily consists of maximizing the
variance in the embedding space, which is directly related to the associated
eigenvalues.1
The situation gets reversed for topology-preserving methods. Most of the
time, in this case, the eigensolver is applied to a sparse matrix and the bot-
tom eigenvectors are sought, those associated with the eigenvalues of lowest
(but nonzero) magnitude. These eigenvectors identify the solution of a min-
imization problem. The objective function generally corresponds to a local
reconstruction error or distortion measure (see Subsections 5.3.1 and 5.3.2
about LLE and LE, respectively).
A duality relationship can be established between those maximizations
and minimizations involving, respectively, dense and sparse Gram-like ma-
trices, as sketched in [164] and more clearly stated in [204]. Hence to some
extent distance and topology preservation are different aspects or ways to
formulate the same problem. Obviously, it should be erroneous to conclude
from the unifying theoretical framework described in [204] that all spectral
methods are equivalent in practice ! This is not the case; experimental results
in Chapter 6 show the large variety that can be observed among their results.
For instance, following the reasoning in [164, 78], it can easily be demon-
strated that under some conditions, the bottom eigenvectors of a Laplacian
matrix (except the last one associated with a null eigenvalue) correspond to
the leading eigenvectors obtained from a double-centered matrix of pairwise
commute-time distances (CTDs). It can be shown quite easily that CTDs
respect all axioms of a distance measure (Subsection 4.2.1). But unlike Eu-
clidean distances, CTDs cannot be computed simply by knowing coordinates
of two points. Instead, CTD distances are more closely related to graph dis-
tances, in the sense that other known points are involved the computation.
Actually, it can be shown that the easiest way to obtain the matrix of pair-
wise CTDs consists of computing the pseudo-inverse of a Laplacian matrix.
As a direct consequence, the leading eigenvectors of the CTD matrix precisely
correspond to the bottom eigenvectors of the Laplacian matrix, just as stated
above. Similarly, eigenvalues of the CTD matrix are inversely proportional
to those of the Laplacian matrix. Therefore, even if distance- and topology-
preserving spectral methods are equivalent from a theoretical viewpoint, it
remains useful to keep both frameworks in practice, as the formulation of a
particular method can be made easier or more natural in the one or the other.
1
Of course, this maximization of the variance can easily be reformulated into a
minimization of the reconstruction error, as observed in Subsection 2.4.2.
236 7 Conclusions

7.4.1 Distance preservation

Distance-preserving methods can be classified according to the distance and


the algorithm they use. Several kinds of distances and kernels can be used.
Similarly, the embedding can be obtained by three different types of algo-
rithms. The first one is the spectral decomposition found in metric MDS. The
second one is the quasi-Newton optimization procedure implemented in Sam-
mon’s NLM. The third one is the stochastic optimization procedure proposed
in curvilinear component analysis. Table 7.2 shows the different combinations
that have been described in the literature. The table also proposes an alterna-

Table 7.2. Classification of distance-preserving NLDR methods, according to the


distance/kernel function and the algorithm. A unifying naming convention is pro-
posed (the real name of each method is given between parentheses).

MDS algorithm NLM algorithm CCA algorithm


Euclidean EMDS ENLM ECCA
(metric MDS) (NLM) (CCA)
Geodesic GMDS GNLM GCCA
(Dijkstra) (Isomap) (CDA)
Commute time CMDS
(LE)
Fixed kernel KMDS
(KPCA)
Optimized kernel OMDS
(SDP) (SDE/MVU)

tive naming convention. The first letter indicates the distance or kernel type
(E for Euclidean, G for geodesic/graph, C for commute-time distance, K for
fixed kernel, and O for optimized kernel), whereas the three next letters refer
to the algorithm (MDS for spectral decomposition, NLM for quasi-Newton
optimization, and CCA for CCA-like stochastic gradient descent).
It is noteworthy that most methods have an intricate name that often
gives few or no clues about their principle. For instance, KPCA is by far
closer to metric MDS than to PCA. While SDE stands for semidefinite em-
bedding, it should be remarked that all spectral methods compute an em-
bedding from a positive semidefinite Gram-like matrix. The author of SDE
renamed his method MVU, standing for maximum variance unfolding [197];
this new name does not shed any new light on the method: all MDS-based
methods (like Isomap and KPCA, for instance) yield an embedding having
maximal variance. The series can be continued, for instance, with PCA (pric-
ipal component analysis), CCA (curvilinear component analysis), and CDA
(curvilinear distances analysis), whose names seem to be designed to ensure
a kind of filiation while remaining rather unclear about their principle. The
7.4 Taxonomy 237

name Isomap, which stands for Isometric feature mapping [179], is quite un-
clear too, since all distance-preserving NLDR methods attempt to yield an
isometric embedding. Unfortunately, in most practical cases, perfect isometry
is not reached.
Looking back at Table 7.2, the third and fourth rows contain methods that
were initially not designed as distance-preserving methods. Regarding KPCA,
the kernels listed in [167] are given for their theoretical properties, without any
geometrical justification. However, the application of the kernel is equivalent
to mapping data to a feature space in which a distance-preserving embedding
is found by metric MDS. In the case of LE, the duality described in [204] and
the connection with commute-time distances detailed in [164, 78] allow it to
occupy a table entry.
Finally, it should be remarked that the bottom right corner of Table 7.2
contains many empty cells that could give rise to new methods with potentially
good performances.

7.4.2 Topology preservation

As proposed in Chapter 5, topology-preserving methods fall into two classes.


On one hand, methods like Kohonen’s self-organizing maps and Svensén’s
GTM map data to a discrete lattice that is predefined by the user. On the
other hand, more recent techniques like LLE and Isotop automatically build
a data-driven lattice, meaning that the shape of the lattice depends on the
data and is entirely determined by them. In most cases this lattice is a graph
induced by k-ary or -ball neighborhoods, which provides a good discrete
approximation of the underlying manifold topology. From that point of view,
the corresponding methods can be qualified to be graph- or manifold-driven.
Table 7.3 is an attempt to classify topology-preserving methods as it was
done in Table 7.2 for distance-preserving methods. If PCA is interpreted as

Table 7.3. Classification of topology-preserving NLDR methods, according to the


kind of lattice and algorithm. The acronyms ANN, MLE and EM in the first row
stand for artificial neural network, maximum likelihood estimation and expectation-
maximization respectively.

ANN-like MLE by EM Spectral

Predefined lattice SOM GTM

Data-driven lattice Isotop LLE, LE

a method that fits a plane in the data space in order to capture as much
238 7 Conclusions

variance as possible after projection on this plane, then to some extent running
an SOM can then be seen as a way to fit a nonrigid (or articulated) piece of
plane within the data cloud. Isotop, on the contrary, follows a similar strategy
in the opposite direction: an SOM-like update rule is used for embedding a
graph deduced from the data set in a low-dimensional space.
By essence, GTM solves the problem in a similar way as an SOM would.
The approach is more principled, however, and the resulting algorithm works
in a totally different way. A generative model is used, in which the latent space
is fixed a priori and whose mapping parameters are identified by statistical
inference.
Finally, topology-preserving spectral methods, like LLE and LE, develop
a third approach to the problem. They build what can be called an “affinity”
matrix [31], which is generally sparse; after double-centering or application of
the Laplacian operator, some of its bottom eigenvectors form the embedding.
A relationship between LLE and LE is established in [13], while the duality
described in [204] allows us to relate both methods to distance-preserving
spectral methods.

7.5 Spectral methods


Since 2000 most recent NLDR methods have been spectral, whereas methods
based on sophisticated (stochastic) gradient ascent/descent were very popular
during the previous decades. Most of these older methods were designed and
developed in the community of artificial neural networks (ANNs). Methods
like Kohonen’s SOM and Demartines’ VQP [46] (the precursor of CCA) share
the same distributed structure and algorithmic scheme as other well-known
ANNs like the multilayer perceptron.
Since the late 1990s, the ANN community has evolved and split into two
groups. The first one has joined the fast-growing and descriptive field of neu-
rosciences, whereas the second has been included in the wide field of machine
learning, which can be seen as a fundamental and theory-oriented emana-
tion of data mining. The way to tackle problems is indeed more formal in
machine learning than it was in the ANN community and often resorts to
concepts coming from statistics, optimization and graph theory, for example.
The growing interest in spectral methods stems at least partly from this differ-
ent perspective. Spectral algebra provides an appealing and elegant framework
where NLDR but also clustering or other optimization problems encountered
in data analysis can be cast within. From the theoretical viewpoint, the cen-
tral problem of spectral algebra, i.e., finding eigenvalues and eigenvectors of
a given matrix, can be translated into a convex objective function to be op-
timized. This guarantees the existence of a unique and global maximum; in
addition, the solutions to the problem are also “orthogonal” in many cases,
giving the opportunity to divide the problem into subproblems that can be
solved successively, yielding the eigenvectors one by one. Another advantage
7.5 Spectral methods 239

of spectral algebra is to associate practice with theory: rather robust and


efficient eigensolvers, with well-studied properties, are widely available.
This pretty picture hides some drawback however. The first to be men-
tioned is the “rigidity” of the framework provided by spectral algebra. In
the case of NLDR, all spectral methods can be interpreted as performing
metric MDS on a double-centered kernel matrix instead of a Gram ma-
trix [31, 167, 203, 78, 15, 198, 17]. This kernel matrix is generally built in
one of the following two ways:
• Apply a kernel function to the matrix of squared Euclidean pairwise dis-
tances or directly to the Gram matrix of dot products.
• Replace the Euclidean distance with some other distance function in the
matrix of squared pairwise distances.
The first approach is followed in KPCA and SDE, and the second one in
Isomap, for which graph distances are used instead of Euclidean ones. Even-
tually, the above-mentioned duality (Section 7.4) allows us to relate methods
working with bottom eigenvectors of sparse matrices, like LLE, LE, and their
variants, to the same scheme. Knowing also that metric MDS applied to a
matrix of pairwise Euclidean distances yields a linear projection of the data
set, it appears that the ability of all recent spectral methods to provide a
nonlinear embedding actually relies merely on the chosen distance or kernel
function. In other words, the first step of a spectral NLDR method consists of
building a kernel matrix (under some conditions) or a distance matrix (with
a non-Euclidean distance), which amounts to implicitly mapping data to a
feature space in a nonlinear way. In the second step, metric MDS enables us
to compute a linear projection from the feature space to an embedding space
having the desired dimensionality. Hence, although the second step is purely
linear, the succession of the two steps yields a nonlinear embedding.
A consequence of this two-step data processing is that for most methods
the optimization (performed by the eigensolver) occurs in the second step
only. This means that the nonlinear mapping involved in the first step (or
equivalently the corresponding kernel) results from a more or less “arbitrary”
user’s choice. This can explain the large variety of spectral methods described
in the literature: each of them describes a particular way to build a Gram-like
matrix before computing its eigenvectors. A gradation of the “arbitrariness”
of the data transformation can be established as follows.
The most arbitrary transformation is undoubtedly the one involved in
KPCA. Based on kernel theory, KPCA requires the data transformation to
be induced by a kernel function. Several kernel functions in agreement with
the theory are proposed in [167], but there is no indication about which one to
choose. Moreover, most of these kernels involve one or several metaparameters;
again, no way to determine their optimal value is given.
In LLE, LE, Isomap, and their variants, the data transformation is based
on geometric information extracted from data. In all cases, K-ary neighbor-
hoods or -balls are used to induce a graph whose edges connect the data
240 7 Conclusions

points. This graph, in turn, brings a discrete approximation of the underlying


manifold topology/structure. In that perspective, it can be said that LLE, LE
and Isomap are graph- or manifold-driven. They induce a “data-dependent”
kernel function [15]. In practice, this means that the kernel value stored at
row i and column j in the Gram-like matrix does not depend only on the ith
and jth data points, but also on other points in the data set [78]. Since the
Gram-like matrix built by those methods is an attempt to take into account
the underlying manifold topology, it can be said the induced data mapping
is less “arbitrary” than in the case of KPCA. Nevertheless, the Gram-like
matrix still depends on K or ; changing the value of these parameters mod-
ifies the induced data mapping, which may can lead to a completely different
embedding. LLE seems to be particularly sensitive to these parameters, as
witnessed in the literature [166, 90]. In this respect, the numerical stability of
the eigenvectors computed by LLE can also be questioned [30] (the tail of the
eigenspectrum is flat, i.e., the ratio of successive lowest-amplitude eigenvalues
is close to one).
The highest level in the proposed gradation is eventually reached by SDE.
In contrast with all other spectral methods, SDE is (currently) the only one
that optimizes the Gram-like matrix before the metric MDS step. In Isomap,
Euclidean distances within the K-ary neighborhoods are kept (by construc-
tion, graph distances in that case reduce to Euclidean distances), whereas
distances between non-neighbors are replaced with graph distances, assum-
ing that these distances subsequently lead to a better embedding. In SDE,
distances between nonneighbors can be seen as parameters whose value is op-
timized by semidefinite programming. This optimization scheme is specifically
designed to work in matrix spaces, accepts equality or inequality constraints
on the matrix entries, and ensures that their properties are maintained (e.g.,
positive or negative semidefiniteness). An additional advantage of semidefinite
programming is that the objective function is convex, ensuring the existence
of a unique global maximum. In the case of a convex developable manifold, if
the embedding provided by SDE is qualified to be optimal, then Isomap yields
a good approximation of it although it is generally suboptimal [204] (see an
example in Fig. 6.6). In the case of a nonconvex or nondevelopable manifold,
the two methods behave very differently.
In summary, the advantages of spectral NLDR methods are as follows: they
benefit from a strong and sound theoretical framework; eigensolvers are also
efficient, leading to methods that are generally fast. Once the NLDR prob-
lem is cast within the framework, a global optimum of the objective function
can be reached without implementing any “exotic” optimization procedure: it
suffices to call an eigensolver, which can be found in many software toolboxes
or libraries. However, it has been reported that Gram-like matrices built from
sparse graphs (e.g., K-ary neighborhoods) can lead to ill-conditioned and/or
ill-posed eigenproblems [30]. Moreover, the claim that spectral methods pro-
vide a global optimum is true but hides the fact that the actual nonlinear
7.6 Nonspectral methods 241

transformation of data is not optimized, except in SDE. In the last case, the
kernel optimization unfortunately induces a heavy computational burden.

7.6 Nonspectral methods


Experimentally, nonspectral methods reach a better tradeoff between flexibil-
ity and computation time than spectral methods. The construction of meth-
ods like NLM or CCA/CDA consists of defining an objective function and
then choosing an adequate optimization technique. This goes in the opposite
direction of the thought process behind spectral methods, for which the ob-
jective function is chosen with the a priori idea to translate it easily into an
eigenproblem.
Hence, more freedom is left in the design of nonspectral methods than
in that of spectral ones. The price to pay is that although they are efficient
in practice, they often give few theoretical guarantees. For instance, meth-
ods relying on a (stochastic) gradient ascent/decent can fall in local optima,
unlike spectral methods. The optimization techniques can also involve meta-
parameters like step sizes, learning rates, stopping criteria, tolerances, and
numbers of iterations. Although most of these parameters can be left to their
default values, some of them can have a nonnegligible influence on the final
embedding.
This is the case of the so-called neighborhood width involved in CCA/CDA,
SOMs and Isotop. Coming from the field of artificial neural networks, these
methods all rely on stochastic gradient ascent/descent or resort to update
rules that are applied in the same way. As usual in stochastic update rules,
the step size or learning rate is scheduled to slowly decrease from one iter-
ation to the other, in order to reach convergence. In the above-mentioned
algorithms, the neighborhood width is not kept constant and follows a similar
schedule. This makes their dynamic behavior quite difficult to analyze, since
the optimization process is applied to a varying objective function that de-
pends on the current value of the neighborhood width. Things are even worse
in the cases of SOMs and Isotop, since the update rules are empirically built:
there exists no objective function from which the update rules can be derived.
Since the objective function, when it exists, is nonconstant and depends on
the neighborhood width, it can be expected that embeddings obtained with
these methods will depend on the same parameter, independently from the
fact that the method can get stuck in local optima.
The presence of metaparameters like the neighborhood width can also be
considered an advantage, to some extent. It does provide additional flexibility
to nonspectral methods in the sense that for a given method, the experienced
user can obtain different behaviors just by changing the values of the meta-
parameters.
Finally, it is noteworthy that the respective strategies of spectral and non-
spectral NLDR methods completely differ. Most spectral methods usually
242 7 Conclusions

transform data (nonlinearly when building the Gram-like matrix, and then
linearly when solving the eigenproblem) before pruning the unnecessary di-
mensions (eigenvectors are discarded). In contrast, most nonspectral methods
start by initializing mapped data vectors in a low-dimensional space and then
rearrange them to optimize some objective function.

7.7 Tentative methodology


Throughout this book, some examples and applications have demonstrated
that the proposed analysis methods efficiently tackle the problems raised by
high-dimensional data. This section is an attempt to guide the user through
the large variety of NLDR methods described in the literature, according to
characteristics of the available data.
A first list of guidelines can be given according to the shape, nature, or
properties of the manifold to embed. In the case of ...
• slightly curved manifolds. Use a linear method like PCA or metric
MDS; alternatively, NLM offers a good tradeoff between robustness and
reproducibility and gives the ability to provide a nonlinear embedding.
• convex developable manifolds. Use methods relying on geodesic/graph
distances (Isomap, GNLM, CDA) or SDE. Conditions to observe convex
and developable manifolds in computer vision are discussed in [54].
• nonconvex developable manifolds. Do not use Isomap; use GNLM or
CDA instead; SDE works well, too.
• nearly developable manifolds. Do not use Isomap or SDE; it is better
use GNLM or CDA instead.
• other manifolds. Use GNLM or preferably CDA. Topology-preserving
methods can be used too (LLE, LE, Isotop).
• manifolds with essential loops. Use CCA or CDA; these methods are
able to tear the manifold, i.e., break the loop. The tearing procedure pro-
posed in [121] can also break essential loops and make data easier to embed
with graph-based methods (Isomap, SDE, GNLM, CDA, LLE, LE, Isotop).
• manifolds with essential spheres. Use CCA or CDA. The abovemen-
tioned tearing procedure is not able to “open” essential spheres.
• disconnected manifolds. This remains an open question. Most meth-
ods do not explicitly handle this case. The easiest solution is to build an
adjacency graph, detect the disconnected components, and embed them
separately. Of course, “neighborhood relationships” between the compo-
nents are lost in the process.
• clustered data. In this case the existence of one or several underlying
manifolds must be questioned. If the clusters do not have a low intrin-
sic dimension, the manifold assumption is probably wrong (or useless).
Then use clustering algorithms, like spectral clustering or preferably other
techniques like hierarchical clustering.
7.7 Tentative methodology 243

Guidelines can also be given according to the data set’s size. In the case
of ...
• Large data set. If several thousands of data points are available (N >
2000), most NLDR methods will generate a heavy computational burden
because of their time and space complexities, which are generally propor-
tional to N 2 (or even higher for the computation time). It is then useful
to reduce the data set’s size, at least to perform some preliminary steps.
The easiest way to obtain a smaller data set consists of resampling the
available one, i.e., drawing a subset of points at random. Obviously, this
is not an optimal way, since it is possible, as ill luck would have it, for
the drawn subsample not to be representative of the whole data set. Some
examples throughout this book have shown that a representative subset
can be determined using vector quantization techniques, like K-means and
similar methods.
• Medium-sized set. If several hundreds of data points are available (200 <
N ≤ 2000), most NLDR methods can be applied directly to the data set,
without any size reduction.
• Small data set. When fewer than 200 data points are available, the use of
most NLDR methods becomes questionable, as the limited amount of data
could be insufficient to identify the large number of parameters involved in
many of these methods. Using PCA or classical metric MDS often proves
to be a better option.
The dimensionality of data, along with the target dimension, can also be
taken into account. In case of a ...
• very high data dimensionality. For more than 50 dimensions (D > 50),
NLDR methods can suffer from the curse of dimensionality, get confused,
and provide meaningless results. It can then be wise first to apply PCA or
metric MDS in order to perform a hard dimensionality reduction. These
two methods can considerably decrease the data dimensionality without
losing much information (in terms of measured variance, for instance). De-
pending on the data set’s characteristics, PCA or metric MDS can also help
attenuate statistical noise in data. After PCA/MDS, a nonlinear method
can be used with more confidence (see the two next cases) in order to
further reduce the dimensionality.
• high data dimensionality. For a few tens of dimensions (5 < D ≤ 50),
NLDR methods should be used with care. The curse of dimensionality is
already no longer negligible.
• low data dimensionality. For up to five dimensions, any NLDR method
can be applied with full confidence.
Obviously, the choice of the target dimensionality should take into account
the intrinsic dimensionality of data if it is known or can be estimated.
• If the target dimensionality is (much) higher than the intrinsic one, PCA or
MDS performs very well. These two methods have numerous advantages:
244 7 Conclusions

they are simple, fast, do not fall in local optima, and involve no parameters.
In this case, even the fact that they transform data in a linear way can be
considered an advantage in many respects.
• If the target dimensionality is equal to or hardly higher than the intrinsic
one, NLDR methods can yield very good results. Most spectral or non-
spectral methods work quite well in this case. For highly curved manifolds,
one or two supernumerary dimensions can improve the embedding quality.
Most NLDR methods (and especially those based on distance preservation)
have limited abilities to deform/distort manifolds. Some extra dimensions
can then compensate for this lack of “flexibility.” The same strategy can
be followed to embed manifolds with essential loops or spheres.
• If the target dimensionality is lower than the intrinsic one, such as for vi-
sualization purposes, use NLDR methods at your own risk. It is likely that
results will be meaningless since the embedding dimensionality is “forced.”
In this case, most nonspectral NLDR methods should be avoided. They
simply fail to converge in an embedding space of insufficient dimensional-
ity. On the other hand, spectral methods do not share this drawback since
they solve an eigenproblem independently from the target dimensionality.
This last parameter is involved only in the final selection of eigenvectors.
Obviously, although an embedding dimensionality that is deliberately too
low does not jeopardize the method convergence, this option does not guar-
antee that the obtained embedding is meaningful either. Its interpretation
and/or subsequent use must be questioned.
Here is a list of additional advices related to the application’s purpose and
other considerations.
• Collect information about your data set prior to NLDR: estimate the in-
trinsic dimensionality and compute an adjacency graph in order to deduce
the manifold connectivity.
• Never use any NLDR method without knowing the role and influence of
all its parameters (true for any method, with a special emphasis on non-
spectral methods).
• For 2D visualization and exploratory data analysis, Kohonen’s SOM re-
mains a reference tool.
• Never use KPCA for embedding purposes. The theoretical framework hid-
den behind KPCA is elegant and appealing; it paved the way toward a
unified view of all spectral methods. However, in practice, the method
lacks a geometrical interpretation that could help the user choose use-
ful kernel functions. Use SDE instead; this method resembles KPCA in
many respects, and the SDP step implicitly determines the optimal kernel
function for distance preservation.
• Never use SDE with large data sets; this method generates a heavy com-
putational burden and needs to run on much more powerful computers
than alternative methods do.
7.8 Perspectives 245

• Avoid using GTM as much as possible; the method involves too many
parameters and is restricted to 1D or 2D rectangular latent spaces; the
mapping model proves to be not flexible enough to deal with highly curved
manifolds.
• LLE is very sensitive to its parameter values (K or , and the regularization
parameter Δ). Use it carefully, and do not hesitate to try different values,
as is done in the literature [166].
• Most nonspectral methods can get stuck in local optima: depending on the
initialization, different embeddings can be obtained.
• Finally, do not forget to assess the embedding quality using appropriate
criteria [186, 185, 9, 74, 10, 190, 103] (see an example in Subsection 6.3.1).
The above recommendations leave the following question unanswered:
given a data set, does one choose between distance and topology preserva-
tion? If the data set is small, the methods with the simplest models often suit
the best (e.g., PCA, MDS, or NLM). With mid-sized data sets, more complex
distance-preserving methods like Isomap or CCA/CDA often provide more
meaningful results. Topology-preserving methods like LLE, LE, and Isotop
should be applied to large data sets only. Actually, the final decision between
distance and topology preservation should then be guided by the shape of the
underlying manifold. Heavily crumpled manifolds are more easily embedded
using topology preservation rather than distance preservation. The key point
to know is that both strategies extract neither the same kind nor the same
amount of information from data. Topology-preserving methods focus on local
information (neighborhood relationships), whereas distance-preserving ones
exploit both the local and global manifold structure.

7.8 Perspectives
During the 1900s, dimensionality reduction went through several eras. The
first era mainly relied on spectral methods like PCA and then classical metric
MDS. Next, the second era consisted of the generalization of MDS into non-
linear variants, many of them being based on distance or rank preservation
and among which Sammon’s NLM is probably the most emblematic represen-
tative. At the end of the century, the field of NLDR was deeply influenced by
“neural” approaches; the autoassociative MLP and Kohonen’s SOM are the
most prominent examples of this stream. The beginning of the new century
witnessed the rebirth of spectral approaches, starting with the discovery of
KPCA.
So in which directions will the researchers orient their investigations in
the coming years ? The paradigm of distance preservation can be counted
among the classical NLDR tools, whereas no real breakthrough has happened
in topology preservation since the SOM invention. It seems that the vein of
spectral methods has now been largely exploited. Many recent papers deal-
ing with that topic do not present new methods but are instead surveys that
246 7 Conclusions

summarize the domain and explore fundamental aspects of the methods, like
their connections or duality within a unifying frameworks. A recent publica-
tion in Science [89] describing a new training technique for auto-associative
MLP could reorient the NLDR research toward artificial neural networks once
again, in the same way as the publication of Isomap and LLE in the same jour-
nal in 2000 lead to the rapid development of many spectral methods. This
renewed interest in ANNs could focus on issues that were barely addressed
by spectral methods and distance preservation: large-scale NLDR problems
(training samples with several thousands of items), “out-of-sample” general-
ization, bidirectional mapping, etc.
A last open question regards the curse of dimensionality. An important
motivation behind (NL)DR aims at avoiding its harmful effects. Paradoxi-
cally, however, many NLDR methods do not bring a complete solution to the
problem, but only dodge it. Many NLDR methods give poor results when
the intrinsic dimensionality of the underlying manifold exceeds four or five.
In such cases, the dimension of the embedding space becomes high enough to
observe undesired effects related to the curse of dimensionality, such as the
empty space phenomenon. The future will tell whether new techniques will
be able to take up this ultimate challenge.
A
Matrix Calculus

A.1 Singular value decomposition


The singular value decomposition (SVD) of an M -by-N matrix A is written
as
A = VΣUT , (A.1)
where
• V is an orthonormal (or unitary) M -by-M matrix such that VT V =
IM×M .
• Σ is a pseudodiagonal matrix with the same size as A; the M entries σm
on the diagonal are called the singular values of A.
• U is an orthonormal (or unitary) N -by-N matrix such that UT U = IN ×N .
The number of singular values different from zero gives the rank of A. When
the rank is equal to P , the SVD can be used to compute the (pseudo-)inverse
of A:
A+ = UΣ+ VT , (A.2)
where the (pseudo-)inverse of Σ is trivially computed by transposing it and
inverting its diagonal entries σm . The SVD is used in many other contexts
and applications. For example, principal component analysis (see Section 2.4)
can be carried out using an SVD.
By the way, it is noteworthy that PCA uses a slightly modified SVD.
Assuming that M < N , U could become large when M  N , which often
happens in PCA, and Σ contains many useless zeros. This motivates an al-
ternative definition of the SVD, called the economy-size SVD, where only the
first P columns of U are computed. Consequently, UT has the same size as
A and Σ becomes a square diagonal matrix. A similar definition is available
when M > N .
For a square and symmetric matrix, the SVD is equivalent to the eigenvalue
decomposition (EVD; see ahead).
248 A Matrix Calculus

A.2 Eigenvalue decomposition


The eigenvalue decomposition (EVD) of a square M -by-M matrix A is written
as
AV = VΛ , (A.3)
where
• V is a square M -by-M matrix whose columns vm are unit norm vectors
called eigenvectors of A.
• Λ is a diagonal M -by-M matrix containing the M eigenvalues λm of A.
The EVD is sometimes called the spectral decomposition of A. Equation (A.3)
translates the fact the eigenvectors keep their direction after left-multiplication
by A: Avm = λm vm . Moreover, the scaling factor is equal to the associated
eigenvalue. The number of eigenvalues different from zero gives the rank of
A, and the product of the eigenvalues is equal to the determinant of A. On
the other hand, the trace of A, denoted tr(A) and defined as the sum of its
diagonal entries, is equal to the sum of its eigenvalues:

M 
M
tr(A)  am,m = λm . (A.4)
m=1 m=1

In the general case, even if A contains only real entries, V and Λ can be
complex. If A is symmetric (A = AT ), then V is orthonormal (the eigen-
vectors are orthogonal in addition to being normed); the EVD can then be
rewritten as
A = VΛVT , (A.5)
and the eigenvalues are all real numbers. Moreover, if A is positive definite
(resp., negative definite), then all eigenvalues are positive (resp., negative).
If A is positive semidefinite (resp., negative semidefinite), then all eigenval-
ues are nonnegative (resp., nonpositive). For instance, a covariance matrix is
positive semidefinite.

A.3 Square root of a square matrix


The square root of a diagonal matrix is easily computed by applying the
square root solely on the diagonal entries. By comparison, the square root of
a nondiagonal square matrix may seem more difficult to compute. First of all,
there are two different ways to define the square root of a square matrix A.
The first definition assumes that
A  (A1/2 )T (A1/2 ) . (A.6)
If A is symmetric, then the eigenvalue decomposition (EVD) of the matrix
helps to return to the diagonal case. The eigenvalue decomposition (see Ap-
pendix A.2) of any symmetric matrix A is
A.3 Square root of a square matrix 249

A = VΛVT = (VΛ1/2 )(Λ1/2 VT ) = (A1/2 )T (A1/2 ) , (A.7)

where Λ is diagonal. If A is also positive definite, then all eigenvalues are


positive and the diagonal entries of Λ1/2 remain positive real numbers. (If A
is only positive semidefinite, then the square root is no longer unique.)
The second and more general definition of the square root is written as

A  (A1/2 )(A1/2 ) . (A.8)

Again, the eigenvalue decomposition leads to the solution. The square root is
then written as
A1/2 = VΛ1/2 V−1 , (A.9)
and it is easy to check that

A1/2 A1/2 = VΛ1/2 V−1 VΛ1/2 V−1 (A.10)


= VΛ1/2 Λ1/2 V−1 (A.11)
= VΛV−1 = A . (A.12)

This is valid in the general case, i.e., A can be complex and/or nonsym-
metric, yielding complex eigenvalues and eigenvectors. If A is symmetric, the
last equation can be further simplified since the eigenvectors are real and
orthonormal (V−1 = VT ).
It is notenorthy that the second definition of the matrix square root can
be generalized to compute matrix powers:

Ap = VΛp V−1 . (A.13)


B
Gaussian Variables

This appendix briefly introduces Gaussian random variables and some of their
basic properties.

B.1 One-dimensional Gaussian distribution


Considering a single one-dimensional random variable x, it is said to be Gaus-
sian if its probability density function fx (x) can be written as
 
1 1 (x − μ)2
fx (x) = √ exp − , (B.1)
2πσ 2 2 σ2

where μ and σ 2 are the mean and the variance, respectively, and correspond to
the first-order moment and second-order central moment. Figure B.1 shows a
plot of a Gaussian probability density function. Visually, the mean μ indicates

0.5

0.4

0.3
f (x)
x

0.2

0.1

0
−2 0 2
x

Fig. B.1. Probability density function associated with a one-dimensional Gaussian


distribution.
252 B Gaussian Variables

the abscissa where the bump reaches its highest point, whereas σ is related to
the spreading of the bump. For this reason, the standard deviation σ is often
called the width in a geometrical context (see ahead).
Since only the mean and variance suffice to characterize a Gaussian vari-
able, it is often denoted as N (μ, σ). The letter N recalls the alternative name
of a Gaussian variable: a normal variable or normally distributed variable.
This name simply reflects the fact that for real-valued variables, the Gaussian
distribution is the most widely observed one in a great variety of phenomena.
Moreover, the central limit theorem states that a variable obtained as
the sum of several independent identically distributed variables, regardless
of their distribution, tends to be Gaussian if the number of terms in the
sum tends to infinity. Thus, to some extent, the Gaussian distribution can
be considered the “child” of all other distributions. On the other hand, the
Gaussian distribution can also be interpreted as the “mother” of all other
distributions. This is intuitively confirmed by the fact that any zero-mean
unit variance pdf fy (y) can be modeled starting from a zero-mean and unit-
variance Gaussian variable with pdf fx (x), by means of the Gram–Charlier or
Edgeworth development:
 
1 1
fy (y) = fx (y) 1 + μ3 (y)H3 (y) + (μ4 (y) − 3)H4 (y) + . . . , (B.2)
6 24

where Hi (y) is the ith-order Hermitte–Chebyshev polynomial and μi (y) the


ith-order central moment. If the last equation is rewritten using the ith-order
cumulants κi (y) instead of central moments, one gets
 
1 1
fy (y) = fx (y) 1 + κ3 (y)H3 (y) + κ4 (y)H4 (y) + . . . . (B.3)
6 24

Visually, in the last development, a nonzero skewness κ3 (y) makes the pdf
fy (y) asymmetric, whereas a nonzero kurtosis excess κ4 (y) makes its bump
flatter or sharper. Even if the development does not go beyond the fourth
order, it is easy to guess that the Gaussian distribution is the only one having
zero cumulants for orders higher than two. This partly explains why Gaussian
variables are said to be the “least interesting” ones in some contexts [95].
Actually, a Gaussian distribution has absolutely no salient characteristic:
• The support is unbounded, in contrast to a uniform distribution, for in-
stance.
• The pdf is smooth, symmetric, and unimodal, without a sharp peak like
the pdf of a Laplacian distribution.
• The distribution maximizes the differential entropy.
The function defined in Eq. (B.1) and plotted in Fig. B.1 is the sole func-
tion that both shows the above properties and respects the necessary con-
ditions to be a probability density function. These conditions are set on the
cumulative density function Fx (x) of the random variable, defined as
B.2 Multidimensional Gaussian distribution 253
x
Fx (x) = fx (u)du , (B.4)
−∞

and requires that


• Fx (−∞) = 0,
• Fx (+∞) = 1,
• Fx (x) is monotonic and nondecreasing,
• Fx (b) − Fx (a) is the probability that a < x ≤ b.

B.2 Multidimensional Gaussian distribution


A P -dimensional random variable x is jointly Gaussian if its joint probability
density function can be written as
 
1 1
fx (x) =  exp − (x − μx )T C−1xx (x − μ x ) , (B.5)
(2π)P det Cxx 2

where μx and Cxx are, respectively, the mean vector and the covariance ma-
trix. As the covariance matrix is symmetric and positive semidefinite, its deter-
minant is nonnegative. The joint pdf of a two-dimensional Gaussian is drawn
in Fig. B.2.

0.15

0.1
f (x)
x

0.05

0
2
0 2
0
−2 −2
x2 x1

Fig. B.2. Probability density function associated with a two-dimensional Gaussian


distribution.

It can be seen that in the argument of the exponential function, the factor
(x − μx )T C−1
xx (x − μx ) is related to the square of the Mahalanobis distance
between x and μx (see Subsection 4.2.1).
254 B Gaussian Variables

B.2.1 Uncorrelated Gaussian variables


If a multidimensional Gaussian distribution is uncorrelated, i.e., if its covari-
ance matrix is diagonal, then the last equation can be rewritten as
! "
1  (xp − μxp )2
P
1
fx (x) = + , exp − (B.6)
(2π)P P c 2 p=1 c2p,p
p=1 p,p

#  
1 (xp − μxp )2
P
1
=  exp − (B.7)
p=1
2πcp,p 2 cp,p
! "
#P
1 1 (xp − μxp )2
=  exp − (B.8)
p=1
2πσxp 2 σx2p
#
P
= fxp (xp ) , (B.9)
p=1

showing that the joint pdf of uncorrelated Gaussian variables factors into the
product of the marginal probability density functions. In other words, uncor-
related Gaussian variables are also statistically independent. Again, fx (x) is
the sole and unique probability density function that can satisfy this property.
For other multivariate distributions, the fact of being uncorrelated does not
imply the independence of the marginal densities. Nevertheless, the reverse
implication is always true.
Geometrically, a multidimensional Gaussian distribution looks like a fuzzy
ellipsoid, as shown in Fig. B.3. The axes of the ellipsoid correspond to coor-
dinate axes.

B.2.2 Isotropic multivariate Gaussian distribution


A multidimensional Gaussian distribution is said to be isotropic if its covari-
ance matrix can be written as a function of the single parameter σ:
Cxx = σ 2 I . (B.10)
As Cxx is diagonal, the random variables in x are uncorrelated and inde-
pendent, and σ is the common standard deviation shared by all marginal
probability density functions fxp (xp ). By comparison to the general case, the
pdf of an isotropic Gaussian distribution can be rewritten more simply as
 
1 1 2 −1
fx (x) =  exp − (x − μx ) (σ I) (x − μx ) (B.11)
T
(2π)P det(σ 2 I) 2
 
1 1 (x − μx )T (x − μx )
=  exp − (B.12)
(2πσ 2 )P 2 σ2
 
1 1 x − μx 22
=  exp − , (B.13)
(2πσ 2 )P 2 σ2
B.2 Multidimensional Gaussian distribution 255

x2
0

−2

−4
−4 −2 0 2 4
x
1

Fig. B.3. Sample joint distribution (10,000 realizations, in blue) of two uncorrelated
Gaussian variables. The variances are proportional to the axis lengths (in red).

The appearance of the Euclidean distance (see Subsection 4.2.1) demonstrates


that the value of the pdf depends only on the distance to the mean and not
on the orientation. Consequently, fx (x) is completely isotropic: no direction
is privileged, as illustrated in Fig. B.4. For this reason, the standard deviation

2
x2

−2

−4
−4 −2 0 2 4
x1

Fig. B.4. Sample (10,000 realizations) of a two-dimensional isotropic Gaussian


distribution. The variances are proportional to the axis lengths (in red), which are
equal. The variance is actually the same in all directions.

σ is often called the width or radius of the Gaussian distribution.


256 B Gaussian Variables

The function fx (x) is often used outside the statistical framework, possibly
without its normalization factor. In this case, fx (x) is usually called a radial
basis function or Gaussian kernel. In addition to be isotropic, such a function
has very nice properties:
• It produces a single localized bump.
• Very few parameters have to be set (P means and one single variance,
compared to P (P + 1)/2 for a complete covariance matrix).
• It depends on the well-known and widely used Euclidean distance.
Gaussian kernels are omnipresent in applications like radial basis function
networks [93] (RBFNs) and support vector machines (SVM) [27, 37, 42].

B.2.3 Linearly mixed Gaussian variables

What happens when several Gausian variables are linearly mixed ?


To answer the question, one assumes that the P -dimensional random vec-
tor x has a joint Gaussian distribution with zero means and identity covari-
ance, without loss of generality. Then P linearly mixed variables can be written
as y = Ax, where A is a square P -by-P mixing matrix. By “mixing matrix”,
it should be understood that A is of full column rank and has at least two
nonzero entries per row. These conditions ensure that the initial random vari-
ables are well mixed and not simply copied or scaled. It is noteworthy that y
has zero means like x.
As matrix A defines a nonorthogonal change of the axes, the joint pdf of
y can be written as
 
1 1 T −1
fy (y) =  exp − y I y (B.14)
(2π)P det Cyy 2
 
1 1
=  exp − (Ax)T (Ax) (B.15)
(2π)P det Cyy 2
 
1 1 T T
=  exp − x (A A) x) , (B.16)
(2π)P det Cyy 2

demonstrating that Cyy = (AT A)−1 . Unfortunately, as the covariance matrix


is symmetric, it has only P (P + 1)/2 free parameters, which is less than P 2 ,
the number of entries in the mixing matrix A. Consequently, starting from
the mixed variables, it is impossible to retrieve the mixing matrix. A possible
solution would have been to compute the matrix square root of C−1 yy (see
Subsection A.3). However, this provides only a least-squares estimate of A,
on the basis of the incomplete information available in the covariance matrix.
Geometrically, one sees in Fig. B.5 that the matrix square root always finds
an orthogonal coordinate system, computed as the eigenvetors of C−1 yy . This
orthogonal system is shown in red, whereas the original coordinate system
deformed by the mixing matrix is drawn in green. This explains why PCA
B.2 Multidimensional Gaussian distribution 257

y2
0

−2

−4
−4 −2 0 2 4
y
1

Fig. B.5. Sample joint distribution (10,000 realizations) of two isotropic Gaussian
variables. The original coordinate system, in green, has been deformed by the mixing
matrix. Any attempt to retrieve it leads to the orthogonal system shown in red.

(and ICA) is unable to separate mixtures of Gaussian variables: with two or


more Gaussian variables, indeterminacies appear.
C
Optimization

Most nonspectral NLDR methods rely on the optimization of some objective


function. This appendix describes a few classical optimization techniques.

C.1 Newton’s method


The original Newton’s method, also known as Newton–Raphson method, is an
iterative procedure that finds a zero of a C∞ function (infinitely differentiable
function)
f : R → R : x → f (x) . (C.1)
Basically, Newton’s method approximates the function f by its first-order
Taylor’s polynomial expansion:
f (x + ) = f (x) + f  (x) + O(2 ) . (C.2)
Defining xold = x and xnew = x + , omitting O(2 ), and assuming that
f (xnew ) = 0, one gets:
0 ≈ f (xold ) + f  (xold )(xnew − xold ) . (C.3)
Solving for xnew leads to
f (xold )
xnew ≈ xold − , (C.4)
f  (xold
which can be rewritten as an iterative update rule:
f (x)
x←x− . (C.5)
f  (x)
Intuitively, starting from a candidate solution x that is randomly initialized,
the next candidate is the intersection of a tangent to f (x) with the x-axis.
It can be proven that the first-order approximation makes the convergence
of the procedure very fast (quadratic convergence): after a few iterations,
the solution remains almost constant, and the procedure may be stopped.
However, it is easy to see that the method becomes unstable when f  (x) ≈ 0.
260 C Optimization

C.1.1 Finding extrema

Recalling that function extremas are such that f  (x) = 0, a straight extension
of Newton’s procedure can be applied to find a local extremum of a twicely
derivable function f :
f  (x)
x ← x −  , (C.6)
f (x)
where the first and second derivatives are assumed to be continuous. The last
update rule, unfortunately, does not distinguish between a minimum and a
maximum and yields either one of them. An extremum is a minimum only if
the second derivative is positive, i.e., the function is concave in the neighbor-
hood of the extremum. In order to avoid the convergence toward a maximum,
a simple trick consists of forcing the second derivative to be positive:

f  (x)
x← x− . (C.7)
|f  (x)|

C.1.2 Multivariate version

The generalization of Newton’s optimization procedure (Eq. (C.6)) to multi-


variate functions f (x) : RP → R : x → f (x) leads to

x ← x − H−1 ∇x f (x) , (C.8)

where ∇x is the differential operator with respect to the components of vector


x = [x1 , . . . , xP ]T :
 T
∂ ∂
∇x  ,..., , (C.9)
∂x1 ∂xP
and ∇x f (x) is the gradient of f , i.e., the vector of all partial derivatives:
 T
∂f (x) ∂f (x) ∂f (x)
∇x f (x) = = ,..., . (C.10)
∂x ∂x1 ∂xP

One step further, H−1 is the inverse of the Hessian matrix, defined as
 2 
∂ f (x)
H  ∇x ∇Tx f (x) = , (C.11)
∂xi ∂xj ij

whose entries hi,j are the second-order partial derivatives.


Unfortunately, the application of Eq. (C.8) raises two practical difficulties.
First, the trick of the absolute value is not easily generalized to multivariate
functions, which can have minima, maxima, but also various kinds of saddle
points. Second, the Hessian matrix is rarely available in practice. Moreover,
its size grows proportionally to P 2 , making its computation and storage prob-
lematic.
C.2 Gradient ascent/descent 261

A solution to these two issues consists of assuming that the Hessian matrix
is diagonal, although it is often a very crude hypothesis. This approximation,
usually called quasi-Newton or diagonal Newton, can be written component-
wise:
∂f (x)
∂x
xp ← xp − α  2 p  , (C.12)
∂ f (x)
 ∂x2p 
where the coefficient α (0 < α ≤ 1) slows down the update rule in order
to avoid unstable behaviors due to the crude approximation of the Hessian
matrix.

C.2 Gradient ascent/descent


The gradient descent (resp., ascent) is a generic minimization (resp., maxi-
mization) technique also known as the steepest descent (ascent) method. As
Newton’s method (see the previous section), the gradient descent is an itera-
tive technique that finds the closest extremum starting from an initial guess.
The method requires knowing only the gradient of the function to be opti-
mized in closed form.
Actually, the gradient descent can be seen as a simplified version of New-
ton’s method, in the case where the Hessian matrix is unknown. Hence, the
gradient descent is a still rougher approximation than the pseudo-Newton
method. More formally, the inverse H−1 of the unknown Hessian is simply
replaced with the product αI, where α is a parameter usually called the step
size or the learning rate. Thus, for a multivariate function f (x), the iterative
update rule can be written as
x ← x − α∇x f (x) . (C.13)
As the Hessian is unknown, the local curvature of the function is unknown,
and the choice of the step size may be critical. A value that is too large
may jeopardize the convergence on an extremum, but, on the other hand, the
convergence becomes very slow for small values.

C.2.1 Stochastic gradient descent

Within the framework of data analysis, it often happens that the function to
be optimized is of the form

1 
N
f (x) = Ey {g(y, x)} or f (x) = g(y(i), x) , (C.14)
N i=1

where y is an observed variable and Y = [. . . , y(i), . . .]1≤i≤N is an array of


observations. Then the update rule for the gradient descent with respect to
the unknown variables/parameters x can be written as
262 C Optimization

1 
N
x ← x − α∇x g(y(i), x)} . (C.15)
N i=1

This is the usual update rule for the classical gradient descent. In the frame-
work of neural networks and other adaptive methods, the classical gradient
descent is often replaced with the stochastic gradient descent. In the latter
method, the update rule can be written in the same way as in the classical
method, except that the mean (or expectation) operator disappears:

x ← x − α∇x g(y(i), x) . (C.16)

Because of the dangling index i, the update rule must be repeated N times,
over all available observations y(i). From an algorithmic point of view, this
means that two loops are needed. The first one corresponds to the iterations
that are already performed in the classical gradient descent, whereas an inner
loop traverses all vectors of the data set. A traversal of the data set is usually
called an epoch.
Moreover, from a theoretical point of view, as the partial updates are no
longer weighted and averaged, additional conditions must be fulfilled in order
to attain convergence. Actually, the learning rate α must decrease as epochs
go by and, assuming t is an index over the epochs, the following (in)equalities
must hold [156]:

 ∞

α(t) = ∞ and (α(t))2 < ∞ . (C.17)
t=1 t=1

Additionally, it is often advised to consider the available observations as an


unordered set, i.e., to randomize the order of the updates at each epoch. This
allows us to avoid any undesired bias due to a repeated order of the updates.
When a stochastic gradient descent is used in a truly adaptive context,
i.e., the sequence of observations y(i) is infinite, then the procedure cannot
be divided into successive epochs: there is only a single, very long epoch. In
that case the learning rate is set to a small constant value (maybe after a
short decrease starting from a larger value in order to initialize the process).
Conditions to attain convergence are difficult to study in that case, but they
are generally fulfilled if the sequence of observations is stationary.
D
Vector quantization

Like dimensionality reduction, vector quantization [77, 73] is a way to re-


duce the size of a data set. However, instead of lowering the dimensionality
of the observations, vector quantization reduces the number of observations
(see Fig. D.1). Therefore, vector quantization and dimensionality reduction
are somewhat complementary techniques. By the way, it is noteworthy that
several DR methods use vector quantization as preprocessing.
In practice, vector quantization is achieved by replacing the original data
points with a smaller set of points called units, centroids, prototypes or code
vectors. The ordered or indexed set of code vectors is sometimes called the
codebook.
Intuitively, a good quantization must show several qualities. Classically,
the user expects that the prototypes are representative of the original data
they replace (see Fig. D.1). More formally, this goal is reached if the prob-
ability density function of the prototypes resembles the probability density
function of the initial data. However, as probability density functions are dif-
ficult to estimate, especially when working with a finite set of data points,
this idea will not work as such. An alternative way to capture approximately
the original density consists of minimizing the quantization distortion. For a
data set Y = {y(i) ∈ RD }1≤i≤N and a codebook C = {c(j) ∈ RD }1≤j≤M , the
distortion is a quadratic error function written as

1 
N
EVQ = y(i) − dec(cod(y(i)))2 , (D.1)
N i=1

where the coding and decoding functions cod and dec are respectively defined
as
cod : RD → {1, . . . , M } : y → arg min y − c(j) (D.2)
1≤j≤M

and
dec : {1, . . . , M } → C : j → c(j) . (D.3)
264 D Vector quantization

1.5

0.5

2
0

y
−0.5

−1

−1.5
−2 −1 0 1
y1

1.5

0.5
2

0
y

−0.5

−1

−1.5
−2 −1 0 1
y
1

1.5

0.5
2

0
y

−0.5

−1

−1.5
−2 −1 0 1
y1

Fig. D.1. Principle of vector quantization. The first plot shows a data set (2000
points). As illustrated by the second plot, a vector quantization method can reduce
the number of points by replacing the initial data set with a smaller set of repre-
sentative points: the prototypes, centroids, or code vectors, which are stored in the
codebook. The third plot shows simultaneously the initial data, the prototypes, and
the boundaries of the corresponding Voronoı̈ regions.
D.1 Classical techniques 265

The application of the coding function to some vector y(i) of the data set
gives the index j of the best-matching unit of y(i) (BMU in short), i.e., the
closest prototype from y(i). Appendix F.2 explains how to compute the BMU
efficiently. The application of the decoding function to j simply gives the
coordinates c(j) of the corresponding prototype. The coding function induces
a partition of RD : the open sets of all points in RD that share the same BMU
c(j) are called the Voronoı̈ regions (see Fig. D.1). A discrete approximation
of the Voronoı̈ regions can be obtained by constituting the sets Vj of all data
points y(i) having the same BMU c(j). Formally, these sets are written as

Vj = {y(i)|cod(y(i)) = j} (D.4)

and yield a partition of the data set.

D.1 Classical techniques


Numerous techniques exist to minimize EVQ . The most prominent one is the
LBG algorithm [124], which is basically an extension of the generalized Lloyd
method [127]. Close relatives coming from the domain of cluster analysis are
the ISODATA [8] and the K-means [61, 130] algorithms. In the case of the K-
means, the codebook is built as follows. After the initialization (see ahead),
the following two steps are repeated until the distortion decreases below a
threshold fixed by the user:
1. Encode each point of the data set, and compute the discrete Voronoı̈
regions Vj ;
2. Update each c(j) by moving it to the barycenter of Vj , i.e.,

1 
c(j) ← y(i) . (D.5)
|Vj |
y(i)∈Vj

This procedure monotonically decreases the quantization distortion until it


reaches a local minimum. To prove the correctness of the procedure, one
rewrites EVQ as a sum of contributions coming from each discrete Voronoı̈
region:
1  j
M
EVQ = E , (D.6)
N j=1 VQ

where 
j
EVQ = y(i) − c(j)2 . (D.7)
y(i)∈Vj

j
Trivially, the barycenter of some Vj minimizes the corresponding EVQ . There-
fore, step 2 decreases EVQ . But, as a side effect, the update of the prototypes
also modifies the results of the encoding function. So it must be shown that the
266 D Vector quantization

re-encoding occurring in step 1 also decreases the distortion. The only terms
that change in the quantization distortion defined in Eq. (D.1) are those cor-
responding to data points that change their BMU. By definition of the coding
function, the distance y(i) − c(j) is smaller for the new BMU than for the
old one. Therefore, the error is lowered after re-encoding, which concludes the
correctness proof of the K-means.

D.2 Competitive learning


Falling in local minima can be avoided to some extent by minimizing the
quantization distortion by stochastic gradient descent (Robbins–Monro pro-
cedure [156]). For this purpose, the gradient of EVQ with respect to prototype
c(j) is written as

1 
N
∂y(i) − c(j)
∇c(j) EVQ = 2y(i) − c(j) (D.8)
N i=1 ∂c(j)

1 
N
(c(j) − y(i))
= 2y(i) − c(j) (D.9)
N i=1 2y(i) − c(j)

1 
N
= (c(j) − y(i)) , (D.10)
N i=1

where the equality j = cod(y(i)) holds implicitly. In a classical gradient de-


scent, all prototypes are updated simultaneously after the computation of
their corresponding gradient. Instead, the Robbins–Monro procedure [156] (or
stochastic gradient descent) separates the terms of the gradient and updates
immediately the prototypes according to the simple rule:
c(j) ← c(j) − α(c(j) − xi ) , (D.11)
where α is a learning rate that decreases after each epoch, i.e., each sweeping
of the data set. In fact, the decrease of the learning rate α must fulfill the
conditions stated in [156] (see Section C.2).
Unlike the K-means algorithm, the stochastic gradient descent does not
decrease the quantization distortion monotonically. Actually, the decrease oc-
curs only on average. The stochastic gradient descent of EVQ belongs to a
wide class of vector quantization algorithms known as competitive learning
methods [162, 163, 3].

D.3 Taxonomy
Quantization methods may be divided into static, incremental, and dynamic
ones. This distinction refers to their capacity to increase or decrease the num-
ber of prototypes they update. Most methods, like competitive learning and
D.4 Initialization and “dead units” 267

LBG, are static and manage a number of prototypes fixed in advance. Incre-
mental methods (see, for instance, [68, 71, 11, 70]) are able to increase this
predetermined number by inserting supplemental units when this is necessary
(various criterions exist). Fully dynamic methods (see, for instance, [69]) can
add new units and remove unnecessary ones.
In addition to the distinction between classical quantization and compet-
itive learning, the latter can further be divided into two subcategories:
• Winner take all (WTA). Similarly to the stochastic method sketched
just above, WTA methods update only one prototype (the BMU) at each
presentation of a datum. WTA methods are the simplest ones and include
the classical competitive learning [162, 163] and the frequency-sensitive
competitive learning [51].
• Winner take most (WTM). WTM methods are more complex than
WTA ones, because the prototypes interact at each presentation of a da-
tum. In practice, several prototypes are updated at each presentation of a
datum. In addition to the BMU, some other prototypes related to the BMU
are also updated. Depending on the specific quantization method, these
prototypes may be the second, third, and so forth closest prototypes in the
data space, as in the neural gas [135] (NG). Otherwise, the neighborhood
relationships with the BMU may also be predefined and data-independent,
as in a self-organizing map [105, 154] (SOM; see also Subsection 5.2.1).

D.4 Initialization and “dead units”


For K-means as well as for competitive learning, the initialization of the code-
book is very important. The worst case appears when the prototypes are ini-
tialized (far) outside the region of the space occupied by the data set. In this
case, there is no guarantee that the algorithm will ever move these prototypes.
Consequently, not only these prototypes are lost (they are often called dead
units), but they may also mislead the user, since these dead units are obvi-
ously not representative of the original data. A good initialization consists of
copying into the codebook the coordinates of m points chosen randomly in
the data set.
The appearance of dead units may be circumvented in several ways. The
most obvious way to deal with them consists of:
1. computing the cardinality of the Voronoı̈ regions after quantization,
2. discarding the units associated with an empty region.
In order to keep the number of prototypes constant, lost units can be detected
during convergence after each epoch of the Robbins–Monro procedure. But
instead of removing them definitely as above, they may be reinitialized and
reinserted, according to what is proposed in the previous paragraph. Such a
recycling trick may be seen as a degenerate dynamic method that adds and
removes units but always keeps the same total number of prototypes.
268 D Vector quantization

Outside the scope of its application to nonlinear dimensionality reduction,


vector quantization belongs to classical techniques used, for example, in:
• statistical data analysis, mainly for the clustering of multivariate data; in
that framework, the centroids and/or their discrete Voronoı̈ regions are
called clusters,
• telecommunications, for lossy data compression.
E
Graph Building

In the field of data analysis, graph building is an essential step in order to


capture the neighborhood relationships between the points of a given data
set. The resulting graph may also be used to obtain good approximations to
geodesic distances with graph distances (see Subsection 4.3.1). The construc-
tion of graphs is also useful in many other domains, such as finite-element
methods.
Briefly put, graph building in the context of NLDR aims at connecting
neighboring points of the space. The set of available points is often finite and
given as a matrix of coordinates. Obviously, the fact to be or not to be a
neighbor must be defined in some way but essentially depends on the user’s
needs or preferences.
Several rules inspired by intuitive ideas allow building graphs starting from
a set of points in space. Each rule shows specific properties and uses different
types of information, depending mainly on whether the data are quantized
or not (see Appendix D). A well-known method of building a graph between
points is Delaunay triangulation [63]. Unfortunately, efficient implementations
of Delaunay triangulation only exist in the two-dimensional case. Moreover,
although Delaunay triangulation has many desirable properties, it is often not
suited in the case of data analysis. A justification of this can be found in [5],
along with a good discussion about the construction of graphs.
The five rules presented hereafter are illustrated by two examples. The
two data sets consist of 3000 2-dimensional points drawn from two different
1-manifolds: a sine wave with an increasing frequency and a linear spiral.
Their parametric equations are, respectively,
 
x
y= with x ∈ [0, 1] (E.1)
sin(π exp(3x))

and  
2x cos(6πx)
y= with x ∈ [0, 1] . (E.2)
2x sin(6πx)
270 E Graph Building

In both cases, Gaussian noise is added (for the sine, on y2 only, with
standard deviation equal to 0.10; for the spiral, on both y1 and y2 , with
standard deviation equal to 0.05). Afterwards, the 3000 points of each data set
are quantized with 120 and 25 prototypes respectively. Figure E.1 illustrates
both data sets. Both manifolds have the property of being one-dimensional

Data set 2
Data set 1 2
1.5
1
1
0.5 0.5
0
y2

y2
0
−0.5
−0.5 −1
−1.5
−1
−2
0 1 2 3
y1 −2.5
−2 −1 0 1 2
y1

Fig. E.1. Two data sets used to test graph-building rules: the sine with exponen-
tially increasing frequency and the linear spiral. Data points are displayed as points
whereas prototypes are circles.

only in a local scale. For example, going from left to right along the sine
wave makes it appear “more and more two-dimensional” (see Chapter 3). For
the rules that build a graph without assuming that the available points are
prototypes resulting from a vector quantization, the prototypes are given as
such. On the other hand, the rules relying on the fact that some data set has
been quantized are given both the prototypes and the original data sets.

E.1 Without vector quantization


All the rules described ahead just assume a set of data points {y(i)}1≤i≤N .

E.1.1 K-rule

Also known as the rule of K-ary neighborhoods, this rule is actually very
simple: each point y(i) is connected with the K closest other points. As a
direct consequence, if the graph is undirected (see Section 4.3), then each point
is connected with at least K other points. Indeed, each point elects exactly K
E.1 Without vector quantization 271

neighbors but can also be elected by points that do not belong to this set of
K neighbors. This phenomenum typically happens with an isolated point: it
elects as neighbors faraway points while those points find their K neighbors
within a much smaller distance. Another consequence of this rule is that no
(nontrivial) upper bound can be easily given for the longest distance between
a point and its neighbors. The practical determination of the K closest points
is detailed in Section F.2.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 proto-
types mentioned above, the K-rule gives the results displayed in the first row
of Fig. E.2. Knowing that the manifolds are one-dimensional, the value of K
is set to 2. For the sine wave, the result is good but gets progressively worse
as the frequency increases. For the spiral, the obtained graph is totally wrong,
mainly because the available points do not sample the spiral correctly (i.e.,
the distances between points on different whorls may be smaller than those
of points lying on the same whorl).

E.1.2 -rule

By comparison with the K-rule, the -rule works almost conversely: each point
y(i) is connected with all other points lying inside an -ball centered on y(i).
Consequently,  is by construction the upper bound for the longest distance
between a point and its neighbors. But as a counterpart, no (nontrivial) lower
bound can easily be given for the smallest number of neighbors that are con-
nected with each point. Consequently, it may happen that isolated points have
no neighbors. As extensively demonstrated in [19], the -rule shows better
properties for the approximation of geodesic distances with graph distances.
However, the choice of  appears more difficult in practice than the one of K.
The practical determination of the points lying closer than a fixed distance 
from another point is detailed in Section F.2.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 proto-
types mentioned above, the -rule gives the results displayed in the second row
of Fig. E.2. The parameter  is given the values 0.3 and 0.8 for, respectively,
the sine and the spiral. As can be seen, the -rule gives good results only if the
distribution of points remains approximately uniform; this assumption is not
exactly true for the two proposed data sets. Consequently, the graph includes
too many edges in dense regions like the first minimum of the sine wave and
the center of the spiral. On the other hand, points lying in sparsely populated
regions often remain disconnected.

E.1.3 τ -rule

This more complex rule connects two points y(i) and y(j) if they satisfy two
conditions. The first condition states that the distances di = minj y(i)−y(j)
and dj = mini y(j) − y(i) from the two points to their respective closest
neighbor is
272 E Graph Building

di ≤ τ dj and dj ≤ τ di (similarity cond.) , (E.3)


where τ is a tolerance greater than 1 (for example, τ = 1.5 or 2.0). The second
condition requires that the two points are neighbors in the sense:

y(i) − y(j) ≤ τ di or y(j) − y(i) ≤ τ dj (neighborhood cond.) . (E.4)

This rule behaves almost like the -rule, except that the radius is implicit,
is hidden in τ , and adapts to the local data distribution. The mean radius
increases in sparse regions and decreases in dense regions.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 pro-
totypes mentioned above, the τ -rule gives the results displayed in the third
row of Fig. E.2. For the sine as well as for the spiral, the parameter τ equals
1.5. As expected, the τ -rule behaves a little better than the -rule when the
density of points varies. This is especially visible in dense regions: the nu-
merous unnecessary edges produced by the -rule disappear. Moreover, it is
noteworthy that the τ -rule better extracts the shape of the spiral than the
two previous rules.

E.2 With vector quantization


When it is known that the points to be connected result from a vector quanti-
zation, the construction of the graph may exploit that additional information.
Such information may be given, for example, by the local distribution of data
points between prototypes that are close to each other. Some algorithms,
called topology representing networks [136] (TRN) perform this task concur-
rently with vector quantization. Typical examples are the neural gas [135]
and its numerous variants (such as [71]). The two simple rules detailed ahead,
applied after vector quantization, yield similar or even better results.

E.2.1 Data rule

For each data point y(i), this rule computes the set containing the K
closest prototypes, written as {c(j1 ), . . . , c(jK )}. Then each possible pair
{c(js ), c(jt )} in this set is analyzed. If the point fulfills the following two
conditions, then the prototypes of the considered pair are connected, and a
graph edge is created between their associated vertices. The first one is the
condition of the ellipse, written as

d(y(i), c(jr )) + d(y(i), c(js )) < C1 d(c(jr ), c(js )) , (E.5)

and the second one is the condition of the circle

d(y(i), c(jr )) < C2 d(y(i), c(jd )) and d(y(i), c(js )) < C2 d(y(i), c(jr )) ,
(E.6)
E.2 With vector quantization 273

K−rule
K−rule 2

1
1
0.5
0
y2

y2
0

−0.5 −1

−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y1

ε−rule
ε−rule 2

1
1
0.5
0
y2

y2

−0.5 −1

−1
−2
0 1 2 3
y −2 −1 0 1 2
1
y1

τ−rule
τ−rule 2

1
1
0.5
0
y2

y2

−0.5 −1

−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y1

Fig. E.2. Results of the K-rule, -rule, and τ -rule on the data sets proposed in
Fig. E.1.
274 E Graph Building

where C1 and C2 are constants defined as



C1 = S 2 + 1 , (E.7)
1+S
C2 = , (E.8)
1−S
The value of the constant S is chosen by the user, with values between 0.2 and
0.6. In this range, S can be seen as the edge of an almost cubic hypervolume
(see Fig. E.3). Indeed, the first condition holds if y(i) lies inside a hyper-
ellipsoid with foci at c(jr ) and c(js ), whereas the second condition holds if
y(i) lies outside two hyperballs including the two foci.

S = 0.2 S = 0.4 S = 0.6


1 1 1

0 0 0

−1 −1 −1
−2 0 2 −2 0 2 −2 0 2

Fig. E.3. The two conditions to fulfill in order to create a graph edge between
the vertices associated to two prototypes (black crosses). The points that create the
edge must be inside the ellipse and outside both circles. The ellipse and circles are
shown for different values of S.

Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 pro-
totypes mentioned in the introductory section, the data rule gives the results
displayed in the first row of Fig. E.4. The parameter K equals 2, and S is
assigned the values 0.5 and 0.3 for the sine and the spiral, respectively. As can
be seen, the exploitation of the information provided by the data set before
vector quantization allows one to extract the shape of both data sets in a
much better way than the three previous rules did.

E.2.2 Histogram rule


This rule, described in [5], shares some common ideas with the previous one.
In a similar way, the goal consists of connecting prototypes c(jr ) and c(js )
only if data points lie between them, i.e., around the midpoint of a segment
with c(jr ) and c(js ) at both ends. In contrast with the data rule, however,
the histogram rule is probabilistic.
In practice, for each data point y(i), the two closest prototypes c(jr ) and
c(js ) are computed and memorized as possible end vertices for a new edge
of the graph. Obviously, several data points can lead to considering the same
possible edge. These data points may be binned in a histogram according to
the normalized scalar product
E.2 With vector quantization 275

(y(i) − c(jr )) · (c(js ) − c(jr ))


h(i) = .
(c(js ) − c(jr )) · (c(js ) − c(jr ))

The normalized scalar product can also be expressed with distances:

y(i) − c(jr )22 − y(i) − c(js )22 + c(jr ) − c(js )22


h(i) = .
2c(jr ) − c(js )2

The histogram may contain only three bins:


• h(i) < 12 (1 − H),
1 1
• 2 (1 − H) ≤ h(i) ≤ 2 (1 + H),
1
• 2 (1 + H) < h(i),

where 0 ≤ H ≤ 1 is a parameter that determines the width of the central


bin. The final decision to establish the edge between c(jr ) and c(js ) is taken
if the central bin is higher than the other two bins. In two dimensions, the
histogram rule produces a subset of the Delaunay triangulation.
Assuming the data set {y(i)}1≤i≤N is actually the set of 120 or 25 pro-
totypes mentioned in the introductory section, the histogram rule gives the
results displayed in the second row of Fig. E.4. The results are similar to those
of the data rule. It is noteworthy that the histogram rule behaves much better
than the data rule when outliers pollute the data (in the artificial examples
of Fig. E.1, only Gaussian noise is added).
276 E Graph Building

Data−rule
Data−rule 2

1
1
0.5
0
y2

y2
0

−0.5 −1

−1
−2
0 1 2 3
y1 −2 −1 0 1 2
y
1

Histogram−rule
Histogram−rule 2

1
1
0.5
0
y2

y2

−0.5 −1

−1
−2
0 1 2 3
y −2 −1 0 1 2
1
y1

Fig. E.4. Results of the data rule and histogram rule on the data sets proposed in
Fig. E.1.
F
Implementation Issues

This appendix gathers some hints in order to implement efficiently the meth-
ods and algorithms described in the main chapters.

F.1 Dimension estimation


This first section refers to the methods described in Chapter 3 to estimate the
intrinsic dimensionality of a data set.

F.1.1 Capacity dimension

Some ideas to build efficient implementations of the box-counting dimension


are given in [45]. Briefly put, the idea consists of assuming the presence of a
D-dimensional grid that surrounds the region of the space occupied by the
D-dimensional data. It is not a good idea to list all hypercubes in the grid
and to count the number of points lying in each of them. Indeed, the number
of hypercubes grows exponentially, and many of these “boxes” are likely to
be empty if the intrinsic dimensionality is low, as expected. A better solution
consists in listing only the nonempty hypercubes. This can be done by giving
a label or code number to each box and assigning to each point the label of
the box it is lying in.

F.1.2 Correlation dimension

By comparison with the capacity dimension, the correlation dimension is much


easier to implement. In order to compute it and to draw the figures of Chap-
ter 3, several pieces of code are needed.
The core of the algorithm consists of computing the pairwise distances
involved in the correlation sum (Eq. (3.18)). For N points or observations,
278 F Implementation Issues

N (N − 1)/2 distances have to be computed. If N is large, the number of dis-


tances to store in memory becomes untractable. Fortunately, distances them-
selves are not interesting: only their cumulative distribution matters for the
correlation sum. An idea to approximate that distribution consists of building
a histogram. Hence, once a distance is computed, it is placed in the right bin
and may be forgotten afterwards. This trick avoids storing all distances: only
the histogram stands in the memory.
The histogram itself can be implemented quite simply by a simple vector.
Each entry of the vector is a bin. For a given value, the index of the corre-
sponding bin is computed in O(1) time. If the histogram h contains B bins,
and if the first and last bins are located at values h(1) and h(B), then the
index of the bin corresponding to value v is written as
   
v − h(1)
b = max h(1), min h(B), rnd , (F.1)
h(B) − h(1)
where the function rnd(x) rounds its argument x. The min and max prevent
b from getting out of bounds. The choice of the number of bins B depends on
the user’s needs. On the other hand, the bounds of the histogram h(1) and
h(B) are important in order to avoid wasted bins. The first thing to remember
is that the correlation sum is always displayed in log-log plots. Therefore, the
values to be binned in the histogram should be the logarithm of the distances,
in order to yield an equal resolution (i.e., bin width) in the plot. For optimal
efficiency, the choice of the histogram bounds is straightforward: h(1) and
h(B) must equal the logarithm of, respectively, the minimal and maximal
distances measured in the data set. Unfortunately, these distances are not
known for sure until all N (N − 1)/2 distances have been computed. This
means that the distances should be either stored or computed twice. The first
solution is the worst: the benefit of time is small, but the required amount of
memory dramatically increases. A third solution would be to approximate the
minimal and maximal distances. For example, a rather good approximation
of the maximum:

max y(i) − y(j)2 ≈ max y(i) − E{y}2 , (F.2)


i,j i

where the expectation of y can be approximated by the sample mean, com-


puted beforehand. This approximation reduces the computation time from
O(N 2 ) to O(N ). Although the exact value of the maximal distance is not so
important, things are unfortunately much different for the minimal distance.
Indeed, as the bounds of the histogram are the logarithms of these values, an
error on a small value is much more dangerous than on a larger one. By the
way, the knowledge of the exact minimal distance prevents us from having
empty bins and causing a log 0 = −∞. In summary, it is better to compute
the minimal and maximal distances exactly.
After all distances have been computed, the histogram bins may be cu-
mulated in order to obtain the correlation sum Ĉ2 (). Next, the logarithm is
F.2 Computation of the closest point(s) 279

applied to the bin heights, yielding the necessary log-log curves. Finally, the
numerical derivative is computed as in Eqs. (3.30) and (3.31). The second-
order estimate is replaced with a first-order one for the first and last bins.

F.2 Computation of the closest point(s)


When searching in a set of points {y(i)}1≤i≤N that are the K closest ones
from given coordinates y, one has to distinguish four cases:
1. 0 < K  N/2;
2. 0  K < N/2;
3. N/2 < K  N ;
4. N/2  K < N .
The four cases are solved by the following procedure:
1. Compute all distances d(y(i), y) and store them in a vector.
2. Sort the vector containing distances.
3. Keep the first K entries of the sorted vector.
The most time-consuming step is the sort (O(N log(N ))). If K is small (second
case), then a partial sort can dramatically decrease the computation time.
Moreover, the ordering of the K closest points often does not matter at all:
the user just needs to know whether or not a given y(i) belongs to the K
closest ones. Consequently, the last two cases in the above list can be elegantly
solved just by computing the N −K farthest points with a partial sort. At this
point, the use of a partial sort algorithm leads to acceptable solutions for the
last three cases. Unfortunately, if K is very small (first case), then the sorting
time becomes negligible. Even a stupid partial sort like traversing K times
the data set and looking for the minimum can be negligible (the complexity
is only O(KN )). Therefore, in the first case, thanks to the partial sort, the
most time-consuming step becomes the computation of all distances (O(DN )
where D, the dimensionality of the data, is assumed to be larger than K).
Would it be possible to avoid the computation of all distances and get some
additional speed-up ?
A positive answer to the above question is given by partial distance search
(PDS in short). This technique exploits the structure of the Minkowski norm
(see Subsection 4.2.1), which is computed as a sum of positive terms, elevated
to some power: 
D

y(i) − yp =  p
|yd (i) − yd |p . (F.3)
d=1

As the pth root is a monotonic function, it does not change the ordering of
the distance and may be omitted.
For the sake of simplicity, it is assumed that a sorted list of K canditates
closest points is available. This list can be initialized by randomly choosing K
280 F Implementation Issues

points in the data set, computing their distances (without applying the pth
root) to the source point x, and sorting the distances. Let  be assigned with
the value of the last and longest distance in the list. From this state, PDS has
to traverse the N − K remaining points in order to update the list. For each of
these points y(i), PDS starts to compute the distance from the source y, by
cumulating the terms |yd (i) − yd |p for increasing values of index d. While the
sum of those terms remains lower than , it is worth carrying on the additions,
because the point remains a possible candidate to enter the list. Otherwise,
if the sum grows beyond , it can be definitely deduced that the point is not
a good candidate, and the PDS may simply stop the distance computation
between y(i) and y. If the point finally enters the list, the largest distance in
the list is thrown away and the new point replaces it at the right position in
the list.
The denomination  is not given arbitrarily to the longest distance in the
list. A small change in the PDS algorithm can perform another task very effi-
ciently: the detection of the points y(i) lying in a hyperball of radius . In this
variant,  is fixed and the list has a dynamic size; the sorting of the list is un-
necessary. The original and modified PDS algorithms find direct applications
in the K-rule and -rule for graph building (see Appendix E).
Other techniques for an even faster determination of the closest points exist
in the literature, but they either need special data structures built around
the set of data points [207] or provide only an approximation of the exact
result [26].

F.3 Graph distances


Within the framework of dimensionality reduction, graph distances are used
to approximate geodesic distances on a manifold, i.e., distances that are mea-
sured along a manifold, like a man walking on it, and not as a bird flies, as
does the Euclidean distance. Whereas Section 4.3 justfies and details the use
of graph distances in the context of NLDR, this subsection focuses on some
implementation issues.
Compared to Euclidean distances, graph distances are much more dif-
ficult to compute efficiently. The explanation of such a difference comes
from the fact that computing the Euclidean distance between two points re-
quires knowing the coordinates of these two points. More precisely, computing
d(y(i), y(j)) = y(i) − y(j)2 , where y(i), y(j) ∈ RD , requires O(D) opera-
tions. (See Section 4.2 for more details about the Euclidean distance and the
L2 -norm y2 .)
Hence, when dealing with a set of N points, computing the Euclidean dis-
tances from one point, called the source, to all others trivially requires O(DN )
operations. Similarly, the measurement of all pairwise distances (from all pos-
sible sources to all other points) demands O(DN 2 ); as the Euclidean distance
is symmetric (see Subsection 4.2.1), only N (N − 1)/2 pairwise distances need
F.3 Graph distances 281

to be effectively computed. For the Euclidean distance, this distinction among


three tasks may seem quite artificial. Indeed, computing the distance for one
pair (SPED task; single-pair Euclidean distance), one source (SSED task;
single-source Euclidean distances), or all sources (APED task; all-pairs Eu-
clidean distances) simply demands repeating 1, N , or N (N − 1)/2 times a
single basic procedure that measures the distance between two points.
On the other hand, when determining graph distances, the distinction
among the three tasks is essential. The first point to remark is that com-
puting the graph distance δ(vi , vj ) between two vertices of a weighted graph
is wasteful. Intuitively, this fact can be justified by remarking that a graph
distance depends on more than the two vertices between which it is com-
puted. Indeed, computing the distance between vi and vj requires computing
paths between vi and vj and finding the shortest one. If this shortest path is
written {vi , . . . , vk , . . . , vj }, then {vi , . . . , vj } is the shortest path between vi
and vk ; but, of course, vk and all other intermediate vertices are not known in
advance. Therefore, computing the shortest path from vi to vj (i.e., graph dis-
tance δ(y(i), y(j))) requires computing all intermediate shortest paths from
vi to any vk (i.e. the distances δ(y(i), y(k))), subject to length({vi , . . . , vk }) ≤
length({vi , . . . , vj }) (resp., δ(y(i), y(k)) ≤ δ(y(i), y(j))). (For an undirected
graph, as is often the case in applications to dimensionality reduction, a
slightly cheaper solution consists of computing simultaneously shortest paths
starting from both vertices vi and vi and stopping when they merge, instead of
computing all shortest paths from vi until vj is reached.) As a consequence of
the previous observations, it appears that the most basic procedure for com-
puting graph distances should be the one that already measures all distances
from one source to all other vertices. In the literature, this task is called the
SSSP problem (single-source shortest paths). Similarly, the computation for
all pairs is known as the APSP problem (all-pairs shortest paths).

SSSP (single-source shortest paths)

The problem of computing the shortest paths from one source vertex to all
other vertices is usually solved by Dijkstra’s [53] algorithm. Dijkstra’s algo-
rithm has already been sketched in Subsection 4.3.1; its time complexity is
O(D|E| + N log N ), where |E| is the number of edges in the graph. The main
idea of the algorithm consists of computing the graph distances in ascend-
ing order. This way implicitly ensures that each distance is computed along
the shortest path. This also means that distances have to be sorted in some
way. Actually, at the beginning of the algorithm, all distances are initialized
to +∞, except for the distance to the source, which is trivially zero. Other
distances are updated and output one by one, as the algorithm is running
and discovering the shortest paths. Hence, distances are not really sorted, but
their intermediate values are stored in a priority queue, which allows us to
store a set of values and extract the smallest one. An essential property of
a priority queue is the possibility to decrease the stored values. In the case
282 F Implementation Issues

of Dijkstra’s algorithm, this functionality is very important since distances


are not known in advance and may be lowered every time a shorter path is
discovered. Priority queues with efficient operations for extracting the min-
imum and updating the stored values are implemented by data structures
called heaps. In particular, Fibonacci heaps are exceptionally well fitted to
Dijkstra’s algorithm [65].
Dijkstra’s algorithm computes not only the length of the shortest paths
(i.e., the graph distances) but also the shortest path itself. Actually, associated
with each vertex vj is a pointer to the vertex vk , which is its predecessor in
the shortest path that goes from the source vi to vj . If the graph is connected,
the shortest paths computed by Dijkstra’s algorithm form a spanning tree of
the graph.
As a byproduct, Dijkstra’s algorithm outputs the graph distances in as-
cending order. They are implicitly sorted by the algorithm.

APSP (all-pairs shortest paths)

Usually, the problem of computing all pairwise graph distances in a weighted


graph is solved by repeating an SSSP algorithm (like Dijkstra’s one) for each
vertex [210]. This procedure requires O(N D|E| + N 2 log N ) operations, which
is far more expensive than the O(DN 2 ) operations required to achieve the
equivalent task for Euclidean distances. Hopefully, it can be seen that repeat-
ing an SSSP procedure generates a lot of redundancy. More precisely, in the
perspective of the APSP, a single execution of an SSSP procedure gives more
than the shortest paths from the specified source vs to all other vertices. In
fact, determining the shortest path {vi , . . . , vk , . . . , vj } from vi to some vj not
only requires computing all intermediate paths {vi , . . . , vk } but also gives the
shortest path {vk , . . . , vj } for free. Moreover, if the graph is undirected, then
the shortest paths {vj , . . . , vk , . . . , vi }, {vj , . . . vk }, and {vk , . . . , vi } are known
in advance, too. Unfortunately, to the authors’ knowledge, no existing algo-
rithm succeeds in exploiting this redundancy. References to improved APSP
algorithms, performing faster in particular situations (e.g., integer weight,
approximated solution) can be found in [210].
References

1. C.C. Aggarwal, A. Hinneburg, and D.A. Keim. On the surprising behav-


ior of distance metrics in high dimensional space. In J. Van den Bussche
and V. Vianu, editors, Proceedings of the Eighth International Conference on
Database Theory, volume 1973 of Lecture Notes in Computer Science, pages
420–434. Springer, London, 2001.
2. D.W. Aha and R.L. Bankert. A comparative evaluation of sequential feature
selection algorithms. In D. Fisher and H.J. Lenz, editors, Learning from Data:
AI and Statistics, pages 199–206. Springer-Verlag, New York, NY, 1996. Also
published in the Proc. 5th Int. Workshop on AI and Statistics.
3. A. Ahalt, A.K. Krishnamurthy, P. Chen, and D.E. Melton. Competitive learn-
ing algorithms for vector quantization. Neural Networks, 3:277–290, 1990.
4. H. Akaike. Information theory and an extension of the maximum likelihood
principle. In B.N. Petrov and F. Csaki, editors, Proceedings of the 2nd Inter-
national Symposium on Information Theory, pages 267–281. Akademia Kiado,
Budapest, 1973.
5. M. Aupetit. Robust topology representing networks. In M. Verleysen, editor,
Proceedings of ESANN 2003, 11th European Symposium on Artificial Neural
Networks, pages 45–50. d-side, Bruges, Belgium, April 2003.
6. C. Baker. The Numerical Treatment of Integral Equations. Clarendon Press,
Oxford, 1977.
7. K. Balász. Principal curves: learning, design, and applications. PhD thesis,
Concordia University, Montréal, Canada, 1999.
8. G.B. Ball and D.J. Hall. A clustering technique for summarizing multivariate
data. Behavioral Science, 12:153–155, 1967.
9. H.-U. Bauer, M. Herrmann, and T. Villmann. Neural maps and topographic
vector quantization. Neural Networks, 12:659–676, 1999.
10. H.-U. Bauer and K.R. Pawelzik. Quantifying the neighborhood preservation
of self-organizing maps. IEEE Transactions on Neural Networks, 3:570–579,
1992.
11. H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a self-
organizing feature map. Technical Report TR-95-030, International Computer
Science Institute, Berkeley, 1995.
12. M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for
embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani,
284 References

editors, Advances in Neural Information Processing Systems (NIPS 2001), vol-


ume 14. MIT Press, Cambridge, MA, 2002.
13. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction
and data representation. Neural Computation, 15(6):1373–1396, June 2003.
14. R. Bellman. Adaptative Control Processes: A Guided Tour. Princeton Univer-
sity Press, Princeton, NJ, 1961.
15. Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and
M. Ouimet. Learning eigenfunctions links spectral embedding and kernel PCA.
Neural Computation, 16(10):2197–2219, 2004.
16. Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and
M. Ouimet. Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and
spectral clustering. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in
Neural Information Processing Systems (NIPS 2003), volume 16. MIT Press,
Cambridge, MA, 2004.
17. Y. Bengio, P. Vincent, J.-F. Paiement, O. Delalleau, M. Ouimet, and
N. Le Roux. Spectral clustering and kernel PCA are learning eigenfunc-
tions. Technical Report 1239, Département d’Informatique et Recherche
Opérationnelle, Université de Montréal, Montréal, July 2003.
18. N. Benoudjit, C. Archambeau, A. Lendasse, J. Lee, and M. Verleysen. Width
optimization of the Gaussian kernels in radial basis function networks. In
M. Verleysen, editor, Proceedings of ESANN 2002, 10th European Symposium
on Artificial Neural Networks, pages 425–432. d-side, Bruges, Belgium, April
2002.
19. M. Bernstein, V. de Silva, J.C. Langford, and J.B. Tenenbaum. Graph ap-
proximations to geodesics on embedded manifolds. Technical report, Stanford
University, Palo Alto, CA, December 2000.
20. K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “near-
est neighbor” meaningful? In Seventh International Conference on Database
Theory, volume 1540 of Lecture Notes in Computer Science, pages 217–235.
Springer-Verlag, Jerusalem, Israel, 1999.
21. C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, Oxford, 1995.
22. C.M. Bishop, M. Svensén, and C.K.I. Williams. EM optimization of latent-
variables density models. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo,
editors, Advances in Neural Information Processing Systems (NIPS 1995), vol-
ume 8, pages 465–471. MIT Press, Cambridge, MA, 1996.
23. C.M. Bishop, M. Svensén, and C.K.I. Williams. GTM: A principled alterna-
tive to the self-organizing map. In M.C. Mozer, M.I. Jordan, and T. Petsche,
editors, Advances in Neural Information Processing Systems (NIPS 1996), vol-
ume 9, pages 354–360. MIT Press, Cambridge, MA, 1997.
24. C.M. Bishop, M. Svensén, and K.I. Williams. GTM: A principled alternative
to the self-organizing map. Neural Computation, 10(1):215–234, 1998.
25. I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Appli-
cations. Springer-Verlag, New York, 1997.
26. A. Borodin, R. Ostrovsky, and Y. Rabani. Lower bounds for high dimensional
nearest neighbor search and related problems. In Proceedings of the thirty-
first annual ACM symposium on Theory of Computing, Atlanta, GA, pages
312–321. ACM Press, New York, 1999.
References 285

27. B.E. Boser, I.M. Guyon, and V. Vapnik. A training algorithm for optimal mar-
gin classifiers. In Fifth Annual Workshop on Computational Learning Theory.
ACM, Pittsburg, PA, 1992.
28. C. Bouveyron. Dépliage du ruban cortical à partir d’images obtenues en IRMf,
mémoire de DEA de mathématiques appliquées. Master’s thesis, Unité Mixte
Inserm – UJF 594, Université Grenoble 1, France, June 2003.
29. M. Brand. Charting a manifold. In S. Becker, S. Thrun, and K. Obermayer,
editors, Advances in Neural Information Processing Systems (NIPS 2002), vol-
ume 15. MIT Press, Cambridge, MA, 2003.
30. M. Brand. Minimax embeddings. In S. Thrun, L. Saul, and B. Schölkopf,
editors, Advances in Neural Information Processing Systems (NIPS 2003), vol-
ume 16. MIT Press, Cambridge, MA, 2004.
31. M. Brand and K. Huang. A unifying theorem for spectral embedding and
clustering. In C.M. Bishop and B.J. Frey, editors, Proceedings of International
Workshop on Artificial Intelligence and Statistics (AISTATS’03). Key West,
FL, January 2003. Also presented at NIPS 2002 workshop on spectral methods
and available as Technical Report TR2002-042.
32. J. Bruske and G. Sommer. Intrinsic dimensionality estimation with optimally
topology preserving maps. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 20(5):572–575, 1998.
33. M.A. Carreira-Perpiñán. A review of dimension reduction techniques. Techni-
cal report, University of Sheffield, Sheffield, January 1997.
34. A. Cichocki and S. Amari. Adaptative Blind Signal and Image Processing. John
Wiley & Sons, New York, 2002.
35. R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and
S. Zucker. Geometric diffusion as a tool for harmonic analysis and structure
definition of data, part i: Diffusion maps. Proceedings of the National Academy
of Sciences, 102(21):7426–7431, 2005.
36. R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and
S. Zucker. Geometric diffusion as a tool for harmonic analysis and structure
definition of data, part i: Multiscale methods. Proceedings of the National
Academy of Sciences, 102(21):7432–7437, 2005.
37. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–
297, 1995.
38. M. Cottrell, J.-C. Fort, and G. Pagès. Two or three things that we know about
the Kohonen algorithm. In M. Verleysen, editor, Proceedings ESANN’94, 2nd
European Symposium on Artificial Neural Networks, pages 235–244. D-Facto
conference services, Brussels, Belgium, 1994.
39. R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 1. In-
terscience Publishers, Inc., New York, 1953.
40. T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley,
New York, 1991.
41. T.F. Cox and M.A.A. Cox. Multidimensional Scaling. Chapman & Hall, Lon-
don, 1995.
42. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma-
chines (and Other Kernel-Based Learning Methods). Cambridge University
Press, 2000.
43. J. de Leeuw and W. Heiser. Theory of multidimensional scaling. In Handbook
of Statistics, chapter 13, pages 285–316. North-Holland Publishing Company,
Amsterdam, 1982.
286 References

44. D. de Ridder and R.P.W. Duin. Sammon’s mapping using neural networks: A
comparison. Pattern Recognition Letters, 18(11–13):1307–1316, 1997.
45. P. Demartines. Analyse de données par réseaux de neurones auto-organisés.
PhD thesis, Institut National Polytechnique de Grenoble (INPG), Grenoble,
France, 1994.
46. P. Demartines and J. Hérault. Vector quantization and projection neural
network. volume 686 of Lecture Notes in Computer Science, pages 328–333.
Springer-Verlag, New York, 1993.
47. P. Demartines and J. Hérault. CCA: Curvilinear component analysis. In 15th
Workshop GRETSI, Juan-les-Pins (France), September 1995.
48. P. Demartines and J. Hérault. Curvilinear component analysis: A self-
organizing neural network for nonlinear mapping of data sets. IEEE Transac-
tions on Neural Networks, 8(1):148–154, January 1997.
49. D. DeMers and G.W. Cottrell. Nonlinear dimensionality reduction. In D. Han-
son, J. Cowan, and L. Giles, editors, Advances in Neural Information Process-
ing Systems (NIPS 1992), volume 5, pages 580–587. Morgan Kaufmann, San
Mateo, CA, 1993.
50. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-
complete data via the EM algorithm. Journal of Royal Statistical Society, B,
39(1):1–38, 1977.
51. D. DeSieno. Adding a conscience to competitive learning. In Proceedings
of ICNN’88 (International Conference on Neural Networks), pages 117–124.
IEEE Service Center, Piscataway, NJ, 1988.
52. G. Di Battista, P. Eades, R. Tamassia, and I.G. Tollis. Algorithms for drawing
graphs: An annotated bibliography. Technical report, Brown University, June
1994.
53. E.W. Dijkstra. A note on two problems in connection with graphs. Numerical
Mathematics, 1:269–271, 1959.
54. D. Donoho and C. Grimes. When does geodesic distance recover the true
parametrization of families of articulated images? In M. Verleysen, editor,
Proceedings of ESANN 2002, 10th European Symposium on Artificial Neural
Networks, pages 199–204, Bruges, Belgium, April 2002. d-side.
55. D.L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding
techniques for high-dimensional data. In Proceedings of the National Academy
of Arts and Sciences, volume 100, pages 5591–5596, 2003.
56. D.L. Donoho and C. Grimes. Hessian eigenmaps: New locally linear techniques
for high-dimensional data. Technical Report TR03-08, Department of Statis-
tics, Stanford University, Palo Alto, CA, 2003.
57. E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering,
convergence properties and energy functions. Biological Cybernetics, 67:47–55,
1992.
58. P.A. Estévez and A.M. Chong. Geodesic nonlinear mapping using the neural
gas network. In Proceedings of IJCNN 2006. 2006. In press.
59. P.A. Estévez and C.J. Figueroa. Online data visualization using the neural gas
network. Neural Networks, 19:923–934, 2006.
60. B.S. Everitt. An Introduction to Latent Variable Models. Monographs on
Statistics and Applied Probability. Chapman & Hall, London, New York, 1984.
61. E. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability
of classifications. Biometrics, 21:768, 1965.
References 287

62. J.-C. Fort. SOM’s mathematics. Neural Networks, 19:812–816, 2006.


63. S. Fortune. Voronoi diagrams and delaunay triangulations. In D.Z. Du and
F. Hwang, editors, Computing in Euclidean geometry, pages 193–233. World
Scientific, Singapore, 1992.
64. D. François. High-dimensional data analysis: optimal metrics and feature selec-
tion. PhD thesis, Université catholique de Louvain, Département d’Ingénierie
Mathématique, Louvain-la-Neuve, Belgium, September 2006.
65. M.L. Fredman and R.E. Tarjan. Fibonacci heaps and their uses in improved
network optimization algorithms. Journal of the ACM, 34:596–615, 1987.
66. J.H. Friedman. Exploratory projection pursuit. Journal of the American Sta-
tistical Association, 82(397):249–266, March 1987.
67. J.H. Friedman and J.W. Tukey. A projection pursuit algorithm for exploratory
data analysis. IEEE Transactions on Computers, C23(9):881–890, 1974.
68. B. Fritzke. Let it grow – self-organizing feature maps with problem dependent
cell structure. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors,
Artificial Neural Networks, volume 1, pages 403–408. Elsevier, Amsterdam,
1991.
69. B. Fritzke. Growing cell structures – a self-organizing network for unsupervised
and supervised learning. Neural Networks, 7(9):1441–1460, 1994. Also available
as technical report TR-93-026 at the International Computer Science Institute
(Berkeley, CA), May 1993.
70. B. Fritzke. Growing grid – a self-organizing network with constant neighbor-
hood range and adaptation strength. Neural Processing Letters, 2(5):9–13,
1995.
71. B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D.S.
Touretzky, and T.K. Leen, editors, Advances in Neural Information Processing
Systems (NIPS 1994), volume 7, pages 625–632. MIT Press, Cambridge, MA,
1995.
72. K. Fukunaga and D.R. Olsen. An algorithm for finding intrinsic dimensionality
of data. IEEE Transactions on Computers, C-20(2):176–193, 1971.
73. A. Gersho and R.M. Gray. Vector Quantization and Signal Processing. Kluwer
Academic Publisher, Boston, 1992.
74. G.J. Goodhill and T.J. Sejnowski. Quantifying neighbourhood preservation in
topographic mappings. In Proceedings of the Third Joint Symposium on Neural
Computation, pages 61–82. California Institute of Technology, University of
California, Pasadena, CA, 1996.
75. J. Göppert. Topology-preserving interpolation in self-organizing maps. In
Proceedings of NeuroNı̂mes 1993, pages 425–434. Nanterre, France, October
1993.
76. P. Grassberger and I. Procaccia. Measuring the strangeness of strange attrac-
tors. Physica, D9:189–208, 1983.
77. R.M. Gray. Vector quantization. IEEE Acoustics, Speech and Signal Processing
Magazine, 1:4–29, April 1984.
78. J. Ham, D.D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimen-
sionality reduction of manifolds. In 21th International Conference on Machine
Learning (ICML-04), pages 369–376, 2004. Also available as technical report
TR-102 at Max Planck Institute for Biological Cybernetics, Tübingen, Ger-
many, 2003.
79. T. Hastie. Principal curves and surfaces. PhD thesis, Stanford University, Palo
Alto, CA, 1984.
288 References

80. T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical
Association, 84(406):502–516, 1989.
81. X.F. He and P. Niyogi. Locality preserving projections. In S. Thrun, L. Saul,
and B. Schölkopf, editors, Advances in Neural Information Processing Systems
(NIPS 2003), volume 16. MIT Press, Cambridge, MA, 2004.
82. X.F. He, S.C. Yan, Y.X. Hu, H.G. Niyogi, and H.J. Zhang. Face recognition
using Laplacianfaces. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27(3):328–340, 2005.
83. H.G.E. Hentschel and I. Procaccia. The infinite number of generalized dimen-
sions of fractals and strange attractors. Physica, D8:435–444, 1983.
84. J. Hérault, C. Jaussions-Picaud, and A. Guérin-Dugué. Curvilinear component
analysis for high dimensional data representation: I. Theoretical aspects and
practical use in the presence of noise. In J. Mira and J.V. Sánchez, editors,
Proceedings of IWANN’99, volume II, pages 635–644. Springer, Alicante, Spain,
June 1999.
85. M. Herrmann and H.H. Yang. Perspectives and limitations of self-organizing
maps in blind separation of source signals. In S. Amari, L. Xu, L.-W. Chan,
I. King, and K.-S. Leung, editors, Progress in Neural Information Processing,
Proceedings of ICONIP’96, volume 2, pages 1211–1216. Springer-Verlag, 1996.
86. D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flachenstück. Math.
Ann., 38:459–460, 1891.
87. G. Hinton and S.T. Roweis. Stochastic neighbor embedding. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information Process-
ing Systems (NIPS 2002), volume 15, pages 833–840. MIT Press, Cambridge,
MA, 2003.
88. G.E. Hinton. Learning distributed representations of concepts. In Proceedings
of the Eighth Annual Conference of the Cognitive Science Society, Amherst,
MA, 1986. Reprinted in R.G.M. Morris, editor, Parallel Distributed Processing:
Implications for Psychology and Neurobiology, Oxford University Press, USA,
1990.
89. G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data
with neural networks. Science, 313(5786):504–507, July 2006.
90. G.E. Hinton and R.R. Salakhutdinov. Supporting online mate-
rial for “reducing the dimensionality of data with neural net-
works”. Science, 313(5786):504–507, July 2006. Available at
www.sciencemag.org/cgi/content/full/313/5786/502/DC1.
91. J.J. Hopfield. Neural networks and physical systems with emergent collective
computational abilities. In Proc. Natl. Acad. Sci. USA 79, pages 2554–2558.
1982.
92. H. Hotelling. Analysis of a complex of statistical variables into principal com-
ponents. Journal of Educational Psychology, 24:417–441, 1933.
93. J.R. Howlett and L.C. Jain. Radial Basis Function Networks 1: Recent Devel-
opments in Theory and Applications. Physica Verlag, Heidelberg, 2001.
94. P.J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985.
95. A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis.
Wiley-Interscience, 2001.
96. A.K. Jain and D. Zongker. Feature selection: Evaluation, application and small
sample performance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(2):153–158, 1997.
References 289

97. M.C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal
Statistical Society, Series A, 150:1–36, 1987.
98. C. Jutten. Calcul neuromimétique et traitement du signal, analyse en com-
posantes indépendantes. PhD thesis, Institut National Polytechnique de Greno-
ble, 1987.
99. C. Jutten and J. Hérault. Space or time adaptive signal processing by neural
network models. In Neural Networks for Computing, AIP Conference Proceed-
ings, volume 151, pages 206–211. Snowbird, UT, 1986.
100. C. Jutten and J. Hérault. Blind separation of sources, part I: An adaptative
algorithm based on neuromimetic architecture. Signal Processing, 24:1–10,
1991.
101. N. Kambhatla and T.K. Leen. Dimension reduction by local principal compo-
nent analysis. Neural Computation, 9(7):1493–1516, October 1994.
102. K. Karhunen. Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. Sci.
Fennicae, 34, 1946.
103. K. Kiviluoto. Topology preservation in self-organizing maps. In IEEE Neu-
ral Networks Council, editor, Proc. Int. Conf. on Neural Networks, ICNN’96,
volume 1, pages 294–299, Piscataway, NJ, 1996. Also available as technical
report A29 of the Helsinki University of Technology.
104. T. Kohonen. Self-organization of topologically correct feature maps. Biological
Cybernetics, 43:59–69, 1982.
105. T. Kohonen. Self-Organizing Maps. Springer, Heidelberg, 2nd edition, 1995.
106. A. König. Interactive visualization and analysis of hierarchical neural projec-
tions for data mining. IEEE Transactions on Neural Networks, 11(3):615–624,
2000.
107. M. Kramer. Nonlinear principal component analysis using autoassociative neu-
ral networks. AIChE Journal, 37:233, 1991.
108. J.B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis. Psychometrika, 29:1–28, 1964.
109. J.B. Kruskal. Toward a practical method which helps uncover the structure of
a set of multivariate observations by finding the linear transformation which
optimizes a new index of condensation. In R.C. Milton and J.A. Nelder, editors,
Statistical Computation. Academic Press, New York, 1969.
110. J.B. Kruskal. Linear transformations of multivariate data to reveal cluster-
ing. In Multidimensional Scaling: Theory and Application in the Behavioural
Sciences, I, Theory. Seminar Press, New York and London, 1972.
111. W. Kühnel. Differential Geometry Curves – Surfaces – Manifolds. Amer. Math.
Soc., Providence, RI, 2002.
112. M. Laurent and F. Rendl. Semidefinite programming and integer programming.
Technical Report PNA-R0210, CWI, Amsterdam, April 2002.
113. M.H. C. Law, N. Zhang, and A.K. Jain. Nonlinear manifold learning for data
stream. In Proceedings of SIAM Data Mining, pages 33–44. Orlando, FL, 2004.
114. J.A. Lee, C. Archambeau, and M. Verleysen. Locally linear embedding versus
Isotop. In M. Verleysen, editor, Proceedings of ESANN 2003, 11th European
Symposium on Artificial Neural Networks, pages 527–534. d-side, Bruges, Bel-
gium, April 2003.
115. J.A. Lee, C. Jutten, and M. Verleysen. Non-linear ICA by using isometric
dimensionality reduction. In C.G. Puntonet and A. Prieto, editors, Independent
Component Analysis and Blind Signal Separation, Lecture Notes in Computer
Science, pages 710–717, Granada, Spain, 2004. Springer-Verlag.
290 References

116. J.A. Lee, A. Lendasse, N. Donckers, and M. Verleysen. A robust nonlinear


projection method. In M. Verleysen, editor, Proceedings of ESANN 2000, 8th
European Symposium on Artificial Neural Networks, pages 13–20. D-Facto pub-
lic., Bruges, Belgium, April 2000.
117. J.A. Lee, A. Lendasse, and M. Verleysen. Curvilinear distances analysis versus
isomap. In M. Verleysen, editor, Proceedings of ESANN 2002, 10th European
Symposium on Artificial Neural Networks, pages 185–192. d-side, Bruges, Bel-
gium, April 2002.
118. J.A. Lee and M. Vereleysen. How to project “circular” manifolds using geodesic
distances. In M. Verleysen, editor, Proceedings of ESANN 2004, 12th European
Symposium on Artificial Neural Networks, pages 223–230. d-side, April 2004.
119. J.A. Lee and M. Verleysen. Nonlinear projection with the Isotop method. In
J.R. Dorronsoro, editor, LNCS 2415: Artificial Neural Networks, Proceedings
of ICANN 2002, pages 933–938. Springer, Madrid (Spain), August 2002.
120. J.A. Lee and M. Verleysen. Curvilinear distance analysis versus isomap. Neu-
rocomputing, 57:49–76, March 2004.
121. J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction of data mani-
folds with essential loops. Neurocomputing, 67:29–53, 2005.
122. A. Lendasse, J. Lee, and M. Verleysen. Forecasting electricity consumption
using nonlinear projection and self-organizing maps. Neurocomputing, 48:299–
311, October 2002.
123. A. Lendasse, J.A. Lee, V. Wertz, and M. Verleysen. Time series forecasting
using CCA and Kohonen maps – application to electricity consumption. In
M. Verleysen, editor, Proceedings of ESANN 2000, 8th European Symposium on
Artificial Neural Networks, pages 329–334. D-Facto public., Bruges, Belgium,
April 2000.
124. Y. Linde, A. Buzo, and R.M. Gray. An algorithm for vector quantizer design.
IEEE Transactions on Communications, 28:84–95, 1980.
125. N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some
of its algorithmic applications. Combinatorica, 15(2):215–245, 1995.
126. L. Ljung. System Identification: Theory for the User. Prentice Hall Information
and System Sciences Series. Prentice Hall, Englewood Cliffs, NJ, 2nd edition,
1999.
127. S.P. Lloyd. Least squares quantization in PCM. IEEE Transactions on In-
formation Theory, 28:129–137, 1982. Unpublished memorandum, 1957, Bell
Laboratories.
128. M. Loève. Fonctions aléatoire du second ordre. In P. Lévy, editor, Processus
stochastiques et mouvement Brownien, page 299. Gauthier-Villars, Paris, 1948.
129. D.J.C. MacKay. Bayesian neural networks and density networks. Nuclear
Instruments and Methods in Physics Research, Section A, 354(1):73–80, 1995.
130. J.B. MacQueen. Some methods for classification and analysis of multivariate
observations. In L.M. Le Cam and J. Neyman, editors, Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume
I: Statistics, pages 281–297. University of California Press, Berkeley and Los
Angeles, CA, 1967.
131. B.B. Mandelbrot. How long is the coast of Britain? Science, 155:636–638,
1967.
132. B.B. Mandelbrot. How long is the coast of Britain? Freeman, San Francisco,
1982.
References 291

133. B.B. Mandelbrot. Les objets fractals: forme, hasard et dimension. Flammarion,
Paris, 1984.
134. J. Mao and A.K. Jain. Artificial neural networks for feature extraction and mul-
tivariate data projection. IEEE Transactions on Neural Networks, 6(2):296–
317, March 1995.
135. T. Martinetz and K. Schulten. A “neural-gas” network learns topologies. In
T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural
Networks, volume 1, pages 397–402. Elsevier, Amsterdam, 1991.
136. T. Martinetz and K. Schulten. Topology representing networks. Neural Net-
works, 7(3):507–522, 1994.
137. W. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.
138. M. Minsky and S. Papert. Perceptrons: An Introduction to Computational
Geometry. MIT Press, Cambridge, MA, 1969.
139. L.C. Molina, L. Belanche, and À. Nebot. Feature selection algorithms: A sur-
vey and experimental evaluation. In Proceedings of 2002 IEEE International
Conference on Data Mining (ICDM’02), pages 306–313. December 2002. Also
available as technical report LSI-02-62-R at the Departament de Lleguatges i
Sistemes Informàtics of the Universitat Politècnica de Catalunya, Spain.
140. J.R. Munkres. Topology: A First Course. Prentice-Hall, Englewood Cliffs, NJ,
1975.
141. B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps,
spectral clustering and eigenfunction of fokker-planck operators. In Y. Weiss,
B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing
Systems (NIPS 2005), volume 18. MIT Press, Cambridge, MA, 2006.
142. R.M. Neal. Bayesian Learning for Neural Networks. Springer Series in Statis-
tics. Springer-Verlag, Berlin, 1996.
143. A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: analysis and an
algorithm. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems (NIPS 2001), volume 14. MIT Press,
Cambridge, MA, 2002.
144. E. Oja. Data compression, feature extraction, and autoassociation in feedfor-
ward neural networks. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas,
editors, Artificial Neural Networks, volume 1, pages 737–745. Elsevier Science
Publishers, B.V., North-Holland, 1991.
145. E. Oja. Principal components, minor components, and linear neural networks.
Neural Networks, 5:927–935, 1992.
146. E. Ott. Measure and spectrum of dq dimensions. In Chaos in Dynamical
Systems, pages 78–81. Cambridge University Press, New York, 1993.
147. P. Pajunen. Nonlinear independent component analysis by self-organizing
maps. In C. von der Malsburg, W. von Seelen, J.C. Vorbruggen, and B. Send-
hoff, editors, Artificial Neural Networks, Proceedings of ICANN’96, pages 815–
820. Springer-Verlag, Bochum, Germany, 1996.
148. E. Pȩkalska, D. de Ridder, R.P.W. Duin, and M.A. Kraaijveld. A new method
of generalizing Sammon mapping with application to algorithm speed-up. In
M. Boasson, J.A. Kaandorp, J.F.M. Tonino, and M.G. Vosselman, editors,
Proceedings of ASCI’99, 5th Annual Conference of the Advanced School for
Computing and Imaging, pages 221–228. ASCI, Delft, The Netherlands, June
1999.
292 References

149. K. Pearson. On lines and planes of closest fit to systems of points in space.


Philosophical Magazine, 2:559–572, 1901.
150. J. Peltonen, A. Klami, and S. Kaski. Learning metrics for information visualisa-
tion. In Proceedings of the 4th Workshop on Self-Organizing Maps (WSOM’03),
pages 213–218. Hibikino, Kitakyushu, Japan, September 2003.
151. Y.B. Pesin. On rigorous mathematical definition of the correlation dimension
and generalized spectrum for dimension. J. Stat. Phys., 71(3/4):529–547, 1993.
152. Y.B. Pesin. Dimension Theory in Dynamical Systems: Contemporary Views
and Applications. The University of Chicago Press, Chicago, 1998.
153. J. Rissanen. Modelling by shortest data description. Automatica, 10:465–471,
1978.
154. H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-
Organizing Maps. Addison-Wesley, Reading, MA, 1992.
155. H. Ritter and K. Schulten. On the stationary state of Kohonen’s self-organizing
sensory mapping. Biological Cybernetics, 54:99–106, 1986.
156. H. Robbins and S. Monro. A stochastic approximation method. Annals of
Mathematical Statistics, 22:400–407, 1951.
157. F. Rosenblatt. The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65:386–408, 1958.
158. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500):2323–2326, 2000.
159. S.T. Roweis, L.K. Saul, and G.E. Hinton. Global coordination of local linear
models. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems (NIPS 2001), volume 14. MIT Press,
Cambridge, MA, 2002.
160. D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal represen-
tations by error propagation. In D.E. Rumelhart and J.L. McClelland, editors,
Parallel Distributed Processing: Explorations in the Microstructure of Cogni-
tion, volume 1: Foundations. MIT Press, Cambridge, MA, 1986.
161. D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by
back-propagating errors. Nature, 323:533–536, 1986.
162. D.E. Rumelhart and D. Zipser. Feature discovery by competitive learning.
Cognitive Science, 9:75–112, 1985.
163. D.E. Rumelhart and D. Zipser. Feature discovery by competitive learning. In
D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, pages 151–193. MIT Press,
Cambridge, MA, 1986.
164. M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analy-
sis of a graph, and its relationships to spectral clustering. In Proceddings of the
15th European Conference on Machine Learning (ECML 2004), volume 3201
of Lecture notes in Artificial Intelligence, pages 371–383, Pisa, Italy, 2004.
165. J.W. Sammon. A nonlinear mapping algorithm for data structure analysis.
IEEE Transactions on Computers, CC-18(5):401–409, 1969.
166. L.K. Saul and S.T. Roweis. Think globally, fit locally: Unsupervised learning of
nonlinear manifolds. Journal of Machine Learning Research, 4:119–155, June
2003.
167. B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as
a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. Also
available as technical report 44 at the Max Planck Institute for Biological
Cybernetics, Tübingen, Germany, December 1996.
References 293

168. G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 6:497–


511, 1978.
169. D.W. Scott. Multivariate Density Estimation: Theory, Practice and Visualiza-
tion. Wiley Series in Probability and Mathematical Statistics. John Wiley &
Sons, New York, 1992.
170. D.W. Scott and J.R. Thompson. Probability density estimation in higher
dimensions. In J.R. Gentle, editor, Proceedings of the Fifteenth Symposium on
the Interface, pages 173–179. Elsevier Science Publishers, B.V., North-Holland,
1983.
171. R.N. Shepard. The analysis of proximities: Multidimensional scaling with an
unknown distance function (parts 1 and 2). Psychometrika, 27:125–140, 219–
249, 1962.
172. J. Shi and J. Malik. Normalized cuts and image segmentation. In Proceedings
IEEE International Conference on Computer Vision and Pattern Recognition,
pages 731–737, 1997.
173. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
174. A.L. Smith. The maintenance of uncertainty. Manuscript for the Fermi Summer
School, October 1996.
175. M. Spivak. Calculus on Manifolds: A Modern Approach to Classical Theorems
of Advanced Calculus. Addison-Wesley, Reading, MA, 1965.
176. J.F.M. Svensén. GTM: The generative topographic mapping. PhD thesis, Aston
University, Aston, UK, April 1998.
177. A. Taleb and C. Jutten. Source separation in postnonlinear mixtures. IEEE
Transactions on Signal Processing, 47(10):2807–2820, 1999.
178. Y.W. Teh and S.T. Roweis. Automatic alignment of hidden representations.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Infor-
mation Processing Systems (NIPS 2002), volume 15. MIT Press, Cambridge,
MA, 2003.
179. J.B. Tenenbaum. Mapping a manifold of perceptual observations. In M. Jor-
dan, M. Kearns, and S. Solla, editors, Advances in Neural Information Process-
ing Systems (NIPS 1997), volume 10, pages 682–688. MIT Press, Cambridge,
MA, 1998.
180. J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric frame-
work for nonlinear dimensionality reduction. Science, 290(5500):2319–2323,
December 2000.
181. P.C. Teo, G. Sapiro, and B.A. Wandell. Creating connected representations of
cortical gray matter for functional MRI visualization. IEEE Transactions on
Medical Imaging, 16(6):852–863, 1997.
182. W.S. Torgerson. Multidimensional scaling, I: Theory and method. Psychome-
trika, 17:401–419, 1952.
183. S. Usui, S. Nakauchi, and M. Nakano. Internal colour representation acquired
by a five-layer neural network. In T. Kohonen, K. Makisara, O. Simula, and
J. Kangas, editors, Artificial Neural Networks. Elsevier Science Publishers,
B.V., North-Holland, 1991.
184. L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review,
38:49–95, March 1996.
185. J. Venna and S. Kaski. Neighborhood preservation in nonlinear projection
methods: An experimental study. In G. Dorffner, H. Bischof, and K. Hornik,
294 References

editors, Proceedings of ICANN 2001, International Conference on Artificial


Neural Networks, pages 485–491. Springer, Berlin, 2001.
186. J. Venna and S. Kaski. Local multidimensional scaling with controlled tradeoff
between trustworthiness and continuity. In Proceedings of the 5th Workshop
on Self-Organizing Maps (WSOM’05), pages 695–702. Paris, September 2005.
187. J. Venna and S. Kaski. Local multidimensional scaling. Neural Networks,
19:889–899, 2006.
188. J. Venna and S. Kaski. Visualizing gene interaction graphs with local multidi-
mensional scaling. In M. Verleysen, editor, Proceedings of ESANN 2006, 14th
European Symposium on Artificial Neural Networks, pages 557–562. d-side,
Bruges, Belgium, April 2006.
189. J.J. Verbeek, N. Vlassis, and B. Kröse. Coordinating mixtures of probabilistic
principal component analyzers. Technical Report IAS-UVA-02-01, Computer
Science Institute, University of Amsterdam, Amsterdam, February 2002.
190. T. Villman, R. Der, M. Hermann, and T.M. Martinetz. Topology preservation
in self-organizing maps: exact definition and measurement. IEEE Transactions
on Neural Networks, 8:256–266, 1997.
191. C. von der Malsburg. Self-organization of orientation sensitive cells in the
striate cortex. Kybernetik, 14:85–100, 1973.
192. B.A. Wandell, S. Chial, and B.T. Backus. Visualization and measurement of
the cortical surface. Journal of Cognitive Neuroscience, 12:739–752, 2000.
193. B.A. Wandell and R.F. Dougherty. Computational neuroimaging: Maps and
tracts in the human brain. In B.E. Rogowitz, T.N. Pappas, and S.J. Daly,
editors, Human Vision and Electronic Imaging XI, volume 6057 of Proceedings
of the SPIE, pages 1–12, February 2006.
194. E.J. Wegman. Hyperdimensional data analysis using parallel coordinates.
Journal of the American Statistical Association, 85(411):664–675, September
1990.
195. K.Q. Weinberger, B.D. Packer, and L.K. Saul. Unsupervised learning of image
manifolds by semidefinite programming. In Proceedings of the Tenth Interna-
tional Workshop on Artificial Intelligence and Statistics, Barbados, January
2005.
196. K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR-04), volume 2, pages 988–995,
Washington, DC, 2004.
197. K.Q. Weinberger and L.K. Saul. Unsupervised learning of image manifolds
by semidefinite programming. International Journal of Computer Vision,
70(1):77–90, 2006. In Special Issue: Computer Vision and Pattern Recognition-
CVPR 2004 Guest Editor(s): A. Bobick, R. Chellappa, L. Davis.
198. K.Q. Weinberger, F. Sha, and L.K. Saul. Learning a kernel matrix for nonlinear
dimensionality reduction. In Proceedings of the Twenty-First International
Conference on Machine Learning (ICML-04), pages 839–846, Banff, Canada,
2004.
199. Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proceedings
IEEE International Conference on Computer Vision, pages 975–982, 1999.
200. E.W. Weisstein. Mathworld. A Wolfram web-resource –
http://mathworld.wolfram.com/.
201. P.J. Werbos. Beyond regression: New tools for prediction and analysis in the
behavioral sciences. PhD thesis, Harvard University, 1974.
References 295

202. H. Whitney. Differentiable manifolds. Annals of Mathematics, 37(3):645–680,


1936.
203. C.K.I. Williams. On a connection between Kernel PCA and metric multidimen-
sional scaling. In T.K. Leen, T.G. Diettrich, and V. Tresp, editors, Advances in
Neural Information Processing Systems (NIPS 2000), volume 13, pages 675–
681. MIT Press, Cambridge, MA, 2001.
204. L. Xiao, J. Sun, and S. Boyd. A duality view of spectral methods for dimen-
sionality reduction. In Proceedings of the 23rd International Conference on
Machine Learning. Pittsburg, PA, 2006.
205. H.H. Yang, S. Amari, and A. Cichocki. Information-theoretic approach to blind
separation of sources in non-linear mixtures. Signal Processing, 64(3):291–300,
1998.
206. L. Yen, D. Vanvyve, F. Wouters, F. Fouss, M. Verleysen, and M. Saerens.
Clustering using a random-walk based distance measure. In M. Verleysen,
editor, Proceedings of ESANN 2005, 13th European Symposium on Artificial
Neural Networks, pages 317–324, Bruges, Belgium, April 2005. d-side.
207. P.N. Yianilos. Data structures and algorithms for nearest neighbor search in
general metric spaces. In Proceedings of the Fifth Annual ACM-SIAM Sympo-
sium on Discrete Algorithms (SODA), 1993.
208. G. Young and A.S. Householder. Discussion of a set of points in terms of their
mutual distances. Psychometrika, 3:19–22, 1938.
209. H. Zha and Z. Zhang. Isometric embedding and continuum ISOMAP. In
Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003). Washington DC, August 2003.
210. U. Zwick. Exact and approximate distances in graphs — a survey. In Proceed-
ings of the 9th ESA, pages 33–48. 2001. Updated version available on Internet.
Index

Acronym, XVII, 234 Back-propagation, 87, 135, 228


Adjacency, see also matrix Ball, 12, 51, 55, 59, 60, 117
Affinity, 238 -, 12, 18, 102, 112, 162, 166, 168,
AIC, 30, see also Akaike’s information 191, 239
criterion closed, 49, 54
Akaike’s information criterion, 30 open, 12, 47
Algorithm, 22, 23, 236, 238, 277 radius, 58, 100, 113, 118, 280
batch, 32, 43, 94 Barycenter, 265
batch vs. online, 37, 43 Bayes
online, 43, 94, 97 rule, 144, 145
Analysis theorem, 148
curvilinear component, 88, see also Bayesian information criterion, 30
CCA, 226, 228, 236 Benchmark, 14, 80
curvilinear distance, 97, see also BIC, 30, see also Bayesian information
CDA, 114, 227, 228 criterion
independent component, 11, see also BMU, 265, see also Unit, 266, 267
ICA, 228, 234 Bone effect, 182
multivariate data, 2, 24 Bootstrap, 62
principal component, 11, see also Box, 51, 54, 55, 57, 59, 277
PCA, 22, 24, 59, 226, 228, 234 closed, 15
Anchor, 109, 111 number of, 51, 52
Angle, 9, 50, 113, 152 open, 15, 79, 80, 86, 94, 109, 113, 119,
preservation, 126, 152 124, 125, 130, 140, 149, 157, 163,
ANN, 228, see also Network, 238 169
APED, 281, see also Distance triangular, 52
Application, 19, 38, 42, 43, 82, 123, 214, Brain, 3, 199
230, 232, 242, 244 BSS, 11, see also Separation, 228, 234
field, 2, 81
real, 43 Calibration, 230, 232, 233
real-time, 32 Capacity dimension, see Dimension
APSP, 101, see also Path, 281, 282 CCA, 88, see also Analysis, 96, 97, 111,
Ascent, 261, see also Descent 114, 115, 206, 215, 226–228, 233,
Axiom, 12, 70, 71 236, 238, 241
298 Index

CDA, 114, see also Analysis, 166, 174, latent, 150


178, 206, 214, 215, 222, 228, 233, low-dimensional, 129, 139, 154, 168
236, 241, 242 matrix, 269
Centering, 25, 26, 28, 74, 76, 230, 232 system, 4, 21, 37, 42, 45, 60, 158, 256
double-, 76, 79, 80, 106, 107, 123, Coordination, 159, 228
124, 126, 129, 235, 238, 239 Corner, 4, 9, 15, 55, 58
Centroid, 43, 263, 264, 268 number of, 52
Circle, 11, 13, 15, 140, 193, 272, 274 Correlation, 10
Classification, 39, 80, 86, 94, 109, 113, dimension, see also Dimension
119, 123, 125, 131, 141, 150, 157, sum, 53, 56, 57, 277, 278
163, 170, 232 Cortex, 199, see also Surface, 203
Cluster, 61, 207, 217, 220, 268 visual, 135
Clustering, 59, 238, 242 Covariance, 30, 32–34, 155
hierarchical, 242 matrix, 28, 29, 31, 35, 73, 75, 80, 155,
spectral, 164, 165, 220, 228, 242 231, 233, 248, 253, 254, 256
Coastline, 48, 55, 57 Covering, 47
paradox, 49 Criterion, 17, see also Optimization, 23,
Codebook, 167, 263–265, 267 26, 209, 245
Color, 4, 140 stopping, 241
scale, 198 Cross validation, 62
Combination CTD, 235, see also Distance
linear, 31, 152 Cube, 9, 11, 15, 49, 51, 63, 225, 277
Commutativity, 72 Cumulant, 252
Comparison, 62, 170, 173, 176, 225, 227 higher-order, 31
Complexity, 77, 107, 124, 156, 168 second-order, 31
time, 233, 281 Cumulative density function, 252
Component Curvature, 20, 83, 261
principal, 28, 34, 35, 60, 63, 78, 84,
Curve, 55–57, 60, 99, 102, 103, 182, 214
93, 117, 123, 148, 206, 207
elbow, 62, 107
Compression, 19, 42, 268
principal, 142, 228
Concentration, 8, 9
space-filling, 47
Cone, 129
Cut
Connectivity, 11, 20, 244
normalized, 165, 228
Constraint, 19–21, 39, 44, 126, 128–130,
Cylinder, 193
153–155, 160, 161, 185, 189
centering, 128, 155
equality, 127, 129, 130, 209, 211, 214 Data
inequality, 130, 186, 206, 212, 214, compression, 19, 42
240 flow, 230, 232
isometry, 126, 127, 131 high-dimensional, 17, 226, 230, 242
Convergence, 23, 44, 92, 139, 148, 168, mining, 3, 24
171, 233, 241, 244, 260–262, 267 missing, 81, 230
quadratic, 259 multivariate, 268
Convexity, 106 readability, 225
Coordinate, 13, 31, 41, 42, 70, 71, 75, rule, 272
76, 78, 84, 93, 117, 139, 153, 166, spatial, 4
168, 194, 222, 235, 279, 280 temporal, 4
axis, 9, 254, 256 visualization, 9
intrinsic, 153 Decomposition
Index 299

spectral, 107, 120, 126, 150, 159, 164, hard vs. soft, 37, 38
236, 248, see also Eigenvalue linear, 40, see also LDR, 231, 232
Decorrelation, 24, 28, 226, 231 nonlinear, 40, see also NLDR, 88,
Definition, 12, 13, 47–49, 51, 53, 69, 70, 133, 136, 165, 231–233, 268
72, 82, 95, 97, 104, 138, 160, 171, Disparity, 81
191, 193, 247–249, 266 Dissimilarity, 73, 81
Delaunay triangulation, 269, 275 Distance, 54, 55, 63, 79, 127, 233, 266,
Dependencies, 10, 22, 26, 34 271, 279, 280
Derivative, 53, 260 city-block, 71
numerical, 56, 57, 279 commute-time, 164, see also CTD,
partial, 72, 83, 147, 260 181, 235, 237
second, 84 computation, 280
Descent Euclidean, 70, 72, 76, 87, 98, 99, 111,
gradient, 83, 87, 145, 147, 233, 261 113, 126, 127, 138, 227, 230, 233,
classical, 262, 266 234, 239, 240, 255, 280, 282
stochastic, 23, 44, 150, 233, 236, all-pairs, 281
238, 241, 261, 262, 266 single-pair, 281
steepest, 261 single-source, 281
Diagonalization, 31 function, 70, 71, 133
Diffeomorphism, 13, 104 geodesic, 99, 102–104, 106, 110, 111,
Dijkstra algorithm, 101–103, 107, 108, 113, 227, 242, 269, 280
113, 227, 281, 282 graph, 101–103, 106, 107, 109,
Dimension 111–114, 126, 127, 131, 167, 168,
q-, 49 197, 207, 222, 227, 230, 234, 235,
box-counting, 51, 277 239, 240, 242, 269, 280–282
capacity, 49, 51, 53–55, 59, 277 interpoint, 73
correlation, 49, 53–55, 57, 59, 60, 63, Mahalanobis, 72, 231
66, 67 Manhattan, 71
fractal, 48, 49, 229 matrix, 87, 111, 239
inequalities, 54 maximum, 71
information, 49, 52, 53, 55 Minkowski, 71
intrinsic, 242 pairwise, 24, 45, 55, 69, 73, 80, 81, 87,
Lebesgue covering, 47 97, 126, 127, 129, 168, 231, 233,
spectrum, 49 234, 277, 280, 282
target, 243, 244 preservation, 16, 24, 45, 69, 70, 81,
topological, 47, 48 83, 86, 88, 91, 95–99, 113, 126,
Dimensionality, 15, 35, 239, 263, 279 133, 227, 233, 235, 236, 244, 245
curse of, 3, 6, 226, 243, 246 local, 186
data, 243 strict, 126, 189
embedding, 107 weighted, 94, 131
estimation, 18, 30, 37, 41, 60, 106, spatial, 70
150, 226, 229, 277 Distortion, 235
intrinsic, 18, 19, 30, 47, 60, 62, 67, Distribution, 7, 59, 142, 167, 175, 194
106, 109, 150, 229, 230, 243, 244, Gaussian, 7, 8, 25, 31, 33, 168, 252
246, 277 isotropic, 254
reduction, 2, 11, 13, 18, 20, 69, 94, multidimensional, 253
97, 125, 191, 225–227, 229, 233, joint, 11
234, 245, 263, 280 Laplacian, 252
hard, 231, 243 multivariate, 254
300 Index

non-Gaussian, 35 stochastic neighbor, 171


of norms, 8 theorem, 13
posterior, 144, 148 Empty space phenomenon, 6, 226, 246
prior, 142, 144, 148 Entropy, 53
probability, 144 differential, 252
support, 11, 13, 54 Epoch, 44, 93, 117, 137–139, 141, 167,
uniform, 35, 252 168, 262, 267
Distributivity, 72 Equation
DR, 2, see also Dimensionality, 60, 66, parametric, 13, 99, 100, 104–106, 173,
228 178, 188, 193, 194, 269
Duality, 235, 237–239, 246 system, 155
Equivalence, 75, 76, 78
-rule, 107, 118, 271, 280 Error
Edge, 4, 6, 15, 65 mean square, 23
Eigenfunction, 122, 162 trial and, 60, 62
Eigenproblem, 42, 155, 158, 240–242, EVD, 28, see also Eigenvalue, 32,
244 77, 78, 80, 107, 124, 125, 129,
Eigenspectrum, 240 156–158, 162, 164, 199, 247, 248
Eigenvalue, 29–31, 33–35, 65, 77, 80, Expectation, 25
106, 109, 122, 125, 156, 161, 162, Expectation-maximization, 146, see also
235, 238, 240, 248 EM, 237
decomposition, 28, see also EVD, 32, Extremum, 260, 261
75, 121–123, 129, 247, 248 local, 260
Eigenvector, 29, 31, 42, 75, 77, 80,
109, 121–123, 125, 129, 156–158, Face, 15, 80, 86, 149, 169
161–165, 176, 189, 238, 244, 248, artificial, 206
256 picture, 214
bottom, 155, 164, 235, 238, 239 real, 214
top, 126, 235 Fibonacci heap, 282
Ellipse, 11, 33, 34, 73, 272, 274 Fractal object, 48, 49, 55
Ellipsoid, 11, 193, 254, 274 Freedom
EM, 146, see also Expectation- degree of, 13, 18, 207
maximization, 148, 150, 151, 156, Function
176, 203, 237 (de)coding, 24, 26, 263, 265
Embedding, 12, 14, 133, 171, 186, 233, approximation, 123, 232
240–242 extremum, 260
dimension, 77 multivariate, 260
dimensionality, 162 objective, 44, 127, 128, 139, 143, 229,
incremental, 86 238, 241, 242, 259
isometric, 237 radial basis, 123, 146, 256
layered vs. standalone, 37, 42 step, 54, 92, 116
locally linear, 152, see also LLE, 227, weighting, 94
228 Functionality, 17, 18, 226
low-dimensional, 70, 166
quality, 245 Geodesic, 99, see also Distance
re-, 13, 14, 19 GNLM, 111, see also NLM, 113, 118,
semidefinite, 120, 126, see also SDE, 166, 174, 178, 186, 203, 206, 214,
228, 236 215, 222, 233, 242
space, 18, 19, 43, 239 Gradient, 87, 260, 261, 266
Index 301

Gram, see Matrix ICA, 11, see also Analysis, 24, 228, 232,
Graph, 12, 100, 103, 112, 126, 130, 134, 234, 257
152, 160, 162, 191, 207, 216, 217, Image, 215
227, 228, 238, 239 face, 206
adjacency, 244 number of, 207
building, 111, 165, 168, 269, 272, 280 processing, 2
connected, 102, 282 Implementation, 77, 280
directed, 100 Independence, 2, 26, 226, 232, 252
distance, see also Distance statistical, 8, 21, 24, 36, 74, 229, 232,
edge, 100, 103, 107, 134, 152, 159, 254
239, 272, 274, 281 Inequality, 71, 133
parasitic, 185 triangular, 70, 101, 115
edge-labeled, 100 Inference, 238
edge-weighted, 100 Information, 3, 10, 24, 70, 80, 88, 225,
Euclidean, 100 229, 243–245, 272, 274
Laplacian, 159, 163, 164 dimension, see also Dimension
partition, 165 local, 158
path, 100, 102 ordinal, 81
theory, 101, 159, 238 theory, 53
undirected, 100, 126, 129, 159, 270, Initialization, 84, 93, 101, 102, 116, 117,
281, 282 139, 148, 149, 166–168, 242, 265,
vertex, 100–103, 111, 118, 134, 135, 267
152, 159, 166–169, 272, 274, 281, Inplementation, 277
282 Invariance, 153
closest, 168 ISODATA, 228, 265
vertex-labeled, 100 Isomap, 97, 102–104, 106–114, 118, 120,
weighted, 101, 107, 134, 281, 282 125, 126, 131, 157–159, 163, 165,
Grid, 6, 49, 52, 136, 137, 146, 148–151, 166, 173–176, 178, 182, 186, 189,
170, 174, 176, 203, 222, 234, 277 190, 199, 206, 207, 209, 214, 215,
-, 51 217, 222, 227, 228, 233, 234, 237,
239, 242
growing, 142
Isometry, 104, 130, 131, 191, 237
hexagonal, 135, 216
local, 126, 127, 131
rectangular, 142
Isotop, 142, 165–171, 176, 181, 187, 189,
shape, 143
193, 203, 206, 209, 214, 215, 220,
GTM, 142, see also Mapping, 143, 146,
222, 227, 228, 241, 245
148, 152, 228, 229, 237, 238, 245
Iteration, 48, 52, 55, 83, 84, 92, 139,
Guideline, 242, 243
233, 241, 259, 262
number of, 84, 92, 113, 130, 148, 151,
Heap, 282 168, 176, 203, 241
Hexagon, 135, 138, 139
Hierarchy, 233 Japanese flag, 178, 180–182
Histogram, 278
bin, 275, 278 K-means, 119, 228, 243, 265–267
rule, 274, 275 K-rule, 100, 101, 107, 194, 207, 270,
History, 69, 73, 88, 171, 228 271, 280
HLLE, 159, see also LLE, 228 Kernel, 70, 122, 151, 156, 160, 162, 171,
Hole, 106, 131, 175, 178, 180, 217 176, 236, 237, 239
Homeomorphism, 12 function, 125, 126, 130, 239, 240, 244
302 Index

Gaussian, 123–125, 167, 169, 171, Literature, 49, 55, 81, 109, 111, 129,
234, 256 142, 220, 229, 236, 239, 240, 242,
heat, 160, 162–164, 234 245
learning, 126 LLE, 152, see also Embedding, 157,
matrix, 239 158, 163, 166, 176, 190, 193, 214,
optimization, 241 216, 228, 238, 239, 242, 245
PCA, 120, see also KPCA, 228 Hessian, 159, 228
polynomial, 123 LMDS, 228, see also MDS
trick, 123, 125 Locality preserving projection, 165
type, 236 Locally linear embedding, 227
width, 124 log-log plot, 55, 56, 58, 279
Knot, 12, 193 Loop, 196, 199, 262
trefoil, 13, 193, 196 essential, 193, 199, 242, 244
Koch’s island, 48, 55, 57 LPCA, 158, see also PCA
KPCA, 120, see also Kernel, 122–126, LPP, 165, see also Projection, 228
131, 157, 165, 199, 228, 236, 237,
239, 244, 245 Magic factor, 83
Kurtosis, 31 Magnetic resonance imaging, 199
excess, 252 Manifold, 12, 14, 19, 48, 54, 57, 79, 103,
124, 133, 233, 242, 244, 269, 271,
Lagrange multiplier, 154 280
Landmark, 109, 111, 131 assumption, 242
Laplace-Beltrami operator, 162–164 boundary, 12
Laplacian eigenmaps, 159, see also LE, convex, 131, 242
162, 165, 228 curved, 245
Laplacianfaces, 165 developable, 104–106, 109, 110, 114,
Lattice, 115, 134–138, 152, 237 115, 119, 174, 177, 189, 191, 196,
data-driven, 134, 142, 152, 187, 193, 240, 242
220, 234, 237 dimension, 13
predefined, 134, 135, 152, 187, 203, disconnected, 242
222, 234 nonconvex, 131, 180
LBG, 228, 265, 267 nondevelopable, 106, 110, 113, 115,
LDR, 40, see also Dimensionality 188–190, 197, 203, 240
LE, 159, see also Laplacian eigenmaps, nonlinear, 42, 60, 67, 69, 86, 109, 110,
163, 165, 166, 176, 228, 238, 239, 119, 120, 125, 153, 157, 180
242, 245 shape, 169
Learning, 3, 168 smooth, 13, 49, 102, 126, 162, 163,
Bayesian, 144 166
competitive, 138, 166, 175, 228, 266, underlying, 14, 19, 66, 93, 95, 100,
267 119, 120, 127, 152, 159, 162, 166,
frequentist, 143, 144 185, 193, 220, 229, 237, 240, 242,
machine, 238 245, 246
rate, 23, 44, 92, 95, 138, 139, 168, MAP, 144
241, 261, 262, 266 Mapping, 14, 20, 23, 41, 82, 120, 122,
supervised, 10, 232 123, 171
unsupervised, 87, 232, 233 conformal, 152
Likelihood, 143–148, 150 continuous, 82
maximum, 237 discrete, 87
Lindenmayer system, 50, 55 explicit vs. implicit, 37, 41
Index 303

generative topographic, 135, see also three-way, 74


GTM, 142, see also GTM, 143, two-way, 74
149, 227, 228 Mean, 8, 25, 123, 233, 251, 255, 256,
isometric feature, 237 262, 278
linear, 153 subtraction, 76, 121, 123, 230, 232
nonlinear, 82, 229, 239 vector, 253
Sammon’s nonlinear, 82, see also Memory, 32, 76, 77, 87, 130, 168, 278
Sammon, 111, 226 requirement, 80, 84, 107, 199, 203,
smooth, 145 233
Matrix Mercer theorem, 122
adjacency, 126, 129, 159, 160, 162 Method, 228, 277
dense, 234 comparison, 173
determinant, 248, 253 distance-preserving, 70, 73, 74, 86,
diagonal, 248 88, 92, 95, 115, 120, 174, 191, 203,
Gram, 74, 75, 80, 107, 126, 130, 164, 206, 207, 209, 215, 226, 230, 233,
239 234, 236, 237, 245
local, 154 DR, 37, 44, 60, 62, 88, 94, 115, 225,
Gram-like, 164, 189, 235, 236, 240, 228, 263
242 finite-element, 269
Hessian, 83, 87, 260, 261 graph-based, 242
Jacobian, 99, 105, 106 incremental, 42, 62, 67, 80, 110, 125,
Laplacian, 160–165, 234, 235 131, 142, 157, 161, 163, 176, 189,
normalized, 162, 165 190
mixing, 33, 256 iterative, 233
nonsparse, 77, 162 NLDR, 102, 191, 214, 233, 238, 239,
positive semidefinite, 248 241–244, 246
sparse, 156, 158, 235, 238, 239 nonspectral, 42, 241, 242, 244, 259
square, 248 spectral, 42, 80, 109, 120, 125, 131,
square root, 248, 256 157, 159, 163, 165, 176, 177, 180,
symmetric, 128, 247, 248 189, 190, 199, 233–236, 238–241,
trace, 248 244, 246
unitary, 247 topology-preserving, 134, 177, 181,
Maximization, 235, 261 184, 187, 189, 203, 206, 209, 230,
Maximum, 260 234, 235, 237
global, 238 Methodology, 242
local, 129 Metric, 8, 87, 106, 227
Maximum variance unfolding, 126, see spatial, 97
also MVU Minimization, 83, 235, 261
MDL, 30, see also Minimum description Minimum, 260
length local, 23, 265, 266
MDS, 73, see also Scaling, 74, 228, 245 Minimum description length, 30
classical, 74 Minkowski, see also Norm and Distance
local, 228 MLP, 87, 123, 135, 146, 147, 156, 228,
metric, 26, 73–78, 80, 82, 102, 106, 246
111, 120, 127, 128, 207, 226, 228, Model, 10, 22–24, 35, 39, 40
233, 234, 236, 237, 239, 242, 243, complexity, 232
245 continuous vs. discrete, 22, 37, 40
nonmetric, 73, 81, 226, 228 Euclidean, 73
step, 240 generalized linear, 146
304 Index

generative, 69, 74, 82, 110, 139, 143, Hopfield, 228


150, 238 radial basis function, 146, see also
generative vs. traditional, 37, 39 RBFN, 256
linear, 26, 59, 69, 229 topology-representing, 111, see also
linear vs. nonlinear, 37, 40 TRN, 119, 134
local linear, 228 Neural gas, 97, see also NG, 119, 267,
locally linear, 159 272
mapping, 245 Neuron, 1, 114, 199, 228
nonlinear, 60, 67 Neuroscience, 238
parameters, 23, 144 Newton method, 83, 259–261
Moment, 8 quasi-, 83, 87, 236, 261
central, 8, 252 NG, 267, see also Neural gas
first-order, 251 NLDR, 40, see also Dimensionality, 73,
second-order, 251 229, 234, 238–240, 244, 269, 280
MVU, 126, see also Maximum variance NLM, 82, see also Sammon, 86, 94, 97,
unfolding, 236 111, 156, 215, 227, 228, 233, 236,
241, 242, 245
Neighbor, 108, 113, 126, 128, 137, 138, geodesic, 111, 227
152–154, 159, 164, 167, 171, 269, Noise, 14, 26, 29, 30, 57, 60, 95, 102,
271 231, 243
closest, 112, 116, 166, 214, 216, 271 Gaussian, 57, 270, 275
nearest, 9, 126, 129, 154, 156, 157, white, 57
233, 234 Nondegeneracy, 70
number of, 107, 113, 118, 130, 156, Nonnegativity, 71, 101
158, 222 Norm, 70, 73
Neighborhood, 12, 97, 153, 158, 171, Lp , 71
191, 193, 199, 207, 214, 233, 260 Euclidean, 8, 72
K-ary, 102, 113, 126, 129, 159, 162, Minkowski, 71, 279
165, 168, 185, 191, 214, 215, 220, Nyström formula, 79, 130, 157, 162, 163
239, 240, 270
-, 12, 113, 159, 162, 185 Optimization, 6, 42, 44, 80, 81, 83, 97,
function, 138, 139, 167, 168 109, 111, 128, 147, 236, 238–241,
graph, 111 259
open, 12 approximate vs. exact, 37, 44
preservation, 97, 153, 160, 168, 185, convex, 127
222 criterion, 17, 23, 37, 44, 75
proportion, 117, 186, 189 iterative, 259, 260
relationship, 15, 69, 93, 109, 119, 133, scheme, 233
135, 137, 142, 152, 159, 160, 166, Optimum
227, 242, 267, 269 global, 110, 240
shape, 138, 139 local, 23, 241, 244, 245
type, 162, 191 Outlier, 275
width, 85, 92, 93, 95, 115–118, 138,
139, 141, 167–169, 241 Parameter, 8, 13, 22, 23, 41, 83, 94, 99,
Network, 1 107, 124, 125, 130, 144–148, 156,
artificial neural, 86, see also ANN, 158, 162, 167, 171, 185, 240, 241,
88, 114, 134, 135, 228, 237, 238 244, 261, 275
autoassociative, 228, 234 free, 63, 229, 256
density, 144–146 hidden, 226
Index 305

hyper-, 23, 143 marginal, 254


number of, 190 posterior, 145, 147, 148
time-varying, 44 prior, 144
tuning, 93, 113, 119, 124, 146, 157, Process
163, 174, 186, 189, 193, 199 stochastic, 24
vector, 145 Projection, 4, 10, 14, 22, 225, 226, 228,
Partition, 265 233, 238, 239
Patch, 59, 60, 153, 159 incremental, 230
Path, see also Graph locality preserving, 165, see also LPP,
concatenated, 101 228
length, 100, 101, 127, 282 pursuit, 228, see also PP
shortest, 101, 102, 281, 282 Property, 6, 10, 11, 13, 23, 71, 72, 104,
all-pairs, 101, 281, 282 126, 129, 153, 225, 242, 251, 252,
single-source, 101, 281 254, 256, 269, 270, 281
PCA, 11, see also Analysis, 23, 24, Prototype, 43, 92, 93, 114, 137, 139,
28–31, 60, 75, 76, 78, 80, 88, 123, 141, 146, 166, 168, 169, 171, 175,
125, 143, 165, 226, 228, 232, 234, 176, 188, 222, 263–267, 270, 271,
236, 242, 243, 245, 256 274, 275
global, 63, 67 closest, 137, 139–141, 272, 274
local, 43, 59, 60, 65, 67, 158, 228, 230 second, 267
model, 24, 26, 59 number of, 139, 142, 168, 175, 181,
neural, 228 184, 189, 193, 194, 266, 267
pdf, 7, see also Probability, 252, 254 Proximity, 45, 81, 133, 134, 209
joint, 253, 254, 256 relationship, 45
PDS, see also Search, 280 relative, 152
Perceptron, 228 scaled, 81
multilayer, 135, 238 Pseudo-inverse, 164, 235, 247
Perspective, 4, 225 Psychology, 39, 73
Plane, 9, 15, 134, 194, 237
hyper-, 13, 41, 69, 84, 93, 103, 117, Radius, 15, 255
122, 129, 136 Rank, 3, 73, 81, 119, 133, 209, 214, 245,
Plateau, 55 247, 256
Point, 12, 13, 15, 21, 41, 54, 57, 66, 70, error, 214, 222
71, 73, 103, 134, 136, 158, 191, RBFN, 146, see also Network, 147, 151,
227, 235, 240, 263, 265, 269, 270, 156, 256
274, 277 Reconstruction, 34
closest, 9, 100, 101, 167, 279, 280 error, 24, 26, 27, 60, 66, 69, 153, 154,
indexed, 73 231, 235
isolated, 55 locally linear, 154
number of, 277 weight, 153, 156, 158
representative, 59 Redundancy, 2, 6, 18, 282
source, 280 Refinement, 47
PP, 228, see also Projection, 234 Regularization, 143, 156, 245
Preprocessing, 21, 25, 86, 88, 94, 109, factor, 158
119, 131, 215, 231, 232, 263 term, 143
Probability, 253 Relevance, 10
density function, 7, see also pdf, 251, Representation, 4, 11, 12, 19, 81
263 discrete, 15, 134, 136
joint, 253 geometrical, 4, 73
306 Index

graphical, 225 Shortcoming, 226


low-dimensional, 11, 18, 20, 21, 41, Shortcut, 99, 131, 134, 176, 185, 186,
42, 69, 225, 226, 229 193
spatial, 4 Similarity, 73
temporal, 4 measure, 234
three-dimensional, 15 Sine, 269–271, 274
Responsibility, 147, 148 Singular value
Robbins–Monro procedure, 44, 137, decomposition, 27, see also SVD, 32,
138, 141, 167, 266, 267 75, 247
Rotation, 2, 31, 35, 153, 155, 231 Skewness, 31, 252
Ruler, 49 Slice, see also Swiss-roll
Slope, 55, 56, 214
Saddle point, 260 average, 55
Sammon, 111 SNE, see also Embedding, 172
nonlinear mapping, 66, 81, 82, see SOM, 40, see also Self-organizing map,
also NLM, 226 41, 43, 88, 94, 115, 138, 165, 167,
stress, 66, 83, 86, 88, 89, 112, 113, 203 169–171, 227, 228, 238, 241, 244,
Scalar product, 70, 72, 74, 76, 80, 122, 245, 267
123, 126–128, 152, 274 grid, 164, 222
pairwise, 74, 120 Sort
Scale, 12, 35, 48, 54, 55, 57, 59, 60, 67, partial, 279
95, 230 Space, 13, 19, 70
local, 59, 270 Cartesian, 6, 12, 42, 71, 82
Scaling, 33, 87, 106, 161, 178, 232, 256 data, 106, 234, 237
factor, 31, 88, 248 embedding, 87, 233–235
method, 73 Euclidean, 3, 12
multidimensional, 26, see also MDS, high-dimensional, 6, 8, 41, 82, 225
73, 226 latent, 106, 112, 145, 146, 148–150,
re-, 153, 155 245
SDE, 126, see also Embedding, 127, low-dimensional, 41–43, 82, 238, 242
129, 131, 176, 199, 206, 214, 228, metric, 49, 70, 73
233, 235, 236, 239–242, 244 topological, 12, 47
SDP, 128, see also Semidefinite SPED, 281, see also Distance
programming, 129, 130, 244 Sphere, 11, 193, 197, 225
Search diameter, 6
partial distance, 279, see also PDS essential, 193, 199, 242, 244
Self-organization, 135 hollow, 106
Self-organizing map, 88, 135, see also hyper-, 7–9, 13
SOM, 227, 237, 267 radius, 6–9, 12
Self-similarity, 48 surface, 8
Semidefinite programming, 128, see also volume, 6, 7
SDP, 131, 233, 235, 240 Spiral, 57, 59, 60, 149, 269–271, 274
Sensor, 3, 63 SSED, see also Distance
array, 1, 2 SSSP, 101, see also Path, 281, 282
Separation, 94, 171, 220, 226, 257 Standard deviation, 26, 57, 138, 230,
blind source, 11, 136, 228, 234 233, 254, 270
latent variable, 11, 228, 229, 232–234 Standardization, 26, 230–232
Shell, 199 Step size, 44, 85, 87, 112, 113, 138, 167,
spherical, 7 241, 261
Index 307

Stress, 80–82, 84–88, 111, 114 linear, 226


Structure, 4, 11, 13, 18–20, 24, 48, 70, nonlinear, 241
229, 245 Translation, 76, 153
algorithmic, 233 Tree
data, 3, 24, 159, 280, 282 hierarchy, 233, 234
distributed, 238 spanning, 282
local, 165 Trial and error, 60, 66, 67, 230
Riemannian, 102 Triangle, 50, 52, 91, 203
Submanifold, 13, see also Manifold TRN, 134, see also Network, 142
Subspace, 22, 29, 30
Superposition, 50, 80, 86, 93, 109, 113, Unit, 263, 267
118, 124, 130, 157, 169, 186, 187 best-matching, 265
Support, 252 dead, 267
Support vector machine, 123, see also Update, 84, 87, 92, 117, 139, 141, 149,
SVM, 256 168, 172, 262, 265, 267
Surface, 4, 9, 19, 54 rule, 83, 87, 89–93, 111, 112, 116, 117,
cortical, 199, 203 138, 147, 167, 169, 171, 241, 262
SVD, 27, see also Singular value, 32,
76, 78, 158, 247 Validation, 3
economy-size, 247 Variable, 3, 10, 230
SVM, 123, see also support vector decorrelated, 231
machine Gaussian, 251, 252
Swiss roll, 14, 15, 79, 80, 86, 93, 109, uncorrelated, 254
113, 118, 124, 127, 130, 149, 157, latent, 11, 18–20, 24, 25, 28–33, 35,
163, 169, 173–176, 178, 184–187, 36, 69, 75, 106, 148, 226, 228, 232
189, 190 normal, 252
heated, 188–190 number of, 11
nondevelopable, 188 observed, 24, 28, 69, 226, 229, 232
thin, 182, 184 random, 251–253, 256
Swiss-cheese effect, 175, 177, 217 selection, 230–232
Symmetry, 71, 77, 101, 126, 153 standardized, 26, see also Standard-
ization
τ -rule, 271 Variance, 7, 29, 144, 148, 167, 235, 238,
Taxonomy, 233, 266 243, 251, 256
Taylor expansion, 57, 259 global, 29, 62, 206, 207
Tear, 11, 97, 117–119, 157, 186, 189, preservation, 24, 26, 28, 69
193, 196, 242 residual, 107, 110
Threshold, 30, 54, 60, 62 Variant, 74, 80, 81, 83, 85–87, 95, 96,
Time, 4 109–111, 114, 119, 131, 142, 151,
running, 130 158, 159, 164, 165, 171, 197
Timeline, 228 nonlinear, 226, 245
Tolerance, 156 Vector
Topology, 11, 19, 133, 227, 233, 237, 240 code, 263, 264
discrete, 134 random, 7, 8, 24, 47, 144, 256
preservation, 16, 133, 134, 152, 166, Vector quantization, 37, 41, 43, 59, 86,
171, 193, 203, 214, 227, 234, 235, 88, 94, 109, 111, 119, 131, 134,
237, 245 136, 165–168, 170, 175, 182, 203,
Torus, 193, 194, 197, 199 222, 243, 263, 266, 270, 272, 274
Transformation, 10, 126, 152, 153, 231 and projection, see also VQP
308 Index

distortion, 263, 265, 266 Whitening, 231, 232


Vertex, see also Graph Width, 149, 255, 278
source, 101, 102, 108, 281, 282 Window, 59, 60, 65
Visualization, 4, 19, 39, 42, 135, 142,
Winner
225, 227, 232, 244
take all, 267
Volume, 6, 7, 54
Voronoı̈ region, 265, 267 take most, 267
VQP, 88, see also Vector quantization, WTA, 267, see also Winner
97, 228, 238 WTM, 267, see also Winner

You might also like