Advances in Self-Organizing Maps
Advances in Self-Organizing Maps
Advances in Self-Organizing Maps
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
6731
Advances in
Self-Organizing Maps
8th International Workshop, WSOM 2011
Espoo, Finland, June 13-15, 2011
Proceedings
13
Volume Editors
Jorma Laaksonen
Aalto University School of Science
Department of Information and Computer Science
00076 Aalto, Finland
E-mail: jorma.laaksonen@aalto.
Timo Honkela
Aalto University School of Science
Department of Information and Computer Science
00076 Aalto, Finland
E-mail: timo.honkela@aalto.
ISSN 0302-9743
e-ISSN 1611-3349
ISBN 978-3-642-21565-0
e-ISBN 978-3-642-21566-7
DOI 10.1007/978-3-642-21566-7
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2011928559
CR Subject Classication (1998): F.1, I.2, D.2, J.3, H.2.8, I.4, I.5
LNCS Sublibrary: SL 1 Theoretical Computer Science and General Issues
Preface
The 8th Workshop on Self-Organizing Maps, WSOM 2011, was the eighth event
in a series of biennial international conferences that started with WSOM 1997
at the Helsinki University of Technology.
WSOM 2011 brought together researchers and practitioners in the eld of
self-organizing systems, with a particular emphasis on the self-organizing map
(SOM). When academician Teuvo Kohonen was conducting his pioneering work
with a small number of colleagues in the 1970s and 1980s, the prospects of
neural network research were not widely acknowledged. The main focus was
on articial intelligence research based on symbol manipulation methodologies.
As notable exceptions, Teuvo Kohonen as well as Stephen Grossberg, Shunichi Amari and Christoph von der Malsburg continued their eorts regardless
of criticism that was often based on short-sighted interpretations of the book
Perceptrons published in 1969 by Marvin Minsky and Seymour Papert.
For a long time, and regrettably also often these days, the term neural networks was considered to be synonymous with multilayer perceptrons. However,
multilayer perceptrons have given ground to more advanced forms of supervised
learning including support vector machines. Actually, among the three classic
neural network paradigms multilayer perceptrons, Hopeld nets and SOMs
only the last one has remained in a strong position. The persistent interest in
the SOM algorithm can perhaps be explained by its strength as an unsupervised
learning algorithm and by its virtues in analyzing and visualizing complex data
sets.
Presently, research on articial neural networks is a well-established scientic discipline and an area of technological development with a large number of
applications. Articial neural network research can be divided into three main
strands: (1) explicit modeling of biological neural circuits and systems, (2) neurally inspired computing, and (3) statistical machine learning research that has
mostly abandoned its biologically inspired roots. This classication cannot be
considered clear-cut, but rather a continuum. In his banquet keynote talk at
the IJCNN 2007 conference, Michael Jordan emphasized the importance of the
neural network research for its role in facilitating the path to current statistical machine learning research. Obviously, the biological inspiration helped in
abandoning some restricting assumptions that were commonly held in classic
statistical computing.
There are hundreds of dierent kinds of variants of the basic SOM algorithm, each typically proposing some advantage by giving up an aspect of the
original formulation, such as computational eciency, capabilities in visualization, implementational simplicity or biological relevance. In general, the SOM
has inspired a lot of methodological research and provided a tool for a large
number of real-world cases.
VI
Preface
The WSOM 2011 event covered the results of research in theory and methodology development as well as selected examples of applications. When applications of the SOM are considered, it is good to keep in mind that the thousands
of uses of the SOM in dierent elds of science are usually reported in the specic fora of each discipline. Moreover, the commercial projects based on the
SOM are typically not reported publicly, but there are many indications that
the entrepreneurial use of the SOM and its variants in data analysis, knowledge
management and business intelligence is widely spread.
The technical program of WSOM 2011 consisted of 36 oral or poster presentations by a total of 96 authors that highlighted the key advances in
the area of self-organizing systems and more specically in SOM research. We
warmly thank all the authors of the contributed papers. We also gratefully acknowledge the contribution of the plenary speakers. The plenary presentations
were given by Barbara Hammer (University of Bielefeld, Germany) and Teuvo
Kohonen (Academy of Finland and Aalto University, Finland). The event celebrated the 30th anniversary of the rst report in which Kohonen presented the
basic principles of the SOM, and the 10th anniversary of the 3rd edition of his
book Self-Organizing Maps. We also celebrated that the number of SOM-related
scientic papers has reached approximately 10,000.
We warmly thank the highly respected international Steering and Program
Committees whose roles were instrumental for the success of the conference.
The Program Committee members and the reviewers ensured a timely and thorough evaluation of the papers. We are grateful to the members of the Executive
Committee. In particular, the experience of Olli Simula as the Local Chair and
the skillful eorts of Jaakko Peltonen as the Publicity Chair contributed greatly
toward the success of the event.
WSOM 2011 was co-located with the ICANN 2011 conference. We wish to
thank the organizers of ICANN 2011, especially General Chair Erkki Oja, Local Chair Amaury Lendasse and Finance Chair Francesco Corona. The smooth
collaboration with them facilitated the success of WSOM 2011. Last but not
least, we would like to thank Springer for their co-operation in publishing the
proceedings in the prestigious Lecture Notes in Computer Science series.
The organizers had a chance to welcome the participants to the new but
prestigious Aalto University School of Science. Namely, from the beginning of
2010, the 100-year-old university changed its name and form. Three universities,
Helsinki University of Technology, Helsinki School of Economics, and University
of Art and Design Helsinki, merged into Aalto University which became the
second largest university in Finland.
April 2011
Timo Honkela
Jorma Laaksonen
Organization
WSOM 2011 was held during June 1315, 2011, organized by the Department
of Computer and Information Science, Aalto University School of Science, and
co-located with the ICANN 2011 conference.
Executive Committee
Honorary Chair
General Chair
Program Chair
Local Chair
Publicity Chair
Steering Committee
Teuvo Kohonen
Marie Cottrell
Pablo Estevez
Timo Honkela
Erkki Oja
Jose Prncipe
Helge Ritter
Takeshi Yamakawa
Hujun Yin
Program Committee
Guilherme Barreto
Yoonsuck Choe
Jean-Claude Fort
Tetsuo Furukawa
Colin Fyfe
Barbara Hammer
Samuel Kaski
Jorma Laaksonen, Chair
Krista Lagus
Amaury Lendasse
Ping Li
Thomas Martinetz
Risto Miikkulainen
VIII
Organization
Klaus Obermayer
Jaakko Peltonen
Marina Resta
Udo Seiert
Olli Simula
Kadim Tasdemir
Heizo Tokutaka
Carme Torras
Alfred Ultsch
Marc Van Hulle
Michel Verleysen
Thomas Villmann
Lei Xu
Technische Universit
at Berlin, Germany
Aalto University, Finland
University of Genova, Italy
Otto von Guericke University of Magdeburg,
Germany
Aalto University, Finland
European Commission Joint Research Centre,
Italy
SOM Japan Inc., Japan
Universitat Polit`ecnica de Catalunya, Spain
Philipps-Universit
at Marburg, Germany
Katholieke Universiteit Leuven, Belgium
Universite Catholique de Louvain, Belgium
University of Applied Sciences Mittweida,
Germany
The Chinese University of Hong Kong,
Hong Kong
Additional Referees
Jaakko Hollmen
Timo Honkela
Markus Koskela
Antonio Neme
Mats Sj
oberg
Mika Sulkava
Sami Virpioja
Aalto
Aalto
Aalto
Aalto
Aalto
Aalto
Aalto
University,
University,
University,
University,
University,
University,
University,
Finland
Finland
Finland
Finland
Finland
Finland
Finland
Table of Contents
Plenaries
Topographic Mapping of Dissimilarity Data . . . . . . . . . . . . . . . . . . . . . . . . .
Barbara Hammer, Andrej Gisbrecht, Alexander Hasenfuss,
Bassam Mokbel, Frank-Michael Schleif, and Xibin Zhu
16
30
40
51
61
71
79
90
Table of Contents
101
111
121
131
141
151
160
168
178
188
198
Table of Contents
XI
207
218
228
238
247
257
267
277
288
298
XII
Table of Contents
308
318
328
338
348
357
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
367
Introduction
Electronic data sets are increasing rapidly with respect to size and dimensionality, such that Kohonens ingenious self organizing map (SOM) has lost none of
its attractiveness as an intuitive data inspection tool: it allows humans to rapidly
access large volumes of high dimensional data [20]. Besides its very simple and
intuitive training technique, the SOM oers a large exibility by providing simultaneous visualization and clustering based on the topographic map formation.
As a consequence, application scenarios range from robotics and telecommunication up to web- and music-mining; further, the self-organizing map is a widely
used technique in the emerging eld of visual analytics because of its ecient
and robust way to deal with large, high-dimensional data sets [19].
The classical SOM and counterparts derived from similar mathematical objectives such as the generative topographic mapping or neural gas [23,3] have
been proposed to process Euclidean vectors in a xed feature vector space. Often, electronic data have a dedicated format which cannot easily be converted
to standard Euclidean feature vectors: biomedical data bases, for example, store
biological sequence data, biological networks, scientic texts, textual experiment
descriptions, functional data such as spectra, data incorporating temporal dependencies such as EEG, etc. It is not possible to represent such entries by means
of conventional feature vectors without loss of information, many data being inherently discrete or compositional. Rather, experts access such data by means
of dedicated comparison measures such as BLAST or FASTA for biological sequences, alignment techniques for biological networks, dynamic time warping
for time series, etc. From an abstract point of view, dissimilarity measures or
kernels which are suited for the pairwise comparison of abstract data types such
as strings, trees, graphs, or functions are used.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 115, 2011.
c Springer-Verlag Berlin Heidelberg 2011
B. Hammer et al.
Already almost 10 years ago, Kohonen proposed a very intuitive way to extend SOMs to discrete data characterized by dissimilarities only [21]: instead
of mean prototype positions in a Euclidean vector space, neuron locations are
restricted to data positions. The generalized median serves as a computational
vehicle to adapt such restricted neurons according to given dissimilarity data.
This principle can be extended to alternatives such as neural gas, and it can
be substantiated by a mathematical derivative from a cost function such that
convergence of the technique can be proved [8]. Depending on the characteristics
of the data set, however, the positional restrictions can lead to a much worse
representation of the data as compared to the capabilities of continuous updates
which are possible in a Euclidean vector space.
As an alternative, specic dissimilarity measures can be linked to a nonlinear
kernel mapping. Kernel versions of SOM have been proposed for example in
the contribution [30] for online updates and [4] for batch adaptation; in both
cases, the standard SOM adaptation which takes place in the high-dimensional
feature space is done implicitly based on the kernel. Kernelization of SOM allows
a smooth prototype adaptation in the feature space, but it has the drawback
that it is often not applicable since many classical dissimilarity measures cannot
be linked to a kernel. For such cases, so-called relational approaches oer an
alternative [15]: prototypes are represented implicitly by means of a weighting
scheme, and adaptation takes place based on pairwise dissimilarities of the data
only. This principle has already been used in the context of fuzzy clustering [17];
in the past years, it has been successfully integrated into topographic maps such
as SOM, neural gas, or the generative topographic mapping [15,14].
Both principles, median extensions of SOM or relational versions, have the
drawback of squared time complexity due to their dependency on the full dissimilarity matrix. Since the computational costs of specialized dissimilarities such as
alignment for strings or trees can be quite time consuming, the main computational bottleneck of the techniques is often given by the computation of the full
dissimilarity matrix. For this reason, dierent approximation techniques have recently been proposed which rely on only a linear subset of the full dissimilarity
matrix and which reduce the computational eort to an only linear one. Two
particularly promising techniques are oered by the Nystr
om approximation,
on the one hand, which can be transferred to dissimilarities as shown in [13].
On the other hand, if a computation of the dissimilarities can be done online,
patch processing oers a very intuitive and easily parallelisable scheme which
can even deal with non i.i.d. data distributions [1]. This way, ecient linear time
processing schemes for topographic mapping of dissimilarity data arises.
In this contribution, we dene topographic mapping based on cost functions
rst. Afterwards, we introduce two dierent principles to extend the techniques
to dissimilarity data: median and relational clustering. Both methods can be
substantiated by mathematical counterparts linking it to cost functions and
pseudo-Euclidean space, respectively. We conclude with technologies which allow
to speed the topographic mapping up to linear time complexity.
Topographic Mapping
Prototype based approaches represent data vectors x Rn by means of prototypes w 1 , . . . , wN Rn based on the standard squared Euclidean distance
d(x, wi ) = x w i 2
(1)
1
i (xj )d(xj , wi )
2 i,j
(3)
B. Hammer et al.
1
i (xj )
exp(nd(i, k)/ 2 )d(xj , wk )
2 i,j
(4)
(5)
(6)
It has been shown in [6] that this procedure converges in a nite number of steps
towards a local optimum of the cost function. The convergence is very fast such
that a good initialization is necessary to avoid topological mismatches as pointed
out in [10]. For this reason, typically, an initialization by means of the two main
principal components takes place, and the neighborhood is annealed carefully
during training.
The GTM. The generative topographic mapping (GTM) can be seen as a
statistical counterpart of SOM which models data by a constraint mixture of
Gaussians [3]. The centers are induced by lattice positions in a low dimensional latent space and mapped to the feature space by means of a smooth
function, usually a generalized linear regression model. That means, prototypes are obtained as images of lattice points v i in a two dimensional space
wi = f (v i ) = (v i ) W with a matrix of xed base functions such as equally
spaced Gaussians in two dimensions and a parameter matrix W . Every prototype induces an isotropic Gaussian probability with variance 1 which are
combined in a mixture model using uniform prior over the modes. For training,
n/2
the data log likelihood j ln N1 i 2
exp 2 d(xj , wi ) is optimized by
means of an EM approach which yields to linear algebraic equations to determine
the parameters W and . As batch SOM, GTM requires a good initialization
which is typically done by aligning the principal components of the data with
the initial images of the lattice points. The smoothness of the mapping f , i.e. the
number of base functions in , determines the stiness of the resulting topological mapping. Unlike SOM which focuses on the quantization error in the limit of
small neighborhood size, this stiness accounts for a usually better visualization
behavior of GTM, see e.g. Fig. 1. It can clearly be seen that GTM respects the
overall shape of the data manifold while SOM pushes prototypes towards data
centers, leading to local distortions. A better preservation of the manifold shape
can also be obtained using ViSOM instead of SOM [29], albeit this technique is
not substantiated by a global cost function such as GTM.
Median Clustering
Often, data are not given as vectors, rather pairwise dissimilarities dij =
d(xi , xj ) of data points xi and xj are available. Thereby, the dissimilarity need
not correspond to the Euclidean metric, and it is not clear whether data xi
can be represented as nite dimensional vectors at all. In the following, we refer
to the dissimilarity matrix with entries dij as D. We assume that D has zero
diagonal and that D is symmetric.
Median SOM. This situation causes problems for classical topographic mapping since a continuous adaptation of prototypes is no longer possible like
in the Euclidean case. One solution has been proposed in [21]: prototype locations are restricted to the positions oered by data points, i.e. we enforce
wi {x1 , . . . , xm }. In [21] a very intuitive heuristic how to determine prototype
positions in this setting has been proposed based on the generalized median. As
pointed out in [8], it is possible to derive a similar learning rule from the cost
function of SOM (4): Like in batch SOM, optimization takes place iteratively
with respect to the assignments of data to prototypes (5) and with respect to
the prototype positions. The latter step does not allow an explicit algebraic formulation such as (6) because of the restriction of prototype positions; rather,
prototypes are found by exhaustive search optimizing their contribution to the
cost function:
i,j
B. Hammer et al.
Ny RGTM
(0.01)
0.88
Ny RGTM
(0.1)
0.55
Relational Clustering
As discussed above, the discrete nature of median clustering causes a severe risk
to get trapped in local optima of the cost function. Hence the question arises
whether a continuous adaptation of prototypes is possible also for general dissimilarity data. A general approach to extend prototype-based clustering schemes
to general dissimilarities has been proposed in [17] in the context of fuzzy clustering, and it has recently been extended in [15,14] to batch SOM, batch NG,
and GTM.
Assume that the dissimilarities dij stem from unknown data in an unknown
high dimensional feature vector space, i.e. dij = (xi ) (xj )2 for some
feature
map . Assume that
prototypes can be expressed as linear combinations
wi = j ij (xj ) with j ij = 1. Then, distances can be computed implicitly
d(w i , xj ) = [Di ]j
1
ti Di
2
(8)
It has been shown in [15] that this equation also holds if an arbitrary symmetric
bilinear form induces dissimilarities in the feature space rather than the squared
Euclidean distance.
Relational SOM. This observation oers a way to directly transfer batch SOM
and batch NG to a general symmetric dissimilarity matrix D. As explained e.g.
in [15], there always exists a vector space together with a symmetric bilinear
form which gives rise to the given dissimilarity matrix. This vector space need
not be Euclidean since some eigenvalues associated to the bilinear form might
be negative or zero. Commonly, this is referred to as pseudo-Euclidean space
where the eigenvectors associated to negative eigenvalues serve as a correction
to the otherwise Euclidean space. For this vector space, batch NG or SOM can
be applied directly in the vector space, and using (8), it can be applied implicitly
without knowing the embedding, because of two key ingredients:
1. an implicit representation of prototype w i in terms of coecient vectors i ,
2. Equation (8) to compute the distance in between a data point and a
prototype.
B. Hammer et al.
(9)
(10)
(11)
Fig. 1. Visualization of the protein data set incorporating 226 proteins in 5 classes
using RSOM (left) and RGTM (right); labels are determined by posterior labeling
according to majority vote of the receptive elds
Ecient Approximations
Both median and relational clustering suer from a quadratic time complexity
as compared to linear complexity for their vectorial counterparts. In addition,
10
B. Hammer et al.
11
information loss. The method turns out to be rather robust with respect to the
choice of the approximation quality k and the patch size. Further, it can deal
with data which are not accessible in an i.i.d. fashion.
Nystr
om approximation. As an alternative, the Nystr
om approximation has
been introduced as a standard method to approximate a kernel matrix in [28].
It can be transferred to dissimilarities as presented in [13]. The basic principle
is to pick M representative landmarks in the data set which give rise to the
rectangular sub-matrix DM,m of dissimilarities of data points and landmarks.
This matrix is of linear size, assuming M is xed. It can be shown (see e.g. [13])
that the full matrix can be approximated in an optimum way in the form
1
t
DM,M
DM,m
D DM,m
(12)
12
B. Hammer et al.
3
3 3 3 3
3
3
3 3 3 3
3 3
3 3
3
3 3
3
11 15
11 11 18
11 11
17
17
17
17
17 17 17
17 17
17 17 17
3
3
3 3
3
8 3
8 6
1 7
19 2
14
14
3 9
3
5
12
10
4
13
13
17 17 17
9 9
16
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
13
g protein receptor
protein kinase st
globin
homeobox
ef hand
abc transporter
cytochrome p450
zinc finger c2h2
rubisco large
ig mhc
cytochrome b
cytochrome c
efactor gtp
protein kinase tyr
adh short
atpase alpha beta
actins
4fe4s ferredoxin
tubulin
Fig. 2. Around 10000 protein sequences compared by pairwise alignments are depicted
on a RGTM trained with the Nystr
om approximation and 100 landmarks. Posterior
labeling displays 19 out of the 32 classes dened by Prosite for this data set in a
topology preserving manner.
Conclusions
Acknowledgment
This work was supported by the German Science Foundation (DFG) under
grant number HA-2719/4-1. Further, nancial support from the Cluster of Excellence 277 Cognitive Interaction Technology funded in the framework of the
German Excellence Initiative is gratefully acknowledged.
14
B. Hammer et al.
References
1. Alex, N., Hasenfuss, A., Hammer, B.: Patch clustering for massive data sets. Neurocomputing 72(7-9), 14551469 (2009)
2. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg
(2007)
3. Bishop, C.M., Williams, C.K.I.: GTM: The generative topographic mapping. Neural Computation 10, 215234 (1998)
4. Boulet, R., Jouve, B., Rossi, F., Villa-Vialaneix, N.: Batch kernel SOM and related
Laplacian methods for social network analysis. Neurocomputing 71(7-9), 12571273
(2008)
5. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger,
E., Martin, M.J., Michoud, K., ODonovan, C., Phan, I., Pilbout, S., Schneider, M.:
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Nucleic Acids Research 31, 365370 (2003)
6. Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithm. In:
Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) NIPS 1994, pp. 585592. MIT,
Cambridge (1995)
7. Cottrell, M., Fort, J.C., Pag`es, G.: Theoretical aspects of the SOM algorithm.
Neurocomputing 21, 119138 (1999)
8. Cottrell, M., Hammer, B., Hasenfuss, A., Villmann, T.: Batch and median neural
gas. Neural Networks 19, 762771 (2006)
9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classication. John Wiley & Sons,
New York (2001)
10. Fort, J.-C., Letremy, P., Cottrell, M.: Advantages and drawbacks of the Batch
Kohonen algorithm. In: Verleysen, M. (ed.) ESANN 2002, D Facto, pp. 223230
(2002)
11. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic
Acids Res. 31, 37843788 (2003)
12. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972976 (2007)
13. Gisbrecht, A., Mokbel, B., Hammer, B.: The Nystrom approximation for relational
generative topographic mappings. In: NIPS Workshop on Challenges of Data Visualization (2010)
14. Gisbrecht, A., Mokbel, B., Hammer, B.: Relational generative topographic map.
Neurocomputing 74, 13591371 (2011)
15. Hammer, B., Hasenfuss, A.: Topographic mapping of large dissimilarity datasets.
Neural Computation 22(9), 22292284 (2010)
16. Hammer, B., Hasenfuss, A., Rossi, F.: Median topographic maps for biological data
sets. In: Biehl, M., Hammer, B., Verleysen, M., Villmann, T. (eds.) Similarity-Based
Clustering. LNCS, vol. 5400, pp. 92117. Springer, Heidelberg (2009)
17. Hathaway, R.J., Bezdek, J.C.: Nerf c-means: Non-Euclidean relational fuzzy clustering. Pattern Recognition 27(3), 429437 (1994)
18. Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE
Transactions on Neural Networks 12, 12991305 (2001)
19. Keim, D.A., Mansmann, F., Schneidewind, J., Thomas, J., Ziegler, H.: Visual analytics: Scope and challenges. In: Simo, S., Boehlen, M.H., Mazeika, A. (eds.)
Visual Data Mining: Theory, Techniques and Tools for Visual Analytics. LNCS,
Springer, Heidelberg (2008)
15
20. Kohonen, T. (ed.): Self-Organizing Maps, 3rd edn. Springer, New York (2001)
21. Kohonen, T., Somervuo, P.: How to make large self-organizing maps for nonvectorial data. Neural Networks 15, 945952 (2002)
22. Lundsteen, C., J-Phillip, Granum, E.: Quantitative analysis of 6985 digitized
trypsin g-banded human metaphase chromosomes. Clinical Genetics 18, 355370
(1980)
23. Martinetz, T., Berkovich, S., Schulten, K.: Neural-gas Network for Vector Quantization and its Application to Time-Series Prediction. IEEE-Transactions on Neural
Networks 4(4), 558569 (1993)
24. Mevissen, H., Vingron, M.: Quantifying the local reliability of a sequence alignment.
Protein Engineering 9, 127132 (1996)
25. Neuhaus, M., Bunke, H.: Edit distance based kernel functions for structural pattern
classication. Pattern Recognition 39(10), 18521863 (2006)
26. Ontrup, J., Ritter, H.: Hyperbolic self-organizing maps for semantic navigation. In:
Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information
Processing Systems, vol. 14, pp. 14171424. MIT Press, Cambridge (2001)
27. Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue
is NP hard. Journal of Global Optimization 1, 1522 (1991)
28. Williams, C., Seeger, M.: Using the Nystr
om method to speed up kernel machines.
In: Advances in Neural Information Processing Systems, vol. 13, pp. 682688. MIT
Press, Cambridge (2001)
29. Yin, H.: ViSOM - A novel method for multivariate data projection and structure
visualisation. IEEE Trans. on Neural Networks 13(1), 237243 (2002)
30. Yin, H.: On the equivalence between kernel self-organising maps and self-organising
mixture density networks. Neural Networks 19(6-7), 780784 (2006)
31. Zhu, X., Hammer, B.: Patch anity propagation. In: European Symposium on
Articial Neural Networks (to appear, 2011)
The Method
The contextual SOMs [1], [2], [3] are used to represent relationships between
local contexts (groups of contiguous words) in text corpora, believed to reect
semantic properties of the words. A local context relates to and is labeled by
its central word, called the target word. It has been found earlier that the SOM
can be used to map words linguistically in such a way that the target words of
dierent word classes are mapped into separate areas on the SOM on the basis
of the local contexts in which they occur.
The local context around a particular target word can be dened in dierent
ways. In early works it was made to consist of the target word itself, indexed
by its position i in the corpus, and of the preceding and the subsequent word
to it, respectively. In this work, in order to take more contextual information
into account, the contexts were dened to consist of ve successive words. In
computation they were represented by the coding vectors ri2 , ri1 , ri , ri+1 ,
and ri+2 , respectively.
In order to minimize the eect of the word forms on the context structures,
and to concentrate on the pure word patterns, i.e., combinations of the words,
without paying attention to the writing forms, one ought to select representations
for the words that are mutually as uncorrelated as possible. To that end, the
coding vectors can be dened, e.g., as high-dimensional Euclidean vectors with
normally distributed random elements. A typical dimensionality of these vectors
is on the order of a few hundred. In this way, the representation vectors of
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 1629, 2011.
c Springer-Verlag Berlin Heidelberg 2011
17
any two words are approximately orthogonal and can be regarded as almost
uncorrelated.
For a statistical analysis, the so-called averaged contextual feature x(w) for
each unique word w in the text corpus can be dened as the vector
x(w) = avgi(w) ([ri2 ri1 ri ri+1 ri+2 ]) ,
(1)
where avgi(w) means the average over all positions i in the text corpus, on the
condition that the contextual feature relating to position i belongs to word w
(i.e., on the condition that ri is the random-vector representation of word w).
When constructing a contextual SOM, the averaged contextual feature vectors
are used as the training data.
Diering from previous approaches, in which individual test words were
mapped onto the SOM, in this work histograms of various test word classes
or otherwise dened subsets of words will be formed over the SOM array. Thus,
the testing of the SOM, i.e., the mapping of selected subsets of target words
onto the SOM is carried out using similarly dened averaged feature vectors as
input vectors, but averaging the input vectors only over the words w of a particular subset (such as general adjectives) or words that occur in the corpus only
a specied number of times.
One particular problem encountered in this simple context analysis is that the
words in most languages are inected, and in languages such as Latin, Japanese,
Hungarian, Finnish, etc,. the linguistic roles of the words are also indicated by
many kinds of endings. One simple method is to treat every word form as a
dierent word. Another method would be to reduce each word to its base form
or word stem, whereby, however, some semantic information is lost.
Nonetheless there also exist languages such as Chinese, where the words are
not inected at all, and which would then be ideal for the context experiments.
Since nowadays there are available large Chinese text corpora that are provided
with linguistic analysis of the words used in them, it was possible to construct
the contextual SOMs automatically on the basis of this information only [4], [5].
The text corpus used in this work, called the MCRC (Modern Chinese Research Corpus) [6] is an electronic collection of text material from newspapers,
novels, magazines, TV shows, folktales, and other text material from modern
Chinese media. In our experiment it contained 1,524,121 words provided with
classication of the words into 114 classes (of which 88 were real linguistic classes,
while the rest consisted of punctuation marks and nonlinguistic symbols). This
corpus was prepared by one of the authors (Hongbing Xing).
A further notice is due. In order to utilize the information of the contexts
maximally, only pure contexts (which did not contain any punctuation marks or
nonlinguistic symbols) were accepted to these experiments. In this way, however,
a substantial portion of the text corpus was left out of the experiments. Notice
that if the target word has a distance of less than ve words from these specic
symbols, the ve-word contexts could not be formed. Nonetheless the original
corpus was so large that the remaining amount of text (488,878 words) was still
believed to produce statistically signicant results.
18
The size of the original lexicon used in this work was 48,191. The number of
words actually used, restricting to pure contexts only, was 27,090.
The patches of the SOM array were selected as hexagonal, and the size of the
array was relatively small, 40 by 50, in order to save memory but still to be able
to discern the cluster structures on it.
In earlier works the dimensionalities of all of the random-code vectors of the
words were always taken as equal. A new idea in this work was to select the
dimensionality as a function of the relative position of the word within the local
context. The dimensionality of the target vector ri was selected as 50. The
dimensionalities of ri1 and ri+1 were taken equal to 200, and those of ri2 and
ri+2 equal to 100, respectively. In this way, the dierent words within the context
have dierent statistical weights in the matching of the input vector with the
SOM weight vectors. The above dimensionalities, based on many experiments,
were chosen experimentally to roughly maximize the clustering accuracy, under
the restriction that the total dimensionality of the feature vectors x(w) could
still t to the Matlab programs, especially to the SOM Toolbox used in the
computations.
In order to write a great number of variable scripts for this rather large experiment, the Matlab SOM Toolbox [7] was used. However, the calibration of the
numerous SOMs was based on dot-product matching, for which both the source
data and the SOM vectors (prototypes) were normalized.
The batch training procedure of the Matlab SOM Toolbox was applied. The
neighborhood function used in it was Gaussian. First, a coarse training phase,
consisting of 20 training cycles was used. During it, the eective radius parameter
of the neighborhood function decreased linearly from 6 to 0.5. The topological
ordering occurred during this phase. After it, a ne tuning phase was applied.
During it, the eective neighborhood radius stayed constant, equal to 0.5. It has
been shown recently [8] that if this kind of learning with constant neighborhood
is used, the batch training algorithm will usually converge in a nite number of
cycles. In the present application this kind of ne tuning was continued until
no changes in the map vectors took place in further training. This occurred in
about 70 cycles.
In Fig. 1 we have the histograms of four main linguistic classes of the words in the
MCRC corpus (restricting to local contexts that do not contain any punctuation
marks or other nonlinguistic symbols).
Some clusters may look rather broad, but one has to realize three facts: 1.
Due to competitive learning, the SOM is always trying to utilize the whole array
optimally, so any cluster that contains a large number of elements will look wide.
2. As will be seen later, the diuse zones usually consist of much narrower partial
clusters. 3. It will further be shown in this work that the clusters of some word
classes also depend on the frequencies of the words they contain. If one would
select to the vocabulary only words that, e.g., exceed a certain frequency limit,
one would obtain much sharper clusters.
Fig. 1. Histograms of all adjectives, nouns, verbs, and adverbs in the MCRC
20
In Fig. 2 and Fig. 3 we show two smaller clusters of specic word classes
that are located close to the area of the adjectives. Consider rst the cluster of
attributive pronouns. If the words, as believed, are clustered in the contextual
SOM according to their role as sentence constituents, then the adjectives that
coincide with the attributive pronouns obviously represent attributive adjectives.
On the other hand, the adjectives that coincide with the cluster of the adverbial
idioms apparently have an adverbial nature, respectively.
A new eect found in this study is that when the histograms are formed using
words restricted to certain intervals of word frequencies, they will depend on the
frequency and be more compact.
22
3.1
General Adjectives
General Nouns
The class of the general nouns, diering from the class of all nouns shown in
Fig. 1, does not contain any names of persons or places, or nouns of time.
The eect of word frequency on the general nouns is even more surprising than
that on the general adjectives. From Fig. 5 we see that the general nouns that
occur with the lowest frequencies (1 to 10), and whose number in the corpus is
also the highest, have a very broad distribution. Compared with Fig. 1, however,
the dierences are not very large in this range. On the other hand, in the range
of 10 to 100 of word frequencies, the centroid of the histogram has already moved
to the right and upwards. In the range of 100 to 1000 of word frequencies, most
of the nouns are clustered into three very compact subsets close to the upper
right corner, and one compact cluster at the bottom. In this range of frequencies
the nouns may have only fewer denite semantic roles, whereas the roles of the
more rare nouns are more vague. The fourth histogram in Fig. 5 as well as in Fig.
4 contain so few words that it is dicult to draw any conclusions from them.
In all of the above partial diagrams of general nouns. there is a salient empty
oval region in the middle, where the verbs, according to Fig. 1, are located.
3.3
Verbs
Verbs without objects. In many languages, this category of verbs is called the intransitive verbs, while in some other languages (like Chinese and French) the term
verbs without objects is used. In Fig. 6, the least frequently (1 to 10 times) used
verbs have a fuzzy cluster on the top-left. This cluster coincides with that of the
predicative idioms (not shown in this publication but in [4]), and so this cluster
of verbs is believed to represent verbs used as the so-called center of predicates.
24
T. Kohonen and H. Xing
26
T. Kohonen and H. Xing
27
The other verbs without objects are clustered mainly in the middle, where the
nouns have an empty space.
Verbs followed by nouns. This category would be called the transitive verbs in
some other languages. The noun subsequent to the verb forms a very close context with the former, and so one might expect that this correlation should also
be reected in the contextual SOM. Indeed, almost independent of the word frequency, the histograms in Fig. 7 are clustered into the middle of the empty space
in the distribution of the nouns. It seems that the locations of these clusters have
been automatically tted to the locations of the surrounding nouns.
The correlation coecient of the histogram of the verbs without objects and
that of the verbs followed by nouns is 0.2514, indicating that their linguistic
roles are dierent.
3.4
General Adverbs
The general adverbs are mainly located in the histograms along the border between the nouns and the verbs (Fig. 8). They have only relatively few clusters
in xed places, showing that there are only few main types of adverbs, and their
contexts depend very little on word frequency.
3.5
Other Classes
As mentioned earlier, there were 88 linguistic classes into which the words of the
MCRC were divided. Histograms of some of them can be found in [4] and [5].
The numerals are clustered very compactly around the lower left corner of
the SOM, and this cluster does not depend on word frequency.
The conjunctions, on the other hand, have histograms scattered randomly
over the area of the SOM, showing that they do not correlate with the text.
The pronouns are projected into areas occupied by the other word classes. In
Fig. 2 we saw the mapping of the attributive pronouns. The pronouns used as
subjects or objects have histograms similar to those of the nouns, conrming that
it is the role of the words as sentence constituents rather than their linguistic
class that is reected in the contextual SOMs.
The verbs, in general, are clustered into a round area in the middle of the
SOM. An exceptional subset of verbs is formed by those 1035 verbs used as
the core of a noun phrase. Independent of their frequency, they are mapped
compactly into the lower right corner of the SOM, indicating that these verbs
occur so tightly with the other words in the noun phrases that the latter words
determine the location of the cluster on the SOM.
Discussion
28
In addition to being among the rst works in which contextual SOMs have
been constructed for the Chinese language, this study contains two new results.
First, it has been found that the target words, on the basis of their local contexts,
are not only clustered according to the main linguistic classes. It seems that the
role of the words as sentence constituents denes their location more closely in
the contextual SOM. Second, the histograms have also been found to depend on
the frequencies of the words selected for testing. In some cases, e.g., for nouns and
adjectives (the histograms of which were the most diuse ones) this dependence
is strong, whereas for some other word classes it is much weaker.
One simple explanation of the frequency dependence that comes into mind
is that the MCRC corpus used in this work is very heterogeneous. It contains
texts from very dierent areas written by dierent people. The vocabularies of
the dierent parts, especially the sets of nouns and adjectives used in them have
probably very dierent word frequencies. Conversely, when the word frequencies
during testing are restricted to certain intervals, these words correlate closest
with certain parts of the corpus, and thus with the specic topic and writer. It
would be very interesting to compare the present results with those produced by
one author only and dealing with a well-dened topic area, preferably written
in a traditional style.
On the other hand, it is also thinkable that the contexts in which especially
the nouns and the adjectives are used have transformed with time, and frequent
usage accelerates this transformation. One fact that supports this assumption is
that a histogram as a function of word frequency often changes gradually in the
same direction (cf. Figs. 4 and 5).
In the contextual SOM, the selection of the random-vector representations for
the words may have an eect on the exact form of the SOM, due to statistical
variations in the matching of the random vectors. These statistical variations
could be eliminated for the most part if one were able to use representation
vectors with extremely high dimensionalities, for which supercomputers would
be needed.
The main message of the work in presentation is that the word frequencies
probably have an important role in all of the contextual-SOM experiments and
should be taken into account when picking up words from the lexica for testing.
The two main conclusions derivable from the work in presentation are thus: 1.
If one wants to produce contextual maps in which the word classes are well segregated, one may select a vocabulary that contains only the most regular words,
i.e., only words that have their frequency above a certain limit, and discard the
very rare words, as usually also has been done in previous experiments. 2. If
all of the occurring words are taken into account, however, one is able to see
intriguing transformations of the word classes as a function of the frequency of
usage of the words, as demonstrated in this work.
Acknowledgements. This work has been carried out under the auspices of the
Academy of Finland, the Aalto University of Finland, and the Beijing Language
and Culture University of China. Special thanks are due to Drs. Zhirong Yang
and Timo Honkela of Aalto University for their help in the decoding of the
29
Unicode les used to transfer the Chinese text corpus. Dr. Xiaowei Zhao of the
Golgate University, U.S.A., has provided an excellent translation of the linguistic
analysis of the word list of the MCRC corpus. Dr. Ping Li of the Pennsylvania
State University, U.S.A., has created the contacts between Aalto University and
Beijing Language and Culture University, and followed this work with keen interest. Dr. Jorma Laaksonen has been very helpful at various editing phases of
this article.
References
1. Ritter, H., Kohonen, T.: Self-organizing semantic maps. Biol. Cyb. 61, 241254
(1989)
2. Honkela, T., Pulkki, V., Kohonen, T.: Contextual relations of words in Grimm
tales, analyzed by self-organizing maps. In: Fogelman-Soulie, F., Gallinari, P. (eds.)
Proc. ICANN 1995, Int. Conf. on Articial Neural Networks, vol. II, pp. 37. EC2,
Nanterre, France (1995)
3. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)
4. Kohonen, T.: Contextually Self-Organized Maps of Chinese Words. TKK Reports
in Information and Computer Science, TKK-ICS-R30. Aalto University School of
Science and Technology, Espoo, Finland (2010) (This report is downloadable from
ics.tkk.fi/en/research/publications)
5. Kohonen, T.: Contextually Self-Organized Maps of Chinese Words. Part II, TKK
Reports in Information and Computer Science, TKK-ICS-R35. Aalto University
School of Science and Technology, Espoo, Finland (2010) (This report is downloadable from ics.tkk.fi/en/research/publications)
6. Sun, H.L., Sun, D.J., Huang, J.P., Li, D.J., Xing, H.B.: Corpus for modern Chinese research. In: Luo, Z.S., Yuan, Y.L. (eds.) Studies in the Chinese language and
characters in the era of computers, pp. 283294. Tsinghua University Press, Beijing,
China (1996)
7. Vesanto, J., Alhoniemi, E., Himberg, J., Kiviluoto, K., Parviainen, J.: Selforganizing map for data mining in Matlab: the SOM Toolbox. Simulation News
Europe 25, 54 (1999) (The SOM Toolbox software package is downloadable from
ics.tkk.fi/en/research/software)
8. Kohonen, T., Nieminen, I.T., Honkela, T.: On the quantization error in SOM vs.
VQ: A critical and systematic study. In: Prncipe, J.C., Miikkulainen, R. (eds.)
WSOM 2009. LNCS, vol. 5629, pp. 133144. Springer, Heidelberg (2009)
Abstract. We explored the use of Self Organizing Map (SOM) to assess the problem of eciency measurement in the case of health care
providers. To do this, we used as input the data from the balance sheets
of 300 health care providers, as resulting from the Italian Statistics Institute (ISTAT) database, and we examined their representation obtained
both by running classical SOM algorithm, and by modifying it through
the replacement of standard Euclidean distance with the generalized
Minkowski metrics. Finally, we have shown how the results may be employed to perform graph mining on data. In this way, we were able to
discover intrinsic relationships among health care providers that, in our
opinion, can be of help to stakeholders to improve the quality of health
care service. Our results seem to contribute to the existing literature in at
least two ways: (a) using SOM to analyze data of health care providers is
completely new; (b) SOM graph mining shows, in turn, elements of innovations for the way the adjacency matrix is formed, with the connections
among SOM winner nodes used as starting point to the process.
Keywords: SOM, Network Representation, Eciency, Health Care
Providers.
Introduction
In an ideal world the health system should be eective, and it should be ecient,
i.e. it should be able to achieve the specied outcomes in a way to maximise access, outputs and outcomes within the available resources. In the real world,
however, this does not happen. Just to make an example, looking at the situation of Italy, health care expenditure plays a crucial impact into the nancial
resources of the country; nevertheless our health care system is lesser ecient
than others, and it is not easy to explain why. In particular, the basic diculty
is to nd a common platform to compare eciency of health systems, because of
their intrinsic complexity, and of certain ambiguity in what does eciency itself
consist and how to measure it.
For what it concerns the complexity of health systems, a quite recent study
of the Australian National Health and Hospitals Reform Commission [10] found
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 3039, 2011.
c Springer-Verlag Berlin Heidelberg 2011
31
low relation between eciency and the level of health spendings, thus suggesting
that rather than increasing expenditures, regulatory eorts should be addressed
on dierent allocation of the existing resources.
With respect to the ambiguity of the denition of eciency and to the way to
measure it, there are at least two issues we can point on. The rst one relates to
the method used to generate eciency scores. Most commonly used techniques
include Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis
(SFA) [9]; there is a huge literature devoted to compare them, and to evaluate
their statistical properties [5], [13]. The second major area of research uses either
DFA or SFA to examine eciency in a single area of health care production: [4]
focused on hospitals; [11] on pharmaceutical industry, and [3] on long term care,
just to cite some.
Within such framework, the main contributions of this work may be briey
summarized as follows:
we focused on the case of Italy, and, being aware of the need for the country
to control health care costs, we examined the balance sheets data of 300
health care providers that receive total or partial public fundings;
we run our analysis using Self Organizing Maps (SOMs): since, to the best
of our knowledge, this is a rst time application in the health care sector, we
mainly addressed our eorts to the application of classic SOM algorithm.
As unique concession to more sophisticated analysis, due to the high dimensionality of the dataset, we explored the convenience to train SOM with
similarity measures other than the Euclidean one, such metrics being chosen
among Minkowski norms [17]:
p1
|Xi |p
, for p R+ .
(1)
||X||p =
i
We have then analized how the clustering capabilities of SOM are modied
when both prenorms (0 < p < 1), and ultrametrics (p >> 1) are considered.
As nal step, we selected the best performing SOM, and we analyzed the
connections among the winner nodes, thus obtaining an adjacency matrix
that has been the starting point for a network representation of health care
providers. We have used it to retrieve more information about their eciency,
and to suggest some economic interpretations of the results.
This paper is therefore organized as follows: Section 2 briey describes the
problem we focused on, and the data we have considered in the study; Section
3 provides a glimpse on the literature that deals on the alternatives to standard
Euclidean metric, and then illustrates the changes carried on SOM algorithm
to use it with norms derived from (1). Section 4 illustrates the details of our
simulations, and discusses the results we have obtained. Section 5 concludes.
The Italian health system assumes that health services can be provided both by
public and private structures, the former essentially totally funded.
32
M. Resta
Here the term public identies two kind of structures: Aziende Sanitarie Locali (ASL) and Aziende Ospedalierie (AO). The main dierence between the
two enterprises stands in the fact that while AO are generally single structures
(namely: hospitals), ASL, on the other hand, are more composite, since, by denition of law, they can include more than one local (regional, district, municipal)
units that provide health care to citizens.
According to the more recent reform statements of the public health sector,
ASL and AO are required to act like autonomous units to control their nancial
ows. This means that:
(i) Each unit of the system should exhibit capabilities concerning the management of economic and nancial ows.
(ii) The eciency of each unit does not only depend on factors of technical type
(such as quality of the provided health service, innovation, satisfaction of
the nal consumer), but also by more strictly nancial factors.
(iii) The capability of the whole system to maintain satisfying levels of solvency
and eciency depends, in turn, on those of every component of the system
(ASL and AO), and on their capability to interact one to each other.
The eciency of the system becomes therefore something that include in a
broad sense the bare management of nancial variables: for this reason we have
analyzed the balance sheets of 300 public enterprises (ASL and AO), as resulting
from the more recent Italian Statistics Institute (ISTAT1 ) database. The goal was
to retain information that might help to monitor the actual level of eciency
of the National Health System, and, eventually, to nd some suggestions to
improve it.
The data under examination were arranged into two dierent aggregation
levels: regional, and by single unit. Since Italy is organized into twenty regional
districts (as resulting from Table 1), we managed twenty les, and within each of
them, a variable number of nancial statements of public health care providers
operating into the region itself.
Every unit is identied by a string code whose rst part is the region ID, and
the second part is a number varying from 101 to 999. For instance, PIEM101
identies the rst ASL of Turin in Piedmont, while VEN112 is associated to
the ASL of Venice, and so on. The records in the balance sheet, on the other
hand, are organized according to the principles of the International Accounting
Standards (IAS2 ), so that they capture the nancial ows of each single unit.
Examples of such ows are given by fundings (from public institutions or from
private organizations), inows deriving from the provision of health services, or
costs and liabilities, for an overall number of 164 variables.
If we examine the data in the traditional accounting way, we should move
to set apart from the balance sheet those variables that are generally employed
to calculate nancial ratios, but we decided to behave dierently, for at least
two reasons. The rst one is that although nancial ratios should accomplish to
1
2
www.istat.it
http://www.ifrs.org/Home.htm
33
Table 1. Name of Italian Regional Districts, and the ID associated to them throughout
the paper
Name
Abruzzo
Apulia
Calabria
EmiliaRomagna
Lazio
Lombardy
Molise
Sardinia
TrentinoAlto Adige
Umbria
ID
ABR
PGL
CAL
EMROM
LAZ
LOM
MOL
SAR
TNTBZ
UMB
Name
ID
Aosta Valley
VDA
Basilicata
BAS
Campania
CAM
FriuliVenezia Giulia FRI
Liguria
LIG
Marche
MAR
Piedmont
PIEM
Sicily
SIC
Tuscany
TOSC
Veneto
VEN
simplication purposes, the number of ratios that can be built from the balance
sheet does not sensitively dier from the number of records in the balance sheet
itself. A more technical explanation of our choice comes by looking to the pecularity of data we are considering. Both ASL and AO, in fact, are enterprises
almost uniquely devoted to provide health care services, so that the greater part
of the records we can read in their balance sheet pertains costs and inows related to such specic activity; on the other hand, the accounting literature does
not provide proper nancial ratios that can be able to capture such specicity.
As a result, we decided to consider all the available data from the nancial
statements of ASL and AO, thus obtaining an input matrix of dimensions 300
164, where each row represents either ASL or AO with their 164 normalized
determinants.
p1
|Xi |p
(2)
Here p is a strictly positive real value (fractional norms). Using the family dened by (2), [2] observed that nearest neighbour search is meaningless in highdimensional spaces for integer p values equal or greater that two (the so called
34
M. Resta
ultrametrics). Those results are general, in the sense that they hold also when p
takes positive real values. In addition, [8] outlined that the optimal distance could
depend on the type of noise on the data: fractional norms should be preferable
in the case of colored noise, while in the case of Gaussian noise, the Euclidean
metrics should be more robust than fractional ones. More recently, [17] gave also
proof that, in contrast to what expected, prenorms (0 < p < 1) are not always
less concentrated than higher order norms. Finally, [14] and [15] provided evidence that the use of both prenorms and ultrametrics can be noteworthy, when
dealing with nancial data.
We considered such debate of particular interest for our study, since we need
to manage input patterns embedded into a very highdimensional space: we are
primarily concerned to test if the performances of SOM may take advantage
from changes in the adopted similarity measures.
To do this, we needed to modify the SOM procedure in a proper way. In
practice, the plain SOM uses a set of q neurons, (arranged either on a strip or
into a 2D rectangular or hexagonal grid) to form a discrete topological mapping
of an input space embedded into a ndimensions space (n >> 2). At the start
of the learning, all the weights are initialised at random. Then the algorithm
repeats the following steps: we will refer to the case of a monodimensional
SOM, but the layout presented can be easily generalized to higher dimensional
grids.
If x(t) = {xj (t)}j=1,...,n Rn is the input item presented at time t to a map
M having q nodes with weights mi (t) = {mi,j (t)}j=1,...,n Rn , (i = 1, ..., q), it
will be claimed the winner neuron at step t i:
1/p
n
it = argminiM
|xj (t) mi,j (t)|p , p N .
(3)
iM j=1
(4)
35
(3), and we trained SOM accordingly. In particular, we have relaxed (3), allowing
p to assume real positive values, to include both ultrametrics (p >> 1), and
prenorms (0 < p < 1). Obviously changes aected all the procedures involving
the use of the Euclidean metric as similarity measure, including, for instance,
the search for best matching units and the evaluation of quantization error.
We run simulations considering values of p in the range [0.5, 10] sampled at step
0.5 for an overall number of twenty alternative p values. For each of them we
trained a bunch of 100 plain SOMs with rectangular grid topology, and dimensions varying from 5 5 to 21 21, isolating the SOM with best performances in
terms of quantization error. For every value of p such ideal SOM tends to exhibit
very closer topology grid dimensions (around 12 12.) Our next move was then
to choose among the best performing SOMs the most representative one. In this
task we considered both the level of the quantization error, and the organization
of SOM nodes. Figure 1 provides a look at the four most signicant results.
One can immediately note the concentration eect for p > 2 (Figure 1(c) and
1(d)): the blank parts of the maps are the only ones where winner nodes are
placed. This is a common feature to all SOMs trained with p >> 2. Concentration was less evident for p < 2; in such case, however, the advantage of using p
values other than two was not as higher (with respect to the quantization error)
as to justify the replacement of p = 2. We then concluded that, at least in our
(a) p=0.5
(c) p=5
(b) p=2
(d) p=10
Fig. 1. Distance matrix with map units size for the four best performing SOMs
36
M. Resta
(a)
(b)
Fig. 2. From left to right: Umatrix (2(a)) and Best Matching Units BMUs (2(b))
for the best SOM trained with p = 2. It may be noticed the sparsity of BMUs
case, despite of the size of the embedding dimension, plain SOM trained with
the standard Euclidean norm still remains the best choice.
However, focusing on the case of p = 2 (see Figure 2), we noticed that, despite
of the overall good performance of SOM in terms of quantization error, the
winner nodes were too much sparse, to our purposes. We then decided to move
one step further, and we analized the connections among winner nodes (Best
Matching Units BMUs) to build the related adjacency matrix. In practice, we
used SOM to perform graph mining like in [6], but with the dierence that
we acted directly on the connections of SOM BMUs. The algorithm we used
is similar to that introduced in [16] to build Planar Maximally Filtered Graph
(PMFG), with changes involving the way distances among BMUs are evaluated:
where [16] uses correlation, here we used (1), with the p value as selected in the
previous stage of the procedure (in our case: p = 2.)
As a result, we obtained a representation of SOM nodes connections like the
one shown in Figure 3. Although the representation need to be interpreted with
certain care, the graph allowed us to extract some notable information. First
of all, the twelve clusters that now clearly emerge from the SOM exhibit quite
distinct features: we are going to discuss the more signicant ones. Clusters 1 and
2 are characterized by lower overall positive revenues, and higher specic (i.e.
related to the provision of health care services) costs, cluster 3 is the one with
both the highest revenues from medical activity, and the lowest taxation costs;
clusters 4 and 9 are those which invest more on employees training. On the other
hand, cluster 5 groups enteprises which have received lower public fundings: it is
not very surprising to discover that this cluster is associated to the lowest level
in the value of production. In the balance sheet this variable generally monitors
the enterprise overall inows: the (quite trivial) lesson we can learn from this
cluster is then that its members seem not able to manage nancial inows others
than public fundings. Cluster 8 is in the opposite situation of cluster 5, receiving
the highest level of public fundings, but despite of it, its members were not able
37
to reach the best nancial results. Finally, cluster 11 exhibits the best nancial
ows among those not specically related to the health care provision.
Another interesting remark relates to the composition of clusters: clusters are
not territorial homogeneous, i.e. they generally group ASL and AO from dierent
regions; a partial exception to this rule is provided by cluster 2 that includes
53% of units from the region Emilia Romagna (EMROM). This could be of
particular importance, because it points on the existence of nancial dierences
among public health care providers belonging to the same region. This, in turn,
suggests that greater eciency could be reached by operating on the allocation
of public fundings at regional level.
Finally, the organization of clusters provides information at technical level
too, suggesting that wraparound grid topologies (either toroidal or cylindric)
could reach more satisfying results.
Conclusion
In this paper we discussed an application of Self Organizing Map (SOM) to assess the eciency of health care providers. To do this, we examined by means
of SOM the data of the balance sheet of 300 italian health care providers that
receive public fundings. Since, to the best of our knowledge, this is a rsttime
application in the health care sector, we mainly addressed our eorts to the application of classic SOM algorithm. As unique concession to more sophisticated
analysis, due to the high dimensionality of the dataset we explored the convenience to train SOM with similarity measures other than the Euclidean one,
38
M. Resta
the metrics being chosen among Minkowski norms, as dened in [17]. We then
trained 20 blocks of SOM each of 100 maps, characterized by various grid topology size, and by dierent distance metrics. The SOM performances were checked
focusing on the quantization error, variously evaluated according to the metric
in use. We obtained the best results with both SOM trained with the standard
Euclidean metric and with those trained through prenorms. In this latter case,
however, the gains in terms of quantization error were not as signicant as to
justify the leaving of the Euclidean metric. In addition, we found that the information provided by SOM were too much sparse to be signicant to our purposes,
and we then moved one step further, using the map best matching units to build
an adjacency matrix that has been then starting point to a graph mining process. This task was particularly procient, since it allowed us to retain a number
of information about the eciency condition of the health system in Italy. In
particular, we observed that more than increasing health care expenditures, a
succesfull move could be that to potentiate the integration among regions, and
the allocation of existing funds inside the regions themselves. Moreover, from
the technical point of view, the clusters organization we obtained suggests the
direction for further experiments: we could probably get better and more rened
results using a dierent grid topology, like the cylindric or the toroidal one.
References
1. Aggarwal, C.C., Yu, P.S.: The IGrid Index: Reversing the Dimensionality Curse
For Similarity Indexing in High Dimensional Space. In: Proc. of KDD, pp. 119129
(2000)
2. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.)
ICDT 2001. LNCS, vol. 1973, pp. 420434. Springer, Heidelberg (2000)
3. Bjorkgren, M., Hakkinen, U., Linna, M.: Measuring Eciency of Long Term Care
Units in Finland. Health Care Management Science 4(3), 193201 (2001)
4. Braithwaite, J., Westbrook, M., Hindle, D., Ledema, R., Black, D.: Does Restructuring Hospitals Result in Greater Eciency?-an Empirical Test Using Diachronic
Data. Health Services Management Research 19(1), 113 (2006)
5. Banker, R.: Maximum Likelihood, Consistency and Data Envelopment Analysis:
A Statistical Foundation. Management Science 39(10), 12651273 (1993)
6. Boulet, R., Jouve, B., Rossi, F., Villa, N.: Batch kernel SOM and related Laplacian
methods for social network analysis. Neurocomputing 71(7-9), 12571273 (2008)
7. Demartines, P.: Analyse de Donnees par Reseaux de Neurones Auto-Organises.
PhD dissertation, Institut Natl Polytechnique de Grenoble, Grenoble, France
(1994)
8. Francois, D., Wertz, V., Verleysen, M.: Non-euclidean metrics for similarity search
in noisy datasets. In: Proc. of ESANN 2005, European Symposium on Articial
Neural Networks (2005)
9. Hollingsworth, B.: Non-Parametric and Parametric Applications Measuring Eciency in Health Care. Health Care Management Science 6(4), 203218 (2003)
10. Hurley, E., McRae, I., Bigg, I., Stackhouse, L., Boxall, A.M., Broadhead, P.: The
Australian health care system: the potential for eciency gains. In: Working paper,
Australian Government National Health and Hospitals Reform Commission (2009)
39
11. Key, B., Reed, R., Sclar, D.: First-order Economizing: Organizational Adaptation
and the Elimination of Waste in the U.S. Pharmaceutical Industry. Journal of
Managerial Issues 17(4), 511528 (2005)
12. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2002)
13. Murillo Zamorano, L.: Economic Eciency and Frontier Techniques. Journal of
Economic Surveys 18(1), 3377 (2004)
14. Resta, M.: Seize the (intra)day: Features selection and rules extraction for tradings
on high-frequency data. Neurocomputing 72(16-18), 34133427 (2009)
15. Resta, M.: On the Impact of the Metrics Choice in SOM Learning: Some Empirical
Results from Financial Data. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C.
(eds.) KES 2010. LNCS, vol. 6278, pp. 583591. Springer, Heidelberg (2010)
16. Tumminello, M., Aste, T., Di Matteo, T., Mantegna, R.N.: A tool for ltering
information in complex systems. PNAS 102(30), 1042110426 (2005)
17. Verleysen, M., Francois, D.: The Concentration of Fractional Distances. IEEE
Trans. on Knowledge and Data Engineering 19(7), 873886 (2007)
18. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5. Helsinki University of Technology Technical Report (2000)
Abstract. The Self-organizing map (SOM) has been widely used in financial
applications, not least for time-series analysis. The SOM has not only been
utilized as a stand-alone clustering technique, its output has also been used as
input for second-stage clustering. However, one ambiguity with the SOM
clustering is that the degree of membership in a particular cluster is not always
easy to judge. To this end, we propose a fuzzy C-means clustering of the units
of two previously presented SOM models for financial time-series analysis:
financial benchmarking of companies and monitoring indicators of currency
crises. It allows each time-series point to have a partial membership in all
identified, but overlapping, clusters, where the cluster centers express the
representative financial states for the companies and countries, while the
fluctuations of the membership degrees represent their variations over time.
Keywords: Self-organizing maps, fuzzy C-means, financial time series.
1 Introduction
The Self-organizing map (SOM), proposed by Kohonen [1], has been widely used in
industrial applications. It is an unsupervised and nonparametric neural network
approach that pursues a simultaneous clustering and projection of high-dimensional
data. While clustering algorithms, in general, attempt to partition data into natural
groups by maximizing inter-cluster distance and minimizing intra-cluster distance, the
SOM performs a clustering of a slightly different nature. The SOM can be thought of
as a spatially constrained form of k-means clustering or as a projection maintaining the
neighborhood relations in the data. In the early days of the SOM, information
extraction was mainly facilitated by visual analysis of a U-matrix, where a color code
between all neighboring nodes indicates their average distance [2]. The SOM has,
however, not only been utilized as a stand-alone clustering technique, its output has
also been used as input for a second stage of two-level clustering. Lampinen and Oja
[3] proposed a two-level clustering by feeding the outputs of the first SOM into a
second SOM. Further, Vesanto and Alhoniemi [4] outperformed stand-alone
techniques using a two-level approach with both hierarchical agglomerative and
partitional k-means clustering algorithms. Minimum distance and variance criteria have
also been proposed for SOM clustering [57].
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 4050, 2011.
Springer-Verlag Berlin Heidelberg 2011
41
However, one ambiguity with the SOM clustering is that the degree of membership
in a particular cluster is not always easy to judge. In some cases, it might be beneficial
to judge the degree to which a particular area of a cluster differs from the rest of the
cluster, and what its closest match among the other clusters is. To this end, we apply
fuzzy C-means (FCM) [8] clustering on the units of the SOM grid. The FCM algorithm
allows each unit to have a partial membership in all identified, but overlapping,
clusters. This enables sensible representation of the real world filled with uncertainty
and imprecision. The model is not only expected to provide an adequate clustering, but
also to enable easily interpretable visualizations of the evolution of cluster
memberships over time. As the crispness of the data cannot be known a priori, the
FCM clustering presents information on the overlapping of the clusters, be they crisp
or fuzzy, while still always enabling comparisons between data points. We apply FCM
clustering to two previously presented SOM models for financial time-series analysis:
financial benchmarking of companies [9] and monitoring indicators of currency crises
[10]. In this paper, such as in Liu and Lindholms [11] stand-alone FCM clustering, the
cluster centers express the representative financial states for the companies and
countries, while the varying membership degrees represent their fluctuations over time.
The results indicate that fuzzy clustering of the SOM units is a useful addition to visual
monitoring of financial time-series data.
The paper is structured as follows. Section 2 discusses fuzzy clustering of the
SOM. In Section 3, the two-level clustering is applied on financial time series.
Section 4 concludes by presenting our key findings and future research directions.
x j mb = min x j mi ,
i
(1)
such that the distance between the data vector xj and the BMU mb is less than or equal
the distance between xj and any other reference vector mi. Then the second step adjusts
each reference vector mi with the sequential updating algorithm [12, p. 111]:
mi (t + 1) = mi (t ) + hib ( j ) (t )[x(t ) mi (t )] ,
(2)
where t is a discrete time coordinate and hib ( j ) a neighborhood function. The reference
vectors can also be updated using the batch algorithm, which projects all xj to their mb
before each mi is updated [12, p. 138].
42
The units of the map can further be divided into clusters of similar units. Instead of
dividing the units into crisp clusters, we employ the FCM algorithm, developed by
[13] and improved by [8], for assigning a degree of membership of each unit in each
of the clusters. The FCM algorithm implements an objective function-based fuzzy
clustering method. The objective function J is defined as the weighted sum of the
Euclidean distances between each unit and each cluster center, where the weights are
the degree of memberships of each unit in each cluster, and constrained by the
requirement that the sum of memberships of each point equals 1:
J =
u ik
mi c k
i =1 k =1
u ik = 1 ,
(3)
k =1
where (1, ) is the fuzzy exponent, uik is the degree of membership of reference
vector mi (where i=1,2,,M) in the cluster center ck (where k=1,2,,C, and 1<C<M),
and mi c k
u ik
C m i c k 1
,
= 1 /
s =1 m i c s
(4)
where s are the iteration steps, and by updating the cluster centers ck:
M
M
c k = u ik mi / u ik . ,
i =1
i =1
(5)
The algorithm proceeds as follows. First, the cluster centers are initialized
randomly. Thereafter, each reference vector is assigned a membership grade in each
cluster. Then the so-called Picard iteration through Eq. (4) and Eq. (5) is run to adjust
the cluster centers and the membership values. The iterations will stop when the
minimum amount of improvement between two consecutive iterations is less than a
small positive number or after a specified number of iterations.
We use = 0.0001 and a maximum of 100 iterations. The improvement criterion
is small enough to ensure no possible significant improvements of the possibly local
optima, while we never reached 100 iterations. The extent of overlapping between the
clusters is set by the fuzzy exponent . When 1 , the fuzzy clustering converges
to a crisp k-means clustering, while when the cluster centers tend towards the
center of the data set. Several experiments were performed to set the - and c-values.
43
consisted of seven financial ratios for the years 19952003, for a total of 78 pulp and
paper companies. The ratios included were operating margin, return on equity, and
return on total assets (profitability ratios), equity to capital and interest coverage
(solvency ratios), quick ratio (liquidity ratio), and receivables turnover (efficiency
ratio). The model is presented in detail in Eklund et al. [9] and validated in Eklund
et al. [14].
IP97
KC95
IP01
IP98
GP97
IP03
IP99
GP01
WH03
IP00
GP02
IP02
GP03
SE02
WH99
E
G
SE98
SE01
IP96
GP96
WH01
WH97
WH98
IP95
GP95
SE03
WH00
SE99
SE97
KC99
GP99
GP98
GP00
WH95
WH96
WH02
KC96
KC97
KC98
KC01
KC02
KC03
SE00
The model was created in SOM_PAK 3.1, using randomly initialized reference
vectors and sequential training, and visualized in Nenet 1.1. Histogram equalization
[15] was used to preprocess the outlier-rich and heavily non-normally distributed
financial ratios. The map consists of a 9 x 7 lattice, divided into eight clusters
representing different aspects of financial performance, and can be found in Fig. 1. In
the figure, the five largest pulp and paper companies according to net sales in 2003
are displayed. The notations are as follows: International Paper = IP, Gerogia Pacific
= GP, Stora Enso = SE, Kimberly-Clark = KC, and Weyerhaeuser = WH. The feature
planes of the map are displayed at the top of the figure. The map is roughly ordered
into high profitability on the right hand side of the map, high solvency and liquidity in
the middle and upper right hand side of the map, and high efficiency in upper right
hand side, as well as lower and upper left hand sides of the map. Generally speaking,
the best in class companies are in clusters A and B, and poorest in clusters G and H.
44
The reference vectors from the financial benchmarking model were used as input for a
secondlevel clustering. Several experiments were performed, varying the -value
(between 1.0 and 3.0) and the c-value (between 3 and 9). Based upon these
experiments, an -value of 2.0 provided the best visual interpretability of the map,
introducing a fuzziness degree large enough to show relationships between clusters,
but not large enough to completely eliminate cluster borders. The c-value was set as
8, in accordance with the originally identified number of clusters on the map.
However, different c-values were tested, including a three cluster model that roughly
divided the companies into good, average, and poor performers. Eight clusters were in
this case used in order to be able to assess this clustering in terms of the original
model.
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 1
8
7
6
1
7
Cluster 3
3
Cluster 2
3
...0 .006
0.41 7
0.8 28
0.44 0
...0 .008
0.40 8
0.8 08...
0.010
0.40 2
...0 .011
4
3
0.41 9
Cluster 8
3
3
5
1
4
0.8 27...
0.7 94...
1
7
Cluster 7
3
1
7
0.8 76...
Cluster 6
3
8
1
...0 .005
Cluster 5
8
7
2
Cluster 4
3
...0 .005
0.43 6
1
4
7
2
0.8 68
...0 .005
0.43 5
0.8 65
Fig. 2 shows a SOMine 5.1 visualization of the nodes cluster membership degrees
for (a) an -value of 1.8 and (b) 2.2. It clearly shows the higher crispness of the
clusters in (a) vis--vis the higher fuzziness of (b). The right map in Fig. 3. shows the
nodes cluster membership degrees for the chosen -value of 2.0, while the left shows
a defuzzification using maximum memberships. It shows that most of the clusters of
the FCM model coincide with the clustering of the original map. For example, the best
in class clusters (A and B in Fig. 1) largely coincide with clusters 1 and 2 in Fig. 3.,
45
only partially overlapping each other and not really any other clusters. The poorest
clusters (F, G, and H) are also quite clearly identifiable as clusters 6, 7, and 8 in Fig. 3.
The only cluster not identifiable is cluster D, which forms a part of cluster 3 in Fig. 3.
The nodes in cluster D thus seem to display similarity to nodes in clusters C, E, and to
a degree, cluster F. When using Wards [16] clustering on the map in Fig. 1, cluster D
does indeed merge with cluster C, indicating a slightly twisted map. This is a
complement to other methods, such as Sammons mapping, for testing map
twistedness. Further, the FCM clustering shows that cluster E is split into two groups
largely based upon liquidity (quick ratio), clusters 4 and 5.
Stora Enso (SE)
Cluster 1
1.00
Cluster 1
0.80
Cluster 2
0.80
Cluster 2
Cluster 3
0.60
Cluster 3
0.60
Cluster 4
Cluster 4
0.40
Cluster 5
0.40
0.20
Cluster 6
0.20
0.00
Cluster 7
0.00
Cluster 5
Cluster 6
Cluster 7
1997 1998 1999 2000 2001 2002 2003
Cluster 8
Kimberly-Clark (KC)
Cluster 1
1.00
Cluster 1
1.00
Cluster 2
0.80
Cluster 2
0.80
Cluster 3
0.60
Cluster 4
Cluster 3
0.60
Cluster 4
0.40
Cluster 5
0.40
0.20
Cluster 6
0.20
0.00
Cluster 7
1995 1996 1997 1998 1999 2000 2001 2002 2003
Cluster 8
Cluster 8
Cluster 5
Cluster 6
Cluster 7
0.00
1995 1996 1997 1998 1999 2000 2001 2002 2003
Cluster 8
Weyerhaeuser (WH)
Cluster 1
1.00
Cluster 2
0.80
Cluster 3
0.60
Cluster 4
0.40
Cluster 5
0.20
Cluster 6
Cluster 7
0.00
1995 1996 1997 1998 1999 2000 2001 2002 2003
Cluster 8
Fig. 4. Membership degrees for the five largest P&P companies in 2003
Fig. 4 shows the cluster membership degrees of the top five pulp and paper
companies. The figures depict that the cluster memberships of the most stable
companies (KC, best) and (IP, poorest) are high (membership of ca 0.8), while the
companies that shift between clusters show low membership values (ca 0.6 or less).
Further, Fig. 4 shows that the clusters on the left and the right border of the map
overlap to a lesser degree than the clusters in the middle, such as the data point for
Weyerhaeuser in 1999. In this particular case, the membership degree does not exceed
0.2 for any of the clusters, indicating no predominant cluster over others. This is
46
indeed informative when judging the certainty to which the financial performance of a
company is categorized to a cluster. To incorporate this type of uncertainty,
defuzzification using a threshold on the above utilized maximum-membership method
might be advantageous.
3.3 The Currency Crisis Model
The currency crisis model was created for visual monitoring of currency crisis
indicators, as is done in [17] on general economic and financial variables. The model
consisted of four monthly indicators of currency crises for 23 emerging market
economies from 1971:11997:12. The indicators were chosen and transformed based
on a seminal early warning system created by IMF staff [18]. The indicators included
were foreign exchange reserve loss, export loss, real exchange-rate overvaluation
relative to trend and current account deficit to GDP. This model is, however,
conceptually different from the benchmarking model. Each data point has a class
dummy indicating the occurrence of a crisis, pre-crisis or tranquil period. A crisis
period is defined to occur when exchange-rate and reserve volatility exceeds a
specified threshold, while the pre-crisis periods are defined as 24 months preceding a
crisis. The class labels were associated with the model by only affecting the updating
of the reference vectors (batch version of Eq. 2), not the choice of the BMU (Eq. 1).
Thus, the main purpose of the model is to visualize the evolution of financial indicators
to assist the detection of vulnerabilities or threats to financial stability. The model is
presented in detail in Sarlin [10] and a model on the same data set is evaluated in terms
of out-of-sample accuracy in Sarlin and Marghescu [19]. Moreover, a stand-alone
FCM clustering has been applied on a close to similar data set in [20].
The model was created and visualized with Viscovery SOMine 5.1, using the two
principle components for initializing the reference vectors and the batch updating
algorithm. The contribution of each input is standardized using columnwise
normalization by range. However, the effects of extremities and outliers are not
eliminated, since a crisis episode is per se an extreme event. The map consists of 137
output neurons ordered on a 13 x 11 lattice, divided into four crisp clusters
representing different time periods of the currency crisis cycle. The units were
clustered using Wards [16] hierarchical clustering on the associated variables. The
map, with a projection of indicators for Argentina from 19711997, and its feature
planes are shown in Fig. 5. The map is roughly divided into a tranquil cluster on the
1976 1984 1983
1988 1985
1987
Cluster C
1990
Reserve loss
Export loss
1971
ER Overvaluation
1978
Cluster B
1982
Cluster A
1992
197475
198081
1979
199396
1997
-2.44
Cluster D
1972
-1.80
-1.16
-0.52
0.13
-0.822
-0.514
CA Deficit
1986
1989
1991
-0.205
0.103
-0.213
CRISIS
0.064
0.341
0.617
PRE CRISIS
C
D
1977
1973
-6.70
-2.72
1.26
5.24
9.21
0.0015
0.0158
0.0301
0.0444
0.062
0.151
0.239
Fig. 5. The currency crisis model with indicators for Argentina from 197197
0.328
0.417
47
right side of the map (cluster D), a crisis cluster in the upper-left part (cluster C), and a
slight early-warning and a pre-crisis cluster in the lower-left part (cluster B and A).
3.4 Fuzzy Clustering of the Currency Crisis Model
Similarly as for the benchmarking model, the reference vectors from the crisis model
were used as input for a secondlevel clustering, whereafter the membership degrees
and the defuzzification is visualized in SOMine. The same experiments were
performed, varying the -value (between 1.0 and 3.0) and the c-value (between 3
and 9). For those models, ={1.8,2.0,2.2} give the best results; higher fuzzy
exponents give non-smooth memberships, while lower give roughly crisp
memberships. However, as for the benchmarking model, an -value of 2.0 provided
the best visual interpretability. The c-value was first set as 4 (Fig. 6), in accordance
with the originally identified number of clusters on the map, but later adjusted to 3
(Fig. 7). The concern with the 4-cluster model is that the cluster termed Early warning
does not directly contribute to the currency crisis cycle. Although it would, of course,
be informative to have an Early warning cluster, the cluster is quite small and borders
the pre-crisis cluster both between the tranquil cluster (as desired) and the crisis
cluster (as not desired). In Fig. 8, where the vertical dotted lines represent crisis
episodes, the fluctuations of indicators for Argentina are shown using both models.
This exercise confirms that the Early warning cluster is a less influential cluster that
does not add real value to the analysis of the currency crisis cycle. Thus, the 3-cluster
model is utilized for assessing the fluctuations in the data.
Argentina experienced three crisis episodes during the analyzed period. As shown in
Fig. 8, the first crisis in 1975 was preceded by high membership values in the pre-crisis
cluster, whereafter the memberships in the crisis and subsequently the tranquil cluster
dominated. The membership values before, during, and after the crisis episode in 1982
Tranquil
Crisis
Pre crisis
Crisis
Tranquil
Early
warning
Crisis
Crisis
Tranquil
Tranquil
Early
warning
Pre crisis
0.00
Crisis
Tranquil
Early
warning
Pre crisis
Early warning
Crisis
Tranquil
Early
warning
Early
warning
Pre crisis
Pre crisis
Pre crisis
0.33
0.67
1.00
Tranquil
Crisis
Pre crisis
Crisis
Tranquil
Tranquil
Crisis
Crisis
Pre crisis
Crisis
Tranquil
Pre crisis
Tranquil
Pre crisis
Pre crisis
0.00
0.33
0.67
1.00...
0.00
0.33
0.66
1.00...
0.000
0.324
0.649
0.973...
48
Tranquil
0.40
Pre crisis
0.20
Crisis
0.00
Early warning
Argentina
1.00
0.80
0.60
Tranquil
0.40
Pre crisis
0.20
Crisis
0.00
similarly characterized a currency crisis cycle. The pre-crisis period for the crisis
episode in 1990 is, on the other hand, characterized by abnormal memberships that
vary between the tranquil and the crisis cluster, and does thus not resemble the
generalization of this model. Further, indications of the out-of-sample crisis episode in
1999 are given already from 1992 onwards.
As evaluating the SOM models accuracy is not the concern of this paper, and has
been done previously, the focus is on the added value of the membership values. The
fuzzy clustering in this application is rather crisp, as a comparison of the data points
for 1972 and 1982, for example, indicates. The conditions in 1972 and 1982,
respectively, are projected into different sides of the border between the pre-crisis and
the tranquil cluster, while still having high memberships in their respective cluster
centers. The crispness is, however, something that cannot be known a priori.
Although the clustering is to some extent non-overlapping, the differences within
each cluster and between each data point still indicate fluctuations in the conditions.
4 Conclusions
This paper addresses an ambiguity of the SOM clustering; the degree of membership in
a particular cluster. To this end, FCM clustering is applied on the units of the SOM
grid, allowing each data point to have a partial membership in all identified, but
overlapping, clusters. The FCM clustering is applied to two previously presented SOM
models for financial time-series analysis. Using FCM clustering, the cluster centers
express the representative financial states for the companies and countries,
respectively, while the varying membership degrees represent fluctuations of their
states over time. The results indicate that fuzzy clustering of the SOM units is a useful
addition to visual monitoring and representation of financial time series. However, the
clustering still needs to be objectively validated. For this task, there exist cluster
validity measures, such as [2122]; however, this is left for future work.
49
Acknowledgments
We acknowledge Academy of Finland (grant no. 127656) and Lars och Ernst Krogius
forskningsfond for financial support. The views in this paper are those of the authors
and do not necessarily reflect those of the European Central Bank.
References
1. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological
Cybernetics 66, 5969 (1982)
2. Ultsch, A., Siemon, H.P.: Kohonens self organizing feature maps for exploratory data
analysis. In: Proceedings of the International Conference on Neural Networks, pp. 305
308. Kluwer, Dordrecht (1990)
3. Lampinen, J., Oja, E.: Clustering properties of hierarchical self-organizing maps. Journal
of Mathematical Imaging and Vision 2(23), 261272 (1992)
4. Murtagh, F.: Interpreting the Kohonen self-organizing feature map using contiguityconstrained clustering. Pattern Recognition Letters 16(4), 399408 (1995)
5. Kiang, M.Y.: Extending the Kohonen self-organizing map networks for clustering
analysis. Computational Statistics and Data Analysis 38, 161180 (2001)
6. Vesanto, J., Sulkava, M.: Distance Matrix Based Clustering of the Self-Organizing Map.
In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 951956. Springer,
Heidelberg (2002)
7. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on
Neural Networks 11(3), 586600 (2000)
8. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum
Press, New York (1981)
9. Eklund, T., Back, B., Vanharanta, H., Visa, A.: Using the Self-Organizing Map as a
Visualization Tool in Financial Benchmarking. Information Visualization 2, 171181
(2003)
10. Sarlin, P.: Visual monitoring of financial stability with a self-organizing neural network.
In: Proceedings of the 10th IEEE International Conference on Intelligent Systems Design
and Applications, pp. 248253. IEEE Press, Los Alamitos (2010)
11. Liu, S., Lindholm, C.: Assessing the Early Warning Signals of Financial Crises: A Fuzzy
Clustering Approach. Intelligent Systems in Accounting, Finance & Management 14, 179
202 (2006)
12. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)
13. Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact,
Well-Separated Clusters. Cybernetics and Systems 3, 3257 (1973)
14. Eklund, T., Back, B., Vanharanta, H., Visa, A.: Evaluating a SOM-Based Financial
Benchmarking Tool. Journal of Emerging Technologies in Accounting 5, 109127 (2008)
15. Guiver, J.P., Klimasauskas, C.C.: Applying Neural Networks, Part IV: Improving
Performance. PC AI Magazine 5, 3441 (1991)
16. Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American
Statistical Association 58, 236244 (1963)
17. Resta, M.: Early Warning Systems: an approach via Self Organizing Maps with
applications to emergent markets. In: Proceedings of the 18th Italian Workshop on Neural
Networks, pp. 176184. IOS Press, Amsterdam (2009)
50
18. Berg, A., Pattillo, C.: What caused the Asian crises: An early warning system approach.
Economic Notes 28, 285334 (1999)
19. Sarlin, P., Marghescu, D.: Visual Predictions of Currency Crises using Self-Organizing
Maps. Intelligent Systems in Accounting, Finance and Management (forthcoming, 2011)
20. Marghescu, D., Sarlin, P., Liu, S.: Early Warning Analysis for Currency Crises in
Emerging Markets: A Revisit with Fuzzy Clustering. Intelligent Systems in Accounting,
Finance and Management 17(23), 143165 (2010)
21. Bezdek, J.C.: Cluster validity with fuzzy sets. Cybernetics 3, 5873 (1974)
22. Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on Pattern
Analysis and Machine Intelligence 13(8), 841847 (1991)
Abstract. We propose the use of self-organizing maps as models of social processes, in particular, of electoral preferences. In some voting districts patterns of electoral preferences emerge, such that in nearby areas
citizens tend to vote for the same candidate whereas in geographically
distant areas the most voted candidate is that whose political position is
distant to the latter. Those patterns are similar to the spatial structure
achieved by self-organizing maps. This model is able to achieve spatial
order from disorder by forming a topographic map of the external eld,
identied with advertising from the media. Here individuals are represented in two spaces: a static geographical location, and a dynamic political position. The modication of the later leads to a pattern in which
both spaces are correlated.
Keywords: Self-organizing maps; electoral preferences; social sciences
and computational models.
Introduction
Self-organizing maps (SOM) have been widely applied in several elds, covering
visualization [1], time series processing [2], and many others. Here, we propose
its use not as a data analysis tool, but as a model of social processes. SOM is, at
the end, an algorithm, and in that sense, is not dierent from other models for
social sciences, that are also algorithms. The eld of statistical physics has been
very productive in proposing models for a wide variety of social phenomena [3].
In that sense, we propose the use of SOM as a model of a specic phenomenon,
that of electoral preferences.
Electoral preferences of individuals are dynamic. They may be inuenced by
the opinion of other voters, the perception they have from candidates, as well as
from factors like propaganda from the media, and several other issues. Although
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 5160, 2011.
c Springer-Verlag Berlin Heidelberg 2011
52
citizens tend to vote for those candidate or political parties that reects more
sharply their own ideas, these perceptions may be modied. Electoral preferences
have been extensively studied from dierent angles [3,4,5]. Here, we present a
model of a special case of electoral preferences based on SOM.
In some voting districts, there is a correlation between geographical space and
the perceived political position of the candidate they voted for. In some cases,
whole adjacent regions of cities or countries tend to vote for the same political
party or candidate while other, possibly distant regions, tend to vote for other
candidates, with a perceived dierent political position (see g. 1-a).
One issue that seems to be a fundamental factor in the dynamics of electoral
preferences is the impact of the media [6]. Specically, we refer to the propaganda
from the parties and candidates aiming to inuence the decision of voters. One
way to model this inuence is as follows. A party or candidate is described by
a point in a high dimensional space, the political position space. This space is
dened by several political issues, such as public health, education, foreign aairs,
labor issues, environmental policies, etc. Each party is dened by a vector with
the relevant political issues. Each voter has an opinion over the same issues,
dened by a vector that summarizes his/her political position. Opinion vectors
from voters are susceptible of being modied.
Fig. 1. a) Voting in Mexico. Gray level codes voting percentages for two parties, represented by white and black. North Mexico presents darker levels than those in south
Mexico. b) SOM formation for four input vectors and three dimensions. It is observed
that in the nal map the most distant vectors are located at opposite locations.
When a political party presents an advertising in the media voters may react to
it. The main assumption here is that the voters whose political position is closer
to the political position that describes the party, will react and modify their
opinion in order to get even closer to the position of the party. At the same time,
these voters will act as active voters or promoters, aectig other voters within
their neighborhood and modify as well their opinion to get closer also to that of
the political party. The area of inuence of these active voters tend to decrease
with time as eect of habituation to active voters [3]. At the end of the process, it
is observed that certain regions tend to vote for a certain candidate while distant
regions tend to vote for a very dierent, politically speaking, candidate (see g.
1-a). The model explains the spatial patterns in which electoral preferences form
clusters, known as topographic maps. Although such patterns may be explained
53
by demographic factors, may also be explained at the light of both, the exposure
to an external eld (media), and a self-organizing process. The described process
is similar to the map formation in SOM.
Voting is a major feature of democratic regimes. For that, it has attracted
the attention of the scientic community to study deeply its dynamics. It has
been studied from several perspectives, including democratic debates and opinion
forming models [3], neighborhood inuence from similar voters in the Sznajd
model, voting through opinion shift [4], and many others. Electoral preferences
are an example of opinion dynamics, widely studied from the social sciences
but also from the mathematical-based sciences. Several ideas are common in
all models. First, political position of voters is susceptible of being inuenced.
Second, the aspects that voters take into account for voting are measurable.
Third, in some models, the dynamics are internal, that is, opinions are driven
only by the actual opinion of some voters. However, in other models, such as in
[7,8], the internal opinion dynamics are subject to external inuences. The idea
behind the external inuence is that it is possible to aect some of the voters in
order to shift their electoral preference toward some desired option.
SOM has been used as a tool to elucidate patterns in data. A less studied side
of SOMs are its capabilities of modeling dynamical systems. By this, we mean it
may model certain processes. We intend to use the SOM as a model of a social
process. We do not intend to use it as a data analysis tool. That is, we propose
the use of SOM as a dynamical system model to study a social phenomena,
the spatial pattern formation in some electoral processes. These patterns are
associated to patterns observed in SOM.
We are interested in the topographic map formation, a particular case of spatial patterns. A topographic map (TM) is a global structure in a low-dimensional
physical media, which is an approximation of the distribution shown by the input stimulus from the multidimensional input space or structure of the external
eld. In a TM, high-dimensional input vectors that are similar are mapped to
close regions in the map, while other, distant vectors are mapped to farther areas. A TM is that in which topology of the input vectors are preserved in the
lattice [9]. In a TM, there is a correlation between geographical space and an
abstract space, that in this contribution corresponds to the political position
space. Voting distribution over a city or country may be the approximation of
the distribution of perceived political positions from candidates or parties.
The Model
It is important for the social sciences scholars to discover why some specic patterns in the electoral preferences appear. In particular, studying the mechanisms
and dynamics that allow the appearance of TM-related maps over voting disticts is a case that has attracted the attention. In this contribution, we study
the relevance of external stimulus (media) and the inuence voters receive from
their peers in order for those patterns to appear. We study those inuences with
the self-organizing map as a model of electoral preferences.
54
(1)
where (t) is the learning rate at epoch t, hn (g, t) is the neighborhood function
from BMU g to unit n at epoch t and xi is the input vector. The neighborhood
decreases monotonically as a function of distance and time [10,11]. Neighborhood
is equivalent to a dynamic coupling parameter. In this work, we applied the socalled bubble neighborhood in which units farther than a given distance do
not update their weight vector. The SOM preserves relationships in the input
data by starting with a large neighborhood and reducing it during the course of
training [1].
The SOM algorithm is divided in three stages. 1. Competition: The best
matching unit (BMU) g is the one whose weight vector is the closest to the
input vector x: BM U = arg ming ||x wg || 2. Cooperation: The adaptation is
diused from the BMU g to the rest of the units in the lattice through the
learning equation (1). 3. Annealing: The learning parameter and neighborhood
are updated.
The map formed by the algorithm is a topographic one. The output map
is an approximation of the vectors distribution in the input space. We are not
interested in mapping multidimensional signals to a low dimensional space and
study the topographic relations, as it is done in several other applications, for
example the case of study of parliamentary elections in [13]. We are mainly
interested in the spatial pattern of the units when exposed to the input signals
or stimulus.
In this model, each unit corresponds to a voter or group of voters within a
geographic static area. Units are susceptible of being aected and also inuence
other units. Although real individuals move around over the city, the discussions
about political issues are mainly present within their neighbors, which makes it
equivalent to static location. The weight vector associated to each unit is the
55
position of voters with respect to the relevant political issues, that is, the weight
vector denes the position of voters in the political position space. The variables
that dene this space are continuous and voters may occupy any region on this
multidimensional space.
When a BMU aects neighbors, all issues are equally modied accordingly to
eq. 1, that is, the position of the aected units is shifted in all dimensions. Also, as
the neighborhood function is discrete (bubble), all units within its neighborhood
are equally aected. The competition stage is interpreted as the assignment of
resources from parties or candidates to those possible voters which may act
as promoters. The cooperation stage summarizes the electoral campaign as the
inuence from promoters or active voters in order to modify the political position
of their neighbors. The annealing stage reects the habituation or refractoriness
from voters to the inuence of promoters.
When an input stimulus (advertising) is presented, it only aects a single unit.
The aected unit corresponds to the BMU in the SOM and this unit will aect
its neighbors in order to attract them in the feature space. The feature space
corresponds to the political position of both, individuals and parties. So the
advertising aects a whole area through the most inuenced individual (BMU),
regardless of the previous political position of aected units.
The process of self-organizing is iterative and as the neighborhood that BMUs
aect tend to zero, the weight vectors reach a steady state. This convergence
may represent a TM, if some conditions are satised (see g. 2). First, the
initial neighborhood area should be suciently large. Second, the neighborhood
function should decrease in both, time and space [11,12]. Third, enough epochs
should occur [1].
Each unit i has its own weight vector wi , that denes the opinion of a group
of neighbor voters to the relevant issues considered in voting. The position of
each party k is dened by the vector pk . Citizens will vote for that candidate or
party to whom they are closer in the space of the considered issues. That is, an
unit i is said to vote for party j that:
j = arg min |wi pk |
k
(2)
The position a unit has over each item is continuous and dened in the range
[0, 1]. The items that voters may consider as relevant are the position of the
political parties with regard to a vector of political issues. At the same time,
parties are dened in the political space by their own position to those aspects.
Parties that have a similar position will be represented as closer points in the
feature space, while parties with opposite positions will be dened by distant
points. Political parties tend to attract as many voters as possible by modifying
voters positions towards their own position. Parties are coded as input vectors:
the political position of each one is dened as a vector. A party from the rightmost wing could be coded as [0, ..., 0] and a party from the left-most wing is
[1, ..., 1]. In general, parties do not change their position.
In the seminal work of Schelling [14], agents move towards an available location
in which the perceived comfort is better than that in the present location. Agents
56
do not change their opinions, but their location, which leads to a segregation
pattern through displacements in geographical space. In the cultural model of
Axelrod [15], agents change their opinions (culture) by means of interaction
dictated by homophilia and in a stochastic fashion. In this model, cultures or
regions of similar individuals are formed, and segregation is observed. In this
model culture is dened by discrete variables, and there is a comparison between
an individuals opinion and that of their neighbors. In the model we present,
there is not a comparison between an individual and her neighbors in order to
interact.
Table 1 shows the interpretation of SOMs parameters and attributes in the
context of electoral preferences models. The area BMUs aect is decreasing as
a function of space and time. Voters have limited presence and resources, which
constraint them to promote candidates in distant regions from their own geographical position for a long time. These corresponds to the neighborhood of a
BMU. Also, voters decrease their presence as time goes by, which is equivalent
to the decreasing neighborhood as a function of time: voters become refractory
to active voters as a function of time. Neighborhood function summarizes the
mobility of voting promoters, as they may be visiting other areas, but they will
stop visiting distant ones as time elapses.
Table 1. Control parameters and variables in the electoral preferences model and its
equivalence in SOM
w
P
H
E
d
Pi
fi
Vi
B
Equivalence in SOM
Weight vector
Number of input vectors
Initial neighborhood area
Number of epochs
Input space dimension
Avg. distance between input vectors
Minimum distance between input vectors
Maximum distance between input vectors
Avg. initial distance between neighbor units
Input vector i
No. of copies of each input vector i
Learning parameter
Initial number of units closer to Pi
Number of BMUs for each input vector
Description
Political position of voters
Number of political parties or candidates
Area of inuence of voters
Political campaigns duration
Number of political issues considered
Average dierence of opinion among parties
Minimum of political dierences
Maximum of political dierences
Avg initial diences among neighbors
Political position of party i
Number of advertising of party i per day
Permeability of voters
No. of supporters of Pi at the beginning
No. of active voters for each party
57
copies are exactly the same: each of the fi issues is modied by a random small
value . The interpretation is that each advertising emphasizes some issues while
others are put aside, at the time that some vote voters may get confused about
the message or misinterpret some aspects. The set of all fi denes the input or
stimulus space SOM will form a low-dimensional topographic map of that space.
Although there are several observable parameters, we are interested only in
one of them: the quality of the topographic map, i.e, how well the maps are
formed. The topographic error (TE) is a measure of the quality of the map and
is dened as the average number of input vectors whose BMU and second-BMU
are not contiguous [9].
In the SOM only one unit is selected as BMU. However, there are a number of
variations in which the BMU is not unique [16], and all BMU act simultaneously
to modify their neighbors. Here, we included this variation to give the model
more plausibility, as a given party may have more than a single active promoter.
The order parameter T E should be featured by the control parameters. It
has been stated that good maps (low T E) are achieved if initial neighborhood
is large enough and decreasing with time and space, otherwise, local order may
be achieved but global order does not emerge, when the initial conguration is
random. There are not analytical results about the conditions to achieve good
maps [17,1], even though some results are known for very specic cases[17,12].
Thus, we ran a set of experiments in order to characterize the topographic map
formation as a function of the control parameters.
In the experiments initial weight vectors are random, which corresponds to
a situation in which supporters are randomly distributed in the city. A number
of P parties tends to attract as many voters as possible in order to win the
elections. Voters are exposed to advertising from the media for a period of E
epochs and each party i presents to the voters fi advertisings per day.
Results
58
Fig. 2. TE as a function of some control parameters in table 1 for lattice size 20 20.
Mutual information between the parameter and TE is shown. Ranges were discretized
in 100 intervals. It is shown the average I(X; Y ) for the lattice sizes considered, as well
as the maximum and minimum I(X; Y ). X corresponds to TE whereas Y corresponds
to the studied parameter.
From g. 2-a), it is observed that, as established by analytical results and numerical explorations [17,9], TE decreases as the duration of the process
increases. Also (g. 2-b), as established by theory [1], the tendency is that the
larger the initial neighborhood area, the lower the TE. TE is also a decreasing
function of the number of political issues considered by voters (g. 2-c). This
is explained by the fact that the more dimensions dening the input space, the
easier is to modify the units weight vector to unfold and approximate the highdimensional input vectors [1]. These three parameters have been widely studied,
so our simulations in these three parameters adjusted to the established theory
and numerical ndings.
Besides the results that are predicted by the theory, such as the decreasing
TE with time, neighborhood and dimension of input space, other results not
previously identied were obtained. We proceed to explain such ndings.
In its original version, SOM works with one BMU for input vector. In our
model, B 1 simultaneous BMUs are present for each input vector. A novel
nding is that TE decreases as the number of BMUs increases, with exception
of B = 2. The more the number of promoters per input vector, the more likely
for a TM is to form (g. 2-g).
As a function of initial average political dierences among
neighbor voters (),
TE presents an interesting curve. is dened as 1/N 2 i=j d(wi , wj ), where
d(a, b) is the Euclidean distance between weight vectors a and b. When these
59
Conclusions
We propose here the use of self-organizing map (SOM) as a model to study voting
processes under the constraints of permeability of voters, inuence from parties
through an external eld, and a decreasing inuence from vote promoters. We
related parameters and variables in SOM with electoral issues and processes. We
have given evidence that the same mechanism that leads to self-organization in
SOM may, help to explain the patterns in electoral processes.
In previous models, electoral results over voting districts are not viewed as
topographic maps, but at most, as segregation states. Here, we have proposed
that electoral results may resemble topographic maps if some constratints are
observed. As in every model, the explanation power is limited by the assumptions supporting it. Thus, we present our model as a possible explanation of
the observed spatial patterns in some electoral districts under the circumstances
here detailed.
In the model, the nal results of electoral processes is dictated by the inuence
between voters, and the external inuence of the media. We propose that the
election results are guided by the political position of the contenders and the
number of their advertising, subject to initial distribution of political preferences
of voters. The media exerts a non-linear inuence in the spatial pattern formation
of voting. For a topographic map to appear, the duration of political campaigns
should be large enough. The higher the number of aspects that are considered
by voters, the more likely a topographic map is to emerge. If at the beginning
voters have more or less the same opinion, or population is radicalized, then it
is unlikely that topographic maps will emerge.
60
References
1. Kohonen, T.: Self-Organizing maps, 3rd edn. Springer, Heidelberg (2000)
2. Barreto, G., Araujo, A.: Identication and control of dynamical using the selforganizing map. IEEE Transactions on Neural Networks 15(5), 12441259 (2004)
3. Galam, S.: The dynamics of minority opinions in democratic debates. Physica
A 336, 4662 (2004), doi:10.1016/j.physa.2004.01.010
4. Pabjan, B., Pekalski, A.: Model opinion forming and voting. Physica A 387, 6183
6189 (2008), doi:10.1016/j.physa.2008.07.003
5. Costa, R., Almeida, M., Andrade, S., Moreira, M.: Scaling behavior in a proportional voting process. Phys. Rev. E. 60, 10671068 (1999)
6. Tuncay, C.: Opinion Dynamics Driven by Leaders, Media, Viruses and Worms. Int.
J. of Modern Physics C 18(5), 849859 (2007)
7. Gonzlez-Avella, J., Cosenza, M., Tucci, K.: Nonequilibrium transition induced by
mass media in a model for social inuence. Phys. Rev. E 72 (2005)
8. Mazzitello, K., Candia, J., Dossetti, V.: Eects of Mass Media and Cultural Drift
in a Model for Social Inuence. Int. J. Mod. Phys. C 18, 1475 (2007)
9. Villmann, T., Der, R., Herrmann, M., Martinetz, T.: Topology preservation in
self-organizing feature maps. IEEE Tr. on NN. 8(2), 256266 (1997)
10. Flanagan, J.: Suciente conditions for self-organization in the SOM with a decreasing neighborhood function of any width. C. of Art. NN. Conf. pub. 470 (1999)
11. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cyb. 67, 4755 (1992)
12. Erwin, E., Obermayer, K., Schulten, K.: self-organizing maps: stationary states,
metastability and convergence rate. Biol. Cyb. 67, 3545 (1992b)
13. Niemel, P., Honkela, T.: Analysis of parliamentary election results and socioeconomic situation using self-organizing map. In: Prncipe, J.C., Miikkulainen, R.
(eds.) WSOM 2009. LNCS, vol. 5629, pp. 209218. Springer, Heidelberg (2009)
14. Schelling, T.: Micromotives and Macrobehavior. W. W. Norton (1978)
15. Axelrod, R.: The dissemination of culture. J. of Con. Res. 41, 203226 (1997)
16. Schulz, R., Reggia, J.: Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps. Neural Comp. 16(3), 535561
(2004)
17. Flanagan, J.: Self-organization in the one-dimensional SOM with a decreasing
neighborhood. Neural Networks 14(10), 14051417 (2001)
18. Celluci, C., Albano, A., Rap, P.: Statistical validation of mutual information calculations. Phys. Rev. E 71, 66208 (2005)
Grupo de Investigaci
on SUPPRESS, Universidad de Le
on, Le
on, Spain
saloc@unileon.es, manuel.dominguez@unileon.es
Department of Information and Computer Science, Aalto University School of
Science, Espoo, Finland
mika.sulkava@tkk.fi, miguel.prada@tkk.fi, Jaakko.Hollmen@hut.fi
Introduction
Many variants of SOM appeared in the literature [1]. The aims of these approaches comprise improvements in clustering, visualization, accuracy of the
model, computation time, etc. For instance, it is possible to dene dierent neighborhood functions, change the winner searching process, and introduce some a
priori information about classes or states. An overview of the main ideas which
can be used to modify the standard SOM is presented in [2].
These variants have brought great advantages for data analysis, but, so far,
none of them has been focused on the data analysis conditioned on the environment. It is well known that environmental conditions inuence strongly most of
the real processes and systems. Furthermore, it is generally desirable to compare
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 6170, 2011.
c Springer-Verlag Berlin Heidelberg 2011
62
S. Alonso et al.
data from dierent processes whose environmental conditions are the same. For
these reasons, a new algorithm, the envSOM, is proposed in this paper. It still
captures the behavior of the processes, but takes into account the model of the
environment.
This paper is structured as follows: In Section 2, several approaches related
to the envSOM are reviewed briey. In Section 3, the envSOM algorithm and its
two phases are explained in detail. Two examples used to test the algorithm are
described in Section 4. Also, the results obtained using the envSOM are shown
there. Finally, the conclusions are drawn in Section 5.
Similar Approaches
63
In the Layering SOM, a SOM is trained for each individual layer to achieve
better results in the eld of exploratory analysis. In this sense, a growing hierarchical SOM has been presented in [8]. That work explains a dynamic model
which adapts its architecture in the training process and uses more units where
more input data are projected. The major benets of this approach are the reduction of training time due to the concept of layers, the possibility to discover
a hierarchical structure of the data, the improvement of cluster visualization by
displaying small maps at each layer and the preservation of topological similarities between neighbors. This algorithm allows us to visualize data in detail, but
the models obtained are not conditioned on the environment.
The self-organizing map has also been used for time series processing in the
form of the Temporal SOM. In order to exploit the temporal information, SOM
needs to be enabled with a short-term memory, which can be implemented, e.g.,
through external tapped delay lines or dierent types of recurrence. Several of
these extensions are reviewed in [9,10]. In our approach, no short-term memory is
explicitly implemented, but it is usually advisable to introduce time information
in the model to analyze the temporal evolution, together with the environment.
The purpose of this work is to develop an algorithm suitable for extracting and
analyzing information from large data sets, but considering the environmental
information such as weather variables, atmospheric deposition, etc. The envSOM
approach consists of two consecutive phases based on the traditional SOM [2].
Some slight variations have been introduced in each phase. The winner searching
process in the rst phase and the update process in the second one have been
modied appropriately in order to achieve the desired result. In our experiments
the learning rate decreases in time and the neighborhood function is implemented
as Gaussian in both phases. However, other functions could be used as well.
The proposed envSOM algorithm has the advantageous features of the traditional SOM. Likewise, it reaches spatially-ordered and topology-preserving maps.
It also provides a good approximation to the input space, similar to vector quantization, and divides the space in a nite collection of Voronoi regions. The main
innovation of this algorithm is that it reects the probability density function
of data set, given the environmental conditions. Therefore, it can be useful from
the point of view of environmental pattern recognition and data comparison,
conditioned on these patterns. On the contrary, it should be noted that it will
be more expensive computationally compared to the traditional SOM, since two
learning phases are needed. Furthermore, it requires knowledge of the environmental variables which inuence the behavior of the process, characterized by
the remaining variables. The envSOM approach will be explained in detail below.
3.1
In the rst phase of the envSOM algorithm, a traditional SOM is trained using all
variables. The initialization can be either linear along the greatest eigenvectors or
64
S. Alonso et al.
(1)
where x represents the current input and m denotes the codebook vectors. N and
t are, respectively, the number of the map units and the time. The dierence
is that a binary mask is always used to indicate which variables are used for
computing the winner. As usual, if the Euclidean norm, , is chosen, the winner
will be computed using equation 2, where is the binary mask and k is a
component or variable.
2
k [xk (t) mik (t)]
(2)
x(t) mi (t)2 = x(t) mi (t)2 =
k
In the second phase of the envSOM algorithm, a new traditional SOM is trained
using all variables. It will be initialized using the codebooks from the rst phase
SOM. Thanks to this appropriate initialization, a fast convergence of the algorithm is reached and an accurate model which denes the environment will be
used in the second phase. It should be noted that environmental components
have been already organized in the rst phase of the envSOM. Therefore, values
from the codebooks of rst SOM are a good starting point for the second phase.
In this case, every component will take part equally in the winner computation
and no mask will be applied. Unlike the rst phase, the update process is now
slightly modied. As environmental variables are already well organized, it is only
required that the remaining variables are updated properly. For this reason, a
new mask is introduced in the update rule and equation 3 will be used in this
case. The mask, , is a k-dimensional vector which takes binary values k , i.e.,
0 if it corresponds to an environmental variable and 1 otherwise. k is the number
of components or variables.
mi (t + 1) = mi (t) + (t)hci (t)[x(t) mi (t)]
(3)
65
At the end of this phase, all variables will be organized properly. The learning
rate, (t), and the neighborhood function, hci (t), are not modied so that a
value decreasing in time and a Gaussian function could be used, respectively,
like in the traditional SOM. The purpose of this phase is to reach a good model
of the whole data set, given environmental information.
Two kinds of experiments have been planned in order to test the envSOM algorithm. First, an articial data set based on binary patterns is created. It allows
us to check the clustering property of the algorithm. Then, a simulated data set
characterizing climate and carbon ux in several ecosystems is studied. It allows
us to check the usefulness of the algorithm with more realistic data and compare
the behavior of carbon in dierent ecosystems, given environmental conditions.
Matlab software has been used to make the experiments and the SOM Toolbox
[11] has been modied to implement the necessary changes, such as a new mask
in the update process.
4.1
A Toy Example
An articial data set with structured data has been used to test the envSOM
algorithm. The data set consists of 16000 samples and 4 variables (X1, X2,
X3, X4). It contains all binary patterns from (0, 0, 0, 0) to (1, 1, 1, 1), i.e.,
the numbers from 0 to 15 in binary system. A low level of noise (10%) has been
added to the variables. Each binary pattern is equally represented by a set of 1000
samples. There are 16 dierent patterns, so the envSOM algorithm should nd
16 clusters in this data set. The choice of this data set is justied by the simple
structure of the data, which facilitates the visualization and understanding of
the results from the algorithm.
First, a traditional SOM was trained using this input data set. The number of
epochs in the training should be high enough in order to guarantee a complete
organization. A number over 500 epochs was chosen. The dimensions of SOM
were 16 20 (320 units). A Gaussian function was selected as the neighborhood
function and a value decreasing exponentially in time as the learning rate. The
SOM should be able to divide the data into 16 clusters and allows us to visualize
them, for instance, by means of the U-matrix representation. The results from
the traditional SOM can be seen in Figure 1. After the training, the U-matrix
yields a clear visualization of the binary patterns. Note that each component has
been organized in a random way, as it is shown by the component planes. If a
new traditional SOM is trained using another data set Y, also based on binary
patterns, i.e., (Y1, Y2, Y3, Y4), the organization of the four components will
probably be completely dierent. Thus, it will be very dicult to make a good
comparison between the results from both data sets, X and Y.
When there are environmental conditions in the data set, it can be desirable
that these components dene the organization of the map. In this case, it is
supposed that X1 and X2 are the environmental variables and X3 and X4 are
66
S. Alonso et al.
Fig. 1. Component planes and U-matrix of traditional SOM for binary patterns. Black
color corresponds to values of 0 and white color to 1.
Fig. 2. Component planes and U-matrix of envSOM algorithm after the rst phase of
learning
Fig. 3. Component planes and U-matrix of envSOM algorithm after the second phase
of learning
features of the data set to be analyzed and compared. The envSOM algorithm
consists of two consecutive SOMs as mentioned above. The parameters of both
SOMs are the same as in the traditional SOM (500 epochs, 320 units, Gaussian
neighborhood function and learning rate decreasing exponentially).
In the rst phase, only X1 and X2 variables are used to compute the winner
neurons and all variables are updated. The results of the rst phase can be seen
in Figure 2. As expected, the organization is only performed on variables X1 and
X2 since X3 and X4 do not take part in the winner computation. Therefore, the
U-matrix only represents four patterns corresponding to possible combinations
of variables X1 and X2.
In the second phase, all four variables are used in the winner computation,
but X1 and X2 are kept xed whereas X3 and X4 are updated. At the end
of this phase, the data set is organized as depicted in Figure 3. In this case,
the 16 patterns can be clearly distinguished in the U-matrix in a similar way
to the traditional SOM. Moreover, the organization of the map conditioned on
X1 and X2, i.e., the environmental variables, is achieved. It can be said that
the envSOM algorithm represents the probability function of data, given the
67
O-CN Example
A more realistic scenario for presenting the performance of the envSOM approach
was performed by analyzing data containing environmental characteristics and
simulated gross primary production (GPP, the amount of carbon sequestrated in
photosynthesis) of dierent ecosystems in Europe. The SOM has been previously
used for analysis of carbon exchange of ecosystems in, e.g., [12,13]. The GPP
estimates used in this study has been generated by the O-CN model [14,15].
The model is developed from the land surface scheme ORCHIDEE [16], and
has been extended through representation of key nitrogen cycle processes. OCN simulates the terrestrial energy, water, carbon, and nitrogen budgets for
discrete tiles (i.e. fractions of the grid cell) occupied by up to 12 plant functional
types (PFTs) from diurnal to decadal timescales. The model can be run on any
regular grid, and is applied here at a spatial resolution of 0.5 0.5. Values of the
model input variables: air temperature, precipitation, shortwave downward ux,
longwave downward ux, specic humidity, and N deposition and simulated GPP
from 1996 to 2005 were used in this example. These values were analyzed for four
PFTs: temperate needle-leaved evergreen forests (TeNE), temperate broadleaved
seasonal forests (TeBS), temperate grasslands (TeH), and temperate croplands
(TeH crop).
The envSOM algorithm was compared with the traditional SOM in this example. First, four traditional SOMs were trained for four PFTs using environmental data and GPP estimates. As example, the component planes and U-matrices
of SOMs of two PFTs, temperate broadleaved seasonal forests and temperate
grasslands, are shown in Figures 4 and 5. The organization of the two maps
characterizing the two PFTs is very dierent from each other, so it is very
laborious to compare them. If one tries to compare the magnitudes of GPP in
68
S. Alonso et al.
Fig. 4. Component planes and U-matrix of traditional SOM for temperate broadleaved
seasonal forests
Fig. 5. Component planes and U-matrix of traditional SOM for temperate grasslands
Fig. 6. Component planes and U-matrix of envSOM algorithm for four PFTs
69
usually found in the same regions of the map and are thus, connected with similar environmental conditions. This similarity between the PFTs was expected.
However, the absolute values of GPP are dierent. There are also some dierences
visible between the PFTs. E.g., the map units with the highest precipitation have
very low GPP values for all PFTs except the temperate croplands. In addition,
the area in the lower left part of the map associated with relatively high temperature, shortwave and longwave downward uxes, low precipitation and high GPP
in temperate needle-leaved evergreen forests, temperate grasslands, and temperate croplands contains very low values of GPP in temperate broadleaved seasonal
forests. The reason for these dierences may be dierent spatial distribution of
the PFTs and that some spatially correlated confounding factors have an eect on
GPP. More detailed investigation of the reasons behind the dierences might be a
topic of a future study.
Conclusions
References
1. Kangas, J., Kohonen, T., Laaksonen, J.: Variants of self-organizing maps. IEEE
Transactions on Neural Networks 1, 9399 (1990)
2. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
3. Hagenbuchner, M., Tsoi, A.C.: A supervised training algorithm for self-organizing
maps for structures. Pattern Recognition Letters 26, 18741884 (2005)
4. Melssen, W., Wehrens, R., Buydens, L.: Supervised Kohonen networks for classication problems. Chemometrics and Intelligent Laboratory Systems 83, 99113
(2006)
70
S. Alonso et al.
5. Koikkalainen, P., Oja, E.: Self-organizing hierarchical feature maps. In: International Joint Conference on Neural Networks, vol. 2, pp. 279284. IEEE, INNS
(1990)
6. Koikkalainen, P.: Progress with the tree-structured self-organizing map. In: Cohn,
A.G. (ed.) 11th European Conference on Articial Intelligence, ECCAI (1994)
7. Laaksonen, J., Koskela, M., Laakso, S., Oja, E.: PicSOM - content-based image retrieval with self-organizing maps. Pattern Recognition Letters 21, 11991207 (2000)
8. Rauber, A., Merkl, D., Dittenbach, M.: The growing hierarchical self-organizing
map: exploratory analysis of high-dimensional data. IEEE Transactions on Neural
Networks 13(6), 13311341 (2002)
9. Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing
network models. Neural Networks 17, 10611085 (2004)
10. Guimar
aes, G., Sousa-Lobo, V., Moura-Pires, F.: A taxonomy of self-organizing
maps for temporal sequence processing. Intelligent Data Analysis (4), 269290
(2003)
11. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM toolbox for Matlab
5 (2000)
12. Abramowitz, G., Leuning, R., Clark, M., Pitman, A.: Evaluating the performance
of land surface models. Journal of Climate 21(21), 54685481 (2008)
13. Luyssaert, S., Janssens, I.A., Sulkava, M., Papale, D., Dolman, A.J., Reichstein,
M., Hollmen, J., Martin, J.G., Suni, T., Vesala, T., Loustau, D., Law, B.E., Moors,
E.J.: Photosynthesis drives anomalies in net carbon-exchange of pine forests at
dierent latitudes. Global Change Biology 13(10), 21102127 (2007)
14. Zaehle, S., Friend, A.D.: Carbon and nitrogen cycle dynamics in the o-cn land surface model: 1. model description, site-scale evaluation, and sensitivity to parameter
estimates. Global Biogeochemical Cycles 24 (February 2010)
15. Zaehle, S., Friend, A.D., Friedlingstein, P., Dentener, F., Peylin, P., Schulz, M.:
Carbon and nitrogen cycle dynamics in the o-cn land surface model: 2. role of the
nitrogen cycle in the historical terrestrial carbon balance. Global Biogeochemical
Cycles 24 (February 2010)
16. Krinner, G., Viovy, N., de Noblet-Ducoudre, N., Ogee, J., Polcher, J., Friedlingstein, P., Ciais, P., Sitch, S., Prentice, I.C.: A dynamic global vegetation model
for studies of the coupled atmosphere-biosphere system. Global Biogeochemical
Cycles 19 (February 2005)
Abstract. A powerful method in knowledge discovery and cluster extraction is the use of self-organizing maps (SOMs), which provide adaptive quantization of the data together with its topologically ordered
lower-dimensional representation on a rigid lattice. The knowledge extraction from SOMs is often performed interactively from informative
visualizations. Even though interactive cluster extraction is successful,
it is often time consuming and usually not straightforward for inexperienced users. In order to cope with the need of fast and accurate analysis
of increasing amount of data, automated methods for SOM clustering
have been popular. In this study, we use spectral clustering, a graph
partitioning method based on eigenvector decomposition, for automated
clustering of the SOM. Experimental results based on seven real data
sets indicate that spectral clustering can successfully be used as an automated SOM segmentation tool, and it outperforms hierarchical clustering
methods with distance based similarity measures.
Introduction
The self-organizing maps (SOMs) provide topology preserving mapping of highdimensional data manifolds onto a lower-dimensional rigid lattice. This enables
informative SOM visualization of the manifold, which can be used for interactive cluster extraction. Various SOM visualization schemes (see [1,2] and references therein) have been proposed; two of which stand out: U-matrix [3] (which
shows Euclidean distances between SOM neighbors on the grid) and CONNvis
[2] (which draws detailed local data distribution as a weighted Delaunay graph
on the grid). However, the interactive process for cluster extraction from SOM
visualization often requires practiced knowledge to evaluate visualized SOM information and hence is dicult for inexperienced users, and time consuming
even for the experienced users. Therefore, automated SOM segmentation methods, which are generally hierarchical agglomerative clustering (HAC) schemes
with dierent distance measures, have been proposed. Centroid linkage is considered in [4] with SOM lattice neighborhood, in [5] with a gap criterion; Wards
measure is used in [6]; a recent similarity measure based on distance and density
(proposed in [7]) in [8]. These approaches accurately extract the clusters when
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 7178, 2011.
c Springer-Verlag Berlin Heidelberg 2011
72
K. Tasdemir
they are well separated; however, they may be inecient for extracting complex
cluster structures. Another approach [9] uses a recursive ooding of a Gaussian
surface (Clusot surface) constructed based on pairwise distances and receptive
eld sizes of SOM prototypes. However, the resulting partitionings are similar
to that of k-means clustering. A recent approach [10] also uses HAC method but
with a similarity measure (CONN linkage) based on weighted Delaunay graph of
SOM units, which represents detailed local data distribution. It is shown in [10]
that CONN linkage is very successful, when the SOM units are dense enough.
A clustering method, becoming more popular due to its high performance
and easy implementation, is spectral clustering, which is a graph partitioning
approach based on eigenvector decomposition (see [11] and references therein).
Due to its advantageous properties, such as extraction of irregular-shaped clusters and obtaining globally optimal solutions [12,13], we propose to use spectral
clustering as an automated SOM segmentation method. A preliminary use of
spectral clustering for SOM segmentation was considered in [14], but specically
for data categorization to obtain semantically meaningful categories. Here, our
experimental results on seven real data sets show that spectral clustering produces better partitionings compared to the ones obtained by other methods in
this study. Section 2 briey explains the spectral clustering, Section 3 discusses
experimental results on the data sets and Section 4 concludes the paper.
Spectral Clustering
Spectral clustering methods [11,15,16] depend on relaxed optimization of graphcut problems, using a graph Laplacian matrix, L. Let G = (V, S) be a weighted,
undirected graph with nodes V representing n points in X = {x1 , x2 , . . . , xn } to
be clustered and edges dened by n n similarity matrix S, where sij is often
described using (Euclidean) distance, d(xi , xj ), between xi and xj , as
sij = exp(
d2 (xi , xj )
)
2 2
(1)
(2)
can achieve an approximate solution to the normalized cut. Ng et al. [16] extend the solution to extract k groups using the k eigenvectors of Lnorm[16] =
D1/2 SD1/2 with the k highest eigenvalues, by the following algorithm:
1. Calculate similarity matrix S using ( 1), diagonal degree matrix D, and
normalized Laplacian Lnorm[16]
2. Find the k eigenvectors {e1 , e2 , . . . , ek } associated with the k highest eigenvalues {1 , 3 , . . . , k }
73
eik
(3)
to automatically set from intrinsic data details, and to reect local statistics.
Detailed information on spectral clustering can be found in [11,12].
Experimental Results
Figure 1 shows the synthetic data sets (Lsun, Wingnut, Chainlink) used in the
study. Despite a few number of clusters with clear separation among them, clustering of these data sets is challenging due to their specic properties: Lsun has
74
K. Tasdemir
two rectangular clusters close to each other and a spherical cluster; Wingnut
has varying within-cluster density distributions; whereas Chainlink has clusters
which are linearly inseparable. For all these data sets, the best partitionings
are achieved by CONN linkage [10], where similarities are dened by detailed
local density distribution, with average accuracies very close to 100% (Table 1).
Spectral clustering with optimal is the runner up for the three data sets, and
produces signicantly better partitionings than k-means and hierarchical clustering with distance based linkages, despite the challenges in these data sets. The
use of local , however, achieves accuracies similar to the accuracies of k-means.
6
2.5
1.5
0.5
1.5
1
0.5
0
0.5
1
1
1.5
1
0
0
(a)
0
0
(b)
(c)
Fig. 1. Three synthetic data sets in [18]. The cluster labels of the data samples are
shown by dierent symbols. (a) Lsun (b) Wingnut (c) Chainlink.
Table 1. Accuracies for SOM clustering of synthetic data sets in [18]. SC is spectral
clustering, SC1 is with optimal ( is 0.3, 0.1, 0.2 for Lsun, Wingnut, and Chainlink
respectively), SC2 is with local ; whereas average, centroid, Ward and CONN represent dierent (dis)similarity measures used in hierarchical agglomerative clustering.
The best accuracy for each data set is shown in boldface.
# of
# of
Data set samples clusters SC1 SC2 k-means average centroid Ward CONN
3.2
Lsun
400
98.2 84.00
81.22
87.75
91.30
95.20 99.58
Wingnut
1016
98.57 95.76
95.94
96.72
98.25
95.08 99.91
Chainlink
1000
95.38 67.61
66.73
75.78
78.10
78.48 97.82
We used six data sets from UCI machine learning repository [19]. Three of
them (Iris, Wine, Breast Cancer Wisconsin) are relatively small. Iris has 150
4-dimensional samples equally distributed into 3 groups; Wine data set has 178
13-dimensional samples (59, 71, 48 samples in 3 classes respectively); and Breast
Cancer Wisconsin has 699 9-dimensional samples grouped into two classes (benign or malignant). The other three data sets (Image Segmentation, Statlog,
75
Table 2. Accuracies for SOM clustering of UCI data sets with small sizes [19]. SC2
is spectral clustering with local . The best accuracy obtained for each SOM size is
shown in boldface and the best accuracy for each data set is underlined.
# of
# of
SOM clustering method
Data set samples clusters SOM size SC2 k-means average centroid Ward CONN
Iris
150
5x5
10x10
57.11 85.92
88.92 84.57
78.40
83.27
Wine
178
5x5
10x10
61.80 67.43
69.54 68.62
66.52
61.01
Breast
Cancer-W
699
5x5
10x10
95.50 95.06
96.16 95.90
94.90
95.82
94.90
95.74
94.90 96.14
95.68 84.39
Table 3. Accuracies for SOM clustering of UCI data sets with medium sizes [19]. SC2
is spectral clustering with local . The best accuracy obtained for each SOM size is
shown in boldface and the best accuracy for each data set is underlined.
# of
# of
SOM clustering method
Data set samples clusters SOM size SC2 k-means average centroid Ward CONN
Segmentation
Statlog
2391
6435
10x10
20x20
30x30
40x40
51.94
55.96
54.39
54.06
51.06
52.40
51.99
52.30
45.63
36.71
29.67
28.97
42.18
29.21
29.03
28.96
53.63
53.08
53.13
52.01
61.26
34.26
41.73
21.47
10x10
20x20
30x30
40x40
58.54
73.34
73.94
73.91
62.44
63.52
63.75
63.84
61.49
53.89
52.90
53.72
10
10x10
20x20
30x30
40x40
50.36
66.61
66.76
68.01
64.03
65.25
66.36
66.86
62.96
64.33
62.84
64.70
62.44 67.66
59.81 64.82
59.91 67.73
56.53 68.67
73.70
74.40
70.55
20.08
Pen digits) are of medium sizes. Segmentation has 2310 samples with 19 features
grouped into 7 classes, Statlog has 6435 samples with 4 features divided into 6
classes, and Pen Digits has 10992 samples with 16 features in 10 classes. Further
details on the data sets can be found in the UCI machine learning repository.
Table 2 and Table 3 show the resulting accuracies for SOM clustering of these
data sets. Out of these six data sets, spectral clustering with local produces the
best partitioning for three of them (Iris, Breast Cancer Wisconsin and Statlog),
CONN linkage for two data sets (Segmentation, Pen digits) and Wards measure
76
K. Tasdemir
for one data set (Wine). Even though it may be possible to achieve better partitioning by using the optimum for each data set, it requires multiple runs to
nd the optimum value unique for each data set. However, even the use of local
in spectral clustering often outperforms other methods (in this study) using
distance based similarity measures.
3.3
This data set describes a remotely sensed area with 360x600 pixels, resulting
in 216000 samples, where each sample has 41 features. There are eight classes:
beach, ocean, ice, river, road, park, residential, industrial. 29,003 samples, which
are labeled as one of the eight classes, represent the ground truth information to
be used in the accuracy assessment. The performance of spectral clustering with
local is quite high (93.83%), outperforming all other methods in the study.
Table 4. Accuracies for SOM clustering of Boston data set used in [20]. SC2 is spectral
clustering with local . The best accuracy obtained for each SOM size is shown in
boldface and the best accuracy for each data set is underlined.
# of
# of
Data set samples clusters SOM size SC2 k-means average centroid Ward CONN
Boston 216000
3.4
10x10
53.95
88.81
88.69
20x20
93.82 92.17
91.48
92.97
30x30
93.32
92.35
92.05
40x40
93.39 92.41
93.06
93.38
90.17 87.81
92.28 89.13
Computational Complexity
Conclusions
77
units, which is another mapping. This method extracted clusters accurately, even
in the case of nonlinear separation boundary. It also outperformed hierarchical
agglomerative clustering and k-means clustering, for SOM clustering of the data
sets used in this study. The success of two consecutive mapping indicates that
SOMs may be used as an initial vector quantization method to make spectral
clustering able to partition large data sets where it is not possible to use spectral
clustering due to its computational complexity.
References
1. Vesanto, J.: SOM-based data visualization methods. Intelligent Data Analysis 3(2),
111126 (1999)
2. Tasdemir, K., Merenyi, E.: Exploiting data topology in visualization and clustering
of Self-Organizing Maps. IEEE Transactions on Neural Networks 20(4), 549562
(2009)
3. Ultsch, A.: Self-organizing neural networks for visualization and classication. In:
Lausen, O.B., Klar, R. (eds.) Information and Classication-Concepts, Methods
and Applications, pp. 307313. Springer, Heidelberg (1993)
4. Murtagh, F.: Interpreting the Kohonen self-organizing map using contiguityconstrained clustering. Pattern Recognition Letters 16, 399408 (1995)
5. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11(3), 586600 (2000)
6. Cottrell, M., Rousset, P.: The Kohonen algorithm: A powerful tool for analyzing and representing multidimensional quantitative and qualitative data. In:
Cabestany, J., Mira, J., Moreno-Daz, R. (eds.) IWANN 1997. LNCS, vol. 1240,
pp. 861871. Springer, Heidelberg (1997)
7. Halkidi, M., Vazirgiannis, M.: A density-based cluster validity approach using
multi-representatives. Pattern Recognition Letters (6), 773786 (2008)
8. Wu, S., Chow, W.: Clustering of the self-organizing map using a clustering validity
index based on inter-cluster and intra-cluster density. Pattern Recognition (37),
175188 (2004)
9. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic cluster detection in Kohonens
SOM. IEEE Transactions on Neural Networks 19(3), 442459 (2008)
10. Tasdemir, K., Milenov, P.: An automated SOM clustering based on data topology.
In: Proc. 18th European Symposium on Articial Neural Networks (ESANN 2010),
Bruges, Belgium, D-Facto, April 27-30, 2010, pp. 375380 (2010)
11. von Luxburg, U.: A tutorial on spectral clustering. Technical Report TR-149, Max
Planck Institute for Biological Cybernetics (March 2007)
12. Wang, L., Leckie, C., Ramamohanarao, K., Bezdek, J.: Approximate spectral clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD
2009. LNCS, vol. 5476, pp. 134146. Springer, Heidelberg (2009)
13. Zhang, X., Jiao, L., Liu, F., Bo, L., Gong, M.: Spectral clustering ensemble applied to SAR image segmentation. IEEE Transactions on Geoscience and Remote
Sensing 46(7) (July 2008)
14. Saalbach, A., Twellmann, T., Nattkemper, T.W.: Spectral clustering for data categorization based on self-organizing maps. In: Nasrabadi, N.M., Rizvi, S.A. (eds.)
SPIE Proceedings, Applications of Neural Networks and Machine Learning in Image Processing IX, vol. 5673, pp. 1218 (2005)
78
K. Tasdemir
15. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888905 (2000)
16. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In:
Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information
Processing Systems 14, MIT Press, Cambridge (2002)
17. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems (2004)
18. Ultsch, A.: Maps for the visualization of high-dimensional data spaces. In: WSOM,
vol. 3, pp. 225230 (2003)
19. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
20. Carpenter, G.A., Martens, S., Ogas, O.J.: Self-organizing information fusion and
hierarchical knowledge discovery: a new framework using ARTMAP neural networks. Neural Networks 18(3), 287295 (2005)
21. Jain, A., Murty, M.N., Flynn, P.: Data clustering: A review. ACM Computing
Surveys 31(3), 264323 (1999)
22. Christopher, D., Manning, P.R., Sch
utze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Abstract. We propose a functional relevance learning for learning vector quantization of functional data. The relevance prole is taken as a
superposition of a set of basis functions depending on only a few parameters compared to standard relevance learning. Moreover, the sparsity
of the superposition is achieved by an entropy based penalty function
forcing sparsity.
Keywords: functional vector quantization, relevance learning, information theory.
Introduction
During the last years prototype based models became one of the widely used
paradigms for clustering and classication. Thereby, dierent strategies are proposed in classication. Whereas support vector machines (SVMs) emphasize the
class borders by the support vectors while maximizing the separation margin,
the family of learning vector quantization (LVQ) algorithms is motivated by
class representative prototypes to achieve high classication accuracy. Based on
the original but heuristically motivated standard LVQ introduced by Kohonen
[3] several more advanced methods were proposed. One key approach is the generalized LVQ (GLVQ) suggested by Sato&Yamada [10], which approximates
the accuracy by a dierentiable cost function to be minimized by stochastic gradient descent. This algorithm was extended to deal with metric adaptation to
weight the data dimensions according to their relevance for classication [1]. Usually, this relevance learning is based on weighting the Euclidean distance, and,
hence, the data dimensions are treated independently leading to large number
of weighting coecients, the so-called relevance prole, to be adapted in case of
high-dimensional data. If the data dimension is huge, as it is frequently the case
for spectral data or time series, the relevance determination may become crucial
and instable. However, functional data have in common that the vectors can be
seen as discrete realizations of functions, i.e. the vectors are so-called functional
data. For those data the index of the vector dimensions is a representative of the
respective independent function variable, i.e. frequency, time or position etc. In
this sense the data dimensions are not longer uncorrelated.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 7989, 2011.
c Springer-Verlag Berlin Heidelberg 2011
80
T. Villmann and M. K
astner
The aim of the relevance learning method here is to make use of this interpretation. Then, the relevance prole can be also assumed as a discrete representation of a relevance function. We suggest to approximate these relevance functions
as a superposition of only a few basis functions depending on a drastically decreased number of parameters compared to the huge number of independent
relevance weights. We call this algorithm Generalized Functional Relevance LVQ
(GFRLVQ). Further, we propose the integration of a sparseness criterion for
minimizing the number of needed basis functions based on an entropy criterion
resulting in Sparse GFRLVQ (S-GFRLVQ).
The paper structure is: After a short review of the GLVQ and GRLVQ schemes
we introduce the GFRLVQ followed by S-GFRLVQ. The experiment section
shows the abilities of the new algorithms for illustrative data sets.
D
i (vi wi )
(2)
i=1
i
i = 1.
(3)
81
S E (W )
d+
S E (W )
+
d
=
and
=
w+
w+
w
w
2d (v)
2d (v)
with + = f (d+ (v)+d
= f (d+ (v)+d
(v))2 ,
(v))2 and f denotes the rst
derivative of f . Relevance learning in this model can be performed by adaptation
of the relevance weights again by gradient descent learning:
d+
d
ES (W )
= + + .
j
j
j
(4)
E(W )
Here, the stochastic gradient operator Sw
is carried out by taking the gradient
f ((v))
for
a
stochastically
chosen
v
according
to the data distribution P (V ).
82
T. Villmann and M. K
astner
Using the convention tj = j we get in the case of Gaussians for the weighting
coecient l , the center l and the width l for
D
2
d (v, w)
1
(j l )
2
=
exp
(5)
(vj wj )
l
2l2
l 2 j=1
D
2
d (v, w)
l
(j l )
= 3
(j l ) exp
(6)
(vj wj )2
l
2l2
l 2 j=1
D
2
2
d (v, w)
l (j l )
(j l )
2
= 2
1 exp
(vj wj ) (7)
l
l2
2l2
l 2 j=1
whereas for the Lorentzian we obtain
D
d (v, w)
l
1
=
(vj wj )2
l
j=1 l2 + (j l )2
(8)
D
d (v, w)
2l (j l )
l
2
=
2 (vj wj )
l
j=1 2
2
l + (j l )
(9)
D
2
d (v, w)
l (j l ) l2
2
=
2 (vj wj )
l
j=1 2
2
l + (j l )
(10)
K
K
l=1
(m l )2
exp
2l m
m=1
is added to the cost function (1), which now reads as EGF RLV Q = E (W ) + r PR
with a properly chosen penalty weight R > 0. For Gaussian basis functions
we set k = k , and for the Lorentzians we take k = k . The penalty can
be interpreted as a repulsion with an inuence range determined by the local
correlations l m . The resulting additional update term for l -learning is
K
2
P
(m l )
(m l )
=
exp
l
l m
2l m
m=1
leading to a minimum spreading of the basis function centers l . Analogously,
R
an additional term occurs for the adjustemts of the l according to P
l , which
has to be taken into account for the update of k and k for Gaussians and
Lorentzians, respectively.
83
In the GFRLVQ model the number K of basis functions can be freely chosen so
far. Obviously, if the K-value is too small, an appropriate relevance weighting
is impossible. Otherwise, a K-value too large complicates the problem more
than necessary. Hence, a good way to choose K is needed. This problem can
be seen as sparseness in functional relevance learning. A common methodology
to judge sparsity is the information theory. In particular, the Shannon entropy
H of the weighting coecients = (1 , . . . , K ) can be applied. Maximum
sparseness, i.e. minimum entropy, is obtained, i l = 1 for exactly one certain
l whereas the other m are equal to zero. However, maximum sparseness may
be accompanied by a decreased classication accuracy and/or increased cost
function value EGF RLV Q .
To achieve an optimal balancing, we propose the following strategy: The cost
function EGF RLV Q is extended to
ESGF RLV Q = EGF RLV Q + ( ) H ()
(11)
with counting the adaptation steps. Let 0 be the nal time step of the usual
GFRLVQ-learning. Then ( ) = 0 for < 0 holds. Thereafter, ( ) is slowly
increased in an adiabatic manner [2], such that all parameters can immediately
follow the drift of the system. An additional term for l -adaptation occurs for
non-vanishing ( )-values according to this new cost function (11):
ESGF RLV Q
EGF RLV Q
H
=
+ ( )
l
l
l
(12)
H
= (log (l ) + 1). This term triggers the -vector to become sparse.
with
l
The adaptation process is stopped, if the EGF RLV Q -value or the classication
error shows a signicant increase compared to that at time 0 .
Experiments
We tested the GFRLVQ for the classication of two well known real world spectral data sets obtained from StatLib and UCI: The Tecator data set, [11], consists
of 215 spectra obtained for several meat probes. The spectral range is between
850nm1050nm wavelength (100 spectral bands). The data are labeled according
to the two fat levels (low/high). The Wine data set, [8], contains 121 absorbing
infrared spectra of wine between wavenumbers 4000cm1 400cm1 (256 bands)
split into 91 training and 30 test data. The data are classied according to their
two alcohol levels (low/high) as given in [5].
For the Tecator data we started the simulations with 10 Gaussians to describe
the relevance prole and trained the model using GFRLVQ. The achieved accuracy was above 88%. Thereafter the penalty term ( ) in (11) and accordingly
in (12) is continuously increased in an adiabatic manner such that the system is
pushed to become sparse. As depicted in Fig.1, the increasing the penalty term
84
T. Villmann and M. K
astner
Fig. 1. Time development of the S-GFRLVQ for the Tecator data. In the beginning
there are 10 Gaussian basis functions with weighting coecients l obtained from usual
GFRLVQ. Increasing the penalty term ( ) H () leads to sparsity in the -vector
(top, l -weights in dependence of the increasing sparsity weight ( ) through learning
time ). If the sparsity constraint does not dominate, a high accuracy is still available.
If the sparsity becomes too high, accuracy signicantly drops down (bottom).
85
Fig. 2. top: Tecator data; bottom: Resulting relevance proles: solid at the end of
usual GFRLVQ, dashed at the maximum sparseness without signicant accuracy loss,
dotted maximum sparseness (i.e. only one single remaining basisfunction).
86
T. Villmann and M. K
astner
Fig. 3. The same as in Fig.1 but now for the Wine data set. In this example 15
Lorentzians are taken as basis functions.
87
Fig. 4. top: Wine data; bottom: Resulting relevance proles: solid at the end of
usual GFRLVQ, dashed at the maximum sparseness without signicant accuracy
loss, dotted maximum sparseness (i.e. only one single remaining basis function).
88
T. Villmann and M. K
astner
Conclusion
We propose the Sparse GFRLVQ for optimal model generation with respect to
functional relevance learning. Functional relevance learning supposes that data
are representations of functions such that the relevance prole can be assumed
as a function, too. This allows the description in terms of a superposition of basis
functions. Sparsity is judged in terms of the entropy of the respective weighting coecients. The approach is demonstrated for two exemplary data sets with
two dierent kinds of basis functions, Gaussians and Lorentzians whereas for
the similarity measure the weighted Euclidean distance was used, for simplicity.
Obviously, the Euclidean distance is not based on a functional norm. Yet, the
transfer of the functional relevance approach to real functional norms and distances like Sobolev norms [15], Lee-norm [6,7], kernel based LVQ-approaches [14]
or divergence based similarity measures [13], which carry the functional aspect
inherently, is straight forward and topic of future investigations.
References
1. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.
Neural Networks 15(8-9), 10591068 (2002)
2. Kato, T.: On the adiabatic theorem of quantum mechanics. Journal of the Physical
Society of Japan 5(6), 435439 (1950)
3. Kohonen, T.: Self-Organizing Maps. Springer Series in Information Sciences,
vol. 30. Springer, Heidelberg (1995), 2nd extended edn. (1997)
89
4. Krier, C., Rossi, F., Francois, D., Verleysen, M.: A data-driven functional projection
approach for the selection of feature ranges in spectra with ICA or cluster analysis.
Chemometrics and Intelligent Laboratory Systems 91, 4353 (2008)
5. Krier, C., Verleysen, M., Rossi, F., Francois, D.: Supervised variable clustering for
classication of NIR spectra. In: Verleysen, M. (ed.) Proceedings of 16th European
Symposium on Articial Neural Networks (ESANN), Bruges, Belgique, pp. 263
268 (2009)
6. Lee, J., Verleysen, M.: Generalization of the lp norm for time series and its application to self-organizing maps. In: Cottrell, M. (ed.) Proc. of Workshop on SelfOrganizing Maps (WSOM) 2005, Paris, Sorbonne, pp. 733740 (2005)
7. Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. In: Information Sciences and Statistics. Springer Science+Business Media, New York (2007)
8. Meurens, M.: Wine data set,
http://www.ucl.ac.be/mlg/index.php?page=databases.meurens.bnut.ucl.
ac.be
9. Rossi, F., Lendasse, A., Francois, D., Wertz, V., Verleysen, M.: Mutual information for the selection of relevant variables in spectrometric nonlinear modelling.
Chemometrics and Intelligent Laboratory Systems 80, 215226 (2006)
10. Sato, A., Yamada, K.: Generalized learning vector quantization. In: Touretzky,
D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Proceedings of the 1995 Conference on
Advances in Neural Information Processing Systems 8, pp. 423429. MIT Press,
Cambridge (1996)
11. Thodberg, H.: Tecator meat sample dataset, http://lib.stat.cmu.edu/
datasets/tecator
12. Villmann, T., Cichocki, A., Principe, J.: Information theory related learning. In:
Verleysen, M. (ed.) Proc. of European Symposium on Articial Neural Networks
(ESANN 2011), Evere, Belgium, d-side publications (2011) (page in press)
13. Villmann, T., Haase, S.: Divergence based vector quantization. Neural Computation 23(5), 13431392 (2011)
14. Villmann, T., Hammer, B.: Theoretical aspects of kernel GLVQ with dierentiable
kernel. IfI Technical Report Series (IfI-09-12), pp. 133141 (2009)
15. Villmann, T., Schleif, F.-M.: Functional vector quantization by neural maps. In:
Chanussot, J. (ed.) Proceedings of First Workshop on Hyperspectral Image and
Signal Processing: Evolution in Remote Sensing (WHISPERS 2009), pp. 14. IEEE
Press, Los Alamitos (2009); ISBN 978-1-4244-4948-4
Abstract. We propose relevance learning for unsupervised online vector quantization algorithm based on stochastic gradient descent learning
according to the given vector quantization cost function. We consider
several widely used models including the neural gas algorithm, the Heskes variant of self-organizing maps and the fuzzy c-means. We apply
the relevance learning scheme for divergence based similarity measures
between prototypes and data vectors in the vector quantization schemes.
Keywords: vector quantization, relevance learning, divergence learning.
Introduction
Machine learning algorithms for unsupervised vector quantization (VQ) comprise a broad variety of models ranging from statistically motivated approaches
to strong biologically realistic models. The main task is to describe given data
in a faithful way such that the main properties of the data are preserved as most
as possible by a set of few prototypes. These properties could be the probability
density [28], the shape of data in sense of possibly non-linear principle component analysis (PCA), [23],[25], or visualization skills like in topology preserving
mapping [30] or the usual reconstruction error. For the dierent goals several
approaches exist [35], whereby we have further to dierentiate according to the
type of adaptation: Batch methods use all the data at the same time whereas
online models adapt incrementally. Usually, the Euclidean distance is used in
all these models. Yet, modern methods also include non-standard dissimilarity
measures like functional norms [17], Sobolev norms [33], kernel based approaches
[32], or divergences [31].
Relevance learning introduced for supervised learning vector quantization [10]
and its generalization, the so-called matrix learning [26], were recently extended
to unsupervised batch learning in topographic mapping [1]. Relevance or matrix
learning weights and correlates the data dimensions to achieve better classication results in supervised learning and the signal to noise ratio in unsupervised
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 90100, 2011.
c Springer-Verlag Berlin Heidelberg 2011
91
In general, vector quantization comprises unsupervised methods for data compression of vectorial data v V Rn with probability density P by prototypes
w W Rn . The similarity between data vectors v and prototypes w is judged
in terms of a dissimilarity measure d (v, w) frequently taken as the Euclidean
distance. Yet, more advanced dissimilarity measures, not necessarily supposed
to be a mathematical distance, can be applied. If the underlying cost function
(1)
EV Q = P (v) L (v, W, d (v, w)) dv
of the vector quantization algorithm is minimized by stochastic gradient learning with respect to w, the used dissimilarity measure has to be assumed as a
dierentiable functional with respect to w. L (v, W ) are local costs diering for
the several algorithms.
A robust alternative to the classical c-means algorithm is the neural gas (NG)
[20] with local costs
LN G (v, W, d (v, w)) =
h (i, v, W )
iA
2C (, N )
d (v, wi )
ki (v,W )
2 2
(2)
, whereby
92
M. K
astner et al.
(3)
r A
where hrr = exp
dA (r,r )
2 2
(4)
and dA (r, r ) denotes some dissimilarity measure
in the index space A, which is equipped with a topological structure for SOMs.
The assignments r (v) are one i r = s (v) and zero elsewhere but here based
on the local errors (4).
The standard fuzzy c-means (FCM) was originally developed by Dunn [9]
and improved by Bezdek [3]. Its local costs are given as
LF CM (v, W, d (v, w)) =
1
m
(i (v)) d(v, wi ) ,
2
(5)
iA
where i (v) are the fuzzy assignments, and the exponent m determines the
fuzziness commonly set to m > 1. For m 1 the clustering becomes crisp.
Frequently, the fuzziness is chosen as m = 2.
Minimizing each of these cost functions by stochastic gradient descent learning
or batch mode algorithms realizes the respective vector quantization algorithm
distributing the prototypes in the data space.
So far we have not discussed any specic choice of the dissimilarity measure
d (v, w). Frequently the Euclidean distance is used. To improve the performance
in SOM and NG, recently the distance measure
T
d (v, w) = (v w) (v w)
(6)
93
(
w)
i
i
i
i
i
(9)
v||w)
= log
D (
1
+1
vi wi
i (i )
(1)
D (
v||w)
=
(i vi ) + ( 1) (i wi ) (i ) vi wi
(10)
i
and v denotes the Hadamard product between the relevance vector and
v [31]. It should be noted that D becomes the quadratic Euclidean distance
for = 2. Further, both DGR and D converge to DGKL in the limit 0
and 0, respectively. The divergence D for = 1 is the Cauchy-Schwarzdivergence DCS [24].
Relevance learning by stochastic gradient learning in the above mentioned
vector quantization algorithms can be performed by taking for a presented data
vector v during learning the respective derivatives
t
v||w)
L (v, W, D (
v||w))
L (v, W, D (
v||w))
D (
=
t
D (
v||w)
4
4.1
Experiments
Data
Leave shape data: The leave form dataset originates from a study of nine
genotypes and their shape characteristics of Arabidopsis thaliana (L.) Heynh
94
M. K
astner et al.
Fig. 1. (a) Two wild types (col-0, ws-2) and seven mutations of Arabidopsis have
been used. (b) For the curvegram, a contour is continuously smoothed with Gaussians
of increasing sigma; (c) the log-normalized multi-scale bending energy as the shape
descriptor for dierent genotypes, depicted are mean and standard deviation per scale.
95
Fig. 2. Visualization of data (top) and the inverse variance (bottom) of the frequencies
for leaf shape data (left) and AVIRIS data (right)
the bending energy has a number of advantages: it is invariant to rotation, translation, and through the normalization with the perimeter also invariant to shape
scaling. The bending energy results from elasticity theory and provides a means
for expressing the amount of energy that is needed to transform the contour of
a specic shape into its minimum-energy state, namely, a circle with the same
perimeter as the original object. For shapes related to real objects, such as biological shapes (e.g. membranes, neurons, organs), the bending energy provides a
particularly meaningful physical interpretation in terms of the energy that has
to be applied in order to produce or modify specic objects [36,7].The bending
energy prole or curvegram [7] serves as input vector to the vector quantization
methods. In Fig.1c the mean proles for two exemplary genotype are depicted.
For this paper, 25 dierent scales with a minimum scale of 0.01 and maximum
scale of 0.5 are used.
Remote sensing data: The remote sensing data set is the publicly available
Indian Pines data from NASA Airborne Visible/Infrared Imaging Spectrometer
(AVIRIS) consisting of 145 145 pixels [16]. AVIRIS acquires data in 220 bands
of 10nm width from 400 2500nm. For this data set 20 noisy bands can be
identied (104 108, 150 163, 220) due to atmospheric water absorption and,
therefore, safely be removed [6]. We did not use the knowledge about the 16
identied classes because of unsupervised learning. The data sets are visualized
in Fig.2 .
4.2
Validity Measures
96
M. K
astner et al.
reecting the fact that clustering is an ill-posed problem and, therefore, dierent
measures emphasize dierent aspects. Yet, they all have in common that they
judge compactness and separability. We selected two measures, which are applicable to fuzzy (FCM) as well as to crisp (NG, SOM) vector quantization, the
modied SV F index [37]:
i)
S
iA avgD (w
(12)
SV F (V, W ) = =
m
C
v||w
i ))
iA maxvV ((i (v)) D (
where avgD (w
i ) = D (w
j ||w
i ) + D (w
i ||w
j )jA denotes the mean distance
value of other prototypes to w
i . The value S estimates the separation whereas
C appraises the compactness. The second measure is the modied fuzzy index
introduced by Xie & Beni [34]:
m
v||w
i)
iA
vV (i (v)) D (
XiB =
.
(13)
2
1
i ))
iA (avgD (w
#A
The higher the SV F - and the lower the XiB-value are for a chosen divergence
D, the better results are achieved.
Table 1. SV F - and XiB-Index for leaf shape (top) and the AVIRIS-dataset (bottom)
with the dierent methods
SVF
NG SOM FCM NG
Euclid
no rel. 0.538 0.090 9.442 1.575
with rel. 0.688 0.219 10.424 1.212
func. rel.
0.235
Cauchyno rel. 1.472 0.330 9.108 0.908
Schwarz with rel. 2.011 0.350 9.870 0.608
div.
func. rel.
0.312
Kullback- no rel. 1.499 0.238 11.537 0.460
Leibler
with rel. 1.856 0.255 11.270 0.608
div.
func. rel.
0.339
R
enyi
no rel. 1.760 0.235 8.997 0.467
div.
with rel. 2.384 0.245 10.116 0.415
( = 2) func. rel.
0.348
Euclid
no
with
func.
Cauchyno
Schwarz with
div.
func.
Kullback- no
Leibler
with
div.
func.
R
enyi
no
div.
with
( = 2) func.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
rel.
XiB
SOM
1.933
1.266
1.057
1.274
1.102
1.224
1.545
1.497
1.082
1.661
1.514
1.112
FCM
0.123
0.119
0.162
0.147
0.125
0.140
0.149
0.149
4.3
97
Experimental Results
We applied all three algorithms NG, SOM, and FCM with m = 2 to both
data sets with and without relevance learning using 20 prototypes for each algorithm. We used the quadratic Renyi-divergence D2GR , the -divergence D
for = 2 (Euclidean distance), the Cauchy-Schwarz divergence DCS and the
Kullback-Leibler-divergence DGKL . Additionally, we performed functional relevance learning for the SOM scheme. All settings are trained as usual in online
vector quantization, i.e. by decreasing learning rate until convergence, and adiabatic relevance adaptation [14]. The results are depicted in Tab.1, whereby
value variances (10-fold cross validation) are at least two relative magnitudes
lower.
We observe that for low-dimensional leaf shape data NG and SOM show consistently an improvement by relevance learning. Further, according to the better
XiB- and SV F -values, NG is superior to SOM in general as expected according
to well-known earlier ndings [20]. FCM also prots from relevance learning
Fig. 3. Relevance proles obtained according to the dierent divergences and algorithms for both data sets leaf shape (left) and AVIRIS data (right): NG - solid, SOM
- dotted (functional relevance), FCM - dashed
98
M. K
astner et al.
Conclusions
In this article we propose relevance learning for unsupervised online vector quantization using weighted divergences as dissimilarity measure. We demonstrate
that generally an improvement of the performance can be expected. However,
the behavior seems to be sensitive (at least) to the algorithm of choice. Further, diculties may arise in case of high-dimensional data because in that case
the number of parameters to be optimized (relevance weights) becomes huge. In
the light of the SOM results, here functional relevance learning could oer an
alternative [29].
References
1. Arnonkijpanich, B., Hasenfuss, A., Hammer, B.: Local matrix adaptation in topographic neural maps. Neurocomputing 74(4), 522539 (2011)
2. Backhaus, A., Kuwabara, A., Bauch, M., Monk, N., Sanguinetti, G., Fleming, A.:
LEAFPROCESSOR: a new leaf phenotyping tool using contour bending energy
and shape cluster analysis. New Phytologist 187(1), 251261 (2010)
3. Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum, New York (1981)
4. Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Computational Statistics and Data Analysis 57(1), 502519 (2007)
5. Bunte, K., Hammer, B., Wism
uller, A., Biehl, M.: Adaptive local dissimilarity measures for discriminative dimension reduction of labeled data. Neurocomputing 73,
10741092 (2010)
6. Camps-Valls, G., Bruzzone, L.: Kernel-based methods for hyperspectral image classication. IEEE Transactions on Geoscience and Remote Sensing 43(6), 13511362
(2005)
1
A direct comparison of the values is not admissible because of the quadratic fuzzy
valued assignments (i (v))2 .
99
100
M. K
astner et al.
29. Villmann, T., Cichocki, A., Principe, J.: Information theory related learning. In:
Verleysen, M. (ed.) Proc. of European Symposium on Articial Neural Networks
(ESANN 2011), Evere, Belgium, d-side publications (2011) (page in press)
30. Villmann, T., Der, R., Herrmann, M., Martinetz, T.: Topology Preservation in SelfOrganizing Feature Maps: Exact Denition and Measurement. IEEE Transactions
on Neural Networks 8(2), 256266 (1997)
31. Villmann, T., Haase, S.: Divergence based vector quantization. Neural Computation 23(5), 13431392 (2011)
32. Villmann, T., Hammer, B.: Theoretical aspects of kernel GLVQ with dierentiable
kernel. IfI Technical Report Series (IfI-09-12), pp. 133141 (2009)
33. Villmann, T., Schleif, F.-M.: Functional vector quantization by neural maps. In:
Chanussot, J. (ed.) Proceedings of First Workshop on Hyperspectral Image and
Signal Processing: Evolution in Remote Sensing (WHISPERS 2009), pp. 14. IEEE
Press, Los Alamitos (2009); ISBN 978-1-4244-4948-4
34. Xie, X., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on
Pat 13(8), 841847 (1991)
35. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neural
Networks 16(3), 645678 (2005)
36. Young, I.T., Walker, J.E., Bowie, J.E.: An analysis technique for biological shape.
i. Information and Control 25(4), 357370 (1974)
37. Zalik, K., Zalik, B.: Validity index for clusters of dierent sizes and densities.
Pattern Recognition Letters 32, 221234 (2011)
Abstract. The aim of this work is to discover the principles of learning a group
of dynamical systems. The learning mechanism, which is referred to as the Learning Algorithm of Multiple Dynamics (LAMD), is expected to satisfy the following four requirements. (i) Given a set of time-series sequences for training,
estimate the dynamics and their latent variables. (ii) Order the dynamical systems according to the similarities between them. (iii) Interpolate intermediate
dynamics from the given dynamics. (iv) After training, the LAMD should be
able to identify or classify new sequences. For this purpose several algorithms
have been proposed, such as the Recurrent Neural Network with Parametric Bias
and the modular network SOM with recurrent network modules. In this paper,
it is shown that these types of algorithms do not satisfy the above requirements,
but can be improved by normalization of estimated latent variables. This confirms that the estimation process of latent variables plays an important role in the
LAMD. Finally, we show that a fully latent space model is required to satisfy the
requirements, for which purpose a SOM with a higher-rank, such as a SOM2 , is
best suited.
1 Introduction
The focus of this work is to develop learning algorithms that enable us to deal with a
group of dynamical systems. More precisely, the aim of this work is to discover common principles shared by Learning Algorithms of Multiple Dynamics (LAMD), which
have the ability to learn, represent, estimate, quantize, interpolate, and classify a class
of dynamics. The LAMD is a useful tool when the system dynamics changes depending
on the context or the environment. In particular, autonomous intelligent robots need an
LAMD with excellent performance, because such robots are required to act in various
environments, and to deal with various objects in a variety of ways.
For this purpose, modular network architectures have often been adopted. One of the
representative architectures is MOdular Selection And Identification for Control (MOSAIC) proposed by Wolpert & Kawato [1]. Sometimes a SOM is combined with the
modular network. Minamino et al. applied a SOM with recurrent neural network modules (RNN-SOM) to the humanoid QRIO, providing QRIO with the capability of multischema learning [2]. Minatohara & Furukawa developed the Self-Organizing Adaptive
Controller (SOAC) [3,4] based on the mnSOM [5]. SOMAR also belong to this type
[6]. In some cases, Jordans plan unit [7] has been adopted as an alternative to the modular network architecture. Tani developed the Parametric Bias (PB) method based on
Jordans plan unit and applied it to multi-schema acquisition tasks for robots [8].
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 101110, 2011.
c Springer-Verlag Berlin Heidelberg 2011
102
Despite the fact that these algorithms seem to have yielded some success, little
attention has been paid to the problems often encountered by these algorithms, such
as learning instability, calculation time, confusion, and the interference of memories.
The reason these problems have not been exposed is that robotic tasks are too complicated for theoretical analysis, while toy problems are often too simple to elucidate
them. Ohkubo & Furukawa showed that these problems occur regularly even in ordinary learning tasks, and naive algorithms are only eective in limited cases, such as
toy problems [9,10]. To make matters worse, the risk of such problems seems to increase when the system possesses latent variables, e.g., internal state variables. These
are common in the case of dynamical systems.
In this paper we report that conventional naive algorithms do not work adequately
in learning tasks of multiple dynamics. We also point out the importance of latent variable estimation, which should be carried out carefully so that the latent variables are
consistent for all dynamics.
Internal
Signal
Observable
Variable
Latent
Variable
(1)
103
which is modified by the internal signal . Here, z(t) is the state variable consisting of
an observable part x(t) and latent (unobservable) part y(t), i.e., z(t) = (x(t), y(t)). (In
this paper, the sequence of the latent variable y(t) is called a latent sequence). It is
also assumed that the internal signal alters the dynamics continuously, so that similar
values of produce similar dynamics. Using this model, a set of training sequences
can be obtained in the following way. First, a set of internal signals = {1 , . . . , N } is
randomly generated (the prior distribution p() can be defined, if needed), and fn (z)
f (z; n ) is obtained for each n . Then, a time-series is generated for each n using the
dierence equation zn (t + 1) = fn (zn (t)). (External noise (t) can be added to z(t) if
necessary). Finally a set of sequences X = {xn (t)} are observed, while {n }, {yn (t)}, and
the function set { fn (z)} are all hidden from the LAMD.
With this framework, the LAMD is expected to satisfy the following requirements.
Requirement 1: The LAMD learns the dynamics of the training sequences X = {xn (t)}
and thus needs to estimate the function set { fn (z)} and the set of latent sequences
{yn (t)}.
Requirement 2: The LAMD measures the distances or similarities between the given
sequences, and orders them. Thus, the LAMD needs to estimate the internal signal
{n } and to sort the sequences by n .
Requirement 3: By giving an intermediate , the LAMD represents the intermediate
dynamics between given sequences.
Requirement 4: If a new sequence xnew (t) is given after training has finished, the
LAMD classifies it, and allocates it to the appropriate position between the training
sequences. Thus, the LAMD needs to identify the internal signal new of the new
sequence.
We define those algorithms satisfying the above four requirements as the LAMD in this
paper. Note that the LAMD refers to a theoretical concept of multi-dynamics learning
rather than a particular algorithm. There can be various implementations of the LAMD,
and the aim of this paper is to discover the common principles of the LAMD.
3 Conventional Methods
3.1 Representative Architectures
To implement the LAMD using a neural network, there are two major approaches (Fig.
2). One is a modular network structure such as SOM with RNN modules or with autoregressive (AR) modules. The modular network SOM with RNNs (RNN-mnSOM) shown
in Fig. 2(a) is representative of this approach [11]. In the RNN-mnSOM, every nodal
unit of the SOM is replaced by an RNN module, and the location of the winner (i.e.,
the best matching module of the given sequence) in the feature map space represents
the control signal . This approach is sometimes referred to as the local representation,
because each dynamics is represented separately in an RNN module.
The other approach is to use Jordans plan unit [7]. A representative architecture is
the recurrent network with parametric bias (RNNPB), shown in Fig. 2(b) [8]. RNNPB
is a variation of the multi-layer perceptron (MLP) with recurrent connections, and the
104
(a)
(b)
Fig. 2. Conventional learning algorithms of multiple dynamics. (a) RNN-mnSOM. (b) RNNPB.
input-output relation is modified by additional bias units called parametric bias. The
parametric bias vector (i.e., the set of parametric bias values) represents the internal
signal , which is determined by a back-propagation algorithm. This approach is referred to as the distributed representation, because the representation of the dynamics
is distributed across the network.
3.2 Naive Algorithm
Though the appearance of these two architectures is quite dierent, in essence the two
approaches have much in common.
1. For each training sequence, the least error node (e.g. RNN-module), or the least
error parametric bias vector becomes the winner of the sequence.
2. The winner and its neighbors are trained to represent the dynamics of the training sequence. In the RNN-mnSOM, the neighbors are determined explicitly by the
neighborhood function, whereas in the RNNPB they are determined implicitly by
the sigmoid curve of the neurons.
3. As a result, the network is trained by a set of sequences, which are mixed with the
weights determined by the neighborhood. That is, for a particular (i.e., each RNNmodule or each parametric bias vector), the network is trained so as to minimize
the weighted sum of the square error of multiple training sequences.
In this paper, algorithms using this strategy are called naive algorithms. In the case of
the RNN-mnSOM, this naive algorithm is formulated as follows. Let x kn (t) be the output
of the k-th RNN module. Then, the square error between the given sequence xn (t) and
the output of the k-th module x kn (t) is given by
1
xn (t) xkn (t)2 .
2 t=1
T
Enk =
(2)
105
The winner module and the estimated internal signal are expressed as
kn = arg min Enk
(3)
n = kn .
(4)
Here, k denotes the coordinates of the k-th module in the feature map space. Using the
neighborhood function, the learning coecients are determined for every module.
exp k n 2 /22
k
an =
(5)
k
2
2
n exp n /2
Finally, the connection weight wk of each RNN module is updated by a backpropagation algorithm (more precisely, the back-propagation through time algorithm)
as follows.
Ek =
N
akn Enk
(6)
n=1
wk =
N
E k
E k
=
akn nk
k
w
w
n=1
(7)
K
k=1
Ek =
K
N
akn Enk .
(8)
k=1 n=1
Note that this objective function is for the update process of the RNN modules, and is
not the objective function for self-organization.
3.3 What Is the Problem?
In the naive algorithm presented above, users expect the intermediate dynamics to be
represented by minimizing the weighted sum of the square errors. Thus, the update
process described in (7) determines whether the algorithm works as expected. Unfortunately, this naive algorithm fails to satisfy user expectations. To ascertain what the
problem is, let us consider the simplest situation, in which we only have one RNN
module and two training sequences. The task of the RNN module is to represent the
intermediate dynamics of two training sequences. In this case, the naive algorithm becomes
1 E1 1 E2
+
.
(9)
w =
2 w 2 w
For simulation, training sequences were generated by the Henon map,
xn (t + 1) = 1 an x2 (t) + y(t)
yn (t + 1) = bxn (t),
(10)
106
0.5
M2
S3 ( a = 1.2 )
0.4
0.3
x(t+2)
x(t+2)
0.3
0.2
0.2
0.1
0.1
0
-0.1
M2
S3 ( a = 1.2 )
0.4
0.3
0.35
0.4
x(t+1)
0.45
0.5
0.1
-0.1
x(t)
(a)
-0.1
0.3
0.35
0.4
0.45
x(t+1)
0.5 0.1
-0.1
x(t)
(b)
Fig. 3. Single RNN is trained by two time-series of a Henon map with dierent parameters,
a = 1.4 and a = 1.0. The gray surface of each panel represents the dynamics expressed by the
RNN, while the black surface represents the desired dynamics when an intermediate parameter
is used, i.e., a = 1.2. (a) The result of the naive algorithm. The RNN represents both training
dynamics (a = 1.0 and a = 1.4) within the same network. (b) The result of the natural algorithm.
The RNN succeeds in representing the intermediate dynamics.
where xn (t) and yn (t) are the observable and latent variables, respectively. To generate
x1 (t) and x2 (t), a1 = 1.4 and a2 = 1.0 were used. Thus, the task of the RNN is to
estimate the dynamics with parameter a = 1.2. The result, illustrated in Fig. 3(a), shows
that the RNN represents both training dynamics simultaneously, but does not represent
the intermediate dynamics.
One may be suspicious of why a single RNN can represent two dynamics simultaneously, instead of representing the intermediate dynamics. The reason is the arbitrariness
of the latent variable estimation. To reproduce a training sequence, it is allowed to
transform y(t) in scale and bias. For example, if the latent variable y(t) is transformed as
y (t) = y(t) + , it is easy to modify the dierence equations so as to produce the same
x(t). It is also possible to transform y(t) by a monotonic nonlinear function. Therefore,
the latent sequence estimation is an ill-posed problem. This fact allows the network
to represents two dynamics simultaneously. Suppose that y(t) is always positive, i.e.,
y(t) > 0, when the RNN represents x1 (t), whereas y(t) < 0 for x2 (t). Then the RNN can
represent both dynamics within a single architecture by segregating the latent variable
regions. Obviously this case minimizes the square error for both training sequences. It
is also worth stressing that no training data for the intermediate dynamics is given to
the RNN.
107
of y n (t) are both 1. The weight connections of the winner module also need to be
compensated so as to produce the same output.
Modification 2: The normalized latent sequence of the winner module is shared by all
modules. Thus, the estimated latent sequences in the non-winner modules are all
rejected, and they are replaced by the winners latent sequence.
Modification 1 makes the probability density p(ykn ) consistent for all training sequences,
i.e., independent of n, whereas Modification 2 makes ykn consistent for all modules,
i.e., independent of k. This modified algorithm is hereafter referred to as the natural
algorithm.
Simulation results for the natural algorithm using a Henon map are shown in Fig.
3(b). Using the natural algorithm, the network succeeds in representing the intermediate
dynamics. Again it is worth noting that there is no training data for the intermediate
dynamics.
4.2 RNN-mnSOM and RNNPB with the Natural Algorithm
To compare the naive and natural algorithms, both algorithms were programmed into
an RNN-mnSOM and RNNPB. To investigate the dierence, time-series of simple harmonic waves were used for simulation. The dierence equation is given by
cos n sin n xn (t)
xn (t + 1)
=
,
(11)
yn (t + 1)
sin n cos n yn (t)
where xn (t) and y(t) denote the observable and latent sequences, respectively. The parameters used for the training sequences were n = 0.8, 1.0, 1.2, 1.4, while other values
between 0.7 and 1.5 were used for the test.
The tasks for the RNN-mnSOM and RNNPB were set as follows: (i) learn four training sequences, (ii) sort them into a one-dimensional parameter space, and (iii) interpolate between the dynamics of the training sequences. To achieve these, the RNNmnSOM was given 7 RNN modules with a one-dimensional feature space, while the
RNNPB had 1 parametric bias unit.
Fig. 4 gives the results for the RNN-mnSOM and RNNPB. In both cases, the naive
algorithm failed to order the given sequences, while the natural algorithm succeeded.
( Module # )
( Module # )
108
( PB )
( PB )
Fig. 4. Results for RNN-mnSOM (a) (b) and RNNPB (c) (d) with Naive and Natural algorithms,
respectively
Here, f1 and f2 are the dierence or the dierential equations of the two dynamics. It is
necessary to emphasize that the aim of the architecture is to deal with a set of dynamics
which can be defined by a set of dierence or dierential equations. Therefore the
distance between equations is more essential than the distance between observed time
sequences. In (13) p(z) = p(x, y) is the probability density function of the state variable.
If the probabilities of the observable and latent variables are assumed to be independent,
then (13) becomes
2
f1 (x, y) f2 (x, y)2 p(x)p(y)dxdy.
(14)
L ( f1 , f2 ) =
109
Fig. 5. A map of weather dynamics organized by a SOM2 . Weather trajectories during one week
for 153 cities are indicated in the winner node.
It is also worth stressing that the density function p(x, y) should be common for all
fn (x, y), otherwise the distance cannot be defined. As mentioned above, the latent variable estimation suers from the arbitrariness problem, and the PDF p(y) changes depending on the modules and the sequences. This is why consistent estimation of the
latent variable in the natural algorithm is necessary.
Though the above heuristic natural algorithm for the RNN-mnSOM and RNNPB
performed well in the simulations, the algorithm still has diculty in more practical
cases. One of the reasons is that the probability density of the observable variable p(x)
also depends on the internal parameter in many cases. Furthermore, they are usually
high dimensional vectors distributed in a nonlinear subspace in a high dimensional data
space. In such cases, a simple normalization of each component of x is not enough,
because each component is not usually independent to others. Therefore, we need a
true natural algorithm that can be derived theoretically, rather than heuristically.
Now let us consider the case in which p(x, y) depends on the internal parameter .
Furthermore, x(t) is supposed to be distributed nonlinearly in a high-dimensional space.
In such a case, the dynamics is formulated as follows.
(t + 1) = f ((t))
z = (x, y) = g(; )
(15)
(16)
110
Here, is the low-dimensional intrinsic state variable that governs the class of dynamics, and the mapping from intrinsic state to actual variables is supposed to be modified
by . In other words, an intrinsic dynamics for (t) exists, and the observation is altered
by the context or the environment. Obviously this formulation includes the previous
simple cases. By applying embedded theory, (16) is equivalent to
z (t) (x(t), x(t 1), . . . , x(t L + 1))
z (t) = g (; ).
(17)
(18)
6 Conclusion
In this paper, we discussed what is required for learning of multiple dynamics, and
pointed out the importance of the natural algorithm. Currently, the SOM2 is the best
solution, but a theoretical derivation has not yet been done. This remains a future work
for us.
Acknowledgement. This work is supported by KAKENHI 22120510A10.
References
1. Wolpert, D.M., Kawato, M.: Neural Networks 11, 13171329 (1998)
2. Minamino, K.: Developing Intelligence. Intelligence Dynamics Series, vol. 3, pp. 73111.
Springer, Japan (2008) (in Japanese)
3. Minatohara, T., Furukawa, T.: Proc. WSOM 2005, pp. 4148 (2005)
4. Minatohara, T., Furukawa, T.: IJICIC 7 (in press, 2011)
5. Tokunaga, K., Furukawa, T.: Neural Networks 22, 8290 (2009)
6. Yin, H., Ni, H.: Generalized Self-Organizing Mixture Autoregressive Model. In: Prncipe,
J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 353361. Springer, Heidelberg (2009)
7. Jordan, M.I.: ICS Report, 8604 (1986)
8. Tani, J.: IEEE Trans. SMC Part A 33, 481488 (2003)
9. Ohkubo, T., Tokunaga, K., Furukawa, T.: Int. Congress Series, vol. 1301, pp. 168171 (2007)
10. Ohkubo, T., Tokunaga, K., Furukawa, T.: IEICE Trans. Inf. & Sys. E92-D, 13881396 (2009)
11. Kaneko, S., Furukawa, T.: Proc. WSOM 2005, pp. 537544 (2005)
12. Furukawa, T.: Neural Networks 22, 463478 (2009)
Introduction
A growing neural network (GNN), such as the Growing Neural Gas (GNG) [1],
represents the topology of data with a graph network using online-learning data
vectors input sequentially. Finding the topology of input data vectors is important in various applications, such as object recognition, character recognition,
structure recognition, and so on. It is expected that a GNN could be used in the
learning system for a mobile robot, since the GNN can nd the topology of data
dynamically using online learning.
In the past, various algorithms for GNNs have been proposed. However, all
these algorithms build the graph network directly from observed data vectors. In
other words, the graph networks built by conventional GNNs do not represent
the generative model of the observed data. Conventional GNNs are prone to
the following problems: sensitivity to noise, generation of redundant nodes, and
having the learning results depend heavily on the learning parameters.
The aim of this work is to develop an algorithm for the GNN from the perspective of a generative model. In this paper, an algorithm for the GNN based on an online Gaussian Mixture Model (GMM) is proposed; henceforth, this
proposed method is referred to as the GGMM: Growing GMM. The GMM represents a probability density function (PDF) combining multiple Gaussian kernels. Thus, the GMM builds the generative model represented by the PDF. The
GGMM is extension of the GMM that includes the following mechanisms: online
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 111120, 2011.
c Springer-Verlag Berlin Heidelberg 2011
112
K. Tokunaga
learning, growing of Gaussian kernels and nding topologies between kernels using
graph paths.
This paper discusses the theory and an algorithm for the GGMM. In addition, results of experiments comparing the GGMM with three typical GNNs
(Growing Neural Gas, Evolving SOM [2], and Self-Organizing Incremental Neural Network [3]) are presented.
The proposed method, the GGMM, performs three processes concurrently: (1)
online parameter estimation of the GMM, (2) controlling the generation of Gaussian kernels using an information criterion, and (3) generation of a path representing the topology between kernels and updating the paths strength. First,
a framework for the proposed method is presented, and then each of the above
processes are explained in turn.
2.1
Framework
The framework of tasks to which the proposal method will be applied, is given
below.
Sequentially-observed data vectors cannot be stored in memory.
The appropriate number of Gaussian kernels is unknown. (In the rst stage
of learning, the number of kernels is one.)
The parameters (mixing parameter, and mean and covariance matrices) of
each kernel are unknown.
The class information for data vectors is unknown.
Noise is added to the data vector.
Under the above conditions, processes to nd the generative model of the input
data and generate the paths are performed simultaneously in online learning.
2.2
K
k=1
where
k N (xt ; k , k ) ,
k
k = 1 ,
(1)
k=1
1
1
1
T
exp (xt k ) k (xt k ) . (2)
N (xt ; k , k ) =
2
(2)d/2 | k |1/2
113
k=1
k
k
.
k (xt ) =
(7)
K
old
old , old
N
x
;
t
k
k =1 k
k
2.3
(8)
Here, K, L(K), and C(K) are the number of kernels, the maximum likelihood in
K kernels, and the degrees of freedom of the model, respectively. In the GMM,
the number of kernels is chosen to maximize AIC(K). In other words, the number
of kernels is determined so that the maximum likelihood is given by as few kernels
as possible. Generally, in the GMM, the number of kernels is controlled oine
using all the input data. However, the GGMM controls the generation of the
kernel from a single observed data vector in online learning.
114
K. Tokunaga
Next, the mechanism for the online-generation of the kernel using the AIC
is explained. Suppose that data vectors x1 , x2 , ..., xT 1 have been observed at
time T 1. Now, at time T , let a new input data vector xT be observed. At this
time, a new kernel is generated if AIC(K + 1) > AIC(K), that is,
ln l(K + 1) ln l(K) > C(K + 1) C(K)
(9)
T
%
pK (xt ),
(10)
pK+1 (xt ) .
(11)
t=1
l(K + 1) =
T
%
t=1
(12)
K
'
&
k N (x; k , k ) + K+1 N x; K+1 , K+1 .
(13)
k=1
(14)
(15)
k=1
Here, in the second term on the right-hand side, suggested that the contribution
of foregone input data xt , t = 1, 2, ..., T 1 can be ignored. Then, Equation (15)
can be approximated as:
l(K + 1)
T 1
K
%
t=1
T 1
%
k N
k=1
(1 )
t=1
T 1
%
115
(xt ; k , k )
K
pK+1 (xT )
(16)
k N (xt ; k , k )
pK+1 (xT )
(17)
k=1
(18)
t=1
= (1 )T 1
l(K)
pK+1 (xT ) .
pK (xT )
(19)
Therefore,
l(K + 1)
T 1 pK+1 (xT )
(1 )
.
l(K)
pK (xT )
(20)
Finally, Equation (12) is derived by taking the logarithm of both sides of Equation (20). Here, when T 1,
(T 1) ln (1 ) 1.
(21)
Therefore, it is possible to control the generation of the kernel using only the
current input data.
2.4
The GGMM simultaneously generates paths between the existing kernels and the
new kernel. These paths represent the topology between the data distributions
on the Gaussian kernels. In addition, a strength, which can be decreased or
increased with learning, is associated with each path.
The Mahalanobis distance between kernels is used to generate and update the
path. If the Mahalanobis distance between kernels is small, then the probability
that data vectors are distributed between these kernels is high. Thus, the probability that each kernel belongs to the same class is high. By contrast, if the
Mahalanobis distance between kernels is large, the probability that each kernel
is independent is high. Thus, the strength of the path represents the statistical
distance between kernels. Next, the underlying theory for generating the path
and updating the path are explained.
Generating the Path
A path is generated between the additional kernel and both the rst and second
winner kernels, where the rst and second winner kernels, k and k , are dened
as:
k = arg min k (xT )
k,
(22)
k
/ k .
(23)
116
K. Tokunaga
In the rst generation of the path, the initial strength of the path is set according
to the following denitions. Here, the notations for the variables are given as
follows. Suppose that s(K + 1, k ) and s(K + 1, k ) are the strengths of the
paths between K + 1 and k , and K + 1 and k , respectively. In addition, the
expressed as dK+1,k
. Then, the following equations describe the initial strength
K+1
of the path.
1
s(K + 1, k ) =
(24)
dK+1,k
/2
+ dK+1,k
K+1
k
1
(25)
s(K + 1, k ) =
K+1,k
dK+1
/2
+ dK+1,k
k
is an arbitrary parameter, such that 0 < < 1.0.
Updating the Path
A necessary and an unnecessary path are highlighted by updating the strength
of the path through learning. All paths connected to the rst winner kernel are
updated using the equation below.
1
(26)
s(k , k )new = (1 )s(k , k )old +
k ,k
dk + dkk ,k /2
Thus, the strength of the path represents the expected value of the reciprocal of
the Mahalanobis distance between the kernels.
Algorithm
The algorithm for the GGMM consists of four processes: (1) evaluation process,
(2-A) generation process, (2-B) update process, and (3) deletion process.
The following processes are repeated during learning.
(0) Initial Process
The initial number of kernels is 0. When the rst observed data vector x1 is
given, the rst kernel composed of the initial parameter in Equation (14) is
generated.
(1) Evaluation Process
The probability density function pK (xt ) in Equation (1) and the assumed probability density function pK+1 (xt ) in Equation (13) are calculated from the observed data vector xt . Next, whether to generate a new kernel is determined by
Equation (12). If Equation (12) is true, then go to (2-A), the generation process.
If Equation (12) is false, then go to (2-B), the update process.
117
Simulation
In this simulation, the proposal method is compared with three typical GNNs
(GNG, ESOM, and SOINN).
The Swiss roll data, shown in Fig. 1, were used as articial data for the
simulation. Data vectors are distributed on two spirals. Two kinds of Swiss roll
data (Type 1, Type 2), diering with respect to the width of noise as shown in
Fig. 1 were used, with a total of 10000 data vectors.
It is dicult to evaluate each method within the same framework, since the
behavior of the learning parameters is dierent for each method. Therefore, in
this simulation, the parameters for each method were set so that two spiral graph
networks could be generated for Type 1 data. For both Type 1 and Type 2 data,
the learning parameters for each GNN were the same.
The learning results obtained from data vectors of Type 1 for each GNN are
shown in Fig. 2. Similarly, the learning results for Type 2 data are shown in
Fig. 3. According to the results for Type 1 data, all methods can successfully
represent two spiral graph networks. By contrast, in the results for the GNG,
SOINN, and ESOM with Type 2 data, the spiral data distribution cannot be
represented, since some nodes are allocated to noise. It is dicult to obtain the
118
K. Tokunaga
GNG
SOINN
ESOM
GGMM
desired results using those methods, for which the probability density functions
of the data distributions cannot be estimated. Additionally, with the Type 2
data, the GNG, ESOM, and SOINN cannot generate two spiral graph networks
no matter how the parameters are adjusted. By contrast, the GGMM successfully
GNG
SOINN
ESOM
GGMM
119
formed graph networks, in which the two spirals were well separated. Moreover,
in several-time experiments, almost the same results were obtained, since the
inuence of the parameters of the GGMM on the result is small. In addition,
since the number of nodes in the GGMM does not increase permanently, almost
the same results were obtained in the several-time experiments.
These results conrm that the proposed method yields stable learning results
and is more robust to noise than conventional GNNs.
Summary
In this paper, the author proposed a growing neural network based on an online
type Gaussian mixture model, which includes mechanisms to grow Gaussian
kernels and nd topologies between kernels using graph paths. In the simulation,
the proposed method obtains stable learning results, and is more robust to noise
than conventional GNNs. It is expected that the proposed method can be applied
as the fundamental algorithm in a system where a cognitive model is developed
from dynamically input sensor data in mobile robots. In a past work, the SelfEvolving Modular network (SEEM) was proposed as a method for multi-dynamic
120
K. Tokunaga
learning in mobile robots. The SEEM increases the number of modules and
represents two or more dynamics. It is anticipated that this proposed method
can be applied as the backbone algorithm of the SEEM. In the future, author
aim to derive more theoretically consolidated GNN algorithms based on Bayesian
estimation.
Acknowledgement
This research was partially supported by the Ministry of Education, Science,
Sports and Culture, Grant-in-Aid for Scientic Research on Innovative Areas,
22120510A10.
References
1. Fritzke, B.: A growing neural gas network learns topologies. Advances in Neural
Information Processing Systems 7, 625632 (1995)
2. Deng, D., Kasabov, N.: On-line pattern analysis by evolving self-organizing maps.
Neurocomputing 51, 87103 (2003)
3. Shen, F., Hasegawa, O.: An incremental network for on-line unsupervised classication and topology learning. Neural Networks 19(1), 90106 (2006)
4. Akaike, H.: A new look at the statistical model identication. IEEE Transaction on
Automatic Control 19(6), 716723 (1974)
5. Schwarz, Gideon, E.: Estimating the dimension of a model. Annals of Statistics 6(2),
461464 (1978)
Abstract. An extension of a recently proposed evolutionary selforganizing map is introduced and applied to the tracking of objects in
video sequences. In the proposed approach, a geometric template consisting of a small number of keypoints is used to track an object that moves
smoothly. The coordinates of the keypoints and their neighborhood relations are associated with the coordinates of the nodes of a self-organizing
map that represents the object. Parameters of a local ane transformation associated with each neuron are updated by an evolutionary algorithm and used to map each templates keypoint in the previous frame
to the current one. Computer simulations indicate that the proposed approach presents better results than those obtained by a direct method
approach.
Keywords: Self-organizing neural networks, evolutionary algorithms, object tracking, video sequences.
Introduction
122
Firstly, we need to dene the reference, current and candidate templates. Let
I = {I0 , I1 , .., Ii } be a sequence of indexed images and T0 , .., Ti are gray-level
intensities of templates dened on these images. The template (or patch) dened
in the rst frame, T0 , is referred to as the reference template (or the reference
patch). When tracking from frame i to frame i + 1, we refer to frame i as the
current frame, and the template within this frame, Ti , as the current template.
The frame i + 1 is referred to as the target frame, and a template within this
frame, Ti+1 , as a candidate template.
The Sum of Squared Dierences (SSD) is used as a measure of matching
between templates. Let x T0 be a feature point in the corresponding template.
Thus, the problem of nding a transformation parameter vector p between T0
and Ti is formulated using SSD as
2
= arg min
p
[Ti (x ) T0 (x)] = arg min
[Ti (w(x, p)) T0 (x)]2 , (1)
p
xT0
xT0
where x = w(x, p) is the projection of the feature point x T0 onto the current
frame i. The SSD-based tracking problem can thus be stated as the task whose
goal is to select and track feature points from images I0 to Ii+1 . Assuming that
) from frame 0 to the current frame i is known, the
the transformation w(x, p
problem reduces to nding an increment p for the transformation parameter
vector between Ti and Ti+1 through an iterative method that solves
2
[Ti+1 (w(x , p)) Ti (x )] ,
(2)
p = arg min
p
x Ti
123
500
patch model
c
400
d
b
300
200
distance between
patch centers
keypoint
100
0
100
200
300
400
(a)
(b)
Fig. 1. (a) A kite-shaped template with 5 patches and 8 distance links. (b) Typical
aspects that a kite-shaped template can assume during the tracking problem.
Object Representation
124
Note that for the standard SOM network [3], the output grid is regular, in the
sense that it has a well-dened geometric structure, e.g. rectangular, cylindrical
or toroidal, and the coordinates of the nodes are located at equally-spaced positions, so that the distances between neighboring coordinates are equal. In the
proposed approach, the coordinates of the nodes correspond to the position of
the chosen keypoints, which do not need to be equally-spaced one from another.
The only constraint is that, once the coordinates of the keypoints have been
chosen, the neighborhood relations between them should be preserved.
Figure 1b shows typical aspects and positions that the kite-shaped template
shown in Figure 1a can assume when subjected to ane transformations. This
synthetic template is used in one of our experiments to represent the object to be
tracked. In the next section, we summarize the theory of the evolutionary selforganizing neural network model used by the proposed object tracking algorithm.
2.2
(4)
( = {w1 , ..., wN } denotes the whole set of weight vectors, and the pawhere W
rameters , [0, 10] weigh the relative importance of the indices with respect
to each other. The QE index assess how good is the map in providing a compact
representation of the original data set. Mathematically, QE is dened as
( =
QE(W)
)
1 )
)xl wi(x ) ) ,
l
L
L
(5)
l=1
where i(xl ) is the winning neuron for pattern vector xl , being determined as
i(xl ) = arg min{xl wj }
j
(6)
Search region
125
keypoints frame i
keypoints frame i+1
Fig. 2. Region search for candidate patches in the vicinity of a model patch
( =
P CC(W)
N
m=1
N
d(rm , rn )d(wm , wn )
,
(N 1) Sr Sw
n=1
(7)
(j)
126
the next frame. The coordinates of the projected keypoints correspond to the new
coordinates of the nodes comprising the output grid of the EvSOM for the next
frame.
At the rst stage, the algorithm selects a set of candidate patches at each
keypoint. These candidate patches are randomly searched in the current frame
(i.e. frame i) in the vicinity of the j-th model patch of the frame i (see Figure 2).
Thus, for a template with N keypoints, the result of the search process is N sets
of candidate patches. It is worth pointing out that the rst stage of the proposed
algorithm aims at transforming the search space into a discrete set of candidate
solutions. For a candidate solution we mean a set of new node coordinates for
the EvSOM, which is equivalent to new positions for the template keypoints.
The maximum number of candidate patches per keypoint is a prespecied
value. Additionally, each candidate patch and the corresponding model patch
must satisfy a measure of matching whose value must be smaller than a given
threshold th . Just as an example, assuming that the number of candidate
patches per keypoint is M (all of them satisfying the required measure of matching), then for N keypoints there are M N potential solutions.
At the second stage, the proposed procedure for dealing with the joint task
of locating the object and updating its representation consists in evolving one
EvSOM per frame. For the initial frame (i.e. frame 0), the keypoints of the
manually selected initial template denes the coordinates of the nodes of the
EvSOM for the frame 0. From frame 1 onwards, we initialize the coordinates of
the nodes of the EvSOM for the frame i + 1 with the coordinates of the nodes of
the EvSOM for the frame i. At each frame, the complete set of candidate patches
denes a discrete search space within which the best solution is searched for.
By evolving the EvSOM for the frame i we mean nding, using an evolutionary algorithm, the optimum weight vector of the j-th node that encodes the
parameters of the ane transformation that maps the coordinates of that node
from frame i to frame i + 1. It is worth mentioning that learning the mappings
from the keypoints of frame i to the keypoints of frame i + 1 is equivalent to
locating (or tracking) the moving object.
In order to evaluate the degree of similarity between regions in two images,
a measure of matching between the reference patch and a candidate patch is
required by the tness function. In this paper, we use a SSD-related index,
dened as
[Ti+1 (k, j) Ti (k, j)]2 ,
(8)
SSD =
k
where Ti+1 (k, j) and Ti (k, j) are, respectively, the intensities of gray levels in the
target and current templates. By introducing the SSD index into the EvSOM
tness function, we get
( = P CC(W)
( SSD(W).
(
F itness(W)
(9)
127
jointly optimizing the SSD index and the P CC index. In this paper, the P CC
index is a measure of correlation for the distances among neighborhood interest
points of the two images. The pseudo-code of the proposed EvSOM-based tracking algorithm is given below. In this algorithm the parameters F ITbest , F ITmax
and Gmax denote, respectively, the best tness value for the current generation,
the maximum tness value found until the current generation and the maximum
number of generations.
Algorithm 1. EvSOM-based tracking algorithm
1: Set i = 0. Then, manually extract a template with N keypoints and L links. This
is the updated template for the frame 0.
2: for all frame i + 1 do
3:
Set the number of EvSOM nodes equal to the number of keypoints of the updated template in frame i, with the coordinates of the keypoints assigned as the
coordinates rj of the nodes in the output grid, following the topological constraints established by the distance links. Then, set the EvSOM weight vectors
to wj = pj = (0, 0, 0, 1), j = 1, 2, ..., N ;
4:
Perform a random search in the neighborhood of the j-th node, in order to nd
a set Cj containing at most Mj candidate patches which must satisfy SSDj
th , for j = 1, ..., N . The neighborhood of the j-th node is dened as pj =
(j)
(j)
(bx , by , (j) , s(j) ). For the k-th candidate patch associated with the j-th
(j)
(j)
node, store its transformation vector pk and its SSDk value, for k = 1, ..., Mj
and j = 1, ..., N .
5:
Build the initial population of candidate templates by randomly taking a candidate patch from each set Cj , j = 1, ..., N , and assessing its tness.
6:
while F ITbest F ITmax and generation Gmax do
7:
Generate the ospring and compute the tness values;
8:
Build the next population and assess its tness;
9:
end while
10:
To avoid template drift, update each pj , j = 1, ..., N , by solving Eq. (1) through
the hill-climbing algorithm.
(j) (j)
11:
Compute the transformations w(rj , pj ) = w rj , bx , by , (j) , s(j) , compute the
resulting RMSE and present the solution.
12: end for
128
e.g. the regular grid of Kohonen map, since the distances from one node to the
others need not to be equal. However, the links between the nodes do dene
their neighborhood (i.e. topological) relationships, and must be preserved, i.e.
maintained as the tracking proceeds.
In order to diminish the template drift eect due to the accumulated error in
each stage, we solve Eq. (1), for each pj , through the hill-climbing search with a
xed number of iterations Nhc . The resulting template is then considered as the
updated template. Finally, since the components of the weight vector wj = pj ,
j = 1, ..., N , are real numbers, for the coordinates of the keypoints in the image
to assume only integer values, we have to quantize and interpolate the values of
the coordinates of the projected keypoints to the closest integer values.
For the interested reader, the other two video sequences and corresponding results
are available upon request.
website: http://esm.gforge.inria.fr/ESMdownloads.html
129
frame 000
frame 075
frame 150
frame 200
frame 000
frame 075
frame 150
frame 200
Fig. 3. Sequence of 4 frames showing the object of interest being tracked. Upper row:
changes in the associated template. Lower row: estimated objects positions.
16
14
Clip 3:
EvSOM tracking, av. = 3.9183, sd = 3.1085
. Direct tracking, av. = 4.5405, sd = 3.6321
12
RMSE (%)
10
8
6
4
2
0
50
100
150
200
250
300
350
frame
Fig. 4. Evolution of the RMSE values between true and estimated keypoints for the
real-world clip used in the object tracking experiment.
130
Conclusions
An extension of the EvSOM [4] was developed and applied to object tracking.
The main characteristic of the proposed approach is the inclusion of geometric or
topological constraints in the determination of parameters of ane transformations that map template keypoints from one frame to the next one. Simulation
results using a real-world lm have shown that the proposed approach consistently outperformed a direct tracking method.
References
1. Dawoud, A., Alam, M., Bal, A., Loo, C.: Target tracking in infrared imagery using
weighted composite reference function-based decision fusion. IEEE Transactions on
Image Processing 15(2), 404410 (2006)
2. Greienhagen, M., Comaniciu, D., Niemann, H., Ramesh, V.: Design, analysis, and
engineering of video monitoring systems: An approach and a case study. Proceedings
of the IEEE 89(10), 14981517 (2001)
3. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)
4. Maia, J.E.B., Barreto, G., Coelho, A.: On self-organizing feature map (SOFM) formation by direct optimization through a genetic algorithm. In: Proceedings of the
8th International Conference on Hybrid Intelligent Systems (HIS 2008), pp. 661666
(2008)
5. Mitchell, M.: An Introduction to Genetic Algorithms, 1st edn. MIT Press, Cambridge (1998)
6. Pentland, A.: Looking at people: Sensing for ubiquitous and wearable computing.
IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 107119
(2000)
7. Silveira, G., Malis, E.: Unied direct visual tracking of rigid and deformable surfaces under generic illumination changes in grayscale and color images. International
Journal of Computer Vision 89(51), 84105 (2010)
8. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(114), 1344 (2006)
9. Jia, Z., Balasuriya, A., Challa, S.: fusion-based visual target tracking for autonomous
vehicles with the out-of-sequence measurements solution. Robotics and Autonomous
Systems 56(2), 157176 (2008)
Introduction
132
There are two types of clustering measures, namely set-based and pairwise-based
measures. These two types of measures can be extended in a dierent manner.
2.1
First, the way to extend set-based clustering measures[10] such as cluster purity,
class F-measure1 and entropy is described in this section.
Fig. 1 shows a learned one-dimensional SOM as an example. The neighbor
data with class label are assumed to be assigned to each neuron node by the
winner, i.e., the best matching unit. Here, a set of neighbor data corresponds to
a micro cluster. In addition, topology of these micro clusters are assumed to be
obtained by the SOM learning.
In general, it is better that samples of the same class are in neighbor and that
the dierent classes are in distant on the map. The measure should evaluate this
property. By considering topology of micro clusters, neighbor class distribution
should be taken into account to the degree of certain class contained in a micro
cluster. That is, data points of the same class in the neighbor clusters should be
high weight, whereas that of the distant clusters should be low weight.
Based on above concept, let Nt,i be the number of samples with class t T
in the ith cluster Ci C. Ni denotes the number of samples in cluster Ci . Also
N denotes the total number of samples. These basic statistics Nt,i , Ni , and N
are weighted by neighbor clusters as follows:
1
133
QHLJKERUKRRGIXQFWLRQ
620QRGHV
WKHQXPEHURI
FODVV
QHLJKERUGDWD
ZHLJKWHGVXPPDWLRQRI
PDUJLQDOL]HGYDOXHV
QHLJKERUGDWD
Fig. 1. Extension of a set-based clustering measure on the topology space of the SOM.
The basic statistics are weighted by the neighborhood function.
Nt,i
=
hi,j Nt,j ,
Ni =
Nt,i
=
N =
(1)
t
Ni =
hi,j Nt,j ,
(2)
hi,j Nt,j .
(3)
| t, i} is a neighbor class distribution that is calculated by a weighted
{Nt,i
summation of the original class distribution over the topology space, {Ni | i} is
a neighbor data distribution that is given by a summation of the neighbor class
distribution over classes, and N is a total volume of neighbor data that is given
by a summation of the neighbor data distribution over all the micro clusters.
The neighborhood function hi,j is used as marginalization weights in the same
manner as in the learning phase, but to introduce topographic connectivity of
class distribution over the topology space as shown in Fig. 1. Any monotonically
decreasing function is available, e.g., typically the Gaussian function:
||r r ||
i
j
,
(4)
hi,j = exp
2
where r is a coordinate of a neuron within the topology space, and (> 0) is
a marginalization (neighborhood) radius. Note that the size of marginalization
radius is not nesessary to use the same neighborhood radius in the SOM learning.
Then, topographic cluster purity, class F-measure, and entropy are dened by
using the marginalized statistics of eq. (1), (2), and (3) as follows:
1
max Nt,i
.
tT
N
Ci C
(5)
134
The original purity is an average of the ratio that a majority class occupies
in each cluster, whereas in the topographic purity a majority class is deter
mined by neighbor class distribution {Nt,i
}.
topographic Class F-measure (tCF)
tCF(C) =
Nt
N
tT
F(t, Ci ) =
max F (t, Ci ),
Ci C
2 Prec(t, Ci ) Rec(t, Ci )
,
Prec(t, Ci ) + Rec(t, Ci )
(6)
(7)
where Prec(t, Ci ) = Nt,i
/Ni and Rec(t, Ci ) = Nt,i
/Nt . The original Fmeasure is a harmonic average of precision and recall among class sets and
cluster sets. The extended precision indicates a separation degree of dierent classes in a cluster and its neighbors, and the extended recall indicates
density of the same class over topology space.
(8)
Nt,i
1 Nt,i
log
.
log N
Ni
Ni
(9)
tEP(C) =
Ci C
Entropy(Ci ) =
tT
The original entropy indicates the degree of unevenness of class distribution within a cluster, whereas the extended entropy includes unevenness of
neighbor clusters.
2.2
c(i) = c(j)
c(i) = c(j)
G __UU__
FL
FM
FL
135
OLNHOLKRRGFL FM
FM
K FLFM
ZLQQHUQHXURQ
[L
[M
Fig. 2. Extension of a pairwise-based clustering measure. A likelihood function is introduced to represent the degree that the data pair belongs to the same micro cluster.
eq. (4) is available for the likelihood function (Fig. 2(b)). Then, a, b, c and d are
replaced by summation of the likelihoods as follows:
a =
hc(i),c(j) ,
(10)
{i,j|t(i)=t(j)}
b =
hc(i),c(j) ,
{i,j|t(i)=t(j)}
c =
(11)
1 hc(i),c(j)
{i,j|t(i)=t(j)}
= a + c a ,
d =
1 hc(i),c(j)
(12)
{i,j|t(i)=t(j)}
= b + d b .
(13)
a + d
.
a + b + c + d
(14)
The original pairwise accuracy is a ratio of the number of pairs that the
same class belong to the same cluster or dierent class belong to the different cluster to all of the pairs. Whereas, the topographic PA is a degree
that the same class belong to the neighbor cluster or that the dierent class
belong to the distant cluster.
topographic Pairwise F-measure (tPF)
tPF(C) =
2P R
,
P +R
(15)
136
Neighborhood Function
For the neighborhood function for the marginalization and the likelihood, any
monotonically decreasing function hi,j 0 is available such as a Gaussian or a
rectangle function same as in the learning of the SOM. Note that the extended
measures are exactly the same as the original measures when hi,j = i,j ( is
the Kronecker delta). As for shape of class distribution, our measure has no
assumption in the original feature space, but has assumption in the topology
space by the shape of the neighborhood function.
And, the neighborhood radius aect to the degree of marginalization and likelihood. Fig. 3 illustrates that the extended measure evaluates individual clusters,
that is the original values, as the marginalization radius becomes zero. On the
contrary, as the radius becomes larger, the nite topology space is smoothed
by almost the same weights, all micro clusters are treated as one big cluster.
The optimal radius depend on class distribution and function of the evaluation
measure. The way to nd the optimal radius is described in the next section.
PDUJLQDOL]HGYDOXH
UDGLXV
RULJLQDOYDOXH
QRGHLQGH[
Fig. 3. Example of eect of the marginalization. The larger radius is the smoother the
values over connectivity of the nodes.
3
3.1
137
Fig. 4. Two dimensional synthetic data. The data points were generated from two
Gaussian distributions; N (1 , 1) and N (2 , 1), where 1 = (0, 0) and 2 = (3, 0).
Also well-known open datasets2 were used as real-world data: Iris data (150
samples, 4 attributes, 3 classes), Wine data (178 samples, 13 attributes, 3 classes),
and Glass Identication data (214 samples, 9 attributes, 6 classes).
3.2
Experimental Condition
We employed the batch type SOM in which Gaussian function was used as a
neighborhood function together with decreasing strategy of the neighbor radius.
The neurons was set to 10 10 regular grid of the most standard setup. Also
the Gaussian function by eq. (4) was employed for the neighborhood function
of the topographic measures. Then, the evaluation values of each measure were
averaged over 100 runs to avoid dependency of initial random values of the
prototypes.
3.3
Topographic Component
Synthetic Data. Fig. 6 shows the evaluation values of the topographic measures for the synthetic data. The larger value is the better except entropy.
2
http://archive.ics.uci.edu/ml/
138
HYDOXDWLRQYDOXH
Fig. 5. The way to calculate the topogoraphic component. The component is calculated
by dierence of the values between (a) topology preservation and (b) random topology.
UDGLXVVLJPD
UDGLXVVLJPD
(a) tCP
UDGLXVVLJPD
(c) tEP
620WRSRORJ\SUHVHUYDWLRQ
(b) tCF
UDGLXVVLJPD
UDGLXVVLJPD
(d) tPA
(e) tPF
620UDPGRPWRSRORJ\
WRSRJUDSKLFFRPSRQHQW
Fig. 6. The evaluation values of the topographic measures for the synthetic data with
changing the neighborhood radius
Firstly, the standard SOM with topology preservation provides always better value than SOM with random topology where topographic connectivity is
destroyed. This means that the proposed topographic measures evaluate both
clustering accuracy and topographic connectivity.
Secondly, as the neighborhood radius becomes close to zero, the extended
measure evaluates individual micro clusters without topographic connectivity.
Whereas, as the radius becomes larger, the extended measure treats whole data
as one big cluster as mentioned before. Therefore, the solid and the broken lines
gradually become equal as the radius becomes close to zero or becomes much
larger.
Thirdly, the topographic component has a monomodality against the radius
in all measures, since there exists an appropriate radius to the average class distribution. Since the topographic measure is a composition of clustering accuracy
WRSRJUDSKLFFRPSRQHQW
LULV
ZLQH
JODVV
V\QWKWLF
UDGLXVVLJPD
(a) tCP
(b) tCF
UDGLXVVLJPD
(c) tEP
139
(d) tPA
UDGLXVVLJPD
(e) tPF
and topographic connectivity, the radius that gives the maximum value to SOM
(topology preservation) does not always match with the maximum value of topographic component. Therefore, the topographic component should be examined
to nd the appropriate radius. Also the appropriate radius depends on function
of the measure such as purity, F-measure, or entropy. This means that the user
should use dierent radius for each measure.
Real-World Data. Also for real-world data, there exists an appropriate radius
(Fig. 7). However, the appropriate radii depend on the number of classes and
the class distribution, not only depend on the function of a measure. This result
indicates that depending on a measure and a dataset, a user should use dierent
radius that gives the maximum volume of topographic component.
Conclusion
We proposed the topographic measures using external criteria (class information) for the evaluation of the SOM as a visual data mining tool. Our method
introduces the topographic component to set-based and pairwise-based clustering measures by utilizing the neighborhood function. Since our method extends
the basic statistics, any set-based or pairwise-based clustering measures can also
be extended. Then, the present paper claried the properties of the topographic
measures using synthetic and real-world data. The experiments revealed the existence of an appropriate neighborhood radius. A user should use an appropriate
radius depending on a measure and a dataset. As for future works, we should
develop an easier way to nd an appropriate radius, also should examine a usage
of the proposed measure when comparing dierent result of SOMs.
140
Acknowledgements
This work was supported by KAKENHI (21700165).
References
1. Fukui, K., Saito, K., Kimura, M., Numao, M.: Sequence-based som: Visualizing
transition of dynamic clusters. In: Proc. of IEEE 8th International Conference on
Computer and Information Technology (CIT 2008), pp. 4752 (2008)
2. Fukui, K., Sato, K., Mizusaki, J., Numao, M.: Kullback-leibler divergence based
kernel SOM for visualization of damage process on fuel cells. In: Proc. of 22nd
IEEE International Conference on Tools with Articial Intelligence (ICTAI 2010).,
vol. 1, pp. 233240 (2010)
3. Goodhill, G.J., Sejnowski, T.J.: Quantifying neighbourhood preservation in topographic mappings. In: Proc. of the 3rd Joint Symposium on Neural Computation,
vol. 6, pp. 6182 (1996)
4. Kiviluoto, K.: Topology preservation in self-organizing maps. In: Proc. of International Conference on Neural Networks (ICNN 1996), pp. 294299 (1996)
5. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
6. Koskela, M., Laaksonen, J., Oja, E.: Entropy-based measures for clustering and
SOM topology preservation applied to content-based image indexing and retrieval.
In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR
2004), vol. 2, pp. 10051009 (2004)
7. Lagus, K., Kaski, S., Kohonen, T.: Mining massive document collections by the
websom method. Information Sciences 163, 135156 (2004)
8. Oja, M., Kaski, S., Kohonen, T.: Bibliography of self-organizing map (som) papers:
1998-2001 addendum. Neural Computing Surveys 3, 1156 (2002)
9. Uriarte, E.A., Martn, F.D.: Topology preservation in SOM. International Journal
of Mathematical and Computer Sciences 1(1), 1922 (2005)
10. Veenhuis, C., Koppen, M.: Data Swarm Clustering, ch. 10, pp. 221241. Springer,
Heidelberg (2006)
11. Xu, R., Wunsch, D.: Cluster Validity. Computational Intelligence, ch. 10, pp. 263
278. IEEE Press, Los Alamitos (2008)
1 Introduction
The self-organizing maps (SOM) as a type of artificial neural networks are commonly
used for data clustering and visualization. It is important to research SOMs, because
they are applied in various areas such as ecology, military, medicine, engineering, etc.
For example, in the medicine area, SOM can be useful for analyzing breast cancer
data sets that can help medics to make decisions [1]. Self-organizing maps can be
combined with dimensionality reduction methods as a multidimensional scaling [2],
[3], [4], [5], [6]. There the number of dimensions of the neurons winners obtained by
SOM is reduced to two by multidimensional scaling and presented in a plane. It is
important to train SOM so that the neurons winners correspond to the data analyzed
as faithfully as possible.
The results of SOM depend on some initialization and learning parameters. In this
article, three neighboring functions (bubble, Gaussian and heuristic [7]) and four
learning rates (linear, inverse-of-time, power series and heuristic [7]) are investigated.
In the training process, learning rates can be changed in two ways (epochs and
iterations). The dependence of these ways on the results is researched, too. The
quality of SOM is commonly estimated by quantization and topographic errors. The
main goal of the research is to estimate dependences of learning parameters on
the results obtained by SOM in the sense of quantization error. Experiments have
been carried out with three data sets: glass, wine, and zoo.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 141150, 2011.
Springer-Verlag Berlin Heidelberg 2011
142
2 Self-Organizing Maps
2.1 Principles of Self-Organizing Maps
An artificial neural network is a mathematical model of the biological neuron system.
A self-organizing map (SOM) is one of mostly analyzed unsupervised models of
neural network. First time, it was described by Finn scientist Teuvo Kohonen in 1982.
Consequently sometimes they are called Kohonen map or networks. The model takes
an important place in science and it is one of the most popular research objects to
date. The main target of the SOM is to preserve the topology of multidimensional
data, i. e., to get a new set of data from the input data such that the new set preserved
the structure (clusters, relationships, etc) of the input data. The SOM are applied to
cluster (quantize) and visualize the data. The self-organizing map is a set of nodes,
connected to each other via a rectangular or hexagonal topology. SOM of rectangular
topology is presented in Fig. 1. Here a circle represents a node. The connections
between the inputs and the nodes have weights, so a set of weights corresponds to
each node. The set of weights forms a vector M ij , i = 1,..., k x , j = 1,..., k y that is
usually called a neuron or a codebook vector, where k x is the number of rows, and
143
network. The vector X p is compared with all neurons M ij . Usually the Euclidean
distance between this input vector X p and each neuron M ij are calculated. The
vector (neuron) Mc with the minimal Euclidean distance to X p is designated as a
winner. All neurons components are adapted according to the learning rule:
M ij (t + 1) = M ij (t ) + hijc (t )( X p M ij (t ))
(1)
(t ), (i, j ) N c
hijc =
,
0, (i, j ) N c
2
Rc Rij
2 (ijc (t ))2
c
hij = (t ) e
(2)
(3)
.
144
(t ) =
1
t
(t , T ) = 1
(4)
t
T
t
(t , T ) = (0.005) T
(5)
(6)
One more heuristic neighboring function [7] is used in the investigation. The
neighboring function (7) is monotonically decreasing.
hijc =
(t )
,
(t ) ijc + 1
T +1 t
,0.01 .
where = max
T
(7)
(8)
Here T is the number of epochs (iterations), t is the order number of a current epoch
(iteration), ijc is the neighboring rank between neurons M c and M ij . The neurons
M ij are recalculated in each epoch, if the inequality (9) is valid. This condition is
applied in all the cases analyzed.
(9)
Often two training phases (rough and fine tuning) are used, but in this investigation
the training is not divided into two phases.
After the SOM network has been trained, its quality must be evaluated. Usually
two errors (quantization and topographic) are calculated. Quantization error EQE (10)
shows how well neurons of the trained network adapt to the input vectors. EQE is the
average distance between data vectors X p and their neuron-winners M c ( p ) .
Fig. 2. Iris data set in 6x6 SOM: bubble (left) and Gaussian (right) neighboring functions are
used
EQE =
1 m
X p M c( p)
m p =1
145
(10)
3 Experimental Results
3.1 Data Set Analyzed
Three data sets are used in the experimental investigation. The glass data set was
collected by a scientist, which wanted to help criminalists to recognize glass slivers
X1, X 2 ,..., X 214
are
formed,
found
[12].
Nine-dimensional
vectors
where X i = ( xi1, xi 2 ,..., xi 9 ) , i = 1,...,214 . Nine features are measured: x1 is a
refractive index, x2 is sodium, x3 is magnesium, x4 is aluminum, x 5 is silicon, x 6
is potassium, x 7 is calcium, x8 is barium and x9 is iron.
The wine data set consists of the results of chemical analysis of wine grapes
grown in the same region in Italy but derived from three different cultivars. The
analysis determined the quantities of 13 constituents found in each of the three types
of wines. Thirteen-dimensional vectors X 1 , X 2 ,..., X 178 are formed, where
X i = ( x i1 , x i 2 ,..., x i13 ) , i = 1,...,178 . Thirteen features are measured: x1 is alcohol, x2
is malic acid, x3 is ash, x4 is alcalinity of ash, x 5 is magnesium, x 6 is total phenol,
x 7 is flavanoid, x8 is nonflavanoid phenol, x9 is proanthocyanin, x10 is color
intensity, x11 is hue, x12 is OD280/OD315 of diluted wines, x13 is proline.
The third data set is zoological (zoo). The data set consists of 16 boolean values.
X 1 , X 2 ,..., X 92
are
formed,
where
Sixteen-dimensional
vectors
X i = ( x i1 , x i 2 ,..., x i16 ) , i = 1,...,92 . Sixteen features are measured: x1 is hair, x2 is
feathers, x3 is eggs, x4 is milk, x 5 is airborne, x 6 is aquatic, x 7 is a predator, x8 is
toothed, x9 is a backbone, x10 is breathes, x11 is venomous, x12 is a fin, x13 is legs,
x14 is a tail, x15 is domestic, and x16 is catsize.
3.2 Dependence of Results on the Learning Parameters
The SOM with three neighboring functions (2), (3), (7) and learning rates (4), (5), (6),
(8) are implemented in Matlab. To find a tendency how much different parameters
affect the SOM training results, 72 experiments are carried out (three data sets, three
neighboring functions, four learning rates, and two ways of changes of learning rates
146
(by epochs and iterations)). The map size is set [1010]. During the experiments, the
number of epochs ranged from 10 to 100 by step 10. Each experiment is repeated 10
times with different initial values of neurons M ij . The averages of quantization error
(10) and their confidence intervals are calculated. In the training process, the neurons
M ij are recalculated, if inequality (9) is valid.
The results of the glass data set are presented in Fig. 3 and Table 1. In figures, only
the numbers of epochs are presented. The number of epochs multiplied by the number
of data items m corresponds to the number of iterations. For the glass data set
(m=214), 10 epochs equal to 2140 iterations, 20 epochs equal 4280, etc. The worst
results in the sense of quantization error are obtained, if learning rate (4) is used. In
this case, the way, when the number of epochs is used in changes of the learning rate,
gives better results in the sense of quantization error comparing with the results,
obtained when the number of iterations is used in changes of the learning rate. When
we use learning rate (5), Gaussian function (3) gives the best results, independent of
the ways of changes of learning rates (by epochs or iterations). Learning rates (6) and
(8) give the best results with the Gaussian function, but heuristic neighborhood
function (7) with epochs also gives good results.
The confidence intervals of the averages of quantization errors are computed for
each set of epochs (10, 20,, 100). The margins of errors of the confidence intervals
are averaged. Due to the limit of article length, only the averages of the margins of
errors are presented in Table 1. We see that the margins of errors are small enough for
all the learning rates and neighboring functions. So the averages of quantization
errors, presented in Fig. 3, are informative.
Table 1. Averages of the margins of errors (confidence probability 0.95) for the glass data set.
X i = ( xi1, xi 2 ,..., xi 9 )
Learning
rate
(4)
(5)
(6)
(8)
Bubble
0.006666
0.001688
0.001696
0.001576
By iterations
Gaussian
0.003382
0.001734
0.001531
0.001557
Heuristic
0.005402
0.001670
0.001724
0.001644
Bubble
0.001805
0.001457
0.001980
0.001516
By epochs
Gaussian
0.002275
0.001563
0.001788
0.001451
Heuristic
0.002394
0.001412
0.001830
0.002090
The same experiments were done with the wine data set. The results are illustrated
in Fig. 4. If we use learning rate (4), all the neighboring functions, when the learning
rate is changed according to epochs, give the smallest quantization error. The training
with learning rate (5) shows that the smallest quantization error is obtained by the
Gaussian function, independent of the ways of changes of learning rates (by epochs or
iterations). Just like in the case of the glass date set, the results with learning rates (6)
and (8) are similar.
147
Fig. 3. The averages of quantization errors, obtained using the glass data set (learning rates,
computed by formulas (4), (5), (6) and (8))
Table 2. Averages of the margins of errors (confidence probability 0.95) for the wine data set
Learning
rate
(4)
(5)
(6)
(8)
Bubble
0.003059
0.001658
0.001527
0.001756
By iterations
Gaussian
0.007191
0.001251
0.001333
0.001333
Heuristic
0.002645
0.001517
0.001507
0.001499
Bubble
0.001359
0.001736
0.001894
0.001919
By epochs
Gaussian
0.001491
0.001284
0.001521
0.001358
Heuristic
0.002169
0.001646
0.001428
0.001875
The averages of the margins of errors of the confidence intervals for the wine data
set are presented in Table 2. The margins of errors are small, too. We can see that the
largest margins are obtained, if the Gaussian function and learning rate (4) are used,
and the learning rate is changed according to iterations. The smallest margins are
obtained when the Gaussian function is also used, but with learning rate (5).
The third data set analyzed is zoo. The results are presented in Fig. 5 and Table 3.
If learning rate (4) is used, the same results as with other data sets are achieved
(Fig. 5). All neighboring functions yield smaller quantization errors, if the learning
rates are changed according to epochs. In the case of learning rate (5), the best result
148
Fig. 4. The averages of quantization errors, obtained using the wine data set (learning rates
computed by formulas (4), (5), (6) and (8)).
is got by Gaussian functions the same as before (according to epochs and iterations).
If we use Gaussian function (iterations and epochs) and the heuristic function
(epochs) with learning rate (6), we can get small quantization errors, too. In the last
case (learning rate (8)), we get the best result, if the Gaussian function is used,
independent of the ways of changes of learning rates.
As we see in Table 3, the margins of errors of the confidence intervals of
the averages of the quantization error are small enough, too. This fact shows that
the averages, computed using the observed values, differ from the mathematical
expectation.
Table 3. Averages of the margins of errors (confidence probability 0.95) for the zoo data set
Learning
rate
(4)
(5)
(6)
(8)
Bubble
0.146629
0.015246
0.016417
0.018691
By iterations
Gaussian
0.039484
0.01074
0.013867
0.010864
Heuristic
0.019839
0.012951
0.016969
0.014641
Bubble
0.017873
0.018137
0.021211
0.016278
By epochs
Gaussian
0.018802
0.011479
0.012662
0.010541
Heuristic
0.021137
0.013805
0.014427
0.014487
149
Fig. 5. The averages of quantization errors, obtained using the zoo data set (learning rates,
computed by formulas (4), (5), (6) and (8))
4 Conclusions
The experimental results have showed that the smallest quantization error is obtained,
if neighboring Gaussian function and nonlinear learning rates are used. The changes
of learning rates according to epochs or iterations do not influence the results
obtained. The bubble neighboring function yields the worst result for all the data sets
analyzed. In order to get good results, the heuristic function can be helpful, too, but
only if the learning rates are changed according to epochs. The smallest quantization
errors are achieved with inverse-of-time, power series and heuristic learning rates.
Learning rate (4) (linear) gives large quantization errors, if learning rates are changed
according to iterations. It is purposeful to use another way (according to epochs) of
changes of this learning rate.
The research has shown that the neighboring function, the learning rate and the
way of changes of the learning rate influence the SOM results in the sense of
quantization error. So it is important to choose the proper training parameters for
different data sets. If we do not know which parameters to select, the best way is to
choose the Gaussian function and a nonlinear learning rate.
In the future, it is purposeful to analyze how the learning parameters and
neighboring functions affect the generalization capability of the SOM, i. e., how well
the trained SOM describes the data not used in training.
150
References
1. Chen, D.R., Chang, R.F., Huang, Y.L.: Breast Cancer Diagnosis Using Self-organizing
Map for Sonography. Ultrasound in Med. & Biol. 26(3), 405411 (2000)
2. Kurasova, O., Molyt, A.: Integration of the Self-organizing Map and Neural Gas with
Multidimensional Scaling. Information Technology and Control 40(1), 1220 (2011)
3. Kurasova, O., Molyt, A.: Combination of Vector Quantization and Visualization. In:
Perner, P. (ed.) MLDM 2009. LNCS (LNAI), vol. 5632, pp. 2943. Springer, Heidelberg
(2009)
4. Kurasova, O., Molyt, A.: Quality of Quantization and Visualization of Vectors Obtained
by Neural Gas and Self-organizing Map. Informatica 22(1), 115134 (2011)
5. Dzemyda, G., Kurasova, O.: Heuristic Approach for Minimizing the Projection Error in the
Integrated Mapping. European Journal of Operational Research 171(3), 859878 (2006)
6. Bernataviien, J., Dzemyda, G., Kurasova, O., Marcinkeviius, V.: Optimal Decisions in
Combining the SOM with Nonlinear Projection Methods. European Journal of Operational
Research 173(3), 729745 (2006)
7. Dzemyda, G.: Visualization of a set of Parameters Characterized by Their Correlation
Matrix. Computational Statistics and Data Analysis 36(1), 1530 (2001)
8. Tan, H.S., George, S.E.: Investigating Learning Parameters in a Standard 2-D SOM Model
to Select Good Maps and Avoid Poor Ones. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS
(LNAI), vol. 3339, pp. 425437. Springer, Heidelberg (2004)
9. Kohonen, T.: Self-organizing Maps, 3rd edn. Springer Series in Information Sciences.
Springer, Berlin (2001)
10. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5
(2005),
http://www.cis.hut.fi/somtoolbox/documentation/somalg.shtml
11. Hassinen, P., Elomaa, J., Rnkk, J., Halme, J., Hodju, P.: Neural Networks Tool NeNet
(1999), http://koti.mbnet.fi/~phodju/nenet/Nenet/General.html
12. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, University of California,
School of Information and Computer, Irvine, CA (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
Abstract. In this paper, we introduce the Gamma Growing Neural Gas (-GNG)
model for temporal sequence processing. The standard GNG is merged with a
context descriptor based on a short term memory structure called Gamma memory. When using a single stage of the Gamma filter, the Merge GNG model is
recovered. The -GNG model is compared to -Neural Gas, -SOM, and Merge
Neural Gas, using the temporal quantization error as a performance measure.
Simulation results on two data sets are presented: Mackey-Glass time series, and
Bicup 2006 challenge time series.
1 Introduction
Several extensions of self-organizing feature maps (SOMs) [1] for dealing with processing data sequences that are temporally or spatially connected, such as words, DNA
sequences, time series, etc. [2],[3],[4] have been developed. In [5] a review of recursive self-organizing network models, and their application for processing sequential
and tree-structured data is presented. An early attempt to include temporal contexts in
SOMs is the Temporal Kohonen Map (TKM) [6], where a neuron output depends on
the current input and its context of past activities. In Recursive SOM [3],[7],the SOM
algorithm is used recursively on both the current input and a copy of the map at the
previous time step. In addition to a weight vector, each neuron has a context vector
that represents the temporal context as the activation of the entire map in the previous
time step. This kind of context is computationally expensive, since the dimension of the
context vectors is equal to the number of neurons in the network.
In the Merge SOM (MSOM) [2] approach, the context is described by a linear combination of the weight and the context of the last winner neuron. This context representation is more space efficient than the one used for the Recursive SOM model, because
in MSOM the dimensionality of the context is equal to the data dimensionality [4]. As
the MSOM context does not depend on the lattice architecture, it has been combined
with other self-organizing neural networks such as Neural Gas (NG) [8] and Growing
Neural Gas (GNG) [9], yielding the Merge Neural Gas (MNG) [10] and the Merge
Growing Neural Gas (MGNG) [11], respectively.
In our previous work we have added Gamma filters [12] to SOM and NG, yielding
the -SOM [13] and -NG models [14], respectively. We have shown that the gamma
filter variants of SOM and NG are generalizations that include as particular examples
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 151159, 2011.
c Springer-Verlag Berlin Heidelberg 2011
152
the MSOM and MNG models, when the filter order is set to one. In this paper, we add
gamma filters to GNG, to produce the -GNG model. We compare the performance of
-GNG with those of the -SOM and -NG models, using the temporal quantization
error as a metric. Results are shown on two data sets: the Mackey-Glass time series and
the Bicup 2006 time series.
K
wk ck (n)
k=0
ck (n) = ck (n 1) + (1 )ck1 (n 1)
(1)
where c0 (n) x(n) is the input signal and y(n) is the filter output, and w0 , , wK ,
are the filter parameters. The (0, 1) parameter provides a mechanism to decouple depth (D) and resolution (R) from filter order. Depth measures how far into the
past the memory stores information, a low memory depth can hold only recent information. Resolution indicates the degree to which information concerning the individual
elements of the input sequence is preserved. The mean memory depth for a Gamma
memory of order-K becomes [16],[17],
D=
K
(1 )
(2)
(1 )
.
(z )
(3)
Eq. (3) can be described as a leaky integrator, where is the gain of the feedback loop.
The recursive rule for context descriptor of order-K can be derived directly from the
transfer function (3), as follows:
Ck (z) = G(z)Ck1 (z) =
1
Ck1 (z)
z
(4)
153
Let N = {1, . . . , M } be a set of neurons. Each neuron has associated a weight vector
wi d , for i = 1, . . . , M , obtained from a vector quantization!algorithm. The "Gamma
context model associates to each neuron a set of contexts C = ci1 , ci2 , . . . , ciK , cik
d , k = 1, . . . , K, where K is the Gamma filter order. Given a sequence s, the context
set C is initialized at zero values. From eq. (2) it can be seen that by increasing the filter
order, K, the Gamma context model can achieve an increasing memory depth without
compromising resolution.
Given a sequence entry, x(n), the best matching unit, In , is the neuron that minimizes
the following distance criterion,
K
)
)2
)
)2
di (n) = w )x(n) wi ) +
k )ck (n) cik )
(5)
k=1
where the parameters w and k , k {1, 2, . . . , K} control the relevance of the different elements. To compute the recursive distance (5) every context descriptor in the
different filtering stages is required. These contexts are built by using Gamma memories. Formally, the K context descriptors of the current unit are defined as:
I
n1
ck (n) = ckn1 + (1 ) ck1
k = 1, . . . , K
(6)
where c0n1 wIn1 and at n = 0 the initial conditions cIk0 , k = 1, . . . , K are set
randomly. When K = 1, the context used in the merge models is recovered. Therefore, Merge SOM and Merge NG reduce to particular examples of -SOM and -NG,
respectively, when only a single Gamma filter stage is used (K = 1).
Because the context construction is recursive, it is recommended that w > 1 >
2 > > K > 0, otherwise errors in the early filter stages may propagate through
higher-order contexts.
2.1 -GNG Algorithm
The -GNG algorithm is a merge between GNG and the Gamma context model. For
the GNG model we use the implementation proposed in [11], which incorporates a
node insertion criterion based on entropy maximization. In the following we extend
Andreakiss GNG implementation to incorporate filters. Neuron ith has associated a
weight vector, wi , and a set of contexts, cik , for k = 1, , K.
1. Initialize randomly two weights wi , and set to zero their respective contexts, cik ,
for k = 1, , K, i = 1, 2. Connect them with a zero age edge and set to 0 their
respective winner counters, wcounti .
2. Present input vector, x(n), to the network
3. Calculate context descriptors ck (n) using eq. (6)
4. Find best matching unit (BMU), In , and the second closest neuron,Jn, using eq. (5)
5. Update the BMUs local winner count variable: wcountIn = wcountIn + 1
6. Update the BMUs weight and contexts using the following rule
&
'
wi = w (n) x(n) wi
(7)
&
'
i
i
ck = w (n) ck (n) ck
154
7.
8.
9.
10.
Update neighboring units (i.e. all nodes connected to the BMU by an edge) using
step-size c (n) instead of w (n) in eq. (7).
Increment the age of all edges connecting the BMU and their topological neighbors,
aj = aj + 1.
If the BMU and the second closest neuron are connected by an edge, then set the
age of that edge to 0. Otherwise create an edge between them.
If there are any edges with an edge larger than amax then remove them. If after this
operation, there are nodes with no edges remove them.
If the current iteration n is an integer multiple of , and the maximum node count
has not been reached, then insert a new node. The parameter controls the number
of iterations required before inserting a new node. Insertion of a new node, r, is
done as follows:
(a) Find node u with the largest winner count.
(b) Among the neighbors of u, find the node v with the largest winner count
(c) Insert the new node r between u and v as follows,
wr = 0.75wu + 0.25wv
r
ck = 0.75ck + 0.25ck
(8)
(d) Create edges between u and r, and v and r, and remove the edge between u
and v
(e) Decrease the winner count variables of nodes u and v by a factor 1
, and
set the winner count of node r as follows,
) wcountu
wcountu = (1
(9)
wcountv = (1
) wcountv
(10)
wcountr = wcountu
(11)
12. Set n n + 1
13. If n < L go back to step 2, where L is the cardinality of the data set.
Typically,
= 0.5 and = 0.0005.
3 Experiments
Experiments were carried out with two data sets: Mackey-Glass time series and Bicup
2006 time series1 . The parameter was varied from 0.1 to 0.9 with 0.1 steps. The
number of filter stages K was varied from 1 to 9. Training in -GNG is done in a single
1
155
stage, during 1 epoch for Mackey-Glass time series and 200 epochs for Bicup time
series. Parameters i are fixed, and decayed linearly with the context order as follows:
K +1i
i = K
, i = 0...K
k=0 (k + 1)
(12)
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
50
100
150
200
250
300
350
400
450
500
156
Fig. 2. TQE for Mackey-Glass time series using Merge GNG (K = 1), -GNG (K = 9), -NG
(K = 9),and -SOM (K = 9)
(a) -NG
(b) -GNG
(c) -SOM
(d) Merge-GNG
Fig. 3. PCA projection of temporal vector quantization results for Mackey-Glass time series,
using a) -NG (K = 9, = 0.1), b) -GNG (K = 9, = 0.3), c) -SOM (K = 9, = 0.3),
and d) Merge-GNG (K = 1, = 0.6)
157
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
100
200
300
400
500
600
700
800
Fig. 5. TQE for Bicup 2006 time series using Merge GNG, -NG, -GNG, and -SOM
158
(a) -NG
(b) -GNG
(c) -SOM
(d) Merge-GNG
Fig. 6. PCA projection of the temporal vector quantization obtained for Bicup 2006 time series,
using a) -NG (K = 9) ,b) -GNG (K = 8), c) -SOM (K = 8), and d) Merge-GNG (K = 1).
4 Conclusion
A -filter version of GNG has been implemented, and compared with -NG and -SOM
models using the temporal quantization error as performance metric. The so-called GNG model is a generalization of the Merge GNG model, where the latter corresponds
to the particular case when a single context is used (gamma filter of order one). It
has been shown empirically using two time series that by adding more contexts the
temporal quantization error tends to diminish. Although the TQE performance of GNG is similar to those of -NG, the former has the advantage of being a much faster
algorithm. In addition, in -GNG there is no need of having two stages of training as
usually done in -NG and -SOM algorithms.
Acknowledgment
This research was supported by Conicyt-Chile under grants Fondecyt 1080643 and
1101701.
159
References
1. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
2. Strickert, M., Hammer, B.: Merge SOM for Temporal Data. Neurocomputing 64, 3972
(2005)
3. Voegtlin, T.: Recursive Self-Organizing Maps. Neural Networks 15, 979991 (2002)
4. Hammer, B., Micheli, A., Neubauer, N., Sperduti, A., Strickert, M.: Self Organizing Maps
for Time Series. In: Proceedings of the Workshop on Self-Organizing Maps (WSOM), Paris,
pp. 115122 (2005)
5. Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive Self-Organizing Network
Models. Neural Networks 17, 10611085 (2004)
6. Chappell, G.J., Taylor, J.G.: The Temporal Kohonen Map. Neural Networks 6, 441445
(1993)
7. Tino, P., Farkas, I., van Mourik, J.: Dynamics and Topographic Organization of Recursive
Self-Organizing Maps. Neural Computation 18, 25292567 (2006)
8. Martinetz, T.M., Berkovich, S.G., Schulten, K.J.: Neural-gas Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks 4,
558569 (1993)
9. Fritzke, B.: A Growing Neural Gas Learns Topologies. In: Tesauro, G., Touretzky, D.S.,
Leen, T.K. (eds.) NIPS, pp. 625632. MIT Press, Cambridge (1995)
10. Strickert, M., Hammer, B.: Neural Gas for Sequences. In: Yamakawa, T. (ed.) Proceedings
of the Workshop on Self-Organizing Networks (WSOM), Kyushu, Japan, pp. 5358 (2003)
11. Andreakis, A., Hoyningen-Huene, N.V., Beetz, M.: Incremental Unsupervised Time Series Analysis Using Merge Growing Neural Gas. In: Prncipe, J.C., Miikkulainen, R. (eds.)
WSOM 2009. LNCS, vol. 5629, pp. 1018. Springer, Heidelberg (2009)
12. De Vries, B., Principe, J.C.: The Gamma Model- A New Neural Model for Temporal Processing. Neural Networks 5, 565576 (1992)
13. Estevez, P.A., Hernandez, R.: Gamma SOM for Temporal Sequence Processing. In: Prncipe,
J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 6371. Springer, Heidelberg (2009)
14. Estevez, P.A., Hernandez, R., Perez, C.A., Held, C.M.: Gamma-filter Self-organizing Neural
Networks for Unsupervised Sequence Processing. Electronics Letters (in press, 2011)
15. Principe, J.C., Giuliano, N.R., Lefebvre, W.C.: Neural and Adaptive Systems. John Wiley &
Sons, Inc., New York (1999)
16. Principe, J.C., Kuo, J.M., Celebi, S.: An Analysis of the Gamma Memory in Dynamic Neural
Networks. IEEE Trans. on Neural Networks 5, 331337 (1994)
17. Principe, J.C., De Vries, B., De Oliveira, P.G.: The Gamma Filter A New Class of Adaptive
IIR Filters with Restricted Feedback. IEEE Transactions on Signal Processing 41, 649656
(1993)
Abstract. In this article, we present an analysis of the impact of nutrition and lifestyle on health at a global level. We have used Self-organizing
Maps (SOM) algorithm as the analysis technique. SOM enables us to
visualize the relative position of each country against a set of the variables related to nutrition, lifestyle and health. The positioning of the
countries follows the basic understanding of their status with respect to
their socioeconomic conditions. We have also studied the relationships
between the variables supported by the SOM visualization. This analysis presents many obvious correlations but also some surprising ndings
that are worth further analyses.
Introduction
The overall relationship between unhealthy diet and deteriorating health is obvious and generally well understood. Establishing this relationship from data
may lead us to further identify the severity of the relationship concerning dierent specic aspects, and to take appropriate corrective actions. Moreover, these
types of analyses can be used to create social awareness regarding the strong impact of certain nutrition on the health of individuals and thus promoting overall
health and wellbeing in the society. The technical report series 916 of World
Health Organization explains that the rapid industrialization, in the previous
decade, has aected the health and nutrition especially in the developing countries, which has resulted in inappropriate dietary patterns, decreased physical
activities and increased tobacco use, and a corresponding increase in diet-related
chronic diseases, especially among poor people [1]. The report also states that
existing scientic evidence has helped in identifying the role of diet in controlling
various diseases. However, the evidence is contradicting at times [1].
A possible way to study correlations of dierent elements of nutrition and
lifestyle with the diseases could be to focus on the eating and drinking trends
in dierent parts of the world. The patterns of nutrition intake vary to a great
extent in dierent regions of the world. As this is also true in the case of the
prevalence of various diseases. The connections between the nutrition intake and
health can be further examined (see e.g. [2]). Volkert [2] studies the nutrition and
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 160167, 2011.
c Springer-Verlag Berlin Heidelberg 2011
161
lifestyle of elderly people in Europe and nds out that these vary widely even
within Europe. Volkert continues to state that the elderly in the south consume
more vegetables, grains, fruit, lean meat and olive oil whereas relatively more
milk products are consumed in the northern European countries. These kind of
ndings are particularly interesting because a deeper analysis of world-wide eating trends and prevalence of diseases may enable us to identify the regions where
some improvements in terms of changing eating habits are essential to promote
wellbeing. In our research, we have carried out a similar investigation where we
can clearly identify large dierences in nutrition intake at the global level. Thus,
we can group dierent regions of the world depending on nutrition intake proles
of each region. We have also explored the links between the prevalence of certain
diseases in various countries and citizens diet in those countries. The dataset,
that we have used for our analysis, contains more than one hundred nutrition,
lifestyle and health indicators in 86 countries. The dataset has been obtained
from Canibais e Reis [3], with FAO (Food and Agriculture Organization) [4],
WHO (World Health Organization) [5] and British Heart Foundation [6]. as the
main sources of the data.
In order to inspect the aforementioned aspects of our research, sophisticated
analysis techniques are required since the dimensionality of data is large owing
to a lot of health, lifestyle and nutrition related features. For this reason, we
have used Self-Organizing Maps (SOM) algorithm, a well-known data analysis
and visualization technique, to mine interesting correlations. SOM is a suitable means to create an ordered representation of high-dimensional data. The
method reduces the complexity of the data and reveals meaningful relationships
by mapping the data into an intuitive two-dimensional space. This also helps in
understanding dependencies between variables in the data. In our analysis, we
have used SOM for a countrywide grouping of the data or data subsets. For instance, SOM enables us to visually perceive the groups of countries based on the
spread of dierent diseases or the intake of certain nutrition elements. SOM also
helps in visualizing the relationships between dierent food items and diseases.
Section 2 provides a brief introduction of SOM and the dataset. Section 3
presents the results of experimentations on the data and nally Section 4 concludes the paper.
This section rst describes in brevity dierent features of the dataset and then
sheds some light on the technical and mathematical details of SOM.
2.1
162
Y. Mehmood et al.
consumption of proteins, sugar and milk products, and various other components of nutrition (see a subset shown in Fig. 1). The lifestyle category provides
information related to the drinking and smoking habits etc. This categorization
has helped us, as shown in section 3, to group dierent countries based on the
similarity of food consumption and the spread of diseases.
Nutrition
Argentina China Ethiopia Finland ... USA
Protein (g/day)
94
82
54
102 ... 114
Fats (g/day)
100
90
20
127 ... 156
Carbohydrates (g/day)
477.5
450.5
366
399.75 ... 426
Animal Products (kcal/day)
823
644
96
1164 ... 1045
Animal Fats (kcal/day)
72
46
13
131 ... 116
Bovine Meat (kcal/day)
342
27
29
90
... 115
Butter, Ghee (kcal/day)
28
1
5
78
... 40
Cheese (kcal/day)
90
1
0
164 ... 149
Eggs (kcal/day)
24
74
1
32
... 55
Fats, Animals, Raw (kcal/day)
42
44
7
17
... 75
Fish, Seafood (kcal/day)
10
35
0
59
... 28
Freshwater Fish (kcal/day)
1
20
0
14
... 5
Honey (kcal/day)
1
1
4
4
... 4
Meat (kcal/day)
475
440
44
497 ... 451
Milk - Excluding Butter (kcal/day)
222
30
33
438 ... 390
Milk, Whole (kcal/day)
127
28
27
218 ... 199
Mutton & Goat Meat (kcal/day)
8
15
6
2
... 3
Oals, Edible (kcal/day)
17
10
3
4
... 3
Pelagic Fish (kcal/day)
1
0
0
34
... 7
Pigmeat (kcal/day)
34
343
0
348 ... 132
Poultry Meat (kcal/day)
83
51
2
53
... 197
Vegetal Products (kcal/day)
2135
2296
1761
1978 ... 2708
Alcoholic Beverages (kcal/day)
67
150
12
183 ... 102
...
...
...
...
...
... ...
Fig. 1. A selection of nutrition-related data
2.2
Self-Organizing Maps
163
Our analysis on the data is twofold. In the rst phase, we visualize the distribution of countries on the 2-dimensional space of SOM, based on the consumption
of dierent nutrients. This explains how countries group with each other regarding the intake of nutrition. The second phase of the analysis shows relationships
between the nutrition variables as well as between the nutrition variables and
diseases.
3.1
Nutrition Analysis
Fig. 2. A map of nutrition variables with an illustration of division of larger geographical groups
164
Y. Mehmood et al.
In Fig. 2, the structure of clusters on the map is illuminated by the visualization of distances in the original high-dimensional space. The shades of red color
indicate large distances in the original space whereas shades of blue refers to
relatively shorter distances. The visualization is interesting since the groups of
countries on the SOM map appear to have a close resemblance with the regionwise grouping of countries on the globe. This possibly explains the fact that
people living in dierent regions of the world have common preferences for certain foods when compared with others. We have annotated the SOM map by
making virtual group boundaries. However, this demarcation is not fully correct since some countries fall in the group of those countries that are not their
geographical neighbors. Interesting exceptions include Egypt, Iran and Turkey
appearing in the group of North African countries. Moreover, Israel appears in
the group of South European countries, and Jamaica in South America. These
kind of exceptions are because of the resemblance in eating trends.
We have also performed similar analyses by combining various disease, lifestyle
and nutrition variables. The maps of disease and lifestyle variables (disease map
is shown in section 3.3) show various similar groups. For example, groups of
various South Asian and European countries can be identied (see section 3.3).
3.2
Correlation Analysis
165
Fig. 5. Relationship between the intake of sugar and sweeteners (kcal/day) and the
mean total cholesterol value (mg/dl) in men and women
on the gray scale where the dark shades represent low values and lighter shades
represent high values of variables.
The correlation between fats and animal products (see Fig. 3) is signicant
since it shows strong correlations on both high and low values. This explains the
fact that high consumption of fats in European countries is strongly correlated
with high consumption of animal products. Similarly, low consumption of fats
in some East African and South Asian countries is strongly correlated with low
consumption of animal products.
Another component map of nine commonly used food constituents is shown
in Fig 4. The map clearly shows strong correlations on relatively low values/
consumption of coee, cheese, butter, wine, eggs, potatoes, starch and wheat
products in various parts of the world.
Disease-Food correlation. Another important aspect of our research was to
explore the connection between food and health. For this reason, we studied the
166
Y. Mehmood et al.
Disease Analysis
Conclusion
The results obtained using SOM analysis provide a good understanding of the
data by not only showing the underlying correlations within dierent food components but also between food components and diseases.
The study also shows the signicance of machine learning techniques in order
to infer useful information from dierent eating and drinking trends in a population. Moreover, deeper analyses performed on richer datasets may bring forth
information that can be used for the societal and individual wellbeing.
Acknowledgments: We are grateful to EIT-ICT labs for funding Well-being Innovation Camp, which has greatly helped in conceiving the idea of this paper.
We are also thankful to the organizers of the Camp as well as the participants.
References
1. WHO. Technical report series 916 (2003), http://whqlibdoc.who.int/trs/
2. Volkert, D.: Nutrition and lifestyle of the elderly in europe. Journal of Public
Health 13, 5661 (2005)
3. Canibais e Reis. Integrated nutrition, lifestyle and health database:
epidemiological information for an improved understanding of diseases of civilization (2009), http://www.canibaisereis.com/2009/03/21/
nutrition-and-health-database/
4. FAO: Statistical yearbook 2005-2006 - consumption, http://www.fao.org/
economic/ess/en/
5. WHO. Global health atlas, http://apps.who.int/globalatlas/dataQuery/
6. British heart foundation statistics website, http://www.heartstats.org/
atozindex.asp?id=8
7. Kohonen, T.: Self-organizing maps. Springer Series in Information Sciences (2001)
167
8. Daz, I., Domnguez, M., Cuadrado, A.A., Fuertes, J.J.: A new approach to exploratory analysis of system dynamics using SOM. Applications to industrial processes. Expert Systems with Applications 34(4), 29532965 (2008)
9. Abonyi, J., Nemeth, S., Vincze, C., Arva, P.: Process analysis and product quality
estimation by self-organizing maps with an application to polyethylene production.
Computers in Industry 52(3), 221234 (2003)
10. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: Self-organizing map in
matlab: the som toolbox. In: Proceedings of the Matlab DSP Conference 1999,
Espoo, Finland, November 1999, pp. 3540 (1999)
11. Welsh, J.A., Sharma, A., Abramson, J.L., Vaccarino, V., Gillespie, C., Vos, M.B.:
Caloric sweetener consumption and dyslipidemia among us adults. Journal of
American Medical Association 303, 14901497 (2010)
12. Welsh, J.A., Sharma, A., Cunningham, S.A., Vos, M.B.: Consumption of added
sugars and indicators of cardiovascular disease risk among us adolescents. Circulation 123, 249257 (2011)
Abstract. Cities are instances of complex structures. They present several conicting dynamics, emergence of unexpected patterns of mobility
and behavior, as well as some degree of adaptation. To make sense of
several aspects of cities, such as trac ow, mobility, social welfare, social exclusion, and commodities, data mining may be an appropriate
technique. Here, we analyze 72 neighborhoods in Mexico City in terms
of economic, demographic, mobility, air quality and several other variables in years 2000 and in 2010. The visual information obtained by
self-organizing map shows interesting and previously unseen patterns.
For city planners, it is important to know how neighborhoods are distributed accordingly to demographic and economic variables. Also, it is
important to observe how neighbors geographically close are distributed
in terms of the mentioned variables. Self-organizing maps are a tool suitable for planners to seek for those correlations, as we show in our results.
Keywords: Self-organizing maps, urbanism, data mining, urban
analysis.
Introduction
Cities are complex structures dened by several processes and variables [1]. Urban planers, as well as policy and decision makers, are urged to understand those
processes and variables in order to seek for patterns that allow them to proper
planing. Several methodologies have been applied, such as those derived from
statistics [2], and others derived from the mathematical and computational modeling [3]. A third tool is that of data mining, in which algorithms are presented
with (possibly) high-dimensional data and structures and patterns are identied.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 168177, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Mining the City Data: Making Sense of Cities with Self-Organizing Maps
169
Cities and its constituent regions may be quantied in terms of relevant variables.
Each region is associated to a feature vector, and from there, data visualization
techniques may be applied. We dene in this section the fundamental variables
that dene city areas.
For city planners and policy makers, it is important to distinguish between
two major attributes of cities: physical aspects and human-related aspects [7].
The rst group refers to variables that take into account the street topology and
the distribution of dierent services as well as of pollution indicators. Examples
of these variables are the number of streets, avenues, turnovers, trac lights,
number of schools in a determined region, pedestrian bridges, the number of
oces and industries in the area, the number of communicating streets to major
avenues, and the average concentration of pollutant in the last month.
170
We dene a neighborhood as a region of blocks that share the same administrative instance. This is of course an articial division, but is dened as the
basic structure within cities. Each neighborhood is dened by an attribute vector. The variables dening the high-dimensional space are now specied. This
rst group of variables describes not only the distribution of public services and
facilities, but also the street network topology, which is fundamental for a proper
trac ow over the city. Two of the most frequent variables are the number of
nodes and edges of a street graph. Edges are dened as sub-sections of streets
that connect two nodes, and nodes are dened as squares, intersections between
streets and dead ends. These variables characterize the possible trac ow patterns within the neighborhood studied [8]. Also in this group of variables are
the pollution conditions. The main pollutant considered in health analysis are
CO, O3 , N O, N O2 , SO2 , among others. These pollutant are dened in terms
of the average concentration during a given period of time, and the number of
measures that exceeded the maximum safe level.
The second group of variables accounts for the description of inhabitants in a
determined region. Examples of these variables are the average income, educational level, average number of individuals living in the same household, number
of trips generated in a trac district (generally weighted by the population in
the neighborhood), number of attracted trips to a given region, the percentage
of houses with proper services, and the proportion of individuals with social security (see Fig. 1-b). The age distribution is also relevant, as well as the gender
distribution. Hereafter, we refer to the urban space as the space generated by
all these variables (groups one and two).
Cities are dynamic in several ways. Citizens get older, new ones are born,
pollution may be controlled, new streets are to be built, new schools may be
constructed, or old houses may be demolished in order for new buildings to be
constructed. All these processes are dicult to be observed at once by traditional
tools, but are very relevant for urban planners.
Megalopolis, dened as cities with more than 10 million inhabitants, are the
result of several growing processes [1,2]. Although there is not a formal theory
of city growth and development, there are some general patterns more or less
accepted by the majority of specialists. In the rst process, growing occurs from
several isolated urban spots. Small towns and settlements tend to growth by
diusion-related mechanisms and eventually, the free areas will be covered. These
small towns are mainly ancient settlements that have suered from demographic
stress and migration from other areas. Also, these small areas may be industrial
complexes or administrative facilities that tend to attract edications towards
their vicinity.
The second process for city growth is the planication of completely new settlements, more or less nearby to already existing settlements. These settlements
are in general planed as to be self-inclusive, that is, to include schools and other
facilities as to minimize the necessity of traveling outside its limits, other for displacement to work areas. The third process is related to the formation of poor
or very poor areas, the so called satellite cities or slums [9]. These settlements
Mining the City Data: Making Sense of Cities with Self-Organizing Maps
171
are formed by people attracted to the city that can not aord to live in already
established neighborhoods. These settlements tend to be isolated from the major
streets, with poor or non-existing facilities.
In the next section, we describe the project of analyzing urban data from
several neighborhoods in a Megalopolis. The neighborhoods are described by
the already dened variables.
In this project we analyze urban data of Mexico City, from three major databases
[10,11,12]. We have two major interests in studying that data. First, we seek to
identify both, neighborhoods with similar urban features and the urban heterogeneity of certain regions. Second, we intend to visualize the evolution of certain
neighborhoods from the urban point of view. We compared 72 neighborhoods
in two dierent stages: years 2000 and 2010. We selected those neighborhoods
because they are within a distance of up to 1km from a major public transport
system established in 2004, so we may, as a side eect, study the impact of it over
its surroundings. The mentioned public transport system is an instance of the
so-called Bus Rapid Transit (BRT), which consists of a high-capacity bus and
an exclusive lane, running over a major avenue that crosses the city from south
to north for more than 24km. In year 2000, the BRT was not yet implemented,
so we have data of the previous situation in the surroundings neighborhoods.
The neighborhoods we considered in this study vary in size from just a few
blocks to up to one hundred, and from 950 inhabitants to up to 25,000. However,
there are other variables that are relevant, besides of the obvious ones that
relate to size, the already dened variables that constitute the urban space (see
Fig. 1-b). The total population of the 72 neighborhoods is 554,590 in 2000 and
551,390 in 2010.
Based on the seminal work by Kaski and Kohonen [13], in which self-organizing
maps were applied to study the world poverty distribution accordingly to several variables, we propose the use of SOM to study not only the distribution
of instances (neighborhoods here after), but also their evolution on the highdimensional space. By evolution, we refer to how each neighborhood has modied
its own variables in two dierent times (2000 and 2010).
The self-organizing map is a non-linear projection with the capability of shown
in a low-dimensional discrete space the data distribution found in the highdimensional feature space [14]. One of the main properties of the SOM is the
ability to preserve in the output map those topographical relations present in
the input data [14]. This attribute is achieved through the transformation of an
incoming analogical signal of arbitrary dimension into a discrete low-dimensional
map and by adaptively transforming data in a topologically ordered fashion
[15]. Each input data is mapped to an unit in the lattice, to the one with the
closest weight vector to the input vector, or best matching unit (BMU). The
SOM preserves neighborhood relationships during training through the learning
172
equation (1), which establishes the eect that each BMU has over any other
neuron. The two-dimensional SOM is a map from d N 2 .
The SOM structure consists of a two-dimensional lattice of units referred to
as the map space. Each unit n maintains a dynamic weight vector wn which is
the basic structure for the algorithm to lead to map formation. The dimension
of the input space is considered in the SOM by allowing weight vectors to have
as many components as features in the input space. Variables dening the input
space, and thus, the weight space, are continuous. Weight vectors are adapted
accordingly to:
wn (t + 1) = wn (t) + (t)hn (g, t)(xi wn (t))
(1)
where (t) is the learning rate at epoch t, hn (g, t) is the neighborhood function
from BMU g to unit n at epoch t and xi is the input vector. The neighborhood
function is to be decreasing with time and distance for proper map unfolding.
With the aim to visualize how neighborhoods are distributed in the highdimensional space, SOMs were constructed. We made use of software SOMPACK, available at www.cis.hut./research/som_lvq_pak.shtml. Fig. 2 shows
the U-matrix [16]. It is indicated the id of the neighborhood followed by the label
that identies the year (00 or 10 for 2000 and 2010). The gray levels represent
the similitude of the surrounding cells, and are useful to identify the clusters
formed by the SOM.
Fig. 1. The 72 neighborhoods contemplated in this study (a). It is shown the general
geographic location. The area of polygons representing neighborhoods is proportional
to the square of the population in the neighborhood. The variables included in the
analysis (b).
Mining the City Data: Making Sense of Cities with Self-Organizing Maps
173
Fig. 2. U-matrix for SOM. The 72 neighborhoods are shown for years 2000 (00) and
2010 (10).
There are several clusters identied by the SOM and highlighted by the Umatrix method. They are labeled A - M. Neighborhoods in cluster A are very
poor areas, slums, with poor urban services and poor inhabitants. They are
located in the southern part of the studied region. Streets in those neighborhoods
are only a few and mainly insucient to even be represented by a connected
graph. Area B is geographically close to neighbors in area A, but have a dierent
description. They are modern neighborhoods, with less than 25 years and there
is a high concentration of physicians and researchers working in the nearby
hospitals and at the university. Area C is related to wealthy neighbors, with
high living standards. The nature of neighbors in this region is that of original
settlements (5, 31) and of recent urban constructions (14, 16, 22, 29). Areas B
and C are formed by neighborhoods in the south part of the city.
Cluster D is also formed by wealthy neighborhoods, but it includes some from
the middle-south and middle-north part of the city. They are mainly modern
neighborhoods with adequate urban planning, which includes wide avenues and
streets, and several facilities are found in the neighborhood or in the surroundings, with good public transport in the surroundings, including metro stations.
Clusters E- G consists of relatively new neighbors, with several apartment build-
174
ings and close to hospitals and several schools and the university in the surroundings. They are geographically located to the middle-south area of the city.
Clusters H and I includes neighborhoods in the central area and in the northern part of the city. They are similar not only in that they are wealthy neighbors,
but also in that the street topology, trac conditions and air pollution are more
or less equivalent. CO levels are high in neighbors within these clusters. Cluster
J includes residential neighborhoods within the vicinity of industrial complexes,
with medium and working classes, although with several facilities. Finally, clusters K and L are new urban settlements, with good street planning and in general,
are among the neighborhoods with the highest standards in Mexico City.
In order to interpret the results, it is important to keep in mind that two
factors are being considered simultaneously. The rst one is that the studied
neighborhoods are situated along a main avenue that crosses Mexico City from
south to north for almost 30km, served by a major public transport system.
The analysis of the SOM of these neighborhoods is important as it shed light
about similar features and patterns among the city. The rst important aspect
we observe is that neighbors geographically close to each other tend to have more
or less similar positions in the urban space (see also Fig. 1). However, this is not
always the case. Neighborhood 5, an original town from where city had growth
in the last century, is a wealthy region. The neighborhoods around it (12, 6, 7)
are considered poor regions. Also, the dierences are not only in the economic
sense: the street topology of both regions is dierent (see Fig. 3).
The second aspect is that each neighborhood is represented by two instances:
its description, accordingly to the relevant urban variables, in year 2000 and in
year 2010. So, in the same map we found two instance of each neighborhood. The
distribution of these two instances is a visualization of how dierent a neighborhood is in 2010 compared to how it was in year 2000. That is, we can visualize
the evolution of each neighborhood in two periods at the time that we visualize
how each neighborhood is placed in the urban space with respect to other neighborhoods. In general, ten years is a short period of time to observe signicant
changes in a city.
In general, it is observed that neighborhoods tend to stay more or less the
same, at least in a ten-year period. There are, however, some interesting counterexamples. Neighborhood 14, a residential area with several apartment buildings, has shifted its position towards a cluster of neighborhoods with higher
standards. Also, no neighborhood seems to be attracted to clusters dening
poor neighborhoods. However, several instances have not abandoned its poverty
condition (neighborhoods 1, 3, 6, 7, 10, ...).
In Fig. 3 it is shown the planes for some of the considered variables. A plane
indicates in gray levels over each unit the average value of a certain variable
of the vectors mapped to that unit. In the rst plane (starting at top left),
it is shown the distribution for the percentage of households with at least one
car. Also, plane 6 shows the percentage of population of each neighborhood
that earns 10 or more basic salaries. It is observed a similitude between these
Mining the City Data: Making Sense of Cities with Self-Organizing Maps
175
Fig. 3. Planes for some of the considered variables. Gray level indicates the corresponding value of the plane. Light tones indicate higher values.
two planes, which clearly indicates that people with good salaries can aord to
buy cars.
Another interesting and previously undetected fact is that neighborhoods with
the highest percentage of educational level are not the neighborhoods whose
inhabitants earnt the highest salaries, which is in contradiction with what is
observed in, for example, the United States [17] (see planes 6 and 9 in Fig 3).
On the side of urban trac and pollution issues, a few facts are also of interest
for urbanists, and were previously undetected. Plane 8 shows the number of
street connecting the neighborhood to a major avenue. By comparing this plane
with plane 1, it is observed that neighborhoods whose inhabitants have more
cars are less connected that neighborhoods whose inhabitants have fewer cars.
This, of course, is a cause of trac jams. This nding express the fact that
neighborhoods with several cars will present higher trac problems not only
because there are more cars but also because there are less streets to leave
(enter) the neighborhood.
Plane 1 and plane 12 show that neighborhoods with more cars tend to be less
connected, that is, less street are connecting the neighborhood with itself. This
176
is conrmed by the fact that many of the neighborhoods with households with
more cars tend to present more trac problems [18].
Plane 4 shows the number of generated trips from a neighborhood to another
one. It is included in this variable both trips to work and trips for social reasons.
This variable is important for urbanist and trac ocials to propose new or
additional routes. By comparing this plane with plane 9, it is observed that in
those neighborhoods with lower level of education people tend to travel more
outside their own neighborhood. From here, it may be inferred that people with
higher levels of education tend to live near their jobs.
In order to make sense of data from dierent fronts related to urban settlements
and inner processes and phenomena, we applied the self-organizing maps. The
visualizing capabilities of the self-organizing map were very helpful in identifying
some patterns and correlations. We analyzed several neighborhoods within the
vicinity of a bus rapid transit, based on several dozens of demographic, economic,
environmental and topological variables, and we were able to nd some clusters.
Those clusters group neighbors with similar descriptions so interesting patterns
may be identied.
The ability to visualize simultaneously clusters and their correlation as in the
U-matrix and planes, is important to seek for relevant patterns. In the case of
urban data, it may lead to the discover of relevant information. In this project,
we identied some hidden patterns, previously undetected. Also, we have been
able to detect clusters of similar neighborhoods and also.
Visualization of high-dimensional data is a rst step for urban planners to
make sense of the city and its inner displacements and processes. Self-organizing
maps are a good alternative at least in the data visualization and data inspection
tasks.
Acknowledgments
This research is derived from a project supported by Instituto de Ciencia y
Tecnologa del Distrito Federal, Mex. (ICyTDF), under contract PICCT08-55.
References
1. Batty, M.: Cities and Complexity. MIT Press, Cambridge (2005)
2. Bettencourt, L., West, J.: A unied theory of urban living. Nature 467, 912913
(2010)
3. Batty, M.: Rank clocks. Nature 444, 592596 (2006)
4. Batty, M., Steadman, P., Xie, Y.: Visualization in spatial modeling. In: Working
Paper from Center for Advanced Spatial Analysis (2004)
Mining the City Data: Making Sense of Cities with Self-Organizing Maps
177
5. Lobo, V., Cabral, P., Baco, F.: Self organizing maps for urban modelling. In: Proc.
9th Int. Conf. on Geocomputation (2007)
6. Castro, A., Gmez, N.: Self-organizing map and cellular automata combined technique for advanced mesh generation in urban and architectural design. Int. J.
Information Technologies and Knowledge 2, 354360 (2008)
7. Batty, M.: Urban modelling. Cambridge University Press, Cambridge (1976)
8. Buhl, J., Gautrais, J., Reeves, N., Sol, R., Valverde, S., Kuntz, P., Theraulaz, G.:
Topological patterns in street networks of self-organized urban settlements. The
European Physical Journal B 49, 513522 (2006)
9. Milgram, S.: The experience of living in cities. Science 167, 14611468 (1970)
10. Instituto Nacional de Estadistica, Geograa e Informatica (National Institute for
Statistics, Geography, and Informatics), Mex, http://www.inegi.org.mx
11. Encuesta
Origen
Destino
(Origin
destination
survey),
http://igecem.edomex.gob.mx/descargas/estadistica/ENCUESTADEORIGEN/
12. Sistema de Informacin de Desarrollo Social, http://www.sideso.df.gob.mx
13. Kaski, S., Kohonen, T.: Exploratory data analysis by the self-organizing map:
structures of welfare and poverty in the world In Apostolos-Paul et al. In: Neural
Networks in Financial Engineering, pp. 498507 (1996)
14. Kohonen, T.: Self-Organizing maps, 3rd edn. Springer, Heidelberg (2000)
15. Hujun, Y.: The self-organizing maps: Background, theories, extensions and applications. In: Computational Intelligence: A Compendium, pp. 715762 (2008)
16. Ultsch, A.: Self organized feature maps for monitoring and knowledge aquisition of
a chemical process. In: Proc. of the Int. Conf. on Articial Neural Networks, pp.
864867 (1993)
17. United States Department of Labor. Bureau of labor statistics,
http://www.bls.gov/emp/ep_chart_001.htm
18. Graizbord, B.: Geografa del transporte en el rea metropolitana de la Ciudad de
Mxico. Colmex (2008)
Abstract. In recent years, a variety of visualization techniques for visual data exploration based on self-organizing maps (SOMs) have been
developed. To support users in data exploration tasks, a series of software
tools emerged which integrate various visualizations. However, the focus
of most research was the development of visualizations which improve the
support in cluster identication. In order to provide real insight into the
data set it is crucial that users have the possibility of interactively investigating the data set. This work provides an overview of state-of-the-art
software tools for SOM-based visual data exploration. We discuss the
functionality of software for specialized data sets, as well as for arbitrary
data sets with a focus on interactive data exploration.
Introduction
Self-organizing maps (SOMs) [1] have been widely employed for data exploration
tasks in the last decade. Their popularity is especially due to the ability of creating low-dimensional, topology preserving representations of high-dimensional
data. Visualizations of these representations help the user to understand the
distribution of data elements in the feature space. In recent years, a variety of
visualization techniques have been developed which support users in the identication of clusters and correlated data. Most of these visualizations have been
included in software tools which provide dierent means to analyze and explore
the data set.
The active research area of visual analytics early identied the need to interact
with data visualizations. Interactive data exploration is relevant in every domain
where users want to gain insight into the data. Interaction techniques allow users
to mentally connect visual elements with actual data. The application of such
techniques may result in a deeper understanding of the visualization and allow
a goal-oriented analysis of the data.
Most current SOM-based visualizations focus on cluster formation, contribution of variables (features) to these clusters, and homogeneity of clusters. However, to gain insight into the data, interaction with the clustered data is crucial.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 178187, 2011.
c Springer-Verlag Berlin Heidelberg 2011
179
The selection of feature vectors and investigation of data linked to these feature
vectors can improve the visual data exploration task. Additionally, more information may be transported through such interactive visualizations than single
static visualization are capable of.
This paper presents a short overview of SOM-based visualization techniques
and evaluates existing SOM-based software tools regarding their application to
visual interactive data exploration. We summarize the results in order to provide
a reference for interactive tools. We conclude with the discussion of the current
state-of-the-art software tools.
180
J. Moehrmann et al.
feature vectors which are projected onto a map unit (as BMU), either utilizing
the size of the unit, color coding, or a three dimensional plot where the height of
each bar indicates the number of feature vectors. Although these visualizations
focus on the identication of clusters, Vesanto additionally discusses visualization
techniques which focus on data analysis. Response surfaces, for example, display
the relative goodness of all units for the data set (using color coding) and thereby
provide information about the quality of the map. Component planes allow the
investigation of individual variables of codebook vectors by color coding their
inuence on the U-Matrix visualization.
Mayer et al. [13] developed visualizations for coloring map units based on
category labels of the feature vectors (class visualization). The category labels are
provided by the ground truth data. Rauber et al. [14] developed a visualization
which also displays labels for clusters, however, no ground truth data is required.
Instead, the feature variable which is primarily responsible for the assignment
of a feature vector to a specic map unit is used as the label. This visualization
aims to provide users with a better understanding of the map topology. We
will summarize such techniques as class histograms in the remainder of this
work. Neumayer et al. [15] introduced the metro map metaphor which extends
component planes by connecting the lowest and highest values in each plane. The
resulting lines are combined in one diagram. This visualization is especially useful
for investigations of high-dimensional data sets where the analysis of individual
component planes becomes unfeasible.
Criteria
Dzemyda et al. [16] conducted a comparative analysis of six software tools for
SOM-based visual data exploration, with a focus on the interpretability of provided visualizations. Their conclusion was that insight might best be gained by
using several such tools. Although this recommendation seems unfeasible, it is
not surprising, since the analysis focused only on dierent visualizations and
accordingly found advantages and disadvantages for all of them.
In contrast, we evaluate SOM-based software tools for the use in visual interactive data exploration tasks. Our criteria focus on interaction, since it is a
key component and allows users to explore data, as well as their correlations
and properties, in detail. We will base our discussion of tools on Keims visual
analytics mantra [17]: analyze rst show the important zoom, lter and analyze further details on demand. In other words, visualizations should provide
users with interaction techniques that allow selection of data subsets, zooming,
ltering, and displaying detailed information, as well as appropriate diagnostics
and statistics.
We propose the following criteria for the assessment of the software tools
discussed in Section 4:
Data preprocessing. The common input for data exploration tools are feature
vectors in a specic data format. The software should provide methods to process the input data before SOM training. Data preprocessing routines should
at least provide functionality for data centering and data normalization.
181
This section gives an overview of existing software for SOM-based visual data
exploration. To provide a clear structure we divided the software tools into two
categories: tools developed for the investigation of specialized data sets and tools
for the exploration of arbitrary data.
4.1
182
J. Moehrmann et al.
purpose and functionality. A summarization of the visualization techniques provided by each tool and an assessment of their interactivity is given in Table 1.
VALT, Heidemann et al. [18,19]. This system was developed for the purpose of
labeling image data in an augmented reality (AR) scenario. The image data is
clustered with a SOM and the map is displayed to the user with one representative image per unit. Users may select and assign labels to map units or individual
images. The U-Matrix can be displayed in the background of the map.
Moehrmann et al. [20]. Similar to VALT but optimized for use on desktop computers. Image data is clustered with a SOM using arbitrary features. A zoomable
user interface (UI) is used to display the SOM in two levels of detail (map units
and all images for a BMU). Class labels can be assigned to map units or individual images.
Schreck et al. [21]. SOM-based visualization of time-series data. The system is
optimized for the application to time-dependent data and integrates techniques
to automatically identify interesting candidate views. Selection of map units
is possible, as well as displaying the component planes for them. The system
includes a U-Matrix visualization, color coding of quantization errors and nearest
neighbor connections. Users may merge or expand units, as well as edit, create
or delete them. The layout of the map may also be rearranged to better suit the
users expectations. The time-dependent data is displayed as a two-dimensional
trajectory.
Torkkola et al. [22]. They propose the use of SOM-based visualizations for mining
gene expression data. The map can be color coded according to the value of
individual components (i.e. one component plane). It is, however, not obvious
whether the selection of a component can be performed interactively. Users may
set thresholds to identify clusters in the SOM. A basic degree of interactivity
with the map is given with the possibility of zooming into areas of interest. Users
may thereby investigate the data in detail (given as line plots).
Kanaya et al. [23,24]. Software tool developed for exploratory analysis of gene
expression data. No data preprocessing is included in the software. It provides
data histograms, component planes, as well as a comparison map which highlights dierences between two component planes. This visualization is of interest
for this special data set since components refer to individual experiments. The
applicability of this visualization to other data sets is unknown. Feature vectors
can be investigated in detail by selecting a unit in the visualization.
4.2
183
184
J. Moehrmann et al.
Kanaya et al.
VisiSOM
Torkkola et al.
SOMvis
Schreck et al.
Matlab SOMToolbox
+
+
Moehrmann et al.
Synapse
+
+
-
Heidemann et al.
Viscovery SOMine
Data preprocessing
Interaction with map
Interaction with data
Interaction with visualization
Label assignment
Data visualization
Data histograms
Class histogram
Cluster connections
Clustering
Component planes
Metro map
Neighborhood graph
U-Matrix
U*-matrix
P-Matrix
Response surfaces
Sky-metaphor visualization
Smoothed data histograms
Vector elds
Java SOMToolbox
for codebook vectors only. The whole software is focused on cluster investigation
and therefore does not provide many visualizations. The expert edition, which
we did not evaluate, seems to provide additional functionality which allows the
context-sensitive display of data vectors for selected units.
VisiSOM [30]. A commercial software for SOM-based data exploration which
provides only basic (2D and 3D) visualizations. Codebook vector data (complete, or individual component values) can be displayed on an additional axis
for selected map units. Additionally, feature vector data is displayed in a list
below the visualization and is updated if the selection is modied.
185
Discussion
As seen in the previous section, the tools can be divided in two classes: special
purpose tools and analytical purpose tools. Special purpose tools like Heidemann
et al., Moehrmann et al., or Schreck et al. provide advanced interaction techniques which are optimized for the exploration of specic data sets. However,
only few visualizations are available. This results from the fact that both, image
and time-series data, can be displayed well and is easy to grasp for users in
its natural form. Additional advanced cluster visualizations would not provide
much benet. In contrast, gene expression data is much more generic due to its
abstract nature. These tools apply a lot of advanced visualizations but lack in
interactivity, although the investigation of dierent experiments could probably
be performed more eciently with appropriate techniques.
Analytical purpose tools, like Java SOMToolbox and the SOMVis extension
for Matlab, provide various visualizations for standardized input data. Additionally, visualizations may be adapted to better suit the users purpose through the
user interface or via command line. Viscovery SOMine and Peltarion Synapse
provide linked views which allow users to investigate several visualizations at
once, thereby supporting the identication of correlations. Both systems provide
good interaction techniques for the map and the visualizations, but only very basic interaction with the data. Although SOMine allows the assignment of labels
to codebook vectors, it is not possible to identify incorrectly projected feature
vectors. In contrast, Synapse allows the manual editing of class attributes in the
feature vector but does not support the assignment of labels to whole clusters.
It is dicult for analytical purpose tools to display the actual underlying data.
Such visualizations require special models and optimizations and one can hardly
implement visualizations for all possible applications. However, interactive or
embedded data visualizations are essential for extensive data exploration tasks
and special tasks, like image or text labeling.
Most tools allow basic interaction with the maps, but only few provide additional information depending on the level of detail. For example, labels are
displayed without consideration of the current zoom level which leads to visual clutter. Instead, labels for whole clusters could be displayed in a low zoom
level and as the zoom is increased labels could be displayed more detailed, for
codebook vectors or, in a very high zoom level, for individual data vectors. A
major advantage for visual data exploration would be the possibility to augment
the visualization with customizable information. Instead of displaying class labels, individual components could be of interest. It is possible to display such
components in many tools but not depending on the level of detail.
In our opinion the focus of research in the area of visual SOM-based data
exploration has to shift from the development of slightly advanced visualizations
for cluster identication to the development of interactive user interfaces with
the aim to support users in their exploration task. Although it could be argued
that the visual analytics research area is responsible for such developments we
believe that the development has already begun to head for the direction of visual
analytics with the realization of sophisticated visualizations. The evolution of
186
J. Moehrmann et al.
Conclusion
Recent developments in the area of visual data exploration with SOMs has focused on the improvement of visualizations for cluster detection or detection of
correlations. However, special purpose tools have been developed which provide
sophisticated interaction techniques which were optimized for specic tasks, like
labeling images. Although various software tools exist which allow visual exploration of arbitrary data sets, the interaction techniques are in a very basic
stadium. We believe that the focus of future research has to shift from the development of further cluster visualizations to the development of sophisticated
interaction techniques for arbitrary data. SOM-based visual data exploration
can be performed intuitively and is therefore especially of interest for non-expert
users from other domains.
References
1. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78(9), 14641480
(1990)
2. Ultsch, A., Siemon, H.P.: Kohonens Self Organizing Feature Maps for Exploratory
Data Analysis. In: International Neural Networks Conference, pp. 305308. Kluwer
Academic Press, Paris (1990)
3. Ultsch, A.: Maps for the Visualization of High-Dimensional Data Spaces. In: Workshop on Self-Organizing Maps, pp. 225230 (2003)
4. Ultsch, A.: U*-Matrix: A Tool to Visualize Clusters in High Dimensional Data.
Technical Report 36, Dept. of Mathematics and Computer Science, University of
Marburg, Germany (2003)
5. Merkl, D., Rauber, A.: Alternative Ways for Cluster Visualization in SelfOrganizing Maps. In: Workshop on Self-Organizing Maps, pp. 106111 (1997)
6. Pampalk, E., Rauber, A., Merkl, D.: Using Smoothed Data Histograms for Cluster
Visualization in Self-Organizing Maps. In: Dorronsoro, J.R. (ed.) ICANN 2002.
LNCS, vol. 2415, pp. 871876. Springer, Heidelberg (2002)
7. P
olzlbauer, G., Rauber, A., Dittenbach, M.: Advanced visualization techniques for
self-organizing maps with graph-based methods. In: Wang, J., Liao, X.-F., Yi, Z.
(eds.) ISNN 2005. LNCS, vol. 3497, pp. 7580. Springer, Heidelberg (2005)
8. Poelzlbauer, G., Dittenbach, M., Rauber, A.: A Visualization Technique for SelfOrganizing Maps with Vector Fields to Obtain the Cluster Structure at Desired
Levels of Detail. IEEE International Joint Conference on Neural Networks 3, 1558
1563 (2005)
9. Poelzlbauer, G., Dittenbach, M., Rauber, A.: Advanced Visualization of SelfOrganizing Maps with Vector Fields. Neural Networks 19(6-7), 911922 (2006)
10. Tasdemir, K., Merenyi, E.: Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps. IEEE Transactions on Neural Networks 20(4), 549562
(2009)
187
11. Latif, K., Mayer, R.: Sky-Metaphor Visualisation for Self-Organising Maps. J. Universal Computer Science (7th International Conference on Knowledge Management), 400407 (2007)
12. Vesanto, J.: SOM-based Data Visualization Methods. Intelligent Data Analysis 3,
111126 (1999)
13. Mayer, R., Aziz, T.A., Rauber, A.: Visualising Class Distribution on SelfOrganising Maps. In: de S
a, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.)
ICANN 2007. LNCS, vol. 4669, pp. 359368. Springer, Heidelberg (2007)
14. Rauber, A.: LabelSOM: on the Labeling of Self-Organizing Maps. In: International
Joint Conference on Neural Networks, vol. 5, pp. 35273532 (1999)
15. Neumayer, R., Mayer, R., Polzlbauer, G., Rauber, A.: The Metro Visualisation of
Component Planes for Self-Organising Maps. In: International Joint Conference on
Neural Networks, pp. 27882793 (2007)
16. Dzemyda, G., Kurasova, O.: Comparative Analysis of the Graphical Result Presentation in the SOM Software. Informatica 13(3), 275286 (2002)
17. Keim, D., Mansmann, F., Schneidewind, J., Ziegler, H.: Challenges in Visual Data
Analysis. In: 10th International Conference on Information Visualization, pp. 916
(2006)
18. Heidemann, G., Saalbach, A., Ritter, H.: Semi-Automatic Acquisition and Labelling of Image Data Using SOMs. In: European Symposium on Articial Neural
Networks, pp. 503508 (2003)
19. Bekel, H., Heidemann, G., Ritter, H.: Interactive image data labeling using selforganizing maps in an augmented reality scenario. Neural Networks 18(5-6), 566
574 (2005)
20. Moehrmann, J., Bernstein, S., Schlegel, T., Werner, G., Heidemann, G.: Optimizing
the Usability of Interfaces for the Interactive Semi-Automatic Labeling of Large
Image Data Sets. In: HCI International. LNCS, Springer, Heidelberg (to appear,
2011)
21. Schreck, T., Bernard, J., von Landesberger, T., Kohlhammer, J.: Visual Cluster
Analysis of Trajectory Data with Interactive Kohonen Maps. Information Visualization 8, 1429 (2009)
22. Torkkola, K., Gardner, R.M., Kaysser-Kranich, T., Ma, C.: Self-Organizing Maps
in Mining Gene Expression Data. Information Sciences 139(1-2), 7996 (2001)
23. Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H.,
Ikemura, T.: Analysis of Codon Usage Diversity of Bacterial Genes with a SelfOrganizing Map (SOM): Characterization of Horizontally Transferred Genes with
Emphasis on the E. Coli O157 Genome. Gene 276(1-2), 8999 (2001)
24. Simple-BL SOM Website, http://kanaya.naist.jp/SOM/
25. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: Self-Organizing Map in
Matlab: the SOM toolbox. In: Matlab DSP Conference, pp. 3540 (1999)
26. SOMVis, TU Wien,
http://www.ifs.tuwien.ac.at/dm/somvis-matlab/index.html
27. Java SOMToolbox, TU Wien, http://www.ifs.tuwien.ac.at/dm/somtoolbox/
28. Peltarion Synapse, http://www.peltarion.com/products/synapse/
29. Viscovery SOMine, http://www.viscovery.net/somine/
30. VisiSOM, http://www.visipoint.fi/visisom.php
1 Introduction
Self-Organizing Maps [1] are artificial neural networks that translate high dimensional
input data into a low dimensional representation, in usually a 2D planar map. The
benefit of using 2D visualization is that it is easy to visualize the relationship among
the cells of the map [2]. By employing a technique such as U-Matrix [3], the clusters
formed can be spotted and inspected visually.
The 2D SOM has been used in various types of applications, one of which is the
use of the SOM as a music archive. For example, [4] uses a 2D SOM to organize a
music collection. In visualizing the map, a metaphor of a geographic map is employed
wherein islands represent musical genres. Islands signify the presence of songs while
water signifies the absence of songs. To interact with the system, the system would
need to click around the map. Another application using a 2D SOM is described in
[5], where various hardware interfaces were used to navigate the 2D SOM. Among
the hardware interfaces, included are an eye-tracker, wii-mote, iphone and desktop
controllers.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 188197, 2011.
Springer-Verlag Berlin Heidelberg 2011
189
190
by the training of the corner cubes, then again the training of the core cube, then
finally the labeling of the corner and core cubes. The initial supervised training of the
core cube assigns the general positions of the input files in the map, which would be
refined later. The training of corner cubes then positions the input files in their
respective corner cubes, according to their category labels. The second training of the
core cube (third phase) is designed to fine-tune the weights of the node vectors in the
core cube, since the initial training phase was just meant for a rough representation by
the core cube node vectors of the various interrelationships of the items in the input
dataset.
The training parameters used for the three training phases are quite different, with
learning rates being high during phase 1 ( = 0.75) and being much lower during
phase 3 ( = 0.10), while the number of training cycles is only 10,000 in phase 1, but
50,000 in phase 3. The supervised training of core and corner cubes is essentially like
the usual unsupervised training of regular SOM, except that the best-matching-unit
for a given input element is constrained to be from among the pre-assigned nodes that
correspond to the accompanying category label. As for the adaption of the weights of
node vectors, the learning rule and the Gaussian function for the diminishing effect of
the learning rate within the neighborhood of the best matching unit follow the usual
learning mechanism of self-organizing maps.
Because of the nature of the learning mechanism of self organizing maps where all
other nodes in the map are updated (but at decreasing learning rates as the distance of
the node to the best-matching unit increases), the core cubealso gets trained during
supervised training of the corner (exterior) cubes. Consequently, by the collective
influences of all the corner cubes, the core cube gets to represent the interrelationships of the various data elements of the entire data set.
We tested the use of structured SOMs by using an artificial dataset of animal data,
each represented by binary features. In the animal data set, there are 10 animal
categories, namely amphibians, birds, fish, reptiles, cnideria, crustaceans, mollusks,
and insects. Mammals and special animals (e.g. bats as mammals that fly, platypus
that has characteristics of mammals and birds, ducks as birds that swim, whales as
mammals that swim in the ocean) were used to do experiments on the core cube.
Following the usual implementation of self-organizing maps, the unknown
category of certain animals can be identified by computing the distance of the input
191
vector to each of the node vectors of the core/corner cube. The label of the nearest
node (best matching unit) would be the identified class/category of the unknown
animal.
With the structured SOM set-up, there are nodes in the core cube that straddle
along the boundaries between two or more classes or categories, somewhere mid-way
between two or three corners. These nodes would correspond to input items that have
multiple categories, such as animals that have traits that make them look both
mammal and bird, or both fish and mammal, or perhaps songs that are partly hip-hop,
partly reggae, and maybe partly rock.
a)
b)
Figure 3a shows a screenshot of the structured SOM that has been trained using an
animal dataset. We had only loaded the trained core cube with the set of special
animals. Figure 3b is a zoomed-in display of the core cube of Figure 3a, with labels
on the spheres to identify the various special animals and how they are positioned vis
vis the different categories assigned to the eight corner cubes. Duck, rheas and
grebes are pulled towards the fish corner which is why these birds are positioned a
little away from the birds corner. As for the bat, it is positioned near the birds corner.
With regard to the flying fish, it is positioned between fishes and insects. Since most
of the insects in the data set have the ability to fly, the flying fish is being pulled by
the insects corner (less by the bird corner) since the flying fish has more features that
are shared with insects than with birds as far as the features we used were concerned.
192
vectors of nearby nodes in the core cube) only when a music file from that particular
genre is selected during training. The labeled training set has music files from all the
8 genres and the sequence of music files for training is completely random.
Interactive 3D Music Organizer (i3DMO) is one instance of the structured SOM
that performs content-based organization of a music collection following the training
and labeling scheme of structured SOMs. It provides a 3D visualization of the map,
with zoom-in and zoom out functions, rotations, probes to the interior of the cube, etc.
The application provides a media player functionality that enables playback of
songs and managing of playlists. The music data set used by the application is
composed of songs with extracted music features. The music features were retrieved
by performing a specialized extraction process that relies on applications using digital
signal processing to extract a songs content.
The feature extraction and selection method is a fairly lengthy and elaborate
process, using various statistical techniques to select pertinent features. In particular,
each music file (song) was divided into 10-second segments and the first and last
segments were discarded. The features were computed for each of the segments and
then the mean and standard deviation of each feature was computed. For a given
feature, all segments whose extracted feature value deviates from the mean by more
than 1 standard deviation were discarded. After further segment filtering, the
remaining valid segments were used to compute for the average feature value for each
feature and for each song. Once all these were computed, Weka [8] was used to filter
out the redundant features (i.e. features that were highly correlated). From an initial
list of about 692 possible features culled from the literature, specifically MusicMiner
[9] and jAudio [10], the list was finally trimmed down to less than 70. This was the
basis for building a corpus of 1,000 songs, with 10 genres, of 10 albums per genre and
10 songs per album.
The map is visualized as 3x3x3 cube, with a light grey border to establish the
boundaries of the sub-cubes. Inside each sub-cube (whether a corner cube or a core
cube), a node is represented by a sphere if there is a song associated to the node. The
color of the sphere depends on the assigned color of the songs genre. If the songs
assigned to a given node belong to different genres, there will be rectangular
horizontal strips of different colors on the surface of the sphere to denote the other
genres. Figure 4 displays a visualization of the music archive as a 3x3x3 cube. Note
the core cube in the middle with all the songs of all genres, and the corner cubes
loaded with just the songs of their assigned genres (as distinguished by the color).
In Figure 4, there are 10 genres in the music collection. Eight (8) out of the 10
genres are assigned to the corner cubes of the structured SOM. The 2 genres which
are not assigned to a corner cube are metal and pop music genres, which were used for
various experiments on the core cube. It must be stressed here that the genres of the
songs included in the dataset were based on the genre tags from well-known,
authoritative websites that classify songs according to genres.
In the i3DMO application, placing the mouse cursor over a sphere displays a popup window which contains the song information. In addition, a song from the list of
songs associated to the node/sphere is randomly selected and a short playback is heard
to allow the user to determine the kind of music associated to the sphere. An external
hardware interface was taken into account in the design of the structured SOM. The
hardware interface should allow a user to position his fingers in his personal space to
193
interact with the SOM. By pointing at a specific location in the 3D space around the
user, a corresponding node or cluster can be selected. Then, commands can be issued
by performing hand gestures. The primary objective of this interface is to enable a
visually impaired individual to interact with the 3D SOM [11]. With this kind of
interface, a visually impaired user need not see the map, but can just interact with
the space around it, guided by audio cues, a huge part of which would be the
samplings or short playbacks that the user hears when pointing at different positions
(corresponding to spheres) in the 3D space around.
194
(1)
where dist is the Euclidean distance between two points in 3D Cartesian space, bmu(i)
is the x,y,z position of the best matching unit to input i in the map, corner (k) is the
x,y,z position of the corner assigned to category k, and r is the number of nodes in
each of the x, y, or z dimension of the core cube.
Figure 5 depicts the attraction index of a specific music file to each of the eight (8)
corners of the core cube. Unless a music file really sounds as if it is a blend of all
types of genres, we expect music files to be positioned relatively nearer to a specific
corner, or perhaps somewhere in between two corners. In figure 5, the corner with a
red oblong depicts the pre-assigned corner for the genre of the music file.
country
jazz
disco
blues
reggae
hip-hop
classical
rock
Fig. 5. The attraction index of a music file vis vis the corners of the core cube
As defined above, the attraction index att-index (i,k) ranges in value from 0 to 1,
with a value of 1 signifying that the input element i is associated with the very corner
node that is assigned to category k, and a value 0 signifying that the input element is
associated to the corner directly opposite the correct or desired corner. Table 1 lists
the average attraction indices by music genre. The average attraction indices by
music genre were based on 10 independent runs. For each run, 50% of the music files
were randomly chosen to train the music archive, while the remaining 50% were used
to compute for the attraction indices after loading them into the core cube.
In the ideal scenario, the average attraction index for a given category k must be
highest for the corner associated with the same category k. These are the average
attraction indices along the right diagonal of Table 1. These average attraction indices
along the diagonal are what we refer to as the fidelity measure. Note from Table 1 that
for all music genres, the highest average attraction indices per row (or for a given
music genre) are precisely those on the diagonal of Table 1. We have made various
195
other experiments and it is clear that this is not always the case and would depend
on the type of training employed, the training parameters used, as well as the over-all
quality or sometimes complexity of the input dataset.
Table 1. Average attraction index per music genre relative to specific music genres
Genre
Jazz
Reggae
Rock
0.8271
0.5311
0.3576
0.2298
0.5347
0.5270
0.3657
0.3643
0.5300
0.8138
0.5184
0.3563
0.3564
0.3710
0.2299
0.5257
Country
0.4375
0.5583
0.7494
0.4887
0.2810
0.5184
0.3564
0.3847
Blues
Classical
Blues
Disco
0.2464
0.3645
0.4795
0.7929
0.3922
0.3391
0.5191
0.5400
Hip-hop
0.5188
0.3618
0.2217
0.3632
0.8423
0.3475
0.5306
0.5427
Jazz
0.5067
0.3836
0.5400
0.3866
0.3535
0.7806
0.5086
0.2632
Reggae
0.4128
0.2823
0.3841
0.5234
0.5401
0.5280
0.7371
0.3861
Rock
0.3437
0.5107
0.3724
0.5486
0.5079
0.2370
0.3654
0.8056
The fidelity measuref(k) is the averageattraction index of all music files belonging
to category k vis vis the corner associated with category k. Table 2 gives the mean m
and standard deviation s of the attraction indices for all the music files in the
collection. Table 2 further shows the standard scores, or z-values, for each of the
genres. The z-values are computed as the difference between the fidelity measure of a
given genre g and the mean m of all attraction indices to g for all music files, divided
by the standard deviation s. The standard score measures the positive or negative
deviation from the mean in terms of number of standard deviations from the mean.
Table 2. Raw and standard fidelity measure of each music genre
Genre
Fidelity Measure
z-value
Blues
0.8271
0.4788
0.2000
1.74
Classical
0.8138
0.4767
0.1911
1.76
Country
0.7494
0.4526
0.1800
1.65
Disco
0.7929
0.4604
0.2012
1.65
Hip-hop
0.8423
0.4769
0.1941
1.88
Jazz
0.7806
0.4561
0.1936
1.68
Reggae
0.7371
0.4512
0.1780
1.61
Rock
0.8056
0.4762
0.1887
1.74
196
Music genres with high z-values for the fidelity measures have music files that are
closest to the corner of the core cube assigned to their genre, as compared to music
files belonging to other genres. The z-value of the fidelity measure is a more accurate
measure of the true fidelity of the music files belonging to a given genre. In
Table 1, we can see that the disco genre has a slightly higher raw fidelity score than
jazz, but the jazz music genre has in fact a higher z-value (standardized fidelity
measure).Of all the music genres, hip-hop yielded the highest fidelity measure, while
reggae had the lowest fidelity measure.
5 Conclusion
A SOM is usually a regular 2D rectangular or hexagonal lattice where nodes that are
spatially close in the 2D map are associated with input elements that are similar in the
input environment. It is, however, feasible to design a SOM as a 3D map, and the
learning algorithm remains virtually the same.
We presented a 3D SOM that is used as a music archive. More important than just
extending the SOM from 2D to 3D, we have a built-in structure in the design of the
3D map in such a way that we distinguish between eight (8) corner cubes and
the core cube in the center and that each corner cubehas an assigned music genre. We
have had to alter the learning algorithm by having a three-step learning phase
followed by a labeling and music loading phase. The training phases are supervised,
and target both the corner cubes and the core cube.
Through the embedded structure of the 3D SOM, we also presented a novel way of
measuring the quality of the resulting trained SOM (in this case, the music archive),
as well as the quality of the different categories/genres of music albums based on a
measure of the attraction index and the fidelity measure of music files vis vis their
respective music genres.
References
1. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)
2. Lagus, K., Honkela, T., Kaski, S., Kohonen, T.: Websom for Textual Data Mining.
Artificial Intelligence Review 13(5-6), 345364 (1999)
3. Ultsch, A., Siemon, H.: Kohonens self organizing feature maps for exploratory data
analysis. In: Proceedings of the International Neural Network Conference, Dordrecht,
Netherlands, pp. 305308 (1990)
4. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization of
music archives. In: Proceedings of the tenth ACM international conference on
Multimedia, Juan-les-Pins, France, pp. 570579 (2002)
5. Tzanetakis, G., Benning, M., Ness, S., Minifie, D., Livingston, N.: Assistive music
browsing using self-organizing maps. In: Proceedings of the 2nd international conference
on Pervasive Technologies Related to Assistive Environments. Corfu, Greece (2009)
6. Knees, P., Schedl, M., Pohle, T., Widmer, G.: An innovative three-dimensional userinterface for exploring music collections enriched. In: Proceedings of the 14th annual
ACM international conference on Multimedia, Santa Barbara, CA, USA, pp. 1724 (2006)
197
1 Introduction
One of the most important issues for informatics studies of virus genomes,
particularly of influenza viruses, is the prediction of genome sequence changes that
will be hazardous. We have developed a novel informatics strategy to predict a
portion of sequence changes of influenza A virus genomes, by focusing on the
pandemic H1N1/09 strains [1] as a model case. The phylogenetic analysis based on
sequence homology searches is a well-established and an irreplaceably important
method for studying genomic sequences. However, it inevitably depends on
alignments of sequences, which is potentially error-prone and troublesome especially
for distantly related sequences. This difficulty becomes increasingly evident as the
number of sequences obtained from a wide range of species, including novel species,
increases dramatically because of the remarkable progress of the high-throughput
*
Corresponding author.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 198206, 2011.
Springer-Verlag Berlin Heidelberg 2011
199
DNA sequencing methods. To address the difficulty and complement the sequence
homology searches, we here report an alignment-free clustering method that could
analyze far more than 100,000 sequences simultaneously, on the basis of SelfOrganizing Map (SOM) [2,3]. SOM is known to be a powerful clustering method,
which provides an efficient interpretation of complex data using visualization on one
plane. We have previously developed a modified SOM (batch-learning SOM:
BLSOM), which depends on neither the data-input order nor the initial conditions, for
studying oligonucleotide frequencies in vast numbers of genomic sequences [4,5].
When we constructed BLSOMs for oligonucleotide frequencies in fragment
sequences (e.g., 10 kb) from wide varieties of species, sequences were self-organized
according to species; BLSOMs could recognize and visualize species-specific
characteristics of oligonucleotide composition. Here, we have analyzed influenza A
viruses, including those of the pandemic H1N1/09 [1,6,7], and developed a widely
applicable strategy for predicting directional sequence changes of zoonotic virus
genomes.
2 Methods
A total of 43,831 virus sequences analyzed in Fig. 1A were obtained from the DDBJ
GIB-V web site (http://gib-v.genes.nig.ac.jp/), and a total of 42,800 sequences derived
from 5,350 influenza A virus strains were obtained from the NCBI Influenza Virus
Resource (http://www.ncbi.nlm.nih.gov/genomes/FLU/).
BLSOM program was obtained from UNTROD Inc. (y_wada@nagahama-i-bio.
ac.jp), witch developed the program under collaboration with our group.
200
Y. Iwasaki et al.
Fig. 1. BLSOMs for virus genome sequences. (a) Tetranucleotide BLSOM (Tetra) for 1-kb
sequences (199,067 sequences) from 43,828 virus genomes. Lattice points containing
sequences from a single phylotype were indicated in a color representing the phylotype. A
major portion of the map was colored, showing a major portion of sequences was selforganized according to phylotypes. (b) Influenza, Tetra. Lattice points containing sequences
from one host were indicated in a color representing the host: avian strains (1948 strains),
human H1N1/09 strains (167 strains), other human strains (2788 strains), swine strains
(249 strains), equine strains (68 strains), and strains from other hosts (130 strains). Human;
human subtype was specified in a color representing the subtype (H1N1, H1N1/09, H3N2,
H5N1, others). Avian; avian subtype was specified in a color representing the subtype (H5N1,
H5N2, H6N1, H7N2, H9N2, others). A minor human territory representing H1N1/09 was
indicated with an arrow. (c) Influenza, Codon; BLSOM was constructed for synonymous codon
usage in 34,376 genes from 4297 strains, and lattice points were specified as described in b.
Occurrence levels of three codons were indicated with different levels of two colors, pink
(high) and green (low); intermediate in achromo. (d) Influenza, Di; BLSOM was constructed
for dinucleotide composition, and lattice points were specified as described in b. (e)
Retrospective time-series changes for human subtype strains on Codon-BLSOM. H1N1 and
H3N2 strains in the specified time period were shown by these colors, and other human strains
were in gray. A zone for H1N1/09 or equine was additionally marked to help to recognize the
position in Codon-BLSOM in c.
For color picture, refer to http://trna.nagahama-i-bio.ac.jp/WSOM2011/WSOM2011_Fig1.pdf
201
202
Y. Iwasaki et al.
influenza A virus genes (Fig. 1c). Human (green) and avian (red) territories were
clearly separated from each other. Notably, human H1N1/09 strains (dark green,
arrowed) were again separated from the major human territory and surrounded by
avian and swine (sky blue) territories. The Codon-choice pattern of newly invading
viruses should be close to that of the original host viruses, at least for a period
immediately after the invasion. Because viruses depend on many cellular factors,
codon choice will most likely shift during infection cycles among humans towards the
pattern of seasonal human viruses. If so, the direction of sequence changes of
H1N1/09 over time especially in the near future is predictable, so far as judged from
codon usage and possibly from oligonucleotide frequency.
3.4 Codons and Oligonucleotides Diagnostic for Host-Specific Separation
BLSOM provides a powerful ability for visualizing diagnostic codons or
oligonucleotides that contribute to self-organization of sequences according to host.
The frequency of each codon at each lattice point was calculated, sorted according to
the frequency, and represented at different levels in colors[17]. Transitions between
the high (pink) and low (dark green) levels often coincided with host territory
borders, and examples of codons diagnostic for host separation were presented
(Fig. 1c). When we focus on all diagnostic cases, which were listed in Table 1, one
tendency was observed; G- or C-ending codons were more favorable in the avian
territory than the human. The G+C% effect was most apparent in two-codon boxes,
witch are composed of two synonymous codons. This was also observed for many
codons in four- or six-codon boxes, but there were exceptional cases, such as GCA
and UUG (Table 1), indicating the presence of other constrains.
Table 1. Preferred codons and oligonucleotides in avian or human viruses
Notably, for many diagnostic codons, human H1N1/09 had the avian-type
preference (Fig. 1c and Table 1). In Table 1, to specify the codon preference in
H1N1/09, codons preferred in H1N1/09 by comparison with seasonal human viruses
were indicated in red within the column for codons preferred in avian viruses, and
codons not preferred in H1N1/09 were specified in green within the column for
codons preferred in seasonal human viruses. Adaptation of codon choice to cellular
factors and environments (e.g., host body temperature) may be a process for invading
viruses to establish continuous infection cycles among humans and to increase viral
fitness. We hypothesize the codon choice in H1N1/09 will change towards the pattern
commonly found in seasonal human viruses. Removal of unfavorable codons can be
attained by not only synonymous, but also nonsynonymous changes, and the rate of
203
204
Y. Iwasaki et al.
Fig. 2. Tetranucleotide BLSOMs for eight genome segments. (a) Gene product name was listed
along with the segment number. Lattice points were marked as described in Fig. 1b, and thus a
zone for human H1N1/09 was in dark green. (b) Examples of subtype of avian virus
segments that were in close proximity to human and/or swine territories were marked in green
or blue for specifying the geographical areas where the avian strains were isolated: H7N2 in the
segment 2 and H5N2 in the segment 4 isolated in New York (NY), H5N2 in the segment 4
isolated in Minnesota (MN), H6N1 in the segment 6 isolated in Taiwan and Hon Kong (HK),
and H9N2 in the segment 6 isolated in Wisconsin (WI) and Hon Kong (HK). Other avian
sequences were in gray and sequences from other hosts were in achromo, but a zone for human
H1N1/09 was additionally marked to help to recognize the position in a.
For color picture, refer to http://trna.nagahama-i-bio.ac.jp/WSOM2011/WSOM2011_Fig2.pdf
and this was true also for Di-, Tri- and Codon-BLSOMs (data not shown). Segment 2
of H1N1/09 (dark green) was in close proximity to the human (green) territory, but
some other segments (e.g., segments 1 and 3) were within the avian (red) territory.
This similarity of the oligonucleotide composition of H1N1/09 with that of human,
swine or avian viruses was consistent with that found with conventional phylogenetic
studies[6,7]. Importantly, more than 5000 sequences could be characterized and
visualized on one map, supporting efficient knowledge discovery.
In Fig. 2b, we noted avian strains that were in close proximity to human and/or
swine territories along with the geographical information of places where the strains
were isolated. Identification of avian- or swine-virus segments whose oligonucleotide
and codon compositions were closely related to those of humans should be valuable
for predicting candidate strains that may cause pandemics. By summarizing
potentially hazardous segments, we can specify avian strains that will come to
resemble human or swine strains with reassortment of only a few segments. This type
of information should be valuable for gaining new perspectives on systematic
205
Acknowledgements
This work was supported by the Integrated Database Project and Grant-in-Aid for
Scientific Research (C) and for Young Scientists (B) from the Ministry of Education,
Culture, Sports, Science and Technology of Japan. The computation was done in part
with the Earth Simulator of Japan Agency for Marine-Earth Science and Technology.
We wish to thank Dr Kimihito Ito (the Research Center for Zoonosis Control,
Hokkaido University) for valuable suggestions and discussions.
References
1. Centers for Disease Control and Prevention: Swine influenza A (H1N1) infection in two
children-South California, March- April 2009. Morb Mortal Wkly Rep. 58, 400402
(2009)
2. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol.
Cybern. 43, 5969 (1982)
3. Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J.: Engineering applications of the
self-organizing map. Proc. IEEE 84, 13581384 (1996)
4. Abe, T., et al.: Informatics for unveiling hidden genome signatures. Genome Res. 13, 693
702 (2003)
5. Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S., Ikemura, T.: Novel phylogenetic studies
of genomic sequence fragments derived from uncultured microbe mixtures in
environmental and clinical samples. DNA Res. 12, 281290 (2005)
6. Smith, G.J., et al.: Origins and evolutionary genomics of the 2009 swine-origin H1N1
influenza A epidemic. Nature 459, 11221125 (2009)
7. Garten, R.J., et al.: Antigenic and genetic characteristics of swine-origin 2009 A(H1N1)
influenza viruses circulating in humans. Science 325, 197201 (2009)
8. Hirahata, M., et al.: Genome Information Broker for Viruses. Nucl. Acids
Res. 35(Database issue), D339D342 (2006)
9. Bao, Y.: The influenza virus resource at the National Center for Biotechnology
Information. J. Virol. 82, 596601 (2008)
10. Garca-Sastre, A.: Inhibition of interferon-mediated antiviral responses by influenza A
viruses and other negative-strand RNA viruses. Virology 279, 375384 (2001)
11. Nelson, M.I., Holmes, E.C.: The evolution of epidemic influenza. Nat. Rev. Genet. 8,
196205 (2007)
12. Alexey, A., Moelling, K.: Dicer is involved in protection against influenza A virus
infection. J. Gen. Virol. 88, 26272635 (2007)
13. Ikemura, T.: Correlation between the abundance of Escherichia coli transfer RNAs and the
occurrence of the respective codons in its protein genes. J. Mol. Biol. 146, 121 (1981)
14. Ikemura, T.: Codon usage and transfer RNA content in unicellular and multicellular
organisms. Mol. Biol. Evol. 2, 1334 (1985)
15. Sharp, P.M., Matassi, G.: Codon usage and genome evolution. Curr. Opin. Gen. Dev. 4,
851860 (1994)
206
Y. Iwasaki et al.
16. Sueoka, N.: Intrastrand parity rules of DNA base composition and usage biases of
synonymous codons. J. Mol. Evol. 40, 318325 (1995)
17. Kanaya, S., et al.: Analysis of codon usage diversity of bacterial genes with a selforganizing map (SOM): characterization of horizontally transferred genes with emphasis
on the E. coli O157 genome. Gene 276, 8999 (2001)
18. Domingo, E., Holland, J.J.: RNA virus mutations and fitness for survival. Annu. Rev.
Microbiol. 51, 151178 (1997)
Introduction
Aphasia is the partial or complete loss of language function due to brain damage,
most commonly following a stroke. In bilinguals, aphasia can aect one or both
languages, and during rehabilitation and recovery, the two languages can interact
in complex ways. Current research on bilingual aphasia has only begun to inform
us about these interactions. At the same time, a better understanding of language
recovery in bilinguals is badly needed to inform treatment strategies. Decisions
like the choice of a target language for treatment aect the outcome in ways
that are currently unpredictable, and the optimal treatment strategy is thought
to depend on many factors, including how late the second language was learned
and the degree of impairment in either language [26].
The problem of choosing the right treatment approach is of considerable practical importance: Over half the worlds population today is bi- or multilingual
[1,6], making bilingual aphasia at least as common as its monolingual counterpart. Moreover, treatment is most eective during a limited time window, and
resources available for treatment are often limited. As the proportion of bilinguals in the world increases, so will the potential benets of more targeted and
eective treatment.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 207217, 2011.
c Springer-Verlag Berlin Heidelberg 2011
208
U. Grasemann et al.
Current clinical research faces considerable diculties in providing the necessary insight. Too many factors contribute to the outcome of rehabilitation,
including which rst and second languages (L1 and L2) the patient speaks, the
second-language age of acquisition (AoA), the relative pre-stroke competencies,
and the relative impairments in both languages. The large number of possible
combinations of these factors, and thus of possible treatment scenarios, makes
it impractical to examine treatment eects clinically in a systematic way.
In this situation, computational modeling can be a useful tool to complement and guide clinical research. Neural network-based models of impairment
and recovery can be used systematically to simulate treatment scenarios and to
predict outcomes. These predictions can then inform clinical research, which in
turn provides data to validate the model.
This paper reports on recent progress in work that follows this approach. A
model of the bilingual human lexicon based on self-organizing maps is trained
and then lesioned in order to model lexical access in bilinguals before and after
the onset of aphasia. The model is matched to, and compared with, human subject data collected from a group of aphasic patients. Additionally, a simulation
of language-based treatment is developed, and is used to investigate a range of
treatment scenarios. The treatment simulation makes testable predictions, and
could ultimately be used to simulate treatment for individual patients, and to
predict the most benecial treatment strategy in each specic case.
The mental lexicon, i.e. the storage of word forms and their associated meanings, is a major component of language processing. Lexical access is frequently
disturbed in aphasia, and naming impairments are especially common, where
patients have trouble recalling words or naming objects (anomic aphasia). The
mental lexicon of bilinguals is considerably more complex than that of monolinguals, and the way in which multiple language representations can develop,
coexist, and interact in the human brain has an important bearing on our understanding of naming impairment in bilingual aphasia.
Current theoretical models of the bilingual lexicon generally agree that bilingual individuals have a shared semantic (or conceptual) system and that there
are separate lexical representations of the two languages. However, the models
dier on how the lexica interact with the semantic system and with each other.
The concept-mediation model [25] (Fig. 1a), proposes that both the rst (L1)
and the second-language lexica directly access concepts. In contrast, the wordassociation model assumes that second-language words (L2) gain access to concepts only through rst-language mediation (Fig. 1b). Empirical evidence [18]
suggests that the word association model is appropriate for low-prociency bilinguals and concept mediation model for high-prociency bilinguals. As an explanation, De Groot [7] proposed the mixed model (Fig. 1c), where the lexica of
a bilingual individual are directly connected to each other as well as indirectly
(by way of a shared semantic representation). This model was further revised
Bilingual Aphasia
209
with asymmetry by Kroll & Stewart [19] (Fig. 1d). The associations from L2 to
L1 are assumed to be stronger than those from L1 to L2, and the links between
the semantic system and L1 are assumed to be stronger than those between the
semantic system and L2.
Fig. 1. Theoretical models of the bilingual lexicon. All four theories assume a shared
semantic system with language specic representations in L1 and L2. The most recent
theory (d) includes connections between all maps, with connections of varying strength
depending on the relative dominance of the two languages. This theory is used as the
starting point for the computational model.
Modeling Approach
Although the physiological structure and location of the lexicon in the brain are
still open to some debate, converging evidence from imaging, psycholinguistic,
210
U. Grasemann et al.
L1 phonetic map (English)
Fig. 2. The DISLEX model is structured after the theoretical model in Fig. 1d. Three
SOMs, one each for semantics, L1, and L2, are linked by associations that enable the
model to translate between semantic and phonetic symbols, simulating lexical access
in bilingual humans.
computational, and lesion studies suggests that the lexicon is laid out as one
or several topographic maps, where concepts are organized according to some
measure of similarity [10,3,28].
Self-organizing maps (SOMs; [16,17]) model such topographical structures,
and are therefore a natural tool to build simulations of the lexicon. SOM models
have been developed to understand e.g. how ambiguity is processed by the lexicon
[22], how lexical processing breaks down in dyslexia [23], and how the lexicon is
acquired during development [21].
The foundation for the bilingual model used in the present work is DISLEX,
a computational neural network model initially designed to understand how
naming and word recognition take place in a single language [22,23], and later
extended to study rst language learning [21]. For the purpose of this study,
DISLEX was extended to include a second language [24]. The resulting computational model, shown in Fig. 2, reects the revised model by Kroll and Stewart
(1994; Fig. 1d): Its three main components are SOMs, one for word meanings,
and one each for the corresponding phonetic symbols in L1 and L2. Each pair of
maps is linked by directional associative connections that enable network activation to ow between maps, allowing the model to translate between alternative
semantic and phonetic representations of a word.
The organization of the three maps and the associations between them are
learned simultaneously. Input symbols are presented to two of the maps at the
same time, resulting in activations on both maps. Each individual map adapts
to the new input using standard SOM training with a Gaussian neighborhood.
Additionally, associative connections between the maps are adapted based on
Hebbian learning, i.e. by strengthening those connections that link active units,
and normalizing all connections of each unit:
aij =
aij + i i j j
,
k (aik + i i k k )
where aij is the weight of the associative connection from unit i in one map to
unit j in the other map, and i is the activation of unit i. The neighborhood
function i is the same as for SOM training. As a result of this learning process,
Bilingual Aphasia
211
when a word is presented to the semantic map, the resulting activation is propagated via the associative connections to the phonetic maps, and vice versa. In
this way, DISLEX can model both comprehension and production in both L1
and L2.
Note that the L1 and L2 maps have direct connections between them as well,
which creates a possible alternative path for the ow of activation between the
semantic map (S) and either phonetic map. For example, activation may ow
SL1 directly, but also SL2L1.
Importantly, such indirect ow of activation between maps can potentially
simulate and explain how treatment in one language can benet the other. For
example, if the lexicon is presented with input symbols for S and L1, those maps
and the connections between them can be adapted using the method described
above. However, in addition, the L2 map is activated indirectly, and that activation can be used to train its associative connections as well. How benecial
this indirect training is for L2 may depend on several factors, including the
strength and quality of the connections between L1 and L2. The computational
experiments reported below will examine this model of cross-language transfer
in detail.
The input data used for training the model is based on a list of 300 English
nouns gathered for previous studies of naming treatment in aphasia (e.g. [9,14]).
The words were translated into Spanish by a native speaker. Semantic representations are vectors of 261 binary semantic features such as is a container,
or can be used as a weapon. These features were encoded by hand, and the
resulting numerical representations were then used to train the semantic map.
Phonetic representations are based on phonetic transcriptions of English and
Spanish words, which were split into spoken syllables, and padded such that the
primary stress lined up for all words. The individual phonemes comprising each
syllable were represented as a set of phonetic features like height and front-ness
for vowels, and place, manner, etc. for consonants [20], similar to the method
used in previous work based on DISLEX [23]. Phonetic representations consisted
of 120 real-valued features for English and 168 for Spanish.
The semantic and phonetic maps of all models were a grid of 30x40 neurons.
All learning rates, both for maps and associations, were set to 0.25. The variance
of the Gaussian neighborhood was initially 5, and decreased exponentially with
a halife of 150 training epochs. Training always lasted 1000 epochs; the number
of randomly selected English and Spanish words trained during each epoch was
controlled by two exposure parameters.
Second-language AoA was simulated by starting training for the L2 phonetic
map and its associative connections as soon as the neigborhood size fell below a
specic threshold. For example, a thresholds of 0.7 resulted in training beginning
at epoch 425, and a treshold of 5.0 meant it started at epoch 1.
The resulting models generally have well-organized semantic and phonetic
maps. Their naming performance, measured as the percentage of semantic symbols that are translated correctly into their phonetic equivalents, is close to 100%
(98% for English, 97% for Spanish) for a wide range of combinations of AoA and
212
U. Grasemann et al.
exposure. However, as expected, for very low exposure and/or very late AoA, the
performance decreases. This is consistent with human language learning, where
performance on second-language naming tests tends to be very good, unless the
AoA is very late, or exposure to L2 is very limited [11]. As an example, Fig.
3 shows a DISLEX system that was trained on a subset of the input data to
make the maps easier to visualize. Semantic and L1 maps reect semantic and
phonetic similarity well. In contrast, the L2 map is poorly organized due to the
eect of very late acquisition.
Computational Experiments
The method to simulate the eects of dierent levels of exposure and AoAs
outlined above was then used to create a number of DISLEX models that were
individually tailored to match a group of bilingual patients suering from aphasia
following a stroke. The rst step in creating these models was to train DISLEX
to match the patients premorbid state, including naming performance in both
Spanish and English, AoA, and exposure data.
Eighteen of the patients were native Spanish speakers, with English AoA
varying from from birth to 35 years. One was a native English speaker (AoA 20
years). Premorbid levels of naming performance, AoA, and relative exposure to
Spanish vs. English were collected from all patients, and were used to determine
the way in which each patient model was trained. The available patient data
on language exposure only specied relative exposure (e.g. 30% Spanish, 70%
English); the absolute amount of exposure was therefore adjusted (retaining the
correct ratio) such that the resulting model t naming performance best.
Fig. 4 shows the language performance of the resulting best-t models for
each patient. Bars show the models performance; small triangles are the target
data, i.e. the human pre-stroke English and Spanish performance. In most cases
Bilingual Aphasia
213
(80%), the model is able to match the premorbid language performance (in
addition to AoA and relative exposure) of patients well. Why DISLEX sometimes
did not achieve a good t is not clear in all cases. Interestingly, however, at least
in one case (#19), the model identied irregular patient data in this way.
Excluding the patients without a matching DISLEX model, the remaining
16 premorbid models were then used to simulate damage to the lexicon leading
to bilingual aphasia. In order to simulate the brain lesion caused by a stroke,
the models were damaged by adding varying levels of Gaussian noise to the
associative connections between the semantic and phonetic maps.
This model of stroke damage was motivated by several known constraints
on the mechanisms by which strokes cause aphasia; for example, word comprehension is often relatively spared in aphasia, which could not be simulated
in DISLEX using damage to semantic or phonetic maps. Additionally, recent
evidence points to white matter damage in anomic aphasia [12,13].
Fig. 5 shows how increasing levels of noise damage aect the naming performance of the patient models. The bars on the left side of each plot show the
same data as in Fig. 4, i.e. the performance of the undamaged model. Moving
from left to right in each plot, the damage increases. Red and green lines show
the resulting naming performance in English and Spanish respectively. The vertical position of the triangles pointing left show the patients post-stroke naming
performance in English and Spanish, i.e. the performance the damaged models
need to match in each case.
By adjusting the amount of damage for English and Spanish separately, each
patients post-stroke naming performance can always be matched, as shown in
100
Performance
80
60
40
20
English
Spanish
Human data
2
9 10 11 12 13 14 15 16 17 18 19
214
U. Grasemann et al.
Performance
80
60
40
20
80
60
40
20
80
60
40
20
80
60
40
20
0.2 0.4
0.6
0.8
10
11
12
13
15
16
17
18
0.2 0.4
0.6
0.8
0.2 0.4
Lesion Strength
0.6
0.8
0.2 0.4
0.6
0.8
the gure. Interestingly, however, in all but three cases (81%), the patients
post-stroke performance can be simulated by damaging English and Spanish
connections equally. This is consistent with the type of impairment seen in aphasia patients, which usually, but not always, aects both languages equally. An
interesting prediction of the model is that less procient languages are more
vulnerable to damage than highly procient ones. This is clearly visible e.g. in
models #1, 3, and 12.
In the future, these individual models will be used to investigate and predict treatment eects in human patients. As a rst step towards this goal,
DISLEX simulations for a range of 64 dierent treatment scenarios were created,
which diered in L1 (English/Spanish), AoA (early/late), exposure to L1 and
L2 (low/high), damage to L1 and L2 (low/high), and treatment language (English/Spanish). Treatment was simulated by retraining a subset of the original
input words in the treatment language. Associative connections of the untreated
language were also trained, using indirect activation in the way described in
Section 3.
Fig. 6 illustrates the clearest prediction of this model of treatment: If one
language is damaged more than the other, training the less damaged language
benets the more damaged language, but not vice versa. Surprisingly, all other
Bilingual Aphasia
215
Fig. 6. Eects of treatment language on outcome in the model. In the scenario shown,
English is L2 (early AoA), exposure to both languages is low, and English is damaged
more than Spanish. The model predicts that treating the less damaged language (in
this case Spanish) benets the more damaged language, but not vice versa.
factors, including relative prociency and AoA, have little or no eect on crosslanguage transfer in the model. Moreover, the current model predicts that treating one language should benet the other in the majority of training scenarios
independent of treatment language. However, damage in the model was only
applied to semanticphonetic connections, and damage to other connections,
which may be common in humans, may prevent this in many cases. Future
work will investigate such additional damage, which will lead to further testable
predictions.
216
U. Grasemann et al.
If validated in this way, the model could be used to meaningfully predict the benets of dierent treatment approaches, and could ultimately contribute to the
development of optimized treatment strategies tailored to individual patients.
References
1. Bhatia, T.K., Ritchie, W.C. (eds.): The Handbook of Bilingualism. Blackwell Publishing, Malden (2005)
2. Boyle, M., Coelho, C.: Application of semantic feature analysis as a treatment for
aphasic dysnomia. American Journal of Speech-Language Pathology, 9498 (1995)
3. Caramazza, A., Hillis, A., Leek, E., Miozzo, M.: The organization of lexical knowledge in the brain: Evidence from category- and modality-specic decits. In:
Hirschfeld, L., Gelman, S. (eds.) Mapping the Mind, Cambridge University Press,
Cambridge (1994)
4. Costa, A., Heij, W., Navarette, E.: The dynamics of bilingual lexical access. Bilingualism: Language and Cognition 9, 137151 (2006)
5. Costa, A., Miozzo, M., Caramazza, A.: Lexical selection in bilinguals: Do words
in the bilinguals two lexicons compete for selection? Journal of Memory and Language 43, 365397 (1999)
6. Crystal, D.: English as a global language. Cambridge University Press, Cambridge
(1997)
7. de Groot, A.: Determinants of word translation. Journal of Experimental Psychology: Learning, Memory and Cognition 18, 10011018 (1992)
8. Edmonds, L., Kiran, S.: Lexical selection in bilinguals: Do words in the bilinguals
two lexicons compete for selection. Aphasiology 18, 567579 (2004)
9. Edmonds, L.A., Kiran, S.: Eect of semantic naming treatment on crosslinguistic generalization in bilingual aphasia. Journal of Speech Language and Hearing
Research 49(4), 729748 (2006)
10. Farah, M., Wallace, M.: Semantically bounded anomia: Implications for the neural
implementation of naming. Neuropsychologia 30, 609621 (1992)
11. Hernandez, A., Li, P.: Age of acquisition: Its neural and computational mechanisms.
Psychological Bulletin 133, 638650 (2007)
12. Fridriksson, J., et al.: Impaired speech repetition and left parietal lobe damage.
The Journal of Neuroscience 30(33), 1105711061 (2010)
13. Anderson, J.M., et al.: Conduction aphasia and the arcuate fasciculus: A reexamination of the wernickegeschwind model. Brain and Language 70, 112 (1999)
14. Kiran, S.: Typicality of inanimate category exemplars in aphasia treatment: Further evidence for semantic complexity. Journal of Speech Language and Hearing
Research 51(6), 15501568 (2008)
15. Kohnert, K.: Cognitive and cognate-based treatments for bilingual aphasia: a case
study. Brain and Language 91(3), 294302 (2004)
16. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 5969 (1982)
17. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)
18. Kroll, J., Curley, J.: Lexical memory in novice bilinguals: The role of concepts in
retrieving second language words. In: Practical Aspects of Memory, vol. 2, Wiley,
New York (1988)
19. Kroll, J.F., Stewart, E.: Category interference in translation and picture naming:
Evidence for asymmetric connections between bilingual memory representations.
Journal of Memory and Language 33, 149174 (1994)
Bilingual Aphasia
217
20. Ladefoged, P.: Vowels and consonants: An introduction to the sounds of languages.
Blackwells, Oxford (2001)
21. Li, P., Zhao, X., MacWhinney, B.: Dynamic self-organization and early lexical
development in children. Cognitive Science 31, 581612 (2007)
22. Miikkulainen, R.: Subsymbolic Natural Language Processing: An Integrated Model
of Scripts, Lexicon, and Memory. MIT Press, Cambridge (1993)
23. Miikkulainen, R.: Dyslexic and category-specic impairments in a self-organizing
feature map model of the lexicon. Brain and Language 59, 334366 (1997)
24. Miikkulainen, R., Kiran, S.: Modeling the bilingual lexicon of an individual subject.
In: Prncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 191
199. Springer, Heidelberg (2009)
25. Potter, M., So, K., von Eckardt, B., Feldman, L.: Lexical and conceptual representation in beginning and procient bilinguals. Journal of Verbal Learning and
Verbal Behavior 23, 2338 (1984)
26. Roberts, P., Kiran, S.: Assessment and treatment of bilingual aphasia and bilingual
anomia. In: Ramos, A. (ed.) Speech and Language Disorders in Bilinguals, pp. 109
131. Nova Science Publishers, New York (2007)
27. Sasanuma, S., Suk Park, H.: Patterns of language decits in two korean-japanese
bilingual aphasic patients - a clinical report. In: Paradis, M. (ed.) Aspects of bilingual aphasia, pp. 111122. Pergamon, Oxford (1995)
28. Spitzer, M., Kischka, U., G
uckel, F., Bellemann, M.E., Kammer, T., Seyyedi, S.,
Weisbrod, M., Schwartz, A., Brix, G.: Functional magnetic resonance imaging of
category-specic cortical activation: Evidence for semantic maps. Cognitive Brain
Research 6, 309319 (1998)
Abstract. The Self-Organizing Map (SOM) is widely applied for data clustering and visualization. In this paper, it is used to cluster Gaussians within the
Hidden Markov Model (HMM) of the acoustic model for automatic speech recognition. The distance metric, neuron updating and map initialization of the
SOM are adapted for the clustering of Gaussians. The neurons in the resulting
map act as Gaussian clusters, which are used for Gaussian selection in the recognition phase to speed up the recognizer. Experimental results show that the
recognition accuracy is kept while the decoding time can be reduced by 70%.
Keywords: SOM, Speech recognition, Gaussian clustering, Gaussian selection.
1 Introduction
The Self-Organizing Map (SOM) is a widely applied neural model for data analysis
and especially for clustering. Its algorithms are comprehensively formulated in [1].
The SOM already attracted interests of the researcher in speech recognition as a vector quantization method to classify speech features, e.g. [20] and [21]. Research that
used the SOM to cluster the Hidden Markov Models (HMM) in speech recognition is
described in [15]. The author directly treated the parameters of the HMMs as the input
features to the SOM. In this paper, SOM is used for clustering Gaussians of the acoustic model for automatic speech recognition.
In a HMM based state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) system [6], there are typically over twenty thousand Gaussians. During
the decoding phase, i.e. at recognition time, all the Gaussians need to be evaluated
given a 39 dimensional observation vector, which is renewed every 10 milliseconds.
Hence evaluation of Gaussians is one of the most time-consuming tasks for the recognizer. However, given the observed feature vector, only a small subset of Gaussians
dominate the likelihood of the states of the HMM, while the rest are unlikely. To
speed up the decoding, vector quantization based Gaussian selection ([7] [10] [11])
was proposed to exclude unlikely Gaussians from evaluation. Here, cluster Gaussians
are computed and assigned likelihoods by the decoder. Only the member Gaussians
belonging to those likely clusters are evaluated. The clustering method in the previous
research is hard K-Means. SOM, as a soft clustering technique, is closely related to
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 218227, 2011.
Springer-Verlag Berlin Heidelberg 2011
219
Fig. 1. Unified distance matrix [4] of a SOM plotted by SOM Toolbox [18]. The input Gaussians and output neurons are expressed in the spectral domain. Sample spectra of five vowels
are plotted on the map: Given a frame of spectral features, the likelihood of every neuron is
calculated and the one with highest likelihood is marked as the best matching unit of the feature
vector. The size of the markers indicates the hit rate of the spectral features on a particular
neuron.
The rest of the paper is organized as follows: section 2 introduces the SOM training procedure and the adaptations to the algorithms to order Gaussians; section 3
describes our scheme of Gaussian selection using the SOM. Section 4 shows the experimental settings and results. In the last section, conclusions are drawn and future
work is proposed.
220
g( n, m, t ) = exp(
rc( n ) rm
2 2 (t )
(1)
where rm denotes the coordinates of the mth neuron on the SOM, (t) is the neighborhood radius which decays with the training episode t.
The algorithm within each step in the above pseudo code should be adapted to
handle the self organizing of Gaussians. Algorithm adaptations for map initialization
in step 1, the distance metric between the input training data (member Gaussians) and
the output neurons (cluster Gaussians) in step 2.a and the codebook estimation in step
2.c will be covered in the section 2.1 and 2.2.
2.1 Distance Metric and Neuron Estimation
The Symmetric Kullback-Leibler Divergence (SKLD) is commonly used to measure
the distance between a particular input member Gaussian and every neuron in step
2.a. If p and q are multivariate Gaussians, their SKLD is:
SKLD(p, q) =
1
trace ( -p1 + -q1 )( p + q )( p + q )'+ p -q1 + q -p1 2I
2
(2)
The estimation of neurons is based on the approach in [9], where a method for
finding the centroid of a set of Gaussians is derived. In their work, the centroid is the
Gaussian that minimizes the sum of SKLD to each of the set members. In our work,
we extend the results of [9] by minimizing the weighted within-cluster mean SKLD
for the mth neuron:
221
n=1
(3)
g(n, m, t )
n=1
In step 2.b, equation (3) is minimized given the N member Gaussians and their corresponding weights, g(n,m,t), to update one of the M neurons. Hence the formula to
re-estimate the mean of the mth neuron is:
1
N
N
mt = g( n, m, t )( n1 + m1 ) g( n, m, t )( n1 + m1 ) n
n=1
n=1
(4)
C 0
(5)
where
N
A = g( n, m, t ) ( n mt )( n mt ) ' + n
n=1
(6)
and
N
C = g( n, m, t ) n1
(7)
n =1
Suppose the member Gaussians are d-dimensional, then B has d positive and d
symmetrically negative eigenvalues. Then a 2d by d matrix V is constructed whose
columns are the d eigenvectors corresponding to the positive eigenvalues. V is partitioned in its upper halve U and lower halve W:
U
V=
W
(8)
mt = UW 1
(9)
It can be seen from equation (4) and (6) that the procedure of estimating the neurons given the weights g(n,m,t) is iterative. The calculation of the mean depends on
the previously calculated covariance and vice versa. The exit criterion is the convergence of mean SKLD defined in equation (3). The choice of the initial values is introduced in section 2.2.
2.2 Map Initialization
The purpose of step 1 in the pseudo code is to obtain faster or even better ordering
convergence. A global Gaussian (or a single-neuron map) is calculated by averaging
222
the entire set of member Gaussians. Then the mean and covariance matrix of the
global Gaussian are updated using equation (2) to (9) for several iterations till the
mean SKLD in equation (3) is converged. Principal Component Analysis (PCA) is
applied on the covariance matrix to find the first two principal eigenvectors e1, e2 and
eigenvalues, 1, 2. The square roots of the two principal eigenvalues are used to determine the height h and width l whose product is the size of the code book, M. Then
the map is spanned linearly along the directions of the two principal components, i.e.
the means of the (x,y)th neurons are initialized as (rxl/ 2 )e2 +(ry.h/ 1 )e1, where rx
and ry are the hexagonal coordinates on the map. The initial values of the covariance
matrices are simply assigned with the average values over all member Gaussians
covariance matrices.
p(G
k =1
| O) > 0.95
M *0.2
m =1
p(Gm | O)
where
p(Gk | O) p(Gk +1 | O)
(10)
223
Feature Vector
0
1
or
X
X
X
X
X
X
m
...
...
0
M Neurons
1/ 2
( 2 ) d / 2 m
f m ( x) =
...
...
M Neurons
exp (o m )' m1 (o m )
2
N member Gaussians
1
X
X
X
X
Evaluation of neurons
Fig. 2. Decision on the selection of a member Gaussian. During decoding, whether a member
Gaussian is selected is determined by the neuron-member Gaussian mapping table and the
neuron selection list (1 indicates that the corresponding Gaussian is selected). The mapping
table determines to which neurons the member Gaussian belongs. The recognizer then checks
the mapping table and will evaluate the member Gaussian only if any of those neurons is selected.
4 Experiments
Speech recognition experiments were conducted on the Aurora4 [12] large vocabulary
database, which is derived from the WSJ0 Wall Street Journal 5k-word dictation task.
224
The test set is the clean-condition subset containing 330 utterances from 8 different
speakers.
4.1 Experiment Settings
The acoustic model is trained using the clean-condition training set which contains
7138 utterances from 83 speakers, which is equivalent to 14 hours of speech data. All
recordings are made with the close talking microphone and no noise is added. The
speech spectra, its first and second order derivatives are transformed into 39 dimensional MIDA (Mutual Information based Discriminant Analysis [16]) features and
further decorrelated to ensure the model can use diagonal covariance Gaussians.
There are 4091 tied states, or senones, and they share 21087 Gaussians. A bigram
language model for a 5k-word closed vocabulary is provided by Lincoln Laboratory.
The decoding is done with a time-synchronous beam search algorithm and the detail
can be found at [17]. The recognizer was launched on a PC installed with Dual Core
AMD Opteron Processor 280, whose main frequency is 2.4 GHz. Only one core is
activated for the testing.
4.2 The SOM
The two dimensional hexagonal SOM is trained using SOM Toolbox [18] with the
21087 MIDA diagonal covariance Gaussians as input. The map size is 26 by 20, i.e.
520 neuron Gaussians are in the map. The covariance matrices of the output neurons
are constrained to be diagonal as well. A rough training phase with only 6 iterations
of ordering is carried out first to prevent the map from topological defects [5], then
followed by a fine ordering with 24 iterations.
Though the convergence of the typical SOM is not strictly proved in higher-thanone-dimensional case theoretically [5], the mean SKLD is observed monotonously
decreased during ordering process in the experiment.
4.3 Experiments Using Other Approaches
Two additional approaches, namely the K-Means and a HMM-based method, are
implemented to compare with the SOM.
K-Means uses the following cost function:
N
M
M
1
f K Means = wnm SKLD(n, m) + wnm log
wnm
n =1 m =1
m =1
(11)
M
wnm
m =1
m =1
w
=
1
nm
(12)
m=1
Here the softness of the clustering is controlled by . The number of clusters M is 520.
225
An alternative approach to acquire clusters is to train them using the data. A compact HMM containing 520 Gaussians is trained using the same training data as the full
HMM containing the member Gaussians. It shares the same structure of tied states as
the full model. These 520 Gaussians are used as the cluster Gaussians. An extra
Viterbi segmentation pass after the training is carried out to set up the association
table between the member Gaussians and the cluster Gaussians: each speech frame is
hence assigned to a HMM state. The association count between the dominating member Gaussians of that HMM state and the dominating cluster Gaussian is then incremented by 1. This association table is used as the cluster-member Gaussian mapping
table after a proper truncation, i.e. per member Gaussian keeping only the cluster
Gaussians with the highest association count (3.6 cluster Gaussians on average).
4.4 Experimental Results
Table 1 shows the experimental results of the Gaussian selection using different approaches of Gaussian clustering, which are compared with the baseline system where
none of the Gaussians are pruned during decoding. The percentage of calculated
Gaussians is the ratio between the number of calculated Gaussians (including both
neurons and selected member Gaussians) and 21087. The baseline is a 2.2realtime
system. We achieve a 0.67realtime system using the SOM, thus 70% recognition
time is saved while the Word Error Rate (WER) is even lower than the base line. The
K-Means approach yields lower WER than the SOM, but calculates more Gaussians.
The single mapping SOM, where only one neuron Gaussian is kept per member
Gaussian in the mapping table, looses accuracy while reducing 1.2ms CPU time per
frame, is not preferable. The HMM-based method cannot improve the performance
but is slower than the SOM. The associated indexing based approach of Gaussian
selection, called Fast Removal of Gaussians (FRoG) [14], of the recognizer [17] is
also tested. It only calculated about 5% of the member Gaussians, but required
1.2realtime.
The Gaussian selection systems based on the SOM, K-Means and data-driven approach also helps to reduce the beam search time because it removes unlikely Gaussians which dominate the unlikely states, hence the confusion among the different
search paths is reduced.
Table 1. Word Error Rates and CPU time of SOM Gaussian selection on Aurora 4
SOM
Single Mapping SOM
K-Means
Data driven
FRoG
No Gaussian Selection
WER
CPU Time
Gaussian Calculation
Beam Search
(ms/frame)
(ms/frame)
6.76%
7.02%
6.65%
6.82%
6.82%
6.87%
1.8
4.9
1.4
4.1
2.4
5.1
2.2
5.2
The CPU time per frame is12ms in total
15.3
6.7
%Gaussian
calculated
7%
4%
11%
9%
5%
100%
226
References
1. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin (2001)
2. Rauber, A., Merkl, D., Dittenbach, M.: The Growing Hierarchical Self-Organizing Map:
Exploratory Analysis of High-Dimensional Data. IEEE Transactions on Neural Networks 13, 13311341 (2002)
3. Sueli, A.M., Joab, O.L.: Comparing SOM neural network with Fuzzy c-means, K-means
and traditional hierarchical clustering algorithms. European Journal of Operational Research, 17421759 (2006)
4. Ultsch, A., Siemon, H.P.: Kohonens self organizing feature maps for exploratory data
analysis. In: Proceeding of International. Neural Network Conference, Dordrecht, Netherlands, pp. 305308 (1990)
5. Van Hulle, M.M.: Faithful Representations and Topographic Maps: From Distortion- to Information-Based Self-Organization. John Wiley & Sons, New York (2000)
6. Huang, X., Hon, H.W., Reddy, R.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice-Hall, NJ (2001)
7. Bocchieri, E.: Vector quantization for efficient computation of continuous density likelihoods. In: Proceeding of ICASSP, vol. 2, pp. 692695 (1993)
8. Bocchieri, E., Mak, B.K.-W.: Subspace Distribution Clustering Hidden Markov Model.
IEEE Transactions on Speech and Audio Processing 9, 264275 (2001)
227
9. Myrvoll, T.A., Soong, F.K.: Optimal Clustering of Multivariate Normal Distributions Using
Divergence and Its Application to HMM Adaptation. In: Proceeding of ICASSP, pp. 552
555 (2003)
10. Watanabe, T., Shinoda, K., Takagi, K., Iso, K.-I.: High Speed Speech Recognition Using
Tree-Structured Probability Density Function. In: Proceeding of ICASSP, vol. 1, pp. 556
559 (1995)
11. Shinoda, K., Lee, C.-H.: A structural Bayes approach to speaker adaptation. IEEE Transactions on Speech and Audio Processing 9, 276287 (2000)
12. Parihar, N., Picone, J.: An Analysis of the Aurora Large Vocabulary Evaluation. In: Proceeding of Eurospeech, pp. 337340 (2003)
13. Fritsch, J., Rogina, I.: The Bucket Box Intersection (BBI) Algorithm for Fast Approximate
Evaluation of Diagonal Mixture Gaussians. In: Proceeding of ICASSP, Atlanta, vol. 2, pp.
273276 (1996)
14. Demuynck, K.: Extracting, Modeling and Combining Information in Speech Recognition.
PhD thesis, K.U.Leuven, ESAT (2001)
15. Du, X.-P., He, P.-L.: The clustering solution of speech recognition models with SOM. In:
Wang, J., Yi, Z., urada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3972, pp.
150157. Springer, Heidelberg (2006)
16. Duchateau, J., Demuynck, K., Van Compernolle, D., Wambacq, P.: Class definition in discriminant feature analysis. In: Proceeding of European Conference on Speech Communication and Technology, Aalborg, Denmark, pp. 16211624 (2001)
17. SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit,
http://www.spraak.org/
18. SOM Toolbox, http://www.cis.hut.fi/somtoolbox
19. Wang, X., Xue, L., Yang, D., Han, Z.: Speech Visualization based on Robust Selforganizing Map (RSOM) for the Hearing Impaired. In: Proceeding of International Conference on BioMedical Engineering and Informatics, pp. 506509 (2008)
20. Sarma, M.P., Sarma, K.K.: Speech Recognition of Assamese Numerals Using Combinations of LPC - Features and Heterogenous ANNs. In: Proceeding of International Conference on BioMedical Engineering and Informatics, pp. 506509 (2008)
21. Souza Jnior, A.H., Barreto, G.A., Varela, A.T.: A speech recognition system for
embedded applications using the SOM and TS-SOM networks. In: Mwasiagi, J.I. (org.)
Self Organizing Maps: Applications and Novel Algorithm Design, pp. 97108. IN-TECH
publishing, Viena, ustria (2010)
Introduction
Supervised machine learning tasks require large amount of training data. Especially in Natural Language Processing (NLP) it is common that people annotate
data to create a training data set. Although optimized annotation tools exist, it
still is expensive and time consuming. In this paper we present an approach to
speed up the annotation process by using self-organizing maps (SOMs) [1].
Besides data annotation, computational linguists require tools which support
the understanding of the feature space of the underlying model and data. SOMbased visualizations help the experts to explore the feature space and to understand how the features inuence the machine learning methods. The analysis
motivate the design of new features which may be better suited to separate the
linguistic data.
There are many dierent data sets which require a speed up of the annotation process and many research areas where exploration of the feature space is
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 228237, 2011.
c Springer-Verlag Berlin Heidelberg 2011
229
230
2.1
A. Burkovski et al.
Data
The data we are working with are plain texts. Plain texts contain entities or
objects (e.g. persons, locations, organizations, things, etc.) and are called noun
phrases. Simplifying slightly, a noun phrase is a group of words in a sentence that
can be replaced by a noun, another noun phrase, or pronoun. For example the
house or my small house that my father built when I was a kid can be replaced
by the pronoun it. Coreference means that the two noun phrases of the link can
replace one another without changing the meaning of a sentence. The two noun
phrases are disreferent, the opposite of coreferent, if by replacing they change
the meaning of a sentence. In the domain of coreference resolution such a pair
of two noun phrases is called a link. A link has a label that contains information
about its co-/disreference status.
In some cases coreference is simple to determine. It is easy to determine that
the expressions ([Michael] Jordan) and (Jordan) in the example in Figure 1
are coreferent, because one is a substring of the other. Intuitively two dierent
names (like Jordan and Wizards) are likely not coreferent. For human readers the
coreference status of (the Washington Wizards) with the previously mentioned
entity (The Wizards) is easy to resolve. Although, if the rst noun phrase would
only consist of (The Washington team), we (as human readers) would need to
know that the Wizards team is located in Washington. Otherwise, the text should
contain that information elsewhere. To resolve a pronoun like (he) we also need
the context to decide that it refers to (Jordan).
(The Wizards) may not want ([Michael] Jordan), but the head of the expansion team says (Jordan) could run operations there, if (he) wants to. [. . . ] After
playing the last two seasons with (the Washington Wizards), Jordan said he
expected to return to the front oce .
Fig. 1. A slightly modied example for coreference from the Ontonotes corpus [6]
Features
Computational linguists use feature extraction methods [9] to process each link
and create a high-dimensional feature vector. We extracted many linguistic features inspired by Ng and others [10,11,12]. From the extracted features we construct a feature vector for every link. These feature vectors are used as input
for the training of the SOM. In the following we describe only a subset of the
features we use for training.
231
(f ) Semantic class
Weight Value
High
Low
Fig. 2. Component planes for selected features on the graph-based U-matrix visualization. Component planes help the user to identify clusters of links with desired features,
like the pronoun feature of a noun phrase in a link (2a and 2b). The user also utilizes
component planes to identify strong coreference features (high values in gures 2c and
2d) and strong disreference features (low values in gures 2e and 2f). The size of the
nodes represent the number of feature vectors assigned to that node.
232
A. Burkovski et al.
The Head Match feature uses the head word of a noun phrase. The head is
calculated as the last word in the rst noun cluster in the noun phrase. Alternatively, it is the last word of the noun phrase if it does not contain a noun cluster
[13]. The feature checks if the head words of both noun phrases are identical.
Wordnet [14] is a database for word senses and semantic relations between
them. The Wordnet Distance feature is the normalized symmetric dierence
distance [15] of hypernym sets (semantic relations) for both noun phrases, also
known as the Jaccard coecient. All hypernyms of the head words of both noun
phrases are retrieved for the calculation. This value represents how the two noun
phrases relate semantically.
The Grammatical Number and Semantic Class features checks if the two noun
phrases in a link agree in grammatical number and semantic class respectively. A
disagreement is a strong indicator for disreference. The semantic class in our data
consists of seven categories: Person, Organization, Location, Facility, Vehicle,
Weapon and Geo-Political Entity.
The pronoun features return whether one of the noun phrases of a link is
a pronoun and its grammatical number. In Figure 2 we show only two such
features: first noun phrase is a singular pronoun and second noun phrase is a
plural pronoun.
Interactive Visualization
The SOM provides multiple visualizations from which the user is able to interpret
the data distribution. We developed an interactive user interface which utilizes
visualizations based on the well-known unified distance matrix (U-matrix) [16].
The U-matrix is intended as an intuitive representation of distances between
nodes, with high U-matrix values indicating large distances between neighboring
nodes in feature space. A common approach to visualize the U-matrix is to
display cells for both nodes and edges. In contrast, our visualization treats the
U-matrix as a graph. The color of the edges represent the euclidean distance of
the neighboring nodes according to the map topology. The nodes represent the
actual map units in the topology grid and the size is used to display the number
of feature vectors assigned to this node.
In our work we focus on the component planes of the U-Matrix [17]. Component planes color nodes based on weight values of nodes in a single codebook
vector component. Component planes allow the visualization of the inuence of
a single feature on the cluster formation. Through component planes users get a
fast overview of which features dominate which regions of the SOM. In Figure 2
high inuence of a feature for the projection of data vectors is displayed in red,
low inuence in blue.
In addition to the U-matrix-based SOM visualizations, we also experimented
with a number of alternatives: the P-matrix (a probability density estimation in
a map unit), the U*-matrix (a combination of U-matrix and P-matrix), distribution of weight inuence (a bar or pie chart for a map units weight values similar
to component planes), and PCA projection of the map. We did not nd any
233
Fig. 3. Using the component plane for the head match feature 2c annotators can select
the inuential nodes and inspect or annotate the contents of the node in a text-based
visualization
Applications
There are only a few software tools for coreference annotation. These tools use
only a text-based visualization for coreference information. The best known general purpose tool for computational linguists is the GATE framework [18]. The
coreference annotation module of GATE [19] provides a link visualization based
on plain text. Another tool for coreference annotation is MMAX2 [20]. MMAX2
utilizes plain text and HTML for visualization. MMAX2 visualizes coreference
information with edges between the noun phrases in the text.
Both tools can utilize extensions for coreference resolution ([19] for GATE and
BART [21] for MMAX2). These extensions use a coreference resolution model,
which is based on supervised machine learning methods and supports users in
the annotation process by highlighting condence values for coreferences.
In contrast to the GATE framework and BART+MMAX2 annotation tools,
we rely on human pattern recognition of coreferences without using supervised
machine learning methods. We use interactive SOMs, which visualize feature
space of the coreference data and allow a systematic annotation of the data. Such
234
A. Burkovski et al.
235
Interactive feature space exploration via the component planes enables computational linguists a fast judgment of how well the map has clustered the data.
4.2
Feature Engineering
The feature space exploration gives a good insight into how well the features
are suited for the SOM. Computational linguists can identify clusters of nodes
where the separation of the data is not clear. With annotated data, the nodes can
be color-coded according to the proportion of coreferent links they contain, as
mentioned above. The nodes also can display the number of co- and disreferent
data.
In some regions expert users may nd mixed nodes which contain coreferent
links, but have some disreferent links as well. Using component planes, they
are able to inspect such dicult cases. This indicates that new features should
be developed to better separate coreferent and disreferent links. Computational
linguists can inspect these nodes and view the corresponding noun phrases and
surrounding text. This allows the experts to understand what the noun phrases
have in common and why they were assigned to the same node. E.g. Figures
2a and 2b show clusters where the expert will nd links where one of the noun
phrases is a pronoun. Such clusters have high weight values in the corresponding
component plane. Pronouns are often dicult to resolve, because of the lack of
advanced features. The data in the nodes can inspire computational linguists to
design new and better features.
Additionally, our tool allows a recalculation of a new SOM with links in
nodes selected by expert users. The computational linguists can also change the
features for the links and use a subset of available features or add new features.
The new SOM calculation sometimes results in a dierent, better ordering of
the links.
4.3
Annotation
The annotators use the U-Matrix, component planes and additional text-based
visualizations (Figure 3) for links to label the data. The component planes reveal good clusters of only coreferent (Figures 2c and 2d) and disreferent (Figures
2e and 2f) links. For new, unlabeled data annotators can apply the same SOM
and inspect previous clusters but with the new data. After learning where to
nd good coreferent and good disreferent clusters, annotators are able to eciently annotate the links in these clusters. The SOM and component planes
allow a systematical approach for annotation. From the component planes, annotators can identify regions of nodes with links which have the same or similar
properties. E.g. the combination of the component planes in Figures 2a and 2b
show regions where one of the noun phrases is a pronoun. The annotator can
then use these nodes to focus on the coreference resolution of pronouns. The
most valuable links for computational linguists are links which are dicult to
resolve. Annotators may learn from computational linguists and feature engineers (as discussed previously) which regions contain nodes with coreferent and
236
A. Burkovski et al.
disreferent links. Annotators are then able to label the links in that regions with
a high precision, thus creating high quality annotations of dicult links.
Conclusion
References
1. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78(9), 14641480
(1990)
2. Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM - Self-Organizing Maps
of Document Collections. Neurocomputing 21, 101117 (1997)
3. Li, P., Farkas, I., MacWhinney, B.: Early lexical development in a self-organizing
neural network. Neural Networks 17(8-9), 13451362 (2004)
4. Heidemann, G., Saalbach, A., Ritter, H.: Semi-Automatic Acquisition and Labelling of Image Data Using SOMs. In: European Symposium on Articial Neural
Networks, pp. 503508 (2003)
5. Moehrmann, J., Bernstein, S., Schlegel, T., Werner, G., Heidemann, G.: Optimizing
the Usability of Interfaces for the Interactive Semi-Automatic Labeling of Large
Image Data Sets. In: HCI International, Springer, Heidelberg (to appear, 2011)
237
6. Pradhan, S.S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.:
OntoNotes: A Unied Relational Semantic Representation. In: International Conference on Semantic Computing, pp. 517526 (2007)
7. Elango, P.: Coreference Resolution: A Survey. In: Technical Report, University of
Wisconsin Madison (2005)
8. Clark, J., Gonz`
ales-Brenes, J.: Coreference: Current Trends and Future Directions.
In: Technical Report, Language and Statistics II Literature Review (2008)
9. Kobdani, H., Sch
utze, H., Burkovski, A., Kessler, W., Heidemann, G.: Relational
feature engineering of natural language processing. In: Proceedings Association for
Computational Linguistics International Conference on Information and Knowledge Management, pp. 17051708 (2010)
10. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguisticst, pp. 104111 (2002)
11. Ng, V.: Unsupervised models for coreference resolution. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing 2008, pp. 640
649 (2008)
12. Kobdani, H., Sch
utze, H.: SUCRE: A Modular System for Coreference Resolution.
In: Proceedings of the SemEval 2010, pp. 9295 (2010)
13. Tjong Kim Sang, E.F.: Memory-Based Shallow Parsing. The Journal of Machine
Learning Research, 559594 (2002)
14. Fellbaum, C.: Wordnet: An Electronic Lexical Database. In: Brandford Books
(1998)
15. Yianilos, P.N.: Normalized Forms for Two Common Metrics. In: Report 91-0829027-1, NEC Research Institute (1991)
16. Ultsch, A., Siemon, H.P.: Kohonens Self Organizing Feature Maps for Exploratory
Data Analysis. In: International Neural Networks Conference, pp. 305308. Kluwer
Academic Press, Paris (1990)
17. Vesanto, J.: SOM-based Data Visualization Methods. Intelligent Data Analysis 3,
111126 (1999)
18. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework
and graphical development environment for robust NLP tools and applications. In:
Proceedings of the 40th Anniversary Meeting of the Association for Computational
Linguistics (2002)
19. Dimitrov, M.: Light-weight Approach to Coreference Resolution for Named Entities
in Text. In: Mastersthesis, University of Soa (2002)
20. M
uller, C., Strube, M.: Multi-level annotation of linguistic data with MMAX2.
In: Corpus Technology and Language Pedagogy: New Resources, New Tools, New
Methods, pp. 197214 (2006)
21. Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X.,
Moschitti, A.: BART: a modular toolkit for coreference resolution. In: Proceedings
of the 46th Annual Meeting of the Association for Computational Linguistics on
Human Language Technologies, pp. 912 (2008)
Introduction
239
SOM Framework
We employ the Java SOMToolbox framework1, developed at the Vienna University of Technology. Besides the standard SOM learning algorithm, the framework
includes several implementations and modications to the basic algorithm, such
as the Growing Grid or the Growing Hierarchical Self-Organising Map (GHSOM). The core of the framework is an application that supports the user in an
interactive, exploratory analysis of the data organised by the map training process. This application allows for zooming, panning and selection of single nodes
and regions among the map.
To facilitate the visual discovery of structures in the data, such as clusters,
a wealth of approximatively 15 visualisations are provided. The visualisations
utilised in the experiments later in this paper are now described briey.
The unied distance matrix, or U-Matrix [9], is one of the earliest, and most
popular visualisations of the SOM. It aims at visualising cluster boundaries in the
SOM grid. It is calculated as the input-space distance between the model vectors
vectors of adjacent map units. These distances are subsequently displayed on the
1
http://www.ifs.tuwien.ac.at/dm/somtoolbox/
240
241
As the SOM does not generate a partition of the map into separate clusters,
a clustering of the units is applied to identify the regions in the map computationally. Of advantage are hierarchical algorithms, which result in a hierarchy of
clusters the user can browse through, allowing dierent levels of granularity; the
framework provides the Wards linkage [2] and several other linkage methods.
Having clusters or regions identied, the framework also provides labelling of
these entire regions. Making use of the properties of the hierarchical clustering,
we can also display two or more dierent levels of labels, some being more global,
some being more local.
Even though labelling the map regions assists the user in quickly getting a
coarse overview of the topics, labels can still be ambiguous or not conveying
enough information. Thus, the framework also employs Automatic Text Summarisation methods to provide a summary of the contents of the documents of
interest, allowing the user to get a deeper insight into the content. The summarisation can either be on single documents, documents from a certain node, a
cluster, or from a user-selected set of nodes or documents. Dierent summarisation algorithms are provided; the user can also specify the desired length of the
summary.
Collection
The WikiLeaks diplomatic cable collection2 is composed of United States embassy cables, allegedly the largest set of condential documents ever to be released into the public domain. The cables date from the 1960s up until February
2010, and contain condential communications between 274 embassies in countries throughout the world and the State Department in Washington DC.
The cables are released subsequently, thus currently a subset of 3,319 documents is available. The subset contains cables originating from 165 dierent
sources (embassies, consulates and other representations), and covers mostly the
last few years. Details on release year and origin of the dataset are given in
Table 1.
It can be noted that a rather large portion of approximatively 12% of the
cables were issued by the embassy in Tripoli. A large numbers of documents also
originates from Brazil (10.4%, including the cables from the consulates in Sao
Paolo and Rio de Janeiro), and Iceland (8.6%). Countries where the USA are
involved in military actions, such as Afghanistan or Iraq, have not been published
yet in large quantities, thus distinguishing this collection from the Afghan and
Iraq war diaries published earlier by WikiLeaks.
To obtain a numeric representation of the document collection for our experiments, we used a bag-of-words indexing approach [8]. From the resulting list of
65,000 tokens, the features for the input vectors were selected according to their
document frequency, skipping stop-words, as well as too frequent (in more than
50% of the documents) and too infrequent (in less than 1% of the documents)
terms. This resulted in a feature vector of approximately 5,500 dimensions for
2
http://wikileaks.ch/cablegate.html
242
documents
6
6
11
24
100
167
292
378
684
1270
434
Documents
406
351
290
202
158
146
122
93
81
77
75
59
58
each document, which formed the basis of the maps subsequently trained. The
values of the of the vector are computed using a standard tf idf weighting
scheme [8], which assigns high weights to terms which appear often in a certain
document (high tf value), and infrequent in the rest of the document collection
(high idf value), i.e. words that are specic for that document.
Experimental Analysis
We trained a map of the size of 35 26 nodes, i.e. a total of 910 nodes for
the 3,319 text documents. Due to the uncertain legal situation of the Wikileaks
documents, we have to refrain from publishing any quotes from the cables, or
other details, in this paper.
After inspection of the initial map, it became obvious that the map was dominantly organised along the origin of documents. The reason is that most cables
describe events in the country the embassies are located in, thus the names of
such countries are too predominantly represented. Thus, for having a more topicoriented map, we decided to remove the most frequent country names from the
feature vector. While this step inuences the content of cables that might talk
about foreign countries, this side-eect seems acceptable.
The U-Matrix visualisation of this map is depicted in Figure 1. However, on
this data set, only a few local boundaries become apparent. The existence of
smaller, interconnected clusters is also conrmed by the Smoothed Data Histograms, which visualises density in the map, in Figure 2(a). Figure 2(b) shows
the Vector Fields visualisation, where the arrows point towards local cluster
centres. These clusters overlap very well with a clustering of the weight (model)
vectors of the map, with the clustering for 40 target clusters being superimposed
in the same illustration. It can be observed that especially the centre area does
not seem to have a clear cluster centre.
243
(a)
(b)
Fig. 2. Smoothed Data Histograms (a) and Vector Fields (b) visualisations
The Thematic Classmap visualisation depicted in Figure 3 shows the distribution of the origin of the cables. It can be observed that the SOM manages to
separate the classes very well, especially on the edges of the map. Overlapping
areas are mainly found in the centre of the map, which has previously been identied as an area without a clear cluster centre, and on the upper-left corner. It is
often those areas, where the external classication scheme contradicts the topical
similarity, which are the most interesting to uncover unexpected relations.
Figure 4 nally shows the Cablegate map with 40 clusters, each of which has two
labels assigned, using the LabelSOM method described in Section 2. The display
244
Fig. 4. Clustering of the map, with 40 clusters and two topic labels each
245
of labels on regions helps to quickly get an overview on the contents of the map,
and where to nd them. We will describe some of the regions in detail now.
The upper-left corner of the map prominently features diplomatic cables discussing nuclear programs, both of Iran and North Korea, and related issues, such
as sanctions and the role of the International Atomic Energy Agency (IAEA).
As this is a topic which involves international diplomacy on a large scale, also
the sources of origin mentioning the topic are manifold from the secretary of
state and embassies of countries involved into the UN proceedings to cables from
the UN representation in Vienna, seat of the IAEA. Topics also dealt with in
this area of the map are weapons and the military in general.
The cluster on the central upper edge of the map features reports on the
Russian-Georgian war in 2008, and other topics related to Russia. The neighbouring cluster, holding messages mostly about energy such as oil and gas, also
features Russian politics, and Russian companies, as well as cables from other
countries, such as Venezuela, Nigeria, and Libya. The cluster right next to it,
in the top-right corner, then deals with further topics concerning the NorthAfrican country (gol stands for government of Libya). One topic is for example
the diplomatic crisis between the country and Switzerland, which resulted in
Switzerland refusing Schengen-Visa.
To the left of this, towards the centre of the map, are two clusters with reports
on Iceland, one of them identied by the names of the former prime minister
Geir Haarde and the former minister for foreign aairs, Ingibjrg Gsladttir,
who had to step down from oce due to the nancial crisis hitting Iceland in
2008. The other cluster deals with reports on the Icelandic banks, which suered
intensely from the crisis.
In the lower-centre of the map, a large area is dedicated to topics regarding
Brazil. These are dealing with ethanol and other biofuels, which Brazil is a major
producer of. Other topics include the defense sector (Nelson Jobim serves as the
Minister of Defense). On the left, two clusters deal with other South American
issues, namely Bolivian politics, or the crisis between Colombia and Venezuela,
reported by cables from both countries. Another topic in that region is the
Venezuelan president Hugo Chavez.
Towards the left, certain documents talking about Afghanistan are located.
Several of them deal with drugs, while others talk about the involvement of the
UK and Spain in the war. Just above that, cables report on Taliban activity,
and the situation in Pakistan, as well as cables from India about the attacks
in Mumbai, which are linked to terrorists in Pakistan. The region right of that,
towards the centre of the map, generally gathers cables from many dierent
sources, all talking about terrorism and criminal activities, without a major
topic dominating.
Another interesting arrangement of documents can be found on the left-centre
area, which features the previously mentioned documents from Iran and closely
to it also Sweden. Many of the documents in the cluster about Sweden deal
with the Swedish stance towards the sanctions against Iran due the nuclear
programme of the latter.
246
Conclusions
In this paper, we presented a case study for analysing text documents with Selforganising Maps. We employed a framework that provides extensive support in
visualisations that uncover structures in the data, and other methods which help
to quickly communicate the contents of the collection as a whole, and certain
parts in particular, such as labelling on cluster level. With such an analysis tool,
the user is able to rapidly get on overview on the interesting areas on the mapping, and gets access to the collection. This approach clearly exceeds the means
available on the WikiLeaks website, which comprise of category-based browsing,
but lack means to communicate the topics of the collection and exploring the
collection by its content.
As the collection of cables is growing on a daily basis, an online version of
the map is available at http://www.ifs.tuwien.ac.at/dm/, being regularily
updated.
References
1. Dittenbach, M., Rauber, A., Merkl, D.: Business, Culture, Politics, and Sports - How
to Find Your Way through a Bulk of News? In: Mayr, H.C., Lazansk, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 200220. Springer,
Heidelberg (2001)
2. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. Journal of
the American Statistical Association 58(301), 236244 (1963)
3. Kohonen, T., Kaski, S., Lagus, K., Salojrvi, J., Paatero, V., Saarela, A.: Organization of a massive document collection. IEEE Transactions on Neural Networks,
Special Issue on Neural Networks for Data Mining and Knowledge Discovery 11(3),
574585 (2000)
4. Mayer, R., Aziz, T.A., Rauber, A.: Visualising Class Distribution on Self-organising
Maps. In: de S, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN
2007. LNCS, vol. 4669, pp. 359368. Springer, Heidelberg (2007)
5. Pampalk, E., Rauber, A., Merkl, D.: Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps. In: Dorronsoro, J.R. (ed.) ICANN 2002.
LNCS, vol. 2415, pp. 871876. Springer, Heidelberg (2002)
6. Plzlbauer, G., Dittenbach, M., Rauber, A.: Advanced visualization of selforganizing maps with vector elds. Neural Networks 19(67), 911922 (2006)
7. Rauber, A., Merkl, D.: Automatic labeling of Self-Organizing Maps for Information
Retrieval. Journal of Systems Research and Inf. Systems (JSRIS) 10(10), 2345
(2001)
8. Salton, G.: Automatic text processing The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc.,
Amsterdam (1989)
9. Ultsch, A., Siemon, H.P.: Kohonens Self-Organizing Feature Maps for Exploratory
Data Analysis. In: Proceedings of the International Neural Network Conference
(INNC 1990), pp. 305308. Kluwer Academic Publishers, Dordrecht (1990)
Introduction
In this article, we present intermediate results from a project called Media Map
in which the use of the Self-Organizing Map (SOM) as an interface to a multifaceted academic library collection is demonstrated. First, we discuss the background for this work and introduce several projects of related work. We continue
by presenting the data and methods used and showing the experimental results
with the emphasis on describing the basic concept and providing information
on the overall system. We do not, however, aim to evaluate each of the subcomponents systematically. This will constitute a future task which includes, for
instance, a quantitative analysis of the performance of applying machine translation in content vector creation (see e.g. [1]) as well as qualitative usability and
quantitative performance and evaluations (see e.g. [14,19]).
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 247256, 2011.
c Springer-Verlag Berlin Heidelberg 2011
248
1.1
T. Honkela et al.
The basic alternatives for information retrieval are (1) searching using keywords
or key documents, (2) exploration of the document collection supported by organizing the documents on some manner, and (3) ltering. The keyword search
systems can be automated rather easily whereas document collections have traditionally been organized manually. The organization is traditionally based on
some (hierarchical) classication scheme, and each document is usually assigned
manually to one class.
In the WEBSOM method (see e.g. [2,5,8]), the Self-Organizing Map algorithm [6] is used to map documents onto a two-dimensional grid so that related
documents appear close to each other. The WEBSOM automates the process
of organizing a document collection according to the contents. It does not only
classify the documents, but also creates the classication system based on the
overall statistics of the document collection.
The PicSOM content-based visual analysis framework1 (see e.g. [11,12,17] is
based on using relevance feedback with multiple parallel SOMs. It has been
developed and used for various types of visual analysis, including image and
video retrieval, video segmentation and summarization. It has also served as the
implementation platform for the experiments described in this paper.
1.2
Our Approach
In this article, we present a method that follows the basic WEBSOM approach
for creating document maps with the following three main novel developments.
First, we present a method for creating maps of multilingual document collections in which the documents with similar semantic contents are mapped close
to each other regardless of their language. In our experiment, we have English
and Finnish documents. Second, the map calculation is conducted with a TreeStructured Self-Organizing Map (TS-SOM) algorithm [9]. Third, we have developed a design interface for the specic purpose of retrieval and exploration of a
database of three dierent types of entities people, projects, and publications
in the area of design, media and artistic research.
Our objectives have been three-fold:
to provide a map as an overview of the contents of an academic research
database,
to design an attractive and informative visualization of the map, and
facilitative information retrieval from the database regardless of the language
used in the documents.
1.3
Related Work
The SOM is widely used as a data mining and visualization method for complex
numerical data sets. Application areas include, for instance, process control,
1
http://www.cis.hut.fi/picsom
Media Map
249
economical analysis, and diagnostics in industry and medicine (see e.g. [18]).
The SOM has also been used to visualize the views of candidates in municipality
elections [4], or the items provided by museum visitors [15]. A variant has been
developed in which the shape of the SOM is modied so that it coincides with
some well-known shape like the country of Austria [16].
The WEBSOM method for text mining and visualization has been used for
various kinds of document collections including conference articles [13], patent
abstracts [8], and competence descriptions [3].
2.1
Data
ReseDa2 is the public web-based research database of the Aalto University School
of Art and Design. It is designed to support the schools research, assist the
administration of research activities and give them wider visibility. In general,
ReseDa provides information on the schools research activities, its expertise,
and artistic activities related to art, design, and media.
Table 1 details what kinds of data elds are contained in each of ReseDas
three record types relevant for our experiments. In practice, we started by collecting data on total of 94 projects described by abstracts in either English or
Finnish. From the project data we then extracted the identiers of all involved
persons resulting in a set of 101 people. Starting from these people, we nally
collected their publications whose abstracts were available in ReseDa in either
English of Finnish.
The last type of entities involved in our studies are units (such as departments
and institutions) with which the projects and publications are associated. In
the current data, there were seven units that had more than ten projects and
publications. While the data retrieval was one directional, i.e., from projects
to people and from people to publications, and from those to the units, we also
maintained the reverse mappings in the opposite directions as depicted with solid
and dashed lines in Fig. 1. The quantities of the collected data are summarized
in Table 2.
Table 1. ReseDa database record types and their contents used in the experiments
Publications Projects
publ-id
proj-id
publ-title
proj-title
publ-abstract proj-abstract
publ-people proj-people
...
...
http://reseda.taik.fi
People
person-id
person-name
person-publs
...
250
T. Honkela et al.
people
projects
publications
units
2.2
English
293
66
101
translated
9
28
n/a
In our pilot data set, the smallest number of words in the publications is 14 and
largest 711. The average number of words is 99.9 with the standard deviation of
90.0. The corresponding numbers for the projects are 27, 867, 117.4 and 156.4.
This indicates that the data is skewed, i.e., for many publications there are only
a short description.
In order to generate a list of relevant terms, a frequency count of all unigrams, bigrams and trigrams was calculated and sorted in a decreasing order
of frequency. Altogether 4934 + 23063 + 32919 = 60916 term candidates were
available. Among these, the words and phrases appearing at least 5 times were
considered. Finally, in the manual selection 268 single words such as adaptive,
advertising, and aesthetic, 66 bigrams like augmented reality, cultural
heritage, design process, and 16 trigrams including digital cultural heritage,
location based information, and research based design were included in the
terminology. These 350 terms were used in the encoding of the 497 text documents into document vectors.
The average number of terms in the project descriptions was 21.6 and for
publications 16.6. The persons were represented as a concatenation of the publications and projects in which they have been involved. Therefore, the average
number of terms for persons, i.e., 44.1 is considerably higher than any of the
other two. Unlike persons, each of whom is represented with one text document
obtained by concatenation, the departments and other units are represented as
collections of their associated projects and publications.
Google Translate3 was used in the translation. The terms found in the translations was 10.6 when the number of words in translation was 77.5, i.e. lower than
for publications or projects. Only a small number (34) of Finnish words were not
3
http://translate.google.com
Media Map
251
translated. Among them, 4 words were misspelled, and the rest 30 were typically inectional word forms of rare or newly invented words or compounds such
as palvelukehityskin (even the service development), julistemaalareiden (of
the poster painters), innovaatiokoneistosta (from the innovation machinery)
or kunnollisuudesta (from decency).
2.3
In creating the document maps from the content vectors, we used the hierarchical, Tree-Structured Self-Organizing Map algorithm [9] that is extensively used
in the PicSOM content-based image retrieval system [10,11]. The hierarchical
structure of TS-SOM drastically reduces the complexity of training large SOMs,
thus enabling scalability of the approach into much larger document collections.
The computational savings follow from the fact that the algorithm can exploit
the hierarchy in nding the best-matching map unit for an input vector.
Document Maps
In the following, we describe dierent kinds of maps produced in the Media Map
project. We also present the basic interface design and some design questions
when a number of people, projects, publications and organizational units are
projected on a map.
3.1
Term Distributions
Class Distributions
The aim of the Media Map project has been to place the researchers and their
publications and projects on a map in a way that the topology reects the
similarity of the content of the activities. A secondary aim has been to study
how well such a mapping also maintains the characteristics of the associated
research units. These two questions are addressed in the following.
Figure 3 displays how one researchers publications (n = 20) and projects
(n = 7) are mapped on the document map. The locations of the documents
have been indicated with impulses that have then been low-pass ltered in order
to amplify the visibility of spatial topology of the data. The most signicant
U-matrix [20] distances are illustrated with horizontal and vertical bars. In this
researchers case, Fig. 3 shows that his projects and publications occupy separate,
252
T. Honkela et al.
Fig. 2. Occurrences of the words design (top left), media (top right), art (bottom left)
and learning (bottom right) on the document map
Fig. 3. Distributions of one persons projects (top left) and publications (top right) on
the document map. Also the persons own location (bottom left) and the union of the
previous three distributions (bottom right) are shown.
but closely situated map areas, and our method maps the researcher himself in
a map location close to both areas.
The distribution of publications and projects associated with four research
units are shown in Fig. 4. It can be seen that the activities of the units appear
Media Map
253
mostly in non-overlapping map areas, but that the units distributions are not
unimodal. The Media department has the largest number of publications and
projects and this is reected in that units relatively largest area. Comparing
this gure with the two previous ones, some observations can be made. First, as
the names of the research units match quite accurately with the most common
terms in Fig. 2, also the term and activity distributions are pairwise somewhat
similar. Second, the activities of the researcher in Fig. 3 seem to fall inside the
activities of the Media department, and this could be expected as the researcher
actually is a sta member of that unit.
Fig. 4. Distributions of four units activities on the document map. The units are
Design (top left), Media (top right), Art (bottom left) and Visual culture (bottom
right).
3.3
Figure 5 shows an example of the planned map interface designed specically for
the Media Map project. Similarly to Figs. 3 and 4, the location of persons is also
in this gure indicated as a specic point on the map whereas the departments
and other research units occupy larger areas on the map, respectively. For the
persons, an icon is used and other icons exist for publications and projects. As
can be seen, the areas of the research units have been planned to be coded with
colors that can be overlaid without losing clarity. The gure shows that there
exist a slider and control arrows on the left hand side of the map for zooming
and panning of the map.
Even though Fig. 5 has still been created by a designer (H.T.), we already
have the necessary mechanisms for creating similar illustrations automatically.
An open issue is still how zooming into a specic area of the document map could
254
T. Honkela et al.
gradually reveal more and more details of the data. In this manner, the highestlevel view would show only information on the research unit level, the mid-level
views could show objects and activities on the person level, and only the most
detailed view would display data on particular projects and publications. Also
dierent kinds of connections between the entities would be illustrated on the
map on dierent zoom levels.
Media Map
255
Central research and development themes are related to the multilinguality, versatility and interlinked structure of the document collection. There are documents in English and in Finnish concerning projects, publications and people
in the database. We have presented a methodology to create document maps in
this kind of basic setting and a map interface design that is meant to support
information exploration and search.
In the future work, we plan to extend the database to cover all schools of
the Aalto University, i.e., schools of Chemical Technology, Economics, Electrical
Engineering, Engineering, and Science, in addition to the School of Art and
Design involved in the current pilot. This will increase the size of the database
considerably because there are more than 300 professors at the Aalto University
and the number of people in the academic sta exceeds 2000. Also, we will
implement automatic incorporation of the designed user interface elements and
facilitate on-line use of the created maps with zooming, panning and clickable
links to the original on-line data.
The map interface provides an alternative view to researchers research areas
and their results in contrast with the traditional classication systems that only
slowly adapt to the developments in research topics and methods. It is important
to note that creative inventions often include introduction of new concepts that
do not t into the existing classication systems. If this aspect is not properly
taken into account, and the semantic processing in information infrastructures
for research are based on some rigid standards, the innovation activities may
even slow down. We believe that the Self-Organizing Map provides a viable
alternative and ecient solution for organizational information management.
References
1. Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T.: Evaluating eects of machine
translation accuracy on cross-lingual patent retrieval. In: Proceedings of the 32nd
international ACM SIGIR conference on Research and development in information
retrieval, pp. 674675. ACM Press, New York (2009)
2. Honkela, T., Kaski, S., Lagus, K., Kohonen, T.: Newsgroup exploration with WEBSOM method and browsing interface. Tech. Rep. A32, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland (1996)
3. Honkela, T., Nordfors, R., Tuuli, R.: Document maps for competence management.
In: Proceedings of the Symposium on Professional Practice in AI. IFIP, pp. 3139
(2004)
4. Kaipainen, M., Koskenniemi, T., Kerminen, A., Raike, A., Ellonen, A.: Presenting
data as similarity clusters instead of lists - data from local politics as an example.
In: Proceedings of HCI 2001, pp. 675679 (2001)
5. Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOMself-organizing maps
of document collections. Neurocomputing 21, 101117 (1998)
6. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (2001)
7. Kohonen, T.: Description of input patterns by linear mixtures of SOM models. In:
Proceedings of WSOM 2007, Workshop on Self-Organizing Maps (2007)
8. Kohonen, T., Kaski, S., Lagus, K., Saloj
arvi, J., Honkela, J., Paatero, V., Saarela,
A.: Self organization of a massive text document collection. In: Kohonen Maps, pp.
171182. Elsevier, Amsterdam (1999)
256
T. Honkela et al.
9. Koikkalainen, P., Oja, E.: Self-organizing hierarchical feature maps. In: Proc.
IJCNN 1990, Int. Joint Conf. on Neural Networks, vol. II, pp. 279285. IEEE
Service Center, Piscataway (1990)
10. Laaksonen, J., Koskela, M., Oja, E.: Application of tree structured self-organizing
maps in content-based image retrieval. In: Ninth International Conference on Articial Neural Networks (ICANN 1999), Edinburgh, UK, September 1999, pp. 174179
(1999)
11. Laaksonen, J., Koskela, M., Oja, E.: Class distributions on SOM surfaces for feature
extraction and object retrieval. Neural Networks 17(8-9), 11211133 (2004)
12. Laaksonen, J., Koskela, M., Sj
oberg, M., Viitaniemi, V., Muurinen, H.:
Video summarization with SOMs. In: Proceedings of the 6th Int. Workshop on Self-Organizing Maps (WSOM 2007), Bielefeld, Germany (2007),
http://dx.doi.org/10.2390/biecoll-wsom2007-143
13. Lagus, K.: Map of WSOM 1997 abstractsalternative index. In: Proceedings of
WSOM 1997, Workshop on Self-Organizing Maps, June 4-6, 1997, pp. 368372.
Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland (1997)
14. Lagus, K.: Text Mining with the WEBSOM. Ph.D. thesis, Helsinki University of
Technology (2000)
15. Legrady, G., Honkela, T.: Pockets full of memories: an interactive museum installation. Visual Communication 1(2), 163169 (2002)
16. Mayer, R., Merkl, D., Rauber, A.: Mnemonic soms: Recognizable shapes for selforganizing maps. In: Proceedings of the Fifth International Workshop on SelfOrganizing Maps (WSOM 2005), pp. 131138 (2005)
17. Oja, E., Laaksonen, J., Koskela, M., Brandt, S.: Self-organizing maps for contentbased image retrieval. In: Oja, E., Kaski, S. (eds.) Kohonen Maps, pp. 349362.
Elsevier, Amsterdam (1999)
18. P
oll
a, M., Honkela, T., Kohonen, T.: Bibliography of Self-Organizing Map (SOM)
papers: 20022005 addendum. Tech. Rep. TKK-ICS-R23, Aalto University School
of Science and Technology, Department of Information and Computer Science,
Espoo (December 2009)
19. Saarikoski, J., Laurikkala, J., J
arvelin, K., Juhola, M.: A study of the use of selforganising maps in information retrieval. Journal of Documentation 65(2), 304322
(2009)
20. Ultsch, A., Siemon, H.P.: Kohonens self organizing feature maps for exploratory
data analysis. In: Proceedings of International Neural Network Conference (INNC
1990), Paris, France, pp. 305308 (July 1990)
Introduction
258
259
An inherent problem of cluster quality evaluation persists when we try to compare various clustering algorithms. It has been shown in [5] that the inertia
measures, or their adaptations [1], which are based on cluster proles are often
strongly biased and highly dependent on the clustering method. A need thus
arised for such quality metrics which validate the intrinsic properties of the numerical clusters. We have thus proposed in [5] unsupervised variations of the
recall and precision measures which have been extensively used in IR systems
for evaluating the clusters.
For such purpose, we transform the recall and precision metrics to appropriate
denitions for the clustering of a set of documents with a list of labels, or properties. The set of labels Sc that can be attributed to a cluster c are those which have
maximum value for that cluster, considering an unsupervised Recall P recision
metric [7]1 . The greater the unsupervised precision, the nearer the intentions of
the data belonging to the same cluster will be with respect to each other. In a
complementary way, the unsupervised recall criterion measures the exhaustiveness of the contents of the obtained clusters, evaluating to what extent single
properties are associated with single clusters.
Global measures of Precision and Recall are obtained in two-ways. M acroP recision and M acro-Recall measures are generated by averaging Recall and
Precision of the cluster labels at the cluster level, in a rst step, and by averaging
the obtained values between clusters, in a second step. M icro-P recision and
M icro-Recall measures are generated by averaging directly Recall and Precision
of the cluster labels at a global level. Comparison of M acro measures and M icro
measures makes it possible to identify heterogeneous results of clustering [8].
It is possible to refer not only to the information provided by the indices
M icro-P recision and M icro-Recall, but to the calculation of the M icro-P recision operated cumulatively. In the latter case, the idea is to give a major inuence to large clusters which are most likely to repatriate the heterogeneous
information, and therefore, by themselves, lowering the quality of the resulting
clustering. This calculation can be made as follows:
CM P =
|cp |
cCi+ ,pSc |c|
1
i=|cinf |,|csup | Ci+
1
i=|cinf |,|csup | |Ci+ |2
(1)
The IGNG-F algorithm uses this strategy as a substitute for the classical distance
based measure which provides best results for homogeneous datasets [6].
260
where Ci+ represents the subset of clusters of C for which the number of
associated data is greater than i, and:
inf = argminci C |ci |, sup = argmaxci C |ci |
(2)
The major problem IGNG-F faces in the case of heterogeneous datasets is that it
can associate a data with labels completely dierent from the ones existing in the
prototypes. This leads to strong decrease in performance in such case as labels
belonging to dierent clusters are clubbed together. So to cope with this problem
we propose three important variations of IGNG-F approach as mentioned below.
3.1
IGNG-Fv1
We use a distance based criteria to limit the number of prototypes which are
investigated for a new upcoming data point. This allows to set a neighbourhood
threshold and focus for each new data point which is lacking in the IGNG-F
approach. It is similar to using the sigma parameter of IGNG, the only dierence
being the criteria is user oriented and can be varied in accordance. Generally, we
see that there is a particular value of this criteria that can be used as threshold,
beyond which all the prototypes are selected within the neighbourhood.
3.2
Similarly to IGNG-Fv1, we use here a distance based criteria to dene a neighbourhood threshold for a new data point. An important variation is that the
F -measure which is to be computed with an added data point must not consider the new labels issued from the upcoming data. The new labels are the
ones which were not taking part in the list of labels of a prototype before augmenting this prototype with the upcoming data (refer to [6]). This operation
concerns all the prototypes that have been selected as potential winners for the
new data point. This strategy will avoid to associate to a prototype data with
labels that are completely dierent from the one exiting in this prototype and
prevent the formation of heterogeneous clusters. We modify the step of calculation of labeling inuence of the input data on the existing prototypes as proposed
in IGNG-F algorithm: in the F M ax or label maximization approach of this
algorithm a label is attached to the prototype which has maximum F -measure
for that label. However, in our case if there is more than one maximizer prototype we temporarily attach the label to all the maximizers. Considering now the
behaviour of the algorithm, when there is an increase of F -measure produced on
some selected prototypes by an upcoming data, we choose the winner prototype
according to the formula:
Kappa = LFx (P1 ) SL(P1 )
EucDist(P1 , x)
criteria att1
(3)
261
Datasets Description
262
Results
For each method, we do many dierent experiments letting varying the number
of clusters in the case of the static methods and the neighbourhood parameters
in the case the incremental ones (see below). We have nally kept the best
clustering results for each method regarding to the value of Recall-P recision
F -measure and the Cumulative M icro-P recision.
We rst conducted our experiments on the Total-Use dataset which is homogeneous by nature. Figure 1 shows the M acro-F -M easure, M icro F -M easure and
the Cumulative M icro-P recision(CM P ) for the dataset. The number of clusters represents the actual number of non-empty clusters among the total number
of clusters used for clustering the dataset. We see that the M acro F -M easure
and M icro F -M easure values are nearly similar for the dierent clustering approaches. However, the CM P value shows some dierence. We see that for the
SOM approach more than half of the clusters are empty. The NG and GNG
algorithms have good M acro and M icro F -M easure but lower CM P than the
IGNG approaches (except IGNG-Fv1).
In the case of the IGNG-Fv1 method, there is a maximum dierence between
the M acro F -measure and the M icro F -measure. This dierence clearly illustrates the presence of unbalanced or degenerated clustering results including some
noisy clusters of large size. Hence, big noisy clusters cannot be detected by the
M acro F -measure as soon as they coexist with smaller relevant ones. This phenomenon is also illustrated by the low value of the CM P measure which we use
as a central index for determining the best clustering quality. For IGNG-Fv1, the
value of CM P is extremely low signifying that the properties (i.e. the labels) are
unable to distinguish the dierent clusters eciently, although we have high CM P
values for IGNG and IGNG-F. The lower number of average label per document
results in normalized label vectors for Total-Use dataset that are far apart from
one another as in comparison to other datasets. This would signify higher distance between similar documents which can be one of the reasons for poor result
of IGNG-Fv1. We see that the best results are obtained for IGNG-Fv2 which uses
the neighbourhood criteria and Kappa based selection procedure for winning neurons. The dataset is homogeneous by nature, so it is possible to reach such high
precision values. Thus small, distinct and non-overlapping clusters are formed.
The standard K-Means approach produces the worst results on the rst dataset.
Furthermore, we leave out neural NG method for our next experiments because
of its too high computation time.
263
Even if it embeds stable topics, the Lorraine dataset is a very complex heterogeneous dataset as we have illustrated earlier. In a rst step we restricted
our experiment to 198 clusters as beyond this number, the GNG approach went
to an innite loop (see below). A rst analysis of the results on this dataset
shows that most of the clustering methods have huge diculties to deal with it
producing consequently very bad quality results, even with such high expected
number of clusters, as it is illustrated in Figure 2 by the very low CMP values.
It indicates the presence of degenerated results including few garbage clusters
attracting most of the data in parallel with many chunks clusters representing either marginal groups or unformed clusters. This is the case for K-Means, IGNG,
IGNG-Fv1 methods and at a lesser extent for GNG method.
This experiment also highlights the irrelevance of Mean Square Error (MSE)
(or distance-based) quality indexes for estimating the clustering quality in complex cases. Hence, the K-Means methods that got the lowest MSE practically
produces the worth results. This behaviour can be conrmed when one looks
more precisely to the cluster content and the cluster size distribution for the
said method, or even to the labels that can be extracted from the clusters in
an unsupervised way using the expectation maximization methodology that is
described in section 2. Hence, cluster label extraction permits to highlight that
the K-means method mainly produced a garbage cluster with very big size
that collects more than 80% of the data and attracts (i.e. maximizes) many
kinds of dierent labels (3234 labels among a total of 3556) relating to multiple
topics. Conversely, the good results of the IGNG-Fv2 method can be conrmed
in the same way. Indeed, label extraction also shows that this latter method
produces dierent clusters of similar size attracting semantically homogeneous
labels groups. In addition, those groups clearly gure out the main research topics of the dataset that might also be identied by looking up to the dierent
PASCAL classication codes which have been initially associated to the data by
the analysts.
The CMP value for GNG approach was surprisingly greater for 136 clusters
than for 198 clusters. Thus, increasing the expected number of clusters is not
helpful to the method to discriminate between potential data groups in the
264
Lorraine dataset context. At the contrary, it is even lead the method to increase
its garbage agglomeration eect. For higher number of clusters the GNG method
does not provide any results on this dataset because of its incapacity to escape
from an innite cycle of creation-destruction of neurons (i.e. clusters).
The only consistent behaviour is shown by SOM and IGNG-Fv2 methods.
The grid constrained learning of the SOM method seems to be a good strategy
for preventing to produce too bad results in such a critical context. Hence, it
enforces the homogeneity of the results by splitting both data and noise on the
grid. The best results on this complex heterogeneous dataset are obtained with
IGNG-Fv2 method. It is mainly because of the use of neighbourhood criteria for
limiting the selection of prototypes and most importantly because of the choice
of winner neuron based on the Kappa measure. It highlights the importance of
taking into account the combination of maximum number of the shared labels
with an upcoming data point, the maximum positive increment in F -measure
and also consider the distance between the current prototype and the new data
point.
We run a test for the cumulative Micro-Precision for IGNG-Fv2 and IGNGFv3 as they are similar in nature and dier only in the measure of similarity
used. We found that as we increase the number of clusters beyond 198 clusters
the actual peak value for the two methods are reached. For the other IGNG
algorithms the results were not as ecient as for IGNG-Fv2 and IGNG-Fv3. For
these two algorithms we allow the label to be associated to more than one winner
neuron (cluster), the same label might thus belong to many dierent clusters.
So when we perform an analysis on the clusters obtained by the two approaches,
we see that there are 82 coherent clusters having more than 10 documents with
each cluster having < 50 labels associated to it i.e. on an average 5 labels per
document, signicantly < 8.99 (global label average). From Figure 3, we observe
that for the IGNG-Fv2 the maximum CM P value occurs at 290 clusters (0.25)
while for IGNG-Fv3 the maximum CM P value occurs at 286 clusters (0.248).
They follow very similar trends for large number of clusters though IGNG-Fv3
reaches high CM P value more consistently than IGNG-Fv2. Thus, they are able
to much more appropriately cluster this highly complex dataset.
265
Conclusion
Neural clustering algorithms show high performance in the usual context of the
analysis of homogeneous textual dataset. This is especially true for the recent
adaptive versions of these algorithms, like the incremental neural gas algorithm
(IGNG). Nevertheless, this paper highlights clearly the drastic decrease of performance of these algorithms, as well as the one of more classical non neural
algorithms, when a very complex heterogeneous textual dataset is considered
as an input. Specic quality measures and cluster labeling techniques that are
independent of the clustering method are used for performance evaluation. One
of the main contributions of our paper has been to propose incremental growing
neural gas algorithm exploiting knowledge issued from clusters current labeling
in an incremental way in combination with the use of distance based measures.
This solution led us to obtain very signicant increase of performance for the
clustering of textual data. Our IGNG-Fv2 approach is the most stable approach
for the dierent datasets. It produces high quality clusters for each dataset unlike the other neural and non-neural approaches which have highly varying on
the datasets. In our experiment the use of stable evaluation methodology has
certainly represented a key point for guaranteeing the success of our approach.
In the near future, we would like to enhance the approach by taking into consideration all the data points for clustering. We also aim at nding appropriate
strategy to split the largest cluster and would like to adapt our enhanced label
maximization similarity principles to several other clustering algorithms.
References
1. Davies, D., Bouldin, W.: A cluster separation measure. IEEE Transaction on Pattern Analysis and Machine Intelligence 1, 224227 (1979)
2. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood for incomplete data via
the em algorithm. Journal of the Royal Statistical Society B-39, 138 (1977)
266
3. Frizke, B.: A growing neural gas network learns topologies. Advances in neural
Information processing Systems 7, 625632 (1995)
4. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 5659 (1982)
5. Lamirel, J.-C., Al-Shehabi, S., Francois, C., Hofmann, M.: New classication quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics 60 (2004)
6. Lamirel, J.-C., Boulila, Z., Ghribi, M., Cuxac, P.: A new incremental growing
neural gas algorithm based on clusters labeling maximization: Application to clustering of heterogeneous textual data. In: Garca-Pedrajas, N., Herrera, F., Fyfe,
C., Bentez, J.M., Ali, M. (eds.) IEA/AIE 2010. LNCS, vol. 6098, pp. 139148.
Springer, Heidelberg (2010)
7. Lamirel, J.-C., Phuong, T.A., Attik, M.: Novel labeling strategies for hierarchical
representation of multidimensional data analysis results. In: IASTED International
Conference on Articial Intelligence and Applications (AIA), Innsbruck, Austria
(February 2008)
8. Lamirel, J.-C., Ghribi, M., Cuxac, P.: Unsupervised recall and precision measures:
a step towards new ecient clustering quality indexes. In: Proceedings of the 19th
Int. Conference on Computational Statistics (COMPSTAT 2010), Paris, France
(August 2010)
9. MacQueen, J.: Some methods of classication and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium in Mathematics, Statistics and Probability, vol. 1, pp. 281297. Univ. of California, Berkeley, USA (1967)
10. Martinetz, T., Schulten, K.: A neural gas network learns topologies. Articial Neural Networks, 397402 (1991)
11. Prudent, Y., Ennaji, A.: An incremental growing neural gas learns topology.
In: Neural Networks, 13th European Symposium on Articial Neural Networks,
Bruges, Belgium (April 2005)
Introduction
Connectionist models have been playing an important role in language development in several areas, such as lexical and pronoun acquisition, syntactic sistematicity, language disorder modeling and prosodic analysis [16, 17, 20], just
to mention a few. Most of these works are based on feedforward or recurrent supervised neural network architectures [4, 6, 8, 11], such as the MLP and Elman
networks, but self-organizing neural network models have also been used as the
primary linguistic model [9, 10, 13, 14, 18, 19].
For example, Li and co-workers [13, 14] simulated the lexical acquisition in
infants using a self-organizing neural network model. The main objective of
the research was to use the properties of topographic preservation of the SelfOrganizing Map (SOM) [12] to study the emergence of linguistic categories and
its organization throughout the stages of lexical learning. The model captured
a series of important phenomena occurring in childrens early lexical acquisition
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 267276, 2011.
c Springer-Verlag Berlin Heidelberg 2011
268
2
2.1
269
Methods
The Self-Organizing Map
(1)
where t denotes the iterations of the algorithm. Then, it is necessary to adjust the
weight vectors of the winning neuron and of those neurons in its neighborhood:
mi (t + 1) = mi (t) + (t)h(i , i; t)[x(t) mi (t)],
(2)
where 0 < (t) < 1 is the learning rate and h(i , i; t) is a Gaussian weighting
function that limits the neighborhood of the winning neuron:
ri (t) ri (t)2
h(i , i; t) = exp
,
(3)
2 2 (t)
where ri (t) and ri (t), are respectively, the positions of neurons i and i in a predened output array where the neurons are arranged in the nodes, and (t) > 0
denes the radius of the neighborhood function at time t. To guarantee convergence of the algorithm, (t) and (t) decay exponentially in time according to
the following expressions:
(t/T )
(t/T )
T
T
(t) = 0
and
(t) = 0
,
(4)
0
0
where 0 (0 ) and T (T ) are the initial and nal values of (t) ((t)).
The incremental learning process dened by Eqs. (1 ) and (2) can often be
replaced by the following batch computation version which is usually faster.
1. Initialize the weight vectors mi , i.
2. For each neuron i, collect a list of all those input vectors x(t), whose most
similar weight vector belongs to the neighborhood set Ni of neuron i.
3. Take as the new weight vector mi the mean over the respective list.
4. Repeat Step 2 a few times until convergence is reached.
Steps 2 and 3 of the batch SOM algorithm need less memory if at Step 2 one only
make lists of the input vectors x(t) at those neurons that have been selected for
winner, and at Step 3 we take the mean over the union of the lists that belong
to the neighborhood set Ni of neuron i.
In addition to usual vector quantization properties properties, the resulting
ordered map also preserves the topology of the input samples in the sense that
adjacent input patterns are mapped into adjacent neurons on the map. Due to
this topology-preserving property, the SOM is able to cluster input information
and spatial relationships of the data on the map.
270
2.2
The process of feature extraction of the speech signal is a crucial step in the
connectionist approach to pattern classication and clustering. This step consists
in applying standard signal processing techniques to the original speech signal in
order to convert it to more suitable compact mathematical representation that
permits the identication of a given utterance by a connectionist model.
Linear predictive coding (LPC) is a signal processing technique widely used for
the parametrization of the speech signal in several applications, such as speech
compression, speech synthesis and speech recognition [7]. Roughly speaking, the
LPC1 technique represents small segments (or frames) of the speech signal by the
coecients of autoregressive (AR) linear predictors. For example, if the speech
signal has 500 frames, it will be parameterized by a set of 500 coecient vectors.
To assure stationarity, each frame usually has a short duration ( 10-30ms).
The set of LPC coecient (LPCC) vectors associated with the utterance of
a given word are then organized along the rows of a matrix of coecients. For
1
LPC coecients can extract the intensity and frequency of the speech signal. These
two characteristics are closely associated with the prosodic element accent. In
English, the stress is the junction of three perceptual factors interrelated: 1) quantity
/ length (measured in ms) related to the size of the syllable, 2) intensity (measured
in dB) related to amplitude and 3) height (measured in Hz), i.e., the value of higher
F0 in an utterance.
271
example, if 500 coecient vectors generated, one vector for each frame, the
corresponding matrix of coecients has 500 rows. The number of columns of
this matrix is equal to order of the AR predictor used in the LPC analysis. The
matrices of coecients are then used to train the SOM.
2.4
Four simulations were ran with parameters that varied according to the need
to adjust to the phenomenon in question. The experiments were design in order
to verify whether the network could organize (discriminate) learners depending
on the transference of stress pattern from L1 to L2 are detailed below. When
applied to the problem of interest, the simulation process of the SOM and the
analysis of results of the training involves the following steps:
1. Startup and training (learning) of the network;
2. Evaluation of the quality of the map using the quantization error (QE) and
topological error (TE);
3. Generation of the U -matrix and labeled map after each training run;
4. Validation of clusters through the Davies-Bouldin index (DB);
5. Tabulation of the data for all outcome measures of network performance (the
quantization error and topological error).
All simulations were conducted using a two-dimensional hexagonal SOM, with
hexagonal neighborhood structure, Gaussian neighborhood function, random initiation of weights and batch learning. For all the experiments we simulated a
5 5 SOM, for 250 epochs (50 for rough training, 200 for ne tuning) with initial
and nal neighborhood of 4 and 1, respectively. The maximum numbers of clusters used by the DB index was set to 10. These specications proved adequate
to treat the phenomena in question. The SOM toolbox [23] was used to run all
the experiments to be described.
As mentioned in the Subsection 2.3, every word uttered by a speaker generates
a coecient matrix. In order to identify this speaker in a posterior analysis of the
results, it is necessary to label the data (row) vectors in that matrix as belonging
to that particular speaker. For this purposes, an alphanumeric label is appended
to each row vector in an additional column. Finally, the text les containing
labeled data related to the utterance of a specic word for all the speakers are
concatenated into a single le.
It is noteworthy that in addition to the label that identies the speaker, other
labels can be associated with a given coecient matrix of that speaker. For
instance, a second label can identify the linguistic category in which the word
pronounced is inserted. This Multi-Label (ML) Analysis is introduced in this
paper with the goal of determining which labeling is more appropriate to the type
of parameterization used. In other words, ML analysis can help inferring which
linguistic properties of the speech signal are encoded in the LPC coecients.
Finally, the U -matrix [22] is used as a tool to visualize the clusters formed
during the learning process.
272
Fig. 1. U-matrix revealing the formation of two major groups, one probably related to
speakers who transfer the stress pattern and the other to speakers who do not transfer
This simulation aims at investigating whether the SOM would be able to organize
the speakers in clusters, according to the process of transferring the stress pattern
of Brazilian Portuguese into English. All the 30 speakers were asked to utter 30
dierent English sentences containing situations where certain words of interest
act sometimes as a verb or as a noun. In this paper, we report only the results
obtained for the sentence I object to going to a bar, where the word of interest
is the verb obJECT. The full corpus is available to the interested reader upon
request.
Three types of graphics were generated after SOM training: U-Matrix, labeled
map (majority rule) and clustered map. The U-matrix and the clustered map
requires no labeled data to be constructed. The labeled map is more useful
for our purposes if labeled data are available since labels may provide a better
understanding of speakers organization as a function of their linguistic abilities.
It is worth pointing out that all the SOM computations are performed using
unlabeled data, i.e. it runs totally in an unsupervised way. The labels are used
only in the analysis of the results.
Two criteria were followed for labeling purposes. At rst, the speakers labels
carry no information about errors in L2 stress, i.e., the transfer pattern of L1 to
L2. In this case, the data from a given speaker is labeled by a number indicating
his/her formal education level in L2 studies (i.e. period in an English course)
and his/her order in the interview process. For example, the label 608 denotes
a speaker in the 6th semester ranked 8th in the list of individuals interviewed for
this research. The second labeling criterion added the characters er to the label
when a speaker misses the pronunciation, i.e. when he/she transfers the pattern
from L1 (Brazilian Portuguese) to L2 (English). For example, the label 203er
denotes a speaker in the 2nd semester, ranked 3rd in the interview sequence and
who missed the pronunciation.
273
Fig. 2. Labeled map associated to the U-matrix shown in Figure 1, conrming the
expectation of two major groups of students, one containing mainly individuals who
transfer the BP stress pattern and one that does not transfer
Fig. 3. Clustered map suggesting the existence of two well-dened clusters, according
to Davies-Bouldin index
274
Conclusion
The preliminary results presented in this paper can serve as a starting point
to demonstrate that an unsupervised neural network can be useful to visualize
the cluster formation of prosody-related linguistic phenomenon, in this case, the
transference of lexical stress. We started from the assumption that the parameterization of the speech signal through the LPC coecients would be eective
in the categorization of speakers for prosodic features.
The segregation of the map in regions of well-dened clusters suggested that
the learners were grouped by similar phonetic-acoustic features. According to the
rounds of experiments, it was conrmed that the network discriminated speakers
according to prosodic features and organized them according to similarities on
these characteristics. Importantly, within these two large groups (the group that
275
transfers the BP stress pattern and what does not transfer) there can be subgroups (subclusters) which, when closely examined in isolation, might reveal
rich information for the linguistic analysis of learners utterances as well as to
contribute to understanding the organization of the data set. We are currently,
developing experiments to analyze these subgroups.
Further tests are to be made and with more results, we hope to perfect the
proposed SOM-based methodology and use it in the future as a tool for determining the language prociency level classication in foreign languages.
Acknowledgements. The authors thank FUNCAP and CAPES (Brazilian
agencies for promoting science) for the nancial support to this research.
References
1. Albini, A.B.: The inuence of the portuguese language in accentuation of english
words by brazilian students (in portuguese). Revista Prolngua 2(1), 4456 (2009)
2. Archibald, J.: A formal model of learning L2 prosodic phonology. Second Language
Research 10(3), 215240 (1994)
3. Baptista, B.O.: An analysis of errors of Brazilians in the placement of English word
stress. Masters thesis, Postgraduate Program on Linguistics, Federal University of
Santa Catarina, Brazil (1981)
4. Blanc, J.M., Dominey, P.F.: Identication of prosodic attitudes by a temporal
recurrent network. Cognitive Brain Research 17, 693699 (2003)
5. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (2011),
http://www.praat.org, version 5.2.10 (retrieved January 11, 2011)
6. Cummins, F., Gers, F., Schmidhuber, J.: Language identication from prosody
without explicit features. In: Proceedings of the 6th European Conference on
Speech Communication and Technology (EUROSPEECH 1999), pp. 305308
(1999)
7. Deller, J., Hansen, J.H.L., Proakis, J.: Discrete-Time Processing of Speech Signals.
John Wiley & Sons, Chichester (2000)
8. Elman, J.L.: Finding structure in time. Cognitive Science 14, 179211 (1990)
9. Farkas, I., Crocker, M.W.: Syntactic systematicity in sentence processing with a
recurrent self-organizing network. Neurocomputing 71, 11721179 (2008)
10. Gauthier, B., Shi, R., Xu, Y.: Learning prosodic focus from continuous speech
input: A neural network exploration. Language Learning and Development 5, 94
114 (2009)
11. Kaznatcheev, A.: A connectionist study on the interplay of nouns and pronouns in
personal pronoun acquisition. Cognitive Computation 2, 280284 (2010)
12. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)
13. Li, P., Farkas, I., MacWhinney, B.: Early lexical development in a self organizing
neural network. Neural Networks 17, 13451362 (2004)
14. Li, P., Zhao, X., MacWhinney, B.: Dynamic self-organization and early lexical
development in children. Cognitive Science 31, 581612 (2007)
15. Mairs, J.L.: Stress assignment in interlanguage phonology: an analysis of the stress
system of spanish speakers learning english. In: Gass, M., Schatcther, J. (eds.)
Linguistic Perspectives on Second Language Acquisition, Cambridge University
Press, Cambridge, USA (1989)
276
16. McClelland, J.L.: The place of modeling in cognitive science. Topics in Cognitive
Science 1, 1138 (2009)
17. McClelland, J.L., Botvinick, M.M., Noelle, D.C., Plaut, D.C., Rogers, T.T., Seidenberg, M.S., Smith, L.B.: Letting structure emerge: connectionist and dynamical
systems approaches to cognition. Trends in Cognitive Sciences 14, 348356 (2010)
18. Miikkulainen, R.: Dyslexic and category-specic aphasic impairments in a self organizing feature map model of the lexicon. Brain and Language 59, 334366 (1997)
19. Miikkulainen, R., Kiran, S.: Modeling the bilingual lexicon of an individual subject.
In: Prncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 191
199. Springer, Heidelberg (2009)
20. Poveda, J., Vellido, A.: Neural network models for language acquisition: A brief
survey. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS,
vol. 4224, pp. 13461357. Springer, Heidelberg (2006)
21. Silva, A.C.C.: The production and perception of word stress in minimal pairs of
the english language by brazilian learners. Masters thesis, Postgraduate Program
on Linguistics, Federal University of Cear
a, Brazil (in Portuguese) (2005)
22. Ultsch, A., Siemon, H.P.: Kohonens self organizing feature maps for exploratory
data analysis. In: Proceedings of the International Neural Network Conference
(ICNN 1990), pp. 305308. Kluwer Academic Publishers, Dordrecht (1990)
23. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11(3), 586600 (2000)
Introduction
278
B. Hammer et al.
279
dX (xi ,xj )2
2i
exp
dX (xi ,xk )2
2i
and qj|i
2
exp dE (y i , y j
=
i
k 2
k=i exp (dE (y , y ) )
280
B. Hammer et al.
qij =
(1 + dE (y i , y j )/)
k=l (1
+1
2
+ dE (y k , y l )/)
+1
2
ij
pj|i log
pj|i
qj|i
(1 )
qj|i log
qj|i
p
j|i
ij
with probabilities as for SNE and a weighting parameter [0, 1] [19]. Similarly, t-NeRV generalizes t-SNE by considering the alternative symmetric pairwise probabilities pij and qij just as t-SNE in the original and projection space
in the symmetric version of the Kullback-Leibler divergence.
A General View. These methods obey one general principle: characteristics of the data x are computed and projections y are determined such that
the corresponding characteristics of the projections coincide with the characteristics of x as far as possible, fullling possibly additional constraints or objectives to achieve uniqueness. I.e. we compute characteristics char(x), then
projections y are determined such that error(char(x), char(y)) becomes small.
Thereby, the methods dier in the way how data characteristics are dened and
computed and how exactly the similarity of the characteristics is dened and
optimized.
Table 1 summarizes the properties of the optimization methods under this
point of view. Naturally, the methods severely dier with respect to the way in
which optimization takes place: in some cases, the characteristics can be directly
computed from the data (such as distances), in others, an optimization step is
required (such as local linear weights). In some cases, the optimization of the
error measure can be done in closed form (such as for Laplacian eigenmaps), in
other cases, numerical optimization is necessary (such as for t-SNE).
Many of the above methods depend on pairwise distances, such that their eort
scales quadratically with the number of data. This makes them infeasible for
large data sets. In addition, even linear techniques (such as e.g. presented in [4])
become infeasible for large data sets such that sublinear or even constant time
techniques are required. Further, it usually does not make sense to project all
data, the projection plane being almost completely lled with points. For this
reason, often, simple random subsampling is used and the projections of just a
subsample of the full data set are shown, see e.g. the overviews [18,19].
probab. pij =
pj|i +pi|j
2n
probab. qij =
+1
2
(1+dE (y ,y )/)
+1
k l
2
k=l (1+dE (y ,y )/)
characteristics of projections
Euclidean distance dE (y i , y j )
Euclidean distance dE (y i , y j )
reconstruction
weights w
ij such that
i
(y ij w
ij y j )2 is minimum
i
with constraints
y = 0, Y t Y = n
squared Euclidean distance
dE (y i , y j )2 for i j
with constraints Y t DY = 1, Y t D1 = 0
Euclidean distance dX (xi , xj ) for i j Euclidean
distance dE (y i , y j ) for i j
such that ij dE (y i , y j )2 is maximum
and i y i = 0.
2
exp(dX (xi ,xj )2 /2i )
exp(dE (yi ,y j )
probab. pj|i =
=
probab.
q
j|i
i
k
2
i
k 2
k=i exp(dX (x ,x ) /2i )
k=i exp(dE (y ,y ) )
characteristics of data
t-SNE
SNE
MVU
MDS
Isomap
LLE
method
enforce identitiy
(introducing slack variables if necessary)
maximize correlation
error measure
Table 1. Many dimensionality reduction methods can be put into a general framework: characteristics of the data are extracted.
Projections lead to corresponding characteristics depending on the coecients. These coecients are determined such that an error
measure of the characteristics is minimized, fullling probably additional constraints.
282
B. Hammer et al.
How can these results of a random subsample be used to inspect the full data
set? One possibility is to add additional points on demand by means of out-ofsample extensions of the techniques. The general view as presented above oers
a very easy way to describe the principle of out-of-sample extensions which are
built on top of xed dimensionality reduction mappings of a subset S of all
data: Assume, the projections y i of data xi S are xed. A further point x
can be mapped to coecients y by optimizing error(char(x), char(y)) whereby
the coecients y i for elements in S are kept xed. Depending on the method at
hand, an explicit algebraic solution or numeric optimization are possible.
Assuming a deterministic optimization method for simplicity, this is essentially a way to determine a function of the full data space to the projection
space f : RN R2 by means of an implicit formula: a data point x is mapped
to the coecients which minimize the cost function as specied above. Depending on the method at hand, f might have a complex form and its computation
might be time consuming, albeit properties such as piecewise dierentiability
and smoothness follow from the smoothness of the cost function.
Explicit Dimensionality Reduction Mapping
We can avoid the computational complexity and complex form of such implicit
function f by the denition of an explicit dimension reduction mapping f : RN
R2 , xi y i = f (xi ) with priorly xed form. The formalization of dimensionality
reduction as cost optimization allows to immediately extend the techniques to
this setting: function parameters can be optimized according to the objective
as specied by the respective dimensionality reduction method. That means, we
x a parameterized form fW : RN R2 with parameters W . This function
can be given by a linear function, a locally linear function, a feedforward neural
network, etc. Then, instead of coecients y i , the images of the map fW (xi ) are
considered and the map parameters W are optimized such that the costs
error(char(x), char(fW (x)))
become minimal. This principle leads to a well dened mathematical objective
for the mapping parameters W for every dimensionality reduction method as
summarized above. The way in which optimization takes place is possibly different as compared to the original method: while numerical methods such as
gradient descent can still be used, it is probably no longer possible to nd closed
form solutions for spectral methods.
We can train a dimensionality reduction mapping for only a random subsample S of the data, providing an explicit out-of-sample extension for all data
points by means of the explicit mapping. Hence this technique oers a constant
time inference of a dimensionality reduction mapping provided S has xed size.
In the literature, a few dimensionality reduction technologies with explicit
mapping of the data can be seen as instantiations of this principle: Locally
linear coordination (LLC) [14] extends locally linear embedding (LLE) by assuming locally linear dimensionality reduction methods, e.g. local PCAs, and
glueing them together adding ane transformations. The additional parameters
283
are optimized using the LLE cost function. Parameterized t-distributed stochastic neighbor embedding (t-SNE) [17] extends t-SNE towards an embedding given
by a multilayer neural network. The network parameters are determined using
back propagation on top of the t-SNE cost function.
Supervised Locally Linear t-SNE Mapping
Here, we include one preliminary example to demonstrate the feasibility of the
approach. We use t-SNE as dimensionality reduction method, and a locally linear
function fW induced by prototypes. We start with locally linear projections of
the data obtained by means of a supervised prototype based method, in our case
matrix learning vector quantization with rank two matrices [13,4]. These give
us locally linear projections xl pk (xl ) = k xl wk with local matrices k
and prototypes w k . Further, we obtain responsibilities rlk of mapping pk for xl ,
given by the receptive elds. Then a global mapping can be dened as
fW : xl y l =
rlk (Lk pk (xl ) + lk ) ,
k
using local linear projections Lk and local osets lk to align the local pieces. The
parameters Lk and lk are determined using the t-SNE cost function.
Obviously, since we start from a supervised clustering, the resulting function
is biased towards good discriminative properties. We compare the results of this
technique to several state of the art supervised dimensionality reduction tools as
reported in [19] on three benchmarks from [1,8] (see also [19]). For all settings,
we use only a fraction of about 10% for training, extending to the full data set
by means of the explicit mapping (unlike the results as reported in [19] which
evaluate on a subset of the data only). The obtained classication accuracy by
means of nearest neighbor classication is reported in Fig. 1, showing that the
method leads to excellent results.
1
DiReduct Map
SNeRV l=0.1
0.9
SNeRV l=0.3
PE
0.8
SIsomap
0.7
MUHSIC
MRE
0.6
NCA
0.5
0.4
0.3
0.2
0.1
0
Letter
Phoneme
Landsat
Fig. 1. Comparison of the 5 nearest neighbor errors for all data sets
284
B. Hammer et al.
Generalization Ability
285
E(P ) :=
x f 1 (f (x))2 P (x)dx
X
where P denes the probability measure according to which the data x are
distributed in X and f 1 constitutes an approximate inverse mapping of f , an
exact inverse in general not existing. Thus, this objective allows us to evaluate
dimensionality reduction mappings. In practice, of course, the full data manifold
is not available, but a nite sample set only. In this case, the empirical error can
be computed
n (x) := 1
E
xi f 1 (f ((xi )))2
n i
for a given data set S = {x1 . . . , xn }. Now, a good generalization ability of a
n (x)
dimensionality reduction method can be formalized as the empirical error E
being representative for the true error E(P ) for the dimensionality reduction f .
This setting can be captured in the classical framework of computational
learning theory, as specied e.g. in [2]. We can adapt Theorem 8 from [2] to our
setting: We consider a xed function class
F :X E
from which the dimensionality reduction mapping is taken. Note that, in every
case as specied above, the form of the embedding function can be xed: either
it is given explicitly, e.g. as locally linear function or by means of a clustering, or
it is given implicitly by means of a local optimum of a cost function. We assume
without loss of generality, that the norm of the input data and its reconstructions
under mappings f 1 f , f 1 denoting the approximate inverse of f F, are
restricted (scaling the data priorly, if necessary), such that the reconstruction
error is induced by the squared error, which is a loss function with limited
codomain
L : X X [0, 1], (xi , xj ) xi xj 2
Then, as reported in [2] (Theorem 8), assuming i.i.d. data according to P , for
any condence (0, 1) and every f F the following holds
n (x) + Rn (LF ) + 8 ln(2/)
E(P ) E
n
with probability at least 1 where
LF := {x L(f 1 (f (x)), x) | f F }
and Rn refers to the so-called Rademacher complexity of the function class.
286
B. Hammer et al.
The Rademacher complexity constitutes a quantity which, similar to the Vapnik Chervonenkis dimension, estimates the capacity of a given function class. We
do not include its exact denition, rather, we refer to [2]. Note, however, that the
Rademacher complexity of reasonable function classes (such as piecewise constant, piecewise linear functions, or polynomials of xed degree) can be limited
by a term which scales as n1/2 , as long as the function class does not have
innite capacity e.g. due to an unlimited number of free paramters (e.g. polynomials with unbounded degree). See [2] for structural results and explicit bounds
for e.g. linear functions, and e.g. [13] for explicit bounds on piecewise constant
functions as induced by prototype based clustering. This result implies that the
generalization ability of dimensionality reduction mappings is usually guaranteed since the Gaussian complexity of the class LF can be limited for reasonable
choices of the mapping function F . It remains a subject of future research to
nd explicit and good bounds. for concrete F as occur in standard methods.
Conclusion
References
1. Asuncion, A., Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of
machine learning databases (1998), http://archive.ics.uci.edu/ml/ (last visit
June 19, 2009)
2. Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: risk bounds
and structural results. J. Mach. Learn. Res. 3, 463482 (2003)
3. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15, 137315396 (2003)
4. Bunte, K., Hammer, B., Wism
uller, A., Biehl, M.: Adaptive local dissimilarity
measures for discriminative dimension reduction of labeled data. Neurocomputing 73(7-9), 10741092 (2010)
The elastic embedding algorithm for dimensionality re5. Carreira-Perpi
n
an, M.A.:
duction. In: 27th Int. Conf. Machine Learning (ICML 2010), pp. 167174 (2010)
287
6. Hinton, G., Roweis, S.: Stochastic neighbor embedding. In: Advances in Neural
Information Processing Systems 15, pp. 833840. MIT Press, Cambridge (2003)
7. Keim, D.A., Mansmann, F., Schneidewind, J., Thomas, J., Ziegler, H.: Visual analytics: Scope and challenges. In: Simo, S.J., B
ohlen, M.H., Mazeika, A. (eds.)
Visual Data Mining. LNCS, vol. 4404, pp. 7690. Springer, Heidelberg (2008)
8. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., Torkkola, K.: LVQ-PAK:
The learning vector quantization programm package. Technical Report A30,
Helsinki University of TechnologyLaboratory of Computer and Information Science, FIN-02150 Espoo, Finland (1996)
9. Lee, J., Verleysen, M.: Nonlinear dimensionality reduction, 1st edn. Springer, Heidelberg (2007)
10. Lee, J.A., Verleysen, M.: Quality assessment of dimensionality reduction: Rankbased criteria. Neurocomput. 72(7-9), 14311443 (2009)
11. Mokbel, B., Gisbrecht, A., Hammer, B.: On the eect of clustering on quality assessment measures for dimensionality reduction. In: NIPS workshop on Challenges
of Data Visualization (2010)
12. Roweis, S.T., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear
Embedding. Science 290(5500), 23232326 (2000)
13. Schneider, P., Biehl, M., Hammer, B.: Adaptive relevance matrices in learning
vector quantization. Neural Computation 21(12), 35323561 (2009)
14. Teh, Y.W., Roweis, S.: Automatic alignment of local representations. In: Advances
in Neural Information Processing Systems 15, pp. 841848. MIT Press, Cambridge
(2003)
15. Tenenbaum, J.B., Silva, V.d., Langford, J.C.: A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science 290(5500), 23192323 (2000)
16. van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine
Learning Research 9, 25792605 (2008)
17. van der Maaten, L.J.P.: Learning a parametric embedding by preserving local structure. In: Proceedings of the 12th International Conference on Articial Intelligence
and Statistics (AI-STATS), 5, pp. 384391. JMLR W&CP (2009)
18. van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: A comparative review. Technical Report TiCC-TR 2009-005, Tilburg University (October 2009)
19. Venna, J., Peltonen, J., Nybo, K., Aidos, H., Kaski, S.: Information retrieval perspective to nonlinear dimensionality reduction for data visualization. J. Mach.
Learn. Res. 11, 451490 (2010)
20. Weinberger, K.Q., Saul, L.K.: An introduction to nonlinear dimensionality reduction by maximum an introduction to nonlinear dimensionality reduction by
maximum variance unfolding. In: Proceedings of the 21st National Conference on
Articial Intelligence (2006)
Abstract. Several techniques have been put forward to describe characteristics of a Self-Organizing Map by depicting them on its output
grid. These techniques form articial landscapes, which are also called
spatializations. Until now, relatively few methods exist for displaying
distinct input vectors on the output grid. Those who exist either do not
show patterns in the distribution of the input vectors or do not observe
the spatial metaphors implied by the spatializations. This paper proposes an attempt to ll this gap. An approach is introduced where the
input vector placement can be inuenced by two parameters. The placement technique is tested with two data sets and analyzed through visual
inspection. The results show that the approach can both indicate patterns in the input data as well as observe the spatial metaphors of the
spatializations. It thereby allows for a meaningful combination of these
visualization forms.
Keywords: Self-Organizing Maps, Input Vector Placement, Spatialization, Spatial Metaphor.
Introduction
Self-organizing maps (SOMs) are neural networks which are frequently used for
the clustering and linear quantization of large, high-dimensional data sets [1].
An outstanding characteristic of a SOM is its ability to display its results visually. Since a topological order is dened over the codebook vectors, they can be
depicted as cells of an output grid. Thus a SOM allows that after the computational part of the data mining a visual analysis of the results can take place. This
is one of the main reasons why numerous tools and techniques for visualizing
various characteristics of a SOM have been developed. Mostly, they display an
additional color coded value on the cells of the SOM output grid. For example, a
u-matrix shows the distances between the various codebook vectors [2]. Dark values indicate large average distances between adjacent codebook vectors, bright
values stand for small distances. Such small distances between codebook vectors
indicate clusterings of input vectors. When a u-matrix is displayed, the bright
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 288297, 2011.
c Springer-Verlag Berlin Heidelberg 2011
289
and dark values often resemble valleys and hills1 . Clusters are then expected to
appear in the valleys, which are separated by hills. Therefore, a u-matrix also is
a spatialization.
A spatialization is a graphic representation of information using spatial metaphors. Such a spatial metaphor is a cognitive relationship between a spatial and
a non-spatial property [3,4]. Spatial metaphors have proven to be intuitively understandable [5]. When a landscape metaphor is used, information is presented in
the form of a 2- or 3-dimensional landscape. A landscape metaphor is composed
of several other spatial metaphors, such as the distance-dissimilarity metaphor.
This metaphor obeys the rst law of geography that everything is related to
everything else, but near things are more related than distant things [6]. In the
case of 3-dimensional landscapes, such as the u-matrix, also a height-metaphor is
employed, since the information is communicated through the visual impression
of a relief. But also other SOM visualization techniques like component planes
[1] or p-matrices [7] make use of such metaphors in order to communicate their
information. The SOM output grid itself is an example for a 2D-landscape, since
it maps codebook vectors which are similar in input space close together on
the output grid [8,9]. These characteristics have led to an increased attention
of the SOM in the GIScience Community. Several publications deal with the
application of methods for spatial analysis on SOMs [10].
Whilst much eort has been undertaken to develop visualization techniques
to describe the characteristics of a SOM and therefore its underlying data set,
relatively few attempts have been made to depict the input data vectors on the
output grid. However, for several applications it is helpful to provide a link back
to the input data. For example, SOMs can be used to visualize large archives of
data like, e.g., scientic papers [11]. In such archives, similar papers are mapped
to similar Best Matching Units (BMUs). Such library SOMs often use u-matrices
to make it easier for the user to nd clusters of codebook vectors. This approach
is less useful when the input vectors, i.e., the scientic papers, are not mapped
onto the output grid. Often this limitation has been overcome by instead linking
the codebook vectors back to the input data or to other visualization forms.
In section 2, several prior attempts to visualize input vectors on a SOM output
grid are introduced. In section 3, an approach is presented to position input
vectors in such a way that they conform to the landscape metaphor of a SOM.
This approach is tested against two data sets by a visual inspection of their umatrices. The tests are presented in section 4 and discussed in section 5. Section
6 concludes the paper.
Related Works
One of the earliest attempts to visualize input vectors on a SOM output grid
was the usage of a SOM with more codebook vectors than input vectors [9].
1
Please note that this assignment of gray values is reversed in some u-matrices. Such
a color coding is also used for grayscale geographic maps. For reasons of legibility,
in this paper valleys are bright and hills are dark.
290
T. Fincke
The training of the SOM would cause some of the codebook vectors to adopt
the values of one of the input vectors, whilst some codebook vectors would not
be the BMU for any input vector. These empty grid cells would then indicate
inter-vector distances. This attempt has the obvious disadvantage of not being
compatible with other visualization techniques. Also, both the clustering and
linear quantization capability of the SOM get lost.
In [11], the input vectors are placed randomly in the vicinity of their BMU.
This placement technique has the advantage that the depiction of the input
vectors complies with the spatial metaphors used by other visualization techniques. When, e.g., a BMU is located in the valley of a u-matrix, also the input
vectors are located in this valley. They are quickly visible as members of the
corresponding cluster. However, since the placement of the input vectors is random, inter-input vector distances are meaningless, thus the rst law of geography
is violated. The placement therefore might trick an analyst into believing that
there exist relationships between input vectors which actually do not exist.
A third input vector placement method is the weighted response placement
[12]. In this technique, not only the BMU but all codebook vectors are considered
in the process of determining a position for a distinct input vector. For a m*nSOM, the position px of a distinct input vector x is
px =
m
n
Ri,j (x)oi,j
(1)
i=1 j=1
where oi,j is the position of a codebook vector wi,j on the output grid. This
position is the center of the output grid cell associated with the codebook vector.
Ri,j (x) is the response of wi,j to x and is given by
g(qi,j )Nc (i, j)
, i [1, m], j [1, n]
Ri,j (x) = m n
k=1
l=1 g(qk,l )Nc (k, l)
(2)
where c is the BMU and Nc (j) is a neighborhood function around c. The purpose of the neighborhood function is to indicate which codebook vectors should
be considered. g(qi,j ) is a weighting function decreasing monotonically in the
interval between 0 and 1 and qi,j is the quantization error for a codebook vector
wi,j . The weight function indicates to which degree the codebook vectors should
be included into the calculation. Both the neighborhood and the weighting function will decrease with increasing distance from the BMU.
In [12] the neighborhood around the BMU was kept very broad and included
codebook vectors from all over the grid in the calculation process. Although in
the result the inter-input vector distances were preserved in the output space and
patterns of the input data became visible, the input vectors were often positioned
between several well matching units. When this approach was combined with a
u-matrix, input vectors were often not placed in the valleys or ridges associated
with their respective BMUs, thus hurting the landscape metaphor.
This attempt showed similarities to other methods from multi-dimensional
scaling like Sammons Mapping [13], where high-dimensional vectors are mapped
onto 2-dimensional output grids.
291
The previous section conveys that there is yet no technique which positions input
vectors in such a way that that their depictions form a coherent spatialization
with other techniques as, e.g., u-matrices and at the same time indicate patterns
in the distribution of input vectors. Therefore the aim of this section is to nd
a method to position the vectors in such a way that (a) the relational distances
amongst them are mostly preserved but (b) they will not be placed far away
from their respective BMUs on the output grid.
Since the approach from [12] allows for the alteration of both its neighborhood
and weighting function, not a completely new method was developed. Instead,
various parameter combinations of the weighted response placement approach
with dierent data sets were tested and the results were analyzed visually.
Before the analysis began, some slight alterations of the approach were performed. The weighting function was changed to
x wi,j 2
)
(3)
r
for an input vector x and a codebook vector wi,j . This applies a parameter
r which can be altered by the analyst. Small values of r will lead to smaller
weights for codebook vectors which are far away from the input vectors, while
large values will cause that they receive larger weights.
Maybe even more crucial is the denition of the neighborhood function. The
attempt presented here is to either include a codebook vector completely or not
at all. This is achieved by including the best matching unit and the units in its
surrounding. The neighborhood function is therefore dened as
1, if wi,j N hc (k)
(4)
Nc (i, j) =
0, else
g(x, wi,j ) = exp (
where N hc (k) is the set of codebook vectors around the BMU c to be considered. The parameter k is used to control how many neighbors to include.
For k = 0, only the BMU is considered. For k = 1, the BMU and its immediate neighbors are used. For k = 2, also the immediate neighbors of the BMUs
immediate neighbors are included, and so on.
Tests
For testing, the input vectors were not merely placed on SOM output grids, but
on u-matrices. There are two reasons for doing this: First, since the u-matrix
employs a height metaphor, the spatialization is an extension of the SOM grid
landscape. Any placement which forms a coherent spatialization with the umatrix will do so as well with the plain output grid. Second, in this manner the
point clusterings can be compared to the clusterings indicated by u-matrix valleys. The test itself consisted of examining whether (a) the input vector position
distributions formed meaningful patterns and (b) the placement of the input
292
T. Fincke
Fig. 1. Various placements of input vectors of the Chainlink data set on a u-matrix for
its SOM
293
Fig. 2. Various placements of input vectors of the WHO data set on a u-matrix for its
SOM. Circles are drawn around the input vectors of Ghana and Sudan.
depiction in the top row almost indicates the positions of the codebook vectors
at least for those which are not too close to the border of the grid. The placements
in the bottom row show the eect of a large neighborhood value. In all depictions
of that row, some of the input vectors are not only placed on a u-matrix hill, but
seem to merge with input vectors from other valleys. Whilst this indicates that
the clusters form rings in input space, it does not harmonize with the landscape
laid out by the u-matrix. Also, note that in the most-right depictions the input
vectors seem to be drawn to the center of the map. This is because the centroid
of the included codebook vectors shifts towards the map center for increased
values of k.
Arguably the best placements are given by the four depictions with k [2, 3]
and r 0.2. In these depictions, the input vectors are clearly placed in the
valleys of the u-matrix and at the same time form patterns.
The second set consists of data from the World Health Organization (WHO)
[16]. It is comprised of 192 states with a total of 10 variables (population growth
rate, total population, total urban population, male child mortality, female child
mortality, cases of tuberculosis, male adult mortality, female adult mortality,
neonatal mortality, and average life expectancy). The SOM for this data set is a
14*5 SOM. The results for neighborhoods up to k = 4 and weights up to r = 0.3
are displayed on a u-matrix in gure 2.
The u-matrix shows two salient valleys, one in the southeast (valley 1) and
one to the west (valley 2). About half of the input vectors have been mapped to
valley 1. It consists mainly of European states, but also of several states from
294
T. Fincke
South America, Eastern Asia, and the Middle East. 23 African states plus Haiti
were placed in valley 2. Whilst not as clearly designated as a valley by the umatrix, the vector placement shows a third big accumulation to the northwest of
valley 1. The input vectors clustered in this valley (valley 3) are a blend of states
from Eastern Europe, several small island states, Southern Asia, and Egypt.
Again, for k = 1 the vectors are placed close to the positions of their BMUs.
For r = 0.1 the input vector placements do not show patternsthe only remarkable event is the alignment of the input vectors from valley 3 towards valley 1.
The several individual cases scattered all over the map do not change their positions. In all depictions, the two eastern clusters meet when k = 3. This also
means that the input vectors are not located in their respective valleys, but are
placed on top of hills, thereby violating the rules of the landscape metaphor.
In valley 1 the input vectors form sub-clusters. Input vectors placed in the
northern part of valley 1 consist mainly of European states. This north-south
divide is visible best for k = 1 or r 0.2. In some of the depictions (best visible
for k = 3 and r = 0.2) also a divide between the eastern and western part
becomes apparent. In the eastern part, mainly states from Western Europe can
be found, whilst the western part consists of many Eastern European states.
Between valley 2 and valley 3 a large plain is situated which is separated by a
chain of hills from the valleys. Distinct input vectors which are placed on the
hills between the plain and valley 3 for k = 1 are located closer to either the
plain or valley 3 when k is increased. In this context, note the two input vectors
close to the bottom of the map (Ghana and Sudan, emphasized in gure 2 by
circles) which lie in the middle between the plain and the valley, but show a
tendency towards the plain for k 2 and r 0.2. The input vectors placed
in valley 2 do not show any distinguishable patterns, but clearly form a cluster
within the valley for r 0.2 and k 1.
It is also apparent that the patterns of the input vectors in valley 1 seem to
dissolve for k = 4. For this increased neighborhood, the clusters start to merge
and to move away from the valley. Also, the input vectors wander away from the
borders to the middle of the map.
Arguably the best depiction is delivered by the map with k = 2 and r = 0.2.
Here, input vectors with BMUs within the same valley form sub-clusters, but
are not placed on hills or get mixed up with input vectors with BMUs from
other valleys. Also, most hills are actually void of input vectors. Therefore, the
landscape metaphor of the u-matrix is observed here.
From these results, it can be seen that already small neighborhoods can cause
placements of input vectors on hills and therefore violations of the landscape
metaphor. A neighborhood of k = 1 will cause the input vectors to be placed
around its BMU. The vectors will only be comparable to vectors with the same
BMU, but not to others, even when their BMUs are topological neighbors.
One reason for this is that of the seven codebook vectors which are used for
295
the calculation of the position of an input vector, only four are used for input
vectors from adjacent BMUs.
When k = 2, 19 codebook vectors are used for the position calculation, and 14
of them are also used for the calculation of an input vector with a neighboring
BMU. This ratio converges to 1 with an increasing k, but the amount of shared
codebook vectors for k = 2 is sucient to show similarities or dissimilarities
between input vectors from dierent BMUs.
Another issue is the centering eect: When the neighborhood is increased,
input vectors tend to be placed at the center of the map. A way to prevent this
might be to assign extra weights to codebook vectors at the borders. However,
this would also lead to an unjustiedly strong accentuation of those codebook
vectors and therefore distort the input vector distribution. Due to these ndings,
a neighborhood of k [2, 3] seems preferable.
When r is increased, the calculated positions for input vectors with the same
BMU become similar. This eect is alleviated when the neighborhood k is also
increased. The distribution of the input vectors starts to show patterns then.
However, some of the input vectors with BMUs placed in valleys are then positioned on hills. A striking dierence in the performances of the two data sets
is that for r = 0.1 the SOM for the Chainlink data showed a clear alignment
of the input vectors, whilst except for the input vectors in valley 3 in the SOM
for the WHO data set input vectors with dierent BMUs did not seem to form
some sort of pattern. Patterns most clearly emerged for r = 0.2.
For the Chainlink data set, the actual pattern (two intertwined rings) was
known and clearly distinguishable. Therefore it could easily be seen that the input vectors aligned in a way that resembled this pattern from input space. Doing
so was harder for the WHO data set, because the underlying distribution was not
known. However, also this data set showed patterns: Inter-dierences between
distinct input vectors or groups of vectors became apparent and sub-clusters located in valleys were indicated. Also, the placement of input vectors emphasized
valley 3, which would not have been so easily to distinguish by merely inspecting
the u-matrix. The distances between dierent input vectors or groups of input
vectors have proven here to tell about the underlying distribution in input space.
This means that for certain parameter combinations the spatial metaphor was
obeyed and applied successfully. Of course, the parameter combinations which
worked best here might not be the optimal solutions for other SOMs. The trial of
other combinations is therefore encouraged; especially when one is dealing with
data sets which are very dierent from those presented in this work.
Conclusion
296
T. Fincke
References
1. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (2001)
2. Ultsch, A., Siemon, H.P.: Kohonens Self Organizing Feature Maps for Exploratory
Data Analysis. In: Proceedings of International Neural Network Conference (1990)
3. Kuhn, W., Blumenthal, B.: Spatialization: Spatial Metaphors for User Interfaces.
Department of Geoinformation, Technical University of Vienna, Vienna, Austria
(1996)
4. Skupin, A., Butteneld, B.P.: Spatial Metaphors for Visualizing Very Large Data
Archives. In: Proceedings GIS/LIS 1996, pp. 607617 (1996)
5. Tory, M., Sprague, D.W., Wu, F., So, W.Y., Munzner, T.: Spatialization Design:
Comparing Points and Landscapes. IEEE Transactions on Visualization and Computer Graphics 13(6), 12621269 (2007)
297
6. Tobler, W.R.: A Computer Movie Simulating Urban Growth in the Detroit Region.
Economic Geography 46(2), 234240 (1970)
7. Ultsch, A.: Density Estimation and Visualization for Data containing Clusters of
unknown Structure. In: Weihs, C., Gaul, W. (eds.) Classication - the Ubiquitous Challenge, Proceedings 28th Annual Conference of the German Classication
Society (GfKl 2004), pp. 232239. Springer, Heidelberg (2005)
8. Baca
o, F., Lobo, V., Painho, M.: Geo-SOM and its integration with Geographic
Information Systems. In: Proc. Workshop on Self-Organizing Maps, Paris, France
(2005)
9. Skupin, A., Hagelman, R.: Attribute Space Visualization of Demographic Change.
In: Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems (2003)
10. Agarwal, P., Skupin, A. (eds.): Self-Organising Maps: Applications in Geographic
Information Systems. Wiley, Chichester (2008)
11. Skupin, A.: A Cartographic Approach to Visualizing Conference Abstracts.
Computer Graphics and Applications. IEEE Computer Graphics and Applications 22(1), 5058 (2002)
12. Liao, G., Shi, T., Liu, S., Xuan, J.: A Novel Technique for Data Visualization Based
on SOM. In: Duch, W., Kacprzyk, J., Oja, E., Zadrozny, S. (eds.) ICANN 2005.
LNCS, vol. 3696, pp. 421426. Springer, Heidelberg (2005)
13. Sammon, J.W.: A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers 18(5), 401409 (1969)
14. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for Matlab 5. Technical Report A57, Helsinki University of Technology (2000),
http://www.cis.hut.fi/projects/somtoolbox/
15. Ultsch, A.: Clustering with SOM: U*C. In: Proc. Workshop on Self-Organizing
Maps, Paris, France, pp. 7582 (2005)
16. World Health Organization - Data and Statistics, January 2011 (2011),
http://www.who.int/research/en/
IFSTTAR - B
atiment Descartes 2,
2, Rue de la Butte verte, 93166 Noisy le Grand Cedex, France
etienne.come@ifsttar.fr
2
SAMM - Universite Paris 1 Pantheon-Sorbonne
90, rue de Tolbiac, 75013 Paris, France
marie.cottrell@univ-paris1.fr
3
Universite Catholique de Louvain, Machine Learning Group
Place du Levant 3, 1348 Louvain-La-Neuve, Belgium
michel.verleysen@uclouvain.be
4
Snecma, Rond-Point Rene Ravaud-Reau,
77550 Moissy-Cramayel CEDEX, France
jerome.lacaille@snecma.fr
Introduction
During the ights, some on-board sensors measure many parameters related to
the behavior (and therefore the health) of aircraft engines. These parameters
are recorded and used at short and long terms for immediate action and alarm
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 298307, 2011.
c Springer-Verlag Berlin Heidelberg 2011
299
generation, respectively. In this work, we are interested in the long-term monitoring of aircraft engines and we want to use these measurements to detect any
deviations from a normal behavior, to anticipate possible faults and to facilitate the maintenance of aircraft engines. This work presents a tool that can help
experts, in addition to their traditional tools based on quantitative inspection
of some relevant variables, to easily visualize the evolution of the engine health.
This evolution will be characterized by a trajectory on a two-dimensional SelfOrganizing Map. Abnormal aging and fault will result in deviations with respect
to normal conditions. The choice of Self-Organizing Maps is motivated by several
points:
SOMs are useful tools for visualizing high-dimensional data onto a lowdimensional grid;
SOMs have already been applied with success for fault detection and prediction in plants and machines (see [8] for example).
This article follows another WSOM paper [3] but contains necessary material
(and possibly redundant) to be self-contained. It is organized as follows : rst, in
Section 2, the data and the notations used throughout the paper are presented.
The methodology and the global architecture of the proposed procedure are
described in Section 3. Each step is dened and results on real data are given in
Section 4.
Data
Methodology
The goal is to build the trajectories of all the engines, that is to project the
successive observations of each engine on a Self-Organizing Map, in order to
follow the evolution and to eventually detect some abnormal deviation. It is
300
E. C
ome et al.
Table 1. Variables names, descriptions and type
1
Xij
2
Xij
3
Xij
4
Xij
5
Xij
6
Xij
7
Xij
8
Xij
9
Xij
10
Xij
11
Xij
12
Xij
13
Xij
14
Xij
15
Xij
1
Yij
Yij2
Yij3
Yij4
Yij5
Name
aid
eid
fdt
temp
nacelletemp
altitude
wingaice
nacelleaice
bleedvalve
isolationleft
vbv
vsv
hptclear
lptclear
rotorclear
ecs
fanspeedi
mach
corespeed
fuelow
ps3
t3
egt
Description
aircraft id
engine id
ight date
temperature
nacelle temperature
aircraft altitude
wings anti-ice
nacelle anti-ice
bleed valve position
valve position
variable bleed valve position
variable stator valve position
high pressure turbine setpoint
low pressure turbine setpoint
rotor setpoint
air cooling system
N1
aircraft speed
N2
fuel consumption
static pressure
temperature plan 3
exhaust gas temperature
Type
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
environment
engine
engine
engine
engine
engine
Binary
not valuable to use the rough engine measurements: they are inappropriate for
direct analysis by Self-Organizing Maps, because they are strongly dependent on
environment conditions and also on the characteristics of the engine (its past, its
age, ...). The rst idea is to use a linear regression for each engine variable: the
environmental variables (real-valued variables) and the number of the engines
(categorical variable) are the predictors and the residuals of these regressions can
be used as standardized variables (see [3] for details). For each engine variable
r = 1, . . . , p, the regression model can be written as:
q
1
+ . . . + rq Xij
+
rij
Yijr = r + ri + r1 Xij
(1)
where ri is the engine eect on the rth variable, r1 , . . . , rq are the regression
coecients for the rth variable, r is the intercept and the error term
rij is the
residual.
Figure 1 presents for example the rough measurements of the corespeed feature
as a function of time (for engine 6) and the residuals computed by model (1).
The rough measurements seem almost time-independent on this gure, whereas
the residuals exhibit an abrupt change which is linked to a specic event in the
life of this engine. This simple model is therefore sucient to bring to light interesting aspects of the evolution of this engine. However, the signals may contain
0.5
98
0.4
97
301
0.3
corespeed residuals
corespeed
96
95
94
0.2
0.1
93
0.1
92
0.2
91
90
Time
0.3
(a)
Time
(b)
Fig. 1. (a) Rough measurements of the corespeed variable as a function of time for
engine 6, (b) residuals of the same variable and for the same engine using a simple
linear model with the environmental variables and the engine indicator as predictors
(see Table 1).
ruptures, making the use of a single regression model hazardous. The main idea
of this work is to replace model (1) by a new procedure which deals with the
temporal behavior of the signals. The goal is therefore to detect the ruptures and
to use dierent models after each rupture. This new procedure is composed of
two modules. The rst module (Environmental Conditions Normalization, ECN)
aims at removing the eects of the environmental variables to provide standardized variables, independent of the ight conditions. It is described in section
4.1. The second module uses an on-line change detection algorithm to nd the
above mentioned abrupt changes, and introduces a piecewise regression model.
The detection of the change points is done in a multi-dimensional setting taking
as input all the normalized engine variables supplied by the ECN module. The
Change Detection (CD) module is presented in Section 4.2. As a result of these
rst two steps, the cleaned database can be used as input to a Self-Organizing
Map with a proper distance for trajectories visualization. The third module
(SOM) provides the map on which the trajectories will be drawn. Finally, engine trajectories on the map are gathered in a trajectory database which can be
accessed through a SEARCH module, which use a dedicated Edit Distance to
nd similar trajectories. This four-steps procedure is summarized in Figure 2.
4
4.1
The rst module aims at removing the eects of the environmental variables.
For that purpose, one regression model has to be tted for each of the p engine variables. As the relationship between environmental and engine variables
is complex and denitively not linear, the environmental variables can be supplemented by some non-linear transformations of the latter, increasing the number of explanatory variables. Interactions (all the possible products between two
environmental variables), squares, cubes and fourth powers of the non binary environmental variables are considered. The number q of predictors in the model
302
E. C
ome et al.
Engine variables
Environmental variables
Distance
Request
Module 1 (ECN)
Module 2 (CD)
Module 3 (SOM)
Module 4 (SEARCH)
Environmental Condition
Normalisation
(Regression)
Changes Detection
Adaptive Modelling
x 10
0.8
0.6
0.4
fanspeedi
vsv
0.2
0
0.2
vsv*fanspeedi
0.4
0.6
0.8
1
7
4
x 10
Standardized variables
Final variables
Trajectory database
is therefore a priori equal to (11 + 4) (11 + 4 1)/2 = 105 for the interactions
variables and 11 4 + 4 = 48 for the power of the continuous variables and the
binary variables leading to a total of q = 153 predictors. This number is certainly too large and some of them are clearly irrelevant due to the systematic
procedure used to build the non-linear transforms of environmental variables. A
LASSO criterion [4] is therefore used to estimate the regression parameters and
to select a subset of signicant predictors. This criterion can be written using
the notations from Section 2 for one engine variable Y r , r {1, . . . , p} as :
= arg min
r
q
r
I,n
i
Yijr
i,j=1
2
lr
l
Xij
l=1
|lr | < C r
(2)
l=1
corespeed fuelow
25
43
0.9875
0.9881
ps3
31
0.9773
t3
30
0.9636
egt
41
0.8755
303
fuelflow
fanspeedi
altitude
fanspeedi*altitude
optimal solution
Fig. 3. Regularization path for the fuelow variable: regression coecients evolution
with respect to C r . The more signicant explanatory variables are given and the best
solution with respect to cross-validation is depicted by a vertical line.
interesting from the point of view of the experts, because it can be compared
with their previous knowledge. Such a curve clearly highlights which are the
more relevant predictors and they appear to be in very good adequateness with
the physical knowledge on the system.
In summary, the rst preprocessing module (ECN) provides p = 5 standardr
, i {1, . . . , I}, j {1, . . . , ni }], with
ized engine variables denoted by S r = [Sij
r {1, . . . , p}, which are the residuals of the selected regressions. They are independent of environmental conditions but still contain some signicant aspects
such as linear trends and abrupt changes at specic dates. We therefore propose
to use an on-line Change Detection algorithm (CD) together with an adaptive
linear model to t the data.
4.2
Change Detection - CD
To take into account the two types of variation (linear trend and abrupt changes),
we implement an algorithm based on the ideas from [5] and [7]. The solution is
based on the joint use of an on-line change detection algorithm to detect abrupt
changes and of a bank of recursive least squares (RLS) algorithms to estimate
the slow variations of the signals. The algorithm works on-line in order to allows
projecting new measurements on the map as soon as new data are available.The
method can be described as follows:
1) One RLS algorithm is used for each one of the p standardized engine
variables to recursively t a linear model. For each r {1, . . . , p}, for each
engine i {1, . . . , I} and at each date l, one has to solve the following equation:
(ril , ilr ) = arg
min
R,R
r
(li) (Sij
(j + ))2 ,
(3)
j=1
where is a forgetting factor. The estimates ril and ilr are respectively the
intercept and the slope of the linear relationship. These estimates are then used
304
E. C
ome et al.
egt
egt
30
60
20
40
10
20
0
0
10
20
corespeed
1.5
fuelflow
400
ps3
6
t3
15
300
20
Time
10
100
0.5
0.5
Time
100
Time
Time
10
15
10
10
400
15
0
Time
0.5
t3
20
200
ps3
5
800
600
0.5
Time
fuelflow
1000
2
200
corespeed
1.5
Time
200
(a)
Time
20
Time
Time
(b)
Fig. 4. Change detection results for two engines, (a) engine 2, (b) engine 41. Alarms
are depicted by vertical lines, input signals are shown in light gray and signal estimates
F r using RLS are depicted by a black line. One gure (with respect to egt) is bigger
than the others to present more clearly the RLS estimate of the signal.
to dene the variables ril = Silr (ilr l + ril ), which do not contain anymore the
slow variations of the signals.
2) These values are concatenated in a vector l = [1l , . . . , pl ], which is then
used in a multi-dimensional Generalized Likelihood Ratio (GLR) algorithm [1]
to detect the abrupt changes of the signals. The GLR algorithm is a sequential
test procedure based on the following model:
k Np ((k), ), k > 0,
where Np ((k),
) is the multivariate normal distribution with variance and
0 = {|||| < ro } if k < t0 ,
mean (k) =
, (r0 < r1 are given constants).
1 = {|||| > r1 } if k t0 .
3) Finally, when an alarm is sent by the GLR algorithm, all the RLS algorithms are re-initialized. The results supplied by this algorithm are the following:
the alarm dates supplied by the multi-dimensional GLR algorithm;
cleaned signals estimated by the RLS algorithm;
slopes and intercepts estimated by the RLS algorithm.
Figure 4 presents the obtained results for two engines. One abrupt change was
found for the rst engine and 3 for the second one; all of them seem to be
reasonable and a comparison between estimated alarm dates and recorded real
events of the engine life have conrmed this fact. The estimated signals are also
shown on these two gures. For more information on this aspect of the analysis
process see [2]. From now, the observations corresponding to each ight are
Fil = [Fil1 , . . . , Filp ], where Filr = ilr l + ril are the results of the transformations
of the raw data performed by the rst two modules (ECN and CD).
305
Request
22
(a)
(b)
Fig. 5. (a) Trajectories of engine 22 on the map. The sizes of the dots are proportional
to the measurement date: smallest dots correspond to recent measurements, larger dots
to older measurements. (b) Pieces of similar trajectories found using the edit distance
(details are given in section 4.4)
4.3
The cleaned signals Fil provided by the previous two modules are then used as
input to a SOM for visualization purpose. To project the observations we use
a [20 20] SOM implemented with the Matlab toolbox [9] and with defaults
settings for the learning rate. Since the variables Filr are correlated, a Mahalanobis distance is used to whiten the data. A classical learning scheme is used
to train the map. Figure 5 (a) presents one example of engine trajectories on
the map, which clearly have dierent shapes. For the studied engine, available
maintenance reports inform us that this engine suers from an deterioration of
its high pressure core. This fault is visible on the map at the end of the trajectory: the engine which was projected on the middle north of the map during a
large part of its trajectory, suddenly moves towards the north-west corner of the
map. This area of the map furthermore corresponds to abnormal values of the
engine variables.
4.4
One of the nal goal of the proposed tool concerns clustering and prediction
of engine trajectories or pieces of engine trajectories. For this end, we have to
dene a proper distance between pieces of trajectories, which can be of dierent
lengths. Before projection on the map, pieces of trajectories were sequences of
Rp -vectors, but as soon as measurements are projected, they can be described
by sequences of integers corresponding to the units where measurements are
projected. Such sequences will be denoted by T = [k1 , . . . , kL ] and as they take
306
E. C
ome et al.
their values in a nite set {1, . . . , U } (where U is the number of units), we will
call them strings. The diculty comes from the fact that the strings can have
dierent lengths. Such a problem has been already investigated in other elds
and one classical solution is to use Edit Distance, which is commonly used for
approximate string matching [6].
To compare two strings, Edit Distance uses a cost function. This function
gives individual cost for each unitary operation such as: suppression, addition or
substitution. The cost of a sequence of operations O
= [o1 , o2 , . . .] is simply equal
to the sum of all the unitary costs, so: cost(O) = t cost(ot ). Then, the Edit
Distance between two strings de(T, T ) is dened as the minimal cost among all
sequences of operations that fulll the constraint O(T ) = T . Such a distance can
be tuned to our end by carefully choosing unitary costs. We may in particular
use the map topology to dene meaningful substitution costs, by setting the cost
of the substitution k k to the distance between unit k and unit k on the
map. With such a choice, we will take benet of the fact that close units on the
map can be exchanged with a small cost. Suppression and insertion costs are
equal to the average of all the pairwise distances between units of the map.
With such a distance one can build classes of pieces of trajectories using
hierarchical clustering. But this distance can also be used to supply clues on the
possible future evolution of one trajectory. To perform such a task, the following
method is proposed. Let T be a piece of engine trajectory:
1. compute the Edit Distance between T and all the pieces of engine trajectories recorded in the eet database [T1 , T2 , . . .] (all these distances can be
computed eciently using dynamic programming [6]);
2. search for matching pieces Tx such that de(Tx , T ) < , where is a given
threshold;
Note that these pieces are parts of already observed engine trajectories,
which were recorded in the eet database so that their evolutions after the
matching points are therefore known, the third step uses this property.
3. look at the pieces that are just after the matching pieces. That gives an idea
about the possible futures of T and enables the computation of probabilities
of dierent types of maintenance events if the eet database is connected to
the maintenance database which recorded all the failures and maintenance
operations performed on the eet. We hope that it will be a useful tool to
anticipate failures.
Figure 5 (b) presents preliminary results obtained using such an approach, T
was built using the last 50 points of an engine trajectory. During this time
period, this engine stays in the same unit. We show in Figure 5 (b) the pieces
of trajectories that occurred after the 3 best matching points found in the eet
database using the proposed Edit Distance. These possible futures for T seem
to be reasonable. Further works concern the connection with the maintenance
database to perform a quantitative analysis of the results.
307
Conclusion
The method proposed in this paper is an nice tool to summarize and represent the temporal evolution of an aircraft engine health ight after ight. The
regression approach used to deal with the problem of environmental condition
normalization (ECN) seems to be eective, even if other model selection methods
such as the BIC criterion could be investigated in further works to reduce the
number of selected variables. The joint use of an adaptive algorithm to estimate
signal evolution (RLS) and of a change points detection method (GLR) is also
an interesting solution to deal with the non-stationary of the signals and to clean
them (GLR module). Finally, Self-Organizing Maps (SOM) can be used to show
the engine health evolution in a synthetic manner and to provide codes for synthetic representation of trajectories, that enables the development of predictive
analysis tools (SEARCH module).
References
1. Basseville, M., Nikiforov, I.: Detection of Abrupt Changes: Theory and Application.
Prentice-Hall, Englewood Clis (1993)
2. C
ome, E., Cottrell, M., Verleysen, M., Lacaille, J.: Aircraft engine health monitoring using self-organizing maps. In: Springer (ed.) Proceedings of the Industrial
Conference on Data-Mining (2010)
3. Cottrell, M., Gaubert, P., Eloy, C., Francois, D., Hallaux, G., Lacaille, J., Verleysen,
M.: Fault prediction in aircraft engines using self-organizing maps. In: Prncipe,
J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 3744. Springer,
Heidelberg (2009)
4. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.J.: Least angle regression. Annals
of Statistics 32(2), 407499 (2004)
5. Gustafsson, F.: Adaptive ltering and change detection. John Wiley & Sons, Chichester (2000)
6. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33, 2001 (1999)
7. Ross, G., Tasoulis, D., Adams, N.: Online annotation and prediction for regime
switching data streams. In: Proceedings of ACM Symposium on Applied Computing,
pp. 15011505 (March 2009)
8. Svensson, M., Byttner, S., Rognvaldsson, T.: Self-organizing maps for automatic
fault detection in a vehicle cooling system. In: 4th International IEEE Conference
on Intelligent Systems, vol. 3, pp. 812 (2008)
9. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: Som toolbox for matlab
5. Tech. Rep. A57, Helsinki University of Technology (April 2000)
Abstract. A new classification method is proposed with which a multidimensional data set was visualized. The phase distance on the spherical surface for
the labeled data was computed and a dendrogram constructed using this distance. Then, the data can be easily classified. To this end, the color-coded clusters on the spherical surface were represented based on the distance between
each node and the labels on the sphere. Thus, each cluster can have a separate
color. This method can be applied to a variety of data. As a first-example, we
considered the iris benchmark data set. A boundary between the clusters was
clearly visualizible with this coloring method. As a second example, the velocity (first derivative) mode of a Plethysmogram pulse-wave data set was analyzed using the distance measure on the spherical surface.
Keywords: Spherical Surface SOM, Colored Clustering, Distance Measurement, Boundary decision.
1 Introduction
There is a clear benefit in utilizing spherical Self-Organizing Maps for classifying
multidimensional data [1], [2] and [3]. The discontinuation at the 4 borders and 4
corners of a 2D planar map affect the results obtained after the learning process [4]. In
the spherical Self-Organizing Maps, these discontinuations dont happen. The case
where the phase relationship between the data points is most clear and precise is on a
spherical surface. In the spherical Self-Organizing Maps this relationship can be exploited for constructing the cluster, and next the dendrogram. To delineate the clusters, the boundary needs to be improved by a manual operation. It is important to
draw the boundary on the map, precisely and correctly. This is necessary in order to
realize a maximally correct classification at a later stage, when the clusters are assigned to classes, using class labels. The method of learning vector quantization
(LVQ) [4] is proposed as a way of determining the cluster boundary automatically.
With this method, it is necessary to choose a learning parameter.
A new classification method was proposed with which the multidimensional data
can be visualized [1] and [2]. There, a phase distance on the spherical surface for the
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 308317, 2011.
Springer-Verlag Berlin Heidelberg 2011
309
label data was computed and the dendrogram was created by the range distance calculation among the labels. Here, a color classification of groups of nodes forming a data
cluster on the spherical surface, using labeled data points, was carried out by considering the distance between the unlabeled node and the labeled data point. The method
can be applied to a variety of data. In this paper, the first example considered is the
iris benchmark data set [5] and [6]. In a second example, the distances among the
labels were used to classify Plethysmogram pulse-wave data.
2 Algorithm
Previously in [1] and [2], we have used the Iris benchmark as an example for classification. It consists of data from 50 of vergicolor (abbreviation ver), 50 of verginica
(abbreviation gnc), 50 of setosa (abbreviation set), together forming the iris data [5]
and [6]. At this time, a spherical surface was transformed up to the Griff value 1 (ref.
[1] and [2]) emphasizing the U-matrix [7] which shows the boundaries between the
clusters, where by the Griff value 0 a spherical surface is given, however, in the Griff
value 1, the position of the maximum distance (the darkest part) is kept the radius 1
and the minimum (the brightest part) is 0, as shown in Fig. 1(a) (the Griff operation is
detailed in ref. [1] and [2]). After transforming the sphere by the Griff value 1, a dendrogram was drawn using the group average-method like Fig. 1. In Fig. 1, the dendrogram near ver_23 and the spherical surface, with the Griff value 0, are shown in
(a) and (b), respectively. The dendrogram is drawn mainly near ver_23 of the red
character, but the boundary, decided from the dendrogram, is a dotted line, and the
actual one should be a solid line.
The following strategy was proposed in order to decide on a correct boundary automatically. Using labeled data, it was examined which node of the group of ver, gnc or
Fig. 1. (a) After a spherical surface was transformed to the Griff value 1, as explained in Fig. 3
of ref. [2], the dendrogram on the left was constructed by the group average-method. (b) When
the spherical surface was returned to the full sphere using the Griff value 0, the solid line is the
boundary that should be pulled out of the label, but the boundary became as shown with the
dotted line, corresponding to the dendrogram on the right (from Fig. 6 of the ref. [2]).
310
H. Tokutaka et al.
set was near to each one of the 642 nodes that compose the spherical surface in Fig. 1.
The same procedure was used for constructing the dendrogram of Fig. 1. In other words,
a spherical surface was transformed based on the Griff value 1. In the procedure, the
distances between the unlabelled nodes and all the labeled ones were calculated and
then, the node was assigned to the group of the label that was the nearest. Thus, all 642
nodes were divided into three groups that are either ver, gnc or set. The 3 data points,
from a total of 9, were taken out from the ver, gnc and set data. Then, 51 data of 42
nodes of 42 face pieces on the sphere and the 9 iris data were used. Then, a polyhedron
was transformed using the same procedure as in Fig. 1, i.e. with the Griff value 1. A
dendrogram was constructed by the group average-method and shown in Fig. 2. The
nodes that are in the neighborhood to the ver, gnc or set belong to these groups respectively. There are some nodes that are far from the labels. These are for example, nodes
31, 36, 38. There, when the dendrogram was traced, two gnc and three ver belong to
these nodes 31, 36 and 38. In the same way, 35, 40, 20, 22, 23, 24, 41, 25, belong to set
when the same dendrogram is traced. Also, all three data points of gnc, and all three of
ver are connected through the root of the dendrogram. Here, we examined how the nine
labels can be assigned to each of the 42 nodes by searching the nearest label data of
each node. The method is named as All label algorithm. As shown in Fig. 2, by the
method, the former nodes belonged to set, ver, set, respectively. Then, the latter nodes
belonged to set, ver, set, gnc, set, set, ver, gnc, respectively. Thus, the nodes of 31 and
38 belong to set. They are different from those by the dendrogram. Also, the nodes of
40, 22, 41, and 25 belong to ver, gnc, ver and gnc, respectively. They are different from
set group which would be expected from the dendrogram. The results are indicated in
the bottom of Fig.2, either ver, gnc, or set.
Fig. 2. The obtained dendrogram is shown with 42 nodes and 9 labels, where the nodes far from
the labeled data are assigned by the All label algorithm and the group is shown in the bottom.
Unlabelled nodes in the bottom are of course classified with the nearest label. For example,
node 0 is classified as ver and the 19 is set and so on.
311
Fig. 3. A total 51 of iris 9 data and 42 nodes were pasted onto a 42 face map
As shown in the bottom of the dendrogram in Fig. 2, the nodes seem to be separately
identified in the dendrogram by this All label algorithm. The label group of the identified node could be pasted onto 42 face map. It was possible to color like in Fig. 3 and
then, a smooth boundary was obtained as shown. Because of this, the number of the
nodes was increased to 642. The algorithm was verified on the data of the iris 150
data points.
The node groups (Nos. 31, 36, 38, and 35, 40, 20, 22, 23, 24, 41, 25 on the denndrogram), which were on the far left from labels in the dendrogram, as shown in Fig.
2, are judged to be close to a variety of labels. The reason is described below. First, a
group is made with the group average-method among the nearest ones. The result is
due to what makes the group bigger than another. All these nodes are near the
boundaries. If theses are examined, they are judged different from the dendrogram, as
they are near a variety of labels as shown in the bottom of Fig. 2.
312
H. Tokutaka et al.
vergicolor and virginica. The result after 500 learning epocks for 642 nodes is shown
in Fig. 4(a). In this example, the boundary between set(setosa) and ver(vergicolor) is
very smooth as shown in Fig. 4(a). However, in this benchmark problem, it is difficult
to distinguish between ver(vergicolor)_19 and gnc(virginica)_20. It is also difficult to
distinguish ver_23 and ver_24 from gnc. As shown in Fig. 4(b), the boundary between ver_19 and gnc_20 is clearly separated. The boundary between two ver labels
of ver_23 and ver_24, and gnc groups are clearly indicated by color in Fig. 4(b). It is
a mathematically difficult problem to draw the boundary. When looking at the result
of adding colors on the spherical surface, it will be possible to draw a more smooth
boundary such as (a) between the set and the ver group, if ver_23 and ver_24 are read
as belonging to the gnc group, thus different from the ver group. In Fig. 5, an unknown sample UK_3 was projected on the spherical surface. From Fig. 5(a) showing
a usual SOM analysis, it can be understood that UK_3 is riding on the U-matrix, [7].
Under the present condition, it can't be decided whether UK_3 belongs to set or ver.
However, Fig. 5(b), which is colored by our technique, shows that UK_3 is near
set_44. Thus, it is in the group of ver based on the boundary between the colors.
Fig. 4. (a) When the iris data is used for training, during 500 epochs, a spherical 642 node map,
a smooth boundary can be drawn between set(setosa) and ver(vergicolor). (b) For the ver and
gnc groups, as shown in (b), it was possible to draw the boundary between ver_19 and gnc_20
which are very close to each other. ver_23 and ver_24 are projected into the gnc group like a
peninsular. There, a clear boundary was found as shown in (b).
Incidentally, in Fig. 6(a), SOM learning is performed during 50 epochs for a 642
spherical surface node map. In the iris benchmark problem, the boundary of ver and
gnc is always arguable. Therefore, the set was excluded and only ver and gnc were
used for the experiment. The samples of ver and gnc were projected on the spherical
surface. Then, the boundary was drawn on the spherical surface [8]. At the same time,
the boundary was also colored, as shown in Fig. 6. The boundary agrees almost with
the coloring boundary. It should be understood that the coloring boundary is finer
than the line boundary.
313
Fig. 5. (a) An unlabeled data point UK_3 (surrounded in the red square in (a) and in the white
square in (b) was projected on the spherical surface. In ordinary SOM learning, UK_3 is on the
line of the U-matrix [7] as shown in (a). There, it can't be found whether UK_3 belongs to ver
or set. However, it can be found that UK_3 belongs clearly to ver from (b).
Fig. 6. The boundary between ver and gnc groups is shown by the line from b0T to b4. (a) The
part of the line boundary from b0T to b2 . (b) The line boundary from b2 to b4 (ref. [8]).
314
H. Tokutaka et al.
Fig. 7. (a) The bundle of corrugated velocity pulse waves in 1 period. (b) The position where each
corrugated wave height value of the figure (a) became 0 first was cut out. Also, it was divided in
100 steps. Then, the rate data where the wave height became 0 in 1 period was added.
Fig. 8. (a) The distance among all labels from label G1111_1 of the first data in the dendrogram
(No.1 enclosed with the square box) using the blossom tool [10] was measured and it is shown
in (b). (b)The distances were arranged in ascending order. In (b), there is a big discontinuous
point at label G2_519. (c) This time, all the distances which started from label G2_519 were
measured and (d) they were shown in ascending order. As shown in (d), the distances are distributed with equal intervals. Then, the labels were put into all intervals. For (d), since it is
obvious that there are much more people in the health region, the labels were then relabeled as
indicated by the (arrowed) right hand-side labels.
315
The 683 data of 101 dimensions that was prepared were used for learning a spherical
surface SOM. The data waves were classified into six groups from A to F, as shown in
Fig. 9. The procedure is shown in Fig. 8. The classification from the A to F group is
already carried out in (a) and (c) of Fig. 8. There, the distance measurement was accomplished by the procedure which was shown in [1] and [2]. The procedure supported by
the tool is described in Fig. 8(a). The first label G1111_1 of the dendrogram that is
searched in begins as No.1 (enclosed with the square box). Then, the label is put in
Label1 as No.2 (enclosed with the square box). Then, "Save selected" in the Fig. 8(a)
was chosen as No.3 (enclosed with the square box). All the distances among G1111_1,
itself and the other 683 label data in total were measured and saved as a csv file. When
the obtained distances were arranged in ascending order, a large discontinuous jump
happens in G2_519 position (Fig. 8(b)). Therefore, the distances from the (G2_519)
point to all other labels were recalculated and the procedure which is similar to Fig. 8(a)
is shown in (c). The obtained distances can be arranged once again in ascending order
and as a result Fig. 8(d) is obtained. A large discontinuous jump like Fig. 8(b) doesn't
occur as shown in Fig. 8(d). Approximately all equal discontinuous steps are obtained
and shown in Fig. 8(d). It is divided into 6 groups named as A-F starting from the bottom. However, the group with the largest number of people was considered temporarily
as the waves coming from the health people. D was read as A. The A label was read in
descending order starting from that point. The obtained corrugated wave group is shown
in Fig. 9. Since the A group grew many corrugated waves, the display was divided into
A1 and A2. As for the first A group, the waves are falling steeply. We also observe that
the corrugated waves are falling more gently for the F group.
5 The Conclusions
The coloring of the nodes comprising a spherical surface was performed by measuring the distance between the node and all labels on the transformed polyhedron. The
result was used as a way to decide on the boundary when considering the case of
316
H. Tokutaka et al.
cluster classification. The method is regarded as All label algorithm. The size of the
node, which composes a spherical surface, in order to verify this algorithm, was reduced to 42. In this way, the boundary based on the coloring was confirmed. Next, the
number of nodes was increased to 642. Our technique to decide on the boundary between the cluster groups, using the coloring of each node, was applied to and analyzed for the case of the iris benchmark problem [5] and [6]. For this problem, the
number of epochs for learning the spherical maps was increased from the usual 50 to
500 epochs. In the past, the boundary between ver_19 and gnc_20 was not clear. By
the present method, however, the boundary became clear. Then, there is a problem
with the lower precision used in the past for this benchmark problem. For example,
ver_23 and ver_24, overhang like a peninsular in Fig. 4(b) compared with the smooth
boundary between set and ver of Fig. 4(a). If ver can be read as gnc, it should be possible to draw a smooth boundary like Fig. 4(a). Also, there is an unknown sample
UK_3 which is near set_44 of setosa (set) in Fig. 5(a). It was found that this UK_3
belonged to the ver group by using the node coloring method.
With the usual planar SOM, the segregation of the clusters is in principle possible.
However, when considering the distances between the labels, a sphere was transformed into a polyhedron. For this case, the distance calculation among the labels was
performed and the result was displayed (see Fig. 8). We observed a distance discontinuity among the groups like Fig. 8(d). Large bundles of corrugated wave, which don't
have any labels, were classified successfully. By the node coloring method, the cluster
decision boundary became visible and was correctly estimated, as discussed above.
Detailed information on the cluster classification was obtained by the distance measurement applied to the transformed polyhedron. Thus, a more quantitative evaluation
became possible, as was demonstrated on the cluster groups. As other demonstration,
the chain-link benchmark problem [11] of the three-dimensional data was also examined in the appendix. This is a very suitable problem for the blossom tool [10] of the
three-dimensional visualization. Finally, we thank Prof. M. V. Hulle of K. U. Leuven
for kind reading and correcting the manuscript and Prof. T. Kohonen of Academy of
Finland for kind reading and giving useful comments to the manuscript.
References
1. Tokutaka, H., Fujimura, K., Ohkita, M.: Cluster Analysis using Spherical SOM (in Japanese). Journal of Biomedical Fuzzy Systems Association 8(1), 2939 (2006)
2. Tokutaka, H., Fujimura, K., Ohkita, M.: Cluster Analysis using Spherical SOM. In:
WSOM 2007, Bielefeld Germany, September 3-6 (2007)
3. Nakatsuka, D., Oyabu, M.: Application of Spherical SOM in Clustering. In: Proceedings of
Workshop on Self-Organizing Maps (WSOM 2003), pp. 203207 (2003)
4. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2005)
5. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of
Eugenics 7, 179188 (1936)
6. http://www.ics.uci.edu/~mlearn/databases/
7. Ultsch, A., Guimaraes, G., Korus, D., Li, H.: Knowledge Extraction from Artificial Neural
Networks and Applications. In: Proc. TAT/WTC 1993, pp. 194203. Springer, Heidelberg
(1993)
8. Matsuda, M., Tokutaka, H.: Decision of class borders on a SOM (in Japanese). Journal of
Biomedical Fuzzy Systems Association 10(2), 2738 (2008)
317
9. Tokutaka, H., Maniwa, Y., Gonda, E., Yamamoto, M., Kakihara, T., Kurata, M., Fujimura,
K., Shigang, L., Ohkita, M.: Construction of a general physical condition judgment system
using acceleration plethysmogram pulse-wave analysis. In: Prncipe, J.C., Miikkulainen, R.
(eds.) WSOM 2009. LNCS, vol. 5629, pp. 307315. Springer, Heidelberg (2009)
10. http://www.somj.com
11. Herrmann, L., Ultsch, A.: Clustering with Swarm Algorithms Compared to Emergent
SOM. In: Prncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 80
88. Springer, Heidelberg (2009)
Appendix
Fig. 10. (a) chain-link problem [11] where two links 1 and 2 are crossing perpendicularly in 3
dimensional spaces. (b) The input data 3D of (a) are learned by the spherical SOM of blossom
[10]. The boundary between 1 and 2 are indicated by darkened U-matrix which is endlessly
continuous and never crossing.
Fig. 11. (a) The Griff-value is increased from 0 (Fig.10(b)) to this 0.5, where U-matrix are
emphasized and they are endlessly continuous and never crossing. (b) Fig.10(b) is colored by
the All label algorithm. Then, it is easy to find the boundary of this two cluster problem.
Fig.10(b) and Fig.11(a),(b) given by the cluster SSOM are displayed all in the same position.
Abstract. Air pollution in big cities is a major health problem. Pollutants in the air may have severe consequences in humans, creating
conditions for several illness and also aect tissues and organs, and also
aect other animals and crop productivity. From several years now, the
air quality has been monitored by stations distributed over major cities,
and the concentration of several pollutants is measured. From these data
sets, and applying the data visualization capabilities of the self-organized
map, we analyzed the air quality in Mexico City. We were able to detect
some hidden patterns regarding the pollutant concentration, as well as
to study the evolution of air quality from 2003 to 2010.
Introduction
Pollutants are any substances that aect the normal cycle of any vital process
or degrade infrastructure [1]. The sources of pollutants are several and well identied. Pollutants may be originated from human actions, but also from natural
events. Among the former it can be listed the incomplete combustion of organic
combustibles in cars, the end products from industrial reactors, and dust and
minerals from construction sites [2].
Air pollution aects several regions all over the planet, and mainly impact
major cities, in which has been reported to be a major problem of health for the
last 30 years. Several respiratory illness have been reported to be a consequence of
high levels of pollutants [3]. In Mexico City, during the years 2000 and 2008, more
than 100,000 deaths were caused directly or indirectly from bad air conditions,
and almost one million visits to the hospital were attributed to pollutants [4,5].
It is expected that if the tendency continues, more than four millions of related
illness will be reported by 2020 [6] Also, air pollutants impact directly in other
animals and in green areas and in harvest productivity [1].
Among the most dangerous pollutants is carbon monoxide (CO), that aects
blood oxygenation as it reacts with hemoglobin and may lead to severe health
problems and in many cases to death. Nitrogen oxides (N O) and dioxide (N O2 )
are also dangerous air pollutants, as they decrease lung function and increase the
risk of acute bronchitis. Ozone (O3 ) is also a pollutant relevant for health issues,
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 318327, 2011.
c Springer-Verlag Berlin Heidelberg 2011
319
Fig. 1. Time series for O3 . It is shown the concentration for one station in Mexico City
for year 2009, as well as the concentration for one day. Also, the Fourier spectrum is
presented. Two frequencies are visible: a 24-hour period and a weekly period.
as it takes part in reactions that lead to the so-called smog. Sulde dioxide (SO2 )
is also constantly monitored as it is a precursor of acid rain which aects crops
and some edications. Lead (P b) may impact nervous connections and cause
blood and brain disorders [1]. Finally, particulate matter (P M ), also called ne
particles, lower than 10 micrometers are also considered as pollutant as they
aect the lungs. In general, P M are identied as P M2.5 for radius lower than
2.5 micrometers and in P M10 for radius less than 10 micrometers [7].
Air quality is aected by several variables, among the obvious the presence
of pollutants, but also the wind and atmospheric conditions as well as trac
conditions. In particular, the case of the urban area of Mexico City and its
surroundings is a complicated one, as the specic conditions of altitude, wind
patterns, high population density, and trac issues seem to be particularly adverse for air quality [7].
Pollutants concentration levels may vary in daily basis, but also may be subject of other inuences. Fig. 1 shows the O3 concentration for one day, one month
and one year, as well as its Fourier spectrum. Trying to nd patterns in just one
time series may be achievable, but trying to make sense from several time series,
as well as to try to identify patterns and correlations among them may require
special tools.
As a part of a research group focused in studying the air quality in Mexico City, we have analyzed multidimensional data consisting of air pollutants
measured every hour in several monitoring stations distributed over the most
polluted areas in Mexico City. From these measures, we have been working in
nding patterns in the data, and in this contribution, we describe some of them.
In section 2 we briey describe the self-organized map and its capabilities for
data analysis. Then, in section 3 we present the application of SOM for visualization of air pollutants, and in section 4 we present some conclusions.
Data Visualization
We dene the air quality vector for a given time as the average concentration
of N O2 , N O, O3 , SO2 , CO, P b, P M 25, and P M 10 and the number of times
each of these pollutants exceeded the norm. Thus, the space has 16 dimensions
320
as there are eight pollutants and two statistics are considered for each one of
them. Italic font refers to the average pollutant concentration (N O2 ), whereas
the pollutant with a bar over it refers to the number of times it exceeded the
safe level over the specied period (N O2 ). Monitoring and visualization of all
air quality data is not a straightforward task. Several pollutants have to be
analyzed simultaneously and correlations and general patterns are to be found
in data. The chemistry of air in polluted environments, although well studied,
is still under development [2]. Thus, for environmental and health ocials to
make sense of data, automatic tools for data analysis and multidimensional data
visualization are required.
As a support tool for data visualization, we applied the self-organizing map
(SOM), as a consequence of its outstanding capabilities for high-dimensional
data visualization, which are well above those presented by other techniques
[8,9]. The SOM preserves neighborhood relationships during training through
the learning equation (1), which establishes the eect that each best matching
unit (BMU) has over any other neuron.
The SOM structure consists of a two-dimensional lattice of units referred to
as the map space. Each unit n maintains a dynamic weight vector wn which is
the basic structure for the algorithm to lead to map formation. The dimension
of the input space is considered in the SOM by allowing weight vectors to have
as many components as features in the input space. Variables dening the input
and weight spaces are continuous. Weight vectors are adapted accordingly to:
wn (t + 1) = wn (t) + (t)hn (g, t)(xi wn (t))
(1)
where (t) is the learning rate at epoch t, hn (g, t) is the neighborhood function
from BMU g to unit n at epoch t and xi is the input vector. SOM has been
widely applied as a data visualization tool as a consequence of its capabilities in
computing high-order statistics [9], which is translated in a two-dimensional map
that is a good approximation of the distribution observed in the high-dimensional
space.
SOMs have been applied in the air quality domain as for example, [12] in
which authors classify monitoring stations accordingly to pollutant levels. Here,
we are interested in study the evolution of pollutant concentration in Mexico
City, considering several levels of resolution.
Results
Since 1986, a governmental law established an agency focused on air quality monitoring. From that agency, several monitoring stations measure the concentration
of air pollutants [7]. The full list of contaminants that are now monitored was
started in 2003. Prior to that year, only a subset of pollutants was measured
and not all pollutants were monitored in all stations. That is the reason we
started our analysis only with data from 2003. Every year is represented by a
16-dimensional vector, as there are 8 major pollutants, and for each one of them,
the average yearly concentration as well as the number of measures that the safe
321
Fig. 2. U-matrix for air pollution map for years 2003 - 2010 in Mexico City. Each
year is represented by a 16-dimensional vector, as there are eight pollutants. For every
pollutant, its average for the whole year and all stations was considered, and also the
number of times the measure exceeded the safe level.
Fig. 3. Air pollution map for all months from January 2003 to November 2010. Each
month is again represented by a 16-dimensional vector, and no geographic information
in included. Codes are: J-January, F-February, M-March, A-April, Y-May, U-June, LJuly, G-August, S-September, O-October, N-November, D-December, plus the year in
two digit format.
level threshold was exceeded are considered. All variables are normalized and
all are supposed to have the same relevance, so no additional preprocessing was
considered. Fig. 2 shows the U-matrix [11] for the eight years analyzed. It is
observed that year 2003 is in a well-dened cluster, and it makes sense, as that
year was particularly polluted [7].
We start with a coarse-grain analysis, in which each year is represented by a
single vector with 16 components, considering the average over all stations and
months. Then, we increase the detail and each month is now dened by a vector,
so we have 1281 vectors (December 2010 is not considered). At the same time,
we consider the case in which data is dierentiated by monitoring station. Each
one of the 10 stations is represented by the average pollutant concentration over
322
Fig. 4. Air pollution map for all stations for years 2003 to 2010. Each pair station-year
is a 16-dimensional vector.
each year, so this time we have 10 8 vectors. At the lowest level, each hour of
every day is represented by an 8-dimensional vector, in which each component is
the concentration of each one of the pollutants measured at that hour, on average
over all available monitoring stations. In all cases, maps were generated with the
SOM PAK, available at http://www.cis.hut./research/som_lvq_pak.shtml.
In g. 3 it is presented the map for all months since January, 2003. It is
observed some seasonality, as some months tends to be in the same cluster, as
for example, December tends to be clustered at the upper left corner (D-03, D-04,
D-05). Those three years presented bad air quality conditions in general, and the
weather and trac conditions of that month are proper for high concentration
of pollutants. The cluster at the upper left corner contains the months with the
highest concentrations and highest measures exceeding the norm. It is observed
that other months in 2003 were also mapped to that area. Seasonality is also
observed in other months, such as May (Y-06, Y-07, Y-08, Y-09, Y-10). The
months with better air quality tend to be those in the late spring and in summer.
Those months are mapped to the bottom right corner, and it is observed that
June (U), July (L), August (G), and September are mainly located there.
In g. 4 it is presented the u-matrix for the map for air pollution from year
2003 to 2010, but now, vectors are composed by measures for the specied monitoring station and years. In this scheme, we have intrinsic geographic information, which may be helpful to seek for patterns in data. For these experiment,
only 11 stations were considered, as the rest of them (ten more) does not measure all considered pollutants. So we have 11 8 vectors. In this map, the vectors
323
Fig. 5. Planes for N O2 (a), N O2 (b), N O (c), N O (d), CO (e), CO (f), O3 (g), O3
(h), and SO2 (i) for the map shown in g. 4. Gray level indicates the corresponding
value of the plane. Light tones indicate higher values.
Fig. 6. Air pollution and weather conditions map for all months from January 2003 to
November 2010. Each month is represented by a 19-dimensional vector, as the previously cited 16 variables (pollutants) are included, plus air temperature, wind speed, and
wind direction. Codes are: J-January, F-February, M-March, A-April, Y-May, U-June,
L-July, G-August, S-September, O-October, N-November, D-December.
324
Fig. 7. Planes for N O2 (a), N O (b), CO (c), O3 (d), SO2 (e), P M10 (f), P M2.5 (g),
P b (h), temperature (i), wind speed (j), and wind direction (k) for the map shown in
g. 6. Gray level indicates the corresponding value of the variable. Light tones indicate
higher values.
corresponding to the poorest air quality are located in the cluster at the bottom
left corner, and the gray levels surrounding that cluster show a heavy border.
Stations A and M recorded the highest pollutant concentration during 2003 and
2004. Station A is situated in a residential area, with several avenues and streets,
with a heavy trac ow, whereas station M is situated in downtown, in an area
with one of the highest population density in the city.
Interestingly, neither of the two stations A and M is located near to an industrial area, but stations L and E are. As SO2 has as one of its sources some
industrial processes, it is not a surprise that the concentration of that pollutant
is very high in stations L and E, mainly for years 2003 and 2004. In 2002, a major
modication in the environmental law urged industries to incorporate new air
quality controls. As it was not an immediate process, 2003 and 2004 were still
very high in some pollutant concentrations, as that of SO2 .
Fig. 5 show some of the variables (average pollutant concentration and number
of measures higher to the safe level). It is observed that light areas tend to be
located at the lower left corner, with the exception of O3 and O3 .
In g. 6 we consider an additional variable in data. Besides the air pollutants,
we included the wind conditions (speed and direction) as well as temperature.
Now, each input vector has 19 components, as it contains the 16 mentioned
variables plus the average air temperature, the average wind speed and the wind
direction. As expected, the map is slightly dierent from map 3, as now some
325
Fig. 8. U-matrix for the SOM for all hours in 2010 (December not included). Each
hour of the year was represented by an 8dimensional vector, with each component
associated to the pollutant concentration measured during that hour, on average over
all available monitoring stations. The label indicates the day number, starting with
Jan, 1st, as day 0, and the hour of the day. Hours are numbered from 0 to 23.
326
Fig. 9. U-matrix for SOM for all hours in 2009. Only labels for July (6) and December
(11) are shown. In the bottom left corner, the hours with the highest pollutant measures
are clustered and in fact, correspond the so called thermic inversion. It is observed that
only measures of December are there, as the cold temperatures present in that month
in Mexico City tend to favor pollutant concentrations. In contrast, almost all measures
in July tend to be very low, and many of them are clustered at the upper right corner.
Acknowledgments
This research is derived from a project supported by Instituto de Ciencia y
Tecnologa del Distrito Federal (ICyTDF), under contract PICCT08-55.
327
References
1. Sportisse, B.: Fundamentals in air pollution. Springer, Heidelberg (2010)
2. Seinfeld, J., Pandis, S.: Atmospheric Chemistry and Physics: From Air Pollution
to Climate Change, 2nd edn. Wiler (2006)
3. Knox, E.: Atmospheric pollutants and mortalities in English local authority areas.
J Epidemiol Community Health 62, 442447 (2008)
4. Ferrer-Carbonell, J., Escalante-Semerena, R.: Contaminacin atmosfrica y efectos
sobre la salud en la Zona Metropolitana del Valle de Mxico. Revista Economa
Informa 360, 119143 (2008)
5. http://www.bvsde.paho.org/bvsacd/eco/038267/038267-04.pdf
6. Bell, M., Davis, D., Guoveia, N., Borja, V., Cifuentes, L.: The avoidable health
eects of air pollution in three Latin American cities: Santiago, So Paulo and
Mexico City. Environmental Research 100, 431440 (2006)
7. http://www.sma.df.gob.mx/simat2/
8. Kohonen, T.: Self-Organizing maps, 3rd edn. Springer, Heidelberg (2000)
9. Hujun, Y.: The self-organizing maps: Background, theories, extensions and applications. In: Computational Intelligence: A Compendium, pp. 715762 (2008)
10. Kaski, S., Kohonen, T.: Exploratory data analysis by the self-organizing map: structures of welfare and poverty in the world. In: Apostolos-Paul, N.R. (ed.) Neural
Nwteorks in Financial Engineering, pp. 498507 (1996)
11. Ultsch, A.: Self organizied feature maps for monitoring and knowledge aquisition
of a chemical process. In: Proc. of the Int. Conf. on Articial Neural Networks, pp.
864867 (1993)
12. Alvarez-Guerra, E.: A SOM-based methodology for classifying air quality monitoring stations (2010), doi:10.1002/ep.10474
1 Introduction
The Self-Organizing Map (SOM) is a powerful tool for exploring huge amounts of
multi-dimensional data. The SOM by Kohonen [1] is a kind of neural network algorithm that projects high dimensional data onto a low dimensional space. In the traditional SOM algorithm, however, the border effect problem have been pointed out,
and several spherical SOMs based on a geodesicdome [2] or a toroidal SOM have
been proposed as a remedy. To show its potential effectiveness, the spherical SOM
has been applied to clustering. For instance, Tokutaka et al. [3] proposed a highly
accurate cluster analysis using the spherical SOM.
On the other hand, there is a proposal [4] for the interpretation of information on
the map. However, the U-matrix [5] has been mainly used in the traditional SOM and
the spherical SOM. In the U-matrix, the Euclidean distance between nodes is expressed by a gray level. Therefore, it is difficult to decipher the information of the
class distributions or the borders when the shading due to the U-matrix changes continuously. Tokutaka et al. [3] converted the shade of the U-matrix to the distance and
obtained a dendrogram to perform a classification based on distances. They recommended analyzing the dendrogram and the graphical object on the polygon surface
interactively to eliminate misclassifications. In the discussion of their cluster analysis,
they used boundaries that were artificially drawn. When discussing the dendrogram,
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 328337, 2011.
Springer-Verlag Berlin Heidelberg 2011
329
however, it is crucial that the class boundary is decided accurately because the precision of the computed class boundaries controls the accuracy of the cluster analysis.
Generally, it is difficult to depict the class borders of multi-dimensional data, however
the projection capability of the SOM of multi-dimensional data will allow the class
borders to be successfully visualized. Therefore, there exists a necessity for a method
accurately drawing the decision borders on the SOM.
To achieve the above goal, we have proposed two methods for determining the
class decision borders [6]. The proposed methods mainly covered the case of equal
class distributions, but it is vital that the class boundary is decided as accurately as
possible. Therefore, we especially propose an approach to deal with the case of nonequal class distributions in this paper.
+ ( xm d m ) 2 +
+ ( ym d m ) 2
(1)
2( xi d i ) + 2( yi d i ) = 0 (i = 1,
, m)
(2)
di =
( xi + y i )
2
(i = 1,
, m)
(3)
330
Fig. 1. Decision border and feature vectors X and Y on the spherical SOM: Open diamond
nodes are class A and open square nodes are class B. Open circle nodes are the calculated ones
on a border and a close circle node is the point D. The dotted line is a part of the borderline.
2
U 2,1 U 2
U=
U
n ,1 U n, 2
M 12 = [x1 d 1 ,
U n21
U n ,n 1
V12 V1, 2
U 1,n
2
U 2,n
V2,1 V2
, V =
V
U n2
n,1 Vn , 2
U 12 U 1, 2
2
U 2,1 U 2
, x m d m ]
U
n ,1 U n, 2
U n21
U n ,n 1
Vn21
Vn ,n 1
U 1,n
U 2,n
U n2
V1,n
V2 , n
Vn2
(4)
[x1 d1 ,
, xm d m ]
M 22 = [ y1 d 1 ,
V12 V1, 2
2
V2,1 V2
, y m d m ]
V
n ,1 V n , 2
V1, n
V2 ,n
V n2
V n21
Vn ,n 1
331
[ y1 d1 ,
, ym d m ]
(5)
The point D on the boundary should be selected to satisfy the condition that the sum2
and Y.
T = M 12 + M 22 = ( X D)U 1 ( X D) T + (Y D)V 1 (Y D) T
(6)
In order for T to be a minimal value, it is necessary to satisfy the simultaneous equations obtained by differentiating equation (6) with respect to d1, d2, , dm. However
directly differentiating equation (6) with respect to d i complicates the handling of the
expression. Thus the calculation is done after U 1 and V 1 in equation (6) are diagonalized by using the eigenvalues and the corresponding eigenvectors.
T = ( X D)P 1 AP ( X D) T + (Y D)Q 1 BQ ( Y D) T
(7)
Here A and B are matrices whose elements on the diagonal are composed of the
eigenvalues of U 1 and V 1 , respectively, and P and Q are matrices whose elements
are composed of the eigenvectors of U 1 and V 1 respectively:
1A
0
A=
A
2
1B
0
B=0
A
n ,
0
A
2
nB
(8)
(i = 1,
, m)
(9)
Here d i' , xi' and yi' are defined respectively by the following equations:
x1'
xn'
d1'
x1
1
=P
'
dn
xn
di
d n
(10)
332
y1'
yn'
d1'
y1
1
=Q
yn
d n'
di
d n
(11)
We can see that the equation coincides with equation (3), when two eigenvalues are
equal in equation (9).
2.3 Procedure for Determining Decision Border
When the probabilistic relation between a feature vector and the class distribution is
uncertain, it is difficult to obtain the decision borders. Then it is necessary to ensure
that the map contains the information of the distribution of the feature vectors. The
SOM is the easiest method to obtain such a map.
After obtaining the spherical SOM as shown in Fig.1, the following three steps are
repeated for determining the decision borders with the equal class distributions.
Step 1
A number of candidates on the spherical SOM are selected from the data sets near
the boundary.
Step 2
The distances between candidates are calculated and a pair of data points with the
minimum distance is determined.
Step 3
Point D on the boundary is selected from the pair by equation (3).
After some points on the boundary are calculated by repeating step 1 through 3, a
borderline is drawn. In Step 1, candidates in the class A and class B are selected according to equation (12) among the datasets.
exp( (x A ) 2 / R ) , exp( (y B ) 2 / R ) .
(12)
where A and B stand for vectorial reference points of each class, and R stands for a
parameter (The value within the range from 0.01 to 0.1 is usually used as a value of R
for the retrieval.). The vectorial reference points are chosen from the node located at the
center part of the classes. is a threshold value. In Step 1 candidates for the decision
borders are usually selected among the boundary dataset. If necessary, nodes of SOM or
other dataset can be chosen. They are selected on the basis of the distance between
candidates. Meanwhile, when the decision borders are determined with the non-equal
the class distributions, then it is necessary to calculate the variance-covariance matrices
of each class, their eigenvalues and the corresponding eigenvectors before beginning the
step (1) of the procedure for the decision borders. In the step (3) point D on the boundary should be selected from the pair by using equation (9) instead of equation (3).
333
vergicolor).The number of objects in each class is 50. The data is 4-dimensional and
the attributes are the sepal length and width and the petal length and width. The label
of each class data on the SOM or in the figures is shown with a brief symbol. Setosa,
vergicolor and virginica classes are represented respectively by set_n, ver_n, and
gnc_n: for example, the first data of virginica is described as gnc_1.
= 0.0092)
334
Fig. 3. A borderline pictured in the vicinity of gnc_20. A short dashed oval line is the line to
facilitate the finding out gnc_20 and not the borderline.
misclassified as a vergicolor class by the cluster analysis using the spherical SOM.
We can see from this figure that the borderline can be successfully expressed in the
ambiguous shade region of the U-matrix.
3.3 Result with Non-equal Class Distributions
Table 1 lists the eigenvalues and the corresponding eigenvectors obtained from the
variance-covariance matrices of the each class when the class distributions are expressed by using the variance-covariance matrices.
The maximum eigenvalue is that of setosa class and the minimum eigenvalue is
that of vergicolor. From the result of this eigenvalues and equation (9), one can see
that the magnitude of the effect of the non-equal class distributions on the decision
borders depends on the eigenvalues and that the magnitude of the effect is the largest
in the setosa class, the next largest in the vergica class and the least in the verginica
class. Hence one can also see that the decision border between the vergica and vergicolor classes will be shifted toward the verginica side.
Fig.4 shows the decision borders determined with our approach on the polygon
surface. The solid decision border with the nodes from b0 to b4 was determined by
using equations (10) and (11) from the feature vectors which have been obtained with
equation (3). The dotted decision border with nodes from b0T to b4T was determined
by using equations (10) and (11) from the data which have been obtained based on the
coordinate system proposed in [6]. The solid decision border in blue color with the
nodes from Q0 to Q4 was determined obtained by using equations (9), (10) and (11)
335
from the feature vectors. When comparing the decision borders determined with the
three methods, the values of three of the five nodes, b0T, b3T and b4T, on the decision border are a very good match with the corresponding values of b0, b2 and b3, but
the other ones of the nodes, b1T and b4T, do not match with the correspond values of
b1 and b4.
Table 1. Eigenvalues and the corresponding eigenvectors of each iris data class
Setosa
Virginica
Vergicolor
22.441
19.856
6.677
10.414
2.314
1.734
4.904
2.107
1.084
0.047
0.022
0.011
Setosa
(set)
W1
-0.011
0.082
-0.920
0.383
W2
-0.062
-0.119
0.372
0.919
W3
0.945
-0.325
-0.027
0.032
W4
-0.322
-0.935
-0.119
-0.094
virginica
(ver)
W1
0.171
-0.139
-0.695
0.685
W2
-0.852
0.148
0.209
0.455
W3
0.025
-0.914
0.364
0.177
W4
0.493
0.351
0.584
0.540
vergicolor
(gnc)
W1
0.444
-0.243
-0.726
0.465
W2
-0.078
0.909
0.370
0.176
W3
-0.753
0.145
-0.103
0.633
W4
0.479
0.306
0.571
0.593
On the other hand, the solid decision border with the nodes from Q0 to Q4 was
mapped in a different region. The reason why the nodes from Q0 to Q4 were mapped
in a different region is that some of input data remap out of phase on the polygon
surface. Therefore some pairs of candidates with the non-equal class distributions
should be reselected to decide the decision borders regardless to candidates selected
in the equal class distributions.
Fig.5 shows two decision borders determined from some pairs of candidates with
the non-equal class distributions by using our approach. As can be seen from this
figure, the part of the decision border along the nodes from R0C to R4C shifts from
the part of the decision border along the nodes from R0+ to R4+ even thought the part
of the decision border along the nodes from R5C to R6C has slightly inverse tendency. This fact gives evidence in support of the fact that the magnitude of deviation
from the decision border with equal class distributions depends on the magnitude of
eigenvalues with the variance-covariance matrix of each class.
336
337
4 Conclusions
We have proposed an approach which approximates the decision borders on a spherical SOM with non-equal class distributions. The magnitude of the effect on the decision borders with the non-equal class distributions depends on the magnitude of the
eigenvalues, especially maximum eigenvalue, of the variance-covariance matrices.
Using the iris dataset of Fisher, we confirmed that our approach allows the magnitude
of the effect on the decision borders to be successfully and qualitatively visualized.
References
1. Kohonen, T.: Self-Organizing Maps. Springer Series in Information Sciences, vol. 3.
Springer, Heidelberg (2001)
2. Nakatsuka, D., Oyabu, M.: Usefulness of Spherical SOM for Clustering. In: Proceedings
19th Fuzzy System Symposium, pp. 6770 (2003)
3. Tokutaka, H., Fujimura, K., Ohkita, M.: Cluster Analysis using Spherical SOM (in
Japanese). Journal of Biomedical Fuzzy Systems Association 8(1), 2939 (2006)
4. Ultsch, A., Mrchen, F.: ESOM-Maps: tools for clustering, visualization, classification
with Emergent SOM, Depart. Of Computer Science University of Marburg, Research Report 46 (2005)
5. Ultsch, A., Guimaraes, G., Korus, D., Li, H.: Knowledge extraction from artificial neural
networks and applications. In: Proceedings of TAT/ WTC 1993, pp. 194203. Springer,
Heidelberg (1993)
6. Matsuda, N., Tokutaka, H., Oyabu, M.: Decision of Class Borders on Spherical SOM and
Its Visualization. In: Leung, C.S., Lee, M., Chan, J.H. (eds.) ICONIP 2009. LNCS,
vol. 5864, pp. 802811. Springer, Heidelberg (2009)
7. http://www.ics.uci.edu/~mlearn/databases/
8. http://www.somj.com/
Introduction
In this paper we study how the Self-Organizing Map (SOM) [2] can be used
in analysing the structure of semantic concepts in visual data, in particular in
the PASCAL VOC 2007 and TRECVID 2010 data sets. We compare how the
a priori ground-truth concept data available in the training material and the
a posteriori concept detections extracted from the testing material of the two
databases behave in the mapping.
Early content-based image and video retrieval systems relied on measuring
similarity solely using low-level visual features automatically extracted from the
objects. However, such generic low-level features are often insucient to discriminate content well on a higher conceptual level required by humans. This
semantic gap is the fundamental problem in multimedia retrieval.
In recent years, high-level features, or semantic concepts have emerged as a
partial answer to this problem. The main idea is to create semantic representations by extracting intermediate semantic levels (events, objects, locations,
people, etc.) from low-level visual features using machine learning techniques.
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 338347, 2011.
c Springer-Verlag Berlin Heidelberg 2011
339
For example we can train detectors for semantic concepts such as image containing a cat or video depicting an explosion or re, which can then be used
as building blocks for higher level querying into the database.
The accuracy of current state-of-the-art concept detectors can vary greatly
depending on the particular concept and on the quality of the training data. Still,
it has been observed that even concept detectors with relatively poor accuracy
can be very useful in supporting high-level indexing and querying on multimedia
data [1].
The rest of the paper is organised as follows. In Section 2 we explain the main
idea of self-organising semantic concept vectors and how they can be analysed.
Section 3 describes our experiments with two commonly used visual databases
that are provided with semantic concepts and corresponding labelled examples.
Conclusions are drawn in Section 4.
In our experiments, each database is divided into a training set with a given
ground truth of known concept members (labelled images or videos), and a
test set for which concept membership is unknown. By training detectors on
the known memberships we generate probability estimates of the corresponding
concept membership of the test set objects. In this way the concept memberships
are boolean in the training set (e.g. an image either depicts a cat or not), while
the test set has probabilistic membership values (e.g. this image contains a cat
with the probability 0.8). In this paper, we take the concept detectors as given.
We use state-of-the-art detectors based on a fusion of Support Vector Machine
(SVM) detectors on low-level visual features.
The ground truth labels and the concept detector outcomes give us two ways
of analysing the semantic concepts of a database. A Self-Organizing Map can be
trained either on the {0, 1} values of the training set or the [0, 1] probabilities
from the test set. Given a set of objects (e.g. images) x1 , . . . , xN , and concepts
C1 , . . . , CK , we can construct a concept vector for the object xi :
pi,1
ci = ... ,
(1)
pi,K
where pi,j [0, 1] is the concept membership score of object xi in concept Cj ,
often interpreted as the probability of the object belonging to the given concept.
Thus, such a concept vector species concisely the concept membership of a
given object to all concepts in the used concept vocabulary.
In the next step, we train Self-Organizing Maps using the concept vectors
of the database objects as input. Such concept vectors will be contain only 1s
and 0s in the training set where we have labelled data, but will be in the range
[0, 1] in the test set where we have detector outcomes. Because of the dierent
types of input, we have trained two SOMs for each database, one for the training
set, and one for the test set. Looking at the organisation of the training set can
340
M. Sj
oberg and J. Laaksonen
give us insight in how concepts are correlated and the 2D relations can give us
important clues to how the concepts group together into larger patterns.
A map of size M M has model vectors m1 , . . . , mM 2 , and if we take the jth
component of each model vector m1,j , . . . , mM 2 ,j we get a 2D distribution over
the map surface for the concept Cj . Comparing such distributions between dierent concepts can provide insight into the semantic organisation of the database.
While studying the organisation of the test set is less certain since we only
have estimated probabilities with quite varying accuracy it may still be useful
for analysing the overall organisation of the data set.
In addition to visually and qualitatively inspecting the concept distributions
on the SOM surface we also wish to make a more quantitative analysis. In
particular the closeness of concepts on the maps might be interesting. The
dierent component distributions of the SOM model vectors represent dierent
concepts, and thus the closeness of concepts could be estimated by calculating the
distance between these distributions arranged as vectors with short distances
indicating semantically close concepts.
We rst considered Euclidean distance between the component vectors, but
this would not take into account the 2D organisation of the SOM map. Two
distributions might be close by on the map but still be orthogonal. To take into
account the 2D distribution of concepts on the map we instead decided to use
the Earth Movers Distance (EMD) [5] to calculate their dissimilarity. The EMD
measures the minimum cost of turning one distribution into the other, where in
this case the cost is the value that needs to be moved times the Euclidean
distance over the 2D map surface. We used the C implementation for EMD1
provided by the authors of [5].
Experiments
In the following subsections we present the resulting SOM maps for two dierent
visual databases. For training the SOMs and processing the databases we have
used the content-based retrieval and analysis framework PicSOM [4]. PicSOM by
default uses Tree-Structured SOMs (TS-SOMs) [3] in which successively larger
SOM layers are trained by xing the previous layer and restricting the bestmatching unit (BMU) search to the neighbourhood of the unit beneath the
BMU in the previous layer. For our purpose, this gives us the advantage that we
can visualise the SOM spatial surface at dierent levels of detail, by looking at
dierent layers of the TS-SOM.
3.1
VOC 2007
The Pascal VOC 2007 database2 contains almost 10,000 images with a training
set of 2,500, evaluation set of 2,500 and test set of about 5,000 images. We used
the training set and the test set to generate two Self-Organizing Maps. The
1
2
http://ai.stanford.edu/~ rubner/emd/default.htm
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
341
342
M. Sj
oberg and J. Laaksonen
components, e.g. the rst image, Aeroplane is a plot of the rst component of
each model vector. The gray scale values of the images are scaled so that white
corresponds to 0.0 and black to 1.0. As can be expected, the training set concepts
show mostly black and white, while the test set covers more of the gray scale.
At this point it should be noted that the two maps were trained separately and
were initialised randomly. Thus the concepts are in general located at dierent
regions in the training and test set SOMs. It can however be seen that the
distributions are similar, e.g. the concept Train has three nodes in both data
sets, but located at dierent points due to the random initialisation.
It is also interesting to note, especially while looking at the concept ground
truth distributions (bottom left in Figure 1) how dierent concepts relate. E.g.
the two classes of domestic pet animals, Cat and Dog, have a similar structure
with two separate nodes in their distributions. The two distributions partially
overlap over the two concepts which indicates some co-occurrence.
Other interesting phenomena can be found as well, e.g. that all forms of landbased transport have at least a partial presence in the lower right corner. Bicycle,
Bus, Car, Motorbike and even Train have strong clusters there. Interestingly the
Bus block seems to be like a piece of a puzzle that ts right into the larger
distribution of the Car concept. On closer investigation it can be seen that they
have a thin slice of common units which cause the areas to coincide on the map.
Also concepts, Bottle and DiningTable partially overlap, since a bottle can
probably often be found on a dining table, and Sofa has overlaps with both of
the two concepts. Also TVMonitor overlaps partially with Sofa. Furthermore
Bird and Aeroplane seem to share the upper right corner, partially with Boat.
Apparently these concepts sometimes co-occur.
The same phenomena are mostly repeated in the SOM based on the detector
outcomes as well (bottom right in Figure 1), e.g. the two overlapping nodes of
Cat and Dog. It is also clear that there is more of a visual organisation in the map
based on the concept detectors, e.g. there are bluish images, often depicting the
sky in the middle left border of the second layer map. This eect is because the
concept detectors are trained on the visual features of the objects. The concept
detectors are really indicating which images in the test set have the same visual
features as those in the training set belonging to a particular class. It tries to
learn which are those discriminating visual features in the training set. These
may or may not generalise to the test set.
To get a more quantitative measure of the similarity of the concept distributions, we calculated the Earth Movers Distance (EMD) between all concept
pairs excluding the concept Person which covers more than half of the map
area (its a priori probability is 43%). Table 1 shows the 10 closest and 10 most
distant concepts from the ground truth in the training set, while Table 2 shows
the same, but calculated from the detector outcomes in the test set. The dierence in EMD compared to the two closest concepts respectively the two farthest
concepts are shown as percentages in the adjoining column.
The EMD ordering corresponds well with our intuitive understanding, and
with the concept distributions shown in Figure 1. For the ground truth-based
343
344
M. Sj
oberg and J. Laaksonen
TRECVID 2010
One of the tasks in the annual TRECVID video retrieval evaluation [8] is to detect the presence of predened high-level features (HLFs) [9] in broadcast videos
that are already partitioned into shots. Our research group has participated since
2005, and in this paper we use the database from TRECVID 2010.
The TRECVID 2010 video data is taken from the Internet Archive collection3 .
A total of 130 concepts are provided, and the ground truth was specied by a
collaborative annotation process among the participants4 . The training data set
contains about 120,000 video shots (200 hours) and the test set about 150,000
video shots (200 hours). Some videos did not belong to any concept and were
dropped for these experiments.
As concept detectors we used our own developed for the TRECVID 2010
competition [7]. These are based on fusion of SVM detectors based on SIFT
and ColorSIFT features calculated from a dense sampling of dierent spatial
partitions of the key frame images extracted from the video shots. We trained
TS-SOM with four layers of sizes 4 4, 16 16, 64 64 and 256 256. The
SOM for the training set is visualised in Figure 2, the test set in Figure 3.
Both gures show the second 16 16-sized layer with image labels representing
the model vectors, and below that are the component distributions of selected
concepts. The 10 closest and 10 most distant concept pairs measured by Earth
Movers Distance are shown in Table 3 for the ground truth and in Table 4 for
the detector outcomes.
In the training set we can see that there are several larger clusters of concepts,
e.g. suburban scenes form a large group close to the centre. Here we nd e.g
the concepts Building, Car, Road, Streets, Suburban and Vehicle. Also Outdoor
overlaps this area. Pairs of these concepts also occur many times in the list of 10
closest concepts. The situation is similar with the detector outcomes (test set),
but now these concepts occupy the lower right corner and are naturally more
spread out. Another cluster covered by the Outdoor concept is in the upper left
corner in the training set, e.g. Landscape, Plant, Trees and Vegetation. Again, in
the test set, they are placed dierently, in the middle of the bottom edge.
Looking at most distant concept pairs, it is not so strange that Fem.FaceCloseup (female human face closeup) is distant from other concepts, since a
closeup image tends to ll the image with only one object excluding the possibility of nding other concepts. Again we nd that the pair-wise distances
grow more rapidly in the test set. Curiously, the concept Canoe is very distant
from other concepts, probably because it is very rare only 11 examples in the
training set.
3
4
http://www.archive.org/
http://mrim.imag.fr/tvca/
345
+21.9%
+32.4%
+42.0%
+48.1%
+51.5%
+59.3%
+70.8%
+85.5%
+87.3%
-3.2%
-4.0%
-9.2%
-9.9%
-10.1%
-10.7%
-12.0%
-12.3%
-12.6%
346
M. Sj
oberg and J. Laaksonen
+40.6%
+483.8%
+651.7%
+654.0%
+684.9%
+817.7%
+1018.7%
+1375.6%
+1886.3%
-0.5%
-0.8%
-0.8%
-0.9%
-1.0%
-1.2%
-1.3%
-2.0%
-2.1%
347
Conclusions
References
1. Hauptmann, A.G., Christel, M.G., Yan, R.: Video retrieval based on semantic concepts. Proceedings of the IEEE 96(4), 602622 (2008)
2. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer Series in Information Sciences, vol. 30. Springer, Berlin (2001)
3. Koikkalainen, P.: Progress with the tree-structured self-organizing map. In: 11th
European Conference on Articial Intelligence, pp. 211215 (1994)
4. Laaksonen, J., Koskela, M., Oja, E.: PicSOMSelf-organizing image retrieval with
MPEG-7 content descriptions. IEEE Transactions on Neural Networks, Special Issue
on Intelligent Multimedia Processing 13(4), 841853 (2002)
5. Rubner, Y., Tomasi, C., Guibas, L.J.: The Earth Movers Distance as a metric for
image retrieval. Tech. Rep. CS-TN-98-86, Stanford University (1998)
6. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for
object and scene recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 32(9), 15821596 (2010)
7. Sj
oberg, M., Koskela, M., Chechev, M., Laaksonen, J.: PicSOM experiments in
TRECVID 2010. In: Proceedings of the TRECVID 2010 Workshop, Gaithersburg,
MD, USA (November 2010)
8. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In:
MIR 2006: Proceedings of the 8th ACM International Workshop on Multimedia
Information Retrieval, pp. 321330. ACM Press, New York (2006)
9. Smeaton, A.F., Over, P., Kraaij, W.: High-Level Feature Detection from Video in
TRECVid: a 5-Year Retrospective of Achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications, pp. 151174. Springer, Berlin
(2009)
Introduction
Holograms are widely used for 3 dimensional(3D) data processing. For examples,
3D object displays, memories of 3D objects and hologram sheets are conventional
applications[1][2]. In this paper, Computer Generated Hologram (CGH)[3] is
used for the representations of 3D objects, and also applied to the recognition of
3D objects. In the conventional method of 3D object processing, the 3D objects
are interpreted as the set of surfaces, edges and vertices, and the interpreted
information is used for processing. Using CGH, the raw data of the points on
the object, which can be obtained from 3D laser scanner or the results of the
processing of 3D stereo camera, are directly used for processing. The computational costs of CGH may become large, and yet, it can be accelerated by SIMD
processing in CPU or GPU, because the computation of CGH is simple numerical calculations. The recognition of 2D objects using CGH was reported in [4].
In our research, CGH is extended to the recognition of 3D objects.
In this paper, we propose a Self Organizing Map(SOM) which is composed of
the CGH planes. Self Organizing map is the feed forward type neural network
which consists of 2 layers, competitive layer and input layer without hidden
layers. The learning method is unsupervised learning. After learning, SOM can
map the multi-dimensional data on the 2 dimensional plane. CGH-SOM is a
SOM which learns CGH of 3D objects in the units. After learning, the learned
J. Laaksonen and T. Honkela (Eds.): WSOM 2011, LNCS 6731, pp. 348356, 2011.
c Springer-Verlag Berlin Heidelberg 2011
349
3D objects can be mapped on the 2 dimensional plane and the resulting map
can be used for the clustering and the recognition of 3D objects. The matching
result between CGH and 3D objects is given as image data. The index to evaluate
the image data is dened for introducing the metric among the objects. Some
experimental results are shown using the articial 3D objects and the actually
measured data obtained from 3D laser scanner. Especially for the scanned data,
sampling is required for hologram processing because too large number of points
are obtained from the laser scanner and the dierence of Z coordinates of scanned
data aects the matching of hologram. For this purpose, the conventional SOM
is applied to select the constant number of the points.
Hologram
Fresnel Hologram
350
H. Dozono et al.
I(x0 , y0 , 0) = |R2 + O2 |2
a2
2a1 a2
(x0 x2 )2 + (y0 y2 )2
= a21 + 22 +
cos[k{
z2
z2
2z2
(x0 cosx1 + y0 cosy1 ) + z2 }]
(1)
A 3D object is represented by the set of point source of light, and the summation
of CGHs calculated for all points becomes CGH of the object.
2.2
Fig.2 shows the schematic representation of matched ltering using Fresnel hologram. The object beam is Fourier transformed by the laser beam from lens, and
projected on the Fresnel hologram. The projected beam is diracted by Fresnel
hologram. If the beam matches with the information on hologram, parallel beam
is emitted as reference beam and converges to correlation spot by lens.
Computing in CGH, the beam projected on the hologram is computed as the
Fresnel transform of the object beam, and calculated as follows. The object is
sliced in Z-axis direction, and at each zi , the distribution of the object beam is
given as gi (x0 , yo ). Assume that the transfer function of point of light is given
as fi (xi , yi ) at (xi , yi ) on hologram. Then, the Fresnel transform of gi (x0 , yo ) is
given as follows where F denotes the Fourier transform.
351
Fig. 4. Matching result for same 3D objects Fig. 5. Matching result for dierent 3D
object
(2)
(3)
The projected beam is calculated summing the ui s for all slices located at zi . The
diracted beam is calculated by simply multiplying each pixel values of project
beam and hologram, and the matching result projected on the screen is given
as the inverse Fourier transform of the multiplied image. Fig.4 and Fig.5 show
the results of CGH processing. Both gures show the pixel values as the hight.
When input object matches with the object recorded in the hologram, correlation
spot with extreme peak value is observed in the matching area. The position
of the correlation spot depends on the parameters ( beam angles, position of
object, etc.) and shapes of object, and yet it will appear in a small area if the
parameters are almost the same. The small values observed in the center of
the image are transmitted beam. When the input object does not match, small
peaks are observed in the matching area. To evaluate the matching results, the
following two indexes are dened, where M a is matching area.
Maximum pixel value in matching area
P max = max P
P Ma
(4)
(5)
P max can not be absolutely evaluated, and P dif depends on the value of P max.
Thus, these indexes are integrated as follows.
Integrated index
P m = P max/P dif 1
The smaller value of P m represents the better matching result.
(6)
352
H. Dozono et al.
(7)
4
4.1
Experimental Results
Experimental Results Using Artificial Objects
In this subsection, the experimental results using articial objects are shown. At
rst, the objects which are shown in Fig. 6 are given as input data. The size
353
Fig. 6. Input data of articial object 1-cylinder, 2-cone, 3-quadrangular pyramid, 4hemisphere, 5-square column
Fig. 7. CGH-SOM for articial object 1- Fig. 8. Retrieved images from CGH-SOM
cylinder, 2-cone, 3-quadrangular pyramid, for articial object
4-hemisphere, 5-square column
of map is 5x5 and the number of iterations is 30. Fig.7 shows the map. Each
number on the map denotes the number of the object which is the closest to
the hologram associated to the unit. Each gray unit denotes the winner unit
to which a object is mapped. Each object is clustered separately on the map.
The map is organized using the metric in hologram space, so the similarity
may not compatible with human sense. Fig.8 shows the images retrieved from
the holograms learned on the map. The retrieved images are not clear because
small amount of the superposes of the holograms may causes the big aection
to retrieved images. However, the features of the objects, such as the shape of
base plain, can be observed in the images and the images which are labeled same
numbers look like similar. Fig. 9 shows the distribution of the Pm which is the
354
H. Dozono et al.
Fig. 10. CGH-SOM for cones changing the radius(R) and hight(H)
similarity index between the holograms on the units and objects. These indexes
become small for the units which are labeled as the object 1 and object 2 shown
in Fig.7. The objects are clustered well based on the index Pm.
Next, the experiment was made changing the parameters of same object(cone).
The radius of base plain and hight are changed to 20, 23 and 26. Fig.10 shows the
results. All of the objects are clustered separately, and mapped on the dierent
units.
4.2
Next, we made experiments using scanned data form 3D laser scanner. 3D laser
scanner generate 5000-9000 coordinates of surface points of the scanned object.
The computational cost using all data for hologram processing becomes very
large. For this problem, we used conventional SOM to reduce the number of
points. Using the scanned data as input vectors, the coordinates data (x, y, z)
are organized on the map. To eliminate the eect of randomness, the coordinates
on the initial map are taken uniformly from x-y plane which includes the center
of mass of the object, and batch learning algorithm is applied. Fig.11 shows the
original scanned data and reduced data of the object cone, and Fig.12 shows the
reduced data of other scanned objects. The map size is 25x25, so the number of
points of reduced data is 645. The scanned data is uniformly reduced, and the
feature of the object remains enough. Additionally, the matching result of CGH
is heavily aected by changes of the values of Z coordinates. For this problem, the
algorithm of SOM is modied as to generate discrete values of multiple number
of Ds = 2 for Z coordinate.
290
280
270
260
250
240
230
220
210
200
190
355
250
240
230
220
210
200
190
180
170
305
300
295
290
285
230 235
280
240 245
275
250 255
270
260 265
265
270 275
260
280 285
255
280
275
270
265
260
230 235
255
240 245
250
250 255
245
240
260 265
235
270 275
230
280
Fig. 11. The original scan data of cone and the reduced data by conventional SOM
215
210
206
204
202
205
200
200
198
195
190
280
275
270
265
260
230 235
255
240 245
250
250 255
245
240
260 265
235
270 275
230
280
196
194
192
275
270
265
260
255
230 235
250
240 245
245
250 255
240
260 265
235
270 275
230
280
225
220
215
210
205
200
195
190
185
206
204
202
200
198
196
194
192
190
290
280
270
220
230
260
240
250
250
260
240
270
280
230
220
290
280
275
270
265
260
230 235
255
240 245
250
250 255
245
240
260 265
235
270 275
230
280
Fig. 12. The reduced data by conventional SOM (cube, halfpipe, pyramid, half spheral)
Fig. 13. CGH-SOM for scanned object 1- Fig. 14. Retrieved images from CGH-SOM
cone, 2-cube, 3-halfpipe, 4-pyramid, 5-half for scanned object
spheral
Five Objects shown in Fig.11 and Fig.12 were used for training the map. Fig.13
and Fig.14 show the results. Using scanned data, the objects are also mapped
separately on the map. Next, we made the experiment of object recognition. In
this experiment, we used Supervised Pareto learning SOM[6]. Pareto learning
SOM can integrate multiple objective functions in the learning process. As the
objective functions P max and P dif in equations (5) and (6) are used, and the
category of each object is used for supervised learning. Two data are scanned
for each object and the scanned data is dierent in each measurement. The 1st
scanned data for each object are used for learning and the 2nd scan data are
used for test. The experiments are made with changing the size of the map used
356
H. Dozono et al.
Table 1. Experimental results for object recognition O:success X:fail
cone
cube
halfpipe
pyramid
spheral
for reducing the scanned data. Table 1 shows the result. Using the map sized
10x10, 15x15 and 20x20 for reduction, the objects except the cone are recognized
successfully, and yet cone can not be recognized. Using the map sized 25x25 and
30x30 for reduction, all objects are recognized successfully. Using the all raw
data, only two objects are recognized because of the changes of Z coordinates in
each scan data.
Conclusion
We proposed the Computer Generated Hologram SOM (CHG-SOM) for mapping 3D objects on the 2 dimensional plane. The 3D objects were successfully
mapped using the metrics in the hologram space. Further research remains in
order to make CGH-SOM applicable as a practical system in the following aspects. One, the matching method of CGH should be reconsidered because CGH
matching is too rigid to guarantee its generalization ability. Two, the learning
method of the units by CGH-SOM should be reconsidered. A superposed hologram, resulting from simply adding holograms, may match an incorrect object
accidentally. Three, the computing cost of CGH SOM is high. If we use GPU
computing in CGH-SOM, we can reduce the cost; we can do so signicantly if we
apply the optical computing method which uses laser beam and the hologram
displayed on the liquid crystal display.
References
1. Yu, F.T.S., Lu, X.J.: A real-time programmable joint transform correlator. Opt.
Commun. 52, 1016 (2000)
2. Orihara, Y., Klaus, W., Fujino, M., Kodate, K.: Optimization and application of
Hybrid level binary zone plates. Appl. Opt. 40(32), 58775885 (2001)
3. Dallas, W.J.: Computer-Generated Holograms. Digital Holography and Three Dimensional Display, pp. 149. Springer, Heidelberg (2006), doi:10.1007/0-387-31397-4 1
4. Yu, F.T.S., Jutammulia, S.: Optical pattern recognition. Cambridge Univ. Press,
New York (1998)
5. Tudela, R., Mart N-Badosa, E., Badosa, A., Labastida, I., Vallmitjana, S., Juvells,
I., Carnicer, A.: Full complex Fresnel holograms displayed on liquid crystal devices.
Journal of Optics A: Pure and Applied Optics 5, 189194 (2003)
6. Dozono, H., Nakakuni, M.: Application of Supervised Pareto Learning Self Organizing Maps to Multi-modal Biometric Authentication (in Japanese). IPSJ Journal 49(9), 30283037 (2008)
358
R. Mayer
A strong focus on musics primary mode, the sound of a song, can be seen
from the research in the last decade. A number of methods to extract descriptive features from the audio signal and to capture information such as rhythm,
speed, amplitude or instrumentation have been proposed, ranging from low-level
features describing the power spectrum to higher level ones. However, also other
modalities associated with music have increasingly been employed for common
MIR tasks.
Several research teams have been working on analysing textual information,
often in the form of song lyrics and a vector representation of the term information contained in other text documents; an early example is a study on artist
similarity via song lyrics [8]. Other cultural data is included in the retrieval
process e.g. in the form of textual artist or album reviews [1].
The study in [2] suggests that an essential part of human psychology is the
ability to identify music, text, images or other information based on associations provided by contextual information of dierent media. It further suggests
that a well-chosen cover of a book can reveal its contents, or that lyrics of a
familiar song can remind one of the songs melody. Album covers are generally carefully designed for specic target groups, as searching for music in a
record shop is facilitated by browsing through album covers. There, album covers have to reveal very quickly the musical content of the album, and are thus
used as strong visual clues [3]. Due to well-developed image recognition abilities
of humans, this task can be performed very eciently, much faster than listening to excerpts of the songs. This motivates and increased utilisation of this
modality.
A multi-modal approach to query music, text, and images with a special
focus on album covers is presented in [2]. In [5], a three-dimensional musical
landscape via a Self-Organising Map is created and applied to small private
music collections. Additional information like web data and album covers are
used for labelling; album covers should facilitate the recognition of music known
to the user. The covers are however not use in the SOM training itself.
The Self-Organising Map has also been applied to image data in the PicSOM
project [6], for Information Retrieval in image databases, incorporating methods
of relevance feedback.
In this paper, we want to empirically validate the hypothesis that album covers
can provide cues to the type of music. We therefore organise a music collection
with Self-Organising Maps using both music features and image features, and
analysing the way the album covers are organised over the map. We investigate
whether musical similarity and a similarity in the album cover art are correlated,
and whether albums can really give a clue on the music they represent.
The remainder of this paper is structured as follows. Section 2 gives a brief
outline over the SOM framework and visualisations employed, while Section 3
will introduce the feature sets employed to describe our music collection. Our
experiments are then detailed in Section 5, before we conclude in Section 6.
359
SOM Framework
We employ the Java SOMToolbox framework1, developed at the Vienna University of Technology, which provides methods for training SOMs. It further
comprises an application for interactive, exploratory analysis of the map, allowing for zooming, panning and selection of single nodes and regions among the
map. The application also allows to display digital images on top of the map
grid, thus it can easily be used to visualise the album covers.
To facilitate the visual discovery of structures in the data, such as clusters,
a wealth of approximatively 15 visualisations are provided, among them the UMatrix [14] and Smoothed Data Histograms[13]. The former indicates distances
between SOM nodes by colour-coding, and thus hints on cluster boundaries,
while the latter visualises density in the data, also indicating clusters as nodes
with high density. We also utilise the Thematic Classmap visualisation [9]. It
which shows the distribution of meta-data labels or categories attached to the
data vectors mapped on the SOM, by colouring the map in continuous regions,
similar as e.g. a political map does for countries. To this end, it performs a
Voronoi tessellation of the map space, and assigns colours to each Voronoi region
to indicate how much a class contributes to the data items in that region.
To provide a partition of the map into separate clusters, the framework provides several clustering algorithms that can be applied on the vectors of the
SOM nodes, such as Wards linkage [4] algorithm.
Feature Sets
Audio Features
http://www.ifs.tuwien.ac.at/dm/somtoolbox/
360
R. Mayer
human sound perception. In the second step, a Discrete Fourier Transform is applied to this Sonogram, resulting in a spectrum of loudness amplitude modulation
per modulation frequency for each critical band. After additional weighting and
smoothing steps, a Rhythm Pattern exhibits magnitude of modulation for 60
modulation frequencies on the 24 critical bands [7].
Rhythm Histogram. A Rhythm Histogram (RH) aggregates the modulation amplitude values of the critical bands computed in a Rhythm Pattern, and is thus
a descriptor for general rhythmic characteristics in a piece of audio [7].
Statistical Spectrum Descriptor. The rst part of the algorithm for computation
of a Statistical Spectrum Descriptor (SSD), the computation of specic
loudness sensation, is equal to the Rhythm Pattern algorithm. Subsequently at
set of statistical values (mean, median, variance, skewness, kurtosis, min and
max) are calculated for each individual critical band. SSDs therby describe uctuations on the critical bands; they capture both timbral and rhythmic information. In a number of evaluation studies, SSD have often shown to be superior for
musical genre classication tasks [7].
3.2
Image Features
Colour Histogram. This feature set computes the distribution of pixel values in
the RGB colour space. For each colour channel, a histogram of values (from 0 to
255) is computed from all pixels in the image. To reduce the dimensionality, we
employed binning of the values. 128 bins for each channels were determined as
a good value through experimental evaluation in classication tasks. Thus the
total dimensionality of such a feature vector is 384 dimensions.
Color Names. Colour names [16] are a level of abstraction on top of a colour
histogram the colour space is divided in the 11 basic colours black, blue, brown,
gray, green, orange, pink, purple, red, white and yellow. Each pixel is associated
with one of these colours, and then, as before, a histogram of values for the whole
image is computed. This feature vector thus has eleven dimensions.
SIFT Bag of Visual Words. Scale Invariant Feature Transform is a local feature
descriptor which is invariant to certain transformations, such as scaling, rotation
or brightness. The algorithm extracts interesting points in an image, which can
then be used to identify similar objects. The points usually lie on high-contrast
regions of the image, such as object edges. We utilise the algorithm presented in
[15], which utilises a Harris corner detector and subsequently the Laplacian for
scale selection. We created a 1024 dimensional codebook (Bag of Visual Words),
capturing the relative distribution of the SIFT features.
Collection
Music information retrieval research in general suers from a lack of standardised benchmark collections, being mainly attributable to copyright issues.
361
Nonetheless, some collections have been used frequently in the literature. These
were howeber not usable for the study in this paper, as none of these collection comes with a complementary set of album covers, and additionally most
collections either miss information about song title and artist, or are royalty
free music from relatively unknown artists for both cases, automated fetching
album covers from the web is not feasible.
Therefore, we composed our own test collection containing both audio snippets and album covers, by crawling data from the webshop amazon.com, which
provides rich information for their music shop. Considering the best-selling list
from several dierent genres, for each album (or maxi-single) found, we downloaded the cover, and the 30 second audio snippet of the rst song. We thereby
skipped entries for which either the cover was of too poor quality (below 400400
pixels), or the 30 second song snippets was missing. Amazon organises the contents of it its music shop into 25 top-level genres, with many sub-categories;
songs may, and frequently are, assigned to multiple genres. We aimed at selecting rather diverse and non-overlapping genres, to achieve distinctive styles in
the cover art, and thus chose genres such as Goth and Industrial Rock, Rap
and Hip-Hop, Reggae, Country, Electronic, Classical music and Blues.
Overall, the collection comprises more than 900 songs.
Experimental Analysis
We trained maps of the size 2218 nodes, i.e. a total of 352 nodes, with each
of the audio features. From a manual inspection, the map trained with SSD
features seems to provide the best arrangement of music according to the authors
perception, superior to RP and RH features.
This map is depicted in Figure 1, with the result of a clustering of the nodes
superimposed on the map lattice.
It can be observed that the classical music (indicated by light-grey colour) is
separated rather well from the other genres, being mostly located in the upperright corner. This area also matches the boundary detected via the clustering of
the map nodes using the Wards linkage method. Gothic and Alternative rock
music, indicated in green, is mostly located in the lower-left corner, though a few
pieces are distributed on other areas as well. These pieces are mostly slow songs,
using a lot of instrumentation found also in e.g. classical music, such as violins,
and therefore most of these mapping patterns appear logical from a musical
point of view. Reggae music (red) can be mostly found in the upper-left corner
and upper-centre, often together with Hip-Hop (blue), with which it shares a lot
of rhythmic and tempo characteristics. Jazz/Blues (dark-grey), which borrows
many styles from other genres, is organised in a number of smaller, but in itself
rather consistent, clusters. The distribution of these clusters all over the map
is motivated by the nature of this genre, which is a conuence of several music
traditions, and has incorporated many aspects of popular music. Electronic music
(pink) shows no clear pattern, distributed in small groups all over the map.
In Figure 2, a Smoothed Data Histogram (SDH) visualisation of this map
is depicted, with the Islands of Music [13] metaphor, where islands represent
362
R. Mayer
Fig. 1. Distribution of genres over the map with SSD audio features. Clusters obtained
via Wards linkage clustering of the nodes is indicated by white lines.
areas with high density. It can be seen that the arrangement of dierent genres
correlates to some degree with the SDH, such as in the area of high-density in
the upper right, which represents the cluster of classical music.
For a more detailed inspection, Figure 3 depicts 24 nodes in the upper-right
corner of the map, the area containing mostly classical music; this section of
the map contains a total of 64 songs. To indicate the genre, the class visualisation [9] from Figure 1 is also used in this illustration, using the same colours
as background for the dierent genres as in Figure 1. On a rst glance, there
seems to be a certain coherence between the album covers. The most striking
shared characteristics between the classical music album covers seems to be the
frequent use of photos of people, in some cases the artists themselves, in other
cases the musician interpreting the piece of music. These album covers generally
follow a rather simple pattern for the background, consisting of few colours, and
none or few objects. Many of the albums also simply feature a completely white
background. The album covers on the top-edge of the gure mostly belong to the
electronic genre; most of them share very similar instrumentation as the classical
pieces, mostly the use of a piano or utes. However, the album art seems to dier
quite strongly, with a stronger use of dark colours, and more complex themes.
We can make similar observations for areas with Jazz and Country music,
such as the area on the lower-right of the map, shown in Figure 4(a). Again,
most of the covers feature portraits of the artists; however, there is a slightly
dierent pattern in the background, using more darker colours, and thus allowing
a subtle dierentiation between the previous examples. Similar observations can
be made for many Reggae songs.
363
Fig. 2. Smoothed Data Histograms of the map with SSD audio features
Fig. 3. Album covers in the cluster of classical music (SSD audio features, top-right
corner)
A cluster with songs from the Gothic and Alternative Rock genre is shown in
Figure 4(b). Out of a total of 13 covers, only six show people, and in most cases,
these portraits are heavily altered and appear more articial. Noteworthy is also
the use of many dark and ashy colours, which create a dark appearance.
While other areas of the map do not show that clear patterns, it can be
concluded that at least to a certain degree, musical similarity as determined by
364
R. Mayer
Fig. 5. Music maps trained on the image features from the album covers
the SSD audio features and the vector projection of the SOM also coincides with
some similarity in album cover art.
When organising the map with the image features, we build again on the
assumption that album covers carry some clues about the music characteristics,
and thus similar music should be located in neighbouring regions of the map.
However, when using the simple features such as colour histograms or color
names, the latter being depicted in Figure 5(a), this assumption is not fullled.
While the organisation of album covers along the colour properties gives a nice
overview, this arrangement does not match with the genres they belong to, as
can be seen in Figure 5(a). There is basically no region in the map that shows a
continuous area of similar music. We can thus conclude that for an interface to
music, simple features such as the ones derived from colours are not sucient.
Figure 5(b) depicts a section of a map trained with the SIFT BoV features.
This section holds covers that, with a very few exceptions, depict people; further,
most of the songs are from the Hip-Hop genre. It could thus be concluded that
365
SIFT BoV features can be useful to detect shapes of faces, which we identied
earlier as an important aspect for several genres. We can also observe in some
other areas that these features are very well working on depicting outliers, mostly
albums with very complex cover art. However, similar observations as for the map
with color names hold true the features dont seem to be able to capture the
complex similarities in the covers very well.
Finally, we applied the method described in [11], which allows for a analytical
comparison of SOMs. It enables to identify dierences in mappings obtained
by dierent SOM trainings, by indicating which data items are mapped closely
together in both maps. It can also be used to compare two maps trained on
dierent features, for example on the music and song lyrics, as in [10]. Applying
this method to the maps trained with the album cover features and the ones
extracted from the music, we notice only a very small percentage of matches in
the two dierent mappings most of the songs that were mapped together in
the music SOM are mapped to divergent areas in the album cover SOM.
Conclusions
We performed an analysis of the similarity of album art and the music they
represent. To this end, we extracted audio features from the music, and image
features from the album covers, and trained a set of SOMs with it. The SOM
trained with the audio features revealed that in a number of cases, the musical
similarity of the music is also reected in the album covers, e.g. by the use of
portraits or rather abstract objects, and also partly by the colours. The maps
trained with the image features could, however, only reconrm some of these
similarities, when using the SIFT features to describe the visual content.
We thus conclude that while there is potential in using album covers for
music information related tasks, there is a need for more powerful image feature
descriptors. Such descriptors could be face detectors, more advanced use of points
of interest features, and a combination of these features into a single descriptor.
References
1. Baumann, S., Pohle, T., Vembu, S.: Towards a socio-cultural compatibility of MIR
systems. In: Proceedings of the 5th International Conference of Music Information
Retrieval (ISMIR 2004), Barcelona, Spain, October 10-14, pp. 460465 (2004)
2. Brochu, E., de Freitas, N., Bao, K.: The sound of an album cover: Probabilistic
multimedia and IR. In: Bishop, C.M., Frey, B.J. (eds.) Proceedings of the 9th
International Workshop on Articial Intelligence and Statistics, Key West, FL,
USA, January 3-6 (2003)
3. Cunningham, S.J., Reeves, N., Britland, M.: An ethnographic study of music information seeking: implications for the design of a music digital library. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 516.
IEEE Computer Society, Washington, DC (2003)
4. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. Journal of
the American Statistical Association 58(301), 236244 (1963)
366
R. Mayer
5. Knees, P., Schedl, M., Pohle, T., Widmer, G.: An Innovative Three-Dimensional
User Interface for Exploring Music Collections Enriched with Meta-Information
from the Web. In: Proceedings of the ACM 14th International Conference on Multimedia (MM 2006), Santa Barbara, California, USA, October 23-26, pp. 1724
(2006)
6. Laaksonen, J., Koskela, M., Laakso, S., Oja, E.: PicSOMcontent-based image
retrieval with self-organizing maps. Pattern Recogn. Lett. 21(13-14), 11991207
(2000)
7. Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classication. In: Proc. ISMIR, London, UK, September
11-15, 2005, pp. 3441 (2005)
8. Logan, B., Kositsky, A., Moreno, P.: Semantic analysis of song lyrics. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2004),
Taipei, Taiwan, June 27-30, 2004, pp. 827830 (2004)
9. Mayer, R., Aziz, T.A., Rauber, A.: Visualising Class Distribution on Self-organising
Maps. In: de S, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN
2007. LNCS, vol. 4669, pp. 359368. Springer, Heidelberg (2007)
10. Mayer, R., Frank, J., Rauber, A.: Analytic comparison of audio feature sets using self-organising maps. In: Proceedings of the Workshop on Exploring Musical
Information Spaces, in Conjunction with ECDL 2009, Corfu, Greece, pp. 6267
(October 2009)
11. Mayer, R., Neumayer, R., Baum, D., Rauber, A.: Analytic comparison of selforganising maps. In: Prncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS,
vol. 5629, pp. 182190. Springer, Heidelberg (2009)
12. Orio, N.: Music retrieval: A tutorial and review. Foundations and Trends in Information Retrieval 1(1), 190 (2006)
13. Pampalk, E., Rauber, A., Merkl, D.: Using Smoothed Data Histograms for Cluster
Visualization in Self-Organizing Maps. In: Dorronsoro, J.R. (ed.) ICANN 2002.
LNCS, vol. 2415, pp. 871876. Springer, Heidelberg (2002)
14. Ultsch, A., Siemon, H.P.: Kohonens Self-Organizing Feature Maps for Exploratory
Data Analysis. In: Proceedings of the International Neural Network Conference
(INNC 1990), pp. 305308. Kluwer Academic Press, Dordrecht (1990)
15. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for
object and scene recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 32(9), 15821596 (2010)
16. Van De Weijer, J., Schmid, C.: Applying color names to image description. In:
IEEE International Conference on Image Processing (ICIP 2007), vol. 3. IEEE,
Los Alamitos (2007)
Author Index
Ikemura, Toshimichi
Itoh, Masae 198
Iwasaki, Yuki 198
Backhaus, Andreas 90
Baranovskiy, Evgeny 178
Barreto, Guilherme A. 121, 267
Biehl, Michael 277
Bunte, Kerstin 277
Burkovski, Andre 178, 228
Chen, Xi 160
Coelho, Andre Lus V.
C
ome, Etienne 298
Cottrell, Marie 298
121
Domnguez, Manuel 61
Dozono, Hiroshi 348
Eklund, Tomas 40
Estevez, Pablo A. 151
Fincke, Tonio 288
Fujimura, Kikuo 308
Fukui, Ken-ichi 131
Furukawa, Tetsuo 101
Geweniger, Tina 90
Gisbrecht, Andrej 1
Grasemann, Uli 207
Haase, Sven 90
Hai, Ying 308
Hammer, Barbara 1, 277
Hasenfuss, Alexander 1
Heidemann, Gunther 178, 228
Heinze, Georey-Alexeij 178
Hern
andez, Leticia 318
Hern
andez, Rodrigo 151
Hern
andez, Sergio 51
Hollmen, Jaakko 61
Honkela, Timo 160, 247
198
K
astner, Marika 79, 90
Kessler, Wiltrud 228
Kiran, Swathi 207
Kobdani, Hamidreza 228
Kohonen, Teuvo 16
Kurasova, Olga 141
Laaksonen, Jorma 247, 338
Lacaille, Jer
ome 298
Lamirel, Jean-Charles 257
Macedo, Ana Cristina P. 267
Maia, Jose Everardo B. 121
Mall, Raghvendra 257
Manalili, Sean 188
Matsuda, Nobuo 328
Mayer, Rudolf 238, 357
Mehmood, Yasir 160
Miikkulainen, Risto 207
Moehrmann, Julia 178
Mokbel, Bassam 1, 277
Nakakuni, Masanori 348
Neme, Antonio 51, 168, 318
Neme, Omar 51, 168
Nishijima, Shinya 348
Numao, Masayuki 131
Ohkita, Masaaki 308
Ohkubo, Takashi 101
Oyabu, Matashige 308
Prada, Miguel Angel
Pulido, JRG 168
61
207
368
Author Index
Sarlin, Peter 40
Schleif, Frank-Michael 1
Sch
utze, Hinrich 228
Seiert, Udo 90
Silva, Ana Cristina C. 267
Sj
oberg, Mats 338
Stefanovic, Pavel 141
Sulkava, Mika 61
Tanaka, Asami 348
Tasdemir, Kadim 71
Tenhunen, Juhani 247
Tokunaga, Kazuhiro 101, 111
16