Tech Report03 2
Tech Report03 2
Technical Report
CSD-TR-03-02
May 28, 2003
DepartmentofComputerScience
Egham, Surrey TW20 0EX, England
Abstract
1 Introduction
During recent years there have been advances in data learning using kernel
methods. Kernel representation offers an alternative learning to non-linear
functions by projecting the data into a high dimensional feature space to
increase the computational power of the linear learning machines, though
this still leaves open the issue of how best to choose the features or the
kernel func-tion in ways that will improve performance. We review some of
the methods that have been developed for learning the feature space.
Proposed by H. Hotelling in 1936 [12], CCA can be seen as the problem of find-
ing basis vectors for two sets of variables such that the correlation between the
projections of the variables onto these basis vectors are mutually maximised. In
an attempt to increase the flexibility of the feature selection, kernelisation of CCA
(KCCA) has been applied to map the hypotheses to a higher-dimensional
feature space. KCCA has been applied in some preliminary work by Fyfe & Lai
[8], Akaho [1] and the recently Vinokourov et al. [19] with improved results.
Introduction 2
During recent years there has been a vast increase in the amount of mul-
timedia content available both off-line and online, though we are unable to
access or make use of this data unless it is organised in such a way as to
allow efficient browsing. To enable content based retrieval with no reference
to labeling we attempt to learn the semantic representation of images and
their associated text. We present a general approach using KCCA that can
be used for content [11] to as well as mate based retrieval [18, 11]. In both
cases we compare the KCCA approach to the Generalised Vector Space
Model (GVSM), which aims at capturing some term-term correlations by
looking at co-occurrence information.
This study aims to serve as a tutorial and give additional novel contribu-tions
in the following ways:
• In this study we follow the work of Borga [4] where we represent the eigenproblem
as two eigenvalue equations as this allows us to reduce the com-putation time and
dimensionality of the eigenvectors.
• Further to that, we follow the idea of Bach & Jordan [2] to compute a new correlation
matrix with reduced dimensionality. Though Bach & Jordan [2] address a very di fferent problem,
they use the same underlining technique of Cholesky decomposition to re-represent the kernel
matrices. We show that by using partial Gram-Schmidt orthogonolisation [6] is equivalent to
incomplete Cholesky decomposition, in the sense that incomplete Cholesky decomposition can
be seen as a dual implementation of partial Gram-Schmidt.
• We show that the general approach can be adapted to two di fferent types of
problems, content and mate retrieval, by only changing the selection of eigenvectors used in
the semantic projection.
we present the CCA and KCCA algorithm. Approaches to deal with the com-
putational problems that arose in Section 3 are presented in Section 4. Our
experimental results are presented In Section 5. In Section 6 we present the
generalisation framework for CCA while in Section 7 draws final conclusions.
2 Theoretical Foundations
Proposed by H. Hotelling in 1936 [12], Canonical correlation analysis can be
seen as the problem of finding basis vectors for two sets of variables such that
the correlation between the projections of the variables onto these basis vectors
are mutually maximised. Correlation analysis is dependent on the co-ordinate
system in which the variables are described, so even if there is a very strong
linear relationship between two sets of multidimensional variables, depending on
the co-ordinate system used, this relationship might not be visible as a cor-
relation. Canonical correlation analysis seeks a pair of linear transformations
one for each of the sets of variables such that when the set of variables are
transformed the corresponding co-ordinates are maximally correlated.
Consider a multivariate random vector of the form (x, y). Suppose we are
given a sample of instances S = ((x1, y1), . . . , (xn , yn )) of (x, y), we use Sx
to denote (x1, . . . , xn) and similarly Sy to denote (y1, . . . , yn ). We can
consider defining a new co-ordinate for x by choosing a direction w x and
projecting x onto that direction
x → hwx, xi
if we do the same for y by choosing a direction wy we obtain a sample of the
new x co-ordinate as
ρ= max corr(Sxwx, Sy wy )
wx ,wy
hS w , S w i
= max x x y y
wx ,wy kSxwxkkSy wy k
ˆ
If we use E [f (x, y)] to denote the empirical expectation of the function f (x, y),
were
ˆ 1 m
X
q
wx ,wy Eˆ[hwx, xi02 ]Eˆ[0hwx, xi2]
ˆ
E[wxxy wy ]
= max
q
Eˆ[wx0xx0wx]Eˆ[wy0yy0wy ]
wx ,wy
follows that 0
ˆ 0
wxE[xy ]wy
ρ = max .
q
0 wx0Eˆ[xx0]wxwy0Eˆ[yy0]wy
wx ,wy
The total covariance matrix C is a block matrix where the within-sets covari-ance
matrices are Cxx and Cyy and the between-sets covariance matrices are
0
Cxy = Cyx
3 Algorithm
In this section we will give an overview of the Canonical correlation analysis
(CCA) and Kernel-CCA (KCCA) algorithms where we formulate the optimisa-
tion problem as a generalised eigenproblem.
=
2 0 0 p 0 C w
q α wx Cxxwxwy Cyywy w x0 Cxxwxwy yy y
Since the choice of re-scaling is therefore arbitrary, the CCA optimisation prob-
lem formulated in equation (2.2) is equivalent to maximising the numerator
Algorithm 5
subject to
0
wx Cxxwx = 1
0
wy Cyywy = 1.
The corresponding Lagrangian is
λ
x λy
0 0 0
L(λ, wx, wy ) = wx Cxywy − 2 (wx Cxxwx − 1) − 2 (wy Cyywy − 1)
Taking derivatives in respect to wx and wy we obtain
∂f
As the covariance matrices Cxx and Cyy are symmetric positive definite we
are able to decompose them using a complete Cholesky decomposition
(more details on Cholesky decomposition can be found in section 4.2)
0
Cxx = Rxx · Rxx
0
where Rxx is a lower triangular matrix. If we let u x = Rxx · wx we are able to
rewrite equation (3.4) as follows
−1 −1 2
C C C R 0 u x = λ R ux
xy yy yx xx xx
We are therefore left with a symmetric eigenproblem of the form Ax = λx.
Algorithm 6
before performing CCA in the new feature space, essentially moving from
the primal to the dual representation approach. Kernels are methods of
implicitly mapping data into a higher dimensional feature space, a method
known as the ”kernel trick”. A kernel is a function K, such that for all x, z X
Using the definition of the covariance matrix in equation (2.1) we can rewrite
the covariance matrix C using the data matrices (of vectors) X and Y , which
have the sample vector as rows and are therefore of size m × N , we obtain
0
Cxx = X X
0
Cxy = X Y.
ρ = max .
q α0Kx2α · β0Ky2 β
α,β
We find that in equation (3.7) the variables are now represented in the dual
form.
Observe that as with the primal form presented in equation (2.2), equation (3.7)
is not affected by re-scaling of α and β either together or independently. Hence
Algorithm 7
0 λ α 0 2 λ β 0 2
L(λ, α, β) = α KxKy β − 2 α Kx α − 1 − 2 β Ky β − 1
Taking derivatives in respect to α and β we obtain
∂f
2 (3.8)
∂α = KxKy β − λαKx α = 0
∂f
2 (3.9)
∂β = Ky Kxα − λβ Ky β = 0.
0 0
Subtracting β times the second equation from α times the first we have
0 0 2 0 0 2 0 2
0 = α KxKy β − α λαKx α − β Ky Kxα + β λβ Ky β = λβ β Ky β −
0 2
λαα Kx α
which together with the constraints implies that λ α − λβ = 0, let λ = λ α = λβ .
Considering the case where the kernel matrices K x and Ky are invertible, we
have
β = Ky−1Ky−1Ky Kxα
λ
= K y K xα
−1
λ
substituting in equation (3.8) we obtain
−1 2
Kx Ky Ky Kxα − λ KxKx α = 0.
Hence
2
KxKxα − λ Kx Kxα = 0
or
2
I α = λ α. (3.10)
We are left with a generalised eigenproblem of the form Ax = λx. We can deduce
from equation 3.10 that λ = 1 for every vector of α; hence we can choose the
1
projections wx to be unit vectors ji i = 1, . . . , m while w y are the columns of λ
−1
Ky Kx . Hence when Kx or Ky is invertible, perfect correlation can be formed.
Since kernel methods provide high dimensional representations such
independence is not uncommon. It is therefore clear that a naive application of
CCA in kernel defined feature space will not provide useful results. In the next
section we investigate how this problem can be avoided.
Computational Issues 8
4 Computational Issues
We observe from equation (3.10) that if K x is invertible maximal correlation is
obtained, suggesting learning is trivial. To force non-trivial learning we intro-
duce a control on the flexibility of the projections by penalising the norms of
the associated weight vectors by a convex combination of constraints based
on Partial Least Squares. Another computational issue that can arise is the
use of large training sets, as this can lead to computational problems and
degener-acy. To overcome this issue we apply partial Gram-Schmidt
orthogonolisation (equivalently incomplete Cholesky decomposition) to
reduce the dimensionality of the kernel matrices.
4.1 Regularisation
To force non-trivial learning on the correlation we introduce a control on the
flexibility of the projection mappings using Partial Least Squares (PLS) to
penalise the norms of the associated weights. We convexly combine the PLS
term with the KCCA term in the denominator of equation (3.7) obtaining
max 0
α KxKy β
α,β
ρ = q
(α0Kx2α + κkwx0k2) · (β0Ky2β + κkwy k2))
α KxKy β
= maxα,β
0 2 0 0 2 0
q (α Kx α + κα Kxα) · (β Ky β + κβ Ky β
We observe that the new regularised equation is not affected by re-scaling of
α or β, hence the optimisation problem is subject to
0 2 0
(α Kx α + κα Kx α) = 1
0 2 0
(β Ky β + κβ Ky β) = 1
2
Taking derivatives in
= Ky Kxα − λβ (Ky β + κKy β). respect to α and β
∂f
(4.1)
∂α
∂f
(4.2)
∂β
0 0
Subtracting β times the second equation from α times the first we have
0 0 2 0 0 2 0 2
0 = α KxKy β − λαα (Kx α + κKxα) − β Ky Kxα + λβ β (Ky β + κKy β) = λβ β (Ky β
0 2
+ κKy β) − λαα (Kx α + κKx α).
Computational Issues 9
λ
substituting in equation 4.1 gives
−1 2
Kx Ky (Ky + κI) Ky (Ky + Kxα = λ Kx(Kx + κI)α
−1 −1 2
κI) (Kx + κI) Ky (Ky + Kxα = λ (Kx + κI)α
−1 K 2
κI) xα = λ α
We obtain a generalised eigenproblem of the form Ax = λx .
X
p=1
0 1
1 0
0
On the other hand, if the symmetric matrix A is positive definite (i.e., x Ax > 0
0
if x x > 0), then the factorisation is possible. We have
Computational Issues 10
Input N xN matrix K
precision parameter η
0
1. Initialisation: i = 1, K = K, P = I, for j [1, N ], Gjj = Kjj
PN
2. While Gjj > η and i! = N + 1
j=1
• Find best new element: j = argmaxj [i,N ]Gjj
• Update j = (j + i) − 1
• Update permutation P :
P next = I, P nextii = 0, P nextj j = 0, P nextij = 1, P nextj i = 1
P = P · P next
0
• Permute elements i and j in K :
0 0
K = P next · K · P next
• Update (due to new permutation) the already calculated elements
G(i, i) ↔√G(j , j )
th
• Calculate i column of G:
1 i−1
G = 0 P G G
i+1:n,i
G
ii Ki +1:n,i − j=1 i+1:n,j ij
0 Pi 2
• Update only diagonal elements: for j [i + 1, N ], Gjj = Kjj − k=1 G jk
• Update i = i + 1
3. Output P , G and M = i
Initialisations:
m = size of K, a N × N matrix
j=1
size and index are a vector with the same length as K
f eat a zeros matrix equal to the size of K for i = 1 to
m do
norm2[i] = Kii;
Algorithm: P
index[j] = ij ;
size[j] = norm2[ij ];
to m do
for i = 1 p
P ”
“
k(di ,dij )−
t=1 f eat[i,t]·f eat[ij ,t]
j−1
feat[i, j] = ;
size[j]
norm2[i] = norm2[i]− feat(i, j)· feat(i, j);
end;
j=j+1
end;
return feat
Output:
0
kK − f eat · f eat k ≤ η where f eat is a N × M lower triangular matrix
(appendix 1.2 for proof )
We observe that the output is equivalent to the output of ICD.
for j = 1 to M Pj−1
newfeat[j] = (Ki,index[j] − newfeat[t] · feat[index[j], t])/size[j];
t=1
end;
β = Ry β
substituting in equations (4.9) and (4.10) we find that we return to the primal
representation of CCA with a dual representation of the data
˜
Z Z β − λZ α˜
xx xy xx
2
= 0
˜
Z Z α˜ − λZ β
yy yx yy = 0.
Computational Issues 13
Assuming that the Zxx and Zyy are invertible. We multiply the first equation
−1 −1
with Zxx and the second with Zyy
˜
Zxy β − λZxxα˜ = 0
˜
˜
Zyx α˜ − λZyy β = 0.
We are able to rewrite β from equation (4.12) as
−1
Z Z α˜
˜ yy yx
β=
λ
and substituting in equation (4.11) gives
−1 2
Zxy Z Zyxα˜ = λ Zxxα˜ (4.13)
yy
0
we are left with a generalised eigenproblem of the form Ax = λBx. Let SS be
0
equal to the complete Cholesky decomposition of Z xx such that Zxx = SS
0
where S is a lower triangular matrix, and let αˆ = S ·α˜. Substituting in
equation (4.13) we obtain
−1 −1 −10 2
S Zxy Zyy ZyxS αˆ = λ αˆ
˜
We are able to rewrite β from equation (4.17) as
−1
˜ (Zyy + κI) Zyxα˜
β=
λ
substituting in equation 4.16 gives
−1 2
Zxy (Zyy + κI) Zyx α˜ = λ (Zxx + κI)α˜
5 Experimental Results
In the following experiments the problem of learning semantics of multimedia
content by combining image and text data is addressed. The synthesis is ad-
dressed by the kernel Canonical correlation analysis described in Section 4.3.
We test the use of the derived semantic space in an image retrieval task that
uses only image content. The aim is to allow retrieval of images from a text
query but without reference to any labeling associated with the image. This can
be viewed as a cross-modal retrieval task. We used the combined multime-dia
image-text web database, which was kindly provided by the authors of [15],
where we are trying to facilitate mate retrieval on a test set. The data was di-
vided into three classes (Figure 1) - Sport, Aviation and Paintball - 400 records
each and consisted of jpeg images retrieved from the Internet with attached text.
We randomly split each class into two halves which were used as training and
test data accordingly. The extracted features of the data were used the same as
in [15] (detailed description of the features used can be found in [15]: image
HSV colour, image Gabor texture and term frequencies in text.
We compute the value of κ for the regularization by running the KCCA with the
association between image and text randomized. Let λ(κ) be the spectrum with-
out randomisation, the database with itself, and λ R(κ) be the spectrum with
randomisation, the database with a randomised version of itself, (by spectrum it
is meant that the vector whose entries are the eigenvalues). We would like to
have the non-random spectrum as distant as possible from the randomised
spectrum, as if the same correlation occurs for λ(κ) and λ Rκ then clearly over-
fitting is taking place. Therefor we expect for κ = 0 (no regularisation) and let j =
1, . . . , 1 (the all ones vector) that we may have λ(κ) = λ R(κ) = j, since it is very
possible that the examples are linearly independent. Though we find that only
50% of the examples are linearly independent, this does not a ffect the selection
of κ through this method. We choose κ so that the κ for which the
Experimental Results 15
20 20 20 20
40 40 40 40
60 60 60 60
80 80 80 80
Aviation 200 50 100 150 200 250 300 350 400 200 50 100 150 200 250 300 350 400 450 500 200 50 100 150 200 250 300 350 400 200 50 100 150 200 250 300 350 400 450
10 10 10
20 20
20 20
30
40 30 30
40
40
40 50
60
50
50 60
80 70 60
60
80 70
70
100 90 80
80
120
90 110 100
10
20 20
20 20
40
40 30
40 60
40
60 80
50
60 100
80 60
120
80 70
100 140
80
160
120 100 90
Paintball 10 20 30 40 50 60 70 80 90 100 110 20 40 60 80 100 120 140 20 40 60 80 100 120 140 50 100 150 200 250 300
100 180
140
κ = argmaxkλR(κ) − λ(κ)k
We find that κ = 7 and set via a heuristic technique the Gram-Schmidt preci-
sion parameter η = 0.5 .
To perform the test image retrieval we compute the features of the images
and text query using the Gram-Schmidt algorithm. Once we have obtained
the features for the test query (text) and test images we project them into the
˜
semantic feature space using β and α˜ (which are computed through training)
respectively. Now we can compare them using an inner product of the
semantic feature vector. The higher the value of the inner product, the more
similar the two objects are. Hence, we retrieve the images whose inner
products with the test query are highest.
Image Set GVSM success KCCA success (30) KCCA success (5)
10 78.93% 85% 90.97%
30 76.82% 83.02% 90.69%
95
KCCA (5)
KCCA (30)
GVSM
90
85
(%)
80
Rate
Success
75
70
65
60
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
90 90
100 100
110 110
10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80
Figure 3 Images retrieved for the text query: ”height: 6-11 weight: 235 lbs
position: forward born: september 18, 1968, split, croatia college: none”
Image set GVSM success KCCA success (30) KCCA success (150)
10 8% 17.19% 59.5%
30 19% 32.32% 69%
In Table 3 we compare the performance of the KCCA algorithm with the GVSM
Experimental Results 18
90
80
70
(%)
60
Success
Overall
50
40
30
0 5 10 15 20 25 30 35 40 45
eigenvectors
over 10 and 30 image sets where in Table 4 we present the overall success
over all image sets. In figure 6 we see the overall performance of the KCCA
method against the GVSM for all possible image sets.
The success rate in Table 3 and Figure 6 is computed as follows
600
count
j=1 j
20
40
20 60
20
40
80
60 40
100
80
60
120
100
120 140 80
140
160 100
160
180
120
180
200
200
140
50 100 150 200 250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200 220
50 100 150 200 250 300 350 400 450
10 20
20 40
60
30
80
40
100
50
120
60 140
70 160
180
80
200
90
20 40 60 80 100 120
50 100 150 200 250 300 350 400 450 500
Figure 5 Images retrieved for the text query: ”at phoenix sky harbor on july 6,
1997. 757-2s7, n907wa phoenix suns taxis past n902aw teamwork america
west america west 757-2s7, n907wa phoenix suns taxis past n901aw
arizona at phoenix sky harbor on july 6, 1997.” The actual match is the
middle picture in the first row.
100
90
80
70
60
50
40
30
20
KCCA (150)
KCCA (30)
10
GVSM
Figure 6 Success plot for KCCA mate-based against GVSM (success (%)
against image set size).
performance between the a priori value κˆ and the new found optimal value
κ for 5 eigenvectors is 1.0423% and for 30 eigenvectors is 5.031%. The more
substantial increase in performance on the latter is due to the increase in the selection of the
regularisation parameter, which compensates for the substantial decrease in performance
(figure 6) of the content based retrieval, when high dimensional semantic feature space is
used.
95
90
85
80
overall success (%)
75
70
65
60
55
50
eigenvectors
88.55
(%)
88.5
Success
Overall
88.45
88.4
100 120 140 160 180 200 220 240 260 280
Kappa
92.796
92.794
92.792
92.79
Success (%)
92.788
92.786
Overall
92.784
92.782
92.78
92.778
Kappa
85.515
85.51
85.505
(%)
85.5
Success
Overall
85.495
85.49
85.485
Kappa
93.03
93.02
93.01
Succes (%)
93
92.99
Overa
ll
92.98
92.97
92.96
92.95
kappa
performance between the a priori value κˆ and the new found optimal value κ
is for 150 eigenvectors 0.627% and for 30 eigenvectors is 0.7586%.
Our observed results support our proposed method for selecting the regu-
larisation parameter κ in an a priori fashion, since the difference between the
actual optimal κ and the a priori κˆ is very slight.
subject to (6.5)
g(y) = 0, (6.6)
m n
x R ,y R . (6.7)
n
Let the set Y R the feasibility domain for y determined by the constrain
g(y) = 0.
subject to (6.9)
g(y) = 0, (6.10)
n
y R , (6.11)
has the same optimal solution in y than equation (6.4) has.
Proof. Let the optimal solution of equation (6.4) be denoted by x 1, y1 and for
equation (6.8) be denoted by y2.
From the convexity of f and the same feasibility domains the optimum
solutions have to be the same.
columns. Introducing notations for the product of the matrices to simplify the
formulas:
0
Σij = H( i)H(j), i, j = 1, 2. (6.12)
We are looking for linear combinations of the columns of these matrices such
(1) (2)
that the first pair of the vectors (a 1, a 1 ) are optimal solution of the optimi-
sation problem:
(1)0 (2)
max a1 Σ12a1
(1) (2)
a1 ,a1
subject to
0 (1)
a1(1) Σ11a1 = 1,
0 (2)
Σ22a1
a1(2) = 1.
The meaning of this optimisation problem is to find the maximum correlation
(1) (2)
between the linear combinations of the columns of the matrices H , H ,
subject to the length of the vectors corresponding to these linear
combinations normalised to 1.
(1) (2)
To determinate the remaining pairs of the vectors, columns in A and A ,
a series of optimisation problems are solved successively. For the pair of the
(1) (2)
vectors (a r , a r), r = 2, . . . , p we have
(1)0 (2)
max ar Σ12ar
(1) (2)
ar ,ar
subject to
(k)0 (k)
ar Σkk ar = 1,
(k)0 (k)
ar Σkk aj = 0,
(k)0 (l)
ar Σklaj = 0,
k, l = 1, 2, j = 1, . . . , r − 1.
The problem (6.13) expanded by the orthogonality constrains (6.17), namely
the components of every new pair in the iteration have to be orthogonal to
the components of the previous pairs.
subject to (6.27)
0 (k)
y1(k) y1 = 1, k = 1, 2. (6.28)
(6.29)
L = y D y
1 1 1 + 2 λ1(1 − y1
12 y1 ) + 2 λ2 (1 − y2 y2 ), (6.30)
where λ1 and λ2 are the Lagrangian multipliers. The vectors of the partial
(1) (2)
derivatives of L1respect to the vectors y1 , y1 are equal to 0 by the KKT
conditions, thus we get
∂L1 (2) (1)
= 2D a
∂y(1) 12 1 − 2λ1y1 = 0,
1
∂L1 (1) (2)
= 2D y
∂y(2) 21 1 − 2λ2y1 = 0.
1
(1)0
Multiplying equation (6.31) by y1 and equation (6.32) by and dividing
(2)
y1 0 by the constant 2 provides
(1)0 (2) (1)0 (1)
y1 D12y1 − λ1y1 y1 = 0, (6.33)
(2)0 (1) (2)0 (2)
y1 D120y1 − λ2y1 y1 = 0. (6.34)
Based on the constrains of the optimisation problem (6.26) and the identity
D210 = D12 we have
(1)0 (2)
λ1 = λ 2 = y 1 D12y1 .
After replacing λ1 and λ2 with λ the following equality system can be formulated
D (1)
−λI 12 y1 = 0.
− 1
D
21 λI y(2) !
It is not too hard to realise this equality system is a singular vector and value
(1) (2)
problem of the matrix D12 having y1 and y1 are a left and a right singular
vectors and the value of the Lagrangian λ is equal to the corresponding singular
value. Based on this statements we can claim that the optimal solutions are the
singular vectors belonging to the greatest singular value of the matrix D12.
X
(2) (1) (2)
,a ),...,(ap ,ap )
(a(1)
(6.38)
1 1 i=1
subject to (6.39)
0
(1) (1) = 1 if
Σ a
ai 11 j (6.40)
0 otherwise,
0
(2) (2) = 1 if
Σ a (6.41)
ai 22 j 0 otherwise,
i, j = 1, . . . , p, (6.42)
(1) (2) (6.43)
a i0 Σ12a j = 0,
i, j = 1, . . . , p, j =6 i.
Based on equation (6.37) and the definition of the Frobenius norm we have a
compact formulation of the canonical correlation problem:
0
max T r A(1) Σ A(2)
A(1),A(2) 12
subject to
(k)0 (k)
A Σkk A =I,
0 (l)
ai(k) Σklaj = 0,
k, l = {1, 2}, l 6= k, i, j = 1, . . . , p, j 6= i.
where I is the identity matrix with size p × p.
Repeating the substitution in equation (6.23) the set of feasible vectors for the
simultaneous problem is equal to the left and right singular vectors of ma-trix
D12, hence the optimal solution is compatible to the successive problems.
subject to
(k)0 (k)
A Σkk A = I ,
0 (l)
ai(k) Σklaj = 0,
k, l = 1, . . . , 2, l 6= k, i, j = 1, . . . , p, j 6= i.
Generalisation of Canonical Correlation Analysis 28
Unfolding the objective function of the minimisation problem (6.49) shows the
optimisation problem is the same as the maximisation problem (6.44).
subject to
(k)0 (k)
A Σkk A =I,
0 (l)
ai(k) Σklaj = 0,
k, l = 1, . . . , K, l 6= k, i, j = 1, . . . , p, j 6= i.
In the forthcoming sections we will show how to simplify this problem.
The total squared distance, the sum of the squared Euclidean distance of all
possible pairs of vectors in X is equal to
1 mm
X X
2
2 kxk − xlk2 =
k=1 l=1,l=6k
X
m n
1 X X
2
= 2 (xki − xli) = (
k=1,l=1 i=1
m n
1 X X
2 2
=2 (xki + xli − 2xkixli) = (
k=1,l=1 i=1
2
= 2 xki
2
+ xli − 2xkixli
1 n m m m
X X X X
1 n m m m m
i=1 k=1,l=1
2
k=1,l=1
2
k=1,l=1
x !
= 2 m xki +m xli − 2 ki xli (
X X X X X
i=1 k=1 l=1 k=1 l=1
Generalisation of Canonical Correlation Analysis 29
XX
=m (xki − M1 ) .
i=1 k=1
Hence the total squared distance turns to be equal to the sum of the
component-wise variances of the vectors in X multiplied by the square of the
number of the vectors.
zi = n xki.
k=1
The components of the optimal solution are equal to the mean values of the
corresponding components of the known vectors.
(6.72)
(6.73)
Generalisation of Canonical Correlation Analysis 30
( k) (k)
where a i denotes the ith column of the matrix A containing the possible
linear combinations.
k = 1, . . . , K, i, j = 1, . . . , p, (6.77)
(k)0 (l)
yi Dklyj = 0, (6.78)
k, l = 1, . . . , K, k =6 l, i, j = 1, . . . p, i =6 j, (6.79)
(k) −
Qk = H Σkk 2 ,
T
D =Q Q .
kl kl
Generalisation of Canonical Correlation Analysis 31
We can derive another statement about the optimal solution of the problem.
Exploiting the definition of the Frobenius norm the objective function (6.70)
can be rewritten as a sum of the Euclidean norm of the column vectors,
where xi denotes the ith column of the matrix X,
K
1 (6.84)
K
k=1
X−H
X
p
(6.85)
= K K
k=1 i=1
xi − H(k)a
i
(6.86)
1
XX
(6.87)
K p
1 (k)
= x − Q y
K k=1 i=1 i k i
XX
p
1 K (k)
XX
hxi − Qk y
=K k=1 i=1 i , xi − Q
The constrains are formulated in equation (6.76).
For the Lagrangian function of the optimisation problem we have:
K L=
p hxi − Qk yi
(k)
, xi − Qk yi
(k)
i+ (6.88)
XX
k=1 i=1
K p
XX (k)0 (k)
+ λk,ii 1 − yi yi + (6.89)
k i
Kp
XX (k)0 (k)
+ λk,ij −yi yj + (6.90)
k i,j
i=6j
K p
XX (k)0 (l)
+ λkl,ij −yi Dklyj . (6.91)
k,l i,j
k=6l i=6j
1
We disregard the constant K from the objective function (6.70).
After computing the partial derivatives, where xi signs the ith column of the
matrix X, we get
K
∂L (k)
∂xi
=
k=1 2xi − 2Qk yi
= 0, i = 1, . . . , p, (6.92)
p K p
∂L (k) k0 (k) (l)
X X
X =6 j=6i
= 2Dkk y
∂y(k) i − 2Q x − 2λ
i k,ij yj −2 λ
kl,ij
D y
kl j = 0,
i j l j
l k
(6.93)
k = 1, . . . , K, i = 1, . . . , p. (6.94)
Conclusions 32
Based on the proposition (3) we can replace the variable X in equation (6.70)
by an expression of the other variables without changing the optimum value
and the optimal solution. Thus we have the variance problem.
7 Conclusions
Through this study we have presented a tutorial on canonical correlation
analysis and have established a novel general approach to retrieving images
based solely on their content. This is then applied to content-based and mate-
based retrieval. Experiments show that image retrieval can be more accurate
than with the Generalised Vector Space Model. We demonstrate that one can
choose the regularisation parameter κ a priori that performs well in very di fferent
regimes. Hence we have come to the conclusion that kernel Canonical
Correlation Analysis is a powerful tool for image retrieval via content. In the
future we will extend our experiments to other data collections.
These approaches can give tools to handle some problems in the kernel
space, where the inner products and the distances between the points are
known but the coordinates are not. For some problems it is sufficient to
know only the coordinates of a few special points, which can be expressed
from the known inner product, e.g. to do cluster analysis in the kernel space
and to compute the coordinates of the cluster centres only.
Acknowledgments
We would like to acknowledge the financial support of EU Projects KerMIT,
No. IST-2000-25341 and LAVA, No. IST-2001-34405.
i i
1 Proof kK − G G 0 k ≤ η
1.1 Some notation
Pn
Lemma 4. Let A and B be an square matrices such that T race(A) = i
aii then we have T race(AB) = T race(BA)
i i
Proof kK − G G 0 k ≤ η 33
Proof.
n
X
T rane(AB) = (ab)ii i
n
X
a b
= ij ji
i,j
n
X
ba
= ji ij
j,i
n
X
= (ba)jj j
= T race(BA)
Proof.
0
T race(Λ) = T race(V AV )
0
= T race((V A)V )
0
= T race(V (V A))
0
= T race(V V A)
= T race(A)
aii = λi
i i
Proof.
kAxk
kAk = max .
x6=0 kxk
For any c R the scaling does not change
Hence we obtain
k k
0 0
kAxk = (x A Ax)
Ax 2 = x0A0Ax
k k
0 0
Let U DU be the eigenvalue decomposition of AA such that D is a
diagonal matrix containing square of the eigenvalues of A
0 0
A A = UDU
2 0
kAxk = x U DU x
0
Setting w = U x and as U is orthognoal we can rewrite kxk = 1 to kwk = 1
2 0
kAk = w Dw
X
= λi2wi2
We can see that the following holds
P X 2 2
( wi2=1) λ w i
2
max = max λ
i i i
Hence we obtain
max λ
kAk = i i
1.2 Proof
0
Theorem 7. If K is a positive definite matrix and GG is its incomplete
0
cholesky decomposition then the Euclidian norm of GG subtracted from K
i
is less than or equal to the trace of the uncalculated part of K. Let ΔK be
i i i0
the uncalculated part of K and let η = T race(ΔK ) then kK − G G k ≤ η.
0 0
Proof. Let GG be the being the complete cholesky decomposition K = GG
where G is a lower triangular matrix were the upper triangular is zeros.
G= B C
A 0
i i
Let G G 0 to be the incopmlete decomposition of K where i are the iterations
of the Cholesky factorization procedure
i
G = G1:n,1:i = B
A
0
i i ˜i ˜i is the approximation of K subject to a sym-
such that G G =K , where K
metric permutation of rows and columns. Assuming that the rows and columns
i i
Proof kK − G G 0 k ≤ η 35
0 0
K i =GG = BA0 BB
0 AA AB
i 0
ΔK = 0 CC
0 0
0
We show that C C is positive semi-definite
C C0 = K ˜i
i+1:n,i+1:n − K i+1:n,i+1:n
K
= K
i+1:n,i+1:n − BB
0
= K
i+1:n,i+1:n
−1
− B·A A·B
0
= i+1:n,i+1:n − B·A
−1
· (AB )
0
K
= K
i+1:n,i+1:n −G i+1:n,1:i
· G− 1
1: i,1:i
· K
1:i,i+1:n
= i+1:n,i+1:n
i
− G i+1:n · G 1:i
−1 i
· K1:i,i+1:n
therefore
0 0
xC C x= < xC, (xC) >
≥ 0 λc ≥ 0
0 i
C C is a positive semi-definite matrix, hence ΔK is also a positive semi-
definite matrix. Using Lemma 6 we are now able to show that
˜i i
kK − K k = kΔK
i i0 i
kK − G G k = kΔK k
n
X
= λiwi
i
= max λi
i
As the maximum eigenvalue is less than or equal to the sum of all the
eigenval-ues, using Lemma 5, we are able to rewrite the expression as
n
X
i i0
kK − G G k ≤ λi
i
≤ T race(Λ)
i
≤ T race(ΔKii ).
Therefore,
i i
kK − G G 0 k ≤ η.
Bibliography
[2] Francis Bach and Michael Jordan. Kernel independent component analysis.
Journal of Machine Leaning Research, 3:1–48, 2002.
[6] Nello Cristianini, John Shawe-Taylor, and Huma Lodhi. Latent semantic kernels.
In Caria Brodley and Andrea Danyluk, editors, Proceedings of ICML-01, 18th International
Conference on Machine Learning, pages 66– 73. Morgan Kaufmann Publishers, San
Francisco, US, 2001.
[7] Colin Fyfe and Pei Ling Lai. Ica using kernel canonical correlation analysis.
[8] Colin Fyfe and Pei Ling Lai. Kernel and nonlinear canonical correlation analysis.
International Journal of Neural Systems, 2001.
[11] David R. Hardoon and John Shawe-Taylor. Kcca for different level preci-sion in
content-based image retrieval. In Submitted to Third International Workshop on
Content-Based Multimedia Indexing, IRISA, Rennes, France, 2003.
[12] H. Hotelling. Relations between two sets of variates. Biometrika, 28:312– 377,
1936.
[13] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley &
Sons, Inc, 1966.
BIBLIOGRAPHY 37
[16] Malte Kuss and Thore Graepel. The geometry of kernel canonical correla-tion
analysis. 2002.
[17] Yong Rui, Thomas S. Huang, and Shih-Fu Chang. Image retrieval: Cur-rent
techniques, promising directions, and open issues. Journal of Visual Communications
and Image Representation, 10:39–62, 1999.
[18] Alexei Vinokourov, David R. Hardoon, and John Shawe-Taylor. Learn-ing the
semantics of multimedia content with application to web image retrieval and classification. In
Proceedings of Fourth International Sym-posium on Independent Component
Analysis and Blind Source Separation, Nara, Japan, 2003.
[19] Alexei Vinokourov, John Shawe-Taylor, and Nello Cristianini. Inferring a semantic
representation of text via cross-language correlation analysis. In Advances of Neural Information
Processing Systems 15 (to appear), 2002.