Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Hilbert Space Embedding For Distributions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

A Hilbert Space Embedding for Distributions

Alex Smola1 , Arthur Gretton2 , Le Song1 , and Bernhard Schölkopf2


1
National ICT Australia, North Road, Canberra 0200 ACT, Australia,
alex.smola@nicta.com.au,lesong@it.usyd.edu.au
2
MPI for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany,
{arthur,bernhard.schoelkopf}@tuebingen.mpg.de

Abstract. We describe a technique for comparing distributions without


the need for density estimation as an intermediate step. Our approach re-
lies on mapping the distributions into a reproducing kernel Hilbert space.
Applications of this technique can be found in two-sample tests, which
are used for determining whether two sets of observations arise from the
same distribution, covariate shift correction, local learning, measures of
independence, and density estimation.

Kernel methods are widely used in supervised learning [1, 2, 3, 4], however
they are much less established in the areas of testing, estimation, and analysis
of probability distributions, where information theoretic approaches [5, 6] have
long been dominant. Recent examples include [7] in the context of construction
of graphical models, [8] in the context of feature extraction, and [9] in the context
of independent component analysis. These methods have by and large a com-
mon issue: to compute quantities such as the mutual information, entropy, or
Kullback-Leibler divergence, we require sophisticated space partitioning and/or
bias correction strategies [10, 9].
In this paper we give an overview of methods which are able to compute
distances between distributions without the need for intermediate density esti-
mation. Moreover, these techniques allow algorithm designers to specify which
properties of a distribution are most relevant to their problems. We are opti-
mistic that our embedding approach to distribution representation and analysis
will lead to the development of algorithms which are simpler and more effective
than entropy-based methods in a broad range of applications.
We begin our presentation in Section 1 with an overview of reproducing kernel
Hilbert spaces (RKHSs), and a description of how probability distributions can
be represented as elements in an RKHS. In Section 2, we show how these repre-
sentations may be used to address a variety of problems, including homogeneity
testing (Section 2.1), covariate shift correction (Section 2.2), independence mea-
surement (Section 2.3), feature extraction (Section 2.4), and density estimation
(Section 2.5).
2 Alex Smola et al.

1 Hilbert Space Embedding


1.1 Preliminaries
In the following we denote by X the domain of observations, and let Px be a
probability measure on X. Whenever needed, Y will denote a second domain,
with its own probability measure Py . A joint probability measure on X × Y will
be denoted by Px,y . We will assume all measures are Borel measures, and the
domains are compact.
We next introduce a reproducing kernel Hilbert space (RKHS) H of functions
on X with kernel k (the analogous definitions hold for a corresponding RKHS G
with kernel l on Y). This is defined as follows: H is a Hilbert space of functions
X → R with dot product h·, ·i, satisfying the reproducing property:

hf (·), k(x, ·)i = f (x) (1a)


and consequently hk(x, ·), k(x0 , ·)i = k(x, x0 ). (1b)

This means we can view the linear map from a function f on X to its value at x
as an inner product. The evaluation functional is then given by k(x, ·), i.e. the
kernel function. Popular kernel functions on Rn include the polynomial kernel
0 d 2
k(x, x ) = hx, x i , the Gaussian RBF kernel k(x, x ) = exp −λ kx − x0 k , and
0 0

the Laplace kernel k(x, x0 ) = exp (−λ kx − x0 k). Good kernel functions have been
defined on texts, graphs, time series, dynamical systems, images, and structured
objects. For recent reviews see [11, 12, 13].
An alternative view, which will come in handy when designing algorithms
is that of a feature map. That is, we will consider maps x → φ(x) such that
k(x, x0 ) = hφ(x), φ(x0 )i and likewise f (x) = hw, φ(x)i, where w is a suitably
chosen “weight vector” (w can have infinite dimension, e.g. in the case of a
Gaussian kernel).
Many kernels are universal in the sense of [14]. That is, their Hilbert spaces H
are dense in the space of continuous bounded functions C0 (X) on the compact
domain X. For instance, the Gaussian and Laplacian RBF kernels share this
property. This is important since many results regarding distributions are stated
with respect to C0 (X) and we would like to translate them into results on Hilbert
spaces.

1.2 Embedding
At the heart of our approach are the following two mappings:

µ[Px ] := Ex [k(x, ·)] (2a)


m
1 X
µ[X] := k(xi , ·). (2b)
m i=1

Here X = {x1 , . . . , xm } is assumed to be drawn independently and identically


distributed from Px . If the (sufficient) condition Ex [k(x, x)] < ∞ is satisfied,
A Hilbert Space Embedding for Distributions 3

then µ[Px ] is an element of the Hilbert space (as is, in any case, µ[X]). By virtue
of the reproducing property of H,
m
1 X
hµ[Px ], f i = Ex [f (x)] and hµ[X], f i = f (xi ).
m i=1

That is, we can compute expectations and empirical means with respect to Px
and X, respectively, by taking inner products with the means in the RKHS, µ[Px ]
and µ[X]. The representations µ[Px ] and µ[X] are attractive for the following
reasons [15, 16]:
Theorem 1. If the kernel k is universal, then the mean map µ : Px → µ[Px ]
is injective.
Moreover, we have fast convergence of µ[X] to µ[Px ] as shown in [17, Theorem
15]. Denote by Rm (H, Px ) the Rademacher average [18] associated with Px and
H via
" #
m
1 X
Rm (H, Px ) = Ex1 ,...,xm Eσ1 ,...,σm sup σi f (xi ) . (3)

m kf kH ≤1 i=1

Rm (H, Px ) can be used to measure the deviation between empirical means and
expectations [17].
Theorem 2. Assume that kf k∞ ≤ R for all f ∈ H with kf kp H ≤ 1. Then with
probability at least 1 − δ, kµ[Px ] − µ[X]k ≤ 2Rm (H, Px ) + R −m−1 log(δ)
This ensures that µ[X] is a good proxy for µ[Px ], provided the Rademacher
average is well behaved.
Theorem 1 tells us that µ[Px ] can be used to define distances between distri-
butions Px and Py , simply by letting D(Px , Py ) := kµ[Px ] − µ[Py ]k. Theorem 2
tells us that we do not need to have access to actual distributions in order to
1
compute D(Px , Py ) approximately — as long as Rm (H, Px ) = O(m− 2 ), a finite
1
sample from the distributions will yield error of O(m− 2 ). See [18] for an analysis
of the behavior of Rm (H, Px ) when H is an RKHS.
This allows us to use D(Px , Py ) as a drop-in replacement wherever informa-
tion theoretic quantities would have been used instead, e.g. for the purpose of
determining whether two sets of observations have been drawn from the same
distribution. Note that there is a strong connection between Theorem 2 and uni-
form convergence results commonly used in Statistical Learning Theory [19, 16].
This is captured in the theorem below:
Theorem 3. Let F be the unit ball in the reproducing kernel Hilbert space H.
Then the deviation between empirical means and expectations for any f ∈ F is
bounded:

m
1 X
sup Ex [f (x)] − f (xi ) = kµ[Px ] − µ[X]k .

f ∈F m i=1

4 Alex Smola et al.

Bounding the probability that this deviation exceeds some threshold  is one of
the key problems of statistical learning theory. See [16] for details. This means
that we have at our disposition a large range of tools typically used to assess
the quality of estimators. The key difference is that while those bounds are
typically used to bound the deviation between empirical and expected means
under the assumption that the data are drawn from the same distribution, we
will use the bounds in Section 2.1 to test whether this assumption is actually true,
and in Sections 2.2 and 2.5 to motivate strategies for approximating particular
distributions.
This is analogous to what is commonly done in the univariate case: the
Glivenko-Cantelli lemma allows one to bound deviations between empirical and
expected means for functions of bounded variation, as generalized by the work
of Vapnik and Chervonenkis [20, 21]. However, the Glivenko-Cantelli lemma also
leads to the Kolmogorov-Smirnov statistic comparing distributions by compar-
ing their cumulative distribution functions. Moreover, corresponding q-q plots
can be used as a diagnostic tool to identify where differences occur.

1.3 A View from the Marginal Polytope


The space of all probability distributions P is a convex set. Hence, the image
M := µ[P] of P under the linear map µ also needs to be convex. This set is
commonly referred to as the marginal polytope. Such mappings have become
a standard tool in deriving efficient algorithms for approximate inference in
graphical models and exponential families [22, 23].
We are interested in the properties of µ[P] in the case where P satisfies the
conditional independence relations specified by an undirected graphical model.
In [24], it is shown for this case that the sufficient statistics decompose along the
maximal cliques of the conditional independence graph.
More formally, denote by C set of maximal cliques of the graph G and let
xc be the restriction of x ∈ X to the variables on clique c ∈ C. Moreover, let kc
be universal kernels in the sense of [14] acting on the restrictions of X on clique
c ∈ C. In this case [24] show that
X
k(x, x0 ) = kc (xc , x0c ) (4)
c∈C

can be used to describe all probability distributions with the above mentioned
conditional independence relations using an exponential family model with k as
its kernel. Since for exponential families expectations of the sufficient statistics
yield injections, we have the following result:

Corollary 1. On the class of probability distributions satisfying conditional in-


dependence properties according to a graph G with maximal clique set C and with
full support on their domain, the operator
X X
µ[P] = µc [Pc ] = Exc [kc (xc , ·)] (5)
c∈C c∈C
A Hilbert Space Embedding for Distributions 5

is injective if the kernels kc are all universal. The same decomposition holds for
the empirical counterpart µ[X].
The condition of full support arises from the conditions of the Hammersley-
Clifford Theorem [25, 26]: without it, not all conditionally independent random
variables can be represented as the product of potential functions. Corollary 1
implies that we will be able to perform all subsequent operations on structured
domains simply by dealing with mean operators on the corresponding maximal
cliques.

1.4 Choosing the Hilbert Space


Identifying probability distributions with elements of Hilbert spaces is not new:
see e.g. [27]. However, this leaves the obvious question of which Hilbert space to
employ. We could informally choose a space with a kernel equalling the Delta
distribution k(x, x0 ) = δ(x, x0 ), in which case the operator µ would simply be
the identity map (which restricts us to probability distributions with square
integrable densities).
The latter is in fact what is commonly done on finite domains (hence the L2
integrability condition is trivially satisfied). For instance, [22] effectively use the
Kronecker Delta δ(xc , x0c ) as their feature map. The use of kernels has additional
advantages: we need not deal with the issue of representation of the sufficient
statistics or whether such a representation is minimal (i.e. whether the sufficient
statistics actually span the space).
Whenever we have knowledge about the class of functions F we would like to
analyze, we should be able to trade off simplicity in F with better approximation
behavior in P. For instance, assume that F contains only linear functions. In this
case, µ only needs to map P into the space of all expectations of x. Consequently,
one may expect very good constants in the convergence of µ[X] to µ[Px ].

2 Applications
While the previous description may be of interest on its own, it is in application to
areas of statistical estimation and artificial intelligence that its relevance becomes
apparent.

2.1 Two-Sample Test


Since we know that µ[X] → µ[Px ] with a fast rate (given appropriate behavior
of Rm (H, Px )), we may compare data drawn from two distributions Px and
Py , with associated samples X and Y , to test whether both distributions are
identical; that is, whether Px = Py . For this purpose, recall that we defined
D(Px , Py ) = kµ[Px ] − µ[Py ]k. Using the reproducing property of an RKHS we
may show [16] that

D2 (Px , Py ) = Ex,x0 [k(x, x0 )] − 2Ex,y [k(x, y)] + Ey,y0 [k(y, y 0 )] ,


6 Alex Smola et al.

where x0 is an independent copy of x, and y 0 an independent copy of y. An


unbiased empirical estimator of D2 (Px , Py ) is a U-statistic [28],
X
D̂2 (X, Y ) := m(m−1)
1
h((xi , yi ), (xj , yj )), (6)
i6=j

where
h((x, y), (x0 , y 0 )) := k(x, x0 ) − k(x, y 0 ) − k(y, x0 ) + k(y, y 0 ).
An equivalent interpretation, also in [16], is that we find a function that max-
imizes the difference in expectations between probability distributions. The re-
sulting problem may be written
D(Px , Py ) := sup |Ex [f (x)] − Ey [f (y)]| . (7)
f ∈F

To illustrate this latter setting, we plot the witness function f in Figure 1,


when Px is Gaussian and Py is Laplace, for a Gaussian RKHS kernel. This
function is straightforward to obtain, since the solution to Eq. (7) can be written
f (x) = hµ[Px ] − µ[Py ], φ(x)i.

Fig. 1. Illustration of the function maximizing the mean discrepancy in the case where
a Gaussian is being compared with a Laplace distribution. Both distributions have zero
mean and unit variance. The function f that witnesses the difference in feature means
has been scaled for plotting purposes, and was computed empirically on the basis of
2 × 104 samples, using a Gaussian kernel with σ = 0.5.

The following two theorems give uniform convergence and asymptotic results,
respectively. The first theorem is a straightforward application of [29, p. 25].
Theorem 4. Assume that the kernel k is nonnegative and bounded by 1. Then
probability at least 1 − δ the deviation |D2 (Px , Py ) − D̂2 (X, Y )| is bounded
withp
by 4 log(2/δ)/m.
A Hilbert Space Embedding for Distributions 7

Note that an alternative uniform convergence bound is provided in [30], based


on McDiarmid’s inequality [31]. The second theorem appeared as [30, Theorem
8], and describes the asymptotic distribution of D̂2 (X, Y ). When Px 6= Py , this
distribution is given by [28, Section 5.5.1]; when Px = Py , it follows from [28,
Section 5.5.2] and [32, Appendix].

Theorem 5. We assume E h2 < ∞. When Px 6= Py , D̂2 (X, Y ) converges in
distribution [33, Section 7.2] to a Gaussian according to
 
1 D
m 2 D̂2 (X, Y ) − D2 (Px , Py ) → N 0, σu2 ,


  
2
where σu2 = 4 Ez (Ez0 h(z, z 0 ))2 − [Ez,z0 (h(z, z 0 ))] and z := (x, y), uniformly


at rate 1/ m [28, Theorem B, p. 193]. When Px = Py , the U-statistic is degen-
erate, meaning Ez0 h(z, z 0 ) = 0. In this case, D̂2 (X, Y ) converges in distribution
according to

D
X
mD̂2 (X, Y ) → λl gl2 − 2 ,
 
(8)
l=1
where gl ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation
Z
k̃(x, x0 )ψi (x)dp(x) = λi ψi (x0 ),
X

and k̃(xi , xj ) := k(xi , xj )−Ex k(xi , x)−Ex k(x, xj )+Ex,x0 k(x, x0 ) is the centered
RKHS kernel.
We illustrate the MMD density by approximating it empirically for both Px =
Py (also called the null hypothesis, or H0 ) and Px 6= Py (the alternative hypoth-
esis, or H1 ). Results are plotted in Figure 2. We may use this theorem directly to
test whether two distributions are identical, given an appropriate finite sample
approximation to the (1 − α)th quantile of (8). In [16], this was achieved via two
strategies: by using the bootstrap [34], and by fitting Pearson curves using the
first four moments [35, Section 18.8].
While uniform convergence bounds have the theoretical appeal of making no
assumptions on the distributions, they produce very weak tests. We find the test
arising from Theorem 5 performs considerably better in practice. In addition, [36]
demonstrate that this test performs very well in circumstances of high dimension
and low sample size (i.e. when comparing microarray data), as well as being the
only test currently applicable for structured data such as distributions on graphs.
Moreover, the test can be used to determine whether records in databases may
be matched based on their statistical properties. Finally, one may also apply
it to extract features with the aim of maximizing discrepancy between sets of
observations (see Section 2.4).

2.2 Covariate Shift Correction and Local Learning


A second application of the mean operator arises in situations of supervised
learning where the training and test sets are drawn from different distributions,
8 Alex Smola et al.

Fig. 2. Left: Empirical distribution of the MMD under H0 , with Px and Py both
Gaussians with unit standard deviation, using 50 samples from each. Right: Empir-
ical distribution of the MMD under H1 , with Px a Laplace distribution √ with unit
standard deviation, and Py a Laplace distribution with standard deviation 3 2, using
100 samples from each. In both cases, the histograms were obtained by computing 2000
independent instances of the MMD.

i.e. X = {x1 , . . . , xm } is drawn from Px and X 0 = {x01 , . . . , x0m0 } is drawn from


Px0 . We assume, however, that the labels y are drawn from the same conditional
distribution Py|x on both the training and test sets.
The goal in this case is to find a weighting of the training set such that
minimizing a reweighted empirical error on the training set will come close to
minimizing the expected loss onP the test set. That is, we would like to find
weights {β1 , . . . , βm } for X with i βi = 1.
Obviously, if Py|x is a rapidly changing function of x, or if the loss measuring
the discrepancy between y and its estimate is highly non-smooth, this problem
is difficult to solve. However, under regularity conditions spelled out in [37], one
may show that by minimizing

X m
0
∆ := βi k(xi , ·) − µ[X ]


i=1
P
subject to βi ≥ 0 and i βi = 1, we will obtain weights which achieve this
task. The idea here is that the expected loss with the expectation taken over
y|x should not change too quickly as a function of x. In this case we can use
points xi “nearby” to estimate the loss at location x0j on the test set. Hence we
are re-weighting the empirical distribution on the training set X such that the
distribution behaves more like the empirical distribution on X 0 .
Note that by re-weighting X we will assign some observations a higher weight
1
than m . This means that the statistical guarantees can no longer be stated in
−2
terms of the sample size m. One may show [37], however, that kβk2 now behaves
like the effective sample size. Instead of minimizing ∆, it pays to minimize ∆2 +
2
λ kβk2 subject to the above constraints. It is easy to show using the reproducing
A Hilbert Space Embedding for Distributions 9

property of H that this corresponds to the following quadratic program:

1 >
minimize β (K + λ1) β − β > l (9a)
β 2
X
subject to βi ≥ 0 and βi = 1. (9b)
i

1
Pm0 0
Here Kij := k(xi , xj ) denotes the kernel matrix and li := m j=1 k(xi , xj ) is
0 0
the expected value of k(xi , ·) on the test set X , i.e. li = hk(xi , ·), µ[X ]i.
Experiments show that solving (9) leads to sample weights which perform
very well in covariate shift. Remarkably, the approach can even outperform
“importance sampler” weights, i.e. weights βi obtained by computing the ra-
tio βi = Px0 (xi )/Px (xi ). This is surprising, since the latter provide unbiased
estimates of the expected error on X 0 . A point to bear in mind is that the
kernels employed in the classification/regression learning algorithms of [37] are
somewhat large, suggesting that the feature mean matching procedure is helpful
when the learning algorithm returns relatively smooth classification/regression
functions (we observe the same situation in the example of [38, Figure 1], where
the model is “simpler” than the true function generating the data).
In the case where X 0 contains only a single observation, i.e. X 0 = {x0 }, the
above procedure leads to estimates which try to find a subset of observations
in X and a weighting scheme such that the error at x0 is approximated well.
In practice, this leads to a local sample weighting scheme, and consequently an
algorithm for local learning [39]. Our key advantage, however, is that we do not
need to define the shape of the neighborhood in which we approximate the error
at x0 . Instead, this is automatically taken care of via the choice of the Hilbert
space H and the location of x0 relative to X.

2.3 Independence Measures

A third application of our mean mapping arises in measures of whether two ran-
dom variables x and y are independent. Assume that pairs of random variables
(xi , yi ) are jointly drawn from some distribution Px,y . We wish to determine
whether this distribution factorizes.
Having a measure of (in)dependence between random variables is a very use-
ful tool in data analysis. One application is in independent component analysis
[40], where the goal is to find a linear mapping of the observations xi to obtain
mutually independent outputs. One of the first algorithms to gain popularity
was InfoMax, which relies on information theoretic quantities [41]. Recent devel-
opments using cross-covariance or correlation operators between Hilbert space
representations have since improved on these results significantly [42, 43, 44]; in
particular, a faster and more accurate quasi-Newton optimization procedure for
kernel ICA is given in [45]. In the following we re-derive one of the above kernel
independence measures using mean operators instead.
10 Alex Smola et al.

We begin by defining

µ[Pxy ] := Ex,y [v((x, y), ·)]


and µ[Px × Py ] := Ex Ey [v((x, y), ·)] .

Here we assumed that V is an RKHS over X × Y with kernel v((x, y), (x0 , y 0 )). If
x and y are dependent, the equality µ[Pxy ] = µ[Px × Py ] will not hold. Hence
we may use ∆ := kµ[Pxy ] − µ[Px × Py ]k as a measure of dependence.
Now assume that v((x, y), (x0 , y 0 )) = k(x, x0 )l(y, y 0 ), i.e. that the RKHS V is
a direct product H ⊗ G of the RKHSs on X and Y. In this case it is easy to see
that
2
∆2 = kExy [k(x, ·)l(y, ·)] − Ex [k(x, ·)] Ey [l(y, ·)]k
= Exy Ex0 y0 [k(x, x0 )l(y, y 0 )] − 2Ex Ey Ex0 y0 [k(x, x0 )l(y, y 0 )]
+Ex Ey Ex0 Ey0 [k(x, x0 )l(y, y 0 )]

The latter, however, is exactly what [43] show to be the Hilbert-Schmidt norm
of the covariance operator between RKHSs: this is zero if and only if x and y
are independent, for universal kernels. We have the following theorem:
Theorem 6. Denote by Cxy the covariance operator between random variables
x and y, drawn jointly from Pxy , where the functions on X and Y are the re-
producing kernel Hilbert spaces F and G respectively. Then the Hilbert-Schmidt
norm kCxy kHS equals ∆.
Empirical estimates of this quantity are as follows:
Theorem 7. Denote by K and L the kernel matrices on X and Y respectively.
Moreover, denote by H = I − 1/m the projection matrix onto the subspace
orthogonal to the vector with all entries set to 1. Then m−2 tr HKHL is an
estimate of ∆2 with bias O(m−1 ). With high probability the deviation from ∆2
1
is O(m− 2 ).
See [43] for explicit constants. In certain circumstances, including in the case
of RKHSs with Gaussian kernels, the empirical ∆2 may also be interpreted in
terms of a smoothed difference between the joint empirical characteristic func-
tion (ECF) and the product of the marginal ECFs [46, 47]. This interpretation
does not hold in all cases, however, e.g. for kernels on strings, graphs, and other
structured spaces. An illustration of the witness function of the equivalent op-
timization problem in Eq. 7 is provided in Figure 3. We observe that this is
a smooth function which has large magnitude where the joint density is most
different from the product of the marginals.
Note that if v((x, y), ·) does not factorize we obtain a more general measure
of dependence. In particular, we might not care about all types of interaction
between x and y to an equal extent, and use an ANOVA kernel. Computationally
efficient recursions are due to [48], as reported in [49]. More importantly, this
representation will allow us to deal with structured random variables which are
not drawn independently and identically distributed, such as time series.
A Hilbert Space Embedding for Distributions 11

Fig. 3. Illustration of the function maximizing the mean discrepancy when MMD is
used as a measure of independence. A sample from dependent random variables x and
y is shown in black, and the associated function f that witnesses the MMD is plotted
as a contour. The latter was computed empirically on the basis of 200 samples, using
a Gaussian kernel with σ = 0.2.
12 Alex Smola et al.

For instance, in the case of EEG (electroencephalogram) data, we have both


spatial and temporal structure in the signal. That said, few algorithms take full
advantage of this when performing independent component analysis [50]. The
pyramidal kernel of [51] is one possible choice for dependent random variables.

2.4 Feature Extraction


Kernel measures of statistical dependence need not be applied only to the analy-
sis of independent components. To the contrary, we may also use them to extract
highly dependent random variables, i.e. features. This procedure leads to variable
selection algorithms with very robust properties [52].
The idea works as follows: given a set of patterns X and a set of labels Y ,
find a subset of features from X which maximizes m−2 tr HKHL. Here L is
the kernel matrix on the labels. In the most general case, the matrix K will
arise from an arbitrary kernel k, for which no efficient decompositions exist. In
this situation [52] suggests the use of a greedy feature removal procedure, i.e. to
remove subsets of features iteratively such that m−2 tr HKHL is maximized for
the remaining features.
In general, for particular choices of k and l, it is possible to recover well known
feature selection methods, such as Pearson’s correlation, shrunken centroid, or
signal-to-noise ratio selection. Below we give some examples, mainly when a
linear kernel k(x, x0 ) = hx, x0 i. For more details see [53].

Pearson’s Correlation is commonly used in microarray analysis [54, 55]. It is


defined as
m   
1 X xij − xj yi − y
Rj := where (10)
m i=1 sxj sy
m m
1 X 1 X
xj = xij and y = yi
m i=1 m i=1
m m
1 X 1 X
s2xj = (xij − xj )2 and s2y = (yi − y)2 . (11)
m i=1 m i=1

This means that all features are individually centered by xj and scaled by
their coordinate-wise variance sxj as a preprocessing step. Performing those
operations before applying a linear kernel yields the formulation:
2
tr KHLH = tr XX > Hyy > H = HX > Hy

(12)
d m     !2 d
X X xij − xj yi − y X
= = Rj2 . (13)
j=1 i=1
s xj sy j=1

Hence tr KHLH computes the sum of the squares of the Pearson Correlation
(pc) coefficients. Since the terms are additive, feature selection is straight-
forward by picking the list of best performing features.
A Hilbert Space Embedding for Distributions 13

Centroid The difference between the means of the positive and negative classes
at the jth feature, (xj+ −xj− ), is useful for scoring individual features. With
different normalization of the data and the labels, many variants can be
derived.
To obtain the centroid criterion [56] use vj := λxj+ −(1−λ)xj− for λ ∈ (0, 1)
as the score3 for feature j. Features are subsequently selected according to
the absolute value |vj |. In experiments the authors typically choose λ = 12 .
For λ = 12 we can achieve the same goal by choosing Lii0 = myyi ymi0y (yi , yi0 ∈
i i0
{±1}), in which case HLH = L, since the label kernel matrix is already
centered. Hence we have
m
X y i y i0
tr KHLH = x> xi0 (14)
0
myi myi0 i
i,i =1
 
d m d
X X y i y i 0 xij xi0 j X
=  = (xj+ − xj− )2 . (15)
j=1 0
m y i m y i0
j=1
i,i =1

This proves that the centroid feature selector can be viewed as a special case
of BAHSIC in the case of λ = 21 . From our analysis we see that other values
of λ amount to effectively rescaling the patterns xi differently for different
classes, which may lead to undesirable features being selected.
t-Statistic The normalization for the jth feature is computed as
" # 21
s2j+ s2j−
s̄j = + (16)
m+ m−

In this case we define the t-statistic for the jth feature via tj = (xj+ −
xj− )/s̄j . Compared to the Pearson correlation, the key difference is that
now we normalize each feature not by the overall sample standard deviation
but rather by a value which takes each of the two classes separately into
account.
Signal to noise ratio is yet another criterion to use in feature selection. The
key idea is to normalize each feature by s̄j = sj+ +sj− instead. Subsequently
the (xj+ − xj− )/s̄j are used to score features.
Moderated t-score is similar to t-statistic and is used for microarray analy-
sis [57]. Its normalization for the jth feature is derived via a Bayes approach
as

ms̄2j + m0 s̄20
s̃j = (17)
m + m0

where s̄j is from (16), and s̄0 and m0 are hyperparameters for the prior dis-
tribution on s̄j (all s̄j are assumed to be iid). s̄0 and m0 are estimated using
3
The parameterization in [56] is different but it can be shown to be equivalent.
14 Alex Smola et al.

information from all feature dimensions. This effectively borrows informa-


tion from the ensemble of features to aid with the scoring of an individual
feature. More specifically, s̄0 and m0 can be computed as [57]
 
d
0−1  1 0 m 
X  
2
m0 = 2Γ (zj − z̄) − Γ , (18)
d j=1 2
 m m   m 
0 0
s̄20 = exp z̄ − Γ +Γ − ln , (19)
2 2 m
where Γ (·) is the gamma function, 0 denotes derivative, zj = ln(s̄2j ) and
Pd
z̄ = d1 j=1 zj .
B-statistic is the logarithm of the posterior odds (lods) that a feature is dif-
ferentially expressed. [58, 57] show that, for large number of features, the
B-statistic is given by

Bj = a + bt̃2j , (20)

where both a and b are constant (b > 0), and t̃j is the moderated-t statistic
for the jth feature. Here we see that Bj is monotonic increasing in t̃j , and
thus results in the same gene ranking as the moderated-t statistic.

2.5 Density Estimation

General setting Obviously, we may also use the connection between mean
operators and empirical means for the purpose of estimating densities. In fact,
[59, 17, 60] show that this may be achieved in the following fashion:

maximize H(Px ) subject to kµ[X] − µ[Px ]k ≤ . (21)


Px

Here H is an entropy-like quantity (e.g. Kullback Leibler divergence, Csiszar di-


vergence, Bregmann divergence, Entropy, Amari divergence) that is to be max-
imized subject to the constraint that the expected mean should not stray too
far from its empirical counterpart. In particular, one may show that this ap-
proximate maximum entropy formulation is the dual of a maximum-a-posteriori
estimation problem.
In the case of conditional probability distributions, it is possible to recover
a raft of popular estimation algorithms, such as Gaussian Process classification,
regression, and conditional random fields. The key idea in this context is to
identify the sufficient statistics in generalized exponential families with the map
x → k(x, ·) into a reproducing kernel Hilbert space.

Mixture model In problem (21) we try to find the optimal Px over the en-
tire space of probability distributions on X. This can be an exceedingly costly
optimization problem, in particular in the nonparametric setting. For instance,
A Hilbert Space Embedding for Distributions 15

computing the normalization of the density itself may be intractable, in par-


ticular for high-dimensional data. In this case we may content ourselves with
finding a suitable mixture distribution such that kµ[X] − µ[Px ]k is minimized
with respect to the mixture coefficients. The diagram below summarizes our
approach:

density Px −→ sample X −→ emp. mean µ[X] −→ estimate via µ[P


b x]
(22)

The connection between µ[Px ] and µ[X] follows from Theorem 2. To obtain a
density estimate from µ[X] assume that we have a set of candidate densities Pix
on X. We want to use these as basis functions to obtain P
b x via
M
X M
X
P
bx = βi Pix where βi = 1 and βi ≥ 0. (23)
i=1 i=1

In other words we wish to estimate Px by means of a mixture model with


mixture densities Pix . The goal is to obtain good estimates for the coefficients βi
and to obtain performance guarantees which specify how well P b x is capable of
estimating Px in the first place. This is possible using a very simple optimization
problem:
2
minimize µ[X] − µ[P subject to β > 1 = 1 and β ≥ 0.
b x ] (24)

β H

To ensure good generalization performance we add a regularizer Ω[β] to the


2
optimization problem, such as 12 kβk . It follows using the expansion of P
b x in
(23) that the resulting optimization problem can be reformulated as a quadratic
program via
1
minimize β > [Q + λ1]β − l> β subject to β > 1 = 1 and β ≥ 0. (25)
β 2

Here λ > 0 is a regularization constant, and the quadratic matrix Q ∈ RM ×M


and the vector l ∈ RM are given by

Qij = µ[Pix ], µ[Pjx ] = E k(xi , xj )



 
(26)
xi ,xj
m
1 X
and lj = µ[X], µ[Pjx ] = j

 
Ej k(xi , x ) . (27)
m i=1
x

By construction Q  0 is positive semidefinite, hence the quadratic program


(25) is convex. For a number of kernels and mixture terms Pix we are able to
compute Q, l in closed form.
Since Pb x is an empirical estimate it is quite unlikely that P
b x = Px . This raises
the question of how well expectations with respect to Px are approximated by
those with respect to Pb x . This can be answered by an extension of the Koksma-
Hlawka inequality [61].
16 Alex Smola et al.

Lemma 1. Let  > 0 and let 0 := µ[X] − µ[P
b x ]
. Under the assumptions of

Theorem 2 we have that with probability at least 1 − exp(−2 mR−2 ),



sup Ex∼Px [f (x)] − Ex∼P [f (x)] ≤ 2Rm (H, Px ) +  + 0 . (28)

bx
kf kH ≤1

Proof We use that in Hilbert spaces, Ex∼Px [f (x)] = hf, µ[Px ]i and Ex∼P b x [f (x)] =
D E
f, µ[Px ] both hold. Hence the LHS of (28) equates to
b
D E
supkf kH ≤1 µ[Px ] − µ[P µ[Px ] − µ[P
b x ], f , which is given by the norm of b x ]

.
The triangle inequality, our assumption on µ[Px ], and Theorem 2 complete the
b
proof.
This means that we have good control over the behavior of expectations of
random variables, as long as they belong to “smooth” functions on X — the
uncertainty increases with their RKHS norm.
The above technique is useful when it comes to representing distributions in
message passing and data compression. Rather than minimizing an information
theoretic quantity, we can choose a Hilbert space which accurately reflects the
degree of smoothness required for any subsequent operations carried out by the
estimate. For instance, if we are only interested in linear functions, an accurate
match of the first order moments will suffice, without requiring a good match in
higher order terms.

2.6 Kernels on Sets


Up to now we used the mapping X → µ[X] to compute the distance between
two distributions (or their samples). However, since µ[X] itself is an element of
an RKHS we can define a kernel on sets (and distributions) directly via
0
m,m
X
0 0
k(X, X ) := hµ[X], µ[X ]i = 1
mm0 k(xi , x0j ). (29)
i,j

In other words, k(X, X 0 ), and by analogy k(Px , Px0 ) := hµ[Px ], µ[Px0 ]i, define
kernels on sets and distributions, and obviously also between sets and distribu-
tions. If we have multisets and sample weights for instances we may easily include
this in the computation of µ[X]. It turns out that (29) is exactly the set kernel
proposed by [62], when dealing with multiple instance learning. This notion was
subsequently extended to deal with intermediate density estimates by [63]. We
have therefore that in situations where estimation problems are well described
by distributions we inherit the consistency properties of the underlying RKHS
simply by using a universal set kernel for which µ[X] converges to µ[Px ]. We
have the following corollary:
Corollary 2. If k is universal the kernel matrix defined by the set/distribution
kernel (29) has full rank as long as the sets/distributions are not identical.
A Hilbert Space Embedding for Distributions 17

Note, however, that the set kernel may not be ideal for all multi instance prob-
lems: in the latter one assumes that at least a single instance has a given property,
whereas for the use of (29) one needs to assume that at least a certain fraction
of instances have this property.

3 Summary
We have seen that Hilbert space embeddings of distributions are a powerful tool
to deal with a broad range of estimation problems, including two-sample tests,
feature extractors, independence tests, covariate shift, local learning, density es-
timation, and the measurement of similarity between sets. Given these successes,
we are very optimistic that these embedding techniques can be used to address
further problems, ranging from issues in high dimensional numerical integration
(the connections to lattice and Sobol sequences are apparent) to more advanced
nonparametric property testing.

Acknowledgments We thank Karsten Borgwardt, Kenji Fukumizu, Jiayuan Huang,


Quoc Le, Malte Rasch, and Vladimir Vapnik for helpful discussions. NICTA is
funded through the Australian Government’s Baking Australia’s Ability initia-
tive, in part through the Australian Research Council. This work was supported
in part by the IST Programme of the European Community, under the PASCAL
Network of Excellence, IST-2002-506778.

References
[1] Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
[2] Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge, MA
(2002)
[3] Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods,
Theory, and Algorithms. Kluwer Academic Publishers, Boston (2002)
[4] Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning.
MIT Press, Cambridge, MA (2006)
[5] Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and
Sons, New York (1991)
[6] Amari, S., Nagaoka, H.: Methods of Information Geometry. Oxford University
Press (1993)
[7] Krause, A., Guestrin, C.: Near-optimal nonmyopic value of information in graph-
ical models. In: Uncertainty in Artificial Intelligence UAI’05. (2005)
[8] Slonim, N., Tishby, N.: Agglomerative information bottleneck. In Solla, S.A.,
Leen, T.K., Müller, K.R., eds.: Advances in Neural Information Processing Sys-
tems 12, Cambridge, MA, MIT Press (2000) 617–623
[9] Stögbauer, H., Kraskov, A., Astakhov, S., Grassberger, P.: Least dependent com-
ponent analysis based on mutual information. Phys. Rev. E 70(6) (2004) 066123
[10] Nemenman, I., Shafee, F., Bialek, W.: Entropy and inference, revisited. In: Neural
Information Processing Systems. Volume 14., Cambridge, MA, MIT Press (2002)
[11] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cam-
bridge University Press, Cambridge, UK (2004)
18 Alex Smola et al.

[12] Schölkopf, B., Tsuda, K., Vert, J.P.: Kernel Methods in Computational Biology.
MIT Press, Cambridge, MA (2004)
[13] Hofmann, T., Schölkopf, B., Smola, A.J.: A review of kernel methods in machine
learning. Technical Report 156, Max-Planck-Institut für biologische Kybernetik
(2006)
[14] Steinwart, I.: The influence of the kernel on the consistency of support vector
machines. Journal of Machine Learning Research 2 (2002)
[15] Fukumizu, K., Bach, F.R., Jordan, M.I.: Dimensionality reduction for supervised
learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res. 5 (2004)
73–99
[16] Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel
method for the two-sample-problem. In Schölkopf, B., Platt, J., Hofmann, T.,
eds.: Advances in Neural Information Processing Systems. Volume 19., The MIT
Press, Cambridge, MA (2007)
[17] Altun, Y., Smola, A.: Unifying divergence minimization and statistical inference
via convex duality. In Simon, H., Lugosi, G., eds.: Proc. Annual Conf. Computa-
tional Learning Theory. LNCS, Springer (2006) 139–153
[18] Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: Risk bounds
and structural results. J. Mach. Learn. Res. 3 (2002) 463–482
[19] Koltchinskii, V.: Rademacher penalties and structural risk minimization. IEEE
Trans. Inform. Theory 47 (2001) 1902–1914
[20] Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies
of events to their probabilities. Theory Probab. Appl. 16(2) (1971) 264–281
[21] Vapnik, V., Chervonenkis, A.: The necessary and sufficient conditions for the
uniform convergence of averages to their expected values. Teoriya Veroyatnostei
i Ee Primeneniya 26(3) (1981) 543–564
[22] Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and vari-
ational inference. Technical Report 649, UC Berkeley, Department of Statistics
(September 2003)
[23] Ravikumar, P., Lafferty, J.: Variational chernoff bounds for graphical models. In:
Uncertainty in Artificial Intelligence UAI04. (2004)
[24] Altun, Y., Smola, A.J., Hofmann, T.: Exponential families for conditional random
fields. In: Uncertainty in Artificial Intelligence (UAI), Arlington, Virginia, AUAI
Press (2004) 2–9
[25] Hammersley, J.M., Clifford, P.E.: Markov fields on finite graphs and lattices.
unpublished manuscript (1971)
[26] Besag, J.: Spatial interaction and the statistical analysis of lattice systems (with
discussion). J. Roy. Stat. Soc. Ser. B Stat. Methodol. 36(B) (1974) 192–326
[27] Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on prob-
ability measures. In Ghahramani, Z., Cowell, R., eds.: Proc. of AI & Statistics.
Volume 10. (2005)
[28] Serfling, R.: Approximation Theorems of Mathematical Statistics. Wiley, New
York (1980)
[29] Hoeffding, W.: Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association 58 (1963) 13–30
[30] Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.: A kernel method
for the two-sample-problem. In: Advances in Neural Information Processing Sys-
tems 19, Cambridge, MA, MIT Press (2007)
[31] McDiarmid, C.: On the method of bounded differences. Surveys in Combinatorics
(1969) 148–188 Cambridge University Press.
A Hilbert Space Embedding for Distributions 19

[32] Anderson, N., Hall, P., Titterington, D.: Two-sample test statistics for measuring
discrepancies between two multivariate probability density functions using kernel-
based density estimates. Journal of Multivariate Analysis 50 (1994) 41–54
[33] Grimmet, G.R., Stirzaker, D.R.: Probability and Random Processes. Third edn.
Oxford University Press, Oxford (2001)
[34] Arcones, M., Giné, E.: On the bootstrap of u and v statistics. The Annals of
Statistics 20(2) (1992) 655–674
[35] Johnson, N.L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distribu-
tions. Volume 1 (Second Edition). John Wiley and Sons (1994)
[36] Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schölkopf, B., Smola,
A.J.: Integrating structured biological data by kernel maximum mean discrepancy.
Bioinformatics 22(14) (2006) e49–e57
[37] Huang, J., Smola, A., Gretton, A., Borgwardt, K., Schölkopf, B.: Correcting
sample selection bias by unlabeled data. In Schölkopf, B., Platt, J., Hofmann, T.,
eds.: Advances in Neural Information Processing Systems. Volume 19., The MIT
Press, Cambridge, MA (2007)
[38] Shimodaira, H.: Improving predictive inference under convariance shift by weight-
ing the log-likelihood function. Journal of Statistical Planning and Inference 90
(2000)
[39] Bottou, L., Vapnik, V.N.: Local learning algorithms. Neural Computation 4(6)
(1992) 888–900
[40] Comon, P.: Independent component analysis, a new concept? Signal Processing
36 (1994) 287–314
[41] Lee, T.W., Girolami, M., Bell, A., Sejnowski, T.: A unifying framework for inde-
pendent component analysis. Comput. Math. Appl. 39 (2000) 1–21
[42] Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mach.
Learn. Res. 3 (2002) 1–48
[43] Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring statistical depen-
dence with Hilbert-Schmidt norms. In Jain, S., Simon, H.U., Tomita, E., eds.: Pro-
ceedings Algorithmic Learning Theory, Berlin, Germany, Springer-Verlag (2005)
63–77
[44] Gretton, A., Herbrich, R., Smola, A., Bousquet, O., Schölkopf, B.: Kernel methods
for measuring independence. J. Mach. Learn. Res. 6 (2005) 2075–2129
[45] Shen, H., Jegelka, S., Gretton, A.: Fast kernel ICA using an approximate newton
method. In: AISTATS 11. (2007)
[46] Feuerverger, A.: A consistent test for bivariate dependence. International Statis-
tical Review 61(3) (1993) 419–433
[47] Kankainen, A.: Consistent Testing of Total Independence Based on the Empirical
Characteristic Function. PhD thesis, University of Jyväskylä (1995)
[48] Burges, C.J.C., Vapnik, V.: A new method for constructing artificial neural net-
works. Interim technical report, ONR contract N00014-94-c-0186, AT&T Bell
Laboratories (1995)
[49] Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
[50] Anemuller, J., Duann, J.R., Sejnowski, T.J., Makeig, S.: Spatio-temporal dynam-
ics in fmri recordings revealed with complex independent component analysis.
Neurocomputing 69 (2006) 1502–1512
[51] Schölkopf, B.: Support Vector Learning. R. Oldenbourg Verlag, Munich (1997)
Download: http://www.kernel-machines.org.
[52] Song, L., Smola, A., Gretton, A., Borgwardt, K., Bedo, J.: Supervised feature
selection via dependence estimation. In: Proc. Intl. Conf. Machine Learning.
(2007)
20 Alex Smola et al.

[53] Song, L., Bedo, J., Borgwardt, K., Gretton, A., Smola, A.: Gene selection via the
BAHSIC family of algorithms. In: Bioinformatics (ISMB). (2007) To appear.
[54] van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., et al.: Gene
expression profiling predicts clinical outcome of breast cancer. Nature 415 (2002)
530–536
[55] Ein-Dor, L., Zuk, O., Domany, E.: Thousands of samples are needed to generate
a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA
103(15) (Apr 2006) 5923–5928
[56] Bedo, J., Sanderson, C., Kowalczyk, A.: An efficient alternative to svm based
recursive feature elimination with applications in natural language processing and
bioinformatics. In: Artificial Intelligence. (2006)
[57] Smyth, G.: Linear models and empirical bayes methods for assessing differential
expressionin microarray experiments. Statistical Applications in Genetics and
Molecular Biology 3 (2004)
[58] Lönnstedt, I., Speed, T.: Replicated microarray data. Statistica Sinica 12 (2002)
31–46
[59] Dudı́k, M., Schapire, R., Phillips, S.: Correcting sample selection bias in maxi-
mum entropy density estimation. In: Advances in Neural Information Processing
Systems 17. (2005)
[60] Dudı́k, M., Schapire, R.E.: Maximum entropy distribution estimation with gen-
eralized regularization. In Lugosi, G., Simon, H.U., eds.: Proc. Annual Conf.
Computational Learning Theory, Springer Verlag (June 2006)
[61] Hlawka, E.: Funktionen von beschränkter variation in der theorie der gle-
ichverteilung. Annali di Mathematica Pura ed Applicata 54 (1961)
[62] Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In:
Proc. Intl. Conf. Machine Learning. (2002)
[63] Jebara, T., Kondor, I.: Bhattacharyya and expected likelihood kernels. In
Schölkopf, B., Warmuth, M., eds.: Proceedings of the Sixteenth Annual Con-
ference on Computational Learning Theory. Number 2777 in Lecture Notes in
Computer Science, Heidelberg, Germany, Springer-Verlag (2003) 57–71

You might also like