Robust Higher Order Statistics: Max Welling
Robust Higher Order Statistics: Max Welling
Robust Higher Order Statistics: Max Welling
Max Welling
School of Information and Computer Science
University of California Irvine
Irvine CA 92697-3425 USA
welling@ics.uci.edu
Abstract
1 INTRODUCTION
Moments and cumulants are widely used in scientific disciplines that deal with data, random variables or stochastic processes. They are well known tools that can be used
to quantify certain statistical properties of the probability
distribution like location (first moment) and scale (second
moment). Their definition is given by,
n = E[xn ]
(2)
(1)
where E[] denotes the average over the probability distribution p(x). In practise we have a set of samples from the
probability distribution and compute sample estimates of
these moments. However, for higher order moments these
estimates become increasingly dominated by outliers, by
which we will mean the samples which are far away from
the mean. Especially for heavy tailed distributions this implies that these estimates have high variance and are generally unsuitable to measure properties of the distribution.
X
1
n (it)n
n!
n=0
(3)
X
1
n (it)n = ln (t)
n!
n=0
(4)
(x)
()
i1 ...in = E (xi1 ) . . . (xin )
1 (6)
(x)
where (x) is the multivariate standard normal density.
The decaying factor is thus given by (x)
=
(x)
d exp( 12 (2 1)xT x), where d is the dimension of the
space. In the limit 1 we obtain the usual definition of
moments.
In order to preserve most of the desirable properties that
cumulants obey, we will use the same definition to relate
moments to cumulants as in the classical case,
Definition 2 The robust cumulants are defined by:
X
M
X
...
n=0 i1 =1
I. For a Gaussian density, all cumulants higher than second order vanish.
II. For independent random variables,
cumulants vanish.
all cross-
ln(
M
X
m=0 j1 =1
M
X
1 ()
(iti ) . . . (itin ) =
n! i1 ...in 1
i =1
n
...
M
X
1 ()
(itj1 ) . . . (itjm ))
m! j1 ...jm
j =1
(7)
(x)
() (t) = E exp(ixT t)
(8)
(x)
The explicit relation between robust moments and cumulants up to fourth order is given in appendix A.
With the above definitions we can now state some important properties for the robust cumulants. Since we assume
zero-mean and unit-variance we cannot expect the cumulants to be invariant with respect to translation and scalings. However, we will prove that the following properties
are still valid,
Theorem 1 The following properties are true for robust
cumulants:
I. For a standard Gaussian density, all robust cumulants
higher than second order vanish.
II. For independent random variables, robust crosscumulants vanish.
III. All robust cumulants transform multi-linearly with respect to rotations.
Proof: I: For a standard Gaussian we can compute the moment generating function analytically giving () (t) =
()
21 tT t, implying that i1 i2 = i1 i2 and all other cumulants vanish.
II: We note that if the variables {xi } are independent,
() (t) factorizes into a product of expectations which the
logarithm turns into a sum, each term only depending on
one ti . Since cross cumulants on the left hand side of Eq.7
are precisely those terms which contain distinct ti , they
must be zero.
III: From Eq.6 we see that since the decay factor is
isotropic, robust moments still transform multi-linearly
with respect to rotations. If we rotate both the moments
and t in the right-hand side of Eq.7, it remains invariant.
To ensure that the left-hand side of Eq.7 remains invariant we infer that the robust cumulants must also transform
multi-linearly with respect to rotations,
()
()
OOT = OT O = I
(9)
c()
n Hn (x)(x) with
(10)
n=0
c()
n
1
=
n!
(11)
()
(x) (P
e n=0
(x)
with
()
()
n = n n,2
1 ()
dn
n (1)n d(x)
n
n!
(x) (13)
(14)
R
X
c()
n Hn (x) + (x)}(x),
(15)
n=0
X
(x)
an Hn (x)).
(x) n=0
n=0
(16)
n
2
2
with an = (n1)!!
(
1)
for
k
{0,
1,
2,
3,
...}
n,2k
n!
and (n1)!! denotes the double factorial of (n1) defined
by 1 3 5...(n 1). The correction factor is thus orthogonal to all Hermite polynomials Hn (x) with n = 1..R
under the new measure d . We can also show that pR (x)
always integrates to 1 and that when 1 the correction
term will reduce to (x) cR+K HR+K (x) with K = 1
when R is oddR and K = 2 when R is even.
Finally we
p(x) =
p(x) =
(
n!an c()
n )(
n=0
with
(x)
(x)
PM
i1 =1 ...
()
PM
n
1 ()
d
in =1 n!
i1 ...in (1) d(x)i
1
()
i1 i2 = i1 i2 i1 i2
d
... d(x)
in
(x)
x 10
2.4
2.2
trace() & trace(J1)
4.5
bias
4
3.5
3
2.5
2
1.8
1.6
IF (x) =
1.4
1.2
2
1.5
1
1.2
1.4
1.6
1.8
1
1
1.2
1.4
(a)
4
x 10
1.6
1.8
1.5
1.4
trace() & trace(J1)
bias
1.5
1.3
1.2
1.1
1
1.2
1.4
1.6
1.8
0.9
1
1.2
(c)
1.4
1.6
1.8
(d)
First we mention that the estimators cn [pR ] for the truncated series expansion (Eq.15) are Fisher consistent. This
can be shown by replacing p(x) in Eq.11 with pR (x) and
using orthogonality between (x) and the Hermite polynomials Hn (x) n = 1..R w.r.t. the measure d .
To prove B-robustness we need to define and calculate the
()
influence function IF for the estimators cn . Intuitively,
the influence function measures the sensitivity of the estimators to adding one more observation at location x,
()
()
1
(x)
Hn (x)
c()
n
n!
(x)
(18)
()
2.5
0.5
1
(b)
()
(17)
V =
1
()
J(c()
,
c
)
=
E
p
(x)
p
(x)
R
R
n
m
p(x) c()
p(x) c()
n
m
p
Z
Hn (x)Hm (x)(x)2
dx
(22)
=
p(x)
bound
follows:
was also plotted (dashed line). The model included 10 orders in the expansion n = 0, ..., 9 plus the normalization
term (x). All quantities were computed using numerical
integration. We conclude that both bias and efficiency improve when moves away from the classical case = 1.
400
6 INDEPENDENT COMPONENTS
ANALYSIS
I(O) =
n=1 i=1
200
150
100
(23)
0
x
10
()
where
i...i only differ from the usual i...i in second or()
()
der,
ii = ii 1. These cumulants are defined on the
0
rotated axis ei = OT ei .
We will now state a number of properties that show the
validity of I(O) as a contrast function for ICA,
Theorem 4 The following properties are true for I(O):
i. I(O) is maximal if the probability distribution on the
corresponding axis factors into an independent product of marginal distributions.
ii. I(O) is minimal (i.e. 0) if the marginal distributions
on the corresponding axis are Gaussian.
Proof: To prove (i) we note that the following expression
is scalar (i.e. invariant) w.r.t. rotations2 ,
X ()
(
i1 ...in )2 = constant
n
(24)
i1 ...in
wn 0,
250
0
10
We propose to use the robust Edgeworth and GramCharlier expansions discussed in this paper instead of the
classical ones. As we will show in the experiments below,
it is safe to include robust cumulants to very high order in
these expansions (we have gone up to order 20), which at a
moderate computational cost will have a significant impact
on the accuracy of our estimates of the marginal distributions. We note that the derivation of the contrast function in e.g. [4] crucially depends on properties I,II and III
from theorem 1. This makes our robust cumulants the ideal
candidates to replace the classical ones. Instead of going
through this derivation we will argue for a novel contrast
function that represents a slight generalization of the one
proposed in [4],
()
wn (
i...i )2
300
50
R X
M
X
350
600
0.35
0.7
0.8
0.3
0.6
0.6
0.25
0.5
0.4
p(x)
300
p(x)
400
0.4
0.2
0.15
200
0.3
0.2
100
0.1
0.2
0
0
100
0.2
10
0 1 2 3 4 5 6 7 8 9 10
expansion coefficient number
0.05
0.1
5
0
x
(a)
0
5
10
(b)
0
x
0
5
(b)
Mixture of Gaussians (=0.5, c=3, d=2)
0.5
0
x
(a)
1.2
0.9
0.6
0.8
0.8
0.4
0.2
p(x)
0.3
p(x)
0.6
0.4
0.5
0.2
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
0.2
0.4
0.5
0.7
0.6
p(x)
0.8
500
p(x)
700
0.1
0
10
0 1 2 3 4 5 6 7 8 9 10
expansion coefficient number
(c)
0
x
10
0
5
0
5
(d)
(c)
(d)
maximizes I(O) (for instance distributions which only differ in the statistics of order higher than R). Good objective
functions are discriminative in the sense that there are only
few (relevant) densities that maximize it. We can influence
the ability of I(O) to discriminate by changing the weighting factors wn . Doing this allows for a more directed search
towards predefined qualities, e.g. a search for high kurtosis
directions would imply a large w4 .
A straightforward strategy to maximize I(O) is gradient
ascent while at every iteration projecting the solution back
onto the manifold of rotations (e.g. see [10]). A more efficient technique which exploits the tensorial property of
cumulants (i.e. property III of theorem 1) was proposed in
[3]. This technique, called Jacobi-optimization, iteratively
solves two dimensional sub-problems analytically.
7 EXPERIMENTS
The following set of experiments focus on density estimates based on the Gram-Charlier expansion (Eq.10)
where we replace Eq.11 with a sample estimate,
c()
n =
N
1 1 X (xA )
Hn (xA )
N n!
(xA )
(25)
A=1
http://sweat.cs.unm.edu/bap/demos.html
0.01
0.008
0.006
0.004
0.002
0
1
1.2
1.4
1.6
1.8
x 10
1.5
1
1.2
1.4
(a)
1.8
x 10
0.04
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
10
15
order of expansion
(a)
20
0.03
0.025
0.02
0.015
0.01
0.005
0
0
10
15
order of expansion
20
(b)
0.03
0.08
0.05
1.6
(b)
0.02
0.01
0
1
2.5
0.012
1.2
1.4
1.6
(c)
1.8
1
1
1.2
1.4
1.6
1.8
(d)
8 DISCUSSION
In this paper we have proposed robust alternatives to higher
order moments and cumulants. In order to arrive at robust
0.7
0.7
0.6
0.6
0.5
0.5
p(x)
p(x)
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
10
0.1
5
0
x
10
0
10
(a)
0
x
10
(b)
tional convenience.
1
2
1
2 =
( )2
0
0
0
3
1 3
1 2
3 =
3 2 + 2( )
0
0
0
2 2
1
4
2 2
1 3
4 =
3( ) 4 2 + 12 1 3 6( )4
0
0
0
0
0
1
2
0
2
0 = e
= 1
= 2 + 1
0
0
3
= 3 + 32 1 + 31
0
4
= 4 + 43 1 + 322 + 62 21 + 41
0
0 = ln 0
1 =
B PROOF OF THEOREM 2
The characteristic function or moment generating function
of a PDF is defined by:
Z
X
1
(t) =
eixt p(x) dx =
n (it)n = F[p(x)]
n!
n=0
(26)
where the last term follows from Taylor expanding the exponential and F denotes the Fourier transform. For arbitrary we have,
Z
(x)
()
(t) =
eixt p(x)
dx
(x)
X
1 (
(x)
=
)n (it)n dx = F[p(x)
].
(27)
n!
(x)
n=0
p(x) =
eixt e n=0 n! n (it) (t) dt.
(x) 2
(30)
Finally, we will need the result
dn
2
1
n
(1)n
(x).
(31)
F [(it) (t)] =
d(x)n
If we expand the exponential containing the cumulants in a
Taylor series, and do the inverse Fourier transform on every
term separately, after which we combine the terms again in
an exponential, we find the desired result (Eq.14).
References
[1] S. Amari, A. Cichocki, and H.H. Yang. A new algorithm
for blind signal separation. Advances in Neural Information
Processing Systems, 8:757763, 1996.
[2] A.J. Bell and T.J. Sejnowski. The independent components
of natural scenes are edge filters. Vision Research, 37:3327
3338, 1997.
[3] J.F. Cardoso. High-order constrast for independent component analysis. Neural Computation, 11:157192, 1999.
[4] P. Comon. Independent component analysis, a new concept?
Signal Processing, 36:287314, 1994.
[5] F.R. Hampel, E.M. Ronchetti, P.J. Rousseuw, and W.A. Stahel. Robust statistics. Wiley, 1986.
[6] P.J. Huber. Robust statistics. Wiley, 1981.
[7] A. Hyvarinen. New approximations of differential entropy
for independent component analysis and projection pursuit.
In Advances in Neural Information Processing Systems, volume 10, pages 273279, 1998.
[8] M.G. Kendall and A. Stuart. The advanced theory of statistics Vol. 1. Griffin, 1963.
[9] P. McCullagh. Tensor Methods in Statistics. Chapman and
Hall, 1987.
[10] M. Welling and M. Weber. A constrained EM algorithm
for independent component analysis. Neural Computation,
13:677689, 2001.