Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences

On the Chi square and higher-order Chi distances
for approximating f -divergences
Frank Nielsen1 Richard Nock2
www.informationgeometry.org
1 Sony

Computer Science Laboratories, Inc.
2 UAG-CEREGMIA

September 2013

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

1/17

Statistical divergences
Measures the separability between two distributions.

Examples: Pearson/Neymann χ2 , Kullback-Leibler divergence:

χ2 (X1 : X2 ) =
P
χ2 (X1 : X2 ) =
N
KL(X1 : X2 ) =


(x2 (x) − x1 (x))2
dν(x),
x1 (x)
(x1 (x) − x2 (x))2
dν(x),
x2 (x)
x1 (x)
x1 (x) log
dν(x),
x2 (x)

2/17

f -divergences: A generic deﬁnition
If (X1 : X2 ) =

x1 (x)f

x2 (x)
x1 (x)

dν(x) ≥ 0,

where f is a convex function
f : (0, ∞) ⊆ dom(f ) → [0, ∞]
such that f (1) = 0.
Jensen inequality: If (X1 : X2 ) ≥ f ( x2 (x)dν(x)) = f (1) = 0.
May consider f ′ (1) = 0 and ﬁx the scale of divergence by setting
f ′′ (1) = 1.
Can always be symmetrized:
Sf (X1 : X2 ) = If (X1 : X2 ) + If ∗ (X1 : X2 )
with f ∗ (u) = uf (1/u), and If ∗ (X1 : X2 ) = If (X2 : X1 ).

3/17

f -divergences: Some examples

Name of the f -divergence

Formula If (P : Q)

Generator f (u) with f (1) = 0

Total variation (metric)
Squared Hellinger

1
2

1 |u − 1|
2
√
( u − 1)2

Pearson χ2
P
Neyman χ2
N
Pearson-Vajda χk
P
Pearson-Vajda |χ|k
P
Kullback-Leibler
reverse Kullback-Leibler
α-divergence
Jensen-Shannon

|p(x) − q(x)|dν(x)
( p(x) − q(x))2 dν(x)
(q(x)−p(x))2
dν(x)
p(x)
(p(x)−q(x))2

dν(x)

q(x)
(q(x)−λp(x))k
dν(x)
p k−1 (x)
|q(x)−λp(x)|k
dν(x)
p k−1 (x)
p(x)
p(x) log q(x) dν(x)
q(x)
q(x) log p(x) dν(x)
1−α
4
(1 − p 2 (x)q 1+α (x)dν(x))
1−α2
2p(x)
2q(x)
1 (p(x) log
+ q(x) log p(x)+q(x) )dν(x)
2
p(x)+q(x)


(u − 1)2
(1−u)2
u

(u − 1)k
|u − 1|k
− log u
u log u
4
(1 − u
1−α2

1+α
2 )

−(u + 1) log 1+u + u log u
2

4/17

Stochastic approximations of f -divergences

(n)
If (X1

1
: X2 ) ∼
2n

n

x2 (si )
x1 (si )

f
i =1

+

x1 (ti )
f
x2 (ti )

x2 (ti )
x1 (ti )

,

with s1 , ..., sn and t1 , ..., tn IID. sampled from X1 and X2 ,
respectively.
(n)
lim I (X1
n→∞ f

: X2 ) → If (X1 : X2 )

◮

work for any generator f but...

◮

In practice, limited to small dimension support.


5/17

Exponential families
Canonical decomposition of the probability measure:
pθ (x) = exp( t(x), θ − F (θ) + k(x)),
Here, consider natural parameter space Θ aﬃne.
λx e −λ
, λ > 0, x ∈ {0, 1, ...}
x!
d
1
⊤
NorI (µ) : p(x|µ) = (2π)− 2 e − 2 (x−µ) (x−µ) , µ ∈ Rd , x ∈ Rd
Poi(λ) : p(x|λ) =

Family
Poisson
Iso.Gaussian

θ
Θ
log λ R
µ
Rd

F (θ) k(x)
eθ
− log x!
1 ⊤
d
1 ⊤
2 θ θ 2 log 2π − 2 x x


t(x) ν
x
νc
x
νL
6/17

χ2 for aﬃne exponential families
Bypass integral computation,

Closed-form formula
χ2 (X1 : X2 ) = e F (2θ2 −θ1 )−(2F (θ2 )−F (θ1 )) − 1,
P
χ2 (X1 : X2 ) = e F (2θ1 −θ2 )−(2F (θ1 )−F (θ2 )) − 1,
N

Kullback-Leibler divergence amounts to a Bregman divergence [3]:
KL(X1 : X2 ) = BF (θ2 : θ1 )
BF (θ : θ ′ ) = F (θ) − F (θ ′ ) − (θ − θ ′ )⊤ ∇F (θ ′ )


7/17

Higher-order Vajda χk divergences

χk (X1 : X2 ) =
P
|χ|k (X1 : X2 ) =
P

(x2 (x) − x1 (x))k
dν(x),
x1 (x)k−1
|x2 (x) − x1 (x)|k
dν(x),
x1 (x)k−1

are f -divergences for the generators (u − 1)k and |u − 1|k .
◮

When k = 1, χ1 (X1 : X2 ) = (x1 (x) − x2 (x))dν(x) = 0
P
(never discriminative), and |χ1 |(X1 , X2 ) is twice the total
P
variation distance.

◮

χ0 is the unit constant.
P

◮

χk is a signed distance
P


8/17

Higher-order Vajda χk divergences

Lemma
The (signed) χk distance between members X1 ∼ EF (θ1 ) and
P
X2 ∼ EF (θ2 ) of the same aﬃne exponential family is (k ∈ N)
always bounded and equal to:
k

(−1)k−j

χk (X1 : X2 ) =
P
j=0


k e F ((1−j)θ1 +jθ2 )
.
j e (1−j)F (θ1 )+jF (θ2 )

9/17

Higher-order Vajda χk divergences:
For Poisson/Normal distributions, we get closed-form formula:
k

(−1)k−j

k λ1−j λj −((1−j)λ1 +jλ2 )
e 1 2
,
j

(−1)k−j

χk (λ1 : λ2 ) =
P

k 1 j(j−1)(µ1 −µ2 )⊤ (µ1 −µ2 )
e2
.
j

j=0
k

χk (µ1 : µ2 ) =
P
j=0

signed distances.


10/17

f -divergences from Taylor series
Lemma (extends Theorem 1 of [1])
When bounded, the f -divergence If can be expressed as the power
series of higher order Chi-type distances:
∞

If (X1 : X2 ) =

x1 (x)
i =0
∞

=
i =0

1 (i )
f (λ)
i!

x2 (x)
−λ
x1 (x)

i

dν(x),

1 (i )
f (λ) χiλ,P (X1 : X2 ),
i!

If < ∞, and χiλ,P (X1 : X2 ) is a generalization of the χiP deﬁned by:
χiλ,P (X1 : X2 ) =

(x2 (x) − λx1 (x))i
dν(x).
x1 (x)i −1

and χ0 (X1 : X2 ) = 1 by convention. Note that
λ,P
χiλ,P ≥ f (1) = (1 − λ)k is a f -divergence for
f (u) = (u − λ)k − (1 − λ)k


11/17

f -divergences: Analytic formula
◮

λ = 1 ∈ int(dom(f (i ) )), f -divergence (Theorem 1 of [1]):
s

|If (X1 : X2 ) −
k=0

f (k) (1) k
χP (X1 : X2 )|
k!

1
≤
f (s+1)
(s + 1)!
where f (s+1)
◮

∞

∞ (M

− m)s ,

= supt∈[m,M] |f (s+1) (t)| and m ≤

λ = 0 (whenever 0 ∈ int(dom(f
families, simpler expression:
∞

If (X1 : X2 ) =
I1−i ,i (θ1 : θ2 ) =

(i ) )))

p
q

≤ M.

and aﬃne exponential

f (i ) (0)
I1−i ,i (θ1 : θ2 ),
i!

i =0
e F (i θ2 +(1−i )θ1 )

e iF (θ2 )+(1−i )F (θ1 )

.
12/17

Corollary: Approximating f -divergences by χ2 divergences
Corollary
A second-order Taylor expansion yields
1
If (X1 : X2 ) ∼ f (1) + f ′ (1)χ1 (X1 : X2 ) + f ′′ (1)χ2 (X1 : X2 )
N
N
2
Since f (1) = 0 and χ1 (X1 : X2 ) = 0, it follows that
N
If (X1 : X2 ) ∼

f ′′ (1) 2
χN (X1 : X2 ),
2

(f ′′ (1) > 0 follows from the strict convexity of the generator).
When f (u) = u log u, this yields the well-known approximation [2]:
χ2 (X1 : X2 ) ∼ 2 KL(X1 : X2 ).
P


13/17

Kullback-Leibler divergence: Analytic expression
Kullback-Leibler divergence: f (u) = − log u.
f (i ) (u) = (−1)i (i − 1)!u −i
(i )

i

(1)
and hence f i ! = (−1) , for i ≥ 1 (with f (1) = 0).
i
1 = 0, it follows that:
Since χ1,P
∞

KL(X1 : X2 ) =
j=2

(−1)i j
χP (X1 : X2 ).
i

→ alternating sign sequence
Poisson distributions: λ1 = 0.6 and λ2 = 0.3, KL ∼ 0.1158 (exact
using Bregman divergence), stochastic evaluation with n = 106
yields KL ∼ 0.1156
KL divergence from Taylor truncation: 0.0809(s = 2),
0.0910(s = 3), 0.1017(s = 4), 0.1135(s = 10), 0.1150(s = 15),
etc.

14/17

Contributions
Statistical f -divergences between members of the same exponential
family with aﬃne natural space.
◮

Generic closed-form formula for the Pearson/Neyman χ2 and
Vajda χk -type distance

◮

Analytic expression of f -divergences using Pearson-Vajda-type
distances.

◮

Second-order Taylor approximation for fast estimation of
f -divergences.

JavaTM package:
www.informationgeometry.org/fDivergence/


15/17

Thank you.

@article{fDivChi-arXiv1309.3029,
author="Frank Nielsen and Richard Nock",
title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences",
year="2013",
eprint="arXiv/1309.3029"
}

www.informationgeometry.org


16/17

Bibliographic references I

N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo.
Approximating Csisz´r f -divergence by the use of Taylor’s formula with integral remainder.
a
Mathematical Inequalities & Applications, 5(3):417–434, 2002.
Thomas M. Cover and Joy A. Thomas.
Elements of information theory.
Wiley-Interscience, New York, NY, USA, 1991.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.


17/17

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences

More Related Content

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences