Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
On the Chi square and higher-order Chi distances
for approximating f -divergences
Frank Nielsen1 Richard Nock2
www.informationgeometry.org
1 Sony

Computer Science Laboratories, Inc.
2 UAG-CEREGMIA

September 2013

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

1/17
Statistical divergences
Measures the separability between two distributions.

Examples: Pearson/Neymann χ2 , Kullback-Leibler divergence:

χ2 (X1 : X2 ) =
P
χ2 (X1 : X2 ) =
N
KL(X1 : X2 ) =

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

(x2 (x) − x1 (x))2
dν(x),
x1 (x)
(x1 (x) − x2 (x))2
dν(x),
x2 (x)
x1 (x)
x1 (x) log
dν(x),
x2 (x)

2/17
f -divergences: A generic definition
If (X1 : X2 ) =

x1 (x)f

x2 (x)
x1 (x)

dν(x) ≥ 0,

where f is a convex function
f : (0, ∞) ⊆ dom(f ) → [0, ∞]
such that f (1) = 0.
Jensen inequality: If (X1 : X2 ) ≥ f ( x2 (x)dν(x)) = f (1) = 0.
May consider f ′ (1) = 0 and fix the scale of divergence by setting
f ′′ (1) = 1.
Can always be symmetrized:
Sf (X1 : X2 ) = If (X1 : X2 ) + If ∗ (X1 : X2 )
with f ∗ (u) = uf (1/u), and If ∗ (X1 : X2 ) = If (X2 : X1 ).
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

3/17
f -divergences: Some examples

Name of the f -divergence

Formula If (P : Q)

Generator f (u) with f (1) = 0

Total variation (metric)
Squared Hellinger

1
2

1 |u − 1|
2
√
( u − 1)2

Pearson χ2
P
Neyman χ2
N
Pearson-Vajda χk
P
Pearson-Vajda |χ|k
P
Kullback-Leibler
reverse Kullback-Leibler
α-divergence
Jensen-Shannon

|p(x) − q(x)|dν(x)
( p(x) − q(x))2 dν(x)
(q(x)−p(x))2
dν(x)
p(x)
(p(x)−q(x))2

dν(x)

q(x)
(q(x)−λp(x))k
dν(x)
p k−1 (x)
|q(x)−λp(x)|k
dν(x)
p k−1 (x)
p(x)
p(x) log q(x) dν(x)
q(x)
q(x) log p(x) dν(x)
1−α
4
(1 − p 2 (x)q 1+α (x)dν(x))
1−α2
2p(x)
2q(x)
1 (p(x) log
+ q(x) log p(x)+q(x) )dν(x)
2
p(x)+q(x)

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

(u − 1)2
(1−u)2
u

(u − 1)k
|u − 1|k
− log u
u log u
4
(1 − u
1−α2

1+α
2 )

−(u + 1) log 1+u + u log u
2

4/17
Stochastic approximations of f -divergences

(n)
If (X1

1
: X2 ) ∼
2n

n

x2 (si )
x1 (si )

f
i =1

+

x1 (ti )
f
x2 (ti )

x2 (ti )
x1 (ti )

,

with s1 , ..., sn and t1 , ..., tn IID. sampled from X1 and X2 ,
respectively.
(n)
lim I (X1
n→∞ f

: X2 ) → If (X1 : X2 )

◮

work for any generator f but...

◮

In practice, limited to small dimension support.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

5/17
Exponential families
Canonical decomposition of the probability measure:
pθ (x) = exp( t(x), θ − F (θ) + k(x)),
Here, consider natural parameter space Θ affine.
λx e −λ
, λ > 0, x ∈ {0, 1, ...}
x!
d
1
⊤
NorI (µ) : p(x|µ) = (2π)− 2 e − 2 (x−µ) (x−µ) , µ ∈ Rd , x ∈ Rd
Poi(λ) : p(x|λ) =

Family
Poisson
Iso.Gaussian

θ
Θ
log λ R
µ
Rd

F (θ) k(x)
eθ
− log x!
1 ⊤
d
1 ⊤
2 θ θ 2 log 2π − 2 x x

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

t(x) ν
x
νc
x
νL
6/17
χ2 for affine exponential families
Bypass integral computation,

Closed-form formula
χ2 (X1 : X2 ) = e F (2θ2 −θ1 )−(2F (θ2 )−F (θ1 )) − 1,
P
χ2 (X1 : X2 ) = e F (2θ1 −θ2 )−(2F (θ1 )−F (θ2 )) − 1,
N

Kullback-Leibler divergence amounts to a Bregman divergence [3]:
KL(X1 : X2 ) = BF (θ2 : θ1 )
BF (θ : θ ′ ) = F (θ) − F (θ ′ ) − (θ − θ ′ )⊤ ∇F (θ ′ )

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

7/17
Higher-order Vajda χk divergences

χk (X1 : X2 ) =
P
|χ|k (X1 : X2 ) =
P

(x2 (x) − x1 (x))k
dν(x),
x1 (x)k−1
|x2 (x) − x1 (x)|k
dν(x),
x1 (x)k−1

are f -divergences for the generators (u − 1)k and |u − 1|k .
◮

When k = 1, χ1 (X1 : X2 ) = (x1 (x) − x2 (x))dν(x) = 0
P
(never discriminative), and |χ1 |(X1 , X2 ) is twice the total
P
variation distance.

◮

χ0 is the unit constant.
P

◮

χk is a signed distance
P

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

8/17
Higher-order Vajda χk divergences

Lemma
The (signed) χk distance between members X1 ∼ EF (θ1 ) and
P
X2 ∼ EF (θ2 ) of the same affine exponential family is (k ∈ N)
always bounded and equal to:
k

(−1)k−j

χk (X1 : X2 ) =
P
j=0

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

k e F ((1−j)θ1 +jθ2 )
.
j e (1−j)F (θ1 )+jF (θ2 )

9/17
Higher-order Vajda χk divergences:
For Poisson/Normal distributions, we get closed-form formula:
k

(−1)k−j

k λ1−j λj −((1−j)λ1 +jλ2 )
e 1 2
,
j

(−1)k−j

χk (λ1 : λ2 ) =
P

k 1 j(j−1)(µ1 −µ2 )⊤ (µ1 −µ2 )
e2
.
j

j=0
k

χk (µ1 : µ2 ) =
P
j=0

signed distances.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

10/17
f -divergences from Taylor series
Lemma (extends Theorem 1 of [1])
When bounded, the f -divergence If can be expressed as the power
series of higher order Chi-type distances:
∞

If (X1 : X2 ) =

x1 (x)
i =0
∞

=
i =0

1 (i )
f (λ)
i!

x2 (x)
−λ
x1 (x)

i

dν(x),

1 (i )
f (λ) χiλ,P (X1 : X2 ),
i!

If < ∞, and χiλ,P (X1 : X2 ) is a generalization of the χiP defined by:
χiλ,P (X1 : X2 ) =

(x2 (x) − λx1 (x))i
dν(x).
x1 (x)i −1

and χ0 (X1 : X2 ) = 1 by convention. Note that
λ,P
χiλ,P ≥ f (1) = (1 − λ)k is a f -divergence for
f (u) = (u − λ)k − (1 − λ)k

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

11/17
f -divergences: Analytic formula
◮

λ = 1 ∈ int(dom(f (i ) )), f -divergence (Theorem 1 of [1]):
s

|If (X1 : X2 ) −
k=0

f (k) (1) k
χP (X1 : X2 )|
k!

1
≤
f (s+1)
(s + 1)!
where f (s+1)
◮

∞

∞ (M

− m)s ,

= supt∈[m,M] |f (s+1) (t)| and m ≤

λ = 0 (whenever 0 ∈ int(dom(f
families, simpler expression:
∞

If (X1 : X2 ) =
I1−i ,i (θ1 : θ2 ) =
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

(i ) )))

p
q

≤ M.

and affine exponential

f (i ) (0)
I1−i ,i (θ1 : θ2 ),
i!

i =0
e F (i θ2 +(1−i )θ1 )

e iF (θ2 )+(1−i )F (θ1 )

.
12/17
Corollary: Approximating f -divergences by χ2 divergences
Corollary
A second-order Taylor expansion yields
1
If (X1 : X2 ) ∼ f (1) + f ′ (1)χ1 (X1 : X2 ) + f ′′ (1)χ2 (X1 : X2 )
N
N
2
Since f (1) = 0 and χ1 (X1 : X2 ) = 0, it follows that
N
If (X1 : X2 ) ∼

f ′′ (1) 2
χN (X1 : X2 ),
2

(f ′′ (1) > 0 follows from the strict convexity of the generator).
When f (u) = u log u, this yields the well-known approximation [2]:
χ2 (X1 : X2 ) ∼ 2 KL(X1 : X2 ).
P

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

13/17
Kullback-Leibler divergence: Analytic expression
Kullback-Leibler divergence: f (u) = − log u.
f (i ) (u) = (−1)i (i − 1)!u −i
(i )

i

(1)
and hence f i ! = (−1) , for i ≥ 1 (with f (1) = 0).
i
1 = 0, it follows that:
Since χ1,P
∞

KL(X1 : X2 ) =
j=2

(−1)i j
χP (X1 : X2 ).
i

→ alternating sign sequence
Poisson distributions: λ1 = 0.6 and λ2 = 0.3, KL ∼ 0.1158 (exact
using Bregman divergence), stochastic evaluation with n = 106
yields KL ∼ 0.1156
KL divergence from Taylor truncation: 0.0809(s = 2),
0.0910(s = 3), 0.1017(s = 4), 0.1135(s = 10), 0.1150(s = 15),
etc.
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

14/17
Contributions
Statistical f -divergences between members of the same exponential
family with affine natural space.
◮

Generic closed-form formula for the Pearson/Neyman χ2 and
Vajda χk -type distance

◮

Analytic expression of f -divergences using Pearson-Vajda-type
distances.

◮

Second-order Taylor approximation for fast estimation of
f -divergences.

JavaTM package:
www.informationgeometry.org/fDivergence/

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

15/17
Thank you.

@article{fDivChi-arXiv1309.3029,
author="Frank Nielsen and Richard Nock",
title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences",
year="2013",
eprint="arXiv/1309.3029"
}

www.informationgeometry.org

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

16/17
Bibliographic references I

N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo.
Approximating Csisz´r f -divergence by the use of Taylor’s formula with integral remainder.
a
Mathematical Inequalities & Applications, 5(3):417–434, 2002.
Thomas M. Cover and Joy A. Thomas.
Elements of information theory.
Wiley-Interscience, New York, NY, USA, 1991.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc.

17/17

More Related Content

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences

  • 1. On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen1 Richard Nock2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17
  • 2. Statistical divergences Measures the separability between two distributions. Examples: Pearson/Neymann χ2 , Kullback-Leibler divergence: χ2 (X1 : X2 ) = P χ2 (X1 : X2 ) = N KL(X1 : X2 ) = c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. (x2 (x) − x1 (x))2 dν(x), x1 (x) (x1 (x) − x2 (x))2 dν(x), x2 (x) x1 (x) x1 (x) log dν(x), x2 (x) 2/17
  • 3. f -divergences: A generic definition If (X1 : X2 ) = x1 (x)f x2 (x) x1 (x) dν(x) ≥ 0, where f is a convex function f : (0, ∞) ⊆ dom(f ) → [0, ∞] such that f (1) = 0. Jensen inequality: If (X1 : X2 ) ≥ f ( x2 (x)dν(x)) = f (1) = 0. May consider f ′ (1) = 0 and fix the scale of divergence by setting f ′′ (1) = 1. Can always be symmetrized: Sf (X1 : X2 ) = If (X1 : X2 ) + If ∗ (X1 : X2 ) with f ∗ (u) = uf (1/u), and If ∗ (X1 : X2 ) = If (X2 : X1 ). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/17
  • 4. f -divergences: Some examples Name of the f -divergence Formula If (P : Q) Generator f (u) with f (1) = 0 Total variation (metric) Squared Hellinger 1 2 1 |u − 1| 2 √ ( u − 1)2 Pearson χ2 P Neyman χ2 N Pearson-Vajda χk P Pearson-Vajda |χ|k P Kullback-Leibler reverse Kullback-Leibler α-divergence Jensen-Shannon |p(x) − q(x)|dν(x) ( p(x) − q(x))2 dν(x) (q(x)−p(x))2 dν(x) p(x) (p(x)−q(x))2 dν(x) q(x) (q(x)−λp(x))k dν(x) p k−1 (x) |q(x)−λp(x)|k dν(x) p k−1 (x) p(x) p(x) log q(x) dν(x) q(x) q(x) log p(x) dν(x) 1−α 4 (1 − p 2 (x)q 1+α (x)dν(x)) 1−α2 2p(x) 2q(x) 1 (p(x) log + q(x) log p(x)+q(x) )dν(x) 2 p(x)+q(x) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. (u − 1)2 (1−u)2 u (u − 1)k |u − 1|k − log u u log u 4 (1 − u 1−α2 1+α 2 ) −(u + 1) log 1+u + u log u 2 4/17
  • 5. Stochastic approximations of f -divergences (n) If (X1 1 : X2 ) ∼ 2n n x2 (si ) x1 (si ) f i =1 + x1 (ti ) f x2 (ti ) x2 (ti ) x1 (ti ) , with s1 , ..., sn and t1 , ..., tn IID. sampled from X1 and X2 , respectively. (n) lim I (X1 n→∞ f : X2 ) → If (X1 : X2 ) ◮ work for any generator f but... ◮ In practice, limited to small dimension support. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/17
  • 6. Exponential families Canonical decomposition of the probability measure: pθ (x) = exp( t(x), θ − F (θ) + k(x)), Here, consider natural parameter space Θ affine. λx e −λ , λ > 0, x ∈ {0, 1, ...} x! d 1 ⊤ NorI (µ) : p(x|µ) = (2π)− 2 e − 2 (x−µ) (x−µ) , µ ∈ Rd , x ∈ Rd Poi(λ) : p(x|λ) = Family Poisson Iso.Gaussian θ Θ log λ R µ Rd F (θ) k(x) eθ − log x! 1 ⊤ d 1 ⊤ 2 θ θ 2 log 2π − 2 x x c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. t(x) ν x νc x νL 6/17
  • 7. χ2 for affine exponential families Bypass integral computation, Closed-form formula χ2 (X1 : X2 ) = e F (2θ2 −θ1 )−(2F (θ2 )−F (θ1 )) − 1, P χ2 (X1 : X2 ) = e F (2θ1 −θ2 )−(2F (θ1 )−F (θ2 )) − 1, N Kullback-Leibler divergence amounts to a Bregman divergence [3]: KL(X1 : X2 ) = BF (θ2 : θ1 ) BF (θ : θ ′ ) = F (θ) − F (θ ′ ) − (θ − θ ′ )⊤ ∇F (θ ′ ) c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/17
  • 8. Higher-order Vajda χk divergences χk (X1 : X2 ) = P |χ|k (X1 : X2 ) = P (x2 (x) − x1 (x))k dν(x), x1 (x)k−1 |x2 (x) − x1 (x)|k dν(x), x1 (x)k−1 are f -divergences for the generators (u − 1)k and |u − 1|k . ◮ When k = 1, χ1 (X1 : X2 ) = (x1 (x) − x2 (x))dν(x) = 0 P (never discriminative), and |χ1 |(X1 , X2 ) is twice the total P variation distance. ◮ χ0 is the unit constant. P ◮ χk is a signed distance P c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/17
  • 9. Higher-order Vajda χk divergences Lemma The (signed) χk distance between members X1 ∼ EF (θ1 ) and P X2 ∼ EF (θ2 ) of the same affine exponential family is (k ∈ N) always bounded and equal to: k (−1)k−j χk (X1 : X2 ) = P j=0 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. k e F ((1−j)θ1 +jθ2 ) . j e (1−j)F (θ1 )+jF (θ2 ) 9/17
  • 10. Higher-order Vajda χk divergences: For Poisson/Normal distributions, we get closed-form formula: k (−1)k−j k λ1−j λj −((1−j)λ1 +jλ2 ) e 1 2 , j (−1)k−j χk (λ1 : λ2 ) = P k 1 j(j−1)(µ1 −µ2 )⊤ (µ1 −µ2 ) e2 . j j=0 k χk (µ1 : µ2 ) = P j=0 signed distances. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/17
  • 11. f -divergences from Taylor series Lemma (extends Theorem 1 of [1]) When bounded, the f -divergence If can be expressed as the power series of higher order Chi-type distances: ∞ If (X1 : X2 ) = x1 (x) i =0 ∞ = i =0 1 (i ) f (λ) i! x2 (x) −λ x1 (x) i dν(x), 1 (i ) f (λ) χiλ,P (X1 : X2 ), i! If < ∞, and χiλ,P (X1 : X2 ) is a generalization of the χiP defined by: χiλ,P (X1 : X2 ) = (x2 (x) − λx1 (x))i dν(x). x1 (x)i −1 and χ0 (X1 : X2 ) = 1 by convention. Note that λ,P χiλ,P ≥ f (1) = (1 − λ)k is a f -divergence for f (u) = (u − λ)k − (1 − λ)k c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/17
  • 12. f -divergences: Analytic formula ◮ λ = 1 ∈ int(dom(f (i ) )), f -divergence (Theorem 1 of [1]): s |If (X1 : X2 ) − k=0 f (k) (1) k χP (X1 : X2 )| k! 1 ≤ f (s+1) (s + 1)! where f (s+1) ◮ ∞ ∞ (M − m)s , = supt∈[m,M] |f (s+1) (t)| and m ≤ λ = 0 (whenever 0 ∈ int(dom(f families, simpler expression: ∞ If (X1 : X2 ) = I1−i ,i (θ1 : θ2 ) = c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. (i ) ))) p q ≤ M. and affine exponential f (i ) (0) I1−i ,i (θ1 : θ2 ), i! i =0 e F (i θ2 +(1−i )θ1 ) e iF (θ2 )+(1−i )F (θ1 ) . 12/17
  • 13. Corollary: Approximating f -divergences by χ2 divergences Corollary A second-order Taylor expansion yields 1 If (X1 : X2 ) ∼ f (1) + f ′ (1)χ1 (X1 : X2 ) + f ′′ (1)χ2 (X1 : X2 ) N N 2 Since f (1) = 0 and χ1 (X1 : X2 ) = 0, it follows that N If (X1 : X2 ) ∼ f ′′ (1) 2 χN (X1 : X2 ), 2 (f ′′ (1) > 0 follows from the strict convexity of the generator). When f (u) = u log u, this yields the well-known approximation [2]: χ2 (X1 : X2 ) ∼ 2 KL(X1 : X2 ). P c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/17
  • 14. Kullback-Leibler divergence: Analytic expression Kullback-Leibler divergence: f (u) = − log u. f (i ) (u) = (−1)i (i − 1)!u −i (i ) i (1) and hence f i ! = (−1) , for i ≥ 1 (with f (1) = 0). i 1 = 0, it follows that: Since χ1,P ∞ KL(X1 : X2 ) = j=2 (−1)i j χP (X1 : X2 ). i → alternating sign sequence Poisson distributions: λ1 = 0.6 and λ2 = 0.3, KL ∼ 0.1158 (exact using Bregman divergence), stochastic evaluation with n = 106 yields KL ∼ 0.1156 KL divergence from Taylor truncation: 0.0809(s = 2), 0.0910(s = 3), 0.1017(s = 4), 0.1135(s = 10), 0.1150(s = 15), etc. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/17
  • 15. Contributions Statistical f -divergences between members of the same exponential family with affine natural space. ◮ Generic closed-form formula for the Pearson/Neyman χ2 and Vajda χk -type distance ◮ Analytic expression of f -divergences using Pearson-Vajda-type distances. ◮ Second-order Taylor approximation for fast estimation of f -divergences. JavaTM package: www.informationgeometry.org/fDivergence/ c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/17
  • 16. Thank you. @article{fDivChi-arXiv1309.3029, author="Frank Nielsen and Richard Nock", title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences", year="2013", eprint="arXiv/1309.3029" } www.informationgeometry.org c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/17
  • 17. Bibliographic references I N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo. Approximating Csisz´r f -divergence by the use of Taylor’s formula with integral remainder. a Mathematical Inequalities & Applications, 5(3):417–434, 2002. Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/17