Entropy and Divergence Associated with Power Function and the Statistical Application

Eguchi, Shinto; Kato, Shogo

doi:10.3390/e12020262

Open AccessArticle

Entropy and Divergence Associated with Power Function and the Statistical Application

by

Shinto Eguchi

^* and

Shogo Kato

The Institute of Statistical Mathematics, Tachikawa, Tokyo 190-8562, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2010, 12(2), 262-274; https://doi.org/10.3390/e12020262

Submission received: 29 December 2009 / Revised: 20 February 2010 / Accepted: 23 February 2010 / Published: 25 February 2010

(This article belongs to the Special Issue Distance in Information and Statistical Physics Volume 2)

Download Versions Notes

Abstract

:

In statistical physics, Boltzmann-Shannon entropy provides good understanding for the equilibrium states of a number of phenomena. In statistics, the entropy corresponds to the maximum likelihood method, in which Kullback-Leibler divergence connects Boltzmann-Shannon entropy and the expected log-likelihood function. The maximum likelihood estimation has been supported for the optimal performance, which is known to be easily broken down in the presence of a small degree of model uncertainty. To deal with this problem, a new statistical method, closely related to Tsallis entropy, is proposed and shown to be robust for outliers, and we discuss a local learning property associated with the method.

Keywords:

Tsallis entropy; projective power divergence; robustness

1. Introduction

Consider a practical situation in which a data set

{x_{1}, \dots, x_{n}}

is randomly sampled from a probability density function of a statistical model

{f_{θ} (x) : θ \in Θ}

, where θ is a parameter vector and Θ is the parameter space. A fundamental tool for the estimation of unknown parameter θ is the log-likelihood function defined by

\begin{matrix} ℓ (θ) = \frac{1}{n} \sum_{i = 1}^{n} log f_{θ} (x_{i}) \end{matrix}

(1)

which is commonly utilized by statistical researchers ranging over both frequentists and Bayesians. The maximum likelihood estimator (MLE) is defined by

\begin{matrix} \hat{θ} = \underset{θ \in Θ}{argmin} ℓ (θ) \end{matrix}

(2)

The Fisher information matrix for θ is defined by

\begin{matrix} I_{θ} = \int f_{θ} (x) \frac{\partial}{\partial θ} log f_{θ} (x) \frac{\partial}{\partial θ^{T}} log f_{θ} (x) d x \end{matrix}

(3)

where

θ^{T}

denotes the transpose of θ. As the sample size n tends to infinity, the variance matrix of

\sqrt{n} (\hat{θ} - θ)

converges to

I_{θ}^{- 1}

. This inverse matrix exactly gives the lower bound in the class of asymptotically consistent estimators in the sense of matrix inequality, that is,

\begin{matrix} {AV}_{θ} (\tilde{θ}) \geq I_{θ}^{- 1} \end{matrix}

(4)

for any asymptotically consistent estimator

\tilde{θ}

of θ, where

{AV}_{θ}

denotes the limiting variance matrix under the distribution with the density

f_{θ} (x)

.

On the other hand, the Boltzmann-Shannon entropy

\begin{matrix} H_{0} (p) = \int p (x) log p (x) d x \end{matrix}

(5)

plays a fundamental role in various fields, such as statistical physics, information science and so forth. This is directly related with the MLE. Let us consider an underlying distribution with the density function

p (x)

. The cross entropy is defined by

\begin{matrix} C_{0} (p, f_{θ}) = - \int p (x) log f_{θ} (x) d x \end{matrix}

(6)

We note that

C_{0} (p, f_{θ}) = E_{p} {- ℓ (θ)}

, where

E_{p}

denotes the expectation with respect to

p (x)

. Hence, the maximum likelihood principle is equal to the minimum cross entropy principle. The Kulback-Leibler (KL) divergence is defined by

\begin{matrix} D_{0} (p, q) = \int p (x) log \frac{p (x)}{q (x)} d x \end{matrix}

(7)

which gives a kind of information distance between p and q. Note that

D_{0} (p, f_{θ}) = C_{0} (p, f_{θ}) - C_{0} (p, p)

An exponential (type) distribution model is defined by the density form

\begin{matrix} f_{θ} (x) = exp {θ^{T} t (x) - ψ (θ)} \end{matrix}

(8)

where

ψ (θ)

is the cumulant transform defined by

log \int exp {θ^{T} t (x)} d x

. Under the assumption of this family, the MLE has a number of convenient properties such as minimal sufficiency, unbiasedness, efficiency [1]. In particular, the MLE for the expectation parameter

η = E_{θ} {t (X)}

is explicitly given by

\begin{matrix} {\hat{η}}_{0} = \frac{1}{n} \sum_{i = 1}^{n} t (x_{i}) \end{matrix}

which is associated with a dualistic relation of the canonical parameter θ and the expectation parameter η [2,3]. Thus, the MLE satisfies such an excellent property, which is associated with logarithmic and exponential functions as in (2) and (8).

The MLE has been widely employed in statistics, in which the properties are supported in theoretical discussion, for example, as in [4]. However, the MLE has some inappropriate properties when the underlying distribution does not belong to the model

{f_{θ} (x) : θ \in Θ}

. A statistical model is just simulation of the true distribution as Fisher pointed in [1]. The model, which is just used as a working model, is wrong in most practical cases. In such situations, the MLE does not show proper performance because of model uncertainty. In this paper we explore alternative estimation method than the MLE.

2. Power Divergence

The logarithmic transform for observed values is widely employed in data analysis. On the other hand, a power transformation defined by

\begin{matrix} t_{β} (x) = \frac{x^{β} - 1}{β} \end{matrix}

often gives more flexible performance to get good approximation to normal distribution [5]. In analogy with this transform, the power cross entropy is defined by

\begin{matrix} C_{β} (p, q) = - \int p (x) \frac{q {(x)}^{β} - 1}{β} d x + \int \frac{q {(x)}^{β + 1}}{β + 1} d x \end{matrix}

where β is a positive parameter. Thus, it is defined by the power transform of the density. If we take the limit of β to 0, then

C_{β} (p, q)

converges to

C_{0} (p, q)

, which is given in (6). In fact, the power parameter β is not fixed, so that different β’s give different behaviors of the power entropy. The diagonal power entropy is defined by

\begin{matrix} H_{β} (p) = \int \frac{(β + 1) p (x) - p {(x)}^{β + 1}}{β (β + 1)} d x \end{matrix}

which is given by taking

C_{β}

the diagonal. Actually, this is equivalent to Tsallis q-entropy with the relation

β = q - 1

.

Let

{x_{1}, \dots, x_{n}}

be a random sample from unknown density function

p (x)

. Then we define the empirical mean power likelihood by

\begin{matrix} ℓ_{β} (θ) = \frac{1}{n} \sum_{i = 1}^{n} \frac{f_{θ} {(x_{i})}^{β} - 1}{β} - κ_{β} (θ) \end{matrix}

(9)

where

κ_{β} (θ) = \int f_{θ} {(x)}^{β + 1} d x / (β + 1)

. See [6,7,8,9] for statistical applications. Accordingly, the minus expectation value of

ℓ_{β} (θ)

is equal to

C_{β} (p, f_{θ})

. In general, a relation of cross and diagonal entropy leads to the inequality

C_{β} (p, q) \leq C_{β} (p, p)

, from which we define the power divergence by

\begin{matrix} D_{β} (p, q) = C_{β} (p, q) - C_{β} (p, p) \end{matrix}

We extend power entropy and divergence defined over the space of all density functions, which are not always assumed to have a total mass one. In particular, this extension is useful for proposing boosting methods [10,11,12,13,14,15,16].

This derivation can be extended by a generator function U. Assume that

U (t)

is strictly increasing and convex. The Fencel duality discussion leads to a conjugate convex function of

U (t)

defined by

\begin{matrix} U^{*} (s) = max_{t \in R} {s t - U (t)} \end{matrix}

(10)

which reduces to

U^{*} (s) = s ξ (s) - U (ξ (s))

, where

ξ (s)

is the inverse function of the derivation

\dot{U}

of U. Then, U-cross entropy is defined by

\begin{matrix} C_{U} (μ, ν) = - \int μ (x) ξ (ν (x)) d x + \int U (ξ (ν (x))) d x \end{matrix}

Similarly U-divergence is defined by

\begin{matrix} D_{U} (μ, ν) = \int {U^{*} (μ (x)) + U (ξ (ν (x))) - μ (x) ξ (ν (x))} d x \end{matrix}

(11)

We note that

D_{U} (μ, ν) = C_{U} (μ, ν) - C_{U} (μ, μ)

. By the definition of

U^{*}

in (10) we see that the integrand of the right-hand side of (11) is always nonnegative. The power divergence is one example of U-divergence by fixing

\begin{matrix} U_{β} (t) = \frac{1}{β + 1} {(1 + β t)}^{\frac{β + 1}{β}} \end{matrix}

The power divergence can be defined on

M

as

\begin{matrix} D_{β} (μ, ν) = \int \{μ (x) \frac{μ {(x)}^{β} - ν {(x)}^{β}}{β} + \frac{ν {(x)}^{β + 1} - μ {(x)}^{β + 1}}{β + 1}\} d x \end{matrix}

(12)

for μ and ν of

M

[17]. Thus

U_{β} (t)

is strictly increasing and convex, which implies that the integrand of the right-hand side of (12) is nonnegative.

To explore this, it seems sufficient to restrict the definition domain of

D_{β}

to

P

. However, we observe that the restriction is not useful for statistical considerations. We discuss the restriction on the projective space as follows. Fix two functions μ and ν in

M

. We say that μ and ν are projectively equivalent if there exists a positive scalar λ such that

\begin{matrix} ν (x) = λ μ (x) (a . e . x) \end{matrix}

Thus, we write

ν \sim μ

. Similarly, we call a divergence D defined on

M

projectively invariant if for all

λ > 0, κ > 0

\begin{matrix} D (λ μ, κ ν) = D (μ, ν) \end{matrix}

(13)

We can derive a variant of power divergence as

\begin{matrix} Δ_{β} (μ, ν) = \frac{1}{β (β + 1)} log \int μ {(x)}^{β + 1} d x - \frac{1}{β} log \int μ (x) ν {(x)}^{β} d x + \frac{1}{β + 1} log \int ν {(x)}^{β + 1} d x \end{matrix}

See Appendix 1 for the derivation. Immediately, we observe

Δ_{β}

satisfies (13), or projective invariance. Hereafter, we call

Δ_{β}

the projective power divergence. In this way, for

p (x) = μ (x) / \int μ (x) d x

and

q (x) = ν (x) / \int ν (x) d x

, it is obtained that

\begin{matrix} Δ_{β} (p, q) = Δ_{β} (μ, ν) \end{matrix}

If we take a specific value of β, then

\begin{matrix} Δ_{β = 1} (μ, ν) = \frac{1}{2} log \frac{\int μ {(x)}^{2} d x \int ν {(x)}^{2} d x}{{(\int μ (x) ν (x) d x)}^{2}} \end{matrix}

and

\begin{matrix} lim_{β \to 0} Δ_{β} (μ, ν) = D_{0} (\frac{μ}{\int μ (x) d x}, \frac{ν}{\int ν (x) d x}) \end{matrix}

where

D_{0}

is nothing but the KL divergence (7). We observe that the projective power divergence satisfies information additivity. In fact, if we write p and q as

p (x_{1}, x_{2}) = p_{1} (x_{1}) p_{2} (x_{2})

and

q (x_{1}, x_{2}) = q_{1} (x_{1}) q_{2} (x_{2})

, respectively, then

\begin{matrix} Δ_{β} (p, q) = Δ_{β} (p_{1}, q_{1}) + Δ_{β} (p_{2}, q_{2}) \end{matrix}

which means information additivity. We note that this property is not satisfied by the original power divergence

D_{β}

. Furthermore, we know that

Δ_{β}

associates with the Pythagorean identity in the following.

Proposition 1.

Assume that there exist three different points

p, q

and r in

M

satisfying

\begin{matrix} Δ_{β} (p, r) = Δ_{β} (p, q) + Δ_{β} (q, r) \end{matrix}

(14)

Define a path

{p_{t}}_{0 \leq t \leq 1}

connecting p with q and a path

{r_{s}}_{0 \leq s \leq 1}

connecting r with q as

\begin{matrix} p_{t} (x) = (1 - t) p (x) + t q (x), {r_{s} (x)}^{β} = (1 - s) {r (x)}^{β} + s {q (x)}^{β} \end{matrix}

Then

\begin{matrix} Δ_{β} (p_{t}, r_{s}) = Δ_{β} (p_{t}, q) + Δ_{β} (q, r_{s}) \end{matrix}

(15)

holds for all

t (0 < t < 1)

and all

s (0 < s < 1)

.

Proof is given in Appendix 2. This Pytahgorean-type identity is also satisfied with the

D_{β}

[16].

3. Minimum Power Divergence Method

In the previous section we introduce a statistical method defined by minimization of the projective power divergence discussed. By the definition of

Δ_{β}

the cross projective power entropy is led to

\begin{matrix} Γ_{β} (μ, ν) = - \frac{1}{β} log \int μ (x) ν {(x)}^{β} d x + c_{β} (θ) \end{matrix}

where

c_{β} (θ) = {(β + 1)}^{- 1} log {\int f_{θ} {(x)}^{β + 1} d x}

. We see that

Δ_{β} (μ, ν) = Γ_{β} (μ, ν) - Γ_{β} (μ, μ)

. Hence, this decomposition leads the empirical analogue based on a given data set

{x_{1}, \dots, x_{n}}

to

\begin{matrix} L_{β} (θ) = \frac{1}{β} log (\frac{1}{n} \sum_{i = 1}^{n} f_{θ} {(x_{i})}^{β}) - c_{β} (θ) \end{matrix}

(16)

which we call the mean power likelihood with the index β. Thus, the minus expectation of

L_{β} (θ)

with respect to the unknown density function

p (x)

equals to

Γ_{β} (p, f_{θ})

. The limit of β to 0 leads that

L_{β} (θ)

converges to

ℓ_{0} (θ)

. Assume that

{x_{1}, \dots, x_{n}}

is a random sample exactly from

f_{θ} (x)

. Then the strong law of large number yields that

\begin{matrix} L_{β} (θ^{'}) ⟶ - Γ (f_{θ}, f_{θ^{'}}) \end{matrix}

as n increases to infinity. From the property associated with the projective power divergence it follows that

Γ (f_{θ}, f_{θ^{'}}) \geq Γ (f_{θ}, f_{θ})

, which implies that

θ = {argmin}_{θ^{'} \in Θ} Γ (f_{θ}, f_{θ^{'}})

. Consequently, we conclude that the estimator

{\hat{θ}}_{β} = {argmin}_{θ^{'} \in Θ} L_{β} (θ^{'})

converges to θ almost surely. The proof is similar to that for the MLE in Wald [18]. In general any minimum divergence estimator satisfies the strong consistency in the asymptotical sense.

The estimator

{\hat{θ}}_{β}

is associated with the estimating function,

\begin{matrix} s_{β} (x, θ) = f_{θ} {(x)}^{β} \{s (x, θ) - \frac{\partial}{\partial θ} c_{β} (θ)\} \end{matrix}

(17)

where

s (x, θ)

is the score vector,

(\partial / \partial θ) log f_{θ} (x)

. We observe that the estimating function is unbiased in the sense that

E_{θ} {s_{β} (x, θ)} = 0

. This is because

\begin{matrix} E_{θ} {s_{β} (x, θ)} = \int f_{θ} {(x)}^{β + 1} s (x, θ) d x - \int f_{θ} {(x)}^{β + 1} d x \frac{\partial}{\partial θ} c_{β} (θ) = 0 \end{matrix}

Thus the estimating equation is given by

\begin{matrix} S_{β} (θ) = \frac{1}{n} \sum_{i = 1}^{n} s_{β} (x_{i}, θ) = 0 \end{matrix}

We see that the gradient vector of

L_{β} (θ)

is proportional to

S_{β} (θ)

as

\begin{matrix} \frac{\partial}{\partial θ} L_{β} (θ) = {(\frac{1}{n} \sum_{i = 1}^{n} f_{θ} {(x_{i})}^{β})}^{- 1} S_{β} (θ) \end{matrix}

Hence, the estimating function (17) exactly leads to the estimator

{\hat{θ}}_{β}

.

Accordingly, we obtain the following asymptotic normality

\begin{matrix} \sqrt{n} ({\hat{θ}}_{β} - θ) ⟶_{D} N (0, {AV}_{β} (θ)) \end{matrix}

where

⟶_{D}

denotes convergence in law, and

N (μ, V)

denotes a normal distribution with mean vector μ and variance matrix V. Here, the limiting variance matrix is

\begin{matrix} {AV}_{β} (θ) = {\{E \frac{\partial s_{β} (x, θ)}{\partial θ}\}}^{- T} var (s_{β} (x, θ)) {\{E \frac{\partial s_{β} (x, θ)}{\partial θ}\}}^{- 1} \end{matrix}

The inequality (4) implies

{AV}_{β} (θ) \geq I_{θ}^{- 1}

for any β, which implies that any estimator

{\hat{θ}}_{β}

is not asymptotically efficient, where

I_{θ}

denotes the Fisher information matrix defined in (3). In fact, the estimator

{\hat{θ}}_{β}

becomes efficient only when

β = 0

, which is reduced to the MLE. Hence, there is no optimal estimator except for the MLE in the class

{{\hat{θ}}_{β}}_{β \geq 0}

as far as the asymptotic efficiency is concerned.

3.1. Super Robustness

We would like to investigate the influence of the estimator

{\hat{θ}}_{β}

against outliers. We consider outliers in a probabilistic manner. An observation

x_{o}

is called an outlier if

f_{θ} (x_{o})

is very small. Let us carefully look at the estimating equation (17). Then we observe that the larger the value of β is, the smaller

∥ s_{β} (x_{o}, θ) ∥

for all outliers

x_{o}

. The estimator

{\hat{θ}}_{β}

is solved as

\begin{matrix} {\hat{θ}}_{β} = \underset{θ \in Θ}{argsolve} {\sum_{i = 1}^{n} s_{β} (x_{i}, θ) = 0} \end{matrix}

which implies that, for a sufficiently large β, the estimating equation has little impact from outliers contaminated in the data set because the value of the integral

\int f_{θ}^{β}

is hardly influenced by the outliers. In this sense,

{\hat{θ}}_{β}

is robust for such β [19]. From an empirical viewpoint, we know it is sufficient to fix

β \geq 0.1

. In a case that

f_{θ} (x)

is absolutely continuous in

R^{p}

we see that

{lim}_{| x | \to \infty} | s_{β} (x, θ) | = 0

, which is quite contrast with the optimal robust method (cf. [20]). Consider an ϵ-contamination model

\begin{matrix} f_{θ ϵ} (x) = (1 - ϵ) f_{θ} (x) + ϵ δ (x) \end{matrix}

In this context,

δ (x)

is the density for outliers, which departs from the assumed density

f_{θ} (x)

with a large degree. It seems reasonable to suppose that

\int f_{θ} (x) δ (x) d x ≃ 0

. Thus if the true density function

p (x)

equals

f_{θ, ϵ} (x)

, then

{\hat{θ}}_{β}

becomes a consistent estimator for θ for all

ϵ, 0 \leq ϵ < 1

. In this sense we say

{\hat{θ}}_{β}

satisfies super robustness. On the other hand, the mean power likelihood function

ℓ_{β} (θ)

as given in (9) associates with the estimating function

\begin{matrix} f_{θ} {(x)}^{β} s (x, θ) - \frac{\partial}{\partial θ} κ_{β} (θ) \end{matrix}

which is unbiased, but the corresponding estimator does not satisfy such super robustness.

Let us consider a multivariate normal model

N (μ, V)

with mean vector μ and variance matrix V in which the minimum projective power divergence method by (16) is applicable for the estimation of μ and V as follows:

({\hat{μ}}_{β}, {\hat{V}}_{β}) = \underset{(μ, V) \in R^{p} \times S}{argmax} L_{β} (μ, V)

where

S

denotes the space of all symmetric, positive definite matrices.

Noting the projective invariance, we obtain

\begin{matrix} L_{β} (μ, V) = \frac{1}{β} log [\frac{1}{n} \sum_{i = 1}^{n} exp {- \frac{β}{2} {(x_{i} - μ)}^{T} V^{- 1} (x_{i} - μ)}] - \frac{1}{β + 1} log det (\frac{V}{β + 1}) \end{matrix}

(18)

from which the estimating equation gives the weighted mean and variance as

\begin{matrix} μ & = & \frac{\sum_{i = 1}^{n} w {(x_{i}, μ, V)}^{β} x_{i}}{\sum_{i = 1}^{n} w {(x_{i}, μ, V)}^{β}}, \end{matrix}

(19)

\begin{matrix} V & = & (β + 1) \frac{\sum_{i = 1}^{n} w {(x_{i}, μ, V)}^{β} (x_{i} - μ) {(x_{i} - μ)}^{T}}{\sum_{i = 1}^{n} w {(x_{i}, μ, V)}^{β}} \end{matrix}

(20)

where

w (x, μ, V)

is the weight function defined by

exp {- \frac{1}{2} {(x - μ)}^{T} V^{- 1} (x - μ)}

. Although we do not know the explicit solution, a natural iteration algorithm can be proposed that the left-hand sides of (19) and (20), say

(μ_{t + 1}, V_{t + 1})

are both updated by plugging

(μ_{t}, V_{t})

in the right-hand sides of (19) and (20). Obviously, for the estimator

({\hat{μ}}_{β}, {\hat{V}}_{β})

with

β = 0

, or the MLE, we need no iteration step but the sample mean vector and sample variance matrix as the exact solution.

3.2. Local Learning

We discuss a statistical idea beyond robustness. Since the expression (16) is inconvenient to investigate the behavior of the mean expected power likelihood function, we focus on

\begin{matrix} I_{β} (θ) = \frac{1}{β} \{\int f_{θ} {(x)}^{β} p (x) d x - 1\} \end{matrix}

as a core term, where

p (x)

is the true density function, that is, the underlying distribution generating the data set. We consider K mixture model, while

p (x)

is modeled as ϵ-contaminated density function

f_{θ, ϵ} (x)

in the previous section. Thus,

p (x)

is written by K different density functions

p_{k} (x)

as follows:

\begin{matrix} p (x) = π_{1} p_{1} (x) + \dots + π_{K} p_{K} (x) \end{matrix}

(21)

where

π_{k}

denotes the mixing ratio. We note that there exists redundancy for this modeling unless

p_{k} (x)

s are specified. In fact, the case in which

π_{1} = 1

and

p_{1} (x)

is arbitrarily means no restriction for

p (x)

. However, we discuss

I_{β} (θ)

on this redundant model and find that

\begin{matrix} I_{β} (μ, V) = \frac{1}{β} (\sum_{k = 1}^{K} π_{k} {{(2 π)}^{p} det (V)}^{- \frac{β}{2}} \int exp {- \frac{β}{2} {(x - μ)}^{T} V^{- 1} (x - μ)} p_{k} (x) d x - 1) \end{matrix}

(22)

We confirm

\begin{matrix} I_{0} (μ, V) = - \frac{1}{2} {\sum_{k = 1}^{K} π_{k} \int {(x - μ)}^{T} V^{- 1} (x - μ) p_{k} (x) d x + log det (V)} \end{matrix}

taking the limit of β to 0. It is noted that

I_{0} (μ, V)

has a global maximizer

(\hat{μ}, \hat{V})

that is a pair of the mean vector and variance matrix with respect to

p (x)

since we can write

\begin{matrix} I_{0} (μ, V) = - \frac{1}{2} \{(μ - \hat{μ}) V^{- 1} (μ - \hat{μ}) + trace (\hat{V} V^{- 1}) + log det (V)\} \end{matrix}

This suggests a limitation of the maximum likelihood method. The MLE cannot change

N (\hat{μ}, \hat{V})

as the estimative solution even if the true density function is arbitrarily in (21). On the other hand, if β becomes larger, then the graph of

I_{β} (μ, V)

is flexibly changed in accordance with

p (x)

in (21). For example, we assume

\begin{matrix} p (x) = π_{1} g (x, μ_{1}, V_{1}) + \dots + π_{K} g (x, μ_{K}, V_{K}) \end{matrix}

(23)

where

g (x, μ_{k}, V_{k})

is a normal density function

N (μ_{k}, V_{k})

. Then,

\begin{matrix} I_{β} (μ, V) = \frac{1}{β} (\sum_{k = 1}^{K} π_{k} {β^{- p} {(2 π)}^{p} det (V)}^{\frac{1 - β}{2}} \int g (x, μ, β^{- 1} V) g (x, μ_{k}, V_{k}) d x - 1) \end{matrix}

Here, we see a formula

\begin{matrix} \int g (x, μ, V) g (x, μ^{*}, V^{*}) d x = g (μ, μ^{*}, V + V^{*}) \end{matrix}

(24)

as shown in Appendix 3, from which we get that

\begin{matrix} I_{β} (μ, V) = \frac{1}{β} (β^{- p} {{(2 π)}^{p} det (V)}^{\frac{1 - β}{2}} \sum_{k = 1}^{K} π_{k} g (μ, μ_{k}, β^{- 1} V + V_{k}) - 1) \end{matrix}

In particular, when

β = 1

,

\begin{matrix} I_{1} (μ, V) = \sum_{k = 1}^{K} π_{k} g (μ, μ_{k}, V + V_{k}) - 1 \end{matrix}

which implies that

I_{1} (μ, O) = p (μ) - 1

, where O is a zero matrix and

p (\cdot)

is defined as in (23). If the normal mixture model has K modes,

I_{1} (μ, V)

has the same K modes for sufficiently small

det V

. Therefore, the expected

I_{β} (μ, V)

with a large β adaptively behaves according to the true density function. This suggests that the minimum projective power divergence method can improve the weak point of the MLE if the true density function has much degree of model uncertainty. For example, such an adaptive selection for β is discussed in principal component analysis (PCA), which enables us to providing explanatory analysis rather than the conventional PCA.

Consider a problem for extracting principal components in which the data distribution has a density function with multimodality as described in (21). Then we wish to search all the sets of the principal vectors for

V_{k}

with

k = 1, \dots, K

. The minimum projective power divergence method can properly provide the PCA to search the principal vectors for

V_{k}

at the centers

μ_{k}

separately for

k = 1, \dots, K

. First we determine the first starting point, say

(μ^{(1)}, V^{(1)})

in which we employ the iteratively reweighted algorithm (19) and (20) starting from

(μ^{(1)}, V^{(1)})

, so that we get the first estimator

({\hat{μ}}^{(1)}, {\hat{V}}^{(1)})

. Then the estimator

{\hat{V}}^{(1)}

gives the first PCA with the center

{\hat{μ}}^{(1)}

by the standard method. Next, we updates the second starting point

(μ^{(2)}, V^{(2)})

to keep away from the first estimator

({\hat{μ}}^{(1)}, {\hat{V}}^{(1)})

by a heuristic procedure based on the weight function

w (x, μ, V)

(see [22] for the detailed discussion). Starting from

(μ^{(2)}, V^{(2)})

, the same algorithm (19) and (20) leads to the second estimator

({\hat{μ}}^{(2)}, {\hat{V}}^{(2)})

with the second PCA with the center

{\hat{V}}^{(2)}

. In this way, we can make this sequential procedure to explore the multimodal structure with an appropriately determined stopping rule.

4. Concluding Remarks

We focus on that the optimality property of the likelihood method is fragile under model uncertainty. Such weakness frequently appears in practice when we got a data set typically from an observational study rather than a purely randomized experimental study. However, the usefulness of likelihood method is supported as the most excellent method in statistics. We note that the minimum projective power divergence method reduces to the MLE by taking a limit of the index β to 0 since it has one degree of freedom of β as a choice of method. A data-adaptive selection of β is possible by cross validation method. However, an appropriate model selection criterion is requested for faster computation.

Recently novel methods for pattern recognition from machine learning paradigm have been proposed [23,24,25]. These approaches are directly concerned with the true distribution in a framework of probability approximate correct (PAC) learning in the computational learning theory. We need to employ this theory for the minimum projective power divergence method. In statistical physics there are remarkable developments on Tsallis entropy with reference to disequilibrium state, chaos phenomena, scale free network and econophysics. We should explore these developments from the statistical point of view.

Acknowledgements

We thank to the anonymous referees for their useful comments and suggestions, in particular, on Proposition 1.

Appendix 1

We introduce the derivation of

Δ_{β}

as follows. Consider the minimization for scalar multiplicity as

\begin{matrix} κ (μ, ν) = \underset{κ > 0}{argmin} D_{β} (μ, κ ν) \end{matrix}

The gradient is

\begin{matrix} \frac{\partial}{\partial κ} D_{β} (μ, κ ν) = - κ^{β - 1} \int ν {(x)}^{β} μ (x) d x + κ^{β} \int ν {(x)}^{β + 1} d x \end{matrix}

which leads to

κ (μ, ν) = \int ν {(x)}^{β} μ (x) d x / \int ν {(x)}^{β + 1} d x

. Hence

\begin{matrix} min_{κ > 0} D_{β} (μ, κ ν) = \frac{1}{β (β + 1)} {\int μ {(x)}^{β + 1} d x - \frac{{(\int μ (x) ν {(x)}^{β} d x)}^{β + 1}}{{(\int ν {(x)}^{β + 1} d x)}^{β}}} \end{matrix}

Taking the ratio as

\begin{matrix} Δ_{β} (μ, ν) = \frac{1}{β (β + 1)} log \frac{(\int μ {(x)}^{β + 1} d x) {(\int ν {(x)}^{β + 1} d x)}^{β}}{{(\int μ (x) ν {(x)}^{β} d x)}^{β + 1}} \\ = & \frac{1}{β (β + 1)} log \int μ {(x)}^{β + 1} d x - \frac{1}{β} log \int μ (x) ν {(x)}^{β} d x + \frac{1}{β + 1} log \int ν {(x)}^{β + 1} d x \end{matrix}

concludes the derivation of

Δ_{β}

in (13).

Appendix 2

We give a proof of Proposition 1.

Proof.

By definition we get that

\begin{matrix} Δ_{β} (p, r) - {Δ_{β} (p, q) + Δ_{β} (q, r)} = \frac{1}{β} log \frac{\int p (x) q {(x)}^{β} d x \int q (x) r {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x \int p (x) r {(x)}^{β} d x} \end{matrix}

which implies

\begin{matrix} \frac{\int p (x) q {(x)}^{β} d x \int q (x) r {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x \int p (x) r {(x)}^{β} d x} = 1 \end{matrix}

(25)

from (14). Similarly,

\begin{matrix} Δ_{β} (p_{t}, r_{s}) - {Δ_{β} (p_{t}, q) + Δ_{β} (q, r_{s})} = \frac{1}{β} log \frac{\int p_{t} (x) q {(x)}^{β} d x \int q (x) r_{s} {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x \int p_{t} (x) r_{s} {(x)}^{β} d x} \end{matrix}

which is written as

\begin{matrix} \frac{1}{β} log \frac{(1 - t) \frac{\int p (x) q {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x} + t}{(1 - t) \frac{\int p (x) r_{s} {(x)}^{β} d x}{\int q (x) r_{s} {(x)}^{β} d x} + t} \end{matrix}

(26)

Furthermore, (26) is rewritten as

\begin{matrix} \frac{1}{β} log \frac{(1 - t) \frac{\int p (x) q {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x} + t}{(1 - t) \frac{\int {(1 - s) p (x) r {(x)}^{β} + s p (x) q {(x)}^{β}} d x}{\int {(1 - s) q (x) r {(x)}^{β} + s q {(x)}^{β + 1}} d x} + t} \end{matrix}

which is

\begin{matrix} \frac{1}{β} log \frac{(1 - t) \frac{\int p (x) q {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x} + t}{(1 - t) \frac{\int p (x) r {(x)}^{β} d x}{\int q (x) r {(x)}^{β} d x} \frac{1 - s + s \frac{\int p (x) q {(x)}^{β} d x}{\int p (x) r {(x)}^{β} d x}}{1 - s + s \frac{\int q {(x)}^{β + 1} d x}{\int q (x) r {(x)}^{β} d x}} + t} \end{matrix}

From (25) we can write

\begin{matrix} Ξ = \frac{\int p (x) q {(x)}^{β} d x}{\int q {(x)}^{β + 1} d x} = \frac{\int p (x) r {(x)}^{β} d x}{\int q (x) r {(x)}^{β} d x} \end{matrix}

Then, we conclude that

\begin{matrix} Δ_{β} (p_{t}, r_{s}) - {Δ_{β} (p_{t}, q) + Δ_{β} (q, r_{s})} = \frac{1}{β} log \frac{(1 - t) Ξ + t}{(1 - t) Ξ \frac{(1 - s) + s Ξ}{(1 - s) + s Ξ} + t} \end{matrix}

which vanishes for any

s, 0 < s < 1

and

t, 0 < t < 1

. This completes the proof. □

Appendix 3

By writing a p-variate normal density function by

g (x, μ, V) = {{(2 π)}^{p} det (V)}^{- \frac{1}{2}} exp {- \frac{1}{2} {(x - μ)}^{T} V^{- 1} (x - μ)}

we have the formula

\begin{matrix} \int g (x, μ, V) g (x, μ^{*}, V^{*}) d x = g (μ, μ^{*}, V + V^{*}) \end{matrix}

(27)

The proof of this formula is immediate. In fact, the left-hand side of (27) is written by

\begin{matrix} {(2 π)}^{p} {det (V) det (V^{*})}^{- \frac{1}{2}} exp {- \frac{1}{2} μ^{T} V^{- 1} μ - \frac{1}{2} {μ^{*}}^{T} {V^{*}}^{- 1} μ^{*}} \\ \times \int exp {- \frac{1}{2} {(x - A^{- 1} b)}^{T} A (x - A^{- 1} b)} d x \end{matrix}

where

\begin{matrix} A = V^{- 1} + {V^{*}}^{- 1}, b = V^{- 1} μ + {V^{*}}^{- 1} μ^{*} \end{matrix}

Hence, we get

\begin{matrix} {{(2 π)}^{p} det (V) det (V^{*}) det (V^{- 1} + {V^{*}}^{- 1})}^{- \frac{1}{2}} exp {\frac{1}{2} b^{T} A^{- 1} b - \frac{1}{2} μ^{T} V^{- 1} μ - \frac{1}{2} {μ^{*}}^{T} {V^{*}}^{- 1} μ^{*}} \end{matrix}

Noting that

\begin{matrix} {{(2 π)}^{p} det (V) det (V^{*}) det (V^{- 1} + {V^{*}}^{- 1})}^{- \frac{1}{2}} = {{(2 π)}^{p} det (V + V^{*})}^{- \frac{1}{2}} \end{matrix}

(28)

and

\begin{matrix} exp {\frac{1}{2} b^{T} A^{- 1} b - \frac{1}{2} μ^{T} V^{- 1} μ - \frac{1}{2} {μ^{*}}^{T} {V^{*}}^{- 1} μ^{*}} \\ = & exp {\frac{1}{2} μ^{T} V^{- 1} {(V^{- 1} + {V^{*}}^{- 1})}^{- 1} {I - (V^{- 1} + {V^{*}}^{- 1}) V} V^{- 1} μ \\ + \frac{1}{2} {μ^{*}}^{T} {V^{*}}^{- 1} {(V^{- 1} + {V^{*}}^{- 1})}^{- 1} {I - (V^{- 1} + {V^{*}}^{- 1}) V^{*}} {V^{*}}^{- 1} μ^{*} \\ - \frac{1}{2} μ^{T} V^{- 1} {(V^{- 1} + {V^{*}}^{- 1})}^{- 1} {V^{*}}^{- 1} μ^{*}} \end{matrix}

it is obtained that

\begin{matrix} exp {- \frac{1}{2} {(μ - μ^{*})}^{T} {(V + V^{*})}^{- 1} (μ - μ^{*})} \end{matrix}

(29)

because of

V^{- 1} {(V^{- 1} + {V^{*}}^{- 1})}^{- 1} {V^{*}}^{- 1} = {(V + V^{*})}^{- 1}

. Therefore, (28) and (29) imply (24). □

References

Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London Ser. A 1922, 222, 309–368. [Google Scholar] [CrossRef]
Amari, S. Lecture Notes in Statistics. In Differential-Geometrical Methods in Statistics; Springer-Verlag: New York, NY, USA, 1985; Volume 28. [Google Scholar]
Amari, S.; Nagaoka, H. Translations of Mathematical Monographs. In Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000; Volume 191. [Google Scholar]
Akahira, M; Takeuchi, K. Lecture Notes in Statistics. In Asymptotic Efficiency of Statistical Estimators: Concepts and Higher Order Asymptotic Efficiency; Springer-Verlag: New York, NY, USA, 1981; Volume 7. [Google Scholar]
Box, G.E.P.; Cox, D.R. An Analysis of Transformations. J. R. Statist. Soc. B 1964, 26, 211–252. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust estimation in the normal mixture model. J. Stat. Plan Inference 2006, 136, 3989–4011. [Google Scholar] [CrossRef]
Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar]
Mollah, N.H.; Minami, M.; Eguchi, S. Exploring latent structure of mixture ICA models by the minimum beta-divergence method. Neural Comput. 2006, 18, 166–190. [Google Scholar] [CrossRef]
Scott, D.W. Parametric statistical modeling by minimum integrated square error. Technometrics 2001, 43, 274–285. [Google Scholar] [CrossRef]
Eguchi, S.; Copas, J.B. A class of logistic type discriminant functions. Biometrika 2002, 89, 1–22. [Google Scholar] [CrossRef]
Kanamori, T.; Takenouchi, T.; Eguchi, S.; Murata, N. Robust loss functions for boosting. Neural Comput. 2007, 19, 2183–2244. [Google Scholar] [CrossRef] [PubMed]
Lebanon, G.; Lafferty, J. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems; 2002; Volume 14, pp. 447–454. MIT Press: New York, NY, USA. [Google Scholar]
Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput. 2004, 16, 767–787. [Google Scholar] [CrossRef] [PubMed]
Takenouchi, T.; Eguchi, S.; Murata, N.; Kanamori, T. Robust boosting algorithm for multiclass problem by mislabelling model. Neural Comput. 2008, 20, 1596–1630. [Google Scholar] [CrossRef] [PubMed]
Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216. [Google Scholar]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
Wald, A. Note on the Consistency of the Maximum Likelihood Estimate. Ann. Math. Statist. 1949, 20, 595–601. [Google Scholar] [CrossRef]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley: New York, NY, USA, 2005. [Google Scholar]
Eguchi, S.; Copas, J.A. Class of local likelihood methods and near-parametric asymptotics. J. R. Statist. Soc. B 1998, 60, 709–724. [Google Scholar] [CrossRef]
Mollah, N.H.; Sultana, N.; Minami, M.; Eguchi, S. Robust extraction of local structures by the minimum beta-divergence method. Neural Netw. 2010, 23, 226–238. [Google Scholar] [CrossRef] [PubMed]
Friedman, J.H.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Annals of Statistics 2000, 28, 337–407. [Google Scholar] [CrossRef]
Hastie, T.; Tibishirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001. [Google Scholar]
Schapire, R.E.; Freund, Y.; Bartlett, P.; Lee, W.S. Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 1998, 26, 1651–1686. [Google Scholar] [CrossRef]

© 2010 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

Share and Cite

MDPI and ACS Style

Eguchi, S.; Kato, S. Entropy and Divergence Associated with Power Function and the Statistical Application. Entropy 2010, 12, 262-274. https://doi.org/10.3390/e12020262

AMA Style

Eguchi S, Kato S. Entropy and Divergence Associated with Power Function and the Statistical Application. Entropy. 2010; 12(2):262-274. https://doi.org/10.3390/e12020262

Chicago/Turabian Style

Eguchi, Shinto, and Shogo Kato. 2010. "Entropy and Divergence Associated with Power Function and the Statistical Application" Entropy 12, no. 2: 262-274. https://doi.org/10.3390/e12020262

Article Menu

Entropy and Divergence Associated with Power Function and the Statistical Application

Abstract

1. Introduction

2. Power Divergence

3. Minimum Power Divergence Method

3.1. Super Robustness

3.2. Local Learning

4. Concluding Remarks

Acknowledgements

Appendix 1

Appendix 2

Appendix 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI