Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

09 Aos687

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

The Annals of Statistics

2010, Vol. 38, No. 2, 753–783


DOI: 10.1214/09-AOS687
© Institute of Mathematical Statistics, 2010

MAXIMUM Lq-LIKELIHOOD ESTIMATION

B Y DAVIDE F ERRARI AND Y UHONG YANG1


Università di Modena e Reggio Emilia and University of Minnesota
In this paper, the maximum Lq-likelihood estimator (MLqE), a new
parameter estimator based on nonextensive entropy [Kibernetika 3 (1967)
30–35] is introduced. The properties of the MLqE are studied via asymptotic
analysis and computer simulations. The behavior of the MLqE is character-
ized by the degree of distortion q applied to the assumed model. When q is
properly chosen for small and moderate sample sizes, the MLqE can success-
fully trade bias for precision, resulting in a substantial reduction of the mean
squared error. When the sample size is large and q tends to 1, a necessary and
sufficient condition to ensure a proper asymptotic normality and efficiency of
MLqE is established.

1. Introduction. One of the major contributions to scientific thought of the


last century is information theory founded by Claude Shannon in the late 1940s.
Its triumph is highlighted by countless applications in various scientific domains
including statistics. The central quantity in information theory is a measure of
the “amount of uncertainty” inherent in a probability distribution (usually called
Shannon’s entropy). Provided a probability density function p(x) for a random
variable X, Shannon’s entropy is defined as H(X) = −E[log p(X)]. The quantity
− log p(x) is interpreted as the information content of the outcome x, and H(X)
represents the average uncertainty removed after the actual outcome of X is re-
vealed. The connection between logarithmic (or additive) entropies and inference
has been copiously studied (see, e.g., Cover and Thomas [9]). Akaike [3] intro-
duced a principle of statistical model building based on minimization of entropy.
In a parametric setting, he pointed out that the usual inferential task of maximiz-
ing the log-likelihood function can be equivalently

regarded as minimization of
the empirical version of Shannon’s entropy, − ni=1 log p(Xi ). Rissanen proposed
the well-known minimum description length criterion for model comparison (see,
e.g., Barron, Rissanen and Yu [5]).
Since the introduction of Shannon’s entropy, other and more general measures
of information have been developed. Rényi [27] and Aczél and Daróczy [2] in
the mid-1960s and 1970s proposed generalized notions of information (usually re-
ferred to as Rényi entropies) by keeping the additivity of independent information,

Received November 2007; revised January 2009.


1 Supported in part by NSF Grant DMS-07-06850.
AMS 2000 subject classifications. Primary 62F99; secondary 60F05, 94A17, 62G32.
Key words and phrases. Maximum Lq-likelihood estimation, nonextensive entropy, asymptotic
efficiency, exponential family, tail probability estimation.
753
754 D. FERRARI AND Y. YANG

but using a more general definition of mean. In a different direction, Havrda and
Charvát [16] proposed nonextensive entropies, sometimes referred to as q-order
entropies, where the usual definition of mean is maintained while the logarithm is
replaced by the more general function Lq (u) = (u1−q − 1)/(1 − q) for q > 0. In
particular, when q → 1, Lq (u) → log(u), recovering the usual Shannon’s entropy.
In recent years, q-order entropies have been of considerable interest in different
domains of application. Tsallis and colleagues have successfully exploited them in
physics (see, e.g., [29] and [30]). In thermodynamics, the q-entropy functional is
usually minimized subject to some properly chosen constraints, according to the
formalism proposed by Jaynes [19] and [20]. There is a large literature on analyz-
ing various loss functions as the convex dual of entropy minimization, subject to
constraints. From this standpoint, the classical maximum entropy estimation and
maximum likelihood are seen as convex duals of each other (see, e.g., Altun and
Smola [4]). Since Tsallis’ seminal paper [29], q-order entropy has encountered
an increasing wave of success and Tsallis’ nonextensive thermodynamics, based
on such information measure, is nowadays considered the most viable candidate
for generalizing the ideas of the famous Boltzmann–Gibbs theory. More recently,
a number of applications based on the q-entropy have appeared in other disciplines
such as finance, biomedical sciences, environmental sciences and linguistics [14].
Despite the broad success, so far little effort has been made to address the infer-
ential implications of using nonextensive entropies from a statistical perspective.
In this paper, we study a new class of parametric estimators based on the q-entropy
function, the maximum Lq-likelihood estimator (MLqE). In our approach, the role
of the observations is modified by slightly changing the model of reference by
means of the distortion parameter q. From this standpoint, Lq-likelihood estima-
tion can be regarded as the minimization of the discrepancy between a distribution
in a family and one that modifies the true distribution to diminish (or emphasize)
the role of extreme observations.
In this framework, we provide theoretical insights concerning the statistical us-
age of the generalized entropy function. In particular, we highlight the role of the
distortion parameter q and give the conditions that guarantee asymptotic efficiency
of the MLqE. Further, the new methodology is shown to be very useful when es-
timating high-dimensional parameters and small tail probabilities. This aspect is
important in many applications where we must deal with the fact that the number
of observations available is not large in relation to the number of parameters or
the probability of occurrence of the event of interest. Standard large sample theory
guarantees that the maximum likelihood estimator (MLE) is asymptotically effi-
cient, meaning that when the sample size is large, the MLE is at least as accurate
as any other estimator. However, for a moderate or small sample size, it turns out
that the MLqE can offer a dramatic improvement in terms of mean squared error
at the expense of a slightly increased bias, as will be seen in our numerical results.
For finite sample performance of MLqE, not only the size of qn − 1 but also its
sign (i.e., the direction of distortion) is important. It turns out that for different fam-
ilies or different parametric functions of the same family, the beneficial direction
MAXIMUM Lq-LIKELIHOOD ESTIMATION 755

of distortion can be different. In addition, for some parameters, MLqE does not
produce any improvement. We have found that an asymptotic variance expression
of the MLqE is very helpful to decide the direction of distortion for applications.
The paper is organized as follows. In Section 2, we examine some information-
theoretical quantities and introduce the MLqE; in Section 3, we present its basic
asymptotic properties for exponential families. In particular, a necessary and suf-
ficient condition on the choice of q in terms of the sample size to ensure a proper
asymptotic normality and efficiency is established. A generalization that goes out
of the exponential family is presented in Section 4. In Section 5, we consider the
plug-in approach for tail probability estimation based on MLqE. The asymptotic
properties of the plug-in estimator are derived and its efficiency is compared to
the traditional MLE. In Section 6, we discuss the choice of the distortion parame-
ter q. In Section 7, we present Monte Carlo simulations and examine the behavior
of MLqE in finite sample situations. In Section 8, concluding remarks are given.
Technical proofs of the theorems are deferred to Appendix A.

2. Generalized entropy and the maximum Lq-likelihood estimator. Con-


sider a σ -finite measure μ on a measurable space (, F ). The Kullback–Leibler
(KL) divergence [21, 22] (or relative entropy) between two density functions g
and f with respect to μ is

f (X) f (x)
(2.1) D(f g) = Ef log = f (x) log dμ(x).
g(X)  g(x)
Note that finding the density g that minimizes D(f g) is equivalent to minimizing
Shannon’s entropy [28] H(f, g) = −Ef log g(X).

D EFINITION 2.1. Let f and g be two density functions. The q-entropy of g


with respect to f is defined as
(2.2) Hq (f, g) = −Ef Lq {g(X)}, q > 0,
where Lq (u) = log u if q = 1 and Lq (u) = (u1−q − 1)/(1 − q), otherwise.

The function Lq represents a Box–Cox transformation in statistics and in other


contexts it is often called a deformed logarithm. Note that if q → 1, then Lq (u) →
log (u) and the usual definition of Shannon’s entropy is recovered.
Let M = {f (x; θ ), θ ∈ } be a family of parametrized density functions and
suppose that the true density of observations, denoted by f (x; θ0 ), is a member
of M . Assume further that M is closed under the transformation
f (x; θ )r
(2.3) f (x; θ )(r) =  r
, r > 0.
 f (x; θ ) dμ(x)

The transformed density f (x; θ )(r) is often referred to as zooming or escort distri-
bution [1, 7, 26] and the parameter r provides a tool to accentuate different regions
756 D. FERRARI AND Y. YANG

of the untransformed true density f (x; θ ). In particular, when r < 1, regions with
density values close to zero are accentuated, while for r > 1, regions with density
values further from zero are emphasized.
Consider the following KL divergence between f (x; θ ) and f (x; θ0 )(r) :

f (x; θ0 )(r)
(2.4) Dr (θ0 θ ) = f (x; θ0 )(r) log dμ(x).
 f (x; θ )
Let θ ∗ be the value such that f (x; θ ∗ ) = f (x; θ0 )(r) and assume that differentia-
tion can be passed under the integral sign. Then, clearly θ ∗ minimizes Dr (θ0 θ )
over θ . Let θ ∗∗ be the value such that f (x; θ ∗∗ ) = f (x; θ0 )(1/q) , q > 0. Since we
have ∇θ Hq (θ0 , θ )|θ ∗∗ = 0 and ∇θ2 Hq (θ0 , θ )|θ ∗∗ is positive definite, Hq (θ0 , θ ) has
a minimum at θ ∗∗ .
The derivations above show the minimizer of Dr (θ0 θ ) over θ is the same as the
minimizer of Hr (θ0 , θ ) over θ when q = 1/r. Clearly, by considering the diver-
gence with respect to a distorted version of the true density we introduce a certain
amount of bias. Nevertheless, the bias can be properly controlled by an adequate
choice of the distortion parameter q, and later we shall discuss the benefits gained
from paying such a price for parameter estimation. The next definition introduces
the estimator based on the empirical version of the q-entropy.

D EFINITION 2.2. Let X1 , . . . , Xn be an i.i.d. sample from f (x; θ0 ), θ0 ∈ .


The maximum Lq-likelihood estimator (MLqE) of θ0 is defined as

n
(2.5) 
θn = arg max Lq [f (Xi ; θ )], q > 0.
θ ∈ i=1

When q → 1, if the estimator  θn exists, then it approaches



the maximum likeli-
hood estimate of the parameters, which maximizes i log f (Xi ; θ ). In this sense,
the MLqE extends the classic method, resulting in a general inferential procedure
that inherits most of the desirable features of traditional maximum likelihood, and
at the same time can improve over MLE due to variance reduction, as will be seen.
Define
U (x; θ ) = ∇θ log{f (x; θ )},
(2.6)
U ∗ (X; θ, q) = U (X; θ )f (X; θ )1−q .
In general, the estimating equations have the form

n
(2.7) U ∗ (Xi ; θ, q) = 0.
i=1
Equation (2.7) offers a natural interpretation of the MLqE as a solution to
a weighted likelihood. When q = 1, (2.7) provides a relative-to-the-model re-
weighting. Observations that disagree with the model receive low or high weight
MAXIMUM Lq-LIKELIHOOD ESTIMATION 757

depending on q < 1 or q > 1. In the case q = 1, all the observations receive the
same weight.
The strategy of setting weights that are proportional to a power transforma-
tion of the assumed density has some connections with the methods proposed by
Windham [33], Basu et al. [6] and Choi, Hall and Presnell [8]. In these approaches,
however, the main objective is robust estimation and the weights are set based on
a fixed constant not depending on the sample size.

E XAMPLE 2.1. The simple but illuminating case of an exponential distribu-


tion will be used as a recurrent example in the course of the paper. Consider an
i.i.d. sample of size n from a distribution with density λ0 exp {−xλ0 }, x > 0 and
λ0 > 0. In this case, the Lq -likelihood equation is

n  
1
(2.8) e−[Xi λ−log λ](1−q) −Xi + = 0.
i=1
λ
 −1
With q = 1, the usual maximum likelihood estimator is
λ=( i Xi /n)−1 = X .
However, when q = 1, (2.8) can be rewritten as
 n −1
i=1 Xi wi (Xi , λ, q)
(2.9) λ = n ,
i=1 wi (Xi , λ, q)
where wi := e−[Xi λ−log λ](1−q) . When q < 1, the role played by observations cor-
responding to higher density values are accentuated; when q > 1, observations
corresponding to density values close to zero are accentuated.

3. Asymptotics of the MLqE for exponential families. In this section, we


present the asymptotic properties of the new estimator when the degree of dis-
tortion is chosen according to the sample size. In the remainder of the paper, we
focus on exponential families, although some generalization results are presented
in Section 4. In particular, we consider density functions of the form
(3.1) f (x; θ ) = exp{θ T b(x) − A(θ )},
where θ ∈  ⊆ Rp is a real valued natural parameter vector, b(x) is the vector of
 T
functions with elements bj (x) (j = 1, . . . , p) and A(θ ) = log  eθ b(x) dμ(x) is
the cumulant generating function (or log normalizer). For simplicity in presenta-
tion, the family is assumed to be of full rank (but similar results hold for curved
exponential families). The true parameter will be denoted by θ0 .

3.1. Consistency. Consider θn∗ , the value such that


(3.2) Eθ0 U ∗ (X; θn∗ , qn ) = 0.
It can be easily shown that θn∗ = θ0 /qn . Since the actual target of 
θn is θn∗ , to retrieve

asymptotic unbiasedness of θn , qn must converge to 1. We call θn∗ the surrogate
parameter of θ0 . We impose the following conditions:
758 D. FERRARI AND Y. YANG

A.1 qn > 0 is a sequence such that qn → 1 as n → ∞.


A.2 The parameter space  is compact and the parameter θ0 is an interior point
in .
In similar contexts, the compactness condition on  is used for technical reasons
(see, e.g., Wang, van Eeden and Zidek [32]), as is the case here.

T HEOREM 3.1. Under assumptions A.1 and A.2, with probability going to 1,
the Lq -likelihood equation yields a unique solution 
θn that is the maximizer of the
 P
Lq -likelihood function in . Furthermore, we have θn → θ0 .

R EMARK . When  is compact, the MLqE always exists under our conditions,
although it is not necessarily unique with probability one.

3.2. Asymptotic normality.

T HEOREM 3.2. If assumptions A.1 and A.2 hold, then we have


√ D
(3.3) nVn−1/2 (
θn − θn∗ ) → Np (0, Ip ) as n → ∞,
where Ip is the (p × p) identity matrix, Vn = Jn−1 Kn Jn−1 and
(3.4) Kn = Eθ0 [U ∗ (X; θn∗ , qn )]T [U ∗ (X; θn∗ , qn )],
(3.5) Jn = Eθ0 [∇θ U ∗ (X; θn∗ , qn )].
A necessary
√ and sufficient condition for asymptotic normality of MLqE around θ0
is n(qn − 1) → 0.

Let m(θ ) := ∇θ A(θ ) and D(θ ) := ∇θ2 A(θ ). Note that Kn and Jn can be ex-
pressed as


(3.6) Kn = c2,n D(θ2,n ) + [m(θ2,n ) − m(θn∗ )][m(θ2,n ) − m(θn∗ )]T
and
Jn = c1,n (1 − qn )D(θ1,n ) − c1,n D(θn∗ )
(3.7)
+ c1,n (1 − qn )[m(θ1,n ) − m(θn∗ )][m(θ1,n ) − m(θn∗ )]T ,
where ck,n = exp {A(θn,k ) − A(θ0 )} and θk,n = kθ0 (1/qn − 1) + θ0 . When qn → 1,
it is seen that Vn → −D(θ0 ), the asymptotic variance of the MLE. When  ⊆
R1 we use the notation σn2 for the asymptotic variance in place of Vn . Note that
the existence of moments are ensured by the functional form of the exponential
families (e.g., see [23]).
MAXIMUM Lq-LIKELIHOOD ESTIMATION 759

R EMARKS . (i) When q is fixed, the MLqE is a regular M-estimator [18],


which converges in probability to θ ∗ = θ0 /q. (ii) With the explicit expression
of θn∗ , one may consider correcting the bias of MLqE by using the estimator qn 
θn .
The numerical results are not promising in this direction under correct model spec-
ification.

E XAMPLE 3.1 (Exponential distribution). The surrogate parameter is θn∗ =


λ0 /qn and a lengthy but straightforward calculation shows that the asymptotic vari-
ance of the MLqE of λ0 is
 2 2
λ0 qn − 2qn + 2
(3.8) σn2 = → λ20
qn qn3 (2 − qn )3
as n → ∞. By Theorem 3.2, we conclude that n1/2 σn−1 ( λn − λ0 /qn ) converges
weakly to a standard normal distribution as n → ∞. Clearly, the asymptotic calcu-
lation does not produce any advantage of MLqE in terms of reducing the limiting
variance. However, for an interval of qn , we have σn2 < λ20 (see Section 6) and,
based on our simulations, an improvement of the accuracy is achieved in finite
sample sizes as long as 0 < qn − 1 = o(n−1/2 ), which ensures a proper asymptotic
normality of λn . For the re-scaled estimator qn
λn , the expression qn2 σn2 is larger
than 1 unless q = 1, which suggests that qn λn may be at best no better than λn .

E XAMPLE 3.2 (Multivariate normal distribution). Consider a multivariate


normal family with mean vector μ and covariance matrix . Two convenient
matrix operators in this setting are the vec(·) (vector) and vech(·) (vector-half).
Namely, vec : Rr×p → Rrp stacks the columns of the argument matrix. For sym-
metric matrices, vech : Sp×p → Rp(p+1)/2 stacks only the unique part of each col-
umn that lies on or below the diagonal [25]. Further, for a symmetric matrix M,
define the extension matrix G as vec M = G vech M. Thus, θ0 = (μT , vechT )T
and under such a parametrization, it is easy to show the surrogate parameter solv-

ing (3.2) is θn∗ = (μT , qn vechT )T , where interestingly the mean component
does not depend on qn . In fact, for symmetric distributions about the mean, it can
be shown that the distortion imposed to the model affects the spread of the dis-
tribution but leaves the mean unchanged. Consequently, the MLqE is expected to
influence the estimation of  without much effect on μ. This will be clearly seen
in our simulation results (see Section 7.3). The calculation in Appendix B shows
that the asymptotic variance of the MLqE of θ0 is the block-diagonal matrix
⎛ ⎞
(2 − q)2+p
⎜ (3 − 2q)1+p/2  0 ⎟
⎜ ⎟
⎜ ⎟
(3.9) Vn = ⎜ 4q 2 [(3 − 2q)2 + 1](2 − q)4+p ⎟,
⎜ 0 ⎟
⎝ [(2 − q)2 + 1]2 (3 − 2q)2+p/2 ⎠
× [GT ( −1 ⊗  −1 )G]−1
where ⊗ denotes the Kronecker product.
760 D. FERRARI AND Y. YANG

4. A generalization. In this section, we relax the restriction of exponential


family and present consistency and asymptotic normality results for MLqE under
some regularity conditions.

T HEOREM 4.1. Let qn be a sequence such that qn → 1 as n → ∞ and assume


the following:
B.1 θ0 is an interior point in .
B.2 Eθ0 supθ ∈ U (X; θ )2 < ∞ and Eθ0 supθ ∈ [f (X; θ )δ − 1]2 → 0 as δ → 0.
 p
B.3 supθ∈  n1 ni=1 U (Xi ; θ ) − Eθ0 U (X; θ ) → 0 as n → ∞,
where  ·  denotes the 2 -norm. Then, with probability going to 1, the Lq -
likelihood equation yields a unique solution 
θn that maximizes the Lq -likelihood.
P
Furthermore, we have 
θn → θ 0 .

R EMARK 4.2. (i) Although for a large n the Lq -likelihood equation has a
unique zero with a high probability, for finite samples there may be roots that are
actually bad estimates. (ii) The uniform convergence in condition B.3 is satisfied if
the set of functions {U (x, θ ) : θ ∈ } is Glivenko–Cantelli under the true parame-
ter θ0 (see, e.g., [31], Chapter 19.2). In particular, it suffices to require (i) U (x; θ ) is
continuous in θ for every x and dominated by an integrable function and (ii) com-
pactness of .

For each θ ∈ , define a symmetric p × p matrix I ∗ (x; θ, q) = ∇θ U ∗ (x; θ, q),


where U ∗ represents the modified score function as in (2.6) and let the matrices
Kn , Jn and Vn be as defined in the previous section.

T HEOREM 4.3. Let qn be a sequence such that qn → 1 and θn∗ → θ0 as n →


∞, where θn∗ is the solution of EU ∗ (X; θn∗ , qn ) = 0. Suppose U ∗ (x; θ, q) is twice
differentiable in θ for every x and assume the following:
C.1 max1≤k≤p Eθ0 |Uk∗ (X, θn∗ , qn )|3 , k = 1, . . . , p, is upper bounded by a con-
stant.
C.2 The smallest eigenvalue of Kn is bounded away from zero.
C.3 Eθ0 {I ∗ (X, θn∗ , qn )}2kl , k, l = 1, . . . , p, are upper bounded by a constant.
C.4 The second-order partial derivatives of U ∗ (x, θ, qn ) are dominated by an in-
tegrable function with respect to the true distribution of X for all θ in a neigh-
borhood of θ0 and qn in a neighborhood of 1.
Then,
√ D
(4.1) nVn−1/2 (
θn − θn∗ ) → Np (0, I) as n → ∞.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 761

5. Estimation of the tail probability. In this section, we address the problem


of tail probability estimation, using the popular plug-in procedure, where the point
estimate of the unknown parameter is substituted into the parametric function of
interest. We focus on a one-dimensional case, that is, p = 1, and derive the asymp-
totic distribution of the plug-in estimator for the tail probability based on the MLq
method. For an application of the MLqE proposed in this work on financial risk
estimation, see Ferrari and Paterlini [12].
Let α(x; θ ) = Pθ (X ≤ x) or α(x; θ ) = 1 − Pθ (X ≤ x), depending on whether
we are considering the lower tail or the upper tail of the distribution. Without loss
of generality, we focus on the latter from now on, and assume α(x; θ ) > 0 for
all x [of course α(x; θ ) → 0 as x → ∞]. When x is fixed, under some condi-
tions, the familiar delta method shows that an asymptotically normally distributed
and efficient estimator of θ makes the plug-in estimator of α(x; θ ) also asymptot-
ically normal and efficient. However, in most applications a large sample size is
demanded in order for this asymptotic behavior to be accurate for a small tail prob-
ability. As a consequence, the setup with x fixed but n → ∞ presents an overly
optimistic view, as it ignores the possible difficulty due to smallness of the tail
probability in relation to the sample size n. Instead, allowing x to increase in n
(so that the tail probability to be estimated becomes smaller as the sample size
increases) more realistically addresses the problem.

5.1. Asymptotic normality of the plug-in MLq estimator. We are interested in


estimating α(xn ; θ0 ), where xn → ∞ as n → ∞. For θ ∗ ∈  and δ > 0, define
  
∗  α (x; θ ) 
(5.1) β(x; θ ; δ) = sup  
√ √  α  (x; θ ∗ ) 
θ ∈∩[θ ∗ −δ/ n,θ ∗ +δ/ n]

and γ (x; θ ) = α  (x; θ )/α  (x; θ ).

T HEOREM 5.1. Let θn∗ be as in the previous section. Under assumptions A.1
and A.2, if n−1/2 |γ (xn ; θn∗ )|β(xn ; θn∗ ; δ) → 0 for each δ > 0, then
√ α(xn ; 
θn ) − α(xn ; θn∗ ) D
n → N(0, 1),
σn α  (xn ; θn∗ )
where σn = −[Eθ0 U ∗ (X; θn∗ )2 ]1/2 /Eθ0 [∂U ∗ (X; θ, qn )/∂θ |θn∗ ].

R EMARKS . (i) For the main requirement of the theorem on the order of the
sequence xn , it is easiest to be verified on a case by case basis. For instance, in the
case of the exponential distribution in (A.4), for xn > 0,
e−xn λ xn2 xn |λ−λ∗n |

β(xn ; λ∗n ; δ) = sup √ ∗
−xn λn x 2
≤ sup √
e = e δxn / n
.
λ∈λ∗n ±δ/ ne n λ∈λ∗n ±δ/ n
762 D. FERRARI AND Y. YANG

Moreover, γ (xn ; λ∗n ) = −xn . So, the condition reads n−1/2 xn eδxn / n → 0, that is,
n−1/2 xn → 0. (ii) The plug-in estimator based on qn  θn has been examined as well.
With qn → 1, we did not find any significant advantage.
The condition n−1/2 |γ (xn ; θn∗ )|β(xn ; θn∗ ; δ) → 0, to some degree, describes the
interplay between the sample size n, xn and qn for the asymptotic normality to
hold. When xn → ∞ too fast so as to violate the condition, the asymptotic nor-
mality is not guaranteed, which indicates the extreme difficulty in estimating a
tiny tail probability. In the next section, we will use this framework to compare
the MLqE of the tail probability, α(xn ;  θn ), with the one based on the traditional

MLE, α(xn ; θn ).
In many applications, the quantity of interest is quantile instead of the tail
probability. In our setting, the quantile function is defined as ρ(s; θ ) = α −1 (s; θ ),
0 < s < 1 and θ ∈ . Next, we present the analogue of Theorem 5.1 for the plug-in
estimator of the quantile. Define
  
∗  ρ (s; θ ) 
(5.2) β1 (s; θ ; δ) = sup  , δ > 0,
√ √  ρ  (s; θ ∗ ) 
θ ∈∩[θ ∗ −δ/ n,θ ∗ +δ/ n]

and γ1 (s; θ ) = ρ  (s; θ )/ρ  (s; θ ).

T HEOREM 5.2. Let 0 < sn < 1 be a nonincreasing sequence such that sn 


0 as n → ∞ and let θn∗ and qn be as in Theorem 5.1. Under assumptions A.1
and A.2, for a sequence sn such that n−1/2 |γ1 (sn ; θn∗ )|β1 (sn ; θn∗ ; δ) → 0 for each
δ > 0, we have
√ ρ(sn ; 
θn ) − ρ(sn ; θn∗ ) D
n → N(0, 1).
σn ρ  (sn ; θn∗ )

5.2. Relative efficiency


√ between MLE and MLqE. In Section 3, we showed
that when (qn − 1) n → 0, the MLqE is asymptotically as efficient as the MLE.
For tail probability estimation, with xn → ∞, it is unclear if the MLqE performs
efficiently.
Consider
√ wn and vn , two
√ estimators of a parametric function gn (θ ) such that
both n(wn − an )/σn and n(vn − bn )/τn converge weakly to a standard normal
distribution as n → ∞ for some deterministic sequences an , bn , σn > 0 and τn > 0.

D EFINITION 5.1. Define


(bn − gn (θ ))2 + τn2 /n
(5.3) (wn , vn ) := .
(an − gn (θ ))2 + σn2 /n
The bias adjusted asymptotic relative efficiency of wn with respect to vn is
limn→∞ (wn , vn ), provided that the limit exists.

It can be easily verified that the definition does not depend on the specific choice
of an , bn , σn and τn among equivalent expressions.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 763

C OROLLARY 5.3. Under the conditions of Theorem 5.1, when qn is chosen


such that
(5.4) n1/2 α(xn ; θn∗ )α(xn ; θ0 )−1 → 1 and α  (xn ; θn∗ )α  (xn ; θ0 )−1 → 1,
then (α(xn ;
θn ), α(xn ; 
θn )) = 1.

The result, which follows directly from Theorem 5.1 and Definition 5.1, says
that when qn is chosen sufficiently close to 1, asymptotically speaking, the MLqE
is as efficient as the MLE.

= e−λxn and
E XAMPLE 5.1 (Continued). In this case, we have α(xn ; λ) √
α  (xn ; λ)√= −λx
−xn e n . For sequences xn and qn such that xn / n → 0 and
(qn − 1) n → 0, we have that
√ (e−λn xn − e−λ0 /qn xn ) D
(5.5) n → N(0, 1).
λ0 xn e−λ0 /qn xn
When qn = 1 for all n, we recover the usual plug-in estimator based on MLE. With
the asymptotic expressions given above,
n
2
(5.6) (α(x; λn )) = 2 2 e−xn (λ0 /qn −λ0 ) − 1 + e−2xn (λ0 /qn −λ0 ) ,
λn ), α(x; 
λ0 xn
which is greater than 1 when qn > 1. Thus, no advantage in terms of MSE is
expected by considering qn > 1 (which introduces bias and enlarges the variance
at the same time).
Although in limits MLqE is not more efficient than MLE, MLqE can be much
better than MLE due to variance reduction as will be clearly seen in Section 7. The
following calculation provides a heuristic understanding. Let rn = 1 − 1/qn . Add
and subtract 1 in (5.6), obtaining
nrn2 L1/qn (e−xn λ0 )2
(5.7) + rn L1/qn (e2xn λ0 ) + 1 < nrn2 + rn 2xn λ0 + 1,
λ20 xn2
where the last inequality holds as L1/qn (u) < log(u) for any u > 0 and q < 1.
Next, we impose (5.7) to be smaller than 1 and solve for qn , obtaining
 −1
2λ0 xn
(5.8) Tn := 1 + < qn < 1.
n
This provides some insights on the choice of the sequence qn in accordance to the
size of the probability to be estimated. If qn approaches 1 too quickly from below,
the gain obtained in terms of variance vanishes rapidly as n becomes larger. On the
other hand, if qn converges to 1 too slowly, the bias dominates the variance and the
MLE outperforms the MLqE. This understanding is confirmed in our simulation
study.
764 D. FERRARI AND Y. YANG

6. On the choice of q. For the exponential distribution example, we have


observed the following:
1. For estimating the natural parameter, when qn → 1, the asymptotic variance of
MLqE is equivalent to that of MLE in limit, but can be smaller. For instance, in
the variance expression (3.8) one can easily check that (q 2 − 2q + 2)/[q 5 (2 −
q)3 ] < 1 for 1 < q < 1.40; thus, choosing the distortion parameter in such a
range gives σn2 < λ20 .
2. For estimating the tail probability, when qn → 1, the asymptotic variance of
MLqE can be of a smaller order than that of MLE, although there is a bias that
approaches 0. In particular:
(i) MLqE cannot be asymptotically more efficient than MLE.
(ii) MLqE is asymptotically as efficient as MLE when qn is chosen to be
close enough to 1. In the case of tail probability for the exponential distribution,
it suffices to choose qn such that (qn − 1)xn → 0.
3. One approach to choosing q is to minimize an estimated asymptotic mean
squared error of the estimator when it is mathematically tractable. In the case of
the exponential distribution, by Theorem 5.1 we have the following expression
for the asymptotic mean squared error:
MSE(q, λ0 ) = (e−λ0 /qxn − e−λ0 xn )2
(6.1)  2  2 
λ0 −1 q − 2q + 2 2 −2λ0 /qxn
+ n xn e .
q q (2 − q)
3 3

However, since λ0 is unknown, we consider


(6.2) q ∗ = arg min{MSE(q,
λ)},
q∈(0,1)

where
λ is the MLE. This will be also used in some of our simulation studies.
In general, unlike in the above example, closed-form expressions of the asymp-
totic mean squared error are not available, which calls for more work on this issue.
In the literature on applications of nonextensive entropy, although some discus-
sions on choosing q have been made often from physical considerations, it is
unclear how to do it from a statistical perspective. In particular, the direction of
distortion (i.e., q > 1 or q < 1) needs to be decided. We offer the following obser-
vations and thoughts:
1. For estimating the parameters in an exponential family, although |qn − 1|n−1/2
guarantees the right asymptotic normality (i.e., asymptotic normality centered
around θ0 ), one direction of distortion typically reduces the variance of estima-
tion and consequently improves the MSE. In the exponential distribution case,
qn needs to be slightly greater than 1, but for estimating the covariance matrix
MAXIMUM Lq-LIKELIHOOD ESTIMATION 765

for multivariate normal observations, based on the asymptotic variance formula


in Example 3.2, qn needs to be slightly smaller than 1. For a given family, the
expression of the asymptotic covariance matrix for the MLqE given in Sec-
tion 3 can be used to find the beneficial direction of distortion. Our numerical
investigations confirm this understanding.
2. To minimize the mean squared error for tail probability estimation for the expo-
nential distribution family, we need 0 < qn < 1. This choice is in the opposite
direction for estimating the parameter λ itself. Thus, the optimal choice of qn
is not a characteristic of the family alone but also depends on the parametric
function to be estimated.
3. For some parametric functions, the MLqE makes little change. For the multi-
variate normal family, the surrogate value of the mean parameter stays exactly
the same while the variance parameters are altered.
4. We have found empirically that given the√right distortion direction, choices
of qn with |1 − qn | between 1/n and 1/ n usually improves—to different
extents—over the MLE.

7. Monte Carlo results. In this section, the performance of the MLqE in fi-
nite samples is explored via simulations. Our study includes (i) an assessment of
the accuracy for tail probability estimation and reliability of confidence intervals
and (ii) an assessment of the performance of MLqE for estimating multidimen-
sional parameters, including regression settings with generalized linear models.
The standard MLE is used as a benchmark throughout the study.
In this section, we present both deterministic and data-driven approaches on
choosing qn . First, deterministic choices are used to explore the possible advan-
tage of the MLqE for tail probability estimation with qn approaching 1 fast when
x is fixed and qn approaching 1 slowly when x increases with n. Then, the data-
driven choice in Section 6 is applied. For multivariate normal and GLM families,
where estimation of the MSE or prediction error becomes analytically cumber-
some, we choose qn = 1 − 1/n, which satisfies 1 − qn = o(n−1/2 ) that is needed
for asymptotic normality around θ0 . In all considered cases, numerical solution of
(2.7) is found using variable metric algorithm (e.g., see Broyden [15]), where the
ML solution is chosen as the starting value.

7.1. Mean squared error: role of the distortion parameter q. In the first group
of simulations, we compare the estimators of the true tail probability α = α(x; λ0 ),
obtained via the MLq method and the traditional maximum likelihood approach.
Particularly, we are interested in assessing the relative performance of the two
estimators for different choices of the sample size by taking the ratio between the
two mean squared errors, MSE(α̂n )/ MSE( αn ). The simulations are structured as
follows: (i) For any given sample size n ≥ 2, a number B = 10,000 of Monte Carlo
samples X1 , . . . , Xn is generated from an exponential distribution with parameter
766 D. FERRARI AND Y. YANG

λ0 = 1. (ii) For each sample, the MLq and ML estimates of α, respectively,  αn,k =
α(x; λn,k ) and α̂n,k = α(x; λn,k ), k = 1, . . . , B, are obtained. (iii) For each sample
size n, the relative performance between the two estimators is evaluated by the
ratio R n = MSEMC (α̂n )/ MSEMC ( αn ), where MSEMC denotes the Monte Carlo

estimate of the mean squared error. In addition, let y 1 = B −1 B αn,k − α)2
k=1 (

and y 2 = B −1 B αn,k − α)2 . By the central limit theorem, for large values
k=1 (

of B, y = (y 1 , y 2 ) approximately has a bi-variate normal distribution with mean
(MSE(α̂n ), MSE( αn )) and a certain covariance matrix . Thus, the standard error

for Rn can be computed by the delta method [11] as
 
−1/2 γ 11 y1 y 21 1/2
se(Rn ) = B 2
− 2γ 12 3 + γ 22 4 ,
y2 y2 y2

where γ 11 , γ 22 and γ 12 denote, respectively, the Monte Carlo estimates for the
components of the covariance matrix .
Case 1: fixed α and q. Figure 1 illustrates the behavior of R n for several choices
of the sample size. In general, we observe that for relatively small sample sizes,
n > 1 and the MLqE clearly outperforms the traditional MLE. Such a behavior is
R
much more accentuated for smaller values of the tail probability to be estimated. In
contrast, when the sample size is larger, the bias component plays an increasingly
relevant role and eventually we observe that R n < 1. This case is presented in
Figure 1(a) for values of the true tail probability α = 0.01, 0.005, 0.003 and a fixed
distortion parameter q = 0.5. Moreover, the results presented in Figure 1(b) show

(a) (b)
F IG . 1. Monte Carlo mean squared error ratio computed from B = 10,000 samples of size n. In (a)
we use a fixed distortion parameter q = 0.5 and true tail probability α = 0.01, 0.005, 0.003. The
dashed lines represent 99% confidence bands. In (b) we set α = 0.003 and use q1 = 0.65, q2 = 0.85
and q3 = 0.95. The dashed lines represent 90% confidence bands.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 767

(a) (b)
F IG . 2. (a) Monte Carlo mean squared error ratio computed from B = 10,000 samples of size n, for
different values of the true probability: α1 = 0.01, α2 = 0.005 and α3 = 0.003. The distortion para-
meter is computed as qn = [1/2 + e0.3(n−20) ]/[1 + e0.3(n−20) ]. (b) Monte Carlo mean squared error
ratio computed from B = 10,000 samples of size n. We use sequences qn = 1 − [10 log(n + 10)]−1
and xn = n1/(2+δ) (δ1 = 0.5, δ2 = 1.0 and δ3 = 1.5). The dashed lines represent 99% confidence
bands.

that smaller values of the distortion parameter q accentuate the benefits attainable
in a small sample situation.
Case 2: fixed α and qn  1. In the second experimental setting, illustrated in
Figure 2(a), the tail probability α is fixed, while we let qn be a sequence such
that qn  1 and 0 < qn < 1. For illustrative purposes we choose the sequence
qn = [1/2 + e0.3(n−20) ]/[1 + e0.3(n−20) ], n ≥ 2, and study Rn for different choices
of the true tail probability to be estimated. For small values of the sample size, the
chosen sequence qn converges relatively slowly to 1 and the distortion parameter
produces benefits in terms of variance. In contrast, when the sample size becomes
larger, qn adjusts quickly to one. As a consequence, for large samples the MLqE
exhibits the same behavior shown by the traditional MLE.
Case 3: αn  0 and qn  1. The last experimental setting of this subsection
examines the case where both the true tail probability and the distortion parameter
change depending on the sample size. We consider sequences of distortion para-
meters converging slowly relative to the sequence of quantiles xn . In particular we
set qn = 1 − [10 log(n + 10)]−1 and xn = n1/(2+δ) . In the simulation described in
Figure 2(b), we illustrate the behavior of the estimator for δ = 0.5, 1.0 and 1.5,
confirming the theoretical findings discussed in Section 5.

7.2. Asymptotic and bootstrap confidence intervals. The main objective of the
simulations presented in this subsection is twofold: (a) to study the reliability of
MLqE based confidence intervals constructed using three commonly used meth-
ods: asymptotic normality, parametric bootstraps and nonparametric bootstraps;
768 D. FERRARI AND Y. YANG

TABLE 1
MC means and standard deviations of estimators of α, along with the MC mean of the standard
error computed using: (i) asymptotic normality, (ii) bootstrap and (iii) parametric bootstrap.
The true tail probability is α = 0.01 and q = 1 corresponds to the MLE

n q∗ Estimate St. dev. seasy seboot sepboot

15 0.939 0.009489 0.010975 0.010472 0.011923 0.010241


1.000 0.013464 0.014830 0.013313 0.013672 0.015090
25 0.959 0.009693 0.008417 0.008470 0.009134 0.008298
1.000 0.012108 0.010517 0.009919 0.010227 0.010950
50 0.977 0.010108 0.006261 0.006326 0.006575 0.006249
1.000 0.011385 0.007354 0.006894 0.007083 0.007318
100 0.988 0.010158 0.004480 0.004568 0.004680 0.004549
1.000 0.010789 0.004908 0.004778 0.004880 0.004943
500 0.998 0.010006 0.002014 0.002052 0.002061 0.002050
1.000 0.010122 0.002055 0.002070 0.002073 0.002087

(b) to compare the results with those obtained using MLE. The structure of sim-
ulations is similar to that of Section 7.1, but a data-driven choice of qn is used.
(i) For each sample, first we compute λn , the MLE of λ0 . We substitute λn in (6.1)
and solve it numerically in order to obtain q ∗ as described there. (ii) For each sam-
ple, the MLq and ML estimates of the tail probability α are obtained. The standard
errors of the estimates are computed using three different methods: the asymptotic
formula derived in (5.5), nonparametric bootstrap and parametric bootstrap. The
number of replicates employed in bootstrap re-sampling is 500. We construct 95%
bootstrap confidence intervals based on the bootstrap quantiles and check the cov-
erage of the true value α.
In Table 1, we show the Monte Carlo means of αn and 
αn , their standard devia-
tions and the standard errors computed with the three methods described above. In
addition, we report the Monte Carlo average of the estimates of optimal distortion
parameter q ∗ . When q ∗ = 1, the results refer to the MLE case. Not surprisingly,
q ∗ approaches 1 as the sample size increases. When the sample size is small, the
MLqE has a smaller standard deviation and better performance. When n is larger,
the advantage of MLqE diminishes. As far as the standard errors are concerned, the
asymptotic method and the parametric bootstrap seem to provide values somewhat
closer to the Monte Carlo standard deviation for the considered sample sizes.
In Table 2, we compare the accuracy of 95% confidence intervals and report the
relative length of the intervals for MLqE over those for MLE. Although the cover-
age probability for MLqE is slightly smaller than that of MLE (in the order of 1%),
we observe a substantial reduction in the interval length for all of the considered
cases. The most evident benefits occur when the sample size is small. Furthermore,
in general, the intervals computed via parametric bootstrap outperform the other
two methods in terms of coverage and length.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 769

TABLE 2
MC coverage rate of 95% confidence intervals for α, computed using (i) asymptotic normality,
(ii) boostrap and (iii) parametric bootstrap. RL is the length of the intervals of MLqE over that of
MLE. The true tail probability is α = 0.01 and q = 1 corresponds to the MLE

Asympt. Boot. Par. boot.


n q∗ Coverage (%) RL Coverage (%) RL Coverage (%) RL

15 0.939 79.2 0.787 89.1 0.865 92.9 0.657


1.000 80.9 88.4 92.5
25 0.958 83.4 0.854 91.8 0.890 93.6 0.733
1.000 84.3 90.8 94.2
50 0.977 87.1 0.918 92.3 0.928 93.9 0.824
1.000 88.4 91.6 93.4
100 0.988 91.1 0.956 93.3 0.960 94.7 0.889
1.000 92.2 92.9 94.3
500 0.998 94.5 0.991 95.0 0.995 95.2 0.962
1.000 94.7 94.6 94.8

7.3. Multivariate normal distribution. In this subsection, we evaluate the


MLq methodology for estimating the mean and covariance matrix of a multivariate
normal distribution. We generate B = 10,000 samples from a multivariate normal
Np (μ, ), where μ is the p-dimensional unknown mean vector and  is the un-
known (p × p) covariance matrix. In our simulation, the true mean is μ = 0 and
the ij th element of  is ρ |i−j | , where −1 < ρ < 1. To gauge performance for the
mean we employed the usual L2 -norm. For the covariance matrix, we considered
the loss function
(7.1) q ) = tr( −1 
(,  q − I)2 ,

where  q represents the MLq estimate of  with q = 1 − 1/n. Note that the loss
is 0 when  =  q and is positive otherwise. Moreover, the loss is invariant to the
transformations AAT and A q AT for a nonsingular matrix A. The use of such a
loss function is common in literature (e.g., Huang et al. [17]).
In Table 3, we show simulation results for moderate or small sample sizes rang-
ing from 10 to 100 for various dimensions of the covariance matrix . The entries
in the table represent the Monte Carlo mean of (,  1 ) over that of (,  q ),

where 1 is the usual ML estimate multiplied by the correction factor n/(n − 1).
The standard error of the ratio is computed via the delta method. Clearly, the
MLqE performs well for smaller sample sizes. Interestingly, the squared error for
the MLqE reduces dramatically compared to that of the MLE as the dimension
increases. Remarkably, when p = 8 the gain in accuracy persists even for larger
sample sizes, ranging from about 22% to 84%. We tried various structures of 
and obtained performances comparable to the ones presented. For μ we found
that MLqE performs nearly identically to MLE for all choices of p and n, which
770 D. FERRARI AND Y. YANG

TABLE 3
1 ) over that of (, 
Monte Carlo mean of (,  q ) with standard error in parenthesis

n 1 2 4 8

10 1.225 (0.018) 1.298 (0.019) 1.740 (0.029) 1.804 (0.022)


15 1.147 (0.014) 1.249 (0.017) 1.506 (0.021) 1.840 (0.026)
25 1.083 (0.011) 1.153 (0.012) 1.313 (0.016) 1.562 (0.020)
50 1.041 (0.007) 1.052 (0.007) 1.199 (0.011) 1.377 (0.015)
100 1.018 (0.005) 1.033 (0.005) 1.051 (0.006) 1.222 (0.011)

is not surprising given the findings in Section 3. For brevity we omit the results
on μ.

7.4. Generalized linear models. Our methodology can be promptly extended


to the popular framework of the generalized linear models. Consider the regres-
sion setting where each outcome of the dependent variables, Y , is drawn from a
distribution in the exponential family. The mean η of the distribution is assumed
to depend on the independent variables, X, through E(Y |X) = η = g −1 (X T β),
where X is the design matrix, β is a p-dimensional vector of unknown parameters
and g is the link function. In our simulations, we consider two notable instances:
(i) Y from an exponential distribution with η = exp(−x T β); (ii) Y from a Bernoulli
distribution with η = 1/(1 + exp {x T β}). The first case represents the exponential
regression model, which is a basic setup for time-to-event analysis. The latter is
the popular logistic regression model.
We initialize the simulations by generating design points randomly drawn from
the unit hypercube [−1, 1]p . The entries of the true vector of coefficients β are
assigned by sampling p points at random in the interval [−1, 1], obtaining values
β = (−0.57, 0.94, 0.16, −0.72, 0.68, 0.92, 0.80, 0.04, 0.64, 0.34, 0.38, 0.47). The
values of X and β are kept fixed during the simulations. Then, 1000 Monte Carlo
samples of Y |X are generated according to the two models described above and
for each sample MLq and ML estimates are computed. The prediction error based
on independent out-of-sample observations is
3
1 
10

test 2
(7.2) PEq = 3 Yj − g −1 (Xjtest β q,n ) ,
10 j =1

where β q,n is the MLqE of β. In Table 4 we present the prediction error for various
choices of n and p. For both models, the MLqE outperforms the classic MLE
for all considered cases. The benefits from MLqE can be remarkable when the
dimension of the parameter space is larger. This is particularly evident in the case
of the exponential regression, where the prediction error of MLE is at least twice
MAXIMUM Lq-LIKELIHOOD ESTIMATION 771

TABLE 4
Monte Carlo mean of PE1 over that of PEq for exponential and logistic regression
with standard error in parenthesis

p 25 50 100 250

Exp. regression
2 2.549 (0.003) 2.410 (0.002) 2.500 (0.003) 2.534 (0.003)
4 2.469 (0.002) 2.392 (0.002) 2.543 (0.002) 2.493 (0.002)
8 4.262 (0.012) 2.941 (0.004) 3.547 (0.006) 3.582 (0.006)
12 9.295 (0.120) 3.644 (0.008) 3.322 (0.005) 5.259 (0.027)
Logistic regression
2 1.156 (0.006) 1.329 (0.006) 1.205 (0.003) 1.385 (0.003)
4 1.484 (0.022) 1.141 (0.003) 1.502 (0.007) 1.353 (0.003)
8 1.178 (0.008) 1.132 (0.003) 1.290 (0.004) 1.300 (0.002)
12 1.086 (0.005) 1.141 (0.003) 1.227 (0.003) 1.329 (0.002)

that of MLqE. In one case, when n = 25 and p = 12, the MLqE is about nine
times more accurate. This is mainly due to MLqE’s stabilization of the variance
component, which for the MLE tends to become large quickly when n is very small
compared to p. Although for the logistic regression we observe a similar behavior,
the gain in high dimension becomes more evident for larger n.

8. Concluding remarks. In this work, we have introduced the MLqE, a new


parametric estimator inspired by a class of generalized information measures that
have been successfully used in several scientific disciplines. The MLqE may
also be viewed as a natural extension of the classical MLE. It can preserve the
large sample properties of the MLE, while—by means of a distortion parame-
ter q—allowing modification of the trade-off between bias and variance in small
or moderate sample situations. The Monte Carlo simulations support that when
the sample size is small or moderate, the MLqE can successfully trade bias for
variance, obtaining a reduction of the mean squared error, sometimes very dramat-
ically.
Overall, this work makes a significant contribution to parametric estimation and
applications of nonextensive entropies. For parametric models, MLE is by far the
most commonly used estimator and the substantial improvement as seen in our nu-
merical work seems relevant and important to applications. Given the increasing
attention to q-entropy in other closely related disciplines, our theoretical results
provide a useful view from a statistical perspective. For instance, from the litera-
ture, although q is chosen from interesting physical considerations, for statistical
estimation (e.g., for financial data analysis where q-entropy is considered), there
are few clues as to how to choose the direction and amount of distortion.
772 D. FERRARI AND Y. YANG

Besides the theoretical optimality results and often remarkably improved per-
formance over MLE, our proposed method is very practical in terms of imple-
mentability and computational efficiency. The estimating equations are simply ob-
tained by replacing the logarithm of log-likelihood function in the usual maximum
likelihood procedure by the distorted logarithm. Thus, the resulting optimization
task can be easily formulated in terms of a weighted version of the familiar score
function, with weights proportional to the (1 − q)th power of the assumed density.
Hence, similarly to other techniques based on re-weighing of the likelihood, sim-
ple and fast algorithms for solving the MLq equations numerically (possibly even
for large problems) can be derived.
For the MLq estimators, helpful insights on their behaviors may be gained from
robust analysis. For a given q, (2.6) defines an M-estimator of the surrogate pa-
rameter θ ∗ . It seems that global robustness properties, such as a high breakdown
point, may be established for a properly chosen distortion parameter, which would
add value to the MLq methodology.
High-dimensional estimation has recently become a central theme in statistics.
The results in this work suggest that the MLq methodology may be a valuable tool
for some high-dimensional estimation problems (such as gamma regression and
covariance matrix estimation as demonstrated in this paper) as a powerful remedy
to the MLE. We believe this is an interesting direction for further exploration.
Finally, more research on the practical choices of q and their theoretical prop-
erties will be valuable. To this end, higher-order asymptotic treatment of the dis-
tribution (or moments) of the MLqE will be helpful. For instance, derivation of
saddle-point approximations of order n−3/2 , along the lines of Field and Ronchetti
[13] and Daniels [10], may be profitably used to give improved approximations of
the MSE.

APPENDIX A: PROOFS

In all of the following proofs we denote ψn (θ ) := n−1 ni=1 ∇θ Lqn (f (Xi ; θ )).
T
For exponential families, since f (x; θ ) = eθ b(x)−A(θ ) , we have

1 n
T

(A.1) ψn (θ ) = e(1−qn )(θ b(Xi )−A(θ )) b(Xi ) − m(θ ) ,
n i=1

where m(θ ) = ∇θ A(θ ). The MLq equation sets ψn (θ ) = 0 and solves for θ . More-
over, we define ϕ(x, θ) := θ T b(x) − A(θ ), and thus f (x; θ ) = eϕ(x,θ ) . When clear
from the context, ϕ(x, θ) is denoted by ϕ.

Proof of Theorem 3.1. Define ψ(θ ) := Eθ0 ∇θ log (f (X; θ )). Since f has the
form in (3.1), we can write ψ(θ ) = Eθ0 [b(X) − m(θ )]. We want to show uniform
MAXIMUM Lq-LIKELIHOOD ESTIMATION 773

convergence of ψn (θ ) to ψ(θ ) for all θ ∈  in probability. Clearly,


 n 
1 

 T 
sup  e(1−qn )(θ b(Xi )−A(θ)) b(Xi ) − m(θ ) − Eθ0 [b(X) − m(θ )]

θ∈ n 
i=1 1
 n 
 1 


 (1−qn )(θ T b(Xi )−A(θ)) 
(A.2) ≤ sup  e − 1 b(Xi ) − m(θ ) 
θ∈  n 
i=1 1
 n 
 1 

 
+ sup  b(Xi ) − m(θ ) − Eθ0 [b(X) − m(θ )] ,
θ ∈  n 
i=1 1
where  · 1 denotes the 1 -norm. Note that the second summand in (A.2) actually
does not depend on θ [as m(θ ) cancels out] and it converges to zero in probability
T
by the law of large numbers. Next, let s(Xi ; θ ) := e(1−qn )(θ b(Xi )−A(θ)) − 1 and
t (Xi ; θ ) := b(Xi ) − m(θ ). By the Cauchy–Schwarz inequality, the first summand
in (A.2) is upper bounded by
 p 
 n

 n 
 1  1 
(A.3) sup  s(Xi ; θ )2  tj (Xi ; θ ) ,
2
θ ∈ j =1 n i=1 n i=1

where tj denotes the j th element of the vector t (Xi ; θ ). It follows that for (A.2),
 p 
it suffices to show n−1 i supθ s(Xi ; θ )2 → 0 and n−1 i supθ tj (Xi ; θ )2 is
bounded in probability. Since  is compact, supθ |m(θ )| ≤ (c1 , c1 , . . . , c1 ) for
some positive constant c1 < ∞, and we have
1 n
2 n
(A.4) sup tj (Xi ; θ )2 ≤ bj (Xi )2 + 2(c1 )2 ,
n i=1 θ ∈ n i=1

where the last inequality from the basic fact that (a − b)2 ≤ 2a 2 + 2b2 (a, b ∈ R).
The last expression in (A.4) is bounded in probability by some constant and
Eθ0 bj (X)2 < ∞ for all j = 1, . . . , p. Next, note that
1 n
1 n
T
sup s(Xi ; θ )2 ≤ sup e2(1−qn )(θ b(Xi )−A(θ))
θ ∈ n i=1 n i=1 θ ∈
(A.5)
2 n
T
− inf e(1−qn )(θ b(Xi )−A(θ )) + 1.
n i=1 θ ∈

 p
Thus, to show n−1 i supθ s(Xi ; θ )2 → 0, it suffices to obtain n−1 ×
 p  p
i supθ e
2(1−qn )ϕ(θ) − 1 → 0 and n−1 i infθ e(1−qn )ϕ(θ) − 1 → 0. Actually,
since  is compact and supθ e−A(θ) < c2 for some c2 < ∞,
1 n
1 n
(∗) T
e2|1−qn |(| log c2 |+θ |b(Xi )|) ,
T
(A.6) sup e2(1−qn )(θ b(Xi )−A(θ)) ≤
n i=1 θ∈ n i=1
774 D. FERRARI AND Y. YANG

(∗) (∗) (∗) (∗) (∗)


where θj = max{|θj,0 |, |θj,1 |}, j = 1, . . . , p and (θj,0 , θj,1 ) represent element-
wise boundary points of θj . For r = 1, 2,
 (∗) T |b(X)|) r
(A.7) Eθ0 e2|1−qn |(| log c2 |+θ

= e2r|1−qn || log c2 |−A(θ0 ) e[2r|1−qn | sign{b(x)}θ
(∗) +θ ]T b(x)
(A.8) 0 dμ(x).

We decompose  into 2p subsets in terms of the sign of the elements of b(x). That
 p
is,  = 2k=1 Bk , where

B1 = {x ∈  : b1 (x) ≥ 0, b2 (x) ≥ 0, . . . , bp−1 (x) ≥ 0, bp (x) ≥ 0},


(A.9) B2 = {x ∈  : b1 (x) ≥ 0, b2 (x) ≥ 0, . . . , bp−1 (x) ≥ 0, bp (x) < 0},
B3 = {x ∈  : b1 (x) ≥ 0, b2 (x) ≥ 0, . . . , bp−1 (x) < 0, bp (x) ≥ 0}

and so on. Note that sign{b(x)} stays the same for each Bi , i = 1, . . . , 2p . Also
because θ0 is an interior point, when |1 − qn | is small enough, the integral in (A.7)
on Bi is finite and by dominated convergence theorem,
 
e[2r|1−qn | sign{b(x)}θ
(∗) +θ ]T b(x) n→∞ T
(A.10) 0 dμ(x) → eθ0 b(x) dμ(x).
Bk Bk

Consequently,
 
e[2r|1−qn | sign{b(x)}θ
(∗) +θ ]T b(x) n→∞ T
(A.11) 0 dμ(x) → eθ0 b(x)−A(θ0 ) dμ(x) = 1.

T
It follows that the mean and the variance of supθ e2(1−qn )[θ b(X)−A(θ )] converge
to 1 and 0, respectively, as n → ∞. Therefore, a straightforward application of
Chebyshev’s inequality gives

1 n
T p
(A.12) sup e2(1−qn )[θ b(Xi )−A(θ)] → 1, n → ∞.
n i=1 θ ∈

An analogous argument shows that

1 n
T p
(A.13) inf e(1−qn )[θ b(Xi )−A(θ)] → 1, n → ∞.
n i=1 θ ∈

 p
Therefore, we have established n−1 i supθ s(Xi ; θ )2 → 0. Hence, (A.2) con-
verges to zero in probability. By applying Lemma 5.9 on page 46 in [31], we know
that with probability converging to 1, the solution of the MLq equations is unique
and it maximizes the MLqE.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 775

Proof of Theorem 3.2. By Taylor’s theorem, there exist a random point  θ , in


the line segment between θn∗ and 
θn , such that with probability converging to one
we have
0 = ψn (X; 
θn )
(A.14)
= ψn (X; θn∗ ) + ψ̇n (X; θn∗ )(
θn − θn∗ ) + 12 (
θn − θn∗ )T ψ̈n (X;  θn − θn∗ ),
θ )(
where ψ̇n is a p × p matrix of first-order derivatives and, similarly to page 68 in
van der Vaart [31], ψ̈n denotes a p-vector of (p × p) matrices of second-order
derivatives, respectively; X denotes the data vector. We can rewrite the above ex-
pression as

− nψ̇(θn∗ )−1 ψn (X; θn∗ )
(A.15) √
= ψ̇(θn∗ )−1 ψ̇n (X; θn∗ ) n(
θn − θn∗ )

∗ −1 n 
(A.16) + ψ̇(θn ) (θn − θn∗ )T ψ̈n (X;  θn − θn∗ ),
θ)(
2
where ψ̇(θ) = Eθ0 ∇θ2 Lqn f (X; θ ). Note that
(A.17) ψ̇(θ ) = Eθ0 e(1−qn )ϕ(θ) [(1 − qn )∇θ ϕ(θ )T ∇θ ϕ(θ ) − ∇θ2θ ϕ(θ )]
(A.18) = K1,n Eμ1,n [(1 − qn )∇θ ϕ(θ)T ∇θ ϕ(θ ) − ∇θ2θ ϕ(θ )],
where μk,n = k(1 − qn )θ + θ0 and Kk,n = eA(μn,k )−A(θ0 ) . For k, l ∈ {1, . . . , p}, we
have
{Eμn,1 ∇θ ϕ(θ)T ∇θ ϕ(θ)}kl
(A.19) 


= Eμn,1 bk (X) − mk (θ ) bl (X) − ml (θ )


(A.20) = Eμn,1 bk (X) − mk (μn,1 ) + mk (μn,1 ) − mk (θ )


(A.21) × bl (X) − ml (μn,1 ) + ml (μn,1 ) − ml (θ )



(A.22) = Eμn,1 bk (X) − mk (μn,1 ) bl (X) − ml (μn,1 )



(A.23) + mk (μn,1 ) − mk (θ ) ml (μn,1 ) − ml (θ ) ,
where the first term in the last passage is the klth element of the covariance matrix
−D(θ ) evaluated at μn,1 . Since  is compact, {ψ̇(θ )}kl ≤ Ckl ∗ < ∞, for some

constants Ckl , k, l ∈ {1, . . . , p}. We take the following steps to derive asymptotic
normality.
Step 1. We first show that the left-hand side of (A.16) converges in distribution.
Define the vector Zn,i := ∇θ Lqn f (Xi , θn∗ ) − Eθ0 ∇θ Lqn f (Xi , θn∗ ) inRp . Consider
an arbitrary vector a ∈ Rp and let Wn,i := a T Zn,i and W n = n−1 i Wn,i . Since
Wn,i (1 ≤ i ≤ n) form a triangular array where Wn,i are rowwise i.i.d., we check
the Lyapunov condition. In our case, the condition reads
(A.24) n−1/3 (EWn,1
2 −1 3
) (E[Wn,1 ])2/3 → 0 as n → ∞.
776 D. FERRARI AND Y. YANG

Next, denote μn,k = θ0 + k(1 − qn )θn∗ . One can see that


  p 3 !2/3


3
(E[Wn,1 ])1/3 = Kn Eμn,3 aj bj (X) − mj (θn∗ ) ,
j =1

where Kn = exp{− 23 A(θ0 ) − 2(1 − qn )A(θn∗ ) + 23 A(μn,3 )} and Kn → 1 as n → ∞.


Since θ0 is an interior point in  (compact) the above quantity is uniformly upper
bounded in n by some finite constant. Next, consider
2
E[Wn,1 ] = E[a T Zn,1 Zn,1
T
] = a T E[Zn,1 Zn,1
T
]a.
T shows that the above
A calculation similar to that in (A.23) for the matrix Zn,1 Zn,1
quantity satisfies
(A.25) a T [−D(μn,2 ) + Mn ]a → −a T D(θ0 )a > 0, n → ∞,
where the klth element of Mn is



(A.26) {Mn }kl = mk (μn,2 ) − mk (θn∗ ) ml (μn,2 ) − ml (θn∗ )
and μn,2 → θ0 and θn∗ → θ0 , as n → ∞. This shows that condition (A.24) holds
√ 2 ])−1/2 a T W →D
and n(E[Wn,1 n N1 (0, 1). Hence, by the Cramér–Wold device
(e.g., see [31]), we have
√ T −1/2 D
(A.27) n[EZn,1 Zn,1 ] W n → Np (0, Ip ).
Step 2. Next, we want convergence in probability of ψ̇(θn∗ )−1 ψ̇n (X, θn∗ ) to Ip .
For k, l ∈ {1, . . . , p}, given ε > 0, we have


Pθ0 |{ψ̇n (X, θn∗ )}kl − {ψ̇(θn∗ )}kl | > ε
(A.28) 2
∂2 
−1 −2 
≤n ε Eθ0 Lqn (f (X; θ )) ∗
∂θk θl θn

by the i.i.d. assumption and Chebyshev’s inequality. When |1 − qn | ≤ 1, the ex-


pectation in (A.28) is
 ∗ 

2
Eθ0 e2(1−qn )ϕ(θn ) (1 − qn ) bk (X) − mk (θn∗ ) bl (X) − ml (θn∗ ) + D(θn∗ )2


2 
≤ 2Eμn,2 bk (X) − mk (θn∗ ) bl (X) − m(θn∗ ) + D(θn∗ )4
× exp{−A(θ0 ) − 2(1 − qn )A(θn∗ ) + A(μn,2 )},
where the inequality passage follows from the triangle inequality. Since  is com-
pact and the existence of fourth moments is ensured for exponential families, the
above quantity is upper bounded by some finite constant. Therefore, the right-hand
side of (A.28) is upper bounded by a constant that converges to zero as n → ∞.
Since convergence in probability holds for each k, l ∈ {1, . . . , p} and p < ∞, we
have that the matrix difference |ψ̇n (X, θn∗ ) − ψ̇(θn∗ )| converges in probability to the
MAXIMUM Lq-LIKELIHOOD ESTIMATION 777

zero matrix. From the calculation carried out in (A.17), one can see that ψ̇(θn∗ ) is a
deterministic sequence such that ψ̇(θn∗ ) → ψ̇(θ0 ) = −∇θ2 A(θ0 ), as n → ∞. Thus,
we have
p
(A.29) |ψ̇n (X; θn∗ ) − ψ̇(θ0 )| ≤ |ψ̇n (X; θn∗ ) − ψ̇(θn∗ )| + |ψ̇(θn∗ ) − ψ̇(θ0∗ )| → 0
p
as n → ∞. Therefore, ψ̇(θn∗ )−1 ψ̇n (X, θn∗ ) → Ip .
Step 3. Here, we show that the second term on the right-hand side of (A.16)
is negligible. Let g(X; θ ) be an element of the array ψ̈n (X, θ ) of dimension
p × p × p. For some fixed θ in the line segment between 
θ and θn∗ , we have that

θ) − g(X; θn∗ )| = |∇θ g(X, θ)T ||


(A.30) |g(X;  θ − θn∗ | ≤ sup |∇θ g(X, θ )||
θ − θn∗ |.
θ ∈
A calculation shows that the hth element of the gradient vector in the expression
above is
{∇θ g(X, θ )}h

n

(A.31) = n−1 e(1−qn )ϕ(θ) (1 − qn )3 ϕ(θ )(1) + (1 − qn )2 ϕ(θ )(2)
i=1

+ (1 − qn )ϕ(θ )(3) + ϕ(θ )(4)
for h ∈ {1, . . . , p}, where ϕ (k) denotes the product of the partial derivatives
of order k with respect to θ . As shown before in the proof of Theorem 3.1,
supθ e(1−qn )ϕ(Xi ,θ ) has finite expectation when |1 − qn | is small enough. Thus,
by Markov’s inequality, supθ |g  (X, θ )| is bounded in probability. In addition,
recall that the deterministic sequence ψ̇(θn∗ ) converges to a constant. Hence,
ψ̇(θn∗ )−1 ψ̈n (X; 
θ0 ) is bounded in probability.
Since the third term in the expansion (A.16) is of higher order than the second
term, by combining steps 1, 2 and 3 and applying Slutsky’s lemma we obtain the
desired asymptotic normality result.

Proof of Theorem 4.1. Uniform convergence of ψn (θ ) to ψ(θ) for all θ ∈ 


in probability is satisfied if
 n 
1   p
 
sup  f (Xi ; θ )1−qn
U (Xi ; θ ) − Eθ0 U (X, θ ) → 0.
θ ∈ n i=1

1
The left-hand side of the above expression is upper bounded by
 n 
 1 

 
sup  f (Xi ; θ )1−qn
− 1 U (Xi ; θ )
θ ∈  n 
i=1 1
(A.32)  n 
1  
 
+ sup  U (Xi ; θ ) − Eθ0 U (X, θ ) .
θ ∈ n i=1

1
778 D. FERRARI AND Y. YANG

By the Cauchy–Schwarz inequality, the first term of the above expression is upper
bounded by
 p 
 n

 n 
 1 
2  1 
sup  f (Xi ; θ )1−qn − 1  Uj (Xi ; θ )2 .
θ ∈ j =1 n i=1 n i=1

By assumption B.2, n−1 i supθ Uj (Xi ; θ )2 is bounded in probability. Moreover,
given  > 0, by Markov’s inequality we have
  

2
2
P n−1 sup f (Xi ; θ )1−qn − 1 >  ≤  −1 E sup f (X; θ )1−qn − 1 ,
i θ θ

which converges to zero by assumption B.2. By assumption B.3, the second sum-
mand in (A.32) converges to zero in probability.

Proof of Theorem 4.3. By Taylor’s theorem, for a solution of the MLq equa-
tion, there exists a random point θ between 
θn and θn∗ such that
1 n
1 n
0= U ∗ (Xi , θn∗ , qn ) + ∇θ U ∗ (Xi , θn∗ , qn )(
θn − θn∗ )
n i=1 n i=1
(A.33)
1 1 n
θn − θn∗ )T
+ ( ∇ 2 U ∗ (Xi , θ, qn )(
θn − θn∗ ).
2 n i=1 θ

From Theorem 4.1, we know that with probability approaching 1,  θn is the unique
MLqE and the above equation holds. Define Zn,i := U ∗ (Xi ; θn∗ , qn ), i = 1, . . . , n,
a triangular array of i.i.d. random vectors and let a ∈ Rp be a vector of constants.
Let Wn,i := a T Zn,i . The Lyapunov

condition for ensuring asymptotic normality of
the linear combination a T ni=1 Zn,i /n for a ∈ Rp and a > 0 in this case reads
n−1/3 (EWn,1
2 −1 3
) (E[Wn,1 ])2/3 → 0 as n → ∞.
Under C.1 and C.2, this can be easily checked. The Cramér–Wold device implies
1 n
D
Cn U ∗ (Xi , θn∗ , qn ) → Np (0, Ip ),
n i=1

where Cn := n[Eθ0 U ∗ (X, θn∗ )T U ∗ (Xi , θn∗ )]−1/2 .
Next, consider the second term in (A.33). Given  > 0, for k, l ∈ {1, . . . , p}, by
Chebyshev’s inequality
   !
 
 −1  ∗
n
∗ 
P  n I (Xi , θn , qn ) − {Jn }k,l  >  ≤  −2 n−2 E{I ∗ (X, θn∗ , qn )}2k,l .
 
i=1 k,l
Thus, the right-hand side of the above expression converges to zero as n → ∞
under C.3. Since convergence in probability is ensured for each k, l and p < ∞,
MAXIMUM Lq-LIKELIHOOD ESTIMATION 779

under C.2, we have that |n−1 i I ∗ (Xi , θn∗ ) − Jn | converges to the zero matrix in
probability.

Finally, n−1 ∇θ2 ni=1 U ∗ (Xi , θ, qn ) in the third term of the expansion (A.33) is
a p × p × p array of partial second-order derivatives. By assumption, there is a
neighborhood B of θ0 for which each entry of ∇θ2 U ∗ (x, θ, qn ) is dominated by
g0 (x) for some g0 (x) ≥ 0 for all θ ∈ B. With probability tending to 1,
 
 
 −1  2 ∗ 
n n
 3 −1
n ∇ U (X i , θ, qn 
) ≤ p n |g0 (Xi )|,
 θ 
i=1 i=1
which is bounded in probability by the law of large numbers. Since the third term
in the expansion (A.33) is of higher order than the second term, the normality
result follows by applying Slutsky’s lemma.

Proof of Theorem 5.1. From the second-order Taylor expansion of α(xn ; 


θn )
about θn∗ one can obtain
√ (α(xn ; 
θn ) − α(xn ; θn∗ ))
n
σn α  (xn ; θn∗ )
√ (
θn − θn∗ ) 1 α  (xn ; 
θ) √ 
(A.34) = n + n(θn − θn∗ )2
σn 2σn α (xn ; θn∗ )


√ (
θn − θn∗ ) 1 α  (xn ; θn∗ ) α  (xn ; 
θ) √ 
= n + n(θn − θn∗ )2 ,
σn 2σn α (xn ; θn ) α (xn ; θn∗ )
 ∗ 

where θ is a value between θn and θn∗ . We need to show that the second term in
(A.34) converges to zero in probability, that is,
α  (xn ; θn∗ ) α  (xn ; 
θ) σn n(
θn − θn∗ )2 p
(A.35) √ → 0.
α  (xn ; θn∗ ) α  (xn ; θn∗ ) n σn2
√  D
Since n(θn − θn∗ )/σn → N(0, 1) and σn is upper bounded, we need
α  (xn ; θn∗ ) α  (xn ; 
θ) p
(A.36) √  → 0.
α (xn ; θn ) n α (xn ; θn∗ )
 ∗

This holds under the assumptions of the theorem. This completes the proof of the
theorem.

Proof of Theorem 5.2. The rationale presented here is analogous to that of


Theorem 5.1. From the second-order Taylor expansion of ρ( θn , s) about θn∗ one
can obtain
√ ρ(sn ; 
θn ) − ρ(sn ; θn∗ )
n
σn ρ  (sn ; θn∗ )
(A.37)
√ ( θn − θn∗ ) 1 ρ  (sn ; θ ) √ 
= n + n(θn − θn∗ )2 ,
σn 2σn ρ  (sn ; θn∗ )
780 D. FERRARI AND Y. YANG

where θ is a value between  θn and θn∗ . The assumptions combined with Theo-
rem 3.2 imply that the second term in (A.37) converges to 0 in probability. Hence,
the central limit theorem follows from Slutsky’s lemma.

APPENDIX B: MULTIVARIATE NORMAL Np (μ, ). ASYMPTOTIC


DISTRIBUTION OF THE MLqE OF 
The log-likelihood function of a multivariate normal is
p 1 1
(B.1) (θ) = log f (x; μ, ) = − (2π) − log || − (x − μ)T (x − μ).
2 2 2
Recall that the surrogate parameter is θ ∗ = (μT , q vechT )T . The asymptotic vari-
ance is computed as V = J −1 (θ ∗ )K(θ ∗ )J −1 (θ ∗ ), where
 
(B.2) K(θ ∗ ) = Eθ0 f (x; θ ∗ )2(1−q) U (x; θ ∗ )T U (x; θ ∗ )
(B.3) = c2 E (2) [U (x; θ ∗ )T U (x; θ ∗ )]
and
(B.4) J (θ ∗ ) = −qEθ0 [f (x; θ ∗ )1−q U (x; θ ∗ )T U (x; θ ∗ )]
(B.5) = −qc1 E (1) [U (x; θ ∗ )T U (x; θ ∗ )],
where E (r) denotes expectation taken with respect to a normal with mean μ and
covariance matrix [r(1 − q) + 1]−1 , r = 1, 2, and the normalizing constant cr is
 −(r(1−q)+1)/2(x−μ)T  −1 (x−μ)
 ∗ r(1−q)  e dx
(B.6) cr := Eθ0 f (x; θ ) =
(2π)rp(1−q)/2 |q|r(1−q)/2 (2π)1/2 ||1/2
(r(1 − q) + 1)−p/2
(B.7) = .
(2πq p ||)r(1−q)/2
Note that K and J can be partitioned into block form

K11 K12 J11 J12
(B.8) K= , J= ,
K21 K22 J21 J22
where K11 and J11 depend on second-order derivatives of U with respect to μ,
K22 and J22 depend on second-order derivatives with respect to vech . The off-
diagonal matrices K12 , K21 depend on mixed derivatives of U with respect μ and
vechT . Since the mixed moments of order three are zero, one can check that
K21 = K12T = 0. Consequently, only the calculation of K , K , J
11 22 11 and J22 is
required and the expression of the asymptotic variance is given by

V11 0 J −1 K11 J11
−1
0
(B.9) V= := 11 −1 −1 .
0 V22 0 J22 K22 J22
MAXIMUM Lq-LIKELIHOOD ESTIMATION 781

Next, we compute the entries of K and J using the approach employed by


McCulloch [25] for the usual log-likelihood function. First, we use standard matrix
differentiation to compute K11 and J11 ,
(B.10) K11 = c2 E (2) [(q)−1 (x − μ)T (x − μ)(q)−1 ]
(B.11) = c2 q −2 [2(1 − q) + 1]−1  −1
and similarly one can obtain J11 = −c1 q −1 [(1 − q) + 1]−1  −1 . Some straightfor-
ward algebra gives

−1 −1 (2 − q)2+p
(B.12) V11 = J11 K11 J11 = .
(3 − 2q)1+p/2
Next, we compute V22 . Let z :=  −1/2 (x − μ) using the following relationship
derived by McCulloch ([25], page 682):
E[∇vech  (θ)]T [∇vech  (θ)]


(B.13) = 1/4GT ( −1/2 ⊗  −1/2 ) E[(z ⊗ z)(zT ⊗ zT )] − vec Ip vecT Ip
× ( −1/2 ⊗  −1/2 )G.
Moreover, a result by Magnus and Neudecker ([24], page 388) shows
(B.14) E[(z ⊗ z)(zT ⊗ zT )] = Ip + Kp,p + vec Ip vecT Ip ,
where Kp,p denotes the commutation matrix (see Magnus and Neudecker [24]).
To compute K22 and J22 , we need to evaluate (B.13) at θ ∗ = (μT , q vechT )T ,
replacing the expectation operator with cr E (r) [·]. In particular,
" (r) #
E [(z ⊗ z)(zT ⊗ zT )]θ ∗ − vec Ip vecT Ip G

−2
(B.15) = r(1 − q) + 1 {Ip + Kp,p }G

−2
= 2 r(1 − q) + 1 G,
where the last equality follows from the fact that Kp,p G = G. Therefore,
(B.16) K22 = 1/(4q 2 )c2 GT ( −1/2 ⊗  −1/2 )


(B.17) × E[(z ⊗ z)(zT ⊗ zT )] − vec Ip vecT Ip ( −1/2 ⊗  −1/2 )G

−2 
(B.18) = 1/(4q 2 )c2 r(1 − q) + 1 + 1 GT ( −1 ⊗  −1 )G
[(2(1 − q) + 1)−2 + 1](3 − 2q)−p/2 T −1
(B.19) = 1/(4q 2 ) G ( ⊗  −1 )G.
4(2πq p ||)2−q
A similar calculation gives
[(2 − q)−2 + 1](2 − q)−p/2 T −1
(B.20) J22 = 1/(4q 2 ) G ( ⊗  −1 )G.
(2πq p ||)(2−q)/2
782 D. FERRARI AND Y. YANG

Finally, we assemble (B.19) and (B.20) obtaining


−1 −1
V22 = J22 K22 J22
(B.21)
4q 2 [(3 − 2q)2 + 1](2 − q)4+p T −1
= [G ( ⊗  −1 )G]−1 .
[(2 − q)2 + 1]2 (3 − 2q)2+p/2

Acknowledgments. The authors wish to thank Tiefeng Jiang for helpful dis-
cussions. Comments from two referees, especially the one with a number of very
constructive suggestions on improving the paper, are greatly appreciated.

REFERENCES
[1] A BE , S. (2003). Geometry of escort distributions. Phys. Rev. E 68 031101.
[2] ACZÉL , J. D. and DARÓCZY, Z. (1975). On measures of information and their characteriza-
tions. Math. Sci. Eng. 115. Academic Press, New York–London. MR0689178
[3] A KAIKE , H. (1973). Information theory and an extension of the likelihood principle. In
2nd International Symposium of Information Theory 267–281. Akad. Kiadó, Budapest.
MR0483125
[4] A LTUN , Y. and S MOLA , A. (2006). Unifying divergence minimization and statistical inference
via convex duality. In Learning Theory. Lecture Notes in Computer Science 4005 139–
153. Springer, Berlin. MR2280603
[5] BARRON , A., R ISSANEN , J. and Y U , B. (1998). The minimum description length principle in
coding and modeling. IEEE Trans. Inform. Theory 44 2743–2760. MR1658898
[6] BASU , A. H ARRIS , I. R., H JORT, N. L. and J ONES , M. C. (1998). Robust and efficient esti-
mation by minimising a density power divergence. Biometrika 85 549–559. MR1665873
[7] B ECK , C. and S CHLÖGL , F. (1993). Thermodynamics of Chaotic Systems: An Introduction.
Cambridge Univ. Press, Cambridge. MR1237638
[8] C HOI , E., H ALL , P. and P RESNELL , B. (2000). Rendering parametric procedures more robust
by empirically tilting the model. Biometrika 87 453–465. MR1782490
[9] C OVER , T. M. and T HOMAS , J. A. (2006). Elements of Information Theory. Wiley, New York.
MR2239987
[10] DANIELS , H. E. (1997). Saddlepoint approximations in statistics (Pkg: P171-200). In Break-
throughs in Statistics (S. Kotz and N. L. Johnson, eds.) 3 177–200. Springer, New York.
MR1479201
[11] F ERGUSON , T. S. (1996). A Course in Large Sample Theory. Chapman & Hall, London.
MR1699953
[12] F ERRARI , D. and PATERLINI , S. (2007). The maximum Lq-likelihood method: An application
to extreme quantile estimation in finance. Methodol. Comput. Appl. Probab. 11 3–19.
MR2476469
[13] F IELD , C. and RONCHETTI , E. (1990). Small Sample Asymptotics. IMS, Hayward, CA.
MR1088480
[14] G ELL -M ANN , M., ED . (2004). Nonextensive Entropy, Interdisciplinary Applications. Oxford
Univ. Press, New York. MR2073730
[15] G OLDFARB , D. (1970). A family of variable-metric method derived by variational means.
Math. Comp. 24 23–26. MR0258249
[16] H AVRDA , J. and C HARVÁT, F. (1967). Quantification method of classification processes: Con-
cept of structural entropy. Kibernetika 3 30–35. MR0209067
MAXIMUM Lq-LIKELIHOOD ESTIMATION 783

[17] H UANG , J. Z., L IU , N., P OURAHMADI , M. and L IU , L. (2006). Covariance matrix selection
and estimation via penalised normal likelihood. Biometrika 93 85–98. MR2277742
[18] H UBER , P. J. (1981). Robust Statistics. Wiley, New York. MR0606374
[19] JAYNES , E. T. (1957). Information theory and statistical mechanics. Phys. Rev. 106 620.
MR0087305
[20] JAYNES , E. T. (1957). Information theory and statistical mechanics II. Phys. Rev. 108 171.
MR0096414
[21] K ULLBACK , S. (1959). Information Theory and Statistics. Wiley, New York. MR0103557
[22] K ULLBACK , S. and L EIBLER , R. A. (1951). On information and sufficiency. Ann. Math. Sta-
tistics 22 79–86. MR0039968
[23] L EHMANN , E. L. and C ASELLA , G. (1998). Theory of Point Estimation. Springer, New York.
MR1639875
[24] M AGNUS , J. R. and N EUDECKER , H. (1979). The commutation matrix: Some properties and
applications. Ann. Statist. 7 381–394. MR0520247
[25] M C C ULLOCH , C. E. (1982). Symmetric matrix derivatives with applications. J. Amer. Statist.
Assoc. 77 679–682. MR0675898
[26] NAUDTS , J. (2004). Estimators, escort probabilities, and phi-exponential families in statistical
physics. J. Inequal. Pure Appl. Math. 5 102. MR2112455
[27] R ÉNYI , A. (1961). On measures of entropy and information. In Proc. 4th Berkeley Sympos.
Math. Statist. and Prob. 1 547–461. Univ. California Press, Berkeley. MR0132570
[28] S HANNON , C. E. (1948). A mathematical theory of communication. Bell System Tech. J. 27
379–423. MR0026286
[29] T SALLIS , C. (1988). Possible generalization of Boltzmann–Gibbs statistics. J. Statist. Phys. 52
479–487. MR0968597
[30] T SALLIS , C., M ENDES , R. S. and P LASTINO , A. R. (1998). The role of constraints within
generalized nonextensive statistics. Physica A: Statistical and Theoretical Physics 261
534–554.
[31] VAN DER VAART, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge.
MR1652247
[32] WANG , X., VAN E EDEN , C. and Z IDEK , J. V. (2004). Asymptotic properties of maximum
weighted likelihood estimators. J. Statist. Plann. Inference 119 37–54. MR2018449
[33] W INDHAM , M. P. (1995). Robustifying model fitting. J. Roy. Statist. Soc. Ser. B 57 599–609.
MR1341326

D IPARTMENTO DI E CONOMIA P OLITICA S CHOOL OF S TATISTICS


U NIVERSITÀ DI M ODENA E R EGGIO E MILIA U NIVERSITY OF M INNESOTA
VIA B ERENGARIO 51 313 F ORD H ALL
M ODENA , 41100 224 C HURCH S TREET S.E.
I TALY M INNEAPOLIS , M INNESOTA 55455
E- MAIL : davide.ferrari@unimore.it USA
E- MAIL : yyang@stat.umn.edu

You might also like