09 Aos687
09 Aos687
09 Aos687
but using a more general definition of mean. In a different direction, Havrda and
Charvát [16] proposed nonextensive entropies, sometimes referred to as q-order
entropies, where the usual definition of mean is maintained while the logarithm is
replaced by the more general function Lq (u) = (u1−q − 1)/(1 − q) for q > 0. In
particular, when q → 1, Lq (u) → log(u), recovering the usual Shannon’s entropy.
In recent years, q-order entropies have been of considerable interest in different
domains of application. Tsallis and colleagues have successfully exploited them in
physics (see, e.g., [29] and [30]). In thermodynamics, the q-entropy functional is
usually minimized subject to some properly chosen constraints, according to the
formalism proposed by Jaynes [19] and [20]. There is a large literature on analyz-
ing various loss functions as the convex dual of entropy minimization, subject to
constraints. From this standpoint, the classical maximum entropy estimation and
maximum likelihood are seen as convex duals of each other (see, e.g., Altun and
Smola [4]). Since Tsallis’ seminal paper [29], q-order entropy has encountered
an increasing wave of success and Tsallis’ nonextensive thermodynamics, based
on such information measure, is nowadays considered the most viable candidate
for generalizing the ideas of the famous Boltzmann–Gibbs theory. More recently,
a number of applications based on the q-entropy have appeared in other disciplines
such as finance, biomedical sciences, environmental sciences and linguistics [14].
Despite the broad success, so far little effort has been made to address the infer-
ential implications of using nonextensive entropies from a statistical perspective.
In this paper, we study a new class of parametric estimators based on the q-entropy
function, the maximum Lq-likelihood estimator (MLqE). In our approach, the role
of the observations is modified by slightly changing the model of reference by
means of the distortion parameter q. From this standpoint, Lq-likelihood estima-
tion can be regarded as the minimization of the discrepancy between a distribution
in a family and one that modifies the true distribution to diminish (or emphasize)
the role of extreme observations.
In this framework, we provide theoretical insights concerning the statistical us-
age of the generalized entropy function. In particular, we highlight the role of the
distortion parameter q and give the conditions that guarantee asymptotic efficiency
of the MLqE. Further, the new methodology is shown to be very useful when es-
timating high-dimensional parameters and small tail probabilities. This aspect is
important in many applications where we must deal with the fact that the number
of observations available is not large in relation to the number of parameters or
the probability of occurrence of the event of interest. Standard large sample theory
guarantees that the maximum likelihood estimator (MLE) is asymptotically effi-
cient, meaning that when the sample size is large, the MLE is at least as accurate
as any other estimator. However, for a moderate or small sample size, it turns out
that the MLqE can offer a dramatic improvement in terms of mean squared error
at the expense of a slightly increased bias, as will be seen in our numerical results.
For finite sample performance of MLqE, not only the size of qn − 1 but also its
sign (i.e., the direction of distortion) is important. It turns out that for different fam-
ilies or different parametric functions of the same family, the beneficial direction
MAXIMUM Lq-LIKELIHOOD ESTIMATION 755
of distortion can be different. In addition, for some parameters, MLqE does not
produce any improvement. We have found that an asymptotic variance expression
of the MLqE is very helpful to decide the direction of distortion for applications.
The paper is organized as follows. In Section 2, we examine some information-
theoretical quantities and introduce the MLqE; in Section 3, we present its basic
asymptotic properties for exponential families. In particular, a necessary and suf-
ficient condition on the choice of q in terms of the sample size to ensure a proper
asymptotic normality and efficiency is established. A generalization that goes out
of the exponential family is presented in Section 4. In Section 5, we consider the
plug-in approach for tail probability estimation based on MLqE. The asymptotic
properties of the plug-in estimator are derived and its efficiency is compared to
the traditional MLE. In Section 6, we discuss the choice of the distortion parame-
ter q. In Section 7, we present Monte Carlo simulations and examine the behavior
of MLqE in finite sample situations. In Section 8, concluding remarks are given.
Technical proofs of the theorems are deferred to Appendix A.
The transformed density f (x; θ )(r) is often referred to as zooming or escort distri-
bution [1, 7, 26] and the parameter r provides a tool to accentuate different regions
756 D. FERRARI AND Y. YANG
of the untransformed true density f (x; θ ). In particular, when r < 1, regions with
density values close to zero are accentuated, while for r > 1, regions with density
values further from zero are emphasized.
Consider the following KL divergence between f (x; θ ) and f (x; θ0 )(r) :
f (x; θ0 )(r)
(2.4) Dr (θ0 θ ) = f (x; θ0 )(r) log dμ(x).
f (x; θ )
Let θ ∗ be the value such that f (x; θ ∗ ) = f (x; θ0 )(r) and assume that differentia-
tion can be passed under the integral sign. Then, clearly θ ∗ minimizes Dr (θ0 θ )
over θ . Let θ ∗∗ be the value such that f (x; θ ∗∗ ) = f (x; θ0 )(1/q) , q > 0. Since we
have ∇θ Hq (θ0 , θ )|θ ∗∗ = 0 and ∇θ2 Hq (θ0 , θ )|θ ∗∗ is positive definite, Hq (θ0 , θ ) has
a minimum at θ ∗∗ .
The derivations above show the minimizer of Dr (θ0 θ ) over θ is the same as the
minimizer of Hr (θ0 , θ ) over θ when q = 1/r. Clearly, by considering the diver-
gence with respect to a distorted version of the true density we introduce a certain
amount of bias. Nevertheless, the bias can be properly controlled by an adequate
choice of the distortion parameter q, and later we shall discuss the benefits gained
from paying such a price for parameter estimation. The next definition introduces
the estimator based on the empirical version of the q-entropy.
depending on q < 1 or q > 1. In the case q = 1, all the observations receive the
same weight.
The strategy of setting weights that are proportional to a power transforma-
tion of the assumed density has some connections with the methods proposed by
Windham [33], Basu et al. [6] and Choi, Hall and Presnell [8]. In these approaches,
however, the main objective is robust estimation and the weights are set based on
a fixed constant not depending on the sample size.
T HEOREM 3.1. Under assumptions A.1 and A.2, with probability going to 1,
the Lq -likelihood equation yields a unique solution
θn that is the maximizer of the
P
Lq -likelihood function in . Furthermore, we have θn → θ0 .
R EMARK . When is compact, the MLqE always exists under our conditions,
although it is not necessarily unique with probability one.
Let m(θ ) := ∇θ A(θ ) and D(θ ) := ∇θ2 A(θ ). Note that Kn and Jn can be ex-
pressed as
(3.6) Kn = c2,n D(θ2,n ) + [m(θ2,n ) − m(θn∗ )][m(θ2,n ) − m(θn∗ )]T
and
Jn = c1,n (1 − qn )D(θ1,n ) − c1,n D(θn∗ )
(3.7)
+ c1,n (1 − qn )[m(θ1,n ) − m(θn∗ )][m(θ1,n ) − m(θn∗ )]T ,
where ck,n = exp {A(θn,k ) − A(θ0 )} and θk,n = kθ0 (1/qn − 1) + θ0 . When qn → 1,
it is seen that Vn → −D(θ0 ), the asymptotic variance of the MLE. When ⊆
R1 we use the notation σn2 for the asymptotic variance in place of Vn . Note that
the existence of moments are ensured by the functional form of the exponential
families (e.g., see [23]).
MAXIMUM Lq-LIKELIHOOD ESTIMATION 759
R EMARK 4.2. (i) Although for a large n the Lq -likelihood equation has a
unique zero with a high probability, for finite samples there may be roots that are
actually bad estimates. (ii) The uniform convergence in condition B.3 is satisfied if
the set of functions {U (x, θ ) : θ ∈ } is Glivenko–Cantelli under the true parame-
ter θ0 (see, e.g., [31], Chapter 19.2). In particular, it suffices to require (i) U (x; θ ) is
continuous in θ for every x and dominated by an integrable function and (ii) com-
pactness of .
T HEOREM 5.1. Let θn∗ be as in the previous section. Under assumptions A.1
and A.2, if n−1/2 |γ (xn ; θn∗ )|β(xn ; θn∗ ; δ) → 0 for each δ > 0, then
√ α(xn ;
θn ) − α(xn ; θn∗ ) D
n → N(0, 1),
σn α (xn ; θn∗ )
where σn = −[Eθ0 U ∗ (X; θn∗ )2 ]1/2 /Eθ0 [∂U ∗ (X; θ, qn )/∂θ |θn∗ ].
R EMARKS . (i) For the main requirement of the theorem on the order of the
sequence xn , it is easiest to be verified on a case by case basis. For instance, in the
case of the exponential distribution in (A.4), for xn > 0,
e−xn λ xn2 xn |λ−λ∗n |
√
β(xn ; λ∗n ; δ) = sup √ ∗
−xn λn x 2
≤ sup √
e = e δxn / n
.
λ∈λ∗n ±δ/ ne n λ∈λ∗n ±δ/ n
762 D. FERRARI AND Y. YANG
√
Moreover, γ (xn ; λ∗n ) = −xn . So, the condition reads n−1/2 xn eδxn / n → 0, that is,
n−1/2 xn → 0. (ii) The plug-in estimator based on qn θn has been examined as well.
With qn → 1, we did not find any significant advantage.
The condition n−1/2 |γ (xn ; θn∗ )|β(xn ; θn∗ ; δ) → 0, to some degree, describes the
interplay between the sample size n, xn and qn for the asymptotic normality to
hold. When xn → ∞ too fast so as to violate the condition, the asymptotic nor-
mality is not guaranteed, which indicates the extreme difficulty in estimating a
tiny tail probability. In the next section, we will use this framework to compare
the MLqE of the tail probability, α(xn ; θn ), with the one based on the traditional
MLE, α(xn ; θn ).
In many applications, the quantity of interest is quantile instead of the tail
probability. In our setting, the quantile function is defined as ρ(s; θ ) = α −1 (s; θ ),
0 < s < 1 and θ ∈ . Next, we present the analogue of Theorem 5.1 for the plug-in
estimator of the quantile. Define
∗ ρ (s; θ )
(5.2) β1 (s; θ ; δ) = sup , δ > 0,
√ √ ρ (s; θ ∗ )
θ ∈∩[θ ∗ −δ/ n,θ ∗ +δ/ n]
It can be easily verified that the definition does not depend on the specific choice
of an , bn , σn and τn among equivalent expressions.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 763
The result, which follows directly from Theorem 5.1 and Definition 5.1, says
that when qn is chosen sufficiently close to 1, asymptotically speaking, the MLqE
is as efficient as the MLE.
= e−λxn and
E XAMPLE 5.1 (Continued). In this case, we have α(xn ; λ) √
α (xn ; λ)√= −λx
−xn e n . For sequences xn and qn such that xn / n → 0 and
(qn − 1) n → 0, we have that
√ (e−λn xn − e−λ0 /qn xn ) D
(5.5) n → N(0, 1).
λ0 xn e−λ0 /qn xn
When qn = 1 for all n, we recover the usual plug-in estimator based on MLE. With
the asymptotic expressions given above,
n
2
(5.6) (α(x; λn )) = 2 2 e−xn (λ0 /qn −λ0 ) − 1 + e−2xn (λ0 /qn −λ0 ) ,
λn ), α(x;
λ0 xn
which is greater than 1 when qn > 1. Thus, no advantage in terms of MSE is
expected by considering qn > 1 (which introduces bias and enlarges the variance
at the same time).
Although in limits MLqE is not more efficient than MLE, MLqE can be much
better than MLE due to variance reduction as will be clearly seen in Section 7. The
following calculation provides a heuristic understanding. Let rn = 1 − 1/qn . Add
and subtract 1 in (5.6), obtaining
nrn2 L1/qn (e−xn λ0 )2
(5.7) + rn L1/qn (e2xn λ0 ) + 1 < nrn2 + rn 2xn λ0 + 1,
λ20 xn2
where the last inequality holds as L1/qn (u) < log(u) for any u > 0 and q < 1.
Next, we impose (5.7) to be smaller than 1 and solve for qn , obtaining
−1
2λ0 xn
(5.8) Tn := 1 + < qn < 1.
n
This provides some insights on the choice of the sequence qn in accordance to the
size of the probability to be estimated. If qn approaches 1 too quickly from below,
the gain obtained in terms of variance vanishes rapidly as n becomes larger. On the
other hand, if qn converges to 1 too slowly, the bias dominates the variance and the
MLE outperforms the MLqE. This understanding is confirmed in our simulation
study.
764 D. FERRARI AND Y. YANG
where
λ is the MLE. This will be also used in some of our simulation studies.
In general, unlike in the above example, closed-form expressions of the asymp-
totic mean squared error are not available, which calls for more work on this issue.
In the literature on applications of nonextensive entropy, although some discus-
sions on choosing q have been made often from physical considerations, it is
unclear how to do it from a statistical perspective. In particular, the direction of
distortion (i.e., q > 1 or q < 1) needs to be decided. We offer the following obser-
vations and thoughts:
1. For estimating the parameters in an exponential family, although |qn − 1|n−1/2
guarantees the right asymptotic normality (i.e., asymptotic normality centered
around θ0 ), one direction of distortion typically reduces the variance of estima-
tion and consequently improves the MSE. In the exponential distribution case,
qn needs to be slightly greater than 1, but for estimating the covariance matrix
MAXIMUM Lq-LIKELIHOOD ESTIMATION 765
7. Monte Carlo results. In this section, the performance of the MLqE in fi-
nite samples is explored via simulations. Our study includes (i) an assessment of
the accuracy for tail probability estimation and reliability of confidence intervals
and (ii) an assessment of the performance of MLqE for estimating multidimen-
sional parameters, including regression settings with generalized linear models.
The standard MLE is used as a benchmark throughout the study.
In this section, we present both deterministic and data-driven approaches on
choosing qn . First, deterministic choices are used to explore the possible advan-
tage of the MLqE for tail probability estimation with qn approaching 1 fast when
x is fixed and qn approaching 1 slowly when x increases with n. Then, the data-
driven choice in Section 6 is applied. For multivariate normal and GLM families,
where estimation of the MSE or prediction error becomes analytically cumber-
some, we choose qn = 1 − 1/n, which satisfies 1 − qn = o(n−1/2 ) that is needed
for asymptotic normality around θ0 . In all considered cases, numerical solution of
(2.7) is found using variable metric algorithm (e.g., see Broyden [15]), where the
ML solution is chosen as the starting value.
7.1. Mean squared error: role of the distortion parameter q. In the first group
of simulations, we compare the estimators of the true tail probability α = α(x; λ0 ),
obtained via the MLq method and the traditional maximum likelihood approach.
Particularly, we are interested in assessing the relative performance of the two
estimators for different choices of the sample size by taking the ratio between the
two mean squared errors, MSE(α̂n )/ MSE( αn ). The simulations are structured as
follows: (i) For any given sample size n ≥ 2, a number B = 10,000 of Monte Carlo
samples X1 , . . . , Xn is generated from an exponential distribution with parameter
766 D. FERRARI AND Y. YANG
λ0 = 1. (ii) For each sample, the MLq and ML estimates of α, respectively, αn,k =
α(x; λn,k ) and α̂n,k = α(x; λn,k ), k = 1, . . . , B, are obtained. (iii) For each sample
size n, the relative performance between the two estimators is evaluated by the
ratio R n = MSEMC (α̂n )/ MSEMC ( αn ), where MSEMC denotes the Monte Carlo
estimate of the mean squared error. In addition, let y 1 = B −1 B αn,k − α)2
k=1 (
and y 2 = B −1 B αn,k − α)2 . By the central limit theorem, for large values
k=1 (
of B, y = (y 1 , y 2 ) approximately has a bi-variate normal distribution with mean
(MSE(α̂n ), MSE( αn )) and a certain covariance matrix . Thus, the standard error
for Rn can be computed by the delta method [11] as
−1/2 γ 11 y1 y 21 1/2
se(Rn ) = B 2
− 2γ 12 3 + γ 22 4 ,
y2 y2 y2
where γ 11 , γ 22 and γ 12 denote, respectively, the Monte Carlo estimates for the
components of the covariance matrix .
Case 1: fixed α and q. Figure 1 illustrates the behavior of R n for several choices
of the sample size. In general, we observe that for relatively small sample sizes,
n > 1 and the MLqE clearly outperforms the traditional MLE. Such a behavior is
R
much more accentuated for smaller values of the tail probability to be estimated. In
contrast, when the sample size is larger, the bias component plays an increasingly
relevant role and eventually we observe that R n < 1. This case is presented in
Figure 1(a) for values of the true tail probability α = 0.01, 0.005, 0.003 and a fixed
distortion parameter q = 0.5. Moreover, the results presented in Figure 1(b) show
(a) (b)
F IG . 1. Monte Carlo mean squared error ratio computed from B = 10,000 samples of size n. In (a)
we use a fixed distortion parameter q = 0.5 and true tail probability α = 0.01, 0.005, 0.003. The
dashed lines represent 99% confidence bands. In (b) we set α = 0.003 and use q1 = 0.65, q2 = 0.85
and q3 = 0.95. The dashed lines represent 90% confidence bands.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 767
(a) (b)
F IG . 2. (a) Monte Carlo mean squared error ratio computed from B = 10,000 samples of size n, for
different values of the true probability: α1 = 0.01, α2 = 0.005 and α3 = 0.003. The distortion para-
meter is computed as qn = [1/2 + e0.3(n−20) ]/[1 + e0.3(n−20) ]. (b) Monte Carlo mean squared error
ratio computed from B = 10,000 samples of size n. We use sequences qn = 1 − [10 log(n + 10)]−1
and xn = n1/(2+δ) (δ1 = 0.5, δ2 = 1.0 and δ3 = 1.5). The dashed lines represent 99% confidence
bands.
that smaller values of the distortion parameter q accentuate the benefits attainable
in a small sample situation.
Case 2: fixed α and qn 1. In the second experimental setting, illustrated in
Figure 2(a), the tail probability α is fixed, while we let qn be a sequence such
that qn 1 and 0 < qn < 1. For illustrative purposes we choose the sequence
qn = [1/2 + e0.3(n−20) ]/[1 + e0.3(n−20) ], n ≥ 2, and study Rn for different choices
of the true tail probability to be estimated. For small values of the sample size, the
chosen sequence qn converges relatively slowly to 1 and the distortion parameter
produces benefits in terms of variance. In contrast, when the sample size becomes
larger, qn adjusts quickly to one. As a consequence, for large samples the MLqE
exhibits the same behavior shown by the traditional MLE.
Case 3: αn 0 and qn 1. The last experimental setting of this subsection
examines the case where both the true tail probability and the distortion parameter
change depending on the sample size. We consider sequences of distortion para-
meters converging slowly relative to the sequence of quantiles xn . In particular we
set qn = 1 − [10 log(n + 10)]−1 and xn = n1/(2+δ) . In the simulation described in
Figure 2(b), we illustrate the behavior of the estimator for δ = 0.5, 1.0 and 1.5,
confirming the theoretical findings discussed in Section 5.
7.2. Asymptotic and bootstrap confidence intervals. The main objective of the
simulations presented in this subsection is twofold: (a) to study the reliability of
MLqE based confidence intervals constructed using three commonly used meth-
ods: asymptotic normality, parametric bootstraps and nonparametric bootstraps;
768 D. FERRARI AND Y. YANG
TABLE 1
MC means and standard deviations of estimators of α, along with the MC mean of the standard
error computed using: (i) asymptotic normality, (ii) bootstrap and (iii) parametric bootstrap.
The true tail probability is α = 0.01 and q = 1 corresponds to the MLE
(b) to compare the results with those obtained using MLE. The structure of sim-
ulations is similar to that of Section 7.1, but a data-driven choice of qn is used.
(i) For each sample, first we compute λn , the MLE of λ0 . We substitute λn in (6.1)
and solve it numerically in order to obtain q ∗ as described there. (ii) For each sam-
ple, the MLq and ML estimates of the tail probability α are obtained. The standard
errors of the estimates are computed using three different methods: the asymptotic
formula derived in (5.5), nonparametric bootstrap and parametric bootstrap. The
number of replicates employed in bootstrap re-sampling is 500. We construct 95%
bootstrap confidence intervals based on the bootstrap quantiles and check the cov-
erage of the true value α.
In Table 1, we show the Monte Carlo means of αn and
αn , their standard devia-
tions and the standard errors computed with the three methods described above. In
addition, we report the Monte Carlo average of the estimates of optimal distortion
parameter q ∗ . When q ∗ = 1, the results refer to the MLE case. Not surprisingly,
q ∗ approaches 1 as the sample size increases. When the sample size is small, the
MLqE has a smaller standard deviation and better performance. When n is larger,
the advantage of MLqE diminishes. As far as the standard errors are concerned, the
asymptotic method and the parametric bootstrap seem to provide values somewhat
closer to the Monte Carlo standard deviation for the considered sample sizes.
In Table 2, we compare the accuracy of 95% confidence intervals and report the
relative length of the intervals for MLqE over those for MLE. Although the cover-
age probability for MLqE is slightly smaller than that of MLE (in the order of 1%),
we observe a substantial reduction in the interval length for all of the considered
cases. The most evident benefits occur when the sample size is small. Furthermore,
in general, the intervals computed via parametric bootstrap outperform the other
two methods in terms of coverage and length.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 769
TABLE 2
MC coverage rate of 95% confidence intervals for α, computed using (i) asymptotic normality,
(ii) boostrap and (iii) parametric bootstrap. RL is the length of the intervals of MLqE over that of
MLE. The true tail probability is α = 0.01 and q = 1 corresponds to the MLE
where q represents the MLq estimate of with q = 1 − 1/n. Note that the loss
is 0 when = q and is positive otherwise. Moreover, the loss is invariant to the
transformations AAT and A q AT for a nonsingular matrix A. The use of such a
loss function is common in literature (e.g., Huang et al. [17]).
In Table 3, we show simulation results for moderate or small sample sizes rang-
ing from 10 to 100 for various dimensions of the covariance matrix . The entries
in the table represent the Monte Carlo mean of (, 1 ) over that of (, q ),
where 1 is the usual ML estimate multiplied by the correction factor n/(n − 1).
The standard error of the ratio is computed via the delta method. Clearly, the
MLqE performs well for smaller sample sizes. Interestingly, the squared error for
the MLqE reduces dramatically compared to that of the MLE as the dimension
increases. Remarkably, when p = 8 the gain in accuracy persists even for larger
sample sizes, ranging from about 22% to 84%. We tried various structures of
and obtained performances comparable to the ones presented. For μ we found
that MLqE performs nearly identically to MLE for all choices of p and n, which
770 D. FERRARI AND Y. YANG
TABLE 3
1 ) over that of (,
Monte Carlo mean of (, q ) with standard error in parenthesis
n 1 2 4 8
is not surprising given the findings in Section 3. For brevity we omit the results
on μ.
where β q,n is the MLqE of β. In Table 4 we present the prediction error for various
choices of n and p. For both models, the MLqE outperforms the classic MLE
for all considered cases. The benefits from MLqE can be remarkable when the
dimension of the parameter space is larger. This is particularly evident in the case
of the exponential regression, where the prediction error of MLE is at least twice
MAXIMUM Lq-LIKELIHOOD ESTIMATION 771
TABLE 4
Monte Carlo mean of PE1 over that of PEq for exponential and logistic regression
with standard error in parenthesis
p 25 50 100 250
Exp. regression
2 2.549 (0.003) 2.410 (0.002) 2.500 (0.003) 2.534 (0.003)
4 2.469 (0.002) 2.392 (0.002) 2.543 (0.002) 2.493 (0.002)
8 4.262 (0.012) 2.941 (0.004) 3.547 (0.006) 3.582 (0.006)
12 9.295 (0.120) 3.644 (0.008) 3.322 (0.005) 5.259 (0.027)
Logistic regression
2 1.156 (0.006) 1.329 (0.006) 1.205 (0.003) 1.385 (0.003)
4 1.484 (0.022) 1.141 (0.003) 1.502 (0.007) 1.353 (0.003)
8 1.178 (0.008) 1.132 (0.003) 1.290 (0.004) 1.300 (0.002)
12 1.086 (0.005) 1.141 (0.003) 1.227 (0.003) 1.329 (0.002)
that of MLqE. In one case, when n = 25 and p = 12, the MLqE is about nine
times more accurate. This is mainly due to MLqE’s stabilization of the variance
component, which for the MLE tends to become large quickly when n is very small
compared to p. Although for the logistic regression we observe a similar behavior,
the gain in high dimension becomes more evident for larger n.
Besides the theoretical optimality results and often remarkably improved per-
formance over MLE, our proposed method is very practical in terms of imple-
mentability and computational efficiency. The estimating equations are simply ob-
tained by replacing the logarithm of log-likelihood function in the usual maximum
likelihood procedure by the distorted logarithm. Thus, the resulting optimization
task can be easily formulated in terms of a weighted version of the familiar score
function, with weights proportional to the (1 − q)th power of the assumed density.
Hence, similarly to other techniques based on re-weighing of the likelihood, sim-
ple and fast algorithms for solving the MLq equations numerically (possibly even
for large problems) can be derived.
For the MLq estimators, helpful insights on their behaviors may be gained from
robust analysis. For a given q, (2.6) defines an M-estimator of the surrogate pa-
rameter θ ∗ . It seems that global robustness properties, such as a high breakdown
point, may be established for a properly chosen distortion parameter, which would
add value to the MLq methodology.
High-dimensional estimation has recently become a central theme in statistics.
The results in this work suggest that the MLq methodology may be a valuable tool
for some high-dimensional estimation problems (such as gamma regression and
covariance matrix estimation as demonstrated in this paper) as a powerful remedy
to the MLE. We believe this is an interesting direction for further exploration.
Finally, more research on the practical choices of q and their theoretical prop-
erties will be valuable. To this end, higher-order asymptotic treatment of the dis-
tribution (or moments) of the MLqE will be helpful. For instance, derivation of
saddle-point approximations of order n−3/2 , along the lines of Field and Ronchetti
[13] and Daniels [10], may be profitably used to give improved approximations of
the MSE.
APPENDIX A: PROOFS
In all of the following proofs we denote ψn (θ ) := n−1 ni=1 ∇θ Lqn (f (Xi ; θ )).
T
For exponential families, since f (x; θ ) = eθ b(x)−A(θ ) , we have
1 n
T
(A.1) ψn (θ ) = e(1−qn )(θ b(Xi )−A(θ )) b(Xi ) − m(θ ) ,
n i=1
where m(θ ) = ∇θ A(θ ). The MLq equation sets ψn (θ ) = 0 and solves for θ . More-
over, we define ϕ(x, θ) := θ T b(x) − A(θ ), and thus f (x; θ ) = eϕ(x,θ ) . When clear
from the context, ϕ(x, θ) is denoted by ϕ.
Proof of Theorem 3.1. Define ψ(θ ) := Eθ0 ∇θ log (f (X; θ )). Since f has the
form in (3.1), we can write ψ(θ ) = Eθ0 [b(X) − m(θ )]. We want to show uniform
MAXIMUM Lq-LIKELIHOOD ESTIMATION 773
where tj denotes the j th element of the vector t (Xi ; θ ). It follows that for (A.2),
p
it suffices to show n−1 i supθ s(Xi ; θ )2 → 0 and n−1 i supθ tj (Xi ; θ )2 is
bounded in probability. Since is compact, supθ |m(θ )| ≤ (c1 , c1 , . . . , c1 ) for
some positive constant c1 < ∞, and we have
1 n
2 n
(A.4) sup tj (Xi ; θ )2 ≤ bj (Xi )2 + 2(c1 )2 ,
n i=1 θ ∈ n i=1
where the last inequality from the basic fact that (a − b)2 ≤ 2a 2 + 2b2 (a, b ∈ R).
The last expression in (A.4) is bounded in probability by some constant and
Eθ0 bj (X)2 < ∞ for all j = 1, . . . , p. Next, note that
1 n
1 n
T
sup s(Xi ; θ )2 ≤ sup e2(1−qn )(θ b(Xi )−A(θ))
θ ∈ n i=1 n i=1 θ ∈
(A.5)
2 n
T
− inf e(1−qn )(θ b(Xi )−A(θ )) + 1.
n i=1 θ ∈
p
Thus, to show n−1 i supθ s(Xi ; θ )2 → 0, it suffices to obtain n−1 ×
p p
i supθ e
2(1−qn )ϕ(θ) − 1 → 0 and n−1 i infθ e(1−qn )ϕ(θ) − 1 → 0. Actually,
since is compact and supθ e−A(θ) < c2 for some c2 < ∞,
1 n
1 n
(∗) T
e2|1−qn |(| log c2 |+θ |b(Xi )|) ,
T
(A.6) sup e2(1−qn )(θ b(Xi )−A(θ)) ≤
n i=1 θ∈ n i=1
774 D. FERRARI AND Y. YANG
We decompose into 2p subsets in terms of the sign of the elements of b(x). That
p
is, = 2k=1 Bk , where
and so on. Note that sign{b(x)} stays the same for each Bi , i = 1, . . . , 2p . Also
because θ0 is an interior point, when |1 − qn | is small enough, the integral in (A.7)
on Bi is finite and by dominated convergence theorem,
e[2r|1−qn | sign{b(x)}θ
(∗) +θ ]T b(x) n→∞ T
(A.10) 0 dμ(x) → eθ0 b(x) dμ(x).
Bk Bk
Consequently,
e[2r|1−qn | sign{b(x)}θ
(∗) +θ ]T b(x) n→∞ T
(A.11) 0 dμ(x) → eθ0 b(x)−A(θ0 ) dμ(x) = 1.
T
It follows that the mean and the variance of supθ e2(1−qn )[θ b(X)−A(θ )] converge
to 1 and 0, respectively, as n → ∞. Therefore, a straightforward application of
Chebyshev’s inequality gives
1 n
T p
(A.12) sup e2(1−qn )[θ b(Xi )−A(θ)] → 1, n → ∞.
n i=1 θ ∈
1 n
T p
(A.13) inf e(1−qn )[θ b(Xi )−A(θ)] → 1, n → ∞.
n i=1 θ ∈
p
Therefore, we have established n−1 i supθ s(Xi ; θ )2 → 0. Hence, (A.2) con-
verges to zero in probability. By applying Lemma 5.9 on page 46 in [31], we know
that with probability converging to 1, the solution of the MLq equations is unique
and it maximizes the MLqE.
MAXIMUM Lq-LIKELIHOOD ESTIMATION 775
zero matrix. From the calculation carried out in (A.17), one can see that ψ̇(θn∗ ) is a
deterministic sequence such that ψ̇(θn∗ ) → ψ̇(θ0 ) = −∇θ2 A(θ0 ), as n → ∞. Thus,
we have
p
(A.29) |ψ̇n (X; θn∗ ) − ψ̇(θ0 )| ≤ |ψ̇n (X; θn∗ ) − ψ̇(θn∗ )| + |ψ̇(θn∗ ) − ψ̇(θ0∗ )| → 0
p
as n → ∞. Therefore, ψ̇(θn∗ )−1 ψ̇n (X, θn∗ ) → Ip .
Step 3. Here, we show that the second term on the right-hand side of (A.16)
is negligible. Let g(X; θ ) be an element of the array ψ̈n (X, θ ) of dimension
p × p × p. For some fixed θ in the line segment between
θ and θn∗ , we have that
By the Cauchy–Schwarz inequality, the first term of the above expression is upper
bounded by
p
n
n
1
2 1
sup f (Xi ; θ )1−qn − 1 Uj (Xi ; θ )2 .
θ ∈ j =1 n i=1 n i=1
By assumption B.2, n−1 i supθ Uj (Xi ; θ )2 is bounded in probability. Moreover,
given > 0, by Markov’s inequality we have
2
2
P n−1 sup f (Xi ; θ )1−qn − 1 > ≤ −1 E sup f (X; θ )1−qn − 1 ,
i θ θ
which converges to zero by assumption B.2. By assumption B.3, the second sum-
mand in (A.32) converges to zero in probability.
Proof of Theorem 4.3. By Taylor’s theorem, for a solution of the MLq equa-
tion, there exists a random point θ between
θn and θn∗ such that
1 n
1 n
0= U ∗ (Xi , θn∗ , qn ) + ∇θ U ∗ (Xi , θn∗ , qn )(
θn − θn∗ )
n i=1 n i=1
(A.33)
1 1 n
θn − θn∗ )T
+ ( ∇ 2 U ∗ (Xi , θ, qn )(
θn − θn∗ ).
2 n i=1 θ
From Theorem 4.1, we know that with probability approaching 1, θn is the unique
MLqE and the above equation holds. Define Zn,i := U ∗ (Xi ; θn∗ , qn ), i = 1, . . . , n,
a triangular array of i.i.d. random vectors and let a ∈ Rp be a vector of constants.
Let Wn,i := a T Zn,i . The Lyapunov
condition for ensuring asymptotic normality of
the linear combination a T ni=1 Zn,i /n for a ∈ Rp and a > 0 in this case reads
n−1/3 (EWn,1
2 −1 3
) (E[Wn,1 ])2/3 → 0 as n → ∞.
Under C.1 and C.2, this can be easily checked. The Cramér–Wold device implies
1 n
D
Cn U ∗ (Xi , θn∗ , qn ) → Np (0, Ip ),
n i=1
√
where Cn := n[Eθ0 U ∗ (X, θn∗ )T U ∗ (Xi , θn∗ )]−1/2 .
Next, consider the second term in (A.33). Given > 0, for k, l ∈ {1, . . . , p}, by
Chebyshev’s inequality
!
−1 ∗
n
∗
P n I (Xi , θn , qn ) − {Jn }k,l > ≤ −2 n−2 E{I ∗ (X, θn∗ , qn )}2k,l .
i=1 k,l
Thus, the right-hand side of the above expression converges to zero as n → ∞
under C.3. Since convergence in probability is ensured for each k, l and p < ∞,
MAXIMUM Lq-LIKELIHOOD ESTIMATION 779
under C.2, we have that |n−1 i I ∗ (Xi , θn∗ ) − Jn | converges to the zero matrix in
probability.
Finally, n−1 ∇θ2 ni=1 U ∗ (Xi , θ, qn ) in the third term of the expansion (A.33) is
a p × p × p array of partial second-order derivatives. By assumption, there is a
neighborhood B of θ0 for which each entry of ∇θ2 U ∗ (x, θ, qn ) is dominated by
g0 (x) for some g0 (x) ≥ 0 for all θ ∈ B. With probability tending to 1,
−1 2 ∗
n n
3 −1
n ∇ U (X i , θ, qn
) ≤ p n |g0 (Xi )|,
θ
i=1 i=1
which is bounded in probability by the law of large numbers. Since the third term
in the expansion (A.33) is of higher order than the second term, the normality
result follows by applying Slutsky’s lemma.
√ (
θn − θn∗ ) 1 α (xn ; θn∗ ) α (xn ;
θ) √
= n + n(θn − θn∗ )2 ,
σn 2σn α (xn ; θn ) α (xn ; θn∗ )
∗
where θ is a value between θn and θn∗ . We need to show that the second term in
(A.34) converges to zero in probability, that is,
α (xn ; θn∗ ) α (xn ;
θ) σn n(
θn − θn∗ )2 p
(A.35) √ → 0.
α (xn ; θn∗ ) α (xn ; θn∗ ) n σn2
√ D
Since n(θn − θn∗ )/σn → N(0, 1) and σn is upper bounded, we need
α (xn ; θn∗ ) α (xn ;
θ) p
(A.36) √ → 0.
α (xn ; θn ) n α (xn ; θn∗ )
∗
This holds under the assumptions of the theorem. This completes the proof of the
theorem.
where θ is a value between θn and θn∗ . The assumptions combined with Theo-
rem 3.2 imply that the second term in (A.37) converges to 0 in probability. Hence,
the central limit theorem follows from Slutsky’s lemma.
−1 −1 (2 − q)2+p
(B.12) V11 = J11 K11 J11 = .
(3 − 2q)1+p/2
Next, we compute V22 . Let z := −1/2 (x − μ) using the following relationship
derived by McCulloch ([25], page 682):
E[∇vech (θ)]T [∇vech (θ)]
(B.13) = 1/4GT ( −1/2 ⊗ −1/2 ) E[(z ⊗ z)(zT ⊗ zT )] − vec Ip vecT Ip
× ( −1/2 ⊗ −1/2 )G.
Moreover, a result by Magnus and Neudecker ([24], page 388) shows
(B.14) E[(z ⊗ z)(zT ⊗ zT )] = Ip + Kp,p + vec Ip vecT Ip ,
where Kp,p denotes the commutation matrix (see Magnus and Neudecker [24]).
To compute K22 and J22 , we need to evaluate (B.13) at θ ∗ = (μT , q vechT )T ,
replacing the expectation operator with cr E (r) [·]. In particular,
" (r) #
E [(z ⊗ z)(zT ⊗ zT )]θ ∗ − vec Ip vecT Ip G
−2
(B.15) = r(1 − q) + 1 {Ip + Kp,p }G
−2
= 2 r(1 − q) + 1 G,
where the last equality follows from the fact that Kp,p G = G. Therefore,
(B.16) K22 = 1/(4q 2 )c2 GT ( −1/2 ⊗ −1/2 )
(B.17) × E[(z ⊗ z)(zT ⊗ zT )] − vec Ip vecT Ip ( −1/2 ⊗ −1/2 )G
−2
(B.18) = 1/(4q 2 )c2 r(1 − q) + 1 + 1 GT ( −1 ⊗ −1 )G
[(2(1 − q) + 1)−2 + 1](3 − 2q)−p/2 T −1
(B.19) = 1/(4q 2 ) G ( ⊗ −1 )G.
4(2πq p ||)2−q
A similar calculation gives
[(2 − q)−2 + 1](2 − q)−p/2 T −1
(B.20) J22 = 1/(4q 2 ) G ( ⊗ −1 )G.
(2πq p ||)(2−q)/2
782 D. FERRARI AND Y. YANG
Acknowledgments. The authors wish to thank Tiefeng Jiang for helpful dis-
cussions. Comments from two referees, especially the one with a number of very
constructive suggestions on improving the paper, are greatly appreciated.
REFERENCES
[1] A BE , S. (2003). Geometry of escort distributions. Phys. Rev. E 68 031101.
[2] ACZÉL , J. D. and DARÓCZY, Z. (1975). On measures of information and their characteriza-
tions. Math. Sci. Eng. 115. Academic Press, New York–London. MR0689178
[3] A KAIKE , H. (1973). Information theory and an extension of the likelihood principle. In
2nd International Symposium of Information Theory 267–281. Akad. Kiadó, Budapest.
MR0483125
[4] A LTUN , Y. and S MOLA , A. (2006). Unifying divergence minimization and statistical inference
via convex duality. In Learning Theory. Lecture Notes in Computer Science 4005 139–
153. Springer, Berlin. MR2280603
[5] BARRON , A., R ISSANEN , J. and Y U , B. (1998). The minimum description length principle in
coding and modeling. IEEE Trans. Inform. Theory 44 2743–2760. MR1658898
[6] BASU , A. H ARRIS , I. R., H JORT, N. L. and J ONES , M. C. (1998). Robust and efficient esti-
mation by minimising a density power divergence. Biometrika 85 549–559. MR1665873
[7] B ECK , C. and S CHLÖGL , F. (1993). Thermodynamics of Chaotic Systems: An Introduction.
Cambridge Univ. Press, Cambridge. MR1237638
[8] C HOI , E., H ALL , P. and P RESNELL , B. (2000). Rendering parametric procedures more robust
by empirically tilting the model. Biometrika 87 453–465. MR1782490
[9] C OVER , T. M. and T HOMAS , J. A. (2006). Elements of Information Theory. Wiley, New York.
MR2239987
[10] DANIELS , H. E. (1997). Saddlepoint approximations in statistics (Pkg: P171-200). In Break-
throughs in Statistics (S. Kotz and N. L. Johnson, eds.) 3 177–200. Springer, New York.
MR1479201
[11] F ERGUSON , T. S. (1996). A Course in Large Sample Theory. Chapman & Hall, London.
MR1699953
[12] F ERRARI , D. and PATERLINI , S. (2007). The maximum Lq-likelihood method: An application
to extreme quantile estimation in finance. Methodol. Comput. Appl. Probab. 11 3–19.
MR2476469
[13] F IELD , C. and RONCHETTI , E. (1990). Small Sample Asymptotics. IMS, Hayward, CA.
MR1088480
[14] G ELL -M ANN , M., ED . (2004). Nonextensive Entropy, Interdisciplinary Applications. Oxford
Univ. Press, New York. MR2073730
[15] G OLDFARB , D. (1970). A family of variable-metric method derived by variational means.
Math. Comp. 24 23–26. MR0258249
[16] H AVRDA , J. and C HARVÁT, F. (1967). Quantification method of classification processes: Con-
cept of structural entropy. Kibernetika 3 30–35. MR0209067
MAXIMUM Lq-LIKELIHOOD ESTIMATION 783
[17] H UANG , J. Z., L IU , N., P OURAHMADI , M. and L IU , L. (2006). Covariance matrix selection
and estimation via penalised normal likelihood. Biometrika 93 85–98. MR2277742
[18] H UBER , P. J. (1981). Robust Statistics. Wiley, New York. MR0606374
[19] JAYNES , E. T. (1957). Information theory and statistical mechanics. Phys. Rev. 106 620.
MR0087305
[20] JAYNES , E. T. (1957). Information theory and statistical mechanics II. Phys. Rev. 108 171.
MR0096414
[21] K ULLBACK , S. (1959). Information Theory and Statistics. Wiley, New York. MR0103557
[22] K ULLBACK , S. and L EIBLER , R. A. (1951). On information and sufficiency. Ann. Math. Sta-
tistics 22 79–86. MR0039968
[23] L EHMANN , E. L. and C ASELLA , G. (1998). Theory of Point Estimation. Springer, New York.
MR1639875
[24] M AGNUS , J. R. and N EUDECKER , H. (1979). The commutation matrix: Some properties and
applications. Ann. Statist. 7 381–394. MR0520247
[25] M C C ULLOCH , C. E. (1982). Symmetric matrix derivatives with applications. J. Amer. Statist.
Assoc. 77 679–682. MR0675898
[26] NAUDTS , J. (2004). Estimators, escort probabilities, and phi-exponential families in statistical
physics. J. Inequal. Pure Appl. Math. 5 102. MR2112455
[27] R ÉNYI , A. (1961). On measures of entropy and information. In Proc. 4th Berkeley Sympos.
Math. Statist. and Prob. 1 547–461. Univ. California Press, Berkeley. MR0132570
[28] S HANNON , C. E. (1948). A mathematical theory of communication. Bell System Tech. J. 27
379–423. MR0026286
[29] T SALLIS , C. (1988). Possible generalization of Boltzmann–Gibbs statistics. J. Statist. Phys. 52
479–487. MR0968597
[30] T SALLIS , C., M ENDES , R. S. and P LASTINO , A. R. (1998). The role of constraints within
generalized nonextensive statistics. Physica A: Statistical and Theoretical Physics 261
534–554.
[31] VAN DER VAART, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge.
MR1652247
[32] WANG , X., VAN E EDEN , C. and Z IDEK , J. V. (2004). Asymptotic properties of maximum
weighted likelihood estimators. J. Statist. Plann. Inference 119 37–54. MR2018449
[33] W INDHAM , M. P. (1995). Robustifying model fitting. J. Roy. Statist. Soc. Ser. B 57 599–609.
MR1341326