Faculty of Economics and Applied Economics
Minimising average risk in regression models
b
Gerda Claeskens and Nils Lid Hjort
DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI)
KBI 0610
Minimising Average Risk in Regression Models
Gerda Claeskens
Nils Lid Hjort
ORSTAT and University Centre for Statistics
Department of Mathematics
University of Oslo
Katholieke Universiteit Leuven
Naamsestraat 69
B-3000 Leuven, Belgium
Gerda.Claeskens@econ.kuleuven.be
N-0316 Oslo, Norway
nils@math.uio.no
December 2005
Abstract
Most model selection mechanisms work in an ‘overall’ modus, providing models without
specific concern for how the selected model is going to be used afterwards. The focussed
information criterion (FIC), on the other hand, is geared towards optimum model
selection when inference is required for a given estimand. In this paper the FIC method
is extended to weighted versions. This allows one to rank and select candidate models
for the purpose of handling a range of similar tasks well, as opposed to being forced
to focus on each task separately. Applications include selecting regression models
that perform well for specified regions of covariate values. We derive these wFIC
criteria, give asymptotic results, and apply the methods to real data. Formulae for
easy implementation are provided for the class of generalised linear models.
Key-words: focussed information criterion, model selection, regression models.
1
Introduction
Model selection is most often the starting point of any data analysis. It is therefore of
importance to carefully address this modelling step. Most model selection tools attach to
each potential model a number, and then proceed by picking the model with the best (usually
either smallest or largest) value of this number. Traditional model selection techniques work
this way, such as Mallows’s Cp (1973) for linear regression or the more generally applicable
information criteria AIC (from Akaike, 1974) and BIC (from Schwarz, 1978). Once a model
is selected, the actual estimation takes place in the selected model.
1
The focussed information criterion (Claeskens and Hjort, 2003) is also in this spirit,
though distinguishes itself from the other information criteria in that it can be directed
– focussed – towards a specific purpose. A model selected to estimate, for example, the
mean return of an investment taken place in one month, should not necessarily be the same
as the model to be used to estimate the mean return one year further in time. Or, in
medical studies, the model which is best for estimating the survival probability should not
be expected to be the same as the best model for estimating the median survival time. In the
construction of the FIC we therefore start by specifying the focus parameter, the quantity
we wish to estimate, and then use this information to obtain the actual FIC value via an
estimator of the mean squared error of the focus parameter’s estimator. When the focus
parameter changes, also the value of the FIC might change, leading to a possibly different
selected model.
The main issue addressed in this paper is that in many situations, the focussed information criterion demands too much focus of its user, so to speak. For example, for regression
models the FIC can easily be used to select a model for the mean response value for a given
single covariate position. This is sometimes of relevance, and one may take a ‘median’ or
‘average’ covariate position; but in other situations one wishes to construct a model that
does well across many covariate positions.
To address this problem, we derive a weighted focussed information criterion, where we
attach to each potential model a single wFIC value, valid over all or part of the covariate
space. To explain matters clearly, we mainly concentrate on the class of generalised linear
models, though our arguments and methods would extend without serious difficulties to more
general regression models. More information on generalised linear models can be found in
the books by McCullagh and Nelder (1989) and Dobson (2002).
Section 2 briefly reviews the FIC method, in a setting of generalised linear models.
Working with weighted versions of mean squared errors, across all or a subset of covariates,
leads in Section 3 to the weighted FIC method. The wFIC method relies on weights that
are user- and context-specific. Applying a specific weighting scheme, with what we term glm
weights, gives a procedure that turns out to be large-sample equivalent to the AIC. This is
discussed in Section 4, along with some other weighting schemes of interest. Strategies for
estimating (and then minimising) more generally formed averaged risks are then taken up
in Section 5. Practical data examples, showing the applicability of the method, appear in
Section 6. Section 7 offers some concluding comments, indicating the further scope of our
work.
2
2
The FIC for generalised linear models
Let the observed data be denoted Y1 , . . . , Yn , together with observed covariate information
c1 , . . . , cn . In a model selection situation the ci s are vector valued and we wish to build
methods that somehow manage to select the components of most relevance. Often, we are
sure about including some of these covariates in the model (an intercept is an example of
this), so that the actual selection should take place over the other variables. Notationally
we split the covariate vector ci into two parts, the ‘protected’ xi with say p covariates that
are deemed necessary a priori and the ‘open’ zi containing the say q remaining covariates,
amongst which we intend to perform the selection.
2.1
The GLM class
For a generalised linear model (glm), there is a monotone differentiable link function g(·)
such that
g(E(Yi | xi , zi )) = xti β + zit γ
for i = 1, . . . , n,
(1)
mapping the mean response to a linear predictor defined in terms of regression coefficients
β1 , . . . , βp and γ1 , . . . , γq . Thus selecting which covariate components zi,j to include amounts
to determining which γj s to keep in the model; excluding component zi,j corresponds to
using γj = 0. For (classical) linear models, the link function is the identity function. For
other examples, see Section 2.3.
We assume that Y1 , . . . , Yn are independent with density function belonging to an exponential family of the form
f (yi ; θi , φ) = exp
n y θ − b(θ )
i i
i
a(φ)
o
+ c(yi , φ)
for y ∈ Y,
where the sample space Y is the same in each case and does not depend on the unknown
parameters θi and φ. The functions a(·), b(·) and c(·, ·) are fully specified. The b(·) function
plays a crucial role since its derivatives yield the mean and variance function, while a(φ) is
a scale parameter. From the two first Bartlett identities, about moment properties of the
first and second log-derivatives of the density, follow
ξi = E(Yi | xi , zi ) = b′ (θi ) and Var(Yi | xi , zi ) = a(φ)b′′ (θi ).
This expresses θi as a function of the mean response, given the covariates. When the so-called
canonical link function is used, that is, g(·) = (b′ )−1 (·), then θi = g(ξi ) = xti β + zit γ.
3
2.2
The pointwise FIC for GLM
Here we derive explicit expressions for the FIC when applied to the linear predictor µ =
xt β + z t γ = g(E(Y | x, z)) associated with a given position (x, z) in the covariate space. The
(pointwise) FIC is large-sample equivariant under smooth transformations, so the FIC for
µ(x, z) will essentially yield the same model ranking as the FIC for ξ(x, z) = E(Y | x, z).
One of the main ingredients for computing the FIC is the (normalised) Fisher information matrix Jn,wide , computed via the second order partial derivatives of the log-likelihood
function with respect to (β, γ). We partition the matrix Jn,wide as
Jn,wide =
Ã
Jn,00
Jn,10
Jn,01
Jn,11
!
,
such that the block Jn,11 in the lower right corner has dimension q × q. In our calculations
−1
we will frequently need the the lower right submatrix of dimension q × q of Jn,wide
, which we
−1
−1
name Kn . This matrix may be found as Kn = (Jn,11 − Jn,10 Jn,00 Jn,01 ) .
For several members of the generalised linear model family, the scale parameter φ is
either known or completely specified. Examples are the Poisson and binomial distribution,
where a(φ) = 1, or the normal distribution with known variance. In case φ is known, the
information matrix takes a simple and general form, making it easy to compute in the full
generality of generalised linear models. We first decompose the n × (p + q) design matrix
into X = (xt1 , . . . , xtn )t of dimension n × p and Z = (z1t , . . . , znt )t of dimension n × q. Then
Jn,wide
n
1
1X
vi
=
n i=1 a(φ)
Ã
xi
zi
!Ã
xi
zi
!t
1 1
=
n a(φ)
Ã
X tV X
Z tV X
X tV Z
Z tV Z
!
,
where V is the diagonal weight matrix diag(v1 , . . . , vn ), with different formulae available for
vi = v(xi , zi ). One has
vi = [b′′ (θi ){g ′ (ξi )}2 ]−1 =
b′′ (θi ) ³ ∂θi ´2
,
a(φ) ∂ηi
(2)
where ηi = g −1 (ξi ) = xti β + zit γ and ξi = E(Yi | xi , zi ) = b′ (θi ). For the situation with φ
known, therefore, the Kn matrix takes the form
Kn = a(φ){n−1 Z t V (I − X(X t V X)−1 X t V )Z}−1 .
(3)
For canonical link functions, where θi = ηi , matters simplify to vi = b′′ (xti β + zit γ)/a(φ).
When the scale parameter φ is not known, such as with the normal distribution with
unknown mean and variance, or the gamma distribution, we shall argue that formula (3)
is still valid for Kn . This is because of an orthogonality property, namely that the mixed
second derivatives of the log-likelihood function, with respect to φ and β or γ, are seen to
4
have mean zero. Thus the full information matrix takes the form
Jn,scale
0
0
Jn,wide =
0
n−1 a(φ)−1 X t V X n−1 a(φ)−1 X t V Z
,
−1
−1 t
−1
−1 t
0
n a(φ) Z V X n a(φ) Z V Z
P
where Jn,scale is the required −n−1 ni=1 E∂ 2 log f (Yi ; θi , φ)/∂φ2 . This implies that the q × q
lower right hand corner of the inverse information matrix remains as in (3). For the normal
distribution with φ = σ and a(σ) = σ 2 , for example, one finds Jn,scale = 2/σ 2 .
We may now describe the FIC procedure for selecting the tentatively best model for
estimating the focus parameter µ = µ(x, z) = xt β + z t γ. The criterion works specifically
for the given (x, z) position in covariate space; if required, the procedure can be repeated
for several positions. Different submodels are indexed by the various subsets S of covariate
components 1, . . . , q, ranging from the empty set to the full list. For each submodel S one
can evaluate the estimate
µb S (x, z) = xt βbS + (πS z)t γbS = xt βbS + zSt γbS
(4)
that uses maximum likelihood estimates (βbS , γbS ) in the model that employs all of β1 , . . . , βp
but only those γj for which j ∈ S. Here πS is the |S| × q projection matrix that sends z to
πS z = zS , the vector of only those zj for which j ∈ S, and |S| is the number of components
in S. The essence of the FIC is to estimate the mean squared error for each candidate
estimator and then to pick the one with lowest possible mean squared error estimate.
−1 ∂µ
− ∂µ
, which here reads
We need the vector ω = Jn,10 Jn,00
∂β
∂γ
ω = Z t V X(X t V X)−1 x − z,
no matter whether φ is known or not. Note the dependence of ω on (x, z). Note furthermore
that premultiplying vS = πS v with πSt produces a vector πSt πS v of full length q, with zeroes
inserted for those components j with j 6∈ S. Define next Kn,S = (πS Kn−1 πSt )−1 , which is the
|S| × |S| lower right block of the inverse of Jn,S , the information matrix for the S model,
and let finally Gn,S = πSt Kn,S πS Kn−1 . The value of the focussed information criterion is now
obtained as
t
(Iq − Gn,S )t ω + 2ω t πSt Kn,S πS ω,
FIC(S; x, z) = nω t (Iq − Gn,S )γbwide γbwide
(5)
with γbwide being the maximum likelihood estimator of γ in the largest of the considered
models, i.e. the one containing all of γ1 , . . . , γq . For given values of (x, z), the best model
according to the FIC is that model, indexed by S, for which FIC(S; x, z) is the smallest.
The FIC stems from estimating and then adding a squared bias term and a variance
term; see Claeskens and Hjort (2003) for further discussion and applications. There is a
natural modification of (5) for the case of the squared bias being estimated with a negative
number; in such cases we truncate that term to zero. For further discussion of this point,
see also relevant comments in Hjort and Claeskens (2006).
5
2.3
Examples
Here we present a short list of examples that fit in with the general framework above. Each
model may be fitted using statistical software packages, and the Jn and Kn matrices are
easily computed via the appropriate form of the vi weights. For each situation one may use
(5) to determine the best submodel for estimating xt β + z t γ, or for any smooth function
thereof, like E(Y | x, z) or the median response median(Y | x, z).
1. Consider a non-linear normal regression setup where Yi is normal (θi , σ 2 ), with
θi = r(xti β + zit γ) for some specified function r(η). Then b(θ) = 12 θ2 , and the glm weights
of (2) become vi = r′ (xti β + zit γ)2 /σ 2 . The ordinary linear normal model corresponds to
r(η) = η with weights vi = 1/σ 2 .
2. Assume the Yi s are Poisson with parameters ξi = exp(xti β + zit γ). This is Poisson
regression with canonical link function. Then b(θ) = exp(θ), and vi = exp(xti β + zit γ).
3. Then let Yi be binomial (mi , pi ), with pi = H(xti β + zit γ), for a suitable distribution
function H. This is again a generalised linear model with b(θi ) = mi log{1 + exp(θi )}, and
one finds
vi = mi H ′ (ηi )2 /[H(ηi ){1 − H(ηi )}],
with ηi = xti β + zit γ. This can be used for probit regression, for example, where pi =
Φ(ηi ) with the cumulative standard normal. For logistic regression matters simplify to
vi = mi pi (1 − pi ).
4. Suppose positive observations Yi are modelled with Gamma distributions (c, di ),
where c is fixed but di = exp(xti β + zit γ). We use the parametrisation where the mean of Yi
is ξi = c/di . Here one finds vi = c, simply. Thus the Jn,wide matrix is proportional to the
sample variance matrix of the covariate vectors.
Suppose on the other hand that Yi is taken to be Gamma with parameters (ci , d),
this time with d fixed and flexible ci = exp(ηi ), with again ηi = xti β + zit γ. This actually corresponds to a generalised linear model in terms of the log Yi , and one finds vi =
exp(2ηi )ψ ′ (exp(ηi )), where ψ ′ is the trigamma function, the derivative of ψ = Γ′ /Γ.
3
The weighted FIC for generalised linear models
Due to the dependence of the focus on the covariate values (x, z), also the FIC takes different
values, and will produce different rankings of candidate models, for different locations in the
covariate space. Sometimes we choose an average or median value for the regression variables
(x, z), to represent some ‘average’ or ‘median’ subject, sometimes inside a stratum. Often,
such a detailed focus point is not wanted, one might rather wish to find a good model which
works well over a major part of, or over all of, the covariate space. We wish to select a good
6
model valid for a range of individuals simultaneously, say for a subgroup of the population.
We continue to use the generalised linear model setting of the previous section. To reach
precise results and concise arguments for our weighted-focussed model selection schemes we
shall employ a local misspecification framework where the true parameter is of the form
√
(β, γ) = (β0 , δ/ n). Here δ = (δ1 , . . . , δq )t is fixed and unknown. This is as in Hjort and
Claeskens (2003a) and Claeskens and Hjort (2003); see also the rejoinder Hjort and Claeskens
(2003b) to the discussion of these papers. The idea is to prove results about model selectors
and estimators in terms of the local model misspecification parameter δ, and to use such for
developing appropriate model information criteria.
3.1
Some preliminary results
Suppose in general terms that a parameter of interest µ(β, γ; u) depends on some quantity
u that varies in the population being studied. Under the framework outlined above, the
√
true parameter value is µ(β0 , δ/ n; u). We shall now use results developed in Hjort and
Claeskens (2003) and Claeskens and Hjort (2003), with suitable modifications. These rely
on certain regularity conditions, detailed in these papers. These conditions are mild, and
are not repeated in detail here. One such condition that we need to mention here, in order
to adequately identify the appropriate limits below, is the existence of a limit matrix J to
which Jn,wide converges, with blocks J00 , etc. Similarly this defines limit versions K of Kn ,
and a fortiori JS and KS , limit versions of Jn,S and Kn,S . Also, KS = (πS K −1 πSt )−1 , the
lower right block (JS−1 )11 of JS .
For each u the theory as developed in Claeskens & Hjort (2003) applies to the subsetmodel-based maximum likelihood estimators µb S (u), for which we have
√
d
−1
)t J00
M + ω(u)t (δ − πSt KS πS K −1 D),
n{µb S (u) − µtrue (u)} → ΛS (u) = ( ∂µ(β,γ;u)
∂β
−1
where ω(u) = J10 J00
∂µ(β, γ; u)/∂β − ∂µ(β, γ; u)/∂γ, with partial derivatives evaluated at
the null point (β0 , 0). Furthermore, M ∼ Np (0, J00 ) and D ∼ Nq (δ, K), and these random
vectors are independent. The (M, D) variables are needed in Section 5, and furthermore the
variables (CS , DS ) that appear now are linear functions of (M, D).
For a fixed set S and a fixed focus point, Lemma 3.2 of Hjort and Claeskens (2003)
applies, and yields
!
Ã
!
Ã√ b
n(βS − β0 ) d CS
∼ Np+|S| (ξS , JS−1 ),
→
√
b
DS
nγS
where
ξS =
!
à −1
J00 J01 (Iq − πSt KS πS K −1 )δ
KS πS K −1 δ
7
.
A special case of importance is that of S being the full set {1, . . . , q}, for which
δbwide =
√
d
nγbwide → D ∼ Nq (δ, K).
(6)
For convenience of notation we adopt the notion that estimators without subset-subscript
correspond to the full model, with all γ1 , . . . , γq parameters; thus δb = δbwide , etc.
The S-indexed model works with estimates γbS for γj with j ∈ S but uses 0 for j ∈
/ S.
To deal efficiently with different functions of these it will be convenient to introduce the
extended projection matrix of dimension (p + |S|) × (p + q),
πeS =
Ã
Ip
0p,q
πS
0|S|,p
!
.
Allow also the introduction of the q × q matrix GS = πSt KS πS K −1 . Then
√
Ã
βbS − β0
n
√
πSt γbS − δ/ n
!
d
→
Ã
CS
πSt DS − δ
!
,
which is seen to have mean vector and variance matrix
FS δ =
3.2
à −1
!
J00 J01 (Iq − GS )
−(Iq − GS )
and ΓS = πeSt JS−1 πeS .
δ
The wFIC for GLM
Consider again the linear predictor µ(x, z) = xt β +z t γ, and the collection of submodel-based
estimators (4). The vector βbS is always of length p, but takes on different values for different
index sets S. In contrast, the estimator γbS has length |S|. The construction of the weighted
FIC proceeds as follows. We start with the weighted average quadratic loss function on the
scale of the linear predictor, of the form
Ln (S) =
=
n
X
i=1
n
X
i=1
w(xi , zi ){µb S (xi , zi ) − µtrue (xi , zi )}2
(7)
√
t
w(xi , zi )(xti βbS + zi,S
γbS − xti β0 − zit δ/ n)2 .
The weights w(xi , zi ) are user-specified and in general different from the weights v(xi , zi )
in the glm weight matrix V . We shall show that this random loss has a limit distribution,
under mild regularity conditions. Let
Ωn,w
Ã
n
xi
1X
w(xi , zi )
=
n i=1
zi
!Ã
xi
zi
!t
,
(8)
and assume this matrix converges in probability to a nonnegative definite Ωw , depending of
course on the weight function w(x, z).
8
To properly study the random loss function it is convenient to express it as
à b
!
à b
!t
βS − βtrue
βS − βtrue
Ωn,w
.
Ln (S) = n
πSt γbS − γtrue
πSt γbS − γtrue
From conditions and results noted above,
d
Ln (S) → L(S) =
Ã
CS
t
π S DS − δ
!t
Ωw
Ã
CS
t
πS DS − δ
!
.
Under mild conditions, the expected loss w-riskn (S) = E Ln (S) will converge to w-risk(S) =
E L(S). The limit loss L(S) is a quadratic form in normal variables, and has mean value
w-risk(S) = E L(S) = δ t FSt Ωw FS δ + trace(Ωw ΓS )
= trace(Ωw FS δδ t FSt ) + trace(Ωw ΓS )
= I(S) + II(S),
say, involving matrices FS and ΓS defined above. We see that the I(S) term corresponds to
weighted squared bias whereas the second term II(S) is related to weighted variance. By
earlier efforts, this second term can be expressed as II(S) = trace(Ωw πeSt JS−1 πeS ).
This leads upon estimating unknown quantities to a weighted-focussed information
criterion. The second term is not problematic, and we use
−1
b
πeS ),
II(S)
= trace(Ωn,w πeSt Jbn,S
where Jbn,S is the appropriate sub-matrix of
n
1 X
vbi
b
n a(φ)
i=1
1
Jbn =
Ã
xi
zi
!Ã
xi
zi
!t
1 1
=
b
n a(φ)
Ã
X t Vb X
Z t Vb X
X t Vb Z
Z t Vb Z
!
,
and vbi is the estimated version of (2), inserting xti βbwide + zit γbwide for the linear predictor
xti β + zit γ. See the examples in Section 2.3. For the first term, we note that
d
δbδbt = nγb γb t → DDt ,
a variable with mean δδ t + K; recall the convention noted around (6) that γb means γbwide ,
etc. We therefore use
bI(S) = trace{Ω Fb (δbδbt − K
c )Fb t } if this is positive,
n,w S
n
S
= 0 if otherwise.
The weighted focussed information criterion consists in evaluating
b
wFIC(S) = bI(S) + II(S)
9
(9)
for each candidate model S, and in the end selecting the model with smallest value of this
estimated average risk. Note that all components of the above expressions are easily obtained
via standard output of any software fitting generalised linear models.
Remark 1. When more components are put into S, the I(S) term becomes smaller;
for S equal to the full {1, . . . , q} we find FS = 0, making I(wide) = 0. On the other hand,
with more components in S, the bigger is the variance II(S). Thus the wFIC method reflects
the squared modelling bias against variance balance.
Remark 2. In linear models the expression for the pointwise (unweighted) FIC has
been shown to be exact, see Claeskens and Hjort (2003), Section 5.5. There it is shown that,
without assuming normality and when using least squares estimators, the exact expression
for the mean squared error matches that obtained by FIC. This is a favourable property
which helps appreciating the obtained expression of the FIC. More precisely, in a linear
model with Yi = xti β + zit γ + εi , where µ = xt β + z t γ is to be estimated, we use least squares
estimators in a submodel indexed by S, leading to the estimator µb S = xt βbS + (πS z)t γbS .
Computing the exact bias and variance of this estimator leads to the following expression
for the mean squared error of µb S :
−1
x + ω t πSt Kn,S πS ω) + ω t (Iq − πSt Kn,S πS Kn−1 )γγ t (Iq − Kn−1 πSt Kn,S πS )ω. (10)
n−1 (xt Jn,00
This is, up to a constant not depending on the subset S, identical to the limiting expression
on which the FIC is based, compare with (5).
The same property holds when a (non-random) weight function w is included in the
loss function Ln in (8). For linear models, computing the exact mean squared error of Ln (S),
with least squares estimators inserted for βbS and γbS , leads to an expression which is equal
to that one which wFIC as in (9) is based upon. Note that the error variance σ 2 appears
in the denominator of the matrix Jn as the scaling factor a(σ). There is accordingly a σ 2
factor implicitly featuring in the first two terms of (10), but not in the final two terms.
4
GLM weights and the AIC
The wFIC method developed above provides a quite general and versatile model selection
scheme, in that the weights wi = w(xi , zi ) are fully user-specified, meant to reflect what
aspects are deemed more important than others for the use of the finally selected model.
The only caveat is that the weights should not be seriously unstable; the mathematical requirement for our asymptotics to go through is that the Ωn,w matrix converges in probability
with increasing sample size. This section discusses various types of weights.
10
4.1
GLM weights
The log-likelihood structure of a generalised linear model itself suggests a natural type of
weights for use in the wFIC method. If one chooses w(xi , zi ) = vi /a(φ), then Ωn,w = Jn,wide .
This leads to simplifications in the w-risk(S) and wFIC(S) expressions, as we shall see now.
For simplicity of notation and presentation we work directly in the limit experiment, where
c etc.; we also write G = π t K π K −1 . First look at
J and K etc. replace Jbn and K
n
S
S S S
t t
I(S) = δ FS JFS δ, where some manipulations give
JFS =
Ã
J00
J10
J01
J11
! Ã −1
!
J00 J01 (Iq − GS )
This leads with some further efforts to
−Iq + GS
=
Ã
!
0
.
−K −1 (Iq − GS )
I(S) = δ t (Iq − GS )t K −1 (Iq − GS )δ = δ t (K −1 − K −1 πSt KS πS K −1 )δ,
which is estimated by
bI(S) = trace{(K −1 − K −1 π t K π K −1 )(DD t − K)}
S S S
= Dt K −1 D − Dt K −1 πSt KS πS K −1 D − q + |S|,
as long as this expression is positive; it is otherwise truncated to zero. Next,
II(S) = trace(J πeSt JS−1 πeS ) = p + |S|.
This is easily obtained since the effect of pre- and post-multiplication by πeS is that only
that submatrix of J −1 is kept for which the row (and column) numbers belong to the index
set S, entries on all other rows and columns being replaced by zero. This implies that J is
multiplied by part of its inverse matrix, leading to the simplification p + |S|. This is also
b
true for the data-based version II(S).
To summarise this, with glm weights the limit risk takes the form
w-risk(S) = δ t (K −1 − K −1 πSt KS πS K −1 )δ + p + |S|,
and the canonical risk estimate (for the limit experiment) is
bI(S) + II(S)
b
=
(
Dt (K −1 − K −1 πSt KS πS K −1 )D + 2|S| + p − q
p + |S|
if N (S) takes place,
(11)
otherwise.
Here N (S) is the event that the trace in bI(S) is positive, i.e. that
Dt (K −1 − K −1 πSt KS πS K −1 )D > q − |S|.
The N (S) takes place with high probability if δ is some distance away from zero, but in
situations where the underlying γ vector is close to zero, i.e. the narrow model is close to
being correct, the probability that N (S) does not take place is significant. The finite-sample
c and K
c
version of the risk estimate uses δb for D and K
n
n,S for K and KS .
11
4.2
The wFIC and the AIC
Now consider Akaike’s information criterion, which in the present circumstances takes the
form
n
AIC(S) = 2
X
i=1
b − 2(p + |S|),
log f (yi ; θbi , φ)
with θbi being the appropriate maximum likelihood estimate of θi = θ(xi , zi ), which again
depends on βbS and γbS . The AIC scores may be computed for each submodel S, down to
that of the most narrow model which corresponds to S = ∅ and which uses only β1 , . . . , βp , φ
as model parameters. When subtracting the smallest model’s AIC value from AIC(S),
and performing a one-step Taylor expansion, we find that, for the limiting situation where
n → ∞,
d
AIC(S) − AIC(∅) → Dt K −1 πSt KS πS K −1 D − 2|S|.
See Claeskens and Hjort (2003, eq. (2.5)). The best models have the highest AIC scores.
We see from this and (11) that the wFIC method, when using glm weights, is essentially
large-sample equivalent to the AIC method. The word ‘essentially’ relates to the modification
for truncating an estimate of a squared bias to zero, when relevant, spelled out in (11).
Thus the wFIC provides a fresh perspective on the AIC, and our arguments even suggest a
correction to the AIC scores in cases where the event N (S) does not take place.
One may go through the list of examples in Section 2.3 to see the appropriate random
loss functions that correspond to the AIC. For the logistic regression setup of Example 3 in
that section, for example, model selection by estimating the mean of
Ln (S) =
n
X
i=1
³
mi p(xi , zi ){1 − p(zi , zi )} log
pbi,S
pbi ´2
− log
1 − pbi,S
1 − pbi
is essentially the same as AIC. Here pbi,S is the estimated probability under model S. Similarly, for Poisson regression, basing model selection on estimating
w-riskn (S) = E
n
X
i=1
b − log λ )2
exp(xti β + zit γ)(log λ
i,S
i
b
will be large-sample equivalent to the AIC scheme, where λ
i,S is the estimate of Poisson rate
i inside the S model. Of course other weights and other transformations can be worked with,
for logistic and Poisson regression, and such alternatives can in the perspective developed
here be seen as cousins to the AIC method.
4.3
Other types of weights
In addition to the perhaps canonical glm weights choice discussed above, the following types
of weights may be considered.
12
Uniform weights. A simple choice of weights could just assign mass 1/n to each contribution in the summand, with Ωn matrix simply equal to the empirical variance matrix of
the p + q-sized covariate vectors. Another choice is to let wi = 1 precisely for individuals i
belonging to a stratum of interest. This is exemplified in Section 6.
Gliding covariate window. A version of the above is to use wi = w(xi , zi ; x0 ) equal to
1 for individuals inside a certain neighbourhood of some fixed x0 . This gives a ranking of
candidate models around each fixed x0 . Smoothed versions can also be contemplated, also
in a context of model assessment where the x0 is moved in its covariate space. A good final
model, then, should rank highly in all these local competitions.
Averaging over z for fixed x. Another version with some appeal is to assess models in
terms of how well they perform for a given covariate x, averaged across all likely z values.
To indicate in one particular fashion how this may be handled, let
wi ∝ f (zi − ξx , Σx ),
in terms of the density f of the multinormal density Nq (0, Σx ), where ξx and Σx are the
estimated mean and variance matrix for z given x.
Robust weights. There is a whole area of research addressing issues related to robust
model choice. Robustness is concerned with the downweighting (and sometimes identification) of data vectors that are ‘outliers’ or that may exert too strong an influence on
maximum likelihood estimators. The methodology developed in this paper sticks to the
maximum likelihood estimators, as such, but the weight function of the wFIC allows downweighting schemes that make model selection less dependent on extreme data vectors. One
version of this would be
w(xi , zi ) = h(k(xi , zi ) − (x0 , z0 )k) for i = 1, . . . , n,
where the norm in question could measure a suitable distance from covariate vectors to some
robustly identified centre location (x0 , z0 ), and where h(u)could be taken as 1 over a broad
interval, but made to go to zero beyond that interval. Another choice is h(u) = exp(−c|u|)
for a perhaps small value of c. Such a scheme assures robustness with respect to extreme or
overly influential covariate vectors.
Note before we come to the next point that the wFIC methodology also works with
weights more general than wi = w(xi , zi ), as long as the Ωn,w matrix of (8) converges in probability. Thus weights of the form wi = h(resi ), functions of suitably defined glm residuals,
are allowed. One such version, among several, is
wi = h(resi ) =
(
1
if |resi | ≤ c,
c/|resi | if |resi | > c,
13
where resi = (Yi − ξbi )/σbi , featuring (perhaps robustified) estimates of mean and standard deviation for Yi . Weights of this type are dicussed, in a rather different context of robust linear
regression, by Ronchetti and Staudte (1994), who also find that c = 1.345 is a reasonable
default value for this weight function.
5
General risk averages
Above we developed a wFIC method for estimating naturally weighted averaged risks in
generalised linear models. Sometimes different risk averages are called for, however, and this
section extends the wFIC to handle such cases. We choose to stay inside the glm framework
exposited at the start of Section 3, although more general situations can be considered. A
brief motivating example is as follows: Suppose Yi data are exponentially distributed with
parameters θi = exp(xti β + zit γ), and that one wishes to estimate the quantile distribution
µ(u) = µ(u | x, z) = {− log(1 − u)}/ exp(xti β + zit γ) for u ∈ (0, 1).
How can we select a model that provides good estimates µb S (u) across all deciles u =
0.1, . . . , 0.9, say?
Assume in general terms that a parameter µ(u) is to be estimated, defined in terms of
the parameters (β, γ) of a generalised linear model, and depending on some parameter u that
may or may not depend on the covariates. As in previous sections limit distributions will
√
be established in the framework where (β, γ) = (β0 , δ/ n), aiming at providing adequate
finite-sample approximations for risks and averaged risks. We have
√
d
−1
n{µb S (u) − µtrue (u)} → ΛS (u) = ( ∂µ(u)
)t J00
M + ω(u)t (δ − GS D),
∂β
−1
where ω(u) = J10 J00
∂µ/∂β−∂µ/∂γ and GS = πSt KS πS K −1 ; also, M and D are independent,
−1
and M ∼ Np (0, J00
) and D ∼ Nq (δ, K). Consider the loss average function
Ln (S) = n
Z
{µb S (u) − µtrue (u)}2 dWn (u),
where Wn represents some relevant distribution of u values, like the deciles in the quantile
example above. Assuming Wn converging to a suitable weight distribution W (or that it
simply stays fixed, independent of sample size), we have
d
Ln (S) → L(S) =
Z
ΛS (u)2 dW (u)
under mild conditions. We measure the total averaged risk via the expected value of Ln ,
which converges to
w-risk(S) = E L(S) =
14
Z
EΛS (u)2 dW (u),
again under mild conditions. Here
EΛS (u)2 = τ0 (u)2 + ω(u)t E(δ − GS D)(δ − GS D)t ω(u)
= τ0 (u)2 + ω(u)t {(Iq − GS )δδ t (Iq − GS )t + KS−1 }ω(u)
−1 ∂µ(u)
)t J00
and the variance matrix
in terms of τ0 (u)2 = ( ∂µ(u)
∂β
∂β
Var GS D = GS KGtS = πSt KS πS K −1 πSt KS πS = πSt KS πS .
This leads to the expression
w-risk(S) =
Z
τ0 (u)2 dW (u) + trace{(Iq − GS )δδ t (Iq − GS )t R} + trace(πSt KS πS R) (12)
for the limit risk, where
R=
Z
ω(u)ω(u)t dW (u).
The first term is immaterial since it does not depend on S, so our generalised wFIC naturally
becomes
b
wFIC(S) = bI(S) + II(S)
b
b )(δbδbt − K
c )(I − G
b )t R},
b 0] + trace(π t K
c
= max[trace{(Iq − G
n,S
n
q
n,S
S n,S πS R).
b is a sample-based estimate of the R matrix.
Here R
This more general wFIC can be applied when one wishes to consider average risk across
both covariates and quantiles, for example.
6
6.1
Illustrations and applications
Diabetic retinopathy data
The Wisconsin Epidemiologic Study of Diabetic Retinopathy (Klein et al., 1984) provides
information to study diabetic retinopathy as a function of several other measurements. The
dataset consists of patient information for 348 men and 343 women. The binary outcome
variable Y = 0 indicates whether there is no or only mild nonproliferate retinopathy on both
of the eyes. A value Y = 1 is obtained when there is moderate to severe nonproliferate
retinopathy, or proliferate retinopathy for at least one of the eyes. Variables measured
are: duration of diabetes in years (x1 ), presence of macular edema in at least one eye (z1 ),
percentage of glycosylated hemoglobin (z2 ), body mass index (z3 ), pulse rate in beats per 30
seconds (z4 ), sex (z5 , with 1 for male and 0 for female), presence of urine protein (z6 ), and
area of residence (urban or rural, z7 ).
A logistic regression model is used for the analysis. Since in earlier analysis of this
dataset it is found that duration of diabetes is an important variable (see for example
15
Table 1: Values of the weighted focussed information criterion, where
weights are indicator values for men (1st column) and for women (3rd column), together with the selected variables. All logistic regression models
contain an intercept term, as well as the variable x1 , duration of diabetes.
Men
Women
wFICM z-variables
wFICF z-variables
21.201
1,4,6
24.828
1,4,6
23.270
1,3,4,6
24.974
1,4,6,7
24.230
1,2,4,6
26.155
1,2,4,6
24.393
1,4,6,7
26.160
1,3,4,6
27.590
1,2,3,4,6
28.072
1,2,4,6,7
27.767
1,3,4,6,7
28.616
1,3,4,6,7
28.680
1,2,4,6,7
28.673
1,2,3,4,6
29.982
1,4,5,6
31.096
1,4,5,6
31.186
1,3,4,5,6
31.636
1,4,5,6,7
31.345
1,2,4,5,6
32.945
1,3,4,5,6
Claeskens, Croux and Van Kerckhoven, 2006), we include this in all of the models we consider, as well as an intercept term. We perform model selection amongst the other seven
variables and allow for all possible subsets of the full model, leading to 128 possible models.
As a model selection criterion we take first wFICM (S) with weight vector (1/nM )I(male)
and next wFICF (S) with weight vector (1/nF )I(female) where the weights are indicator
variables for men (in case 1) and for women (in case 2), and nM (resp. nF ) denotes the
number of men (resp. women) in the dataset. Note that the values of wFIC(S)are computed
using the complete dataset, we are not splitting the dataset for model selection.
Table 6.1 gives for both criteria the ten smallest values, together with the variables in
the corresponding models. The best model is in both cases the model containing the binary
variables z1 : presence of macular edema in at least one of the eyes and z6 : indicator for
urine protein, as well as z4 : pulse rate. The subgroups differ in the ranking of the next best
models, for men the body mass index (variable z3 ) is an important variable, while for women
the area of residence (z7 ) is more important, and z3 only shows in the 4th best ranked model.
Thus we learn that body mass index influences Y in possibly different ways, for men and for
women, which may be taken into account for building a final model.
As a comparison we also used the overall model selection criteria AIC and BIC. The
AIC picks the same model as the FIC does, namely the model with variables z1 , z4 and
z6 , while the BIC omits from this model the variable z4 , pulse rate in beats per second.
16
These criteria are computed using the complete dataset. When we would split the data
into the results for men and for women separately and then run the AIC selection method
separately for both data parts, AIC selects the variables z1 , z2 , z6 for the subset of men, and
variables z1 , z4 , and z6 for the subset of women. No artificial data splitting is needed for the
computation of the wFIC.
6.2
CH4 concentrations
This example consists of CH4 data, which are atmospheric CH4 concentrations (ppbv) derived from flask samples collected at the Shetland Islands of Scotland (Steele, Krummel and
Langenfelds, 2002). Monthly values are expressed in parts per billion by volume (ppbv). In
total there are 110 monthly measurements, starting in December 1992 and ending December
2001. The regression variable u = time is rescaled to the (0, 1) interval, and the response
variable Y is the CH4 concentration. We use a cosine series estimator based on the model
µ(u) = E(Y | U = u) = β0 +
m
X
γj cos(πju),
j=1
where we will vary the value of m, which is the truncation point of the series. This defines
a sequence of nested models, which fits into the regression context of the previous sections
when defining zj = cos(πju). We wish to select the best order m. In our modelling efforts,
we let m be any number between 1 and 15 (the wide model). A scatter plot of the data is
shown in Figure 1(a). We applied the wFIC(S) method (9) with equal weights wi = 1, and
found that the best model is for m = 2; see the figure.
As a comparison we also computed FIC values for each of the individual 110 measurement months. This means that we take as a focus parameter the mean CH4 concentration at
that particular month (without averaging), which leads to a set of 15 FIC values, one for each
order of m = 1, . . . , 15, and this for each of the 110 months. The results are summarised in
the following frequency table for the individually chosen model by FIC. For example, model
order m = 1 was chosen 75 times in the 110 model selection applications, model order m = 2
was chosen 7 times, etc.
m
frequency
1 2 3 5 6 7
75 7 1 1 3 5
10 11
3 15
The overall chosen model with m = 2 is in this case not the model which was most frequently
selected by the individual searches. Remarkably, the model with order 11 is chosen 15 times
in the individual search. A possible explanation for this is that a high frequency model of
order 11 is reflecting the random variability in the data cloud.
Another set of weights which makes sense for variables measured in time, is that which
gives more weight to more recent measurements. As an example we used the weighting
17
(b)
1400
1200
wFIC values
1810
1800
1000
1790
800
1770
1780
CH4 concentration
1820
1600
1830
1800
1840
(a)
0.0
0.2
0.4
0.6
0.8
1.0
2
time
4
6
8
10
12
14
m
Figure 1: (a) Scatterplot of the CH4 data, along with the estimated mean curve for the
m = 2 model. (b) The wFIC(S) values with equal weights wi = 1.
scheme i/n (i = 1, . . . , n = 110), and found in this particular case the same FIC selected
model, namely the model with truncation point m = 2.
6.3
Highway data
To illustrate robust downweighting, we use Hoffstedt’s highway data, see also Weisberg
(2005, Section 7.2). This dataset is used to explain the 1973 accident rate per million vehicle
miles, as a function of several variables. There are 39 observations made. In every model we
include an intercept term and x1 , the length of the highway segment in miles. Variables to
choose from are average daily traffic count in thousands (z1 ), truck volume as a percent of
the total volume (z2 ), total number of lanes of traffic (z3 ), number of access points per mile
(z4 ), number of signalised interchanges per mile (z5 ), number of freeway-type interchanges
per mile (z6 ), speed limit in 1973 (z7 ), lane width, in feet (z8 ), width of the outer shoulder
on the roadway (in feet) (z9 ), and finally an indicator of the type of roadway or the source
of funding for the road (z10 ).
Based on robust Cp model selection, Ronchetti and Staudte (1994) support the model
which includes, in addition to x1 , the variables z5 , z6 , z7 and z10 , and also the model with
additional variables z2 , z3 , z4 and z9 . Here we construct weights w1 , . . . , w39 based on the
robustification method outlined in Section 4.3; these rely on intially used robust estimators
for the regression coefficients obtained in the full model. Five of the observations receive a
weight which is smaller than one, all other observations get weight one. The weights that
18
differ from one are 0.903, 0.568, 0.436, 0.577, 0.811. These weights are then used in the
wFIC(S) construction, and lead to preferring the model with the two variables x1 and z7 .
For this particular example, the model selected by FIC is more parsimonious than the model
suggested by Mallow’s Cp .
7
Concluding comments
Our paper has provided a fair middle ground between the extremes ‘blind model selection’
(where a model is found via say AIC or BIC, without any particular regard to the actual
use of the model after selection) and ‘fully focussed selection’ (where a model is selected to
perform optimally for a given estimand). We have seen that special versions of the wFIC
correspond to the AIC for generalised linear models, so our methods may be seen as suitable
generalisations of the AIC for use in situations where the context of one’s modelling and
analysis dictates more specific weighting schemes than the default ones. Below we give
some concluding remarks, pertaining to themes related to but outside the main scope of the
present paper.
1. In this paper we chose to concentrate on the generalised linear models framework,
where methods have a particularly clear structure, but it is clear that the methods can be
generalised to many other regression structures. Thus parametric regression models for multidimensional data, and for hazard rates with censored data, can be handled with essentially
the same methods. Extensions to the semiparametric Cox model are less immediate, but
can be accomplished via methods of Hjort and Claeskens (2006).
2. Our paper has developed strategies that for each model rely on maximum likelihood (or asymptotically equivalent) estimators. There is a need for generalising methods
and results to more robust strategies, for example involving M-estimators. This is entirely
possible, but requires more work and will result in algebraically and structurally somewhat
less elegant methods.
b
3. Our wFIC(S) = bI(S) + II(S)
is not algebraically equivalent to the simpler one of
averaging individual FIC(S) scores. The two methods are the same only in cases where
there are no modifications of setting negative estimates of squared bias to zero. The wFIC
method, as outlined in Sections 3 and 5, performs the squared bias modification only once,
at the end, as opposed to performing this operation for each individual application.
4. Our methods stem from precise limit distribution results that involve quantities
like J and K, along with further relatives like KS and GS . Our wFIC(S) formulae involve
c , and so on. The theory behind the methods
estimates of these quantities, say Jbn,S , K
n,S
ensure that they work well as long as estimates are used that are consistent, under the local
√
misspecification framework (β, γ) = (β0 , δ/ n). Among various possibilities we have chosen
19
to use the ‘wide model perspective’ in our implementations, starting from estimators Jbn that
use estimates (βbwide , γbwide ).
5. We have derived methods for selecting a model, but have not discussed the consequences of having selected the model in this fashion. Methods of Hjort and Claeskens
(2003a) and Claeskens and Hjort (2003) make it however possible to analyse the performance of estimator-after-selection, also with the wFIC methods developed in the present
article.
6. Though we have been specifically concerned with model selection, methods of Hjort
and Claeskens (2003a) can be applied to provide model average procedures, say of the type
µb =
X
S
c(wFIC(S))µb S ,
a data-dependent average across the estimators of the individual models, with
.X
c(wFIC(S)) = exp{− 21 κ wFIC(S)}
exp{− 12 κ wFIC(S ′ )}.
S′
Here κ is an algorithmic parameter, with small κ corresponding to near uniform weighting
while larger κ means giving nearly full weight to the model that wins the wFIC competition.
Methods of that paper also make it possible to study performances of such average estimators,
compared to the more usual estimators-post-selection.
7. Our methods have been developed inside a framework of ‘first order asymptotics’ for
general parametric models. It might be important to supplement such methods by suitable
second order corrections to make them work more precisely for moderate or smaller sample
sizes. The content of Remark 2 of Section 3 is that the wFIC(S) expression is exactly correct
for each finite n > p + q, when used in the linear model with non-random weights. This is an
indication that even the first-order approximations to risks and their estimates are adequate
also in other generalised linear models.
The situation is a bit more complicated when the weights themselves have a random
component, as with the robust type wi = h(resi ) discussed in Section 4.3; here approximations stemming from the first-order asymptotics might need adjustments to be more
accurate in practice. In other words, even though demonstrably Ln (S) →d L(S) and
w-riskn (S) → w-risk(S), one can expect the real variance of Ln (S) to be bigger than that of
L(S), in cases with complicated data-dependent weights.
20
References
Akaike, H. (1974). A new look at statistical model identification, I.E.E.E. Transactions on
Automatic Control 19, 716–723.
Claeskens, G., Croux, C. and Van Kerckhoven, J. (2006). Variable selection for logistic
regression using a prediction focussed information criterion. Biometrics, to appear.
Claeskens, G. and Hjort, N. L. (2003). The focused information criterion [with discussion].
Journal of the American Statistical Association 98, 900–916.
Dobson, A. J. (2002). An Introduction to Generalized Linear Models. Chapman and Hall,
London.
Hjort, N. L. and Claeskens, G. (2003a). Frequentist model average estimators [with discussion]. Journal of the American Statistical Association 98, 879–899.
Hjort, N. L. and Claeskens, G. (2003b). Rejoinder to ‘The Focussed Information Criterion’ and ‘Frequentist model average estimators’. Journal of the American Statistical
Association 98, 938–945.
Hjort, N. L. and Claeskens, G. (2006). Focussed information criteria and model averaging
for Cox’s hazard regression model. Journal of the American Statistical Association, to
appear.
Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L. (1984). The Wisconsin epidemiologic study of diabetic retinopathy: II. Prevalence and risk of diabetic
retinopathy when age at diagnosis is less than 30 years. Archives of Ophthalmology 102,
520–526.
Mallows, C. L. (1973). Some comments on Cp , Technometrics 15, 661–675.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Chapman and Hall,
London.
Ronchetti, E. (1997). Robustness aspects of model choice. Statistica Sinica 2, 327–338.
Ronchetti, E. and Staudte, R. G, (1994). A robust version of Mallows’ Cp . Journal of the
American Statistical Association 89, 550–559.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6,
461–464.
Steele, L. P., Krummel, P. B. and Langenfelds, R. L. (2002). Atmospheric CH4 concentrations from sites in the CSIRO Atmospheric Research GASLAB air sampling network
(October 2002 version). In Trends: A Compendium of Data on Global Change, Carbon
Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S. Department
of Energy, Oak Ridge, TN, U.S.A.
Weisberg, S. (2005). Applied Linear Regression, 3rd edition. Wiley, New York.
21