Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Faculty of Economics and Applied Economics Minimising average risk in regression models b Gerda Claeskens and Nils Lid Hjort DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) KBI 0610 Minimising Average Risk in Regression Models Gerda Claeskens Nils Lid Hjort ORSTAT and University Centre for Statistics Department of Mathematics University of Oslo Katholieke Universiteit Leuven Naamsestraat 69 B-3000 Leuven, Belgium Gerda.Claeskens@econ.kuleuven.be N-0316 Oslo, Norway nils@math.uio.no December 2005 Abstract Most model selection mechanisms work in an ‘overall’ modus, providing models without specific concern for how the selected model is going to be used afterwards. The focussed information criterion (FIC), on the other hand, is geared towards optimum model selection when inference is required for a given estimand. In this paper the FIC method is extended to weighted versions. This allows one to rank and select candidate models for the purpose of handling a range of similar tasks well, as opposed to being forced to focus on each task separately. Applications include selecting regression models that perform well for specified regions of covariate values. We derive these wFIC criteria, give asymptotic results, and apply the methods to real data. Formulae for easy implementation are provided for the class of generalised linear models. Key-words: focussed information criterion, model selection, regression models. 1 Introduction Model selection is most often the starting point of any data analysis. It is therefore of importance to carefully address this modelling step. Most model selection tools attach to each potential model a number, and then proceed by picking the model with the best (usually either smallest or largest) value of this number. Traditional model selection techniques work this way, such as Mallows’s Cp (1973) for linear regression or the more generally applicable information criteria AIC (from Akaike, 1974) and BIC (from Schwarz, 1978). Once a model is selected, the actual estimation takes place in the selected model. 1 The focussed information criterion (Claeskens and Hjort, 2003) is also in this spirit, though distinguishes itself from the other information criteria in that it can be directed – focussed – towards a specific purpose. A model selected to estimate, for example, the mean return of an investment taken place in one month, should not necessarily be the same as the model to be used to estimate the mean return one year further in time. Or, in medical studies, the model which is best for estimating the survival probability should not be expected to be the same as the best model for estimating the median survival time. In the construction of the FIC we therefore start by specifying the focus parameter, the quantity we wish to estimate, and then use this information to obtain the actual FIC value via an estimator of the mean squared error of the focus parameter’s estimator. When the focus parameter changes, also the value of the FIC might change, leading to a possibly different selected model. The main issue addressed in this paper is that in many situations, the focussed information criterion demands too much focus of its user, so to speak. For example, for regression models the FIC can easily be used to select a model for the mean response value for a given single covariate position. This is sometimes of relevance, and one may take a ‘median’ or ‘average’ covariate position; but in other situations one wishes to construct a model that does well across many covariate positions. To address this problem, we derive a weighted focussed information criterion, where we attach to each potential model a single wFIC value, valid over all or part of the covariate space. To explain matters clearly, we mainly concentrate on the class of generalised linear models, though our arguments and methods would extend without serious difficulties to more general regression models. More information on generalised linear models can be found in the books by McCullagh and Nelder (1989) and Dobson (2002). Section 2 briefly reviews the FIC method, in a setting of generalised linear models. Working with weighted versions of mean squared errors, across all or a subset of covariates, leads in Section 3 to the weighted FIC method. The wFIC method relies on weights that are user- and context-specific. Applying a specific weighting scheme, with what we term glm weights, gives a procedure that turns out to be large-sample equivalent to the AIC. This is discussed in Section 4, along with some other weighting schemes of interest. Strategies for estimating (and then minimising) more generally formed averaged risks are then taken up in Section 5. Practical data examples, showing the applicability of the method, appear in Section 6. Section 7 offers some concluding comments, indicating the further scope of our work. 2 2 The FIC for generalised linear models Let the observed data be denoted Y1 , . . . , Yn , together with observed covariate information c1 , . . . , cn . In a model selection situation the ci s are vector valued and we wish to build methods that somehow manage to select the components of most relevance. Often, we are sure about including some of these covariates in the model (an intercept is an example of this), so that the actual selection should take place over the other variables. Notationally we split the covariate vector ci into two parts, the ‘protected’ xi with say p covariates that are deemed necessary a priori and the ‘open’ zi containing the say q remaining covariates, amongst which we intend to perform the selection. 2.1 The GLM class For a generalised linear model (glm), there is a monotone differentiable link function g(·) such that g(E(Yi | xi , zi )) = xti β + zit γ for i = 1, . . . , n, (1) mapping the mean response to a linear predictor defined in terms of regression coefficients β1 , . . . , βp and γ1 , . . . , γq . Thus selecting which covariate components zi,j to include amounts to determining which γj s to keep in the model; excluding component zi,j corresponds to using γj = 0. For (classical) linear models, the link function is the identity function. For other examples, see Section 2.3. We assume that Y1 , . . . , Yn are independent with density function belonging to an exponential family of the form f (yi ; θi , φ) = exp n y θ − b(θ ) i i i a(φ) o + c(yi , φ) for y ∈ Y, where the sample space Y is the same in each case and does not depend on the unknown parameters θi and φ. The functions a(·), b(·) and c(·, ·) are fully specified. The b(·) function plays a crucial role since its derivatives yield the mean and variance function, while a(φ) is a scale parameter. From the two first Bartlett identities, about moment properties of the first and second log-derivatives of the density, follow ξi = E(Yi | xi , zi ) = b′ (θi ) and Var(Yi | xi , zi ) = a(φ)b′′ (θi ). This expresses θi as a function of the mean response, given the covariates. When the so-called canonical link function is used, that is, g(·) = (b′ )−1 (·), then θi = g(ξi ) = xti β + zit γ. 3 2.2 The pointwise FIC for GLM Here we derive explicit expressions for the FIC when applied to the linear predictor µ = xt β + z t γ = g(E(Y | x, z)) associated with a given position (x, z) in the covariate space. The (pointwise) FIC is large-sample equivariant under smooth transformations, so the FIC for µ(x, z) will essentially yield the same model ranking as the FIC for ξ(x, z) = E(Y | x, z). One of the main ingredients for computing the FIC is the (normalised) Fisher information matrix Jn,wide , computed via the second order partial derivatives of the log-likelihood function with respect to (β, γ). We partition the matrix Jn,wide as Jn,wide = à Jn,00 Jn,10 Jn,01 Jn,11 ! , such that the block Jn,11 in the lower right corner has dimension q × q. In our calculations −1 we will frequently need the the lower right submatrix of dimension q × q of Jn,wide , which we −1 −1 name Kn . This matrix may be found as Kn = (Jn,11 − Jn,10 Jn,00 Jn,01 ) . For several members of the generalised linear model family, the scale parameter φ is either known or completely specified. Examples are the Poisson and binomial distribution, where a(φ) = 1, or the normal distribution with known variance. In case φ is known, the information matrix takes a simple and general form, making it easy to compute in the full generality of generalised linear models. We first decompose the n × (p + q) design matrix into X = (xt1 , . . . , xtn )t of dimension n × p and Z = (z1t , . . . , znt )t of dimension n × q. Then Jn,wide n 1 1X vi = n i=1 a(φ) à xi zi !à xi zi !t 1 1 = n a(φ) à X tV X Z tV X X tV Z Z tV Z ! , where V is the diagonal weight matrix diag(v1 , . . . , vn ), with different formulae available for vi = v(xi , zi ). One has vi = [b′′ (θi ){g ′ (ξi )}2 ]−1 = b′′ (θi ) ³ ∂θi ´2 , a(φ) ∂ηi (2) where ηi = g −1 (ξi ) = xti β + zit γ and ξi = E(Yi | xi , zi ) = b′ (θi ). For the situation with φ known, therefore, the Kn matrix takes the form Kn = a(φ){n−1 Z t V (I − X(X t V X)−1 X t V )Z}−1 . (3) For canonical link functions, where θi = ηi , matters simplify to vi = b′′ (xti β + zit γ)/a(φ). When the scale parameter φ is not known, such as with the normal distribution with unknown mean and variance, or the gamma distribution, we shall argue that formula (3) is still valid for Kn . This is because of an orthogonality property, namely that the mixed second derivatives of the log-likelihood function, with respect to φ and β or γ, are seen to 4 have mean zero. Thus the full information matrix takes the form   Jn,scale 0 0   Jn,wide =  0 n−1 a(φ)−1 X t V X n−1 a(φ)−1 X t V Z   , −1 −1 t −1 −1 t 0 n a(φ) Z V X n a(φ) Z V Z P where Jn,scale is the required −n−1 ni=1 E∂ 2 log f (Yi ; θi , φ)/∂φ2 . This implies that the q × q lower right hand corner of the inverse information matrix remains as in (3). For the normal distribution with φ = σ and a(σ) = σ 2 , for example, one finds Jn,scale = 2/σ 2 . We may now describe the FIC procedure for selecting the tentatively best model for estimating the focus parameter µ = µ(x, z) = xt β + z t γ. The criterion works specifically for the given (x, z) position in covariate space; if required, the procedure can be repeated for several positions. Different submodels are indexed by the various subsets S of covariate components 1, . . . , q, ranging from the empty set to the full list. For each submodel S one can evaluate the estimate µb S (x, z) = xt βbS + (πS z)t γbS = xt βbS + zSt γbS (4) that uses maximum likelihood estimates (βbS , γbS ) in the model that employs all of β1 , . . . , βp but only those γj for which j ∈ S. Here πS is the |S| × q projection matrix that sends z to πS z = zS , the vector of only those zj for which j ∈ S, and |S| is the number of components in S. The essence of the FIC is to estimate the mean squared error for each candidate estimator and then to pick the one with lowest possible mean squared error estimate. −1 ∂µ − ∂µ , which here reads We need the vector ω = Jn,10 Jn,00 ∂β ∂γ ω = Z t V X(X t V X)−1 x − z, no matter whether φ is known or not. Note the dependence of ω on (x, z). Note furthermore that premultiplying vS = πS v with πSt produces a vector πSt πS v of full length q, with zeroes inserted for those components j with j 6∈ S. Define next Kn,S = (πS Kn−1 πSt )−1 , which is the |S| × |S| lower right block of the inverse of Jn,S , the information matrix for the S model, and let finally Gn,S = πSt Kn,S πS Kn−1 . The value of the focussed information criterion is now obtained as t (Iq − Gn,S )t ω + 2ω t πSt Kn,S πS ω, FIC(S; x, z) = nω t (Iq − Gn,S )γbwide γbwide (5) with γbwide being the maximum likelihood estimator of γ in the largest of the considered models, i.e. the one containing all of γ1 , . . . , γq . For given values of (x, z), the best model according to the FIC is that model, indexed by S, for which FIC(S; x, z) is the smallest. The FIC stems from estimating and then adding a squared bias term and a variance term; see Claeskens and Hjort (2003) for further discussion and applications. There is a natural modification of (5) for the case of the squared bias being estimated with a negative number; in such cases we truncate that term to zero. For further discussion of this point, see also relevant comments in Hjort and Claeskens (2006). 5 2.3 Examples Here we present a short list of examples that fit in with the general framework above. Each model may be fitted using statistical software packages, and the Jn and Kn matrices are easily computed via the appropriate form of the vi weights. For each situation one may use (5) to determine the best submodel for estimating xt β + z t γ, or for any smooth function thereof, like E(Y | x, z) or the median response median(Y | x, z). 1. Consider a non-linear normal regression setup where Yi is normal (θi , σ 2 ), with θi = r(xti β + zit γ) for some specified function r(η). Then b(θ) = 12 θ2 , and the glm weights of (2) become vi = r′ (xti β + zit γ)2 /σ 2 . The ordinary linear normal model corresponds to r(η) = η with weights vi = 1/σ 2 . 2. Assume the Yi s are Poisson with parameters ξi = exp(xti β + zit γ). This is Poisson regression with canonical link function. Then b(θ) = exp(θ), and vi = exp(xti β + zit γ). 3. Then let Yi be binomial (mi , pi ), with pi = H(xti β + zit γ), for a suitable distribution function H. This is again a generalised linear model with b(θi ) = mi log{1 + exp(θi )}, and one finds vi = mi H ′ (ηi )2 /[H(ηi ){1 − H(ηi )}], with ηi = xti β + zit γ. This can be used for probit regression, for example, where pi = Φ(ηi ) with the cumulative standard normal. For logistic regression matters simplify to vi = mi pi (1 − pi ). 4. Suppose positive observations Yi are modelled with Gamma distributions (c, di ), where c is fixed but di = exp(xti β + zit γ). We use the parametrisation where the mean of Yi is ξi = c/di . Here one finds vi = c, simply. Thus the Jn,wide matrix is proportional to the sample variance matrix of the covariate vectors. Suppose on the other hand that Yi is taken to be Gamma with parameters (ci , d), this time with d fixed and flexible ci = exp(ηi ), with again ηi = xti β + zit γ. This actually corresponds to a generalised linear model in terms of the log Yi , and one finds vi = exp(2ηi )ψ ′ (exp(ηi )), where ψ ′ is the trigamma function, the derivative of ψ = Γ′ /Γ. 3 The weighted FIC for generalised linear models Due to the dependence of the focus on the covariate values (x, z), also the FIC takes different values, and will produce different rankings of candidate models, for different locations in the covariate space. Sometimes we choose an average or median value for the regression variables (x, z), to represent some ‘average’ or ‘median’ subject, sometimes inside a stratum. Often, such a detailed focus point is not wanted, one might rather wish to find a good model which works well over a major part of, or over all of, the covariate space. We wish to select a good 6 model valid for a range of individuals simultaneously, say for a subgroup of the population. We continue to use the generalised linear model setting of the previous section. To reach precise results and concise arguments for our weighted-focussed model selection schemes we shall employ a local misspecification framework where the true parameter is of the form √ (β, γ) = (β0 , δ/ n). Here δ = (δ1 , . . . , δq )t is fixed and unknown. This is as in Hjort and Claeskens (2003a) and Claeskens and Hjort (2003); see also the rejoinder Hjort and Claeskens (2003b) to the discussion of these papers. The idea is to prove results about model selectors and estimators in terms of the local model misspecification parameter δ, and to use such for developing appropriate model information criteria. 3.1 Some preliminary results Suppose in general terms that a parameter of interest µ(β, γ; u) depends on some quantity u that varies in the population being studied. Under the framework outlined above, the √ true parameter value is µ(β0 , δ/ n; u). We shall now use results developed in Hjort and Claeskens (2003) and Claeskens and Hjort (2003), with suitable modifications. These rely on certain regularity conditions, detailed in these papers. These conditions are mild, and are not repeated in detail here. One such condition that we need to mention here, in order to adequately identify the appropriate limits below, is the existence of a limit matrix J to which Jn,wide converges, with blocks J00 , etc. Similarly this defines limit versions K of Kn , and a fortiori JS and KS , limit versions of Jn,S and Kn,S . Also, KS = (πS K −1 πSt )−1 , the lower right block (JS−1 )11 of JS . For each u the theory as developed in Claeskens & Hjort (2003) applies to the subsetmodel-based maximum likelihood estimators µb S (u), for which we have √ d −1 )t J00 M + ω(u)t (δ − πSt KS πS K −1 D), n{µb S (u) − µtrue (u)} → ΛS (u) = ( ∂µ(β,γ;u) ∂β −1 where ω(u) = J10 J00 ∂µ(β, γ; u)/∂β − ∂µ(β, γ; u)/∂γ, with partial derivatives evaluated at the null point (β0 , 0). Furthermore, M ∼ Np (0, J00 ) and D ∼ Nq (δ, K), and these random vectors are independent. The (M, D) variables are needed in Section 5, and furthermore the variables (CS , DS ) that appear now are linear functions of (M, D). For a fixed set S and a fixed focus point, Lemma 3.2 of Hjort and Claeskens (2003) applies, and yields ! à ! Ã√ b n(βS − β0 ) d CS ∼ Np+|S| (ξS , JS−1 ), → √ b DS nγS where ξS = ! à −1 J00 J01 (Iq − πSt KS πS K −1 )δ KS πS K −1 δ 7 . A special case of importance is that of S being the full set {1, . . . , q}, for which δbwide = √ d nγbwide → D ∼ Nq (δ, K). (6) For convenience of notation we adopt the notion that estimators without subset-subscript correspond to the full model, with all γ1 , . . . , γq parameters; thus δb = δbwide , etc. The S-indexed model works with estimates γbS for γj with j ∈ S but uses 0 for j ∈ / S. To deal efficiently with different functions of these it will be convenient to introduce the extended projection matrix of dimension (p + |S|) × (p + q), πeS = à Ip 0p,q πS 0|S|,p ! . Allow also the introduction of the q × q matrix GS = πSt KS πS K −1 . Then √ à βbS − β0 n √ πSt γbS − δ/ n ! d → à CS πSt DS − δ ! , which is seen to have mean vector and variance matrix FS δ = 3.2 à −1 ! J00 J01 (Iq − GS ) −(Iq − GS ) and ΓS = πeSt JS−1 πeS . δ The wFIC for GLM Consider again the linear predictor µ(x, z) = xt β +z t γ, and the collection of submodel-based estimators (4). The vector βbS is always of length p, but takes on different values for different index sets S. In contrast, the estimator γbS has length |S|. The construction of the weighted FIC proceeds as follows. We start with the weighted average quadratic loss function on the scale of the linear predictor, of the form Ln (S) = = n X i=1 n X i=1 w(xi , zi ){µb S (xi , zi ) − µtrue (xi , zi )}2 (7) √ t w(xi , zi )(xti βbS + zi,S γbS − xti β0 − zit δ/ n)2 . The weights w(xi , zi ) are user-specified and in general different from the weights v(xi , zi ) in the glm weight matrix V . We shall show that this random loss has a limit distribution, under mild regularity conditions. Let Ωn,w à n xi 1X w(xi , zi ) = n i=1 zi !à xi zi !t , (8) and assume this matrix converges in probability to a nonnegative definite Ωw , depending of course on the weight function w(x, z). 8 To properly study the random loss function it is convenient to express it as à b ! à b !t βS − βtrue βS − βtrue Ωn,w . Ln (S) = n πSt γbS − γtrue πSt γbS − γtrue From conditions and results noted above, d Ln (S) → L(S) = à CS t π S DS − δ !t Ωw à CS t πS DS − δ ! . Under mild conditions, the expected loss w-riskn (S) = E Ln (S) will converge to w-risk(S) = E L(S). The limit loss L(S) is a quadratic form in normal variables, and has mean value w-risk(S) = E L(S) = δ t FSt Ωw FS δ + trace(Ωw ΓS ) = trace(Ωw FS δδ t FSt ) + trace(Ωw ΓS ) = I(S) + II(S), say, involving matrices FS and ΓS defined above. We see that the I(S) term corresponds to weighted squared bias whereas the second term II(S) is related to weighted variance. By earlier efforts, this second term can be expressed as II(S) = trace(Ωw πeSt JS−1 πeS ). This leads upon estimating unknown quantities to a weighted-focussed information criterion. The second term is not problematic, and we use −1 b πeS ), II(S) = trace(Ωn,w πeSt Jbn,S where Jbn,S is the appropriate sub-matrix of n 1 X vbi b n a(φ) i=1 1 Jbn = à xi zi !à xi zi !t 1 1 = b n a(φ) à X t Vb X Z t Vb X X t Vb Z Z t Vb Z ! , and vbi is the estimated version of (2), inserting xti βbwide + zit γbwide for the linear predictor xti β + zit γ. See the examples in Section 2.3. For the first term, we note that d δbδbt = nγb γb t → DDt , a variable with mean δδ t + K; recall the convention noted around (6) that γb means γbwide , etc. We therefore use bI(S) = trace{Ω Fb (δbδbt − K c )Fb t } if this is positive, n,w S n S = 0 if otherwise. The weighted focussed information criterion consists in evaluating b wFIC(S) = bI(S) + II(S) 9 (9) for each candidate model S, and in the end selecting the model with smallest value of this estimated average risk. Note that all components of the above expressions are easily obtained via standard output of any software fitting generalised linear models. Remark 1. When more components are put into S, the I(S) term becomes smaller; for S equal to the full {1, . . . , q} we find FS = 0, making I(wide) = 0. On the other hand, with more components in S, the bigger is the variance II(S). Thus the wFIC method reflects the squared modelling bias against variance balance. Remark 2. In linear models the expression for the pointwise (unweighted) FIC has been shown to be exact, see Claeskens and Hjort (2003), Section 5.5. There it is shown that, without assuming normality and when using least squares estimators, the exact expression for the mean squared error matches that obtained by FIC. This is a favourable property which helps appreciating the obtained expression of the FIC. More precisely, in a linear model with Yi = xti β + zit γ + εi , where µ = xt β + z t γ is to be estimated, we use least squares estimators in a submodel indexed by S, leading to the estimator µb S = xt βbS + (πS z)t γbS . Computing the exact bias and variance of this estimator leads to the following expression for the mean squared error of µb S : −1 x + ω t πSt Kn,S πS ω) + ω t (Iq − πSt Kn,S πS Kn−1 )γγ t (Iq − Kn−1 πSt Kn,S πS )ω. (10) n−1 (xt Jn,00 This is, up to a constant not depending on the subset S, identical to the limiting expression on which the FIC is based, compare with (5). The same property holds when a (non-random) weight function w is included in the loss function Ln in (8). For linear models, computing the exact mean squared error of Ln (S), with least squares estimators inserted for βbS and γbS , leads to an expression which is equal to that one which wFIC as in (9) is based upon. Note that the error variance σ 2 appears in the denominator of the matrix Jn as the scaling factor a(σ). There is accordingly a σ 2 factor implicitly featuring in the first two terms of (10), but not in the final two terms. 4 GLM weights and the AIC The wFIC method developed above provides a quite general and versatile model selection scheme, in that the weights wi = w(xi , zi ) are fully user-specified, meant to reflect what aspects are deemed more important than others for the use of the finally selected model. The only caveat is that the weights should not be seriously unstable; the mathematical requirement for our asymptotics to go through is that the Ωn,w matrix converges in probability with increasing sample size. This section discusses various types of weights. 10 4.1 GLM weights The log-likelihood structure of a generalised linear model itself suggests a natural type of weights for use in the wFIC method. If one chooses w(xi , zi ) = vi /a(φ), then Ωn,w = Jn,wide . This leads to simplifications in the w-risk(S) and wFIC(S) expressions, as we shall see now. For simplicity of notation and presentation we work directly in the limit experiment, where c etc.; we also write G = π t K π K −1 . First look at J and K etc. replace Jbn and K n S S S S t t I(S) = δ FS JFS δ, where some manipulations give JFS = à J00 J10 J01 J11 ! à −1 ! J00 J01 (Iq − GS ) This leads with some further efforts to −Iq + GS = à ! 0 . −K −1 (Iq − GS ) I(S) = δ t (Iq − GS )t K −1 (Iq − GS )δ = δ t (K −1 − K −1 πSt KS πS K −1 )δ, which is estimated by bI(S) = trace{(K −1 − K −1 π t K π K −1 )(DD t − K)} S S S = Dt K −1 D − Dt K −1 πSt KS πS K −1 D − q + |S|, as long as this expression is positive; it is otherwise truncated to zero. Next, II(S) = trace(J πeSt JS−1 πeS ) = p + |S|. This is easily obtained since the effect of pre- and post-multiplication by πeS is that only that submatrix of J −1 is kept for which the row (and column) numbers belong to the index set S, entries on all other rows and columns being replaced by zero. This implies that J is multiplied by part of its inverse matrix, leading to the simplification p + |S|. This is also b true for the data-based version II(S). To summarise this, with glm weights the limit risk takes the form w-risk(S) = δ t (K −1 − K −1 πSt KS πS K −1 )δ + p + |S|, and the canonical risk estimate (for the limit experiment) is bI(S) + II(S) b = ( Dt (K −1 − K −1 πSt KS πS K −1 )D + 2|S| + p − q p + |S| if N (S) takes place, (11) otherwise. Here N (S) is the event that the trace in bI(S) is positive, i.e. that Dt (K −1 − K −1 πSt KS πS K −1 )D > q − |S|. The N (S) takes place with high probability if δ is some distance away from zero, but in situations where the underlying γ vector is close to zero, i.e. the narrow model is close to being correct, the probability that N (S) does not take place is significant. The finite-sample c and K c version of the risk estimate uses δb for D and K n n,S for K and KS . 11 4.2 The wFIC and the AIC Now consider Akaike’s information criterion, which in the present circumstances takes the form n AIC(S) = 2 X i=1 b − 2(p + |S|), log f (yi ; θbi , φ) with θbi being the appropriate maximum likelihood estimate of θi = θ(xi , zi ), which again depends on βbS and γbS . The AIC scores may be computed for each submodel S, down to that of the most narrow model which corresponds to S = ∅ and which uses only β1 , . . . , βp , φ as model parameters. When subtracting the smallest model’s AIC value from AIC(S), and performing a one-step Taylor expansion, we find that, for the limiting situation where n → ∞, d AIC(S) − AIC(∅) → Dt K −1 πSt KS πS K −1 D − 2|S|. See Claeskens and Hjort (2003, eq. (2.5)). The best models have the highest AIC scores. We see from this and (11) that the wFIC method, when using glm weights, is essentially large-sample equivalent to the AIC method. The word ‘essentially’ relates to the modification for truncating an estimate of a squared bias to zero, when relevant, spelled out in (11). Thus the wFIC provides a fresh perspective on the AIC, and our arguments even suggest a correction to the AIC scores in cases where the event N (S) does not take place. One may go through the list of examples in Section 2.3 to see the appropriate random loss functions that correspond to the AIC. For the logistic regression setup of Example 3 in that section, for example, model selection by estimating the mean of Ln (S) = n X i=1 ³ mi p(xi , zi ){1 − p(zi , zi )} log pbi,S pbi ´2 − log 1 − pbi,S 1 − pbi is essentially the same as AIC. Here pbi,S is the estimated probability under model S. Similarly, for Poisson regression, basing model selection on estimating w-riskn (S) = E n X i=1 b − log λ )2 exp(xti β + zit γ)(log λ i,S i b will be large-sample equivalent to the AIC scheme, where λ i,S is the estimate of Poisson rate i inside the S model. Of course other weights and other transformations can be worked with, for logistic and Poisson regression, and such alternatives can in the perspective developed here be seen as cousins to the AIC method. 4.3 Other types of weights In addition to the perhaps canonical glm weights choice discussed above, the following types of weights may be considered. 12 Uniform weights. A simple choice of weights could just assign mass 1/n to each contribution in the summand, with Ωn matrix simply equal to the empirical variance matrix of the p + q-sized covariate vectors. Another choice is to let wi = 1 precisely for individuals i belonging to a stratum of interest. This is exemplified in Section 6. Gliding covariate window. A version of the above is to use wi = w(xi , zi ; x0 ) equal to 1 for individuals inside a certain neighbourhood of some fixed x0 . This gives a ranking of candidate models around each fixed x0 . Smoothed versions can also be contemplated, also in a context of model assessment where the x0 is moved in its covariate space. A good final model, then, should rank highly in all these local competitions. Averaging over z for fixed x. Another version with some appeal is to assess models in terms of how well they perform for a given covariate x, averaged across all likely z values. To indicate in one particular fashion how this may be handled, let wi ∝ f (zi − ξx , Σx ), in terms of the density f of the multinormal density Nq (0, Σx ), where ξx and Σx are the estimated mean and variance matrix for z given x. Robust weights. There is a whole area of research addressing issues related to robust model choice. Robustness is concerned with the downweighting (and sometimes identification) of data vectors that are ‘outliers’ or that may exert too strong an influence on maximum likelihood estimators. The methodology developed in this paper sticks to the maximum likelihood estimators, as such, but the weight function of the wFIC allows downweighting schemes that make model selection less dependent on extreme data vectors. One version of this would be w(xi , zi ) = h(k(xi , zi ) − (x0 , z0 )k) for i = 1, . . . , n, where the norm in question could measure a suitable distance from covariate vectors to some robustly identified centre location (x0 , z0 ), and where h(u)could be taken as 1 over a broad interval, but made to go to zero beyond that interval. Another choice is h(u) = exp(−c|u|) for a perhaps small value of c. Such a scheme assures robustness with respect to extreme or overly influential covariate vectors. Note before we come to the next point that the wFIC methodology also works with weights more general than wi = w(xi , zi ), as long as the Ωn,w matrix of (8) converges in probability. Thus weights of the form wi = h(resi ), functions of suitably defined glm residuals, are allowed. One such version, among several, is wi = h(resi ) = ( 1 if |resi | ≤ c, c/|resi | if |resi | > c, 13 where resi = (Yi − ξbi )/σbi , featuring (perhaps robustified) estimates of mean and standard deviation for Yi . Weights of this type are dicussed, in a rather different context of robust linear regression, by Ronchetti and Staudte (1994), who also find that c = 1.345 is a reasonable default value for this weight function. 5 General risk averages Above we developed a wFIC method for estimating naturally weighted averaged risks in generalised linear models. Sometimes different risk averages are called for, however, and this section extends the wFIC to handle such cases. We choose to stay inside the glm framework exposited at the start of Section 3, although more general situations can be considered. A brief motivating example is as follows: Suppose Yi data are exponentially distributed with parameters θi = exp(xti β + zit γ), and that one wishes to estimate the quantile distribution µ(u) = µ(u | x, z) = {− log(1 − u)}/ exp(xti β + zit γ) for u ∈ (0, 1). How can we select a model that provides good estimates µb S (u) across all deciles u = 0.1, . . . , 0.9, say? Assume in general terms that a parameter µ(u) is to be estimated, defined in terms of the parameters (β, γ) of a generalised linear model, and depending on some parameter u that may or may not depend on the covariates. As in previous sections limit distributions will √ be established in the framework where (β, γ) = (β0 , δ/ n), aiming at providing adequate finite-sample approximations for risks and averaged risks. We have √ d −1 n{µb S (u) − µtrue (u)} → ΛS (u) = ( ∂µ(u) )t J00 M + ω(u)t (δ − GS D), ∂β −1 where ω(u) = J10 J00 ∂µ/∂β−∂µ/∂γ and GS = πSt KS πS K −1 ; also, M and D are independent, −1 and M ∼ Np (0, J00 ) and D ∼ Nq (δ, K). Consider the loss average function Ln (S) = n Z {µb S (u) − µtrue (u)}2 dWn (u), where Wn represents some relevant distribution of u values, like the deciles in the quantile example above. Assuming Wn converging to a suitable weight distribution W (or that it simply stays fixed, independent of sample size), we have d Ln (S) → L(S) = Z ΛS (u)2 dW (u) under mild conditions. We measure the total averaged risk via the expected value of Ln , which converges to w-risk(S) = E L(S) = 14 Z EΛS (u)2 dW (u), again under mild conditions. Here EΛS (u)2 = τ0 (u)2 + ω(u)t E(δ − GS D)(δ − GS D)t ω(u) = τ0 (u)2 + ω(u)t {(Iq − GS )δδ t (Iq − GS )t + KS−1 }ω(u) −1 ∂µ(u) )t J00 and the variance matrix in terms of τ0 (u)2 = ( ∂µ(u) ∂β ∂β Var GS D = GS KGtS = πSt KS πS K −1 πSt KS πS = πSt KS πS . This leads to the expression w-risk(S) = Z τ0 (u)2 dW (u) + trace{(Iq − GS )δδ t (Iq − GS )t R} + trace(πSt KS πS R) (12) for the limit risk, where R= Z ω(u)ω(u)t dW (u). The first term is immaterial since it does not depend on S, so our generalised wFIC naturally becomes b wFIC(S) = bI(S) + II(S) b b )(δbδbt − K c )(I − G b )t R}, b 0] + trace(π t K c = max[trace{(Iq − G n,S n q n,S S n,S πS R). b is a sample-based estimate of the R matrix. Here R This more general wFIC can be applied when one wishes to consider average risk across both covariates and quantiles, for example. 6 6.1 Illustrations and applications Diabetic retinopathy data The Wisconsin Epidemiologic Study of Diabetic Retinopathy (Klein et al., 1984) provides information to study diabetic retinopathy as a function of several other measurements. The dataset consists of patient information for 348 men and 343 women. The binary outcome variable Y = 0 indicates whether there is no or only mild nonproliferate retinopathy on both of the eyes. A value Y = 1 is obtained when there is moderate to severe nonproliferate retinopathy, or proliferate retinopathy for at least one of the eyes. Variables measured are: duration of diabetes in years (x1 ), presence of macular edema in at least one eye (z1 ), percentage of glycosylated hemoglobin (z2 ), body mass index (z3 ), pulse rate in beats per 30 seconds (z4 ), sex (z5 , with 1 for male and 0 for female), presence of urine protein (z6 ), and area of residence (urban or rural, z7 ). A logistic regression model is used for the analysis. Since in earlier analysis of this dataset it is found that duration of diabetes is an important variable (see for example 15 Table 1: Values of the weighted focussed information criterion, where weights are indicator values for men (1st column) and for women (3rd column), together with the selected variables. All logistic regression models contain an intercept term, as well as the variable x1 , duration of diabetes. Men Women wFICM z-variables wFICF z-variables 21.201 1,4,6 24.828 1,4,6 23.270 1,3,4,6 24.974 1,4,6,7 24.230 1,2,4,6 26.155 1,2,4,6 24.393 1,4,6,7 26.160 1,3,4,6 27.590 1,2,3,4,6 28.072 1,2,4,6,7 27.767 1,3,4,6,7 28.616 1,3,4,6,7 28.680 1,2,4,6,7 28.673 1,2,3,4,6 29.982 1,4,5,6 31.096 1,4,5,6 31.186 1,3,4,5,6 31.636 1,4,5,6,7 31.345 1,2,4,5,6 32.945 1,3,4,5,6 Claeskens, Croux and Van Kerckhoven, 2006), we include this in all of the models we consider, as well as an intercept term. We perform model selection amongst the other seven variables and allow for all possible subsets of the full model, leading to 128 possible models. As a model selection criterion we take first wFICM (S) with weight vector (1/nM )I(male) and next wFICF (S) with weight vector (1/nF )I(female) where the weights are indicator variables for men (in case 1) and for women (in case 2), and nM (resp. nF ) denotes the number of men (resp. women) in the dataset. Note that the values of wFIC(S)are computed using the complete dataset, we are not splitting the dataset for model selection. Table 6.1 gives for both criteria the ten smallest values, together with the variables in the corresponding models. The best model is in both cases the model containing the binary variables z1 : presence of macular edema in at least one of the eyes and z6 : indicator for urine protein, as well as z4 : pulse rate. The subgroups differ in the ranking of the next best models, for men the body mass index (variable z3 ) is an important variable, while for women the area of residence (z7 ) is more important, and z3 only shows in the 4th best ranked model. Thus we learn that body mass index influences Y in possibly different ways, for men and for women, which may be taken into account for building a final model. As a comparison we also used the overall model selection criteria AIC and BIC. The AIC picks the same model as the FIC does, namely the model with variables z1 , z4 and z6 , while the BIC omits from this model the variable z4 , pulse rate in beats per second. 16 These criteria are computed using the complete dataset. When we would split the data into the results for men and for women separately and then run the AIC selection method separately for both data parts, AIC selects the variables z1 , z2 , z6 for the subset of men, and variables z1 , z4 , and z6 for the subset of women. No artificial data splitting is needed for the computation of the wFIC. 6.2 CH4 concentrations This example consists of CH4 data, which are atmospheric CH4 concentrations (ppbv) derived from flask samples collected at the Shetland Islands of Scotland (Steele, Krummel and Langenfelds, 2002). Monthly values are expressed in parts per billion by volume (ppbv). In total there are 110 monthly measurements, starting in December 1992 and ending December 2001. The regression variable u = time is rescaled to the (0, 1) interval, and the response variable Y is the CH4 concentration. We use a cosine series estimator based on the model µ(u) = E(Y | U = u) = β0 + m X γj cos(πju), j=1 where we will vary the value of m, which is the truncation point of the series. This defines a sequence of nested models, which fits into the regression context of the previous sections when defining zj = cos(πju). We wish to select the best order m. In our modelling efforts, we let m be any number between 1 and 15 (the wide model). A scatter plot of the data is shown in Figure 1(a). We applied the wFIC(S) method (9) with equal weights wi = 1, and found that the best model is for m = 2; see the figure. As a comparison we also computed FIC values for each of the individual 110 measurement months. This means that we take as a focus parameter the mean CH4 concentration at that particular month (without averaging), which leads to a set of 15 FIC values, one for each order of m = 1, . . . , 15, and this for each of the 110 months. The results are summarised in the following frequency table for the individually chosen model by FIC. For example, model order m = 1 was chosen 75 times in the 110 model selection applications, model order m = 2 was chosen 7 times, etc. m frequency 1 2 3 5 6 7 75 7 1 1 3 5 10 11 3 15 The overall chosen model with m = 2 is in this case not the model which was most frequently selected by the individual searches. Remarkably, the model with order 11 is chosen 15 times in the individual search. A possible explanation for this is that a high frequency model of order 11 is reflecting the random variability in the data cloud. Another set of weights which makes sense for variables measured in time, is that which gives more weight to more recent measurements. As an example we used the weighting 17 (b) 1400 1200 wFIC values 1810 1800 1000 1790 800 1770 1780 CH4 concentration 1820 1600 1830 1800 1840 (a) 0.0 0.2 0.4 0.6 0.8 1.0 2 time 4 6 8 10 12 14 m Figure 1: (a) Scatterplot of the CH4 data, along with the estimated mean curve for the m = 2 model. (b) The wFIC(S) values with equal weights wi = 1. scheme i/n (i = 1, . . . , n = 110), and found in this particular case the same FIC selected model, namely the model with truncation point m = 2. 6.3 Highway data To illustrate robust downweighting, we use Hoffstedt’s highway data, see also Weisberg (2005, Section 7.2). This dataset is used to explain the 1973 accident rate per million vehicle miles, as a function of several variables. There are 39 observations made. In every model we include an intercept term and x1 , the length of the highway segment in miles. Variables to choose from are average daily traffic count in thousands (z1 ), truck volume as a percent of the total volume (z2 ), total number of lanes of traffic (z3 ), number of access points per mile (z4 ), number of signalised interchanges per mile (z5 ), number of freeway-type interchanges per mile (z6 ), speed limit in 1973 (z7 ), lane width, in feet (z8 ), width of the outer shoulder on the roadway (in feet) (z9 ), and finally an indicator of the type of roadway or the source of funding for the road (z10 ). Based on robust Cp model selection, Ronchetti and Staudte (1994) support the model which includes, in addition to x1 , the variables z5 , z6 , z7 and z10 , and also the model with additional variables z2 , z3 , z4 and z9 . Here we construct weights w1 , . . . , w39 based on the robustification method outlined in Section 4.3; these rely on intially used robust estimators for the regression coefficients obtained in the full model. Five of the observations receive a weight which is smaller than one, all other observations get weight one. The weights that 18 differ from one are 0.903, 0.568, 0.436, 0.577, 0.811. These weights are then used in the wFIC(S) construction, and lead to preferring the model with the two variables x1 and z7 . For this particular example, the model selected by FIC is more parsimonious than the model suggested by Mallow’s Cp . 7 Concluding comments Our paper has provided a fair middle ground between the extremes ‘blind model selection’ (where a model is found via say AIC or BIC, without any particular regard to the actual use of the model after selection) and ‘fully focussed selection’ (where a model is selected to perform optimally for a given estimand). We have seen that special versions of the wFIC correspond to the AIC for generalised linear models, so our methods may be seen as suitable generalisations of the AIC for use in situations where the context of one’s modelling and analysis dictates more specific weighting schemes than the default ones. Below we give some concluding remarks, pertaining to themes related to but outside the main scope of the present paper. 1. In this paper we chose to concentrate on the generalised linear models framework, where methods have a particularly clear structure, but it is clear that the methods can be generalised to many other regression structures. Thus parametric regression models for multidimensional data, and for hazard rates with censored data, can be handled with essentially the same methods. Extensions to the semiparametric Cox model are less immediate, but can be accomplished via methods of Hjort and Claeskens (2006). 2. Our paper has developed strategies that for each model rely on maximum likelihood (or asymptotically equivalent) estimators. There is a need for generalising methods and results to more robust strategies, for example involving M-estimators. This is entirely possible, but requires more work and will result in algebraically and structurally somewhat less elegant methods. b 3. Our wFIC(S) = bI(S) + II(S) is not algebraically equivalent to the simpler one of averaging individual FIC(S) scores. The two methods are the same only in cases where there are no modifications of setting negative estimates of squared bias to zero. The wFIC method, as outlined in Sections 3 and 5, performs the squared bias modification only once, at the end, as opposed to performing this operation for each individual application. 4. Our methods stem from precise limit distribution results that involve quantities like J and K, along with further relatives like KS and GS . Our wFIC(S) formulae involve c , and so on. The theory behind the methods estimates of these quantities, say Jbn,S , K n,S ensure that they work well as long as estimates are used that are consistent, under the local √ misspecification framework (β, γ) = (β0 , δ/ n). Among various possibilities we have chosen 19 to use the ‘wide model perspective’ in our implementations, starting from estimators Jbn that use estimates (βbwide , γbwide ). 5. We have derived methods for selecting a model, but have not discussed the consequences of having selected the model in this fashion. Methods of Hjort and Claeskens (2003a) and Claeskens and Hjort (2003) make it however possible to analyse the performance of estimator-after-selection, also with the wFIC methods developed in the present article. 6. Though we have been specifically concerned with model selection, methods of Hjort and Claeskens (2003a) can be applied to provide model average procedures, say of the type µb = X S c(wFIC(S))µb S , a data-dependent average across the estimators of the individual models, with .X c(wFIC(S)) = exp{− 21 κ wFIC(S)} exp{− 12 κ wFIC(S ′ )}. S′ Here κ is an algorithmic parameter, with small κ corresponding to near uniform weighting while larger κ means giving nearly full weight to the model that wins the wFIC competition. Methods of that paper also make it possible to study performances of such average estimators, compared to the more usual estimators-post-selection. 7. Our methods have been developed inside a framework of ‘first order asymptotics’ for general parametric models. It might be important to supplement such methods by suitable second order corrections to make them work more precisely for moderate or smaller sample sizes. The content of Remark 2 of Section 3 is that the wFIC(S) expression is exactly correct for each finite n > p + q, when used in the linear model with non-random weights. This is an indication that even the first-order approximations to risks and their estimates are adequate also in other generalised linear models. The situation is a bit more complicated when the weights themselves have a random component, as with the robust type wi = h(resi ) discussed in Section 4.3; here approximations stemming from the first-order asymptotics might need adjustments to be more accurate in practice. In other words, even though demonstrably Ln (S) →d L(S) and w-riskn (S) → w-risk(S), one can expect the real variance of Ln (S) to be bigger than that of L(S), in cases with complicated data-dependent weights. 20 References Akaike, H. (1974). A new look at statistical model identification, I.E.E.E. Transactions on Automatic Control 19, 716–723. Claeskens, G., Croux, C. and Van Kerckhoven, J. (2006). Variable selection for logistic regression using a prediction focussed information criterion. Biometrics, to appear. Claeskens, G. and Hjort, N. L. (2003). The focused information criterion [with discussion]. Journal of the American Statistical Association 98, 900–916. Dobson, A. J. (2002). An Introduction to Generalized Linear Models. Chapman and Hall, London. Hjort, N. L. and Claeskens, G. (2003a). Frequentist model average estimators [with discussion]. Journal of the American Statistical Association 98, 879–899. Hjort, N. L. and Claeskens, G. (2003b). Rejoinder to ‘The Focussed Information Criterion’ and ‘Frequentist model average estimators’. Journal of the American Statistical Association 98, 938–945. Hjort, N. L. and Claeskens, G. (2006). Focussed information criteria and model averaging for Cox’s hazard regression model. Journal of the American Statistical Association, to appear. Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L. (1984). The Wisconsin epidemiologic study of diabetic retinopathy: II. Prevalence and risk of diabetic retinopathy when age at diagnosis is less than 30 years. Archives of Ophthalmology 102, 520–526. Mallows, C. L. (1973). Some comments on Cp , Technometrics 15, 661–675. McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Chapman and Hall, London. Ronchetti, E. (1997). Robustness aspects of model choice. Statistica Sinica 2, 327–338. Ronchetti, E. and Staudte, R. G, (1994). A robust version of Mallows’ Cp . Journal of the American Statistical Association 89, 550–559. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461–464. Steele, L. P., Krummel, P. B. and Langenfelds, R. L. (2002). Atmospheric CH4 concentrations from sites in the CSIRO Atmospheric Research GASLAB air sampling network (October 2002 version). In Trends: A Compendium of Data on Global Change, Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S. Department of Energy, Oak Ridge, TN, U.S.A. Weisberg, S. (2005). Applied Linear Regression, 3rd edition. Wiley, New York. 21