Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
arXiv:1001.2152v1 [math.ST] 13 Jan 2010 Bernoulli 15(4), 2009, 1351–1367 DOI: 10.3150/09-BEJ191 Rate of convergence of predictive distributions for dependent data PATRIZIA BERTI1 , IRENE CRIMALDI2 , LUCA PRATELLI3 and PIETRO RIGO4 1 Dipartimento di Matematica Pura ed Applicata “G. Vitali”, Universita’ di Modena e ReggioEmilia, via Campi 213/B, 41100 Modena, Italy. E-mail: patrizia.berti@unimore.it 2 Dipartimento di Matematica, Universita’ di Bologna, Piazza di Porta San Donato 5, 40126 Bologna, Italy. E-mail: crimaldi@dm.unibo.it 3 Accademia Navale, viale Italia 72, 57100 Livorno, Italy. E-mail: pratel@mail.dm.unipi.it 4 Dipartimento di Economia Politica e Metodi Quantitativi, Universita’ di Pavia, via S. Felice 5, 27100 Pavia, Italy. E-mail: prigo@eco.unipv.it This paper deals with empirical processes of the type √ Cn (B) = n{µn (B) − P (Xn+1 ∈ B | X1 , . . . , Xn )}, P where (Xn ) is a sequence of random variables and µn = (1/n) n i=1 δXi the empirical measure. Conditions for supB |Cn (B)| to converge stably (in particular, in distribution) are given, where B ranges over a suitable class of measurable sets. These conditions apply when (Xn ) is exchangeable or, more generally, conditionally identically distributed (in the sense of Berti et al. [Ann. Probab. 32 (2004) 2029–2052]). By such conditions, in some relevant situations, one obtains that √ P supB |Cn (B)| → 0 or even that n supB |Cn (B)| converges a.s. Results of this type are useful in Bayesian statistics. Keywords: Bayesian predictive inference; central limit theorem; conditional identity in distribution; empirical distribution; exchangeability; predictive distribution; stable convergence 1. Introduction and motivations A number of real problems reduce to the evaluation of the predictive distribution an (·) = P (Xn+1 ∈ ·|X1 , . . . , Xn ) for a sequence X1 , X2 , . . . of random variables. Here, we focus on those situations where an cannot be calculated in closed form and one decides to estimate it based on the available data X1 , . . . , Xn . Related references are [1–3, 5, 6, 8, 10, 15, 18, 20]. This is an electronic reprint of the original article published by the ISI/BS in Bernoulli, 2009, Vol. 15, No. 4, 1351–1367. This reprint differs from the original in pagination and typographic detail. 1350-7265 c 2009 ISI/BS 1352 Berti, Crimaldi, Pratelli and Rigo For notational reasons, it is convenient to work in coordinate probability space. Accordingly, we fix a measurable space (S, B) and a probability P on (S ∞ , B ∞ ), and we let Xn be the nth canonical projection on (S ∞ , B ∞ , P ), n ≥ 1. We also let Gn = σ(X1 , . . . , Xn ) and X = (X1 , X2 , . . .). Since we are concerned with predictive distributions, it is reasonable to make some (qualitative) assumptions about them. In [6], X is said to be conditionally identically distributed (c.i.d.) when a.s. for all B ∈ B and k > n ≥ 0, E(IB (Xk )|Gn ) = E(IB (Xn+1 )|Gn ) where G0 is the trivial σ-field. Thus, at each time n ≥ 0, the future observations (Xk : k > n) are identically distributed given the past Gn . In a sense, this is a weak form of exchangeability. In fact, X is exchangeable if and only if it is stationary and c.i.d., and various examples of non-exchangeable c.i.d. sequences are available. In the sequel, X = (X1 , X2 , . . .) is a c.i.d. sequence of random variables. In that case, a sound estimate of an is the empirical distribution n µn = 1X δX . n i=1 i The choice of µn can be defended as follows. Let D ⊂ B and let k · k denote the sup-norm on D. Suppose also that D is countably determined, as defined in Section 2. (The latter is a mild condition, only needed to handle measurability issues.) Then a.s. kµn − an k = sup |µn (B) − an (B)| −→ 0, (1) B∈D provided (X is c.i.d. and) µn converges uniformly on D with probability 1; see [5]. For a.s. instance, kµn − an k −→ 0 whenever X is exchangeable and D is a Glivenko–Cantelli class. a.s. Also, kµn − an k −→ 0 if S = R, D = {(−∞, t] : t ∈ R}, and X1 has a discrete distribution or inf ε>0 lim inf n P (|Xn+1 − Xn | < ε) = 0; see [4]. To sum up, under mild assumptions, µn is a consistent estimate of an (with respect to uniform distance) for c.i.d. data. This is in line with de Finetti [10] in the particular case of exchangeable indicators. Taking (1) as a starting point, the next step is to investigate the convergence rate, that is, to investigate whether αn kµn − an k converges in distribution, possibly to a null limit, for suitable constants αn > 0. This is precisely the purpose of this paper. A first piece of information on the convergence rate of kµn − an k can be obtained as follows. For B ∈ B, define µ(B) = lim sup µn (B), n √ Wn (B) = n{µn (B) − µ(B)}. Rate of convergence of predictive distributions 1353 a.s. By the SLLN for c.i.d. sequences, µn (B) −→ µ(B); see [6]. Hence, for fixed n ≥ 0 and B ∈ B, one obtains E(µ(B)|Gn ) = lim E(µk (B)|Gn ) = lim k k k 1 X E(IB (Xi )|Gn ) k i=n+1 = E(IB (Xn+1 )|Gn ) = an (B) In turn, this implies that a.s. √ n{µn (B) − an (B)} = E(Wn (B)|Gn ) a.s., so 1 1 kµn − an k ≤ √ sup E(|Wn (B)||Gn ) ≤ √ E(kWn k|Gn ) n B∈D n a.s. If supn EkWn kk < ∞ for some k ≥ 1, it then follows that E{(αn kµn − an k)k } ≤  αn √ n k EkWn kk → 0 αn whenever √ → 0. n Even if obvious, this fact is potentially useful since sup EkWn kk < ∞ for all k ≥ 1, if X is exchangeable, n (2) for various choices of D; see Remark 3. In particular, (2) holds if D is finite. √ The intriguing case, however, is αn = n. For each B ∈ B and probability Q on (S ∞ , B ∞ ), write CnQ (B) = EQ (Wn (B)|Gn ) and √ Cn (B) = CnP (B) = n{µn (B) − an (B)}. In Theorem 3.3 of [6], the asymptotic behavior of Cn (B) is investigated for fixed B. Here, instead, we are interested in kCn k = sup |Cn (B)| = B∈D √ nkµn − an k. Our main result (Theorem 1) is the following. Fix a random probability measure N on R and a probability Q on (S ∞ , B ∞ ) such that kCnQ k → N stably under Q and kWn k is uniformly integrable under both P and Q. Then, kCn k → N stably whenever P ≪ Q. (3) 1354 Berti, Crimaldi, Pratelli and Rigo A remarkable particular case is N = δ0 . Suppose, in fact, that for some Q, one has Q kCnQ k → 0 and kWn k uniformly integrable under P and Q. Then, P kCn k → 0 whenever P ≪ Q. Stable convergence (in the sense of Rényi) is a stronger form of convergence in distribution. The definition is recalled in Section 2. In general, one cannot dispense with the uniform integrability condition. However, this condition is often true. For instance, kWn k is uniformly integrable (under P and Q) provided D meets (2) and X is exchangeable (under P and Q). To make (3) concrete, a large list of reference probabilities Q is needed. Various examples are available in the Bayesian nonparametrics framework; see, for example, [16] and references therein. The most popular is perhaps the Ferguson–Dirichlet law, denoted by Q0 . If P = Q0 , then X is exchangeable and an (B) = αP (X1 ∈ B) + nµn (B) α+n a.s. for some constant α > 0. P Since kµn −an k ≤ (α/n) when P = Q0 , something more than kCn k → 0 can be expected in the case P ≪ Q0 . Indeed, we prove that √ converges a.s. nkµn − an k = nkCn k whenever P ≪ Q0 with a density satisfying a certain condition; see Theorem 2 and Corollary 5. One more example should be mentioned. Let Xn = (Yn , Zn ), where Zn > 0 and Pn αP (Y1 ∈ B) + i=1 Zi IB (Yi ) Pn P (Yn+1 ∈ B|Gn ) = a.s. α + i=1 Zi for some constant α > 0. Under some conditions, X is c.i.d. (but not necessarily exchangeable), kWn k is uniformly integrable and kCn k converges stably; see Section 4. The above material takes a nicer form when the condition P ≪ Q can be given a simple characterization. This happens, for instance, if S = {x1 , . . . , xk , xk+1 } is finite, X exchangeable and P (X1 = x) > 0 for all x ∈ S. Then, P ≪ Q0 (for some choice of Q0 ) if and only if (µ{x1 }, . . . , µ{xk }) has an absolutely continuous distribution with respect to Lebesgue measure. In this particular case, however, a part of our results can also be obtained through the Bernstein– von Mises theorem; see Section 3. Finally, we make two remarks: (i) If X is exchangeable, our results apply to Bayesian predictive inference. Suppose, in fact, that S is Polish and B the Borel σ-field, so that de Finetti’s theorem applies. Then, Rate of convergence of predictive distributions 1355 P is a unique mixture of product probabilities on B ∞ and the mixing measure is called the prior distribution in a Bayesian framework. Now, given Q, P ≪ Q is just an assumption on the prior distribution. This is plain in the last example where S = {x1 , . . . , xk , xk+1 }. In Bayesian terms, such an example can be summarized as follows. For a multinomial P statistical model, √ kCn k → 0 if the prior is absolutely continuous with respect to Lebesgue measure, and nkCn k converges a.s. if the prior density satisfies a certain condition. (ii) To our knowledge, there is no general representation for the predictive distributions of an exchangeable sequence. Such a representation would be very useful. Even if only partially, results like (3) contribute to filling the gap. As an example, for fixed B ∈ B, one obtains an (B) = µn (B) + oP ( √1n ), provided X is exchangeable and P ≪ Q for some Q Q such that CnQ (B) → 0 and Wn (B) is uniformly integrable. 2. Main results A few definitions need to be recalled. Let T be a metric space, BT the Borel σ-field on T and (Ω, A, P ) a probability space. A random probability measure on T is a mapping N on Ω × BT such that: (i) N (ω, ·) is a probability on BT for each ω ∈ Ω; (ii) N (·, B) is A-measurable for each B ∈ BT . Let (Zn ) be a sequence of T -valued random variables and N a random probability measure on T . Both (Zn ) and N are defined on (Ω, A, P ). We say that Zn converges stably to N in the case where P (Zn ∈ ·|H) → E(N (·)|H) weakly for all H ∈ A such that P (H) > 0. Clearly, if Zn → N stably, then Zn converges in distribution to the probability law E(N (·)) (just let H = Ω). Stable convergence has been introduced by Rényi in [17] and subsequently investigated by various authors; see [9] for more information. Next, we say that D ⊂ B is countably determined in the case where, for some fixed countable subclass D0 ⊂ D, one obtains supB∈D0 |ν1 (B) − ν2 (B)| = supB∈D |ν1 (B) − ν2 (B)| for every pair ν1 , ν2 of probabilities on B. A sufficient condition is that for some countable D0 ⊂ D, and for every ε > 0, B ∈ D and probability ν on B, there is B0 ∈ D0 satisfying ν(B∆B0 ) < ε. Most classes D involved in applications are countably determined. For instance, D = {(−∞, t] : t ∈ Rk } and D = {closed balls} are countably determined if S = Rk and B is the Borel σ-field. As another example, D = B is countably determined if B is countably generated. We are now in a position to state our main result. Let N be a random probability measure on R, defined on the measurable space (S ∞ , B ∞ ), and let Q be a probability on (S ∞ , B ∞ ). Theorem 1. Let D be countably determined. Suppose kCnQ k → N stably under Q, and (kWn k : n ≥ 1) is uniformly integrable under P and Q. Then, kCn k = √ nkµn − an k → N stably whenever P ≪ Q. 1356 Berti, Crimaldi, Pratelli and Rigo Proof. Since D is countably determined, there are no measurability problems in taking supB∈D . In particular, kWn k and kCn k are random variables and kCn k is Gn -measurable. Let f be a version of dP dQ and Un = f − EQ (f |Gn ). Then, Cn (B) = E(Wn (B)|Gn ) = = CnQ (B) + Letting Mn = EQ (|Un |kWn k|Gn ) EQ (f |Gn ) EQ (f Wn (B)|Gn ) EQ (f |Gn ) EQ (Un Wn (B)|Gn ) , EQ (f |Gn ) P -a.s., for each B ∈ B. and taking supB∈D , it follows that kCnQ k − Mn ≤ kCn k ≤ kCnQ k + Mn , P -a.s. We first assume f to be bounded. Since kCnQ k → N stably under Q, given a bounded random variable Z on (S ∞ , B ∞ ), one obtains Z Z Q φ(kCn k)Z dQ −→ N (φ)Z dQ for each bounded continuous φ : R → R, where N (φ) = Z φ(x)N (·, dx). Letting Z = f IH /P (H) with H ∈ B ∞ and P (H) > 0, it follows that kCnQ k → N stably under P . Therefore, it suffices to prove that EMn → 0. Given ε > 0, since kWn k is uniformly integrable under Q, there exists some c > 0 such that EQ {kWn kI{kWn k>c} } < ε sup f for all n. Since Mn is Gn -measurable, EMn = EQ (f Mn ) = EQ (EQ (f |Gn )Mn ) = EQ (|Un |kWn k) ≤ cEQ |Un | + (sup f )EQ (kWn kI{kWn k>c} ) < cEQ |Un | + ε for all n. Therefore, the martingale convergence theorem implies that lim sup EMn ≤ c lim sup EQ |Un | + ε = ε. n n This concludes the proof when f is bounded. Next, let f be any density. Fix k > 0 such that P (f ≤ k) > 0 and define K = {f ≤ k} and PK (·) = P (·|K). Then, PK has the bounded density f IK /P (K) with respect to Q. By what has already been proven, kCnPk k → N stably under PK , where CnPk (B) = EPK (Wn (B)|Gn ) = E{IK Wn (B)|Gn } , E(IK |Gn ) PK -a.s. Rate of convergence of predictive distributions 1357 Letting Rn = IK − E(IK |Gn ), it follows that   E{Rn Wn (B)|Gn } Pk E{IK kCn − Cn k} = E IK sup E(IK |Gn ) B∈D   E{|Rn |kWn k|Gn } = E{|Rn |kWn k} ≤ E IK E(IK |Gn ) ≤ cE|Rn | + E{kWn kI{kWn k>c} } for all c > 0. Since E|Rn | → 0 and kWn k is uniformly integrable under P , arguing as above gives that EPK |kCn k − kCnPk k| ≤ E{IK kCn − CnPk k} −→ 0. P (K) Therefore, kCn k → N stably under PK . Finally, fix H ∈ B ∞ , P (H) > 0 and a bounded continuous function φ : R → R. Then P (H ∩ K) = P (H ∩ {f ≤ k}) > 0 for k sufficiently large and P (H)|E(φ(kCn k)|H) − E(N (φ)|H)| ≤ 2 sup |φ|P (f > k) + |E(φ(kCn k)|H ∩ K) − E(N (φ)|H ∩ K)|. Since E(φ(kCn k)|H ∩ K) → E(N (φ)|H ∩ K) as n → ∞ and P (f > k) → 0 as k → ∞, this concludes the proof.  Next, we deal with the particular case Q = Q0 , where Q0 is a Ferguson–Dirichlet law on (S ∞ , B ∞ ). If P ≪ Q0 with a density satisfying a certain condition, the convergence rate of kµn − an k can be remarkably improved. Theorem 2. Suppose D is countably determined and supn EQ0 kWn k2 < ∞. Then, √ nkCn k = nkµn − an k converges a.s., provided P ≪ Q0 and   dP 1 2 2 for some version f of . EQ0 (f ) − EQ0 {EQ0 (f |Gn ) } = O n dQ0 Proof. Let Dn (B) = mined) and √ nCn (B). Then, kDn k is Gn -measurable (as D is countably deter- E(kDn+1 k|Gn ) = E sup n+1 X B∈D i=1 ≥ sup E B∈D = sup n X B∈D i=1 n+1 X i=1 IB (Xi ) − (n + 1)E(µ(B)|Gn+1 ) Gn IB (Xi )|Gn ! ! − (n + 1)E(µ(B)|Gn ) IB (Xi ) − nE(µ(B)|Gn ) = kDn k a.s. 1358 Berti, Crimaldi, Pratelli and Rigo Since kDn k is a Gn -submartingale, it suffices to prove that supn EkDn k < ∞. Let Un = f − E0 (f |Gn ), where E0 stands for EQ0 . By assumption, there exist c1 , c2 > 0 such that E0 kWn k2 ≤ c1 , nE0 Un2 = n{E0 (f 2 ) − E0 (E0 (f |Gn )2 )} ≤ c2 for all n. As noted in Section 1, since Q0 is a Ferguson–Dirichlet law, there is an α > 0 such that √ √ nkCnQ0 k = n sup |E0 (Wn (B)|Gn )| ≤ α for all n. B∈D n |kWn k|Gn ) Define Mn = E0 (|U and recall that kCn k ≤ kCnQ0 k + Mn , P -a.s.; see the proof E0 (f |Gn ) of Theorem 1. Then, for all n, one obtains √ √ √ EkDn k = nEkCn k ≤ n(EkCnQ0 k + EMn ) ≤ α + nE0 (f Mn ) √ √ p = α + nE0 (|Un |kWn k) ≤ α + n E0 Un2 E0 kWn k2 p √  ≤ α + c1 nE0 Un2 ≤ α + c1 c2 . Finally, we clarify a point raised in Section 1. Remark 3. There is a long list of (countably determined) choices of D such that sup EkWn kk ≤ c(k) n for all k ≥ 1, if X is i.i.d., where c(k) is some universal constant; see, for example, Sections 2.14.1 and 2.14.2 of [21]. Fix one such D, k ≥ 1, and suppose that S is Polish and B is the Borel σ-field. If X is exchangeable, then de Finetti’s theorem yields E(kWn kk |T ) ≤ c(k) a.s. for all n, where T is the tail σ-field of X. Hence, EkWn kk = E{E(kWn kk |T )} ≤ c(k) for all n. This proves inequality (2). 3. Exchangeable data with finite state space When X is exchangeable and S finite, there is some overlap between Theorem 1 and a result of Bernstein and von Mises. 3.1. Connections with the Bernstein–von Mises theorem For each θ in an open set Θ ⊂ Rk , let Pθ be a product probability on (S ∞ , B ∞ ) (that is, X is i.i.d. under Pθ ). Suppose the map θ 7→ Pθ (B) is Borel measurable for fixed B ∈ B ∞ . Given a (prior) probability π on the Borel subsets of Θ, define Z P (B) = Pθ (B)π(dθ), B ∈ B∞. Rate of convergence of predictive distributions 1359 Roughly speaking, the Bernstein–von Mises (BvM) theorem can be stated as follows. Suppose π is absolutely continuous with respect to Lebesgue measure and the statistical model (Pθ : θ ∈ Θ) is suitably “smooth” (we refer to [13] for a detailed exposition of what “smooth” means). For each n, suppose that θ admits a (consistent) maximum likelihood estimator θbn . Further, suppose the prior π possesses the first moment and denote by θn∗ the posterior mean of θ. Then, √ Pθ 0 n(θbn − θn∗ ) −→ 0 for each θ0 ∈ Θ such that the density of π is strictly positive and continuous at θ0 . Actually, the BvM theorem yields much more than asserted; what is reported above is just the corollary connected to this paper. We refer to [13] and [14] for more information and historical notes; see also [18]. Assuming a smooth, finite-dimensional statistical model is fundamental; see, for example, [11]. Indeed, the BvM theorem does not apply when the only information is that X is exchangeable (or even c.i.d.) and P ≪ Q for some reference probability Q. One exception, however, is when S is finite. Let us suppose S = {x1 , . . . , xk , xk+1 }, X is exchangeable, P (X1 = x) > 0 for all x ∈ S and D = B = power set of S. Also, let λ denote Lebesgue measure on Rk and π the probability distribution of θ = (µ{x1 }, . . . , µ{xk }). As noted in Section 1, π ≪ λ if and only if P ≪ Q0 for some choice of Q0 . Since D is finite and X exchangeable under P and Q0 , kWn k is uniformly integrable under P P and Q0 . Thus, Theorem 1 yields kCn k → 0 whenever π ≪ λ. On the other hand, π is the prior distribution for this problem. The underlying statistical model is smooth and finite-dimensional (it is just a multinomial model). Further, for each n, the maximum likelihood estimator and the posterior mean of θ are, respectively, θbn = (µn {x1 }, . . . , µn {xk }), θn∗ = (an {x1 }, . . . , an {xk }). P Thus, the BvM theorem implies that kCn k → 0, provided π ≪ λ and the density of π is continuous on the complement of a π-null set. To sum up, in this particular case, the same conclusions as from Theorem 1 can be drawn from the BvM theorem. Unlike the latter, however, Theorem 1 does not require any conditions on the density of π. 3.2. Some consequences of Theorems 1 and 2 In this subsection, we focus on S = {0, 1}. Thus, D = B = power set of S and λ denotes Lebesgue measure on R. Let N (0, a) denote the one-dimensional Gaussian law with mean 1360 Berti, Crimaldi, Pratelli and Rigo 0 and variance a ≥ 0 (where N (0, 0) = δ0 ). Our first result allows π to have a discrete part. Corollary 4. With S = {0, 1}, let π be the probability distribution of µ{1} and ∆ = {θ ∈ [0, 1] : π{θ} > 0}, A = {ω ∈ S ∞ : µ(ω, {1}) ∈ ∆}. Define the random probability measure N on R as N = (1 − IA )δ0 + IA N (0, µ{1}(1 − µ{1})). If X is exchangeable and π does not have a singular continuous part, then Cn {1} → N stably and kCn k → N ◦ h−1 stably, where h(x) = |x|, x ∈ R, is the modulus function. Proof. By standard arguments, the corollary holds when π(∆) ∈ (0, 1), provided it holds when π(∆) = 0 and π(∆) = 1. Let π(∆) = 0. Then, π ≪ λ as π does not have a singular continuous part, and the corollary follows from Theorem 1. Thus, it can be assumed that π(∆) = 1. Since Cn {0} = −Cn {1}, kCn k = |Cn {1}| and the modulus function is continuous, it suffices to prove that Cn {1} → N stably. Next, exchangeability of X implies that Wn {1} → N (0, µ{1}(1 − µ{1})) stably; see, for example, Theorem 3.1 of [6]. Since π(∆) = 1, we have N = N (0, µ{1}(1 − µ{1})) a.s. Hence, it is enough to show that E|Cn {1} − Wn {1}| → 0. Fix ε > 0 and let Mn = Wn {1}. Since X is exchangeable, Mn is uniformly integrable. Therefore, there exists some c > 0 such that ε sup E(|Mn |I{|Mn |>c} ) < . 4 n Define φ(x) = x if |x| ≤ c, φ(x) = c if x > c, and φ(x) = −c if x < −c. Since Cn {1} = E(Mn |Gn ) a.s., it follows that E|Cn {1} − Wn {1}| ≤ E|E(Mn |Gn ) − E(φ(Mn )|Gn )| + E|E(φ(Mn )|Gn ) − φ(Mn )| + E|φ(Mn ) − Mn | ≤ E|E(φ(Mn )|Gn ) − φ(Mn )| + 4E(|Mn |I{|Mn |>c} ) < E|E(φ(Mn )|Gn ) − φ(Mn )| + ε for all n. √ Write ∆ = {a1 , a2 , . . .} and Mn,j = n(µn {1} − aj ). Since σ(Mn,j ) ⊂ Gn and P (µ{1} ∈ ∆) = π(∆) = 1, one also obtains E|E(φ(Mn )|Gn ) − φ(Mn )| X E|E(φ(Mn,j )I{µ{1}=aj } |Gn ) − φ(Mn,j )I{µ{1}=aj } | = j Rate of convergence of predictive distributions = X j ≤c 1361 E|φ(Mn,j ){P (µ{1} = aj |Gn ) − I{µ{1}=aj } }| m X j=1 E|P (µ{1} = aj |Gn ) − I{µ{1}=aj } | + 2c X j>m π{aj } for all m, n. By the martingale convergence theorem, E|P (µ{1} = aj |Gn ) − I{µ{1}=aj } | → 0 as n → ∞, for each j. Thus, X π{aj } for all m. lim sup E|Cn {1} − Wn {1}| ≤ ε + 2c n j>m Taking the limit as m → ∞ completes the proof.  If π is singular continuous, we conjecture that Cn {1} converges stably to a non-null limit. However, we do not have a proof. In the next result, a real function g on (0, 1) is said to be almost Lipschitz in the case where x 7→ g(x)xa (1 − x)b is Lipschitz on (0, 1) for some reals a, b < 1. Corollary 5. Suppose S = {0, 1}, X is exchangeable and π is the probability distribution √ of µ{1}. If π admits an almost Lipschitz density with respect to λ, then nkCn k converges a.s. to a real random variable. Proof. Let V = µ{1}. By assumption, there exist a, b < 1 and a version g of dπ dλ such that φ(θ) = g(θ)θa (1 − θ)b is Lipschitz on (0, 1). For each u1 , u2 > 0, we can take Q0 such that V has a beta-distribution with parameters u1 , u2 under Q0 . Let Q0 be such that V has a beta-distribution with parameters u1 = 1 − a and u2 = 1 − b under Q0 . Then, for any n ≥ 1 and x1 , . . . , xn ∈ {0, 1}, one obtains P (X1 = x1 , . . . , Xn = xn ) Z 1 = θr (1 − θ)n−r π(dθ) 0 = Z 1 0 =c Z θr−a (1 − θ)n−r−b φ(θ) dθ V r (1 − V )n−r φ(V ) dQ0 , where r = n X xi and c > 0 is a constant. i=1 dP Let h = cφ. Then, h is Lipschitz and f = h(V ) is a version of dQ . 0 Let Vn = E0 (V |Gn ), where E0 stands for EQ0 . Since h is Lipschitz, |f − E0 (f |Gn )| ≤ |h(V ) − h(Vn )| + E0 (|h(V ) − h(Vn )||Gn ) ≤ d|V − Vn | + dE0 (|V − Vn ||Gn ), 1362 Berti, Crimaldi, Pratelli and Rigo where d is the Lipschitz constant of h. Since E0 kCnQ0 k2 ≤ E0 kWn k2 and √ n|V − Vn | = |CnQ0 {1} − Wn {1}| ≤ kCnQ0 k + kWn k, it follows that 2 E0 (f 2 ) − E0 (E0 (f |Gn )2 ) = E0 {(f − E0 (f |Gn )) } ≤ 4d2 E0 {(V − Vn )2 } ≤ 16d2 4d2 E0 {(kCnQ0 k + kWn k)2 } ≤ E0 kWn k2 . n n Since supn E0 kWn k2 < ∞, we have E0 (f 2 ) − E0 (E0 (f |Gn )2 ) = O(1/n). An application of Theorem 2 completes the proof.  Corollaries 4 and 5 deal with S = {0, 1}, but similar results can be proven for any finite S; see also [12] and [19]. 4. Generalized Pólya urns In this section, based on Examples 1.3 and 3.5 of [6], the asymptotic behavior of kCn k is investigated for a certain c.i.d. sequence. Let (Y, BY ) be a measurable space, B+ the Borel σ-field on (0, ∞) and S = Y × (0, ∞), B = BY ⊗ B+ , Xn = (Yn , Zn ), where Yn (ω) = yn , Zn (ω) = zn for all ω = (y1 , z1 , y2 , z2 , . . .) ∈ S ∞ . Given a law P on B ∞ , it is assumed that Pn αP (Y1 ∈ B) + i=1 Zi IB (Yi ) Pn P (Yn+1 ∈ B|Gn ) = α + i=1 Zi P (Zn+1 ∈ C|X1 , . . . , Xn , Yn+1 ) = P (Z1 ∈ C) a.s., n ≥ 1, a.s., n ≥ 0, (4) (5) for some constant α > 0 and all B ∈ BY , C ∈ B+ . Note that (Zn ) is i.i.d. and Zn+1 is independent of (Y1 , Z1 , . . . , Yn , Zn , Yn+1 ) for all n ≥ 0. In real problems, the Zn should be viewed as weights, while the Yn describe the phenomenon of interest. As an example, consider an urn containing white and black balls. At each time n ≥ 1, a ball is drawn and then replaced together with Zn more balls of the same color. Let Yn be the indicator of the event {white ball at time n} and suppose that Zn is chosen according to a fixed distribution on the integers, independently of (Y1 , Z1 , . . . , Yn−1 , Zn−1 , Yn ). The predictive distributions of X are then given by (4)–(5). Also, note that the probability law of (Yn ) is Ferguson–Dirichlet in the case where Zn = 1 for all n. It is not hard to prove that X is c.i.d. We state this fact as a lemma. Lemma 6. The sequence X assessed according to (4)–(5) is c.i.d. Rate of convergence of predictive distributions 1363 Proof. Fix k > n ≥ 0 and A ∈ BY ⊗ B+ . By a monotone class argument, it can be assumed that A = B × C, where B ∈ BY and C ∈ B+ . Further, it can be assumed that k = n + 2. Let n = 0 and G0 be the trivial σ-field. Since X2 ∼ X1 (as is easily seen), E(IB (Y2 )IC (Z2 )|G0 ) = E(IB (Y1 )IC (Z1 )|G0 ) a.s. If n ≥ 1, define Gn∗ = σ(X1 , . . . , Xn , Zn+1 ). Noting that E(IB (Yn+1 )|Gn∗ ) = E(IB (Yn+1 )|Gn ) a.s., one obtains E(IB (Yn+2 )|Gn∗ ) = E{E(IB (Yn+2 )|Gn+1 )|Gn∗ } P αP (Y1 ∈ B) + ni=1 Zi IB (Yi ) + Zn+1 E(IB (Yn+1 )|Gn∗ ) = Pn+1 α + i=1 Zi Pn (α + i=1 Zi )E(IB (Yn+1 )|Gn ) + Zn+1 E(IB (Yn+1 )|Gn ) = . Pn+1 α + i=1 Zi = E(IB (Yn+1 )|Gn ) = E(IB (Yn+1 )|Gn∗ ) a.s. Finally, since Gn ⊂ Gn∗ , the previous equality implies that E(IB (Yn+2 )IC (Zn+2 )|Gn ) = P (Z1 ∈ C)E{E(IB (Yn+2 )|Gn∗ )|Gn } = P (Z1 ∈ C)E{E(IB (Yn+1 )|Gn∗ )|Gn } = E(IB (Yn+1 )IC (Zn+1 )|Gn ) a.s. Therefore, X is c.i.d.  Usually, one is interested in predicting Yn more than Zn . Thus, in the sequel, we focus on P (Yn+1 ∈ B|Gn ). For each B ∈ BY , we write Cn (B) = Cn (B × (0, ∞)), an (B) = an (B × (0, ∞)) = P (Yn+1 ∈ B|Gn ), and so on. In Example 3.5 of [6], assuming EZ12 < ∞, it is shown that 2 Cn (B) → N (0, σB ) 2 stably, where σB = var(Z1 ) µ(B)(1 − µ(B)). (EZ1 )2 Here, we prove that Cn converges stably when regarded as a map Cn : S ∞ → l∞ (D), where l∞ (D) is the space of real bounded functions on D equipped with uniform distance; see Section 1.5 of [21]. In particular, stable convergence of Cn as a random element of l∞ (D) implies stable convergence of kCn k = supB∈D |Cn (B)|. Intuitively, the stable limit of Cn (when it exists) is connected to the Brownian bridge. Let B1 , B2 , . . . be pairwise disjoint elements of BY and D = {Bk × (0, ∞) : k ≥ 1}, T0 = 0, Tk = k X i=1 µ(Bi ). 1364 Berti, Crimaldi, Pratelli and Rigo Also, let G be a standard Brownian bridge process on some probability space (Ω0 , A0 , P0 ). For fixed ω ∈ S ∞ , p var(Z1 ) {G(Tk (ω)) − G(Tk−1 (ω))} L(ω, Bk ) = EZ1 is a real random variable on (Ω0 , A0 , P0 ). Since the Bk are pairwise disjoint and G has continuous paths, L(ω, Bk ) → 0 as k → ∞. It thus makes sense to define M (ω, ·) as the probability distribution of L(ω) = (L(ω, B1 ), L(ω, B2 ), . . .), that is, M (ω, A) = P0 (L(ω) ∈ A) for each Borel set A ⊂ l∞ (D). Similarly, let N (ω, ·) be the probability distribution of supk≥1 |L(ω, Bk )|, that is,   N (ω, A) = P0 sup |L(ω, Bk )| ∈ A k≥1 for each Borel set A ⊂ R. Theorem 7. Suppose B1 , B2 , . . . ∈ BY are pairwise disjoint and D, M , N are defined as above. Let X be assessed according to (4)–(5) with a ≤ Z1 ≤ b a.s. for some constants 0 < a < b. Then, s  [  2 sup EkWn k ≤ c P Y1 ∈ Bk (6) n k for some constant c independent of the Bk , and Cn → M stably (in the metric space l∞ (D)). In particular, kCn k → N stably. Let Q1 denote the probability law of a sequence X satisfying (4)–(5) and a ≤ Z1 ≤ b a.s. In view of Theorem 7, Q1 can play the role of Q in Theorem 1. That is, for an arbitrary c.i.d. sequence X with distribution P , one has kCn k → N stably, provided P ≪ Q1 and kWn k is uniformly integrable under P . The condition of pairwise disjoint Bk is actually rather strong. However, it holds in at least two relevant situations: when a single set B is involved, and when S = {x1 , x2 , . . .} is countable and Bk = {xk } for all k. Proof of Theorem 7. This proof involves some simple but long calculations. Accordingly, we provide only a sketch of the proof and refer to [7] for details. Since X is c.i.d., for fixed B ∈ BY , one has an (B) = E(µ(B)|Gn ) a.s. Hence, (an (B) : n ≥ a.s. 1) is a Gn -martingale with an (B) −→ µ(B) and this implies that 2 E{(an+1 (B) − µ(B)) } = E X (aj (B) − aj+1 (B)) j>n 2  = X j>n 2 E{(aj (B) − aj+1 (B)) }. Replacing aj (B) by (4) and usingPthe fact that a ≤ Zi ≤ b a.s. for all i, a long but straightforward calculation yields j>n E{(aj (B) − aj+1 (B))2 } ≤ cn1 P (Y1 ∈ B), where Rate of convergence of predictive distributions 1365 c1 is a constant independent of B. It follows that o X n 2 2 Ekan+1 − µk2 = E sup(an+1 (Bk ) − µ(Bk )) ≤ E{(an+1 (Bk ) − µ(Bk )) } k = XX k j>n k 2 E{(aj (Bk ) − aj+1 (Bk )) } ≤  [  c1 = P Y1 ∈ Bk n c1 X P (Y1 ∈ Bk ) n k as the Bk are pairwise disjoint. k Precisely as above, after some algebra, one obtains s  [  c2 2 P Y1 ∈ Bk Ekµn − an+1 k ≤ n k for some constant c2 independent of B1 , B2 , . . . . Therefore, s  [  EkWn k = nEkµn − µk ≤ 2nEkµn − an+1 k + 2nEkan+1 − µk ≤ c P Y1 ∈ Bk , 2 2 2 2 k where c = 2(c1 + c2 ). This proves inequality (6). It remains to prove that Cn → M stably (in the metric space l∞ (D)). For each m ≥ 1, let Σm be the m × m matrix with elements σk,j = var(Z1 ) (µ(Bk ∩ Bj ) − µ(Bk )µ(Bj )), (EZ1 )2 k, j = 1, . . . , m. By Theorems 1.5.4 and 1.5.6 of [21], for Cn → M stably, it is enough that: (i) (finite-dimensional convergence): (Cn (B1 ), . . . , Cn (Bm )) → Nm (0, Σm ) stably for each m ≥ 1, where Nm (0, Σm ) is the m-dimensional Gaussian law with mean 0 and covariance matrix Σm ; (ii) (asymptotic tightness): for each ε, δ > 0, there exists some m ≥ 1 such that   lim sup P sup |Cn (Br ) − Cn (Bs )| > ε < δ. n r,s>m P Fix m ≥ 1, b1 , . . . , bm ∈ R and define Rn = m k=1 bk IBk (Yn ). Since (Rn : n ≥ 1) is c.i.d., arguing exactly as in Example 3.5 of [6], one obtains m X k=1 bk Cn (Bk ) = Pn i=1 {Ri   X − E(Rn+1 |Gn )} √ −→ N 0, bk bj σk,j n k,j stably. 1366 Berti, Crimaldi, Pratelli and Rigo Since b1 , . . . , bm are arbitrary, (i) holds. To check (ii), given ε, δ > 0, take m such that   2 2  [ ε δ , P Y1 ∈ Br < 4c r>m where c is the constant involved in (6). By what has already been proven,     P sup |Cn (Br ) − Cn (Bs )| > ε ≤ P 2 sup |Cn (Br )| > ε r,s>m r>m o     4 n ≤ P 2E sup |Wn (Br )||Gn > ε ≤ 2 E sup Wn (Br )2 ε r>m r>m s   [ 4c ≤ 2 P Y1 ∈ Br < δ. ε r>m Thus, (ii) holds and this completes the proof.  Acknowledgments This paper benefited from the helpful suggestions of two anonymous referees. References [1] Algoet, P.H. (1992). Universal schemes for prediction, gambling and portfolio selection. Ann. Probab. 20 901–941. MR1159579 [2] Algoet, P.H. (1995). Universal prediction schemes (correction). Ann. Probab. 23 474–478. MR1330780 [3] Berti, P. and Rigo, P. (2002). A uniform limit theorem for predictive distributions. Statist. Probab. Lett. 56 113–120. MR1881164 [4] Berti, P., Pratelli, L. and Rigo, P. (2002). Almost sure uniform convergence of empirical distribution functions. Int. Math. J. 2 1237–1250. MR1939015 [5] Berti, P., Mattei, A. and Rigo, P. (2002). Uniform convergence of empirical and predictive measures. Atti Semin. Mat. Fis. Univ. Modena 50 465–477. MR1958292 [6] Berti, P., Pratelli, L. and Rigo, P. (2004). Limit theorems for a class of identically distributed random variables. Ann. Probab. 32 2029–2052. MR2073184 [7] Berti, P., Crimaldi, I., Pratelli, L. and Rigo, P. (2008). Rate of convergence of predictive distributions for dependent data. Technical report. Available at http://amsacta.cib.unibo.it/2538/1/be-cri-pra-ri-preprint.pdf. [8] Blackwell, D. and Dubins, L.E. (1962). Merging of opinions with increasing information. Ann. Math. Statist. 33 882–886. MR0149577 [9] Crimaldi, I., Letta, G. and Pratelli, L. (2007). A strong form of stable convergence. In Seminaire de Probabilites XL. Lecture Notes in Math. 1899 203–225. Berlin: Springer. MR2409006 Rate of convergence of predictive distributions 1367 [10] de Finetti, B. (1937). La prevision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincaré 7 1–68. MR1508036 [11] Freedman, D. (1999). On the Bernstein–von Mises theorem with infinite-dimensional parameters. Ann. Statist. 27 1119–1140. MR1740119 [12] Ghosh, J.K., Sinha, B.K. and Joshi, S.N. (1982). Expansions for posterior probability and integrated Bayes risk. In Statistical Decision Theory and Related Topics III 1 403–456. New York: Academic Press. MR0705299 [13] Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian Nonparametrics. New York: Springer. MR1992245 [14] Le Cam, L. and Yang, G.L. (1990). Asymptotics in Statistics: Some Basic Concepts. New York: Springer. MR1066869 [15] Morvai, G. and Weiss, B. (2005). Forward estimation for ergodic time series. Ann. Inst. H. Poincaré Probab. Statist. 41 859–870. MR2165254 [16] Pitman, J. (1996). Some developments of the Blackwell–MacQueen urn scheme. In: Statistics, Probability and Game Theory (T. S. Ferguson, L. S. Shapley and J. B. MacQueen, eds.). IMS Lecture Notes Monogr. Ser. 30 245–267. Hayward, CA: Inst. Math. Statist. MR1481784 [17] Renyi, A. (1963). On stable sequences of events. Sankhyā A 25 293–302. MR0170385 [18] Romanovsky, V. (1931). Sulle probabilita’ “a posteriori”. Giornale dell’Istituto Italiano degli Attuari 4 493–511. [19] Strasser, H. (1977). Improved bounds for equivalence of Bayes and maximum likelihood estimation. Theory Probab. Appl. 22 349–361. MR0440778 [20] Stute, W. (1986). On almost sure convergence of conditional empirical distribution functions. Ann. Probab. 14 891–901. MR0841591 [21] van der Vaart, A. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. New York: Springer. MR1385671 Received February 2008 and revised December 2008 View publication stats