Supplementary materials for this article are available online. Please click the JASA link at http://pubs.amstat.org.
Latent Stick-Breaking Processes
Abel R ODRÍGUEZ, David B. D UNSON, and Alan E. G ELFAND
We develop a model for stochastic processes with random marginal distributions. Our model relies on a stick-breaking construction for the
marginal distribution of the process, and introduces dependence across locations by using a latent Gaussian copula model as the mechanism
for selecting the atoms. The resulting latent stick-breaking process (LaSBP) induces a random partition of the index space, with points
closer in space having a higher probability of being in the same cluster. We develop an efficient and straightforward Markov chain Monte
Carlo (MCMC) algorithm for computation and discuss applications in financial econometrics and ecology. This article has supplementary
material online.
KEY WORDS: Nonparametric Bayes; Option pricing; Point-referenced counts; Random probability measure; Random stochastic
processes.
1. INTRODUCTION
Recent literature in Bayesian nonparametrics repeatedly focused on models for collections of distributions based on extensions of the Dirichlet process and other stick-breaking priors
(Ferguson 1973; Sethuraman 1994; Ishwaran and James 2001;
Ongaro and Cattaneo 2004). These extensions are generated
by introducing dependence in the weights and/or atoms of the
stick-breaking construction (MacEachern 2000; DeIorio et al.
2004; Gelfand, Kottas, and MacEachern 2005; Teh et al. 2006;
Griffin and Steel 2006b; Duan, Guindani, and Gelfand 2007;
Dunson and Park 2008; Rodriguez, Dunson, and Gelfand 2008),
or by considering mixtures of independent samples from one or
more Dirichlet processes (Müller, Quintana, and Rosner 2004;
Griffin and Steel 2006a; Dunson, Pillai, and Park 2007). In contrast, this article focuses on models for prior distributions on
stochastic processes on an index space D ⊂ Rd with rich but
constant marginal distributions. We no longer focus on building different distributions for each possible value of the index
space, but instead construct a stochastic process where observations at different locations are dependent, but have a common
(albeit unknown) marginal distribution. We call the resulting
construction a latent stick-breaking process (LaSBP). As the
name indicates, we rely on a stick-breaking construction to represent the unknown marginal distribution, while introducing an
underlying (latent) Gaussian process on D to drive the selection
of the atoms at each location s ∈ D, as suggested in MacEachern
(2007).
As a motivation, consider the problem of estimating species
abundance from point counts at pre-specified locations. It is
clear that abundance changes spatially with the specific characteristics of micro-habitats, and modeling these changes is key to
understanding population intensities and dynamics. However,
biologists are interested in not only the spatial distribution of a
species, but also its local and global abundance. When studying abundance of European Starlings (a nonnative bird species)
Abel Rodriguez is Assistant Professor, Department of Applied Mathematics and Statistics, University of California, Mailstop SOE2, Santa Cruz,
CA 95064 (E-mail: abel@soe.ucsc.edu). David B. Dunson is Professor
(E-mail: dunson@stat.duke.edu) and Alan E. Gelfand is Professor (E-mail:
alan@stat.duke.edu), Department of Statistical Sciences, Duke University, Box
90251, Durham, NC 27708. The work of the first author was supported in part
by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences and NSF DMS 0915272. The work of the third author
was supported in part by NSF DMS 0504953. The authors thank the associate
editor and two anonymous referees, whose comments greatly contributed to
improve the quality of the manuscript.
in North Carolina (see Section 7.2), one may be interested in
the marginal distribution of the number of individual birds observed at randomly selected locations. In this context, the underlying Gaussian process introduced previously can be interpreted as a latent, location-specific habitat quality score, which
changes spatially making higher rates more likely in certain regions. Similarly, we can consider the price of a financial asset to be a random variable whose volatility changes in time
according to an underlying score reflecting market sentiments,
with the common marginal distribution providing a summary of
the return levels that can be expected from the asset (see Section 7.1). Current nonparametric Bayes models for related problems focus on priors for dependent collections of conditional
distributions, and therefore, assume that the marginal distribution of the process changes at each location. Instead, we are
interested in models that assume a constant marginal distribution across the index space, but allow samples from it to be
dependent.
Due to the discrete nature of its marginal distribution, LaSBP
models induce a random segmentation of the index space D.
This segmentation is spatially coherent, in the sense that, the
closer two locations are in space, the larger the probability of
them being assigned to the same cluster. Typically, the segmentation has an interesting interpretation in specific applications.
For example, the partitions can be identified with micro-habitats
in species distribution applications. Nonetheless, smooth interpolations and predictions can be recovered through model
averaging. The literature in signal and image segmentation is
enormous, some interesting reviews include Pal and Pal (1993),
Pham, Xu, and Prince (2000), and Suri, Wilson, and Laxminarayan (2005). Mixture models are the most relevant for our
discussion. In contrast to our method, many of them do not
incorporate information on spatial location in the prior for allocating observations to mixture components. One interesting
exception is the model developed in Figueiredo (2005) and
Figueiredo, Cheng, and Murino (2007) for image segmentation,
which uses a latent conditionally autoregressive (CAR) structure to define spatially dependent weights, in a similar spirit to
Rodriguez, Dunson, and Taylor (2009).
Since the selection process in the LaSBP inherits the characteristics of the underlying Gaussian process, our models can
647
© 2010 American Statistical Association
Journal of the American Statistical Association
June 2010, Vol. 105, No. 490, Theory and Methods
DOI: 10.1198/jasa.2010.tm08241
648
easily accommodate a rich variety of correlation structures, including nonstationary processes. This is especially interesting
because (1) typical change-point models (like those used in
finance) assume a stationary Markovian structure to simplify
computation, and (2) nonstationary dependence structures are
hard to specify in other nonparametric models like the orderdependent Dirichlet process (Griffin and Steel 2006b). Also,
the model does not require replicates at each location, giving
it an advantage over Gelfand, Kottas, and MacEachern (2005)
for the purpose of our applications.
In addition, the use of order constraints in the definition of
the LaSBP allows us to generate skewed mixing distributions.
For scale mixtures (as in Section 7.1), this feature yields models
that favor heavier tails for the marginal distribution of the observations than those obtained by dependent Dirichlet process
(DDP) mixtures with the same baseline measure. For location
mixtures (as in Section 7.2), the LaSBP favors distributions that
have a heavy right or left tail. This is appealing in many applications, but can be avoided through suitable adjustment of the
distributions of the stick-breaking weights if desired.
The LaSBP can also be formulated as a Gaussian copula
process with a common and unknown marginal distribution,
which is assigned a stick-breaking prior. The literature on copula modeling is extensive; Nelsen (1999), among others, provided an introduction. Within a quasi-Bayesian setting, Hoff
(2007) recently developed a multivariate nonparametric model
for mixed discrete and continuous data that uses Gaussian copulas and marginal set likelihoods. Although similar in spirit to
the LaSBP, this model avoids the specification of the marginal
distributions by using a pseudo-likelihood, and is formulated as
a procedure to obtain a finite-dimensional distribution and not a
process. Also, Kottas, Müller, and Quintana (2005) formulated
a model for ordinal data using a nonparametric specification for
the copula. Again, the goals of this model are different from
ours, as they focus on contingency tables and build a mixture
model for the copula describing the correlation across entries in
the table.
The LaSBP is related to the hybrid Dirichlet process (hDP)
models in Petrone, Guindani, and Gelfand (2009), who also
proposed the use of Gaussian copula processes to build dependence across locations. However, our construction is conceptually simpler because it uses scalar atoms rather than atoms that
are process realizations. In addition, while we focus on problems where prediction and interpolation are the main goal, the
hDP is designed for local surface clustering. Although the hDP
is, in principle, a more flexible model (as it allows for more
general atoms than the LaSBP), the underlying latent field lacks
interpretability and can produce multimodal predictive distributions as an artifact (see Section 3 and Figure 2 in it). This is
a direct consequence of the use of process realizations in the
definition of the hDP, as an adjacent switch of labels does not
imply any closeness for the adjacent atoms.
For some applications (e.g., the stochastic volatility model in
Section 7.1), the infinite hidden Markov model (iHMM) (Beal,
Ghahramani, and Rasmussen 2001; Fox et al. 2008; van Gael et
al. 2008), which is closely connected to the hierarchical Dirichlet process (Teh et al. 2006), is a viable alternative to the LaSBP.
However, the iHMM assumes by construction that the index
space D for the stochastic process is discrete and unidimensional, which prevents its use in other interesting applications
Journal of the American Statistical Association, June 2010
such as the one presented in Section 7.2. In addition, the iHMM
relies on a Markovian assumption, which only allows a very
restrictive local dependence structure in allocation to mixture
components, while the LaSBP is much more flexible in using a
latent Gaussian process to drive the allocation.
This article is organized as follows: Section 2 reviews the literature on stick-breaking processes and sets the stage for the
LaSBP model. Sections 3 and 4 define the latent stick-breaking
process and discusses some of its properties, with Section 5 describing an efficient Markov chain Monte Carlo (MCMC) algorithm for posterior computation. Section 6 discusses extensions
of the model to multivariate problems. Section 7 presents two
applications in financial econometrics and ecology. Finally, in
Section 8 we present a short discussion and some future directions.
2. STICK–BREAKING PRIORS
Let G be a random probability measure on (, B), with B
the Borel sets of . Then G has a stick-breaking prior if
G(·) =
L
wl δθl∗ (·),
l=1
iid
θl∗ ∼
H for some baseline probability measure H on
with
and wl = vl l−1
k=1 (1 − vk ) where vl ∼ beta(al , bl ) for l < L and
vl = 1 for l = L. The number of atoms L can be finite (either
known or unknown) or infinite. The class of stick-breaking priors includes some well-known processes. For example, taking
L = ∞, al = a and bl = b + l(1 − a) for 0 ≤ a < 1 and b >
−(1 − a) yields the two-parameter Poisson–Dirichlet Process
(Pitman 1996), with the choice a = 1 resulting in the Dirichlet process with baseline measure H and precision parameter b
(Sethuraman 1994).
For L < ∞, setting vL = 1 ensures that Ll=1 wl = 1 with
probability one. If L = ∞, Ishwaran and James (2001) showed
that the weights add up to 1 (and therefore the process yields a
well-defined
probability distribution) almost surely if and only
if ∞
E(log(1
− vk )) = −∞. For any measurable set B ∈ B,
l=1
G(B) is a random variable whose moments are given by
E(G(B)) = H(B),
Var(G(B)) = H(B)(1 − H(B))
L
al (1 + al )(al + bl )
×
(al + bl )2 (al + bl + 1)
l=1
×
l−1
k=1
bk (1 + bk )(ak + bk )
.
(ak + bk )2 (ak + bk + 1)
Therefore, H describes the expected shape of the distributions,
while the sequences {al }Ll=1 and {bl }Ll=1 control the variance of
the sampled distributions around H.
To increase the flexibilityof the model, most applications
use convolutions of the form K(·|θ )G(dθ ), where G follows a
stick-breaking prior and K is a kernel, to model unknown probability distributions. Under fairly general conditions on the kernel K, mixtures of this type yield models that are dense (in the
L1 sense) in the space of absolutely continuous distributions
(Lo 1984).
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
649
One way to extend the stick-breaking priors to collections of
distributions on a space D is to replace the random variables
involved in the definition of the random distribution with stochastic processes on D. This is the strategy behind the dependent Dirichlet process (DP) (MacEachern 2000; DeIorio et al.
2004; Gelfand, Kottas, and MacEachern 2005; Griffin and Steel
2006b). Inherently, all of these approaches (including those that
allow the weights to depend on the location) assume that, conditional on the stick-breaking parameters, the probability of assigning observations to atoms is known. For example, for the
spatial Dirichlet process (SDP) (Gelfand, Kottas, and MacEachern 2005),
Pr θ (s) = θi∗ (s), θ (s′ ) = θj∗ (s′ )|{θl∗ (s)}Ll=1 , {wl (s)}Ll=1
=
1,
0,
i = j,
i = j,
for all s, s′ ∈ D and all i, j ∈ {1, . . . , L}. Therefore, conditional
on the atoms and weights, the selection mechanism is constant
for all locations s ∈ S. In the sequel, we explore an alternative approach, which constructs a process where the selection
probabilities are not conditionally independent across locations
while the atoms and weights are.
3. UNIVARIATE LATENT STICK–BREAKING PROCESS
Our goal is to construct a stochastic process θ = {θ (s) : s ∈
D ⊂ Rd } where θ (s) ∈ ⊂ R for all s such that, marginally, θ (s)|G ∼ G for all s ∈ D and some unknown G, but
Cov(θ (s), θ (s′ )|G) = 0. To do this, consider the general approach motivated by copula models: Define a stochastic process
U = {U(s) : s ∈ D} with uniform marginals, then define θ (s) =
G− (U(s)), where G is a random discrete probability distribution and G− is its generalized inverse. Then clearly θ (s) is
distributed according to G for any given s and θ (s) and θ (s′ )
are correlated. The question of model development becomes
how to choose the process U and a prior for G. In the sequel,
we do this through an independent triplet,
{{z(s) : s ∈ D}, {vl }Ll=1 , {θl∗ }Ll=1 }.
(1)
Here, the latent Gaussian process z is such that z(s) ∼
N(0, 1) for all s ∈ D and Cor(z(s), z(s′ )) = γ (s, s′ ), then U(s) =
(z(s)) and the sequences {vl }Ll=1 and {θl∗ }Ll=1 are used to define
a random distribution
G(·) =
L
wl δθl∗ (·),
and {vl }Ll=1 as the marginal prior probability of observing the
different rates.
Now, for any finite set of locations s1 , . . . , sn ∈ D, and any n
let
Pr θ (s1 ) = θl∗1 , . . . , θ(sn ) = θl∗n |{θl∗ }Ll=1 , {vl }Ll=1
= Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . ,
z(sn ) ∈ −1 πln −1 , −1 πln ,
(2)
where (·) denotes thecumulative standard normal distribution
function and πl = 1 − k≤l (1 − vk ) is the proportion of the unit
stick assigned to the first l atoms, with π0 = 0. We call the prior
distribution induced by the hierarchical model above on the law
P of the process a latent stick-breaking process (LaSBP), and
write θ|P ∼ P with P ∼ LaSBPL ({al }Ll=1 , {bl }Ll=1 , H, γ ). The
definition in Equation (2) makes clear that we are modeling a
joint distribution, which encourages closeness among θ (s) and
θ (s′ ). This is very different from the effect of DDP models,
which encourage closeness among the distributions at s and s′ ,
but not necessarily between θ (s) and θ (s′ ).
Theorem 1. For every n > 1, P ∼ LaSBPL ({al }Ll=1 , {bl }Ll=1 ,
H, γ ) satisfies Kolmogorov’s conditions almost surely,
(1) symmetry:
Ps1 ,...,sn (θ (s1 ) ∈ A1 , . . . , θ(sn ) ∈ An )
= Psν1 ,...,sνn θ sν1 ∈ Aν1 , . . . , θ sνn ∈ Aνn
for any permutation ν = (ν1 , . . . , νn ) of the set {1, . . . , n};
(2) consistency:
Ps1 ,...,sn+1 (θ (s1 ) ∈ A1 , . . . , θ(sn ) ∈ An , θ(sn+1 ) ∈ (−∞, ∞))
= Ps1 ,...,sn (θ (s1 ) ∈ A, . . . , θ (sn ) ∈ An ),
where A1 , . . . , An are measurable subsets of .
The proof is in
the Appendix. Note that for any single given
location Ps (·) = Ll=1 wl δθl∗ (·) = G(·), and therefore, the marginal distribution of the process is constant. However, draws
at two different locations s and s′ are correlated because z(s)
and z(s′ ) are correlated. For example, predictions on the value
θ (s′ ) for an unobserved location s′ depend on the predictions
for z(s′ ), in particular
l=1
where the probability
weights are defined as wl = vl k<l (1 −
vk ) and satisfy
l wl = 1 almost surely. The sequences of
stick breaking ratios {vl }Ll=1 is such that vl ∼ beta(al , bl ) for
l < L and vL = 1. The sequence of atoms {θl∗ }Ll=1 is constructed by imposing an order constraint on a sample from
a baseline measure H on R using the following mechanism:
Sample θ1∗ ∼ H and for l > 1 draw θl∗ ∼ Hl , where Hl is de∗ }, i.e.,
fined as the restriction of H to the set Sl = {θ : θ > θl−1
Hl (B) = H(B ∩ Sl )/H(Sl ) for any measurable set B ∈ B. Referring back to our example in ecology, we can think of z(s) as the
quality of the habitat at location s, {θl∗ }Ll=1 as the (ordered) set
of possible abundance rates that can potentially be observed,
Pr θ (s′ ) = θl∗ |θ (s1 ), . . . , θ (sn ), {z(si )}ni=1 , {θl∗ }Ll=1 , {vl }Ll=1
= Pr z(s′ ) ∈ [−1 (πl−1 ), −1 (πl ))|z(s1 ), . . . , z(sn )
−1 (πl )
′
−1
φ(t| ′s′ ,s −1
=
s,s z; 1 − s′ ,s s,s s′ ,s ) dt,
−1 (πl−1 )
where φ(·|a; b) denotes the density of the univariate Gaussian
distribution with mean a and variance b, s′ ,s is a column vector with entries [ s′ ,s ]i = γ (si , s′ ), s,s is a n × n matrix with
entries [ s,s ]ij = γ (si , sj ), and z = (z(s1 ), . . . , z(sn ))′ .
An alternative but equivalent construction for the LaSBP can
be obtained by introducing the labeling process ξ = {ξ(s) : s ∈
650
Journal of the American Statistical Association, June 2010
D} with ξ(s) ∈ {1, . . . , L} such that ξ |{vl }Ll=1 ∼ Q and Q is characterized by the finite-dimensional distributions
Qs1 ,...,sn (l1 , . . . , ln )
= Pr(ξ(s1 ) = l1 , . . . , ξ(sn ) = ln |{vl }Ll=1 )
= Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . ,
z(sn ) ∈ −1 πln −1 , −1 πln .
For example, Qs (k) = Pr(ξ(s) = k|{vl }Ll=1 ) = wk and
Qs,s′ (l, k) = γ (s,s′ ) (−1 (πl ), −1 (πk ))
− γ (s,s′ ) (−1 (πl−1 ), −1 (πk ))
− γ (s,s′ ) (−1 (πl ), −1 (πk−1 ))
+ γ (s,s′ ) (−1 (πl−1 ), −1 (πk−1 )),
where r (·, ·) denotes the cumulative distribution of the standard bivariate normal with correlation r. Conditionally on the
∗ . Intelabeling process we can define θ (s)|ξ(s), {θl∗ }Ll=1 = θξ(s)
grating out ξ we get θ|P ∼ P and P ∼ LaSBP({al }Ll=1 , {bl }Ll=1 ,
H, γ ) as before, however, by explicitly introducing the latent
labeling process ξ we immediately realize that the LaSBP
induces a segmentation of the θ surface by discretizing a
Gaussian process.
To gain some intuition into this construction, we show in Figure 1 three random realizations of the θ surface on D = [0, 1]2
associated with three different marginal distributions P sharing the same underlying latent process z, which was generated
using an exponential correlation function. For this simulation,
we took L = 100 and al = 1, bl = α, for three different values
of α. The latent process was sampled at 3600 locations over an
evenly spaced 60 × 60 grid, and we used a standard Gaussian
baseline measure H to generate the atoms of the process. The
resulting surfaces are tiled, segmenting the space according to
the level of the underlying Gaussian process. The parameter α
influences the roughness of the segmentation (essentially, the
number of different levels that the random surface takes), the
baseline measure H controls the height of the tiles, and the correlation function γ influences the relative size of the different
tiles.
Figure 1 also provides insight into the order constraint we introduced in the definition of the process. The order constraint
allows us to link the behavior of the atoms of the stick-breaking
construction to the behavior of the latent process. In particular, it enforces the internal consistency condition z(s) > z(s′ ) ⇒
θ (s) ≥ θ (s′ ), which simplifies interpretation. In the species
abundance application, this means that higher habitat quality
scores should imply a larger number of sightings. In a similar
spirit, a more nervous market should be subject to larger price
changes. Without the order constraint forcing adjacent atoms to
have similar values, interpretability is lost and label switching
can unexpectedly lead to very wiggly estimates of the underlying latent field z and multimodal predictive distributions at new
locations as an artifact.
Figure 1. Realizations from a random law P drawn from a LaSBP process on [0, 1]2 with a standard Gaussian baseline measure. We illustrate
the effect of different concentrations while keeping the underlying Gaussian process (shown in the upper left panel) constant to simplify interpretation. In the color scale, white corresponds to large values, yellow to medium values, and red to low values of the process. A color version
of this figure is available in the electronic version of this article.
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
651
L
=
l=1
al
bk
E(ψ(θl∗ ))
a l + bl
a k + bk
k<l
EH (ψ(θ1∗ )).
>
Therefore, due to the order constraint, some care must be exercised when interpreting the baseline measure in our model.
In particular, the average level of the process will depend not
only on the expected value of the baseline measure H, but also
on the stick-breaking parameters {al }Ll=1 and {bl }Ll=1 . To understand the consequences of this behavior, consider the case
al = 1 and bl = b. In that case, if b → ∞ then E(θ (s)) → ∞.
In turn, for a finite b, the mean value of the process is strictly
greater than the mean of the baseline measure.
Example 1. Consider θ|P ∼ P and P ∼ LaSBPL (a, b, H, γ ),
where H is an exponential baseline measure with mean λ. This
coincides with the illustrations we present in Section 7. In
∗ follows a shifted exponential distribution and
this case, θl∗ |θl−1
∗
∗
∗ . This implies that E(θ ∗ ) = lλ and
E(θl |θl−1 ) = λ + θl−1
l
Figure 2. Locations in our toy example. Black points correspond to
locations where data are observed, the white point is a location where
prediction is desired.
As an illustration of this phenomenon, consider a very simple setup where L = 3 and the process is observed at three
locations (labeled s1 , s2 , and s3 in Figure 2). Let the stickbreaking weights be such that −1 (π1 ) = −1 and −1 (π2 ) = 1
and the underlying z take the values z(s1 ) = −2, z(s2 ) = 2, and
z(s3 ) = 0. This implies that the component membership indicators are ξ(s1 ) = 1, ξ(s2 ) = 3, and ξ(s3 ) = 2. Also, let the atoms
take the value θ1∗ = −5, θ2∗ = 5, and θ3∗ = 0 (this is a valid combination of values if the order constraint is removed).
Consider now the value of the underlying process and the
corresponding predictions at location s4 . Since the Gaussian
process is continuous, predictions for z(s4 ) will be close to 0,
making ξ(s4 ) = 2 (and a prediction close to 5) highly likely,
even though s3 is far away from s1 , s2 , and s4 . This is undesirable if we are interested in convex interpolations, as in that case
we will expect θ (s4 ) < 0. This feature can be particularly acute
with small samples or low spatial coverage, as there may be little information to distinguish between multiple latent surfaces
consistent with different label permutations.
In most applications, the LaSBP will not be used to model the
observable Y(s) directly, but will be used to construct LaSBP
mixtures. Briefly, we will assume that Y(s)|θ (s) ∼ f (·|θ (s)),
where θ (s) is a scalar, and f (·|θ (s)) is a density function, e.g.,
N(μ, 1/θ(s)) (as in Section 7.1) or Poi(θ (s)) (as in Section 7.2),
and let θ |P ∼ P where P ∼ LaSBPL ({al }Ll=1 , {bl }Ll=1 , H, γ ).
This implies that Y(s) and Y(s′ ) are correlated since θ (s)
′
and
correlated, and that marginally, Y(s)|G ∼
L θ (s ) are
∗ ), which defines the same, possibly multimodal,
w
f
(·|θ
l=1 l
l
marginal for any s.
4. PROPERTIES
First, consider the expectation of θ (s). If ψ is a measurable
function on (, B),
E ψ(θ(s)) =
L
l=1
E(wl )E(ψ(θl∗ ))
E(θ (s)) =
Note that
L
λl
l=1
l−1
a
b
a+b a+b
L+1
a+b
b
=λ
1−
a
a+b
L
a
b
− (L + 1)
.
a+b
a+b
a+b
.
a
Therefore, the process is well defined (in the sense that it has
a finite first moment) even with an infinite number of atoms as
long as b is finite. A similar argument yields
a+b 2
lim E (θ (s))2 = 2λ2
L→∞
a
lim E[θ (s)] = λ
L→∞
and
lim Var[θ (s)] = λ
L→∞
2
a+b
a
2
.
Example 1 illustrates that in many cases we can interpret finite LaSBP’s as truncations of infinite processes. In this situation, the theory of truncations developed in Ishwaran and James
(2001, 2003) or Rodriguez, Dunson, and Gelfand (2008) can be
directly modified for the LaSBP. The following theorem gives
sufficient conditions for the limiting process to have finite expectation.
Theorem 2. Let θ |P ∼ P and P ∼ LaSBPL (a, b, H, γ ),
where H is such that, for large l,
∗
∗
) ≤ μ + λθl−1
E(θl∗ |θl−1
for some real constants μ and λ such that 0 < λ ≤ 1. Then
lim E(θ (s)) < ∞.
L→∞
Corollary 1. If E(θl∗ ) < c1 lp + c2 λl for some real constants
c1 , c2 , and λ with 0 < λ ≤ 1 and integer constant p, then
limL→∞ E(θ (s)) < ∞.
652
Journal of the American Statistical Association, June 2010
Corollary 2. If H
limL→∞ E(θ (s)) < ∞.
has
a
compact
support
then
The details of the proof are shown in the Appendix, and it can
be easily extended to provide conditions for finite higher-order
moments. As the following two examples show, the condition
in Theorem 2 is generally satisfied when H corresponds to a
distribution with light tails.
Example 2. Let θ |P ∼ P and P ∼ LaSBPL (a, b, H, γ )
where H is a Gaussian distribution with mean μ and variance
σ 2 . In this case
∗ − μ)/σ )
φ((θl−1
∗
E(θl∗ |θl−1
)=μ+σ
∗ − μ)/σ ) ,
1 − ((θl−1
where φ denotes the density of the standard normal distribution. Using some well-known properties (Abramowitz and Stegun 1965, p. 298),
σ
∗ − μ)/σ )
φ((θl−1
∗ − μ)/σ )
1 − ((θl−1
1
∗
∗ − μ)2 + 2σ 2 ]
≤ √ [(θl−1
− μ) + (θl−1
2π
2
∗
≈ √ (θl−1
− μ)
2π
√
∗ . Since 2/ 2π < 1, Theorem 1 applies and
for large θl−1
limL→∞ E(θ (s)) < ∞.
Example 3. Consider now a Pareto baseline measure with
density
α η α+1
,
θ ≥ η.
p(θ ) =
η θ
The conditional density reduces to
∗ α+1
α θl−1
∗
)= ∗
,
p(θl∗ |θl−1
θl−1 θl∗
∗
θl∗ ≥ θl−1
,
∗ ) = α θ ∗ . Therefore lim
and E(θl∗ |θl−1
L→∞ E(θ (s)) < ∞ iff
α−1 l−1
b/(a + b) < (α − 1)/α.
In addition to the mean of the process, it is important to understand the covariance structure induced by our construction.
The covariance of the θ process is given by
Cov(θ (s), θ (s′ )) =
L
L
E(Qs,s′ (l, k))E(θl∗ θk∗ )
l=1 k=1
− E(Qs (l))E(Qs′ (k))E(θl∗ )E(θk∗ ) .
(3)
Although we do not have a closed-form expression for the covariance function, it is clear from Equation (3) that the process
on θ is isotropic, stationary or nonstationary if z has these properties. Additional properties for the selection process U can be
seen in the Appendix.
Realizations from P are discontinuous almost surely, which
is a consequence of the stick-breaking construction. However,
note that
Qs,s′ (l, k) → Qs (l)Qs′ (k)
as γ (s, s′ ) → 0,
Qs (l), l = k,
Qs,s′ (l, k) →
0,
otherwise
as γ (s, s′ ) → 1.
In particular, this implies that Pr(θ (s) = θ (s′ )) → 1 as s′ → s
almost surely, and the selection mechanism is spatially coherent. This result also shows that, as the range of the latent
Gaussian process goes to zero, samples at different locations
become iid observations from G. On the other hand, as the range
grows, the random surfaces drawn from P are constant and the
corresponding LaSBP mixture reduces to a parametric model
with a prior H on the unknown parameter.
It is important to note that θ (s) and θ (s′ ) are never independent. Indeed, even if γ (s, s′ ) = 0, implying that Pr(ξ(s) =
l, ξ(s′ ) = k|{vr }Lr=1 ) = Pr(ξ(s) = l|{vr }Lr=1 )Pr(ξ(s′ ) =
k|{vr }Lr=1 ) = Qs (l)Qs′ (k), it is clear from Equation (3) that, in
general, Cov(θ (s), θ (s′ )) = 0 because Cov(E(θ (s)|G), E(θ (s′ )|
G)) = 0. It is also interesting to formally consider the effect of
the stick-breaking distribution on the surface. Assume for simplicity that al = a and bl = b for all l, which we will do in most
a
practical applications. Note that as a+b
→ 1 we have w1 → 1
and realizations from P will correspond to flat random surfaces,
θ (s) = θ1∗ for all s ∈ D.
Finally, we also note that, due to the order constraint on
the atoms, the LaSBP with constant a and b tends to generate skewed marginal distributions as the process will also encourage larger weights for smaller values of θ . This provides
our model with a particular flavor that differentiates it from the
regular Dirichlet process and the Dirichlet-multinomial model
underlying the hDP in Petrone, Guindani, and Gelfand (2009).
More precisely, the weights arising from a stick-breaking construction are stochastically ordered. Under the order constraint
for the θl∗ ’s, this implies that smaller θl∗ ’s receive stochastically larger weights. For scale LaSBP mixtures (as in Sections 7.1 and 7.2), this feature yields models that favor heavier
tails for the marginal distribution of the observations than those
obtained by dependent Dirichlet process (DDP) mixtures. For
location mixtures, the LaSBP favors distributions that have a
heavy right or left tail, which in our experience, can be appealing property in many applications. Indeed, there is often prior
information available that one tail of the distribution is heavier and is more subject to secondary modes, and our specification allows a natural incorporation of such prior information. In
the applications we considered, this was the case. To provide
a few additional examples: (1) In studies of premature delivery, the natural response is the gestational age at delivery. The
right tail tends to be very stable across populations and is very
light, while the left tail (corresponding to premature deliveries) is quite heavy and varies substantially across ethnic groups
and with other factors; (2) in studies of the level of a pollutant,
the left tail (corresponding to unusually low pollution levels)
tends to be very light relative to the right tail (corresponding
to high pollution levels); and (3) similar behavior is observed
typically for continuous adverse health responses, such as the
frequency of DNA strand breaks in cells. In applications where
this property not appealing, it can be mitigated through appropriate choices of the nonconstant sequences {al }Ll=1 and {bl }Ll=1
controlling the stick breaking weights.
5. INFERENCE FOR MIXTURES OF LASBP
For inference purposes, we develop an MCMC algorithm.
We assume al = a, bl = b, L < ∞ and use a blocked sampler
(Ishwaran and James 2001). If L = ∞, a retrospective sampler
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
(Roberts and Papaspiliopoulos 2008) can be easily obtained as
an extension. To facilitate computation, the latent processes,
z and ξ are sampled explicitly. Assuming that n data points
y(s1 ), . . . , y(sn ) were collected, one for each location s1 , . . . , sn ,
and using the familiar bracket notation, the joint distribution is
given in this case by
n
n
[y(si )|θ ∗ , ξ(si )] × [ξ(si )|z(si ), v] × [z(s1 ), . . . , z(sn )|λ]
i=1
i=1
×
∞
l=1
[vi |α] ×
∞
l=1
[θl ] × [α] × [λ].
After setting up initial values for all parameters in the model,
the algorithm proceeds by sequentially updating the parameters
according to the following steps:
(1) Jointly update the latent processes ξ and z, one location
at a time, by first sampling ξ(si ) from a discrete distribution
such that
Pr(ξ(si ) = l)
∝ p(y(si )|θl )Pr z(si ) ∈ [−1 (πl−1 ), −1 (πl ))|z(s1 ), . . . ,
z(si−1 ), z(si+1 ), . . . , z(sn ) ,
where the prior probability of component l can be obtained by
a univariate integration, and then sampling z(si ) from the restricted univariate normal distribution defined by
[z(si )|z(s1 ), . . . , z(si−1 ), z(si+1 ), . . . , z(sn )]1l ,
where l = {z(si ) : z(si ) ∈ [−1 (πξ(si )−1 ), −1 (πξ(si ) ))}. Note
that, if si = sj for some j, then the prior probability for observation i being assigned to component ξ(sj ) is one, and therefore,
ξ(si ) = ξ(sj ) and z(si ) = z(sj ), as expected.
(2) Sample the stick-breaking ratios one at a time from their
full conditional
vl |v(−l) ∝ va−1
(1 − vl )b−1 1Al ,
l
where Al = {vl : qll ≤ vl < qul } and
qll
= max
qul =
{i:ξ(si )≥l}
min
1 − (z(si ))
,
1−
k≤ξ(si ),k=l (1 − vk )
{i:ξ(si )≥l+1}
1 − (z(si ))
.
k≤ξ(si )−1,k=l (1 − vk )
1−
Note that vl depends on v(−l) only through the constraint on the
support. Also, vl depends on z only through {z(si ) : ξ(si ) ≥ l}.
Therefore, for l > l∗ = maxi {ξ(si )}, vl is conditionally independent from v(−l) .
(3) Sample the atoms {θl }. The set of full conditionals is
given by
p(θ1 | · · ·) ∝ h(θ1 )
p(y(si )|θ1 )
{i:ξ(si )=1}
and
p(θl | · · ·) ∝ h(θl |θl−1 )
p(y(si )|θl ),
l > 1,
{i:ξ(si )=l}
where hl is the density associated with the measure Hl .
653
(4) The prior parameters on the stick-breaking ratios a
and b can be jointly sampled using a random-walk Metropolis–
Hastings algorithm.
(5) The parameters of the underlying Gaussian process can
be sampled conditional on the current realization z(s1 ), . . . ,
z(sn ) using a random-walk Metropolis–Hastings algorithm, as
is customary in Gaussian process models.
(6) Since the conditional moves in step (1) can raise some
concerns about mixing rates, we suggest to additionally sample
the whole Gaussian process jointly. This can be easily done by
noting that the joint full conditional is given by
[z(s1 ), . . . , z(sn )| · · ·] · · · ∼ N(0, )1 ,
where []ij = γ (si , sj ), = ni=1 i and i = {z(si ) :
−1 (πξ(si )−1 ) ≤ z(si ) < −1 (πξ(si ) )} as before. Due to the
form of the restrictions, this is a relatively straightforward step.
The supplemental material available at http://pubs.amstat.
org/loi/jasa discusses some details related to efficient computational implementation for this model.
6. MULTIVARIATE LATENT
STICK–BREAKING PROCESSES
The order constraint in the definition of the LaSBP makes it
hard to directly extend our formulation to multivariate atoms.
One alternative we explore here is to use a latent multivariate
Gaussian process to induce dependence in the selection mechanism of multiple stick-breaking representations. As an example,
consider a two-dimensional process θ = {(θ1 (s), θ2 (s))′ : s ∈ D}
with marginal distributions G1 and
G2 , which have stickbreaking representations Gi (·) = Ll=1 wil δθil∗ (·) with θil∗ ≤
∗
for i = 1, 2 as defined in Section 3 for the univariθi,l+1
ate LaSBP, with components independent for i = 1, 2. Also,
let (z1 (s), z2 (s))′ = z(s) = Ax(s), where xi is a standardized
Gaussian process with correlation function γi (s, s′ ) and A is a
matrix such that
1
0
1 η
′
A=
.
⇒ AA=
η 1
η
1 − η2
Letting aj be the jth column of A, the corresponding crosscovariance matrix for z is given by z(s),z(s′ ) = γ1 (s, s′ )a1 a′1 +
γ1 (s, s′ )a2 a′2 . In particular, this implies that Var(zi (s)) = 1 for
i = 1, 2 and Cov(z1 (s), z2 (s)) = Cor(z1 (s), z2 (s)) = η. As in
Section 3, we can define finite-dimensional distributions
Ps1 ,...,sn θ1 (s1 ) = θ1l∗ 1 , . . . , θ1 (sn ) = θ1l∗ n ,
θ2 (s1 ) = θ2l∗ n+1 , . . . , θ2 (sn ) = θ2l∗ 2n |
{θ1l , θ2l }Ll=1 , {v1l , v2l }Ll=1
= Pr z1 (s1 ) ∈ −1 (π1,l1 −1 ), −1 π1,l1 , . . . ,
z1 (sn ) ∈ −1 π1,ln −1 , −1 π1,ln ,
z2 (s1 ) ∈ −1 π2,ln+1 −1 , −1 π2,ln+1 , . . . ,
z2 (sn ) ∈ −1 π2,l2n −1 , −1 π2,l2n .
Choosing η = 0 implies that θ1 (s) is independent from θ2 (s).
More generally, we can calculate the cross-covariance function
654
Journal of the American Statistical Association, June 2010
for this process by noting that
′
Cov(θi (s), θj (s )) =
L
L
ij
E[Qs,s′ (l, k)]E(θil∗ )E(θjk∗ )
l=1 k=1
j
− E[Qis (l)]E[Qs (k)]E(θil∗ )E(θjk∗ ),
(4)
ij
where as in Section 4, Qs,s′ (l, k) = Pr((zi (s)) ∈ [πi,l−1 ,
πi,l ), (zj (s′ )) ∈ [πj,k−1 , πj,k )|{v1l , v2l }Ll=1 ) and Qis (l) =
Pr((zi (s)) ∈ [πi,l−1 , πi,l )|{vil }Ll=1 ) = wil . As expected, Equation (4) reduces to Equation (3) when i = j.
The use of dependent Gaussian processes typically provides
an interesting interpretation for the model. For example, in
the context of the stochastic volatility model described in Section 7.1, the dependence between latent Gaussian processes
measures the correlation between market sentiments across different assets. Additionally, sampling for this multivariate extension can proceed along the lines described in Section 5, with an
additional step to sample the parameters in A conditionally on
all other parameters.
The multivariate extension we describe in this section is really just a straightforward modification of the univariate case
in that the order restrictions are placed on the atoms in each
marginal in the same manner as in the univariate case. Hence,
the assumptions are not more stringent than in the univariate
case, so our previous comments comparing the LaSBP with the
hDP in the univariate case still apply. As we discussed in Section 4, it is very often the case that such order restrictions are
justified. This is also true in the multivariate case. For example,
instead of one pollutant, we may measure multiple pollutants,
all of which should have a heavier right tail in their concentration distribution a priori. A posteriori these assumptions are not
strictly enforced; if there is sufficient evidence in the data to
refute our a priori (informed) guess of the shape of the distribution, the posterior estimate will be appropriately adapted.
7. ILLUSTRATIONS
7.1 Modeling Stochastic Volatility
In this section we consider an application of the LaSBP to the
modeling of stock-market returns. This example is interesting
as it allows us to compare the LaSBP with other more standard
models in a real life example.
The dataset under consideration consists of the weekly returns of the S&P500 index covering the 10-year period between
April 21, 1997 and April 9, 2007, for a total of 520 observations.
Figure 3 shows the evolution of these returns. The series does
not exhibit any long-term trend, but different levels of volatility
can be clearly seen. In particular, two slightly different regimes
are apparent; before May 2003, periods of high volatility are
relatively frequent, while after May 2003, we can appreciate
longer low-volatility periods. Indeed, it is well known that financial time series typically exhibit heavy tails and that periods
of low/high volatility tend to cluster together.
Similarly to other approaches in the literature (Bollerslev
1986; Jacquier, Polson, and Rossi 1994), we model r(t), the
log-return of the S&P500 at time t, as conditionally following
a normal distribution, i.e., r(t)|μ, θ (t) ∼ N(μ, θ (t)−1 ), where
θ|P ∼ P and the law P follows a LaSBP with an exponential
baseline measure with mean 1/φ0 and a decreasing (rather than
Figure 3. Weekly returns on the S&P500 index between April 21,
1997 and April 9, 2007. A color version of this figure is available in
the electronic version of this article.
increasing) order constraint. Also, we take al = 1 and bl = α
for all l. We noted before that this choice places stochastically
larger probabilities on larger variances. A prior with this structure is well motivated in this application, because low volatility (high precision) is the norm, while high volatility (low precision) is associated with sporadic shocks to the system. To
demonstrate the methodology we use an exponential correlation function with range parameter λ for the latent process, i.e.,
′|
γ (t, t′ ) = exp{− |t−t
λ }; however, implementations using more
general correlation functions (like Matérn, power exponential
or some nonstationary family) is just as straightforward. This
model results in a marginal distribution for the returns that is a
scale mixture of Gaussian distributions, yielding a rich model
on the class of unimodal distributions that can accommodate
heavy tails. Also, the structure of the LaSBP naturally induces
volatility clustering, while potentially allowing for richer dependence structures than Markov-switching models.
We complete the model by specifying priors for the different parameters. The mean of the returns is given a normal distribution with mean μ0 and variance τ 2 , the spatial range is
given a Gam(aλ , bλ ) distribution, the precision α is assigned a
Gam(aα , bα ) prior and φ0 is given a Gam(δ, φ00 ) prior. Since
the historical annual volatility of the S&P500 is traditionally estimated to be in the range of 12% –15%, we take τ 2 = 0.152 /52
and E(1/φ0 ) = 0.122 /52. The prior mean for the average return is taken to be μ0 = 0, and the degrees of freedom are set
to δ = 2. We picked aα = bα = 1 so that the prior for the precision parameter α is centered around 1 and has variance 1. The
prior for the temporal correlation is set as aλ = 1 and bλ = 1/5,
implying that the a priori range of the latent process driving the
correlation is approximately 15 weeks. Forty atoms (L = 40)
were used in the stick-breaking construction. A small sensitivity analysis was conducted by changing the hyperpriors on σ02 ,
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
α, and λ, as well as L, and results seemed mostly robust to the
selection. All results are based on 50,000 iterations obtained after a burn-in period of 5000 samples. We resampled the latent
process every 10 iterations to improve mixing. Visual inspection of the trace plots does not reveal any obvious mixing or
convergence problems. Sensitivity analysis was performed by
considering values for E(1/φ0 ) in the range between 0.052 /52
and 0.42 /52, leading to essentially the same results.
The model identifies between 2 and 11 volatility regimes,
with models having between 3 and 6 components receiving
the most weight (estimated posterior probabilities 0.27, 0.33,
0.20, and 0.10, respectively). The mean of the precision α is
estimated to be 0.700 [median 0.650, symmetric 90% credible
interval (0.260; 1.312)]. The posterior distribution of the correlation parameter λ has a mean of 110.72 [90% credible interval (34.05; 241.53)], showing strong evidence of long-range
dependence in the sample. This feature, which is well known in
financial time series, is hard to capture with stochastic volatility
models based on low-order autoregressive processes.
The posterior mean weekly return is estimated to be 0.00194
[credible interval (0.00050; 0.00336)], a reasonable value given
historical evidence. Figure 4 shows the posterior mean for the
volatility path estimated by the LaSBP model, along with the
sample standard deviation and the volatility path obtained from
the “standard” stochastic volatility described in Jacquier, Polson, and Rossi (1994). Note that, although under the LaSBP
each realization of the volatility path is piecewise constant, by
taking an average over multiple realizations of the posterior distribution we obtain a smoother path. Also, although both models provide similar volatility estimates, there are some interesting differences. For example, the low volatility period between October 2003 and December 2006 is estimated by the
Figure 4. Smoothed weekly volatilities for the S&P500 using the
LaSBP stochastic volatility model, a standard stochastic volatility
model fitted to the whole series, and separate standard stochastic
volatility models for the periods before and after 06/16/2003. The flat
dotted line corresponds to the empirical standard deviation of the sample. A color version of this figure is available in the electronic version
of this article.
655
LaSBP analysis to be a period with essentially constant volatility, while the regular volatility model implies frequent fluctuations. Nonetheless, both models rapidly adapt to the rise in
volatility in early 2007. Also the three high volatility peaks in
2000, 2001, and 2002 are more pronounced under our model.
To better understand the behavior of these models, we also
fitted separate standard stochastic volatility models for the periods before and after 06/16/2003 (labeled “separate” in Figure 4). Note that the separate stochastic volatility models fit a
smoother volatility estimate than the standard model to the second period. This suggests that lack of smoothness in the second
period for the standard stochastic volatility model is induced by
the lack of flexibility in the model.
Unlike standard volatility models, the LaSBP automatically
produces a nonparametric estimate of the marginal distribution
of returns on S&P500, which is shown in Figure 5. As expected
from historical evidence, the tails of the estimated marginal distribution are heavier than the tails of a normal distribution with
equivalent mean and variance.
7.2 Understanding Species Abundance: A Model for
Point-Referenced Counts
This section concentrates on models for spatial count data, in
the spirit of Richardson and Green (2001). The Christmas Bird
Count (CBC) is an annual census of early-winter bird populations conducted by over 50,000 observers each year between
December 14th and January 5th. The primary objective of the
CBC is “to monitor the status and distribution of bird populations across the Western Hemisphere.” Parties of volunteers
follow specified routes through a designated 15-mile diameter
circle, counting every bird they see or hear. The parties are organized by compilers who are also responsible for reporting
total counts to the organization that sponsors the activity, the
Figure 5. Marginal distribution of returns from the S&P500, compared with an empirical fit to the Gaussian distribution. A color version
of this figure is available in the electronic version of this article.
656
Audubon Society. Data and additional details about this survey
are available at http:// www.audubon.org/ bird/ cbc/ index.html.
In this illustration, we concentrate on jointly modeling the
abundance of two bird species in North Carolina during the
2006–2007 winter season: Sturnus vulgaris, also known as European Starling, and Zenaida macroura, also known as Mourning Dove. In this period, forty-four circles were active within
the state. Since the diameter of the circles is very small compared to the size of the region under study, we treat the data
as point referenced to the center of the circle. We elaborate on
the model described in Section 6 to construct a bivariate LaSBP
mixture of Poisson kernels.
Specifically, letting yi (s) stand for the number of birds observed at location s (expressed in latitude and longitude) belonging to species i, we assume yi (s) ∼ Poi(h(s)θi (s)), where
h(s) represents the number of man-hours invested at location s.
Next, we assume that (θ1 , θ2 )|P and P follows a bivariate
H20 with
LaSBP with exponential baseline measures H10 and
means θ10 and θ20 , stick-breaking weights wil = vil k<l (1 −
vik ) with vil ∼ beta(1, αi ), and exponential correlation functions γ1 (·, ·) and γ2 (·, ·) with range ρ1 and ρ2 , respectively.
We allow for a hierarchical structure on the mean of the baseline distributions H10 and H20 , whose inverse is assumed to
follow a gamma distribution, 1/θi0 ∼ Gam(2, 2θi00 ). Based on
historical data, we set θ100 = 10 and θ200 = 4. The correlation η
and range parameters ρ1 and ρ2 defining the latent Gaussian
processes are assumed unknown, and priors η ∼ Uni[−1, 1],
ρ1 ∼ Gam(1, 1/2) and ρ2 ∼ Gam(1, 1/2) are, respectively, assigned. Precision parameters are assigned independent priors
αi ∼ Gam(1, 1). Results are based on 100,000 iterations obtained after a burn-in period of 20,000 samples, and the latent
process was resampled every 10 iterations. Visual inspection of
trace plots reveals no obvious convergence problems, and autocorrelations seem within reasonable levels.
Figure 6 shows the posterior mean surfaces corresponding to
the per-man-hour sighting rates for each of the two species. As
expected, rates for both species tend to be higher in the coastal
plains region; the European Starling and the Mourning Dove
tend to be associated with man-altered environments, preferring
open country and grasslands, and avoiding woodlands, deserts,
and swamps. The latent processes appear to be positively correlated, the posterior expectation for η is 0.26 and (0.12; 0.44)
is a symmetric, 90% credible interval. However, the difference
in the level of spatial dependence between species is striking;
while suitable habitats for the Mourning Dove seem to cover
large geographical regions, European Starlings seem to concentrate on small microhabitats.
Figure 7 shows estimates of the marginal distribution of the
number of sightings in 1 man-hour for both species under consideration. Both distributions are bimodal and heavy tailed. The
smaller mode, which can be interpreted as the baseline rate of
sightings, falls around four or six birds per man per hour for the
European Starling and around two to four birds for the Mourning Dove. The higher mode, which can be interpreted as corresponding to specially suited habitats, is around 45 sightings per
man-hour for the European Sterling and around 23 sightings for
the Mourning Dove.
We wish like to emphasize that parametric alternatives to
our proposed model like a Poisson log-linear model do not allow for a multimodal marginal distribution like the one shown
previously. On the other hand, nonparametric alternatives like
Journal of the American Statistical Association, June 2010
(a)
(b)
Figure 6. Estimates of the spatial abundance (a) the European Starling and (b) Mourning Dove in North Carolina. A color version of this
figure is available in the electronic version of this article.
the SDP (Gelfand, Kottas, and MacEachern 2005) and the order dependent DP (Griffin and Steel 2006b) allow for flexible
location-specific distributions for the number of sightings, but
again do not provide a common estimate like the one in Figure 7.
8. DISCUSSION
We develop and illustrate a novel mechanism to generate stochastic processes with random marginal distributions. This is
done by representing the common marginal distribution using a
stick-breaking construction and inducing dependence in the selection mechanism of the atoms. Implementation of these models is straightforward using standard MCMC algorithms. One
important difference between our model and other hybridization schemes is its ability to provide reasonable predictions and
interpolations. Indeed, we show that some sort of identifiability
constraint on the atoms is necessary to facilitate interpretation
and avoid multimodality in the posterior distribution due to label switching.
LaSBP’s are very flexible. Unlike other nonparametric models in the literature, implementation of anisotropic and nonstationary models is straightforward, as is the modeling of discrete
or continuous data. In particular, we are currently exploring hierarchical specifications for the latent process that also allow
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
(a)
657
(in the Kolmogorov sense). Note that, from Equation (2), the finitedimensional distributions can be represented as mixtures. Therefore, it
is enough to prove that the process satisfies Kolmogorov’s conditions
conditional on {vl }Ll=1 and {θl∗ }Ll=1 , as mixtures of consistent processes
are consistent. Symmetry is immediate,
G θ sν1 = θl∗ν , . . . , θ sνn = θl∗ν |{θl∗ }Ll=1 , {vl }Ll=1
n
1
∈ −1 πlν −1 , −1 πlν
1
1
= Pr z sν1
z sν1 ∈ −1 πlνn −1 , −1 πlνn
,...,
= Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . ,
z(sn ) ∈ −1 πln −1 , −1 πln
= G θ(s1 ) = θl∗1 , . . . , θ (sn ) = θl∗n |{θl∗ }Ll=1 , {vl }Ll=1 .
For consistency
G θ(s1 ) = θ1∗ , . . . , θ (sn ) = θn∗ , θ(sn+1 ) = u|
u∈{θl∗ }Ll=1
(b)
{θl∗ }Ll=1 , {vl }Ll=1
=
L
ln+1 =1
Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . ,
z(sn+1 ) ∈ −1 πln+1 −1 , −1 πln+1
= Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . , z(sn+1 ) ∈ (−∞, ∞)
= Pr z(s1 ) ∈ −1 πl1 −1 , −1 πl1 , . . . ,
z(sn ) ∈ −1 πln −1 , −1 πln
= G(θ (s1 ) = θ1∗ , . . . , θ (sn ) = θn∗ |{θl∗ }Ll=1 , {vl }Ll=1 ).
A.2 Proof of Theorem 2
If E(θl |θl−1 ) ≤ μ + λθl−1 then solving the recursion
E(θl ) ≤ μ
l−1
k=0
λk + λl E(θ1 ) =
μ
+ λl (E(θ1 ) − μ)
1−λ
and
E(θ (s)) =
Figure 7. Marginal distribution of sightings for (a) the European
Starling and (b) the Mourning Dove in North Carolina.
efficient high-dimensional specifications where the correlation
between processes varies in space.
An alternative construction for the LaSBP can be obtained
by using the order statistics from a sample of size L from the
baseline measure as the atoms of G. This approach, which was
suggested by one of the referees, will be explored elsewhere.
APPENDIX
A.1 Proof of Theorem 1
This is a simple consequence of the properties of a Gaussian
process, and can be extended for any other consistent selection process
∞
l=1
E(θl )
l−1
a
b
a+b a+b
≤
l−1
∞
b
μ
a
+ λl (E(θ1 ) − μ)
1−λ
a+b a+b
=
aλ(E(θ ) − μ)
μ
+
,
1−λ
a + b(1 − λ)
l=1
only if λb/(a + b) < 1. This result can be easily extended to the case
where E(θl ) ≤ c1 lp + c2 λl by noting that the expected value of the first
term corresponds to the value of the polylogarithm function of order p
(Abramowitz and Stegun 1965), which is finite.
A.3 Properties of the Selection Process
Let {z(s) : s ∈ D} be a standard Gaussian process with correlation
function γ (s, s′ ) and define U(s) = (z(s)). We call U(s) the selection
process, as it controls which atom is associated with each location.
This section describes some of the properties of U(s).
658
Journal of the American Statistical Association, June 2010
By construction, the marginals of U(s) are uniform on [0, 1], therefore, E(U(s)) = 1/2 for all s ∈ D. Equivalently, we can obtain this
result through computing the expectation,
∞ z
x2
z2
1
1
dx dz.
E(U(s)) =
√ exp −
√ exp −
2
2
2π
−∞ −∞ 2π
If we define random variables X and Z such that (X, Z)′ ∼ N(0, I), it
turns out that
∞ z
x2
z2
1
1
dx dz = Pr(Z − X > 0)
√ exp −
√ exp −
2
2
2π
−∞ −∞ 2π
= 1/2,
since Z − X ∼ N(0, 2). We can use a similar approach to compute
E(U(s)U(s′ )). Indeed,
E(U(s)U(s′ )) =
x2 + x22
1
exp − 1
2
−∞ −∞ −∞ −∞ 2π
∞ ∞ z1 z2
1
1 ′ −1
−1/2
||
×
exp − z z dx dz,
2π
2
where z = (z1 , z2 )′ and
=
1
γ (s, s′ )
Define
⎛X ⎞
1
⎛⎛ 0 ⎞ ⎛ 1
⎜ X2 ⎟
⎜0⎟ ⎜0
⎜ ⎟ ∼ N⎜
⎝⎝ ⎠ , ⎝
⎝Z ⎠
0
0
1
0
0
Z
2
γ (s, s′ )
.
1
0
0
0 ⎞⎞
1
0
0 ⎟⎟
⎠⎠ ,
0
1
γ (s, s′ )
′
1
0 γ (s, s )
then
E(U(s)U(s′ )) = Pr(Z1 − X1 > 0, Z2 − X2 > 0),
where
(Z1 − X1 > 0, Z2 − X2 > 0) ∼ N
2
0
,
γ (s, s′ )
0
γ (s, s′ )
2
.
Therefore,
1
Cov(U(s), U(s′ )) = (s,s′ ) (0, 0) − ,
4
where denotes the cumulative distribution of a bivariate normal random variable
with covariance matrix and (s, s′ ) =
γ (s,s′ )
2
. Clearly, if γ (s, s′ ) = 0 then E(U(s)U(s′ )) =
γ (s,s′ )
2
Pr(Z1 − X1 > 0, Z2 − X2 > 0) = Pr(Z1 − X1 > 0)Pr(Z2 − X2 > 0) =
E(U(s))E(U(s′ )), and therefore, Cov(U(s), U(s′ )) = 0. Higher-order
moments of the process can be similarly computed, resulting in expressions where the ith moment depends on the cumulative distribution of
an ith variate normal random variable.
SUPPLEMENTAL MATERIALS
Short note: Discusses details pertaining to the efficient implementation of the Markov chain Monte Carlo sampler discussed in Section 5, including the efficient generation of
pseudo-random numbers from truncated multivariate and
conditional normal distributions. (supplemental_material.
pdf)
[Received April 2008. Revised November 2009.]
REFERENCES
Abramowitz, M., and Stegun, I. A. (eds.) (1965), Handbook of Mathematical
Functions (1st ed.), New York: Dover. [652,657]
Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2001), “The Infinite Hidden Markov Model,” in Proceedings of Fourteenth Annual Conference on
Neural Information Processing Systems, Vancouver, BC, Canada. [648]
Bollerslev, T. (1986), “Generalized Autoregressive Conditional Heteroscedasticity,” Journal of Econometrics, 31, 307–327. [654]
DeIorio, M., Müller, P., Rosner, G. L., and MacEachern, S. N. (2004), “An
ANOVA Model for Dependent Random Measures,” Journal of the American Statistical Association, 99, 205–215. [647,649]
Duan, J. A., Guindani, M., and Gelfand, A. E. (2007), “Generalized Spatial
Dirichlet Process Models,” Biometrika, 94, 809–825. [647]
Dunson, D. B., and Park, J.-H. (2008), “Kernel Stick-Breaking Processes,” Biometrika, 95, 307–323. [647]
Dunson, D. B., Pillai, N., and Park, J.-H. (2007), “Bayesian Density Regression,” Journal of the Royal Statistical Society, Ser. B, 69, 163–183. [647]
Ferguson, T. S. (1973), “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209–230. [647]
Figueiredo, M. A. T. (2005), “Bayesian Image Segmentation Using Gaussian
Field Priors,” in Energy Minimization Methods in Computer Vision and Pattern Recognition, eds. A. Rangarajan, B. Vemuri, and A. L. Yuille, Berlin,
Heidelberg: Springer-Verlag, pp. 74–89. [647]
Figueiredo, M. A., Cheng, D. S., and Murino, V. (2007), “Clustering Under
Prior Knowledge With Application to Image Segmentation,” in Advances
in Neural Information Processing Systems 19, eds. B. Schölkopf, J. Platt,
and T. Hoffman, Cambridge, MA: MIT Press, pp. 401–408. [647]
Fox, E., Sudderth, E., Jordan, M. I., and Willsky, A. (2008), “An HDP-HMM for
Systems With State Persistence,” in Proceedings of the 25th International
Conference on Machine Learning (ICML), Helsinki, Finland. [648]
Gelfand, A. E., Kottas, A., and MacEachern, S. N. (2005), “Bayesian Nonparametric Spatial Modeling With Dirichlet Process Mixing,” Journal of the
American Statistical Association, 100, 1021–1035. [647-649,656]
Griffin, J. E., and Steel, M. F. J. (2006a), “Nonparametric Inference in Time Series Problems,” technical report, University of Warwick, Dept. of Statistics.
[647]
(2006b), “Order-Based Dependent Dirichlet Processes,” Journal of the
American Statistical Association, 101, 179–194. [647-649,656]
Hoff, P. D. (2007), “Extending the Rank Likelihood for Semiparametric Copula
Estimation,” The Annals of Applied Statistics, 1, 265–283. [648]
Ishwaran, H., and James, L. F. (2001), “Gibbs Sampling Methods for StickBreaking Priors,” Journal of the American Statistical Association, 96, 161–
173. [647,648,651,652]
(2003), “Some Further Developments for Stick-Breaking Priors: Finite and Infinite Clustering and Classification,” Sankhya, 65 577–592. [651]
Jacquier, E., Polson, N. G., and Rossi, P. E. (1994), “Bayesian Analysis of Stochastic Volatility Models,” Journal of Business and Economic Statistics, 12,
371–389. [654,655]
Kottas, A., Müller, P., and Quintana, F. (2005), “Nonparametric Bayesian Modeling for Multivariate Ordinal Data,” Journal of Computational and Graphical Statistics, 14, 610–625. [648]
Lo, A. Y. (1984), “On a Class of Bayesian Nonparametric Estimates: I. Density
Estimates,” The Annals of Statistics, 12, 351–357. [648]
MacEachern, S. N. (2000), “Dependent Dirichlet Processes,” technical report,
Ohio State University, Dept. of Statistics. [647,649]
(2007), Discussion of “Bayesian Nonparametric Modelling for Spatial Data Using Dirichlet Processes” by A. E. Gelfand, M. Guindani, and
S. Petrone, in Bayesian Statistics 8, eds. J. M. Bernardo, M. J. Bayarri,
J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West,
New York: Oxford University Press, pp. 196–198. [647]
Müller, P., Quintana, F., and Rosner, G. (2004), “Hierarchical Meta-Analysis
Over Related Non-Parametric Bayesian Models,” Journal of the Royal Statistical Society, Ser. B, 66, 735–749. [647]
Nelsen, R. B. (1999), An Introduction to Copulas, New York: Springer. [648]
Ongaro, A., and Cattaneo, C. (2004), “Discrete Random Probability Measures:
A General Framework for Nonparametric Bayesian Inference,” Statistics
and Probability Letters, 67, 33–45. [647]
Pal, N. R., and Pal, S. K. (1993), “A Review on Image Sementation Techniques,” Pattern Recognition, 26, 1277–1294. [647]
Petrone, S., Guindani, M., and Gelfand, A. E. (2009), “Hybrid Dirichlet Mixture Models for Functional Data,” Journal of the Royal Statistical Society,
Ser. B, 71, 755–782. [648,652]
Pham, D. L., Xu, C., and Prince, J. L. (2000), “A Survey of Current Methods in
Medial Image Segmentation,” Review of Biomedical Engineering, 2, 315–
337. [647]
Pitman, J. (1996), “Some Developments of the Blackwell–Macqueen Urn
Scheme,” in Statistics, Probability and Game Theory. Papers in Honor of
David Blackwell, eds. T. S. Ferguson, L. S. Shapeley, and J. B. MacQueen,
Hayward, CA: IMS, pp. 245–268. [648]
Rodríguez, Dunson, and Gelfand: Latent Stick-Breaking Processes
Richardson, S., and Green, P. J. (2001), “Hidden Markov Models and Disease
Mapping,” Journal of the American Statistical Association, 97, 1055–1070.
[655]
Roberts, G., and Papaspiliopoulos, O. (2008), “Retrospective Markov Chain
Monte Carlo Methods for Dirichlet Process Hierarchical Models,” Biometrika, 95, 169–186. [653]
Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008), “The Nested Dirichlet
Process” (with discussion), Journal of the American Statistical Association,
103, 1131–1154. [647,651]
Rodriguez, A., Dunson, D. B., and Taylor, J. (2009), “Bayesian Hierarchically
Weighted Finite Mixtures Models for Samples of Distributions,” Biostatistics, 10, 155–171. [647]
659
Sethuraman, J. (1994), “A Constructive Definition of Dirichelt Priors,” Statistica Sinica, 4, 639–650. [647,648]
Suri, J. S., Wilson, D., and Laxminarayan, S. (eds.) (2005), Handbook of Biomedical Image Analysis. Volume 2: Segmentation Models. Part B, New
York: Springer. [647]
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), “Sharing Clusters
Among Related Groups: Hierarchical Dirichlet Processes,” Journal of the
American Statistical Association, 101, 1566–1581. [647,648]
van Gael, J., Saatci, Y., Teh, Y.-W., and Ghahramani, Z. (2008), “Beam Sampling for the Infinite Hidden Markov Model,” in Proceedings of the 25th
International Conference on Machine Learning (ICML), Helsinki, Finland.
[648]