Bayesian Factor Zero-Inflated Poisson Model For Multiple Grouped Count Data
Bayesian Factor Zero-Inflated Poisson Model For Multiple Grouped Count Data
Abstract
This paper proposes a computationally efficient Bayesian factor model for multiple grouped count
data. Adopting the link function approach, the proposed model can capture the association within and
between the at-risk probabilities and Poisson counts over multiple dimensions. The likelihood function
for the grouped count data consists of the differences of the cumulative distribution functions evaluated
at the endpoints of the groups, defining the probabilities of each data point falling in the groups.
The combination of the data augmentation of underlying counts, the Pólya-Gamma augmentation to
approximate the Poisson distribution, and parameter expansion for the factor components is used to
facilitate posterior computing. The efficacy of the proposed factor model is demonstrated using the
simulated data and real data on the involvement of youths in the nineteen illegal activities.
Key words: data augmentation; factor model; Markov chain Monte Carlo; multivariate count data;
parameter expansion, Pólya-gamma augmentation;
1 Introduction
Zero-inflation is a prevalent issue in the statistical analysis of count data in various applications,
such as epidemiology, health services research, and social studies. Several well-developed statistical
methods exist for analysing zero-inflated count data, with the zero-inflated model (Lambert, 1992)
being one of the commonly used approaches. See, for example, Neelon et al. (2016) for a review.
Another significant challenge in the count data analysis arises from the occurrence of ‘grouped
counts’. Instead of actual counts, grouped count data provide frequencies of individuals for predefined
∗
Author of correspondance: gkobayashi@meiji.ac.jp
1
ordinal groups. Grouping occurs due to various factors, such as the sensitivity of the data topic and
cognitive burden experienced by interviewees (Fu et al., 2018). For example, in our real data analysis,
the frequencies of involvement in illegal activities are reported in categories such as ‘never’, ‘once’,
‘twice’, ‘between three and five times’, ‘between six and ten times’, ‘between eleven and fifty times’
and ‘over fifty times’, instead of the exact frequencies.
Although there exists a body of studies analysing grouped continuous data, especially in the context
of income data analysis (see, e.g., Kobayashi et al., 2022, 2023), the statistical analysis of grouped
count data has much less attention, though grouped count data frequently arises, especially in applied
social science. To our knowledge, McGinley et al. (2015) is the only study introducing the model for
the grouped zero-inflated count data. McGinley et al. (2015) employs the likelihood function of an
ordinal response model where the likelihood contribution of each group is expressed by the difference
between the cumulative distribution function of a discrete probability distribution evaluated at the
endpoints of the group. These differences define the probabilities of the data points falling into the
groups. However, when zero-inflation is high, an analysis of zero-inflated grouped count data using
a univariate model can be distorted by the severe scarcity of information due to grouping and zero-
inflation. If the data include multiple count responses, leveraging shared information among them by
analysing them jointly considering a multivariate structure rather than treating them independently
would be beneficial.
In addition, there has also been a growing demand for the joint analysis of multiple count data
of which some or all dimensions are zero-inflated (see, e.g., Berry and West, 2020). However, unlike
continuous distributions such as the normal, developing and implementing a multivariate count model
is generally cumbersome, especially when the multivariate counts are zero-inflated, as in the recent
study of Liu and Tian (2015).
Factor analysis stands out as a common approach to analysing multivariate count data in a par-
simonious and computationally convenient manner. To introduce a factor structure into count data
analysis, a link function is commonly used to model a latent linear predictor incorporating latent
factors and covariates (Wedel et al., 2003). As an alternative approach, Larsson (2020) introduced a
distinct type of factor model for discrete data, differing from classical count factor models, which is
based on a dependent Poisson model (see, e.g. Karlis, 2003). Some previous research exists on the
factor models of zero-inflated count data, such as Neelon and Chung (2017) and Xu et al. (2021).
Neelon and Chung (2017) introduced the factor structure into the at-risk probability and the mean
count using the multiplicative function of the latent factor and regression components. Xu et al.
(2021) used the link-function approach to connect the zero-inflated count and latent linear predictor
2
with factors. The fundamental difference between our approach and the previous approaches lies in
the flexibility of the factor structure. Due to its multiplicative structure, the factor model studied in
Neelon and Chung (2017) permits only positive factors. Xu et al. (2021) employed the common latent
linear predictor for both the at-risk probability and the mean count, which results in a restrictive
correlation structure.
Based on the preceding, we propose the zero-inflated Poisson model with a flexible latent factor
structure for multiple grouped count data. For modelling grouped count in each dimension, we follow
McGinley et al. (2015) and introduce the likelihood function for an ordinal model described above. To
introduce the association within and between the at-risk and Poisson parts over different dimensions,
we introduce the individual-specific latent factors with the dimension-specific factor loadings for the
at-risk and Poisson parts. To facilitate posterior computation, we employ the Pólya-Gamma (PG)
mixture representation of Polson et al. (2013). Since our model is Poisson-based, following Hamura
et al. (2021), we approximate the Poisson model by the negative binomial model and apply PG data
augmentation. This augmentation enables us to carry out an efficient Gibbs sampling. Moreover, for
efficient sampling, we also borrow the idea of the parameter expansion technique of Ghosh and Dunson
(2009) for the factor components, but without the positive lower triangular constraints. The MCMC
draws of the unidentified working parameters are post-processed using the algorithm of Papastamoulis
and Ntzoufras (2022). While achieving a stable sampling of the factor components in the low layer of
the hierarchical model may seem challenging, our sampling method works well, as illustrated in the
real data analysis where the counts are highly zero-inflated and highly coarsened into groups.
The remainder of this paper is organised as follows. Section 2 introduces the proposed factor model
for zero-inflated grouped counts. Then, the MCMC algorithm for the posterior inference is provided by
applying the PG augmentation, data augmentation of the underlying counts, and parameter expansion.
We also describe the post-processing for producing identified MCMC draws. The efficacy of the joint
modelling through the latent factors is demonstrated by using the simulated data in Section 3 and real
data in Section 4. Specifically, Section 4 analyses the grouped count data of National Longitudinal
Study of Youths 1979 (NLSY79) on the illegal activities by youths. Finally, Section 5 provides some
conclusion and discussion.
3
2 Method
2.1 Model
Let yi = (yi1 , . . . , yiJ )′ denote the J dimensional vector of the response variables. Each element of yi
consists of zero-inflated grouped count data. Let yi∗ = (yi1
∗ , . . . , y ∗ )′ denote the vector of the latent
iJ
count data, and each element of yi∗ is assumed to follow the zero-inflated Poisson distribution (ZIP)
model expressed as
∗ ∗
yij ∼ (1 − πij )I(zij = 0, yij = 0) + πij P o(µij )I(zij = 1), i = 1, . . . , N, j = 1, . . . , J.
where P o(µ) denotes the Poisson distribution with the mean parameter µ, zij is the latent binary
indicator such that zij = 1 with probability πij and zij = 0 otherwise. If zij = 0, the latent count
is equal to structurally zero with probability and otherwise follows the Poisson distribution (at-risk).
∗ ). Generally, c is in the form of
Given a known grouping mechanism c, yij is observed as c(yij
∗
yij = g iff κg ≤ yij < κg+1 , g = 0, . . . , G − 1, (1)
where κg ’s define the thresholds of the ordinal groups (see for example, McGinley et al., 2015). Typi-
cally, κg = 0 and κG = ∞. We utilise this data augmentation form for the posterior computation.
The at-risk probability πij = Pr(zij = 1) is modelled using the logistic model given by
exp(η1ij )
πij = , i = 1, . . . , N, j = 1, . . . , J.
1 + exp(η1ij )
The Poisson mean is modelled through the log-link function µij = exp(η2ij ).
In order to connect the 2 × J responses, the common latent factor is introduced to the linear
predictor in such a way that
ηhij = x′ij β hj + u′i λhj , h = 1, 2,
where xij is the P × 1 vector of covariates with the associated coefficient β hj , ui = (ui1 , . . . , uiK )′ is
the K × 1 common latent factor and λhj = (λhj1 , . . . , λhjK )′ is the corresponding factor loading for
h = 1, 2 and j = 1, . . . , J.
In order to facilitate the posterior computation and identification, we follow Hamura et al. (2021)
to approximate the Poisson model by the negative binomial model and apply the Pólya-Gamma (PG)
mixture of Polson et al. (2013). It is well known that the negative binomial distribution has the
4
following mixture representation:
where Ga(a, b) denotes the gamma distribution with the mean a/b. The marginal probability function
of Y is given by
Γ(y + r) (eη /r)y Γ(y + r) (eψ )y
p(y) = = ,
Γ(r)y! (1 + eη /r)y+r Γ(r)y! (1 + eψ )y+r
where ψ = η − log r. The Poisson distribution is obtained in the limit of r → ∞. Therefore, for a
sufficiently large r, we can apply the Pólya-Gamma (PG) mixture representation to this approximate
Poisson model:
∞
(eψ )a
Z
2 /2
= 2−b eκψ e−ωψ p(ω|b, 0)dω,
(1 + eψ )b 0
where a = y, b = y + r, κ = a − b/2 and ω follows the PG distribution P G(b, 0) with the density
p(ω|b, 0).
Collecting the 2J terms of the PG mixture, the contribution of the ith individual to the augmented
∗ and
QJ
likelihood function conditionally on ω1ij , ω2ij , yij j=1 zij = 1 is proportional to
J n ω o
1ij
Y
exp − (x′i β 1j + u′i λ1j )2 + κ1ij (x′i β 1j + u′i λ1j )
2
j=1 (2)
n ω o
2ij
× exp − (x′i β 2j + u′i λ2j − log r)2 + κ2ij (x′i β 2j + u′i λ2j − log r)
2
1
∝ exp − (di − βxi − Λui − r)′ Ωi (di − βxi − Λui − r)
2
For the regression parameters β hj , we assume the conditionally conjugate priors N (b0 , B0 ) for h = 1, 2
and j = 1, . . . , J. The standard prior distributions for the common factors and loadings would be
5
uij ∼ N (0, 1) and λhj ∼ N (0, 1). Under this prior specification, however, the mixing of an MCMC
algorithm tends to be very slow.
The augmented model above is expanded for efficient posterior sampling, borrowing the idea of
Ghosh and Dunson (2009). Specifically, we introduce the working parameters λ∗hj = (λ∗hj1 , . . . , λ∗hjK )′
and u∗i = (u∗i1 , . . . , u∗iK )′ . The MCMC algorithm samples the working parameters from their posterior
distributions. The likelihood contribution of the ith individual in the expanded model is obtained
by simply replacing ui and λhj in (2) with u∗i and λ∗hj , respectively. The prior distributions for
the working parameters are given by λ∗hjk ∼ N (0, 1), h = 1, 2, j = 1, . . . , J, k = 1, . . . , K, and
u∗i ∼ N (0, Φ), i = 1, . . . , N where Φ = diag(ϕ1 , . . . , ϕK ). Further, it is assumed ϕk ∼ IG(ak , bk ), k =
1, . . . , K.
Our prior specification differs slightly from that in Ghosh and Dunson (2009). To correct for
the invariance of the factor loadings due to rotation and sign-switching, Ghosh and Dunson (2009)
employed the positive lower triangular (PLT) constraint where the diagonal elements of the factor
loading matrix are strictly positive, and the upper triangle elements are fixed to zero a-priori. In our
model, it would have been λ∗1jk = 0, k = min(j, K) + 1, . . . , K and λ∗2jk = 0, k = min(J + j, K) +
1, . . . , K. However, PLT only partially solves the identification issues. For example, the identifiability
is lost when the loading for the first variable is close to zero. In this case, reordering the variables is
required. See Papastamoulis and Ntzoufras (2022) and references therein for the recent development
in the approaches to achieving identifiability of the factor model and their limitations.
Therefore, this paper employs the parameter expansion without constraining the factor loading
matrix. The MCMC draws of the unidentified parameters are post-processed to produce the posterior
draws of the identified parameters. See Section 2.4.
The parameters and latent variables are sampled using the Gibbs sampler described in the following.
′
In this section, ηhij is expressed in terms of the working parameters ηhij = x′ij β hj + u∗i λ∗hj .
The joint distribution of the parameters and latent variables under the expanded model is propor-
6
tional to
J Y N h n ω o
1ij ′ ′
Y
exp − (x′i β 1j + u∗i λ∗1j )2 + κ1ij (x′i β 1j + u∗i λ∗1j ) p(ω1ij )
2
j=1 i=1
n ω o I(zij =1)I(yij =g,yij∗ ∈[κg ,κg+1 ))
2ij ′ ∗′ ∗ 2 ′ ∗′ ∗
× exp − (xi β 2j + ui λ2j − log r) + κ2ij (xi β 2j + ui λ2j − log r) p(ω2ij )
2
"N # 2 J K
2 Y J
Y YYY Y
× p(u∗i |Φ) p(λ∗hjk ) p(β hj ) p(Φ)
i=1 h=1 j=1 k=1 h=1 j=1
(3)
where I(·) is the indicator function, p(λ∗hjk ), p(β hj ) and p(Φ) denote the prior densities. Then,
∗ }, {β }, {z }, {ω }, {λ∗ }, {u∗ } and {ϕ } from their
the Gibbs sampler alternately samples {yij hj ij ij hj i k
∗
∗ + r)
Γ(yij (eψ
ij )
yij
∗
p(yij |zij = 1, Rest) ∝ ∗! ∗ I(yij = g, yij ∈ [κg , κg+1 )),
Γ(r)yij (1 + eψij )yij +r
where ψij = η2ij − log r. This full conditional distribution is the negative binomial distribution
truncated on the interval [κg , κg+1 ).
2. The sampling steps of zij , β hj , λ∗hj and ωhij are similar to those provided in Neelon (2019).
r
πij vij
∗
Pr(zij = 1|yij = 0, Rest) = r ),
1 − πij (1 − vij
• Sampling β 1j and λ∗1j , j = 1, . . . , J: We sample β 1j and λ∗1j in one block. The full
′
conditional distribution of (β ′1j , λ∗1j )′ is given by N (b1j , B1j ) where
′
where x̃ij = (x′ij , u∗i )′ , B̃0 is the block diagonal matrix with B0 and Iℓ on the diagonal
7
blocks and b̃0 = (b′0 , 0′K )′ .
• Sampling β 2j and λ∗2j , j = 1, . . . , J: Similarly, β 2j and λ∗2j are sampled in one block. The
′
full conditional distribution of (β ′2j , λ∗2j )′ is given by N (b2 , B2 ) where
−1
X X ∗ −r
yij
B2j = ω2ij x̃ij x̃ij + B̃−1
0
, b2j = B2j x̃ij + ω2ij log r + B̃−1
0 b̃0 ,
2
i:zij =1 i:zij =1
−1
J
′ ′
X X
Vi = ω1ij λ∗1j λ∗1j + ω2ij λ∗2j λ∗2j + Φ−1 ,
j=1 j:zij =1
XJ X
mi = Vi (κ1ij − ω1ij x′i β 1j )λ∗1j + (κ2ij − ω2ij (x′i β 2j − log r))λ∗2j
j=1 j:zij =1
2.4 Post-processing
The MCMC draws of the factor components are processed in the following two steps. First, the
sampled working parameters are not identified in terms of scale (Section 2.2). The original parameters
are recovered through
1/2 −1/2
λhjk = λ∗hjk ϕk , uik = u∗ik ϕk , j = 1, . . . , J, k = 1, . . . , K. (4)
Then, these parameters are still subject to the rotational and sign-switching invariance. We apply the
post-processing algorithm of Papastamoulis and Ntzoufras (2022) to the MCMC draws of λhjk . The
algorithm first applies the varimax rotation to each MCMC draw to solve the rotational invariance,
then to solve the sign-switching invariance, it applies the signed permutations to the MCMC output
until the transformed loadings are sufficiently close to some reference value. Their algorithm is provided
in the R package factor.switching. See Papastamoulis and Ntzoufras (2022) for details.
3 Simulation study
Here, the performance of the proposed model is investigated using the simulated data. We set
N = 1000, J = 10, K = 1 and P = 2. The regression coefficients are given by β true
1j = (0.5, 0.5),
8
β true
2j = (−0.5, −1) for j = 1, . . . , J. The covariate vector is xij = (1, xi )′ for i = 1, . . . , N , j =
1, . . . , J, and xi ∼ N (0, 1). For the factor loadings, λtrue
1 = (0.89, 0, 0.25, 0, 0.8, 0, 0.5, 0, 0, 0)′ and
λtrue
2 = (0, 0, 0.85, 0.8, 0, 0.75, 0.75, 0, 0.8, 0.8)′ . We consider the following two settings for the grouping
mechanisms. In Setting 1, it is set {0}, {1}, {2}, [3, 5], [6, 10], [11, 50], [51, ∞), which is the same
as the NLSY79 data in Section 4. Setting 2 considers the finer grouping mechanism such that the
grouped data contain more information: {0}, {1}, {2}, . . . , {10}, [11, 15], [16, 20], [21, 25], [26, 30],
[31, 40], [41, 50], [51, ∞). The data are replicated R = 100 times. The overall proportion of structural
zeros is approximately 0.6.
The proposed factor ZIP model for grouped data (GFZIP) is compared with the following three
models. Firstly, the ZIP model for grouped data (GZIP) is considered. Since this model does not
include factors that provide links among structural zeros and grouped counts, it is essentially a uni-
variate model and thus is estimated separately for each j. Secondly, ZINB (McGinley et al., 2015)
for grouped data is also considered. Finally, to assess the effect of the loss of information due to the
grouping mechanism, the factor ZIP (FZIP) model for the ungrouped count data is considered and
fitted to the underlying count data without the grouping mechanism.
For all models, we assume β hj ∼ N (0, 100I) for h = 1, 2 and j = 1, . . . , J. For each model, the
MCMC algorithm is run for 22,000 iterations, with the initial 2,000 draws discarded as the burn-in
period. The parameter estimation is based on the remaining 20,000 MCMC draws.
(r)
The performance of the models is assessed based on the bias Bias(βhjp ) = R1 R true
P
r=1 β̂hjp − β hjp
r 2
1 P R (r) true
and root mean squared errors (RMSE) RMSE(βhjp ) = R r=1 β̂hjp − βhjp for h = 1, 2, j =
(r)
1, . . . , J and p = 1, . . . , P , where β̂hjp is the posterior mean from rth replication of the data. For
the factor loadings, we compute the bias and RMSE for vec(ΛΛ′ ), as the post-processed signs of the
loadings vary over the replications.
We also evaluate the true positive (TPR), true negative (TNR), false positive (FPR) and false
negative (FNR) rates for being at-risk conditionally on the zero response. The posterior probability of
ith individual being at-risk in jth dimension given the response yij = 0 is denoted by Pr(zij = 1|yij =
1 PM m
0). It is estimated by π̂ij = M m=1 zij for i such that yij = 0 based on the M draws of the MCMC
algorithm. The individual i is deemed to be at-risk in jth dimension if π̂ij > 0.5. Then, the TPR,
TNR, FPR, and FNR are calculated as
PN true = 1)
PN true = 0)
i=1 I(π̂ij > 0.5, yij = 0, zij i=1 I(π̂ij ≤ 0.5, yij = 0, zij
TPRj = PN , TNRj = PN
true = 1) true = 0)
i=1 I(yij = 0, zij i=1 I(yij = 0, zij
PN true = 0)
PN true = 1)
i=1 I(π̂ij > 0.5, yij = 0, zij i=1 I(π̂ij ≤ 0.5, yij = 0, zij
FPRj = PN , FNRj = PN ,
I(y = 0, z true = 0) I(y = 0, z true = 1)
i=1 ij ij i=1 ij ij
9
true denotes the true value of the latent at-risk indicator z .
for j = 1, . . . , J, where zij ij
As in the real data application on the youths’ involvement in illegal activities in Section 4, the
proportion of at-risk individuals among those whose responses are zero would be a quantity of interest.
The proportion of interest is defined by
PN
i=1 I(π̂ij > 0.5, yij = 0)
R̂j ≡ PN , j = 1, . . . , J. (5)
i=1 I(yij = 0)
Tables 1 presents the biases and RMSEs for the coefficients β 1j and β 2j from 100 replications of the
data averaged over J dimensions. Since FZIP, which knows the true underlying counts before grouping,
is not affected by the grouping mechanism, it produces identical results under both simulation settings.
Therefore, the cells for FZIP in Setting 2 are left blank.
When comparing the proposed GFZIP, GZIP, and GZINB, ignoring the factor structure among
the at-risk probabilities and Poisson parts leads to larger bias and RMSE. As expected, the GFZIP
performed the best among the three models. GZINB resulted in large RSME for the at-risk coefficients
β 1jp , especially in the case of Setting 1. This is due to the numerical instability from the coarse grouped
data. Compared to FZIP, GFZIP resulted in increased bias and RMSE for the Poisson coefficients β 1j
due to the loss of information through the grouping mechanism. It is also seen that the performance
of GFZIP regarding the Poisson coefficients improves as the number of groups increases from Setting 1
to Setting 2, where the grouped data contain more information. This is also the case for GZIP and
GZINB, and this phenomenon was also observed in McGinley et al. (2015).
Figure 1 presents the boxplots of the bias and RMSE for vec(ΛΛ′ ) under GFZIP and FZIP. The
bias under GFZIP is larger than that under FZIP in both settings due to grouping. It is also seen
that the bias under GFZIP decreases as the finer grouping mechanism is used in Setting 2. A similar
pattern is observed for the RMSE.
Figures 2 and 3 present TPR, TNR, FPR, and FNR averaged over 100 replications under GFZIP,
GZIP, GZINB and FZIP in Settings 1 and 2, respectively. Firstly, we observe that the results under
GFZIP and FZIP become almost identical in Setting 2, while there are some discrepancies in Setting 1.
In both settings, GZIP resulted in TPR and FPR for some dimensions being close to zero. On the
contrary, TNR and FNR in those dimensions are close to one.
Figure 4 presents the estimated proportions of at-risk individuals given yij = 0, R̂j , averaged over
100 replications. The GFZIP and FZIP models seem to work well, with the estimates being close
to the truth. Their results become almost identical in Setting 2, similar to Figures 2 and 3. The
figure also shows that GZINB overestimates Rj in both Settings. The results for GZIP are similar
to those in the previous figures. In some dimensions, the estimates of Rj under GZIP are close to
10
zero, implying the false negative rates are close to one. In the other dimensions, the estimates under
GZIP are similar to those of GFZIP. The behaviour in GZIP results from ignoring the association
between the dimensions. In the real data analysis in Section 4, where it would be natural to consider
the association between the youths’ illegal activities, we observe a similar result under GZIP.
Table 1: Bias and RMSE for the at-risk coefficients β1jp from 100 replications averaged over J = 10
dimensions. The results for FZIP, which are not affected by the grouping mechanism, are the same
for both settings.
Bias RMSE
Parameter Setting GFZIP GZIP GZINB FZIP GFZIP GZIP GZINB FZIP
β1.1 1 -0.093 -0.618 1.549 0.110 0.290 0.667 2.320 0.303
β1.2 -0.043 -0.329 0.389 0.067 0.205 0.382 0.797 0.218
β1.1 2 0.115 -0.428 1.044 —— 0.302 0.565 1.680 ——
β1.2 0.071 -0.210 0.330 —— 0.217 0.315 0.633 ——
β2j1 1 0.087 0.533 0.084 -0.023 0.153 0.555 0.194 0.111
β2j2 -0.036 -0.178 -0.029 -0.006 0.076 0.163 0.098 0.062
β2j1 2 -0.025 0.394 0.048 —— 0.111 0.449 0.188 ——
β2j2 -0.009 -0.076 -0.007 —— 0.063 0.125 0.083 ——
0.4
0.10
0.05
0.3
−0.05 0.00
RMSE
bias
0.2
0.1
−0.15
0.0
11
TPR TNR
1.0
1.0
GFZIP GFZIP
GZIP GZIP
GZINB GZINB
FZIP FZIP
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 4 6 8 10 2 4 6 8 10
j j
FPR FNR
1.0
1.0
GFZIP GFZIP
GZIP GZIP
GZINB GZINB
FZIP FZIP
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 4 6 8 10 2 4 6 8 10
j j
Figure 2: True positive (TPR), true negative (TNR), false positive (FPR) and false negative rates
(FNR) for GFZIP, GZIP, GZINB, and FZIP in Setting 1
12
TPR TNR
1.0
1.0
GFZIP GFZIP
GZIP GZIP
GZINB GZINB
FZIP FZIP
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 4 6 8 10 2 4 6 8 10
j j
FPR FNR
1.0
1.0
GFZIP GFZIP
GZIP GZIP
GZINB GZINB
FZIP FZIP
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 4 6 8 10 2 4 6 8 10
j j
Figure 3: True positive (TPR), true negative (TNR), false positive (FPR) and false negative rates
(FNR) for GFZIP, GZIP, GZINB, and FZIP in Setting 2
13
1.0 Setting 1 Setting 2
1.0
True GFZIP GZIP GZINB FZIP True GFZIP GZIP GZINB FZIP
0.8
0.8
0.6
0.6
Rj
Rj
0.4
0.4
0.2
0.2
0.0
0.0
2 4 6 8 10 2 4 6 8 10
j j
We consider the number of times youths were involved in the nineteen illegal activities (J = 19)
obtained from the 1980 round of the National Longitudinal Study of Youth 1979 (NLSY79) data. In
NLSY79, the questionnaire was designed so that the respondents answered at an exact frequency or
interval of frequencies of each illegal activity in the year prior to the interview. Then, the answers are
published as the grouped count data. Although this is old data, it provides valuable information on
the problematic behaviour of youths, of which statistical analyses are still relevant today.
The choices are ‘never’ (g = 0: {0}), ‘once’ (g = 1: {1}), ‘twice’ (g = 2: {2}), ‘between three
and five times’ (g = 3: [3, 5]), ‘between six and ten times’ (g = 4: [6, 10]), ‘between eleven and fifty
times’ (g = 5: [11, 50]) and ‘over fifty times’ (g = 6: [51, ∞]). Table 2 describes the nineteen activities
considered in this analysis and associated labels used in the following figures and tables.
Figure 5 presents the histograms of the times 2865 youths were involved in the 19 illegal activities
in the previous year. The numbers in the panels indicate the fractions of zeros. A substantially large
proportion of youths were not involved in each activity, exhibiting many zeros. For example, the
observed proportions of zeros for the activities with highly criminal nature, such as sell marijuana,
sell hard drugs and break in, are above 0.9, and are particularly high. The proportion of zeros for
alcohol is 0.39. It is much lower than those for other activities as it is more common for youths,
though this value may be relatively high in the context of zero-inflated count data.
14
The histograms only reveal the distribution of involvement in each activity separately and the
extent of the zero inflation. However, we are also interested in the association among the activities
because it would be natural to assume that involvement in one activity and its frequency may be
associated with those in another activity, such as the use of alcohol and marijuana. Figure 6 presents
the heatmaps of log frequencies for the arbitrarily selected pairs of activities. The frequencies are added
with one before taking the log. Some observations from the figure are as follows. The frequencies for
non-involvement in neither are the highest for all pairs of activities. The top left and middle panels
indicate that a certain fraction of youths had experience using marijuana or hard drugs while they
did not sell them. The top left panel also shows that the youths who sold marijuana frequently used
marijuana frequently, as indicated by the darker shades in the top right corner of the panel. A similar
pattern is seen in the pair of hard drugs and marijuana in the top right panel, where most youths
tended to use marijuana only, but the frequent users used both of them. The bottom left panel shows
that frequent drinking of alcohol is associated with frequent use of marijuana, indicating they may be
used together. Therefore, it would be more appropriate to analyse all activities jointly rather than
treat each separately. The proposed GFZIP model can take into account these data characteristics.
Since the information specific to each involvement in an activity is not available, only the individual
characteristics are used as covariates: xij = xi for i = 1, . . . , N . The covariate information includes the
constant, age, gender, race, grade, residence, poverty and mental status. Table 3 presents the summary
of the covariates. For the prior distributions for the coefficient vectors, we use β j ∼ N (0, 100I) for
j = 1, . . . , J.
As in the simulation study, we compare the proposed GFZIP model with GZIP and GZINB models.
For GFZIP, we consider the three cases for the number of factors: K = 1, 2, 3. For each model, the
MCMC algorithm is run for 60,000 iterations. The first 20,000 draws are discarded as burn-in period
and the remaining 40,000 draws are retaind for the posterior inference. The models are compared
based on a version of the posterior predictive loss (PPL) of Gelfand and Ghosh (1998), which is
similar to the one considered by Sugasawa et al. (2020):
J G J G
1 XX M 1 XX M 2
PPL(M) = Vjg + (cjg − Ejg )
N N +1
j=1 g=0 j=1 g=0
M and
where cjg is the number of individuals belonging to the gth group for the jth activity, and Ejg
M , respectively, are the mean and variance of the posterior predictive distribution for c
Vjg jg under
model M.
15
Table 2: Illegal activities in NLSY79
Label Description
alcohol Drank beer, wine, or liquor without parents’ permission
run away Run away from home
damage Purposely damaged or destroyed property
fight Got into a physical fight
shoplift Taken something from a store without paying
steal lt $50 Stolen other’s belongings worth less than $50
steal ge $50 Stolen other’s belongings worth equal to or more than $50
extort Used force to get money or things from a person
threaten Hit or seriously threatened to hit someone
attack Attacked someone with the idea of seriously hurting or killing
use marijuana Smoked marijuana or hashish
use hard drugs Used any drugs or chemicals except marijuana
sell marijuana Sold marijuana or hashish
sell hard drugs Sold hard drugs
con Tried to get something by lying to a person
vehicle Taken a vehicle without the owner’s permission
break in Broken into a building or vehicle
sell stolen Sold or held stolen goods
gambling Helped in a gambling operation
16
alcohol run_away damage fight
2500
2500
2500
2500
0.39 0.917 0.754 0.628
2000
2000
2000
2000
1500
1500
1500
1500
1000
1000
1000
1000
500
500
500
500
0
0
0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51−
2500
2500
2500
0.702 0.804 0.942 0.95
2000
2000
2000
2000
1500
1500
1500
1500
1000
1000
1000
1000
500
500
500
500
0
0
0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51−
2500
2500
2500
0.571 0.891 0.615 0.857
2000
2000
2000
2000
1500
1500
1500
1500
1000
1000
1000
1000
500
500
500
500
0
0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 0 1 2 3−5 6−10 11−50 51−
2500
2500
2500
2000
2000
2000
1500
1500
1500
1500
1000
1000
1000
1000
500
500
500
500
0
0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51−
2500
2500
2000
2000
1500
1500
1500
1000
1000
1000
500
500
500
0
0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51− 0 1 2 3−5 6−10 11−50 51−
Figure 5: Histograms of NLSY79 data on the illegal activities of 2865 youths. The numbers indicate
the fractions of zeros.
17
51− 51− 51−
use_hard_drugs
sell_hard_drugs
sell_marijuana
6 6 6
2 2 2
2 2 2
0 0 0
1 1 1
0 0 0
0
3−5
6−10
11−50
51−
3−5
6−10
11−50
51−
3−5
6−10
11−50
51−
use_marijuana use_hard_drugs use_marijuana
6 6
6
attack
5
fight
1 1 1
0 0 0
0
3−5
6−10
11−50
51−
3−5
6−10
11−50
51−
3−5
6−10
11−50
51−
alcohol alcohol threaten
Figure 6: Heatmaps of log frequencies for the arbitrarily selected pairs of activities
Table 3: Covariates
Label Description Mean s.d.
age Age of respondent 16.11 0.777
male Dummary variable for male 0.505 0.250
black Dummy variable for the respondent’s race (black) 0.251 0.434
hisp Dummy variable for respondent’s race (Hispanic) 0.175 0.380
grade Highest grade achieved 9.555 1.029
self Log score of self-esteem 3.059 0.188
urban Dummy variable for the respondent living in an urban area 0.761 0.428
pov Dummy variable for the respondent in poverty 0.218 0.413
4.2 Results
First, we compare the posterior PPL presented in Table 4. It is shown that GFZIP with one factor
resulted in the smallest PPL followed by GZIP. The proposed GFZIP model, which accounts for the
association among the decisions on involvement with the activities and frequencies of involvements, is
18
more appropriate than GZIP, which treats each activity separately. The PPL increases as the number
of factors increases. This is a natural result, as the information in our dataset is severely limited due
to the coarse grouping mechanism. The GZINB resulted in the largest PPL. This would be because
the GZINB suffers from computational instability when the groups are coarsely defined, as observed
in the simulation study.
Figure 7 presents the trace plots of the Gibbs sampler for the selected parameters under GFZIP.
For the factor loadings, the reordered series are shown. Although the model includes many latent
variables in with the multiple hierarchy, it is seen the Gibbs sampler seems to be working reasonably
well.
Table 5 presents the posterior means and 95% credible intervals for the factor loadings under
GFZIP. Except for fight and gambling for at-risk, the 95% credible intervals for all activities do
not include zero. Among the credible at-risk loadings λ1 , the three factor loadings with the largest
magnitudes in the posterior means are those for sell hard drugs (−1.075), use hard drugs (−0.900)
and sell marijuana (−0.594), which are all drug-related loadings. For all Poisson factor loadings,
λ2 , the 95% credible intervals do not include zero. The loadings with the largest magnitudes in the
posterior means are also the drug-related loadings such as sell marijuana (−3.374), use marijuana
(−2.991) and use hard drugs (−2.572). Therefore, the single common latent factor included in the
model is interpreted as the drug-related factor.
Figure 8 presents the heatmaps of the posterior means of λh λ′h , h = 1, 2 under GFZIP as indicators
of the association within the at-risk and Poisson parts. The activities are ordered in each panel based
on the hierarchical clustering for better visibility and interpretability. The darker the shades of the
block for λhj λhj ′ , the greater the association between the activities j and j ′ in part h. In the top
left corner of the left panel, there is a patch of noticeable dark shade. This part corresponds to the
association among sell hard drugs and use hard drugs, the two activities with the largest factor
loadings in the at-risk part. The figures show that the involvement in these activities is also associated
with the involvement in almost all the other activities except for fight, as indicated by the left and top
edges of the heatmap. The activities such as sell marijuana, alcohol and steal ge 50 are relatively
highly associated with sell hard drugs and use hard drugs. These four activities also exhibit mild
degree of association among themselves. In the top left corner of the right panel, there is also a dark
patch indicating the association among use harddrugs, use marijuana and sell marijuana. Again,
these activities exhibit association with all the other activities, as indicated by the darker bands along
the left and top edges.
Figure 9 presents the heat map of the posterior means of λ1 λ′2 representing the association between
19
the at-risk and Poisson parts. The activities are ordered based on the hierarchical clustering. Similarly
to Figure 8, a dark patch for the drug-related activities between the at-risk and Poisson parts is
recognisable. The figures show that being at-risk for use hard drugs and sell hard drugs is highly
associated with the Poisson counts for themselves and use marijuana. It is also seen that being at-risk
for using and selling hard drugs is also associated with the Poisson counts for all the other activities
and that the Poisson counts for these three activities are also associated with being at-risk for most
activities.
Figure 10 presents the posterior means of β h for h = 1, 2 under GFZIP. The circles in the fig-
ure indicate the parameters for which the 95% credible intervals do not include zero. Overall, the
signs of the coefficients are the same for most activities. For example, the left panel shows age has
positive effects on the at-risk probabilities for sell marijuana, use hard drugs and use marijuana,
but has negative effect on fight. male has positive effects on most activities other than run away,
use marijuana, and use hard drugs. On the other hand, self has negative effects on the at-risk
probabilities for most activities other than alcohol, run away and sell stolen. This is expected
because the higher the self-esteem score, the less likely youths are to engage in illegal activities.
In the right panel, urban and male positively affect the Poisson counts of most activities. An
urban environment would offer more opportunities for various types of illegal activities. Combined
with the results on the at-risk coefficient for male, male youths are more likely to be involved in illegal
activities, and their involvements are more frequent. self has a positive impact on the frequencies
of the activities such as attack, extort, steal ge 50 and con. Most of these activities typically
involve aggressive behaviour towards other individuals or audacity. Therefore, higher self-esteem
would increase the frequency of those activities. On the other hand, self has negative impacts on the
frequencies of sell stolen, break in, sell hard drugs and use hard drugs, run away. It would be
intuitive that the frequency of these activities, especially drug-related activities and running away, is
associated with lower self-esteem.
Finally, we estimate the proportions of at-risk youths among those who answered ‘never’ for each
activity based on (5). These are the estimated fractions of youths involved in the activities, but their
responses on the frequency of involvement happened to be zero one year before the interview. Figure 11
presents R̂j for 19 activities under GFZIP and GZIP. Under the proposed GFZIP, R̂j for alcohol,
fight, threaten and use marijuana are above 0.1. Among those activities, use marijuana resulted
in the largest R̂j of 0.492. The result implies that nearly half of the youths who responded ‘never’
actually are regular users but did not use them during the one year before the interview. R̂j = 0.232
for alcohol is the second largest, followed by 0.127 for fight and 0.114 for threaten. These activities
20
might be more common among youths, as indicated by non-zero proportions in Figure 5, compared to
the other activities with a higher criminal nature, such as selling drugs and stealing vehicles. On the
contrary, R̂j ’s for the rest of the activities are zero or almost zero. The figure also shows that under
GZIP R̂j = 0 for all activities. The results for the activities such as use marijuana and alcohol are
suspected to be false negative, as observed in the simulation study. The result under the proposed
model is more reasonable and indicates the efficacy of leveraging shared information among activities
through the latent factors.
λ1 : use_marijuana λ1 : sell_hard_drugs
0.2
−0.5
0.0
−1.0
−0.2
−0.4
−1.5
−0.6
iteration iteration
λ2 : extort λ2 : con
−0.7
−1.0
−0.9
−1.1
−1.2
−1.1
−1.3
−1.3
iteration iteration
0.8
0.5
0.6
0.4
0.0
0.2
−0.5
0.0
iteration iteration
1.0
0.0
0.6
−0.5
0.2
−0.2
iteration iteration
Figure 7: Trace plots of the Gibbs sampler for the selected parameters under GFZIP
21
Table 5: Posterior means and 95% credible intervals (CI) for the factor loadings under GFZIP
22
at−risk (h=1) Poisson (h=2)
use_hard_drugs use_hard_drugs
sell_hard_drugs use_marijuana
fight sell_marijuana
sell_stolen run_away
threaten fight
run_away extort
damage vehicle
gambling threaten
steal_lt_50 con 9
0.9
break_in damage
0.6 6
use_marijuana steal_ge_50
0.3 3
con steal_lt_50
shoplift shoplift
alcohol attack
attack break_in
vehicle sell_hard_drugs
steal_ge_50 alcohol
sell_marijuana gambling
extort sell_stolen
use_hard_drugs
sell_hard_drugs
fight
sell_stolen
threaten
run_away
damage
gambling
steal_lt_50
break_in
use_marijuana
con
shoplift
alcohol
attack
vehicle
steal_ge_50
sell_marijuana
extort
use_hard_drugs
use_marijuana
sell_marijuana
run_away
fight
extort
vehicle
threaten
con
damage
steal_ge_50
steal_lt_50
shoplift
attack
break_in
sell_hard_drugs
alcohol
gambling
sell_stolen
Figure 8: Posterior means of λh λ′h under GFZIP. The activities are ordered based on the hierarchical
clustering.
use_hard_drugs
sell_hard_drugs
fight
sell_stolen
threaten
run_away
damage
gambling
at−risk (h=1)
steal_lt_50 3
break_in 2
use_marijuana
1
con
shoplift
alcohol
attack
vehicle
steal_ge_50
sell_marijuana
extort
use_hard_drugs
use_marijuana
sell_marijuana
run_away
fight
extort
vehicle
threaten
con
damage
steal_ge_50
steal_lt_50
shoplift
attack
break_in
sell_hard_drugs
alcohol
gambling
sell_stolen
Poisson (h=2)
Figure 9: Posterior means of λ1 λ′2 under GFZIP. The activities are ordered based on the hierarchical
clustering.
23
GFZIP at−risk (h=1) GFZIP Poisson (h=2)
gambling gambling
sell_stolen sell_stolen
break_in break_in
vehicle vehicle
con con
sell_hard_drugs sell_hard_drugs
sell_marijuana sell_marijuana
use_hard_drugs use_hard_drugs
use_marijuana 4 use_marijuana
0
attack 2 attack
−3
0
threaten threaten
−6
−2
extort extort
−9
steal_ge_50 steal_ge_50
steal_lt_50 steal_lt_50
shoplift shoplift
fight fight
damage damage
run_away run_away
alcohol alcohol
const.
age
male
black
hisp
self
urban
const.
age
male
black
hisp
self
urban
Figure 10: Posterior means of β h under GFZIP. The circles indicate the parameters for which the 95%
credible intervals do not include zero.
0.5
GFZIP
GZIP
0.4
0.3
Rj
0.2
0.1
0.0
alcohol
run_away
damage
fight
shoplift
steal_lt_50
steal_ge_50
extort
threaten
attack
use_marijuana
use_hard_drugs
sell_marijuana
sell_hard_drugs
con
vehicle
break_in
sell_stolen
gambling
Figure 11: Proportions of at-risk youths among those who answered ‘never’
24
5 Conclusion
We have proposed the Poisson factor zero-inflated model for multiple grouped count data, which
includes latent factors to account for association among the multiple count responses. Based on
the data augmentation, Pólya-Gamma augmentation and parameter expansion, we have developed
an efficient MCMC algorithm. The identification of the factor components is achieved through the
post-processing algorithm. We have demonstrated the efficacy of the proposed model through the
numerical examples. Notably, in the analysis of illegal activities of youths, we have found a single
common factor, which can be interpreted as the drug-related factor, producing a strong association
among the drug-related activities both in at-risk and Poisson parts. The proposed model also revealed
the individuals at risk among those who reported zero in each activity, while treating each activity
separately completely failed to do so.
Acknowledgement
This work was supported by JSPS KAKENHI (#21K01421, #21H00699, #20H00080, #22K13376,
#24K00244).
References
Berry, L. R. and M. West (2020). Bayesian forecasting of many count-valued time series. Journal of
Business & Economic Statistics 38 (4), 872–887.
Fu, Q., X. Guo, and K. C. Land (2018). A poisson-multinomial mixture approach to grouped and
right-censored counts. Communications in Statistics - Theory and Methods 47 (2), 427–447.
Gelfand, A. E. and S. K. Ghosh (1998). Model choice: A minimum posterior predictive loss approach.
Biometrika 85 (1), 1–11.
Ghosh, J. and D. B. Dunson (2009). Default prior distributions and efficient posterior computation
in bayesian factor analysis. Journal of Computational and Graphical Statistics 18 (2), 306–320.
Hamura, Y., K. Irie, and S. Sugasawa (2021). Robust hierarchical modeling of counts under zero-
inflation and outliers. arXiv:2106.10503v1 .
Karlis, D. (2003). An em algorithm for multivariate poisson distribution and related models. Journal
of Applied Statistics 30 (1), 63–77.
25
Kobayashi, G., S. Sugasawa, and Y. Kawakubo (2023). Spatio-temporal smoothing, interpolation and
prediction of income distributions based on grouped data.
Kobayashi, G., Y. Yamauchi, K. Kakamu, Y. Kawakubo, and S. Sugasawa (2022). Bayesian approach
to lorenz curve using time series grouped data. Journal of Business & Economic Statistics 40 (2),
897–912.
Larsson, R. (2020). Discrete factor analysis using a dependent poisson model. Computational Statis-
tics 35 (3), 1133–1152.
Liu, Y. and G.-L. Tian (2015). Type i multivariate zero-inflated poisson distribution with applications.
Computational Statistics & Data Analysis 83, 200–222.
McGinley, J. S., P. J. Curran, and D. Hedeker (2015). A novel modeling framework for ordinal data
defined by collapsed counts. Statistics in Medicine 34 (15), 2312–2324.
Neelon, B. and D. Chung (2017). The LZIP: A bayesian latent factor model for correlated zero-inflated
counts. Biometrics 73 (1), 185–196.
Neelon, B., A. J. O’Malley, and V. A. Smith (2016). Modeling zero-modified count and semicontinuous
data in health services research part 1: background and overview. Statistics in Medicine 35 (27),
5070–5093.
Papastamoulis, P. and I. Ntzoufras (2022). On the identifiability of bayesian factor analytic models.
Statistics and Computing 32 (2), 23.
Polson, N. G., J. G. Scott, and J. Windle (2013). Bayesian inference for logistic models using
pólya–gamma latent variables. Journal of the American Statistical Association 108 (504), 1339–
1349.
Sugasawa, S., G. Kobayashi, and Y. Kawakubo (2020). Estimation and inference for area-wise spatial
income distributions from grouped data. Computational Statistics & Data Analysis 145, 106904.
Wedel, M., U. Böckenholt, and W. A. Kamakura (2003). Factor models for multivariate count data.
Journal of Multivariate Analysis 87 (2), 356–369.
26
Xu, T., R. T. Demmer, and G. Li (2021). Zero-inflated poisson factor model with application to
microbiome read counts. Biometrics 77 (1), 91–101.
27