Zero Inflation as a Missing Data Problem: a Proxy-based Approach
Abstract
A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data).
Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated.
This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known.
If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al. [2023].
We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).
1 Introduction
Zero-inflated (ZI) data is prevalent in many empirical sciences such as public health, epidemiology, computational biology, and medical research. An important type of zero inflation occurs when some observed zeros of an outcome of interest do not represent true zero values.
As an example, consider patient surveillance for complications in outpatient settings, where any complication developed outside the hospital is of interest. One such complication is a central line-associated bloodstream infection (CLABSI) which can occur in patients undergoing therapies involving central venous catheters (CVCs). Such complications are fairly rare, but are associated with significant morbidity and mortality, and their prevalence is often assessed retrospectively. Because of this, absence of sufficient information on whether such a complication is present in a particular patient is often coded as a “presumed negative” rather than a “missing value” [Keller et al., 2020]. Since this type of value differs from a true negative value, indicating actual absence of a complication in a patient, the result is zero-inflated data. Another prominent example is single-cell RNA sequence data, whose zeros may signify either genuine values (representing, e.g. lack of gene expression) or artificial zeros resulting from technical artifacts of experimental protocols or recording equipment [Wagner et al., 2016, Jiang et al., 2022]. In all these cases, naive analysis of ZI data that does not distinguish true from artificial zeros can lead to markedly biased conclusions.
Existing approaches for zero inflation focus on observed data likelihood modelling using either hurdle models or zero-inflation models [Neelon et al., 2016, Greene, 2005]. Hurdles models are mixtures models of a distribution truncated at zero and another distribution modeling the occurrence of values [Mullahy, 1986]. In genomics applications, Yu et al. [2023], Dai et al. [2023] use graphical models to represent the zero-inflated likelihood for the purposes of causal discovery. On the other hand, zero inflation models [Lambert, 1992, Young et al., 2022] assume two sources of zeros, either structural (or inflated) zeros or true zeros due to sampling. More recent work has extended this type of approach to include semi-parametric models [Arab et al., 2012, Lam et al., 2006]. Kleinke and Reinecke [2013] apply an augmentation of the chained equations imputation approach to correct the bias introduced by inflated zeros. Lukusa et al. [2017] review methods in settings where inflated zeros co-occur with missing data, however these settings do not include cases considered here, where the excess zeros represent a censored realization.
The disadvantage of the first type of approach is that it does not aim to reconstruct underlying values, which are often of interest. The disadvantage of the second type of approach is that correctly distinguishing true from inflated zero values relies on assumptions that are unlikely to hold in practice, e.g., strict parametric assumptions. Moreover, these assumptions may not be congenial and not lead to a coherent full data distribution – guaranteeing model misspecification. This is a more general issue than zero inflation, and occurs in standard missing data problems as well. In contrast, our approach to modeling inflated zeros has two important features. First, we aim to distinguish true from inflated zeros, and thus identify underlying realizations in the data. Second, we avoid imposing strong parametric assumptions to do so.
Specifically, we propose to model zero inflation using a generalization of missing data models. In standard missing data, the relationship between an observed variable and its corresponding underlying variable is determined by an observability indicator. If the indicator is , the observed and the underlying variables coincide, while if the indicator is , the observed variable is recorded as a missing value. In zero inflated problems, we view improperly recorded zero values as missing values denoted by a zero. Hence, in this view, we cannot tell a zero indicating an actual value from a zero indicating missingness, and observing a zero means the observability indicator is itself unobserved.
This complication implies that even if we assume a missing data model where the full data distribution would have been identified absent zero inflation, such as the Missing-Completely-At-Random (MCAR) model, we would generally not obtain identification in the presence of zero inflation. Thus, the variant of the missing data problem we consider is significantly more complicated than standard missing data.
We approach this problem using recent theory of graphical models applied to missing data, which gives general identification results in the absence of zero inflation [Mohan et al., 2013, Bhattacharya et al., 2019, Nabi et al., 2020]. We first note that zero inflation problems viewed in this framework could be arranged in a hierarchy similarly to missing data problems [Rubin, 1976]: Zero-Inflated Missing-Completely-At-Random (ZI MCAR), Zero-Inflated Missing-At-Random (ZI MAR), and Zero-Inflated Missing-Not-At-Random (ZI MNAR).
We then show that if zero inflation is present, target parameters involving zero inflated variables are not identified without additional assumptions, even in the relatively simple ZI MCAR model. We further show that if an informative proxy for a missingness indicator exists, identification of the target parameters becomes possible provided the missing data model (sans zero inflation) is identified, via a modification of the effect restoration approach in Kuroki and Pearl [2014], provided the true proxy-indicator relationship is known.
If this relationship is not known, we show that only certain proxy-indicator relationships are compatible with the overall model which provides a natural sensitivity analysis strategy. In particular, in the case of a categorical outcome, we provide an analytic bound for the proxy-indicator relationship in the presence of zero inflation in a number of missing data models, and show that in some models our bound is sharp. In more general cases, we show that the numeric approach for obtaining bounds detailed in Duarte et al. [2023] may be used instead.
Finally, we demonstrate an application of our method on simulated data, as well as a real world dataset on CLABSIs.
2 Graphical Models of Missing Data and Zero Inflated Data
In this section we briefly review relevant existing works on missing data, and describe difficulties posed by zero inflation.
2.1 Missing data and identification
Let be a set of random variables (r.v.s) of interest. Denote as the state space of , which we assume is categorical, and without loss of generality, includes the value . Samples of are systematically missing, with true values being replaced by a special symbol “?”. To better represent missing data problems, it is convenient to use two additional sets of r.v.s: the proxies , where each proxy has the state space , and the binary observability indicators . Each proxy is deterministically defined in terms of the underlying variable and the observability indicator via the missing data version of the consistency rule: when and when . Thus, a variable may be described as " had it (hypothetically) been observed", i.e., a counterfactual. The superscript notation is deliberately chosen to make the connection to counterfactuals in causal inference explicit. In addition to , let represents other fully observed variables.
We define as , as and as , with analogous subsets of , defined similarly. Following the nomenclature in Nabi et al. [2020], Bhattacharya et al. [2019], we call the full law, the observed law, and the target law. A missing data model is a set of distributions over the variables that satisfy the above consistency rule.
Following Mohan et al. [2013], we consider missing data model defined using a class of directed acyclic graphs (DAGs) called missing data DAGs (m-DAGs). Specifically, an m-DAG consists of nodes . Like all DAGs, m-DAGs only have directed edges and lack directed cycles, but also have a number of additional restrictions: each proxy has exactly 2 incoming edges (due to consistency); there is no edge from any or to any . A joint in the missing data model corresponding to the m-DAG factorizes as
where all terms are deterministic. Using m-DAGs, one can represent many interesting missing data scenarios, see Fig. 1 for examples.
An important goal in missing data problems, prior to statistical inference, is to ensure the target parameter, which is generally some function of the target law, is identified from the observed law. It follows by definition that the target law is identified if and only if the propensity score evaluated at is identified, while the full law is identified if and only if the propensity score at all values of is identified. While identification of the target law is still an open problem, Nabi et al. [2020] showed a sound and complete method for identification of the full law from the observed law in missing data models represented by m-DAGs and hidden variable m-DAGs.
2.2 Zero Inflation Non-identifiability
A zero inflated (ZI) model associated with an m-DAG is a variant of the missing data model associated with that m-DAG, with the following important difference: the missing data consistency relating variables is replaced by a zero inflation version, where if , and if . 111Note that we consider ZI models with categorical state spaces only, unless stated otherwise.
There are several important consequences of zero inflated consistency. Firstly, both and take values in , and no variable in a ZI problems takes the value “?”. Secondly, as in missing data, the ZI-variable is counterfactual, and according to the ZI consistency rule, its true realizations are observed only when . In particular, if , we deduce and . However, since it is not possible to tell whether a realization corresponds to the situation where is the true value of , or corresponds to a censored realization of , is unobserved whenever . Moreover, while we still refer to and as the target law and the full law, respectively, we will refer to as the zero-inflated law (ZI law), rather than the observed law, since is not always observed. Thirdly, the ZI consistency imposes the following important restriction on the ZI law
-
(Z) For every and , .
We classify ZI models as ZI MCAR, ZI MAR, or ZI MNAR, if its missing data version is MCAR, MAR, or MNAR, respectively. Examples of ZI models are shown in Fig. 2 and Fig. 3.
Just as in missing data problems, the goal in ZI problems is to identify (a function of) the target law or the full law from the observed law and possibly additional objects. We focus on the full law identification in this paper. Unsurprisingly, ZI problems are significantly harder than missing data problems, in the sense that both the target law and the full law are non-parametrically non-identified even in the simplest setting (ZI MCAR), as shown by the following result.
Lemma 1 (Non-identifiability).
Given a ZI model associated with any m-DAG , both the target law and the full law are non-parametrically non-identified.
3 Proxy-based Identification
We first demonstrate our approach to proxy-based identification with the simplest ZI missing data model, ZI MCAR, and generalize it to arbitrary ZI m-DAG.
3.1 Identification in the ZI MCAR Model
Lemma 1 implies that any identification method must rely on additional assumptions beyond those implied by the m-DAG. To illustrate additional assumptions that will be employed, consider a simple ZI MCAR model with a single ZI variable taking values in , and the corresponding inflation indicator , where .
To simplify subsequent presentation, we will use the following notational shorthand: to mean , and to mean the stochastic matrix whose elements are . Similarly for and . We also use a matrix multiplication shorthand, where is taken to mean .
We will assume the existence of an observed binary proxy variable informative for with the following properties:
-
(A1) ,
-
(A2) The matrix is invertible.
Note that since and are binary, A2 is equivalent to . Due to the existence of the proxy variable , we call this ZI MCAR model "proxy-augmented", whose graph is shown in Fig. 2 (a).
Kuroki and Pearl [2014] considered assumptions A1 and A2 in the context of obtaining identification of causal effects in the presence of unobserved confounding. In that work, the proxy variable was related to an unobserved categorical variable which was a common cause of the treatment and outcome variables.
In this paper, we adopt the method of Kuroki and Pearl [2014] to express the ZI law in terms of the observed law and the conditional distribution . In addition to A1 and A2, the Kuroki-Pearl method requires that the observed law and are from the same full law (e.g. compatible), and is known.
To see that point identification is then possible, we write in matrix form
(1) |
where the entry in is due to the restriction Z. Since is invertible, we can solve for by , leading to the following result.
Theorem 1 (ZI law restoration in ZI MCAR).
For the ZI MCAR model in Fig. 2 (a) under A1, A2, the ZI law is point identified given the observed law and a compatible proxy-indicator conditional distribution , as follows
(2) |
After the ZI law is identified, the full law is identified, , by standard assumptions of the MCAR model.
Remark.
There are two difficulties with this result. First, since is potentially unobserved, it is not always reasonable to specify the true in applications. Second, since our working model corresponds to a hidden variable DAG, the model imposes restrictions on the pair , meaning that not every potential distribution would be consistent with the observed data law under our model. Using inconsistent in the matrix inversion equation places us outside the model, and can yield inconsistent results, such as invalid negative probabilities . Examples of such an inconsistency is provided in the Appendix. Kuroki and Pearl [2014] noted the latter issue in the context of causal inference, but did not provide bounds.
3.2 Proxy-based identification in general ZI missing data models
In this section, we generalize our previous proxy-based approach to an arbitrary graphical ZI model corresponding to an m-DAG, given that the full law is point identified in the missing data model associated to that m-DAG.
Consider any ZI model associated with an arbitrary m-DAG, with a set of fully observed covariates , a set of ZI variables , inflation indicators , observed versions for variables in , and proxies for variables in .
We make the following assumptions which generalize A1 and A2:
-
(A1∗) , .
-
(A2∗) The matrix is invertible.
In addition, we will provide alternatives to A1∗ and A2∗ which allow the proxies to potentially depend on :
-
(A1†) , .
-
(A2†) The matrix is invertible for every value .
The identification strategy we adopt proceeds in two stages:
-
1.
ZI law restoration: point identify (if true is known) or partially identify the ZI law from the observed law .
-
2.
Downstream identification: identify the full law from the ZI law .
Since every is unobserved whenever in ZI problems, the purpose of the first stage is to recover the ZI law involving and other observed variables. Under mentioned proxy assumptions and knowledge of the true , point identification of this law is possible. Otherwise, partial identification bounds are computed. If point or partial identification is possible, variables in may now be treated as observed data, and the problem is reduced to classical identification in missing data model. In particular, we adopt the sound and complete identification procedure described by Nabi et al. [2020] to point identify the full law in the second stage.
While we focus on non-parametric point identification results for the full law, one could instead employ any point or partial identification procedure developed for missing data problems for the second stage. We leave these types of extensions to future work.
3.2.1 ZI law restoration
Under the proxy assumptions, we have the following identification result, which generalizes Theorem 1.
Theorem 2 (ZI law restoration).
Given a ZI model satisfying assumptions A1∗ and A2∗ (or A1† and A2†), the ZI law is point identified given the observed law and a compatible proxy-indicator conditional distribution (OR ),
(3) | ||||
Theorem 2 suffers from the same issue as Theorem 1: it is unlikely that the true distribution (or ) will always be available, and given a candidate distribution, it is not obvious to verify that it is compatible with the model and the observed law.
If the true (or ) is not given, we must find the set of compatible (or ) distributions to the model and the observed law. In general, bounds on (or ) may be computed numerically by encoding the model as a system of polynomial equations and finding extrema of this system using polynomial programming. A method for solving such systems of equations using a primal/dual method is described in Duarte et al. [2023]. These bounds lead to a natural sensitivity analysis strategy according to our two stage approach. Particularly, each compatible in the bounds implies a valid ZI law by Theorem 2, which in turn implies a full law by Proposition 1 of the next section. In Section 4, we conduct a grid search of the compatible set to illustrate this point.
While numeric bound computation is a general approach, finding such bounds is computationally challenging due to the need to solve polynomial programs. Fortunately, we show that in certain ZI models, it is possible to derive analytic bounds on (or ), instead. We also show that these bounds are sharp in some cases.
3.2.2 Downstream identification
After the ZI law is recovered in the restoration step, one may consider this law as the "observed law" in the missing data problem corresponding to the same m-DAG, and invoke missing data identification to obtain the full law . We note that this second identification stage is not precisely the same as that for standard missing data problems, because identification relies on consistency, and consistency under ZI differs from missing data consistency whenever .
Fortunately, consistency when coincides in ZI problems and missing data problems, and, as the following result shows, suffices for identification.
Proposition 1 (ZI full law identification).
The full law exhibiting zero inflation that is Markov relative to an m-DAG is identified given the ZI law if and only if does not contain edges of the form (no self-censoring) and structures of the form (no colluders), and the positivity assumption holds. Moreover, the identifying functional for the full law coincides with the functional given in Malinsky et al. [2021].
3.3 Partial Identification in ZI MCAR
In this subsection, we relax the requirement that the true must be given in the ZI law restoration step, and provide bounds for this conditional distribution in the proxy-augmented ZI MCAR model.
Consider the proxy-augmented ZI MCAR model in Fig. 2 (a). This model is equivalently described by the following model , satisfying Z, A1, and A2:
(4) |
Given an observed law , we are interested in the following subset of distributions yielding the observed law,
(5) |
In particular, our goal is finding all , which is the partial identification of w.r.t. the given observed law. This is equivalent to projecting onto the probability simplex of . From (5) and (4), one way to check whether an invertible is to compute and check . First, must be a stochastic matrix for any problem under A1 and A2. Second, must also satisfy ZI-consistency constraint Z. If these conditions are true, there is a joint distribution in the model generates both and , and they are said to be compatible. After the compatible set of is derived, the partial identification of could be obtained using (2).
We note that Z implies, for all , , so . Then by considering , we obtain point identification and the marginal constraints . Note that these constraints may be used to design a falsification test of the model.
However, is not identified, and its bounds must be obtained by solving the following polynomial program:
(6) | ||||||
s.t. | ||||||
Since both and are unknowns, the above system of equations corresponds to a quadratic program, which is difficult to solve in general.
However, it is possible to transform this optimization into an equivalent linear program with the following observations:
-
1.
A specific solution to is not required. One merely needs to check if is a stochastic matrix.
-
2.
If , where all matrices are non-negative, sum to 1 and sum to 1, then sum to 1. The proof of this fact is in the Appendix.
-
3.
The inverse is
(7)
Observations 1 and 2 imply that checking compatibility involves only checking non-negativity of , reducing the unknowns in our optimization problem to only . Checking is still non-linear in , but (7) suggests an equivalent procedure consisting of two separate problems where or , respectively. Concretely, for each case and , we consider 2 linear programs
(8) | ||||||
s.t. | ||||||
These problems could be solved analytically using fast linear program solvers, yielding the following partial identification result for . A detailed proof is in the Appendix.
Theorem 3 (ZI MCAR compatibility bound).
Consider a ZI MCAR model in Fig. 2 (a) under proxy assumptions A1, A2, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditionals is given by
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur, i.e., .
3.4 Partial Identification in ZI MAR
We compute analytical bounds for two versions of the proxy-augmented ZI MAR model, illustrated in Fig. 2 (b) and (c). The first model has and satisfies A1† and A2†, while in the second model, and the proxy assumptions are A1∗ and A2∗.
In the first proxy-augmented ZI MAR model, the set of compatible is given by the Cartesian product of the independently determined ZI MCAR bounds for each value . This leads to the following direct analogue of Theorem 3. The proof is deferred to the Appendix.
Theorem 4 (ZI MAR compatibility bound 1).
Consider a ZI MAR model in Fig. 2 (b) under proxy assumptions A1† and A2†, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditional distributions is given by, for each value ,
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur for stratum , i.e., .
On the other hand, the compatibility bound for the second ZI MAR model is the intersection of the ZI MCAR bounds for each values . The proof is deferred to the Appendix.
Theorem 5 (ZI MAR compatibility bound 2).
Consider a ZI MAR model in Fig. 2 (c) under proxy assumptions A1∗ and A2∗, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditional distributions is given by
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur, i.e., .
Note that the first two cases are mutually exclusive due to the following lemma.
Lemma 2.
Note that, as before, the marginal constraints described may be used to design a model falsification test.
3.5 Partial Identification In ZI MNAR
Consider the ZI version of any MNAR model represented by an m-DAG where the target law is identified. In missing data, an important subclass of such models are submodels of the no-self-censoring model in Malinsky et al. [2021] due to the results in Nabi et al. [2020]. The ZI versions of such models exhibit a crucial complication not found in previously discussed ZI models, namely that multiple variables may be zero inflated. For these models, we posit a set of proxies corresponding to , and assume assumptions A1∗, A2∗ in Section 3.2.1 are satisfied. Fig. 3 (a) and (b) show two bivariate examples of such models. We use the short hand .
Given observed law , we seek the compatible set of , whose elements allow restoration of via Theorem 2. Although sharp bounds for are unknown, the ZI MAR partial identification procedure could be applied to each independently to obtain bounds for . Moreover, due to the usual properties of ZI, is point identified for each .
For each , we apply Theorem 5 with being , respectively, and being the covariates . These bounds are not sharp as structural constraints of the MNAR model are not considered. However, these bounds are valid in the sense that the Cartesian product of these bounds contains the true model compatible set of distributions .
In addition, we note that (11) below hold in the observed law under our model, and may be used as falsification test for our ZI model.
Lemma 3.
Consider any ZI model in Section 3.2.1 under A1∗ and A2∗. Denote . The observed law must satisfy, for each ,
(11) | ||||
3.6 Identification Given A Known Zero Inflation Probability
For ZI MCAR models in Theorem 3 and ZI MAR model in Theorem 5, we provided the identification and the bounds for , which lead to partial identification of the full law .
If is known a priori, the full law is point identified. Alternatively, point identification of the full law may be obtained if the zero inflation probability, or , is known.
This is because the joint distribution for binary has dimension , and one (variationally dependent) parameterization for this joint is via the following parameters , , . This is easy to see by noting that we can compute , and , , and are the Möbius parameters for [Evans and Richardson, 2014].
In particular, we have the following: , which in turns implies point identification of the full law.
Note that not every zero inflation probability is compatible with the model. This is easily seen by noting that the Möbius parameterization is variationally dependent, and two parameters, namely and , are known. Howevre, our derived bounds for naturally imply bounds for , with sharp bounds for the former implying sharp bounds for the latter.
4 Experiments
We confirmed the validity of our analytical results for inflated zero models by sampling data generating processes (DGPs), and numerical methods. In addition, we used our methods to perform sensitivity analyses on CLABSI data. Details of these experiments are in the Appendix. The code could be found at https://github.com/trungpq-ci/zero-inflation-bounds.
4.1 Bound Validity In Random DGPs
We verify the results of Theorem 3, Theorem 5 and related observed law constraints by randomly generating DGPs in models we described. We generated DGPs in the model in Fig. 2 (a), satisfying ZI-consistency, A1, A2, and DGPs in the model in Fig. 2 (b), satisfying ZI-consistency, A1∗, A2∗. For both cases, we verified identification of and the bounds for as predicted by the corresponding theorem. For the bounds, two tests were conducted
-
1.
Bound validity: is the true inside the bounds?
- 2.
Additionally, for ZI MAR, we checked marginal constraints in (10). We found that all considered results held up to floating point precision in every single DGP.
4.2 Bounds By Numerical Methods
We compared our analytical bounds with numerical bounds computed using autobounds package in Duarte et al. [2023] for a subset of DGPs used for verification of bound validity in Section 4.1.
In particular, 20 DGPs were randomly selected for each model (ZI MCAR and ZI MAR), and their observed laws were computed. For each DGP, 2 polynomial programs were constructed, whose objective functions are maximizing or minimizing , respectively, and whose constraints are (i) structural constraints from the corresponding graph, (ii) probability constraints, (iii) ZI-consistency constraint, (iv) constraints resulted from the structure imposed on the observed law by the structure of the full law. The solutions to these programs are the numerical lower and upper bounds of . We refer reader to Duarte et al. [2023] for details of the program’s construction and the methods used by the polynomial program solver.
For all DGPs, the numerical bounds coincided with our analytical bounds up to the th decimal place. Since the algorithm in the autobounds package is an anytime algorithm, our analytic bounds were always contained inside the numerical bounds. Table 1 shows a selection of these results.
MCAR | lb | ub | num lb | num ub | |
0 | 0.5564 | 1 | 0.5564 | 1 | 0.8207 |
1 | 0.3578 | 1 | 0.3578 | 1 | 0.4936 |
2 | 0 | 0.5206 | 0 | 0.5206 | 0.4536 |
3 | 0.6064 | 1 | 0.6064 | 1 | 0.6826 |
MAR | lb | ub | num lb | num ub | |
0 | 0 | 0.4290 | 0 | 0.4290 | 0.4132 |
1 | 0.8346 | 1 | 0.8346 | 1 | 0.8486 |
2 | 0 | 0.3404 | 0 | 0.3404 | 0.3192 |
3 | 0.3002 | 1 | 0.3002 | 1 | 0.5155 |
4.3 Data Application
Patients receiving therapies involving central venous catheters (CVCs) through home infusion agencies may develop CLABSI. Though relatively rare, CLABSIs are potentially dangerous. Knowing true CLABSI rates is essential in deploying and testing the impact of CLABSI prevention activities. Recorded CLABSI rates undercount true positive cases. This is because adjudicators performing CLABSI surveillance often lack access to the full information required to determine whether a CLABSI has occurred [Hannum et al., 2022, 2023]. If the available information do not meet the CLABSI definition criteria, as CLABSIs are relatively rare, the adjudicator typically records the CLABSI status as a presumed negative.
We will apply our zero inflation correction method to data on patients undergoing CVC therapies and thus potentially susceptible to a CLABSI. Our data contains 652 unique patient records obtained from five different home infusion agencies across 14 states and the District of Columbia, see [Keller et al., 2023] for additional details. These records correspond to records investigated on patients who presented to a hospital due to a complication and on whom blood cultures were drawn and were positive. Many patients with CVCs who presented to the hospital due to a complication on whom blood cultures were drawn and were positive do have CLABSIs. In fact, the observed CLABSI rate in our data was more than , much higher than the prevalence in the population undergoing CVC therapies. However due to zero inflation, even the elevated observed CLABSI rate undercounts the true CLABSI rate in this cohort.
Variables in our data included covariates , which indicated home infusion therapy type and CVC type, coded as binary variables. A description of these covariates is found in the Appendix. The outcome of interest is the true CLABSI probability (had zero inflation not occurred), which we denote by . This outcome is not directly observed. Instead, our data contains the observed CLABSI status , recorded as and . Given this variable, we define the inflation indicator which corresponds to the adjudicator having enough information to make a CLABSI determination for a particular case. The information could come from private meeting with patients and specialists, or from reading patients test results and other data in health record systems. Recording conventions dictate that this indicator has a known value whenever the observed CLABSI is , and is unobserved otherwise (since we cannot distinguish true negatives from inflated zeros). We considered two candidates for the proxy : (i) adjudicator access to the shared electronic health record system EPIC, (ii) either adjudicator access to EPIC, or the statewide health information exchange CRISP. Since encodes the state of knowing all required information from all sources, we have .
Our working model is the proxy-augmented ZI MAR under assumptions A1∗ and A2∗, shown in Fig. 2 (b). Using the analytic bounds for the ZI MAR model derived in Section 3.4, we perform a sensitivity analysis to understand how the true CLABSI rate changes as the proxy-indicator relationship varies, within its compatibility range. First, we use the EM algorithm [Dempster et al., 1977] to maximize the observed data likelihood defined via the full data distribution consistent with our assumptions. Next, we invoke Theorem 5 to obtain the plug-in estimate for and the bounds for . Finally, we do a grid search over the bounds interval, compute the full data distribution for each value of via (2) and obtain using standard g-formula adjustment in MAR models. The sensitivity analysis curve is shown in Fig 4. 222 This plot differs somewhat from the plot in the published version of the paper, due to a corrected data processing error. The conclusions on the underlying CLABSI rate were not substantially affected.
The values of consistent with the model show that inability to make a CLABSI determination is strongly associated with access to patient data via electronic health records. For proxy EPIC, our obtained (sharp) bound for the nuisance parameter is , yielding the estimated range of the true CLABSI rate to be . Compared with the baseline rate of under no-zero-inflation assumption, the rate’s bound implies that anywhere from to of true CLABSI cases are undercounted, even in our patient cohort with a highly elevated CLABSI prevalence.
We have repeated the analysis using the proxy-augmented ZI MAR model under assumptions A1† and A2†, shown in Fig. 2 (c). In this case, bounds for were obtained, for each value . The narrowest bound corresponds to adult patients receiving outpatient parenteral antimicrobial therapy (OPAT) via a peripherally inserted central catheter (PICC). On the other hand, the widest bound corresponds to pediatric patients receiving chemotherapy via tunneled CVC, a type of catheter under the skin. We performed a search of points over the polytope comprised of these bounds and find the estimated range of true CLABSI rate to be . That is, anywhere from 12%-28% of true CLABSI cases are undercounted.
All derived bounds for the true CLABSI rate were deemed to be medically plausible by our medical collaborators.
5 Conclusion
In this paper, we considered inference on data with inflated zeros as a missing data problem where censored realizations are indicated by a rather than by a special token such as . This leads to a situation where the censoring indicator for a variable is unobserved any time the value is observed for such a variable. We have shown that this significantly complicates the problem, and results in lack of identification even in simple missing data models such as MCAR.
To address this, we proposed a generalization of the approach in Kuroki and Pearl [2014] which assumes the existence of an informative proxy for the censoring indicator. We show that only some relationships between this proxy and the indicator are compatible with the model, derive analytic bounds for this relationship in a number of cases, and show that in some cases our bound is sharp. Our bounds directly imply bounds on the zero inflated mean parameter. We verified our results by deriving bounds numerically using the autobounds package described in Duarte et al. [2023]. Finally, we applied our methods to CLABSI data, which exhibits significant zero inflation. Our methods led to informative bounds on the true CLABSI rate, and provided a natural sensitivity analysis strategy.
Zero inflation is common in many types of data, particularly in electronic health records. Our approach provides a principled strategy for deriving informative conclusions from such data without reliance on unrealistic modeling assumptions.
Acknowledgements.
This research is funded in part by ONR N00014-21-1-2820, NSF 2040804, NSF CAREER 1942239, NIH R01 AI127271-01A1, AHRQ R01 HS027819.References
- Arab et al. [2012] Ali Arab, Scott H. Holan, Christopher K. Wikle, and Mark L. Wildhaber. Semiparametric bivariate zero-inflated Poisson models with application to studies of abundance for multiple species. Environmetrics, 23(2):183–196, March 2012. ISSN 1180-4009, 1099-095X. 10.1002/env.1142.
- Bhattacharya et al. [2019] Rohit Bhattacharya, Razieh Nabi, Ilya Shpitser, and James Robins. Identification in missing data models represented by directed acyclic graphs. In Proceedings of the Thirty Fifth Conference on Uncertainty in Artificial Intelligence (UAI-35th). AUAI Press, 2019.
- Dai et al. [2023] Haoyue Dai, Ignavier Ng, Gongxu Luo, Peter Spirtes, Petar Stojanov, and Kun Zhang. Gene Regulatory Network Inference in the Presence of Dropouts: A Causal View. In The Twelfth International Conference on Learning Representations (ICLR 2024), October 2023.
- Dempster et al. [1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.
- Duarte et al. [2023] Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. An automated approach to causal inference in discrete setting. Journal of the American Statistical Association, 2023.
- Evans and Richardson [2014] Robin J. Evans and Thomas S. Richardson. Markovian acyclic directed mixed graphs for discrete data. Annals of Statistics, pages 1–30, 2014.
- Greene [2005] William H. Greene. Censored Data and Truncated Distributions, 2005.
- Hannum et al. [2022] Susan M. Hannum, Opeyemi Oladapo-Shittu, Alejandra B. Salinas, Kimberly Weems, Jill Marsteller, Ayse P Gurses, Sara E. Cosgrove, and Sara C. Keller. A task analysis of central line-associated bloodstream infection (CLABSI) surveillance in home infusion therapy. American Journal of Infection Control, 50(5):555–562, May 2022. ISSN 0196-6553. 10.1016/j.ajic.2022.01.008.
- Hannum et al. [2023] Susan M. Hannum, Opeyemi Oladapo-Shittu, Alejandra B. Salinas, Kimberly Weems, Jill Marsteller, Ayse P. Gurses, Ilya Shpitser, Eili Klein, Sara E. Cosgrove, and Sara C. Keller. Controlling the chaos: Information management in home-infusion central-line–associated bloodstream infection (CLABSI) surveillance. Antimicrobial Stewardship & Healthcare Epidemiology, 3(1):e69, January 2023. ISSN 2732-494X. 10.1017/ash.2023.134.
- Jiang et al. [2022] Ruochen Jiang, Tianyi Sun, Dongyuan Song, and Jingyi Jessica Li. Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biology, 23(1):31, January 2022. ISSN 1474-760X. 10.1186/s13059-022-02601-5.
- Keller et al. [2020] Sara Keller, Alejandra Salinas, Deborah Williams, Mary McGoldrick, Lisa Gorski, Mary Alexander, Anne Norris, Jennifer Charron, Roger Scott Stienecker, Catherine Passaretti, Lisa Maragakis, and Sara E. Cosgrove. Reaching consensus on a home infusion central line-associated bloodstream infection surveillance definition via a modified Delphi approach. American Journal of Infection Control, 48(9):993–1000, September 2020. ISSN 0196-6553. 10.1016/j.ajic.2019.12.015.
- Keller et al. [2023] Sara C. Keller, Susan M. Hannum, Kimberly Weems, Opeyemi Oladapo-Shittu, Alejandra B. Salinas, Jill A. Marsteller, Ayse P. Gurses, Eili Y. Klein, Ilya Shpitser, Christopher J. Crnich, Nitin Bhanot, Clare Rock, Sara E. Cosgrove, and the Home Infusion CLABSI Prevention Collaborative. Implementing and validating a home-infusion central-line–associated bloodstream infection surveillance definition. Infection Control & Hospital Epidemiology, 44(11):1748–1759, November 2023. ISSN 0899-823X, 1559-6834. 10.1017/ice.2023.70.
- Kleinke and Reinecke [2013] Kristian Kleinke and Jost Reinecke. Multiple imputation of incomplete zero-inflated count data. Statistica Neerlandica, 67(3):311–336, 2013. ISSN 1467-9574. 10.1111/stan.12009.
- Kuroki and Pearl [2014] Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101:423–437, 2014.
- Lam et al. [2006] K. F. Lam, Hongqi Xue, and Yin Bun Cheung. Semiparametric Analysis of Zero-Inflated Count Data. Biometrics, 62(4):996–1003, December 2006. ISSN 0006-341X. 10.1111/j.1541-0420.2006.00575.x.
- Lambert [1992] Diane Lambert. Zero-Inflated Poisson Regression, With an Application to Defects in Manufacturing. Technometrics, February 1992. ISSN 1048-5228. 10.2307/1269547.
- Lukusa et al. [2017] T. Martin Lukusa, Shen-Ming Lee, and Chin-Shang Li. Review of Zero-Inflated Models with Missing Data. Current Research in Biostatistics, 7(1):1–12, October 2017. ISSN 2524-2229. 10.3844/amjbsp.2017.1.12.
- Malinsky et al. [2021] Daniel Malinsky, Ilya Shpitser, and Eric J Tchetgen Tchetgen. Semiparametric inference for nonmonotone missing-not-at-random data: the no self-censoring model. Journal of the American Statistical Association, pages 1–9, 2021.
- Mohan et al. [2013] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing data. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1277–1285. Curran Associates, Inc., 2013.
- Mullahy [1986] John Mullahy. Specification and Testing of some Modified Count Data Models. Journal of Econometrics, 33(3):341–365, December 1986. ISSN 0304-4076. 10.1016/0304-4076(86)90002-3.
- Nabi et al. [2020] Razieh Nabi, Rohit Bhattacharya, and Ilya Shpitser. Full law identification in graphical models of missing data: Completeness results. In Proceedings of the 37th International Conference on Machine Learning, pages 7153–7163. PMLR, November 2020.
- Neelon et al. [2016] Brian Neelon, A. James O’Malley, and Valerie A. Smith. Modeling zero-modified count and semicontinuous data in health services research Part 1: Background and overview. Statistics in Medicine, 35(27):5070–5093, 2016. ISSN 1097-0258. 10.1002/sim.7050.
- Rubin [1976] D. B. Rubin. Causal inference and missing data (with discussion). Biometrika, 63:581–592, 1976.
- Wagner et al. [2016] Allon Wagner, Aviv Regev, and Nir Yosef. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology, 34(11):1145–1160, November 2016. ISSN 1546-1696. 10.1038/nbt.3711.
- Young et al. [2022] Derek S. Young, Eric S. Roemmele, and Peng Yeh. Zero-inflated modeling part I: Traditional zero-inflated count regression models, their applications, and computational tools. WIREs Computational Statistics, 14(1):e1541, 2022. ISSN 1939-0068. 10.1002/wics.1541.
- Yu et al. [2023] Shiqing Yu, Mathias Drton, and Ali Shojaie. Directed Graphical Models and Causal Discovery for Zero-Inflated Data. In Proceedings of the Second Conference on Causal Learning and Reasoning, pages 27–67. PMLR, August 2023.
Supplementary Material
Appendix A Proofs
A.1 Downstream identification
Proposition 1. The full law exhibiting zero inflation that is Markov relative to an m-DAG is identified given if and only if does not contain edges of the form (no self-censoring) and structures of the form (no colluders), and the positivity assumption holds. Moreover, the identifying functional for the full data law coincides with the functional given in Malinsky et al. [2021].
Proof.
Following the proof in Nabi et al. [2020], the full law factorizes as
(12) | ||||
where , and similarly for .
-
•
No-colluder condition implies , so . Hence these factors use only case of consistency.
-
•
The 2-way odd-ratio is not a function of . Therefore, only case of consistency is used.
-
•
the 3-way interaction term is not a function of . Therefore, only case of consistency is used. Similarly for any k-way interaction term.
Hence the proof in Nabi et al. [2020] applies to ZI problems, whose consistency differs missing data consistency only at case. ∎
A.2 Non-identifiability proof
Lemma 1. Given a ZI model associated with any m-DAG , both the target law and the full law are non-parametrically non-identified.
Proof.
Let be an m-DAG over and its associated ZI-model. The m-DAG obtained from by deleting all edges while keeping defines a sub-model in which are jointly independent. If and are non-parametrically non-identified in this sub-model, they are also non-identified in .
It suffices to prove non-identification for binary variables. The target is , and the observed marginals are
(13) | ||||
using d-separation in . Since the second equation is just minus the first, if the quantity
(14) |
is shown to be identical for 2 joint distributions in , the proof is finished. Indeed, for any , we pick any real number and construct as follow
(15) |
Evidently, the target laws are different , yet the observed marginals are the same . Moreover, the full laws are also different
(16) | ||||
Hence, and are non-parametrically non-identified in . ∎
A.3 Examples of Compatibility Issue
Consider the proxy-augmented ZI MCAR model, in which a joint distribution factorizes as
(17) |
Here, the proxy assumptions insist that . Therefore, any obeys this inequality is said to be model compatible. Moreover, any joint distribution with violating this inequality is outside of the model. Works investigate marginal models of hidden variable models often consider this type of compatibility.
In our paper, we mentioned another type of compatibility. Any joint distribution in the model yields a pair of observed law and proxy-indicator conditional distribution . Obviously, both and produced this way are model compatible. Furthermore, they are compatible to one another, in the sense that there exists a model compatible joint distribution producing them. It is possible to construct an incompatible pair whose components are both model compatible, because the joint distribution yielding them is not in the model. This is illustrated in the following simple examples.
Example 1:
(18) |
Since ZI MCAR does not impose any restriction on in the binary case (see our proof for the bound in the ZI MCAR case), we can pick any number for . In particular, let they be all non-zero. Then both and are model compatible. However, there isn’t any valid (non-negative, summed to ) such that the Kuroki-Pearl equation holds . Attempting to invert in this equation will yield negative-valued .
Example 2:
We choose a joint distribution (DGP) Markov to the proxy-augmented ZI MCAR graph in Figure 2(a), from which we obtain the true , true , true .
We calculate via the matrix inversion equation using the true and the true . The calculated is valid, and close to the true up to floating point precision. This indicates the true and the true are compatible to one another.
We sample data points from this DGP and estimate by counting, which is the MLE for binary data. Again, this estimation is in the model, since marginal model for is saturated in the binary case. Then, we calculate via the matrix inversion equation, using the estimated and the true . This estimated has a negative value, which renders it invalid.
The code for this experiment could be found in the supplement of the paper. Its output is printed below.
True p(W,X): [[0.42643891 0.31215362] [0.14620603 0.11520144]] True p(W|R): [[0.74919143 0.73043156] [0.25080857 0.26956844]] True p(R,X): [[0.43502295 0. ] [0.13762199 0.42735506]] Computed p(R,X) via matrix inv using true p(W,X) and true p(W|R): [[0.43502294 -1.81411279e-16] [0.13762199 0.42735505]] Estimated p(W,X): [[0.42883 0.30976] [0.14496 0.11645]] Computed p(R,X) via matrix inv using estimated p(W,X) and true p(W|R): [[ 0.5178968 -0.08300896] [ 0.05589317 0.50921896]]
A.4 ZI MCAR model and bounds
In this section, and are categorical, while and are binary.
A.4.1 Model definition
Both the ZI MCAR model and ZI MAR model are Cartesian products, between model and model, or model, respectively. Firstly, the adjustment formula establishes a 1-to-1 relation between the model and the model. The constraint of the latter is fully understood.
Lemma 4.
C For 1 variable ZI MCAR and ZI MAR model, the full law model for is 1-to-1 to the model for satisfying Z: .
Proof.
We only need to prove the lemma for ZI MAR model.
-
•
includes all full laws factorizing as
(19) with denotes the deterministic ZI-consistency.
-
•
includes all laws factorizing as
(20) and obeying Z: .
These 2 models are 1-to-1:
-
•
(): This is just summation . The ZI-consistency implies Z.
-
•
(): By d-separation .
∎
In principle, asking whether and are compatible means pointing out a full law which yields both of them. The above lemma allows us to reformulate this compatibility question by pointing out a joint in the model. This has the advantage of simplyfying the original compatibility question, i.e., the polynomial program describing it is of higher degree. Moreover, we do not sacrify bound sharpness as we invoke this lemma, since the joint satisfying Z is 1-to-1 to the full law.
Lemma 5.
Proof.
Due to ,
(22) |
Therefore, model for is a Cartesian product between the model for and the model for . The former is shown to be 1-to-1 to the model for with restriction Z, by lemma 4.
(23) |
While the latter is
(24) |
We just need to rewrite . Since are binary
(25) | ||||
∎
A.4.2 Bounds for ZI MCAR
Before proving the bound theorem, we have the following useful lemma:
Lemma 6.
For the ZI MCAR model in Theorem 3, Z constraint is equivalent to . This means: (i) there is a marginal constraint , and (ii) is point-identified.
Proof.
() direction: Suppose . Then for all
(26) |
Then, for all
(27) | ||||
() direction: Suppose . Then, for all
(28) | ||||
Since , we must have . This is true for all .
∎
Due to this lemma, an observed law is consistent to the model if and only if . We also require positivity , so that is well-defined.
Theorem 3. Consider a ZI MCAR model in Fig. 2 (a) (reproduced in Fig. 5) under proxy assumptions A1, A2, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditionals is given by
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur, i.e., .
Proof.
In the following, denotes an element in a model, while is derived from the given marginal .
In principle, any compatible to must be derived from some full joint distribution , such that . Since the model for is 1-to-1 to the model for with restriction Z, we can simplify this process by considering the marginal model for
(29) |
The subset of yielding the observed law is
(30) |
Polynomial program
Since is invertible, is the set of all pairs with and ,
(31) |
is called the compatibility set of w.r.t. . As mentioned in the main paper, one can directly solve for via the following polynomial program, where are slack variables.
(32) | ||||||
s.t. | ||||||
As we will show below, the constraint is equivalent to . This is a quadratic program due to the first constraint.
Linear program
We will simplify . We do so by considering its superset and adding constraints to it. Firstly, could be parameterized by only 2 numbers, because its superset is
(33) |
Secondly, when , the 2-2 matrix has inverse
(34) |
Therefore, we can transform the quadratic constraint into the following equivalent linear constraints
(35) | ||||
Next, given where all terms are non-negative, , , and is invertible, then . Hence, is a redundant constraint. Proof: entry -th . Then .
Finally, lemma 6 says , and in . Note that this lemma also requires to satisfy the marginal constraint . Therefore, the constraint .
Putting together, we can write as
(36) |
or,
(37) | ||||
The set is called the compatible set of w.r.t. . As will be shown, this is an interval in , hence the name compatibility bound.
To find , we will find each and take their union. Each could be numerically computed by solving the 2 linear programs
(38) | ||||||
s.t. | ||||||
These problems are linear program as is the only unknown and all constraints are linear. The set is the interval whose endpoints are 2 numbers returned by these programs.
Solutions to linear programs
Solving : We expand the matrix multiplication equation
(39) | ||||||||||
In the derivations above, we use the positivity assumption . At the very least, we assume there is zeros, i.e., , otherwise the problem does not make sense. If positivity is violated, e.g., , one can show that , and hence this value does not place any restriction on , and can be ignored in the following discussion.
The first equation shows that the case has no solution if . When the LHS is non-negative, the feasible region is . We can further split into 2 cases, and note that , per .
-
1.
If , which is true for all values in . Then . For this to make sense, we must have .
-
2.
If , which is true for all . Then .
The bounds are sharp because they are the feasible regions .
Solving : Similarly
(40) | ||||||||||
The first equation shows that the case has no solution if . When the LHS is non-positive, the feasible region is . We can further split into 2 cases, and note that , per .
-
1.
If , which is true for all values in . Then . For this to make sense, we must have .
-
2.
If . which is true for all . Then .
The bounds are sharp because they are the feasible regions .
Result
Combine these results to get the compatibility bound . The bounds are sharp.
The situations are not allowed by the model. Moreover, if then , i.e., zero inflation does not occur. Proof:
(41) | ||||
Therefore, subtracting both sides,
(42) |
Due to proxy assumption A2: . Then the LHS equals if and only if . Moreover, Z implies . Then .
∎
A.5 ZI MAR proofs
Theorem 4. Consider a ZI MAR model in Fig. 2 (b) (reproduced in Fig. 6) under proxy assumptions A1† and A2†, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditional distributions is given by, for each value ,
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur for stratum , i.e., .
Proof.
Model definition.
We assume is a cardinal variable, taking values in a finite set . Any joint distribution in this ZI MAR model is
(43) |
Since the Markov factors are variationally independent, the ZI MAR model is a Cartesian product
(44) | ||||
Note how constraints A1†, A2† are equivalent to imposing A1, A2 to each stratum . Notation: (i) means is a non-parametric model contains all probability distribution , and (ii) probability constraints are assumed to hold.
In this product, for all are the same ZI MCAR model described in Theorem 3, repeated times. The value is not a parameter of the model , but a constant. Its only purpose is for the sake of book-keeping when constructing the joint distribution in . For MAR, standard adjustment method point identifies as a functional of . Therefore, as shown in lemma 4 the set is 1-to-1 to the set . Hence, we are interested in the marginal model
(45) |
Finding compatible set.
Given an observed law , we want to find the compatible set w.r.t. this law
(46) |
Geometrically speaking, this set is the intersection of our model with the constraint set , which is itself a Cartesian product,
(47) | ||||
Here we abuse notation to mean the set with 1 element - the observed law , which is not the model .
Since the constraint only concerns and does not concern other in any way, we push each constraint to the corresponding . In other words, we will proceed to find the ZI MCAR compatibility bound for each level , as shown below. Mathematically, as Cartesian product could be written as intersection: If and , then
(48) | ||||
We could transform
(49) | ||||
where
(50) |
This is exactly the set described in Theorem 3. Therefore, this equation suggests the application of Theorem 3 to each stratum . First, there are marginal constraints: . Second,
(51) |
where contains stochastic matrix satisfying
Moreover, if then is an additional condition, and zero inflation does not occur for stratum , i.e., . Since the compatibility set in this case is a Cartesian product of compatibility sets described in Theorem 3, which is sharp, is sharp. ∎
Theorem 5. Consider a ZI MAR model in Fig. 2 (c) (reproduced in Fig. 7) under proxy assumptions A1∗ and A2∗, with categorical and binary . Given a consistent observed law satisfying positivity assumption, , the set of compatible proxy-indicator conditional distributions is given by
These bounds are sharp. Moreover, if , must satisfy , and zero inflation does not occur, i.e., .
Proof.
Model definition. We assume is a cardinal variable, taking values in a finite set . Any joint distribution in this ZI MAR model is
(52) |
Since the Markov factors are variationally independent, the ZI MAR model is a Cartesian product
(53) | ||||
Lemma 4 says is 1-to-1 to the set . Hence, we are interested in the marginal model
(54) |
Finding compatible set.
Given observed law , we want to find the compatible set w.r.t. this law
(55) |
This is similar to the compatible set we consider when A1†, A2† hold (e.g., when ), except the same is shared between the constraints . Each constraint restricts in a different way, hence we cannot write as a Cartesian product to separate the constraints as we did before.
To proceed, note that is 1-to-1 to a set containing only , just as in ZI MCAR proof.
(56) | ||||
This set is the intersection , in which each contains only constraints associated with values .
(57) |
We have already solved before, it is the ZI MCAR compatibility set of in Theorem 3. Then all we need is to take the intersection of these results, one for each . This intersection is non-empty, because there is some produces the given observed law. First, the identification of and marginal constraints are
(58) |
The last equality is due to the marginal constraint discussed in Theorem 3. Then we can write
(59) |
Next, we consider each case of the bound for .
- 1.
- 2.
This means these 2 cases disjoint, i.e., we must have the following marginal constraint
(60) |
The corresponding bounds are
The max/min appears since we take the intersection of the bounds for . Moreover, if , then is an additional condition, and zero inflation does not occur, i.e., . Due to the marginal constraints above, this exhausts all the cases.
Since the compatibility set is the intersection of each , each is sharp in their own ZI MCAR model, the above abound is sharp.
∎
A.6 ZI MNAR proofs
Lemma 3. Consider any ZI model in Section 3.2.1 under A1∗ and A2∗. Denote . The observed law must satisfy, for each ,
(63) |
Proof.
Let be the graph where is fully connected, and . The Markov model for this graph contains all joint distributions , where is from the saturated model restricted by Z. In the original ZI model, we have with satisfying Z. Hence the model for this joint distribution is contained in .
Appendix B Simulations
B.1 Bound validity in random DGPs
DGPs for ZI MCAR and ZI MAR are randomly selected according to Fig. 2 (a) and (b), respectively. In particular, a DGP for ZI MAR is a joint distribution which factorizes as
(64) |
Then, the observed law is .
We randomly select a DGP by sampling the following parameters
(65) | ||||
Further more, to satisfy the ZI-consistency
(66) |
B.2 Numerical bounds results
We compute numerical bounds using method in Duarte et al. [2023] and compare to our analytical bounds for DGPs in ZI MCAR and ZI MAR. Since computation time for the dual bound may be very long (some DGP might take more than 36 hours), we report only DGPs where primary bound is available (whose computation time may take only a few minutes). We refer reader to original paper for distinction of dual/primal bounds.
dgp | lb | ub | num lb | num ub | |
---|---|---|---|---|---|
0 | 0.556406 | 1.0 | 0.556411 | 1.0 | 0.820732 |
1 | 0.357830 | 1.0 | 0.357830 | 1.0 | 0.493695 |
2 | 0.0 | 0.520689 | 0.0 | 0.520689 | 0.453609 |
4 | 0.606499 | 1.0 | 0.606499 | 1.0 | 0.682699 |
5 | 0.0 | 0.524069 | 0.0 | 0.524061 | 0.496676 |
6 | 0.381825 | 1.0 | 0.381825 | 1.0 | 0.441227 |
8 | 0.652288 | 1.0 | 0.652288 | 1.0 | 0.659347 |
9 | 0.698149 | 1.0 | 0.698149 | 1.0 | 0.738794 |
10 | 0.0 | 0.443595 | 0.0 | 0.443595 | 0.442502 |
11 | 0.656867 | 1.0 | 0.656867 | 1.0 | 0.850498 |
12 | 0.211359 | 1.0 | 0.211359 | 1.0 | 0.856658 |
14 | 0.183034 | 1.0 | 0.183034 | 1.0 | 0.303129 |
15 | 0.648430 | 1.0 | 0.648430 | 1.0 | 0.833933 |
16 | 0.292337 | 1.0 | 0.292337 | 1.0 | 0.307559 |
17 | 0.500542 | 1.0 | 0.500542 | 1.0 | 0.553972 |
18 | 0.0 | 0.102988 | 0.0 | 0.102988 | 0.087253 |
20 | 0.0 | 0.479532 | 0.0 | 0.479532 | 0.238318 |
21 | 0.426615 | 1.0 | 0.426615 | 1.0 | 0.426787 |
22 | 0.399169 | 1.0 | 0.399169 | 1.0 | 0.494816 |
23 | 0.0 | 0.216052 | 0.0 | 0.216052 | 0.158163 |
24 | 0.436636 | 1.0 | 0.436636 | 1.0 | 0.533412 |
26 | 0.429579 | 1.0 | 0.429579 | 1.0 | 0.710488 |
27 | 0.0 | 0.500198 | 0.0 | 0.500199 | 0.451856 |
28 | 0.0 | 0.383471 | 0.0 | 0.383471 | 0.136093 |
29 | 0.0 | 0.325871 | 0.0 | 0.325871 | 0.070747 |
30 | 0.363744 | 1.0 | 0.363744 | 1.0 | 0.374293 |
dgp | lb | ub | num lb | num ub | |
---|---|---|---|---|---|
0 | 0.0 | 0.429089 | 0.0 | 0.429089 | 0.413267 |
1 | 0.834644 | 1.0 | 0.834644 | 1.0 | 0.848638 |
2 | 0.0 | 0.340484 | 0.0 | 0.340484 | 0.319264 |
3 | 0.300217 | 1.0 | 0.300217 | 1.0 | 0.515513 |
4 | 0.582249 | 1.0 | 0.582249 | 1.0 | 0.688620 |
5 | 0.938604 | 1.0 | 0.938604 | 1.0 | 0.991572 |
6 | 0.0 | 0.147758 | 0.0 | 0.147758 | 0.053637 |
7 | 0.534321 | 1.0 | 0.534321 | 1.0 | 0.569545 |
8 | 0.720775 | 1.0 | 0.720775 | 1.0 | 0.726467 |
9 | 0.585611 | 1.0 | 0.592962 | 1.0 | 0.686385 |
10 | 0.261442 | 1.0 | 0.261442 | 1.0 | 0.303129 |
11 | 0.378136 | 1.0 | 0.378136 | 1.0 | 0.481036 |
12 | 0.0 | 0.729282 | 0.0 | 0.729282 | 0.703234 |
13 | 0.425249 | 1.0 | 0.425249 | 1.0 | 0.426797 |
14 | 0.612665 | 1.0 | 0.612665 | 1.0 | 0.632688 |
15 | 0.319180 | 1.0 | 0.319180 | 1.0 | 0.628988 |
16 | 0.0 | 0.702582 | 0.0 | 0.702582 | 0.660187 |
17 | 0.661849 | 1.0 | 0.661849 | 1.0 | 0.726963 |
18 | 0.594456 | 1.0 | 0.594456 | 1.0 | 0.600720 |
19 | 0.531541 | 1.0 | 0.532509 | 1.0 | 0.536156 |
21 | 0.596110 | 1.0 | 0.600331 | 1.0 | 0.606306 |
22 | 0.513144 | 1.0 | 0.513144 | 1.0 | 0.692834 |
23 | 0.0 | 0.560519 | 0.0 | 0.560519 | 0.536384 |
24 | 0.837194 | 1.0 | 0.837212 | 1.0 | 0.844800 |
25 | 0.0 | 0.443658 | 0.0 | 0.443658 | 0.302563 |
26 | 0.469323 | 1.0 | 0.469323 | 1.0 | 0.479800 |
28 | 0.688084 | 1.0 | 0.688084 | 1.0 | 0.826820 |
30 | 0.0 | 0.230720 | 0.0 | 0.230720 | 0.103594 |
Appendix C Variable descriptions in the CLABSI Data Application
This section describes the covariates used in the CLABSI data application (all coded as binary variables). These covariates correspond to types of therapy, and types of catheter used.
-
•
Pediatrics: the CVC therapy is tailored for children.
-
•
Chemotherapy: the CVC therapy is used to administer chemotherapy.
-
•
OPAT: outpatient parenteral antimicrobial therapy (IV antibiotics).
-
•
TPN: parenteral nutrition delivered via the VC.
-
•
Other therapy: any other type of therapy not included in the above categories, such as hydration.
-
•
Port: a type of CVC in use.
-
•
PICC: peripherally inserted central catheter, another type of CVC.
-
•
Tunneled CVC: a CVC tunneled under the skin.