Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\authormark

Yuliang et. al.

\corres

*Yuliang Shi.

\presentaddress

200 University Ave W, Waterloo, ON, Canada, N2L 3G1.

Model Selection for Causal Modeling in Missing Exposure Problems

Yuliang Shi    Yeying Zhu    Joel A. Dubin \orgdivDepartment of Statistics and Actuarial Science, \orgnameUniversity of Waterloo, \orgaddress\stateWaterloo, \countryCanada yuliang.shi@uwaterloo.ca
(June 18, 2024; <day> <Month>, <year>; <day> <Month>, <year>)
Abstract

[Abstract]In causal inference, properly selecting the propensity score (PS) model is a popular topic and has been widely investigated in observational studies. In addition, there is a large literature concerning the missing data problem. However, there are very few studies investigating the model selection issue for causal inference when the exposure is missing at random (MAR). In this paper, we discuss how to select both imputation and PS models, which can result in the smallest RMSE of the estimated causal effect. Then, we provide a new criterion, called the “rank score" for evaluating the overall performance of both models. The simulation studies show that the full imputation plus the outcome-related PS models lead to the smallest RMSE and the rank score can also pick the best models. An application study is conducted to study the causal effect of CVD on the mortality of COVID-19 patients.

keywords:
Missing Exposure; Model Selection; Propensity Score; Multiple Imputation Chained Equations;
articletype: RESEARCH ARTICLE

1 Introduction

In real application studies, missing data can occur in multiple ways 1. One of the common cases is that the exposure of interest may not be fully observed 2. For example, a patient’s exposure status may not be fully recorded when the patient suddenly drops out of the clinical study; or an individual may decline to answer a sensitive question regarding his/her health problem in a survey questionnaire. In addressing the challenge of missing data, researchers have proposed various methodologies for handling cases where the outcome is missing3. However, when the exposure status is missing, few approaches have been provided in literature to deal with this issue 4, 5.

In causal inference, propensity score (PS) analysis is a popular tool to address the confounding issue 6. However, when the exposure status is missing at random (MAR), estimation of the causal effect is challenging. The main reason is that MAR on exposure significantly differs from MAR on the outcome in this case. When the outcome is missing, only the outcome model is influenced by the missing data, and the PS model will not be affected by the missing outcome. In contrast, in cases where the exposure is missing, all three models — imputation, PS, and outcome models— are largely affected. Consequently, although PS-based methods have been discussed when the outcome is MAR 7, 8, their direct application to cases where the exposure is missing is not straightforward. Careful adjustments regarding both missing and confounding issues are necessary in such scenarios.

As a motivating example, we consider the case in which researchers aim to determine the causal relationship between CVD (as the exposure) and COVID-19 patients’ mortality (as the outcome) in an observational study where CVD status is not completely observed. In this situation, we need to make adjustments for both missing CVD status and other confounding factors. However, which variables should be selected for the imputation and PS models is not clear because we do not know the true missing values. If we only include some clinical variables related to CVD, such as diabetes and metabolic status without including the outcome and outcome-related variables in the imputation and PS models, both models may not be valid, so the imputed exposure may be deviated from the true missing values and extreme weights may occur, leading to a large bias and variance for the estimated causal effect.

One of the common approaches to dealing with missing data is via imputation. Single imputation (SI) is easy to conduct based on the observed data 8, 9. However, imputing data only once may not be reliable and a large variation may be induced 10. In contrast, the multiple imputations chained equations (MICE) approach is proposed through random sampling, and applying Rubin’s rules can account for both the sampling variability and the uncertainty in the missing values11, 12. After addressing the missing data issue, we can apply either inverse probability weighting (IPW) or double robust (DR) method based on the imputed dataset 13, 14, 15.

If we rely on parametric methods to fit the imputation and PS models, model selection is an essential step to obtain valid results. In the previous literature, researchers have shown that improper model selection for PS model can result in higher bias and variance 16. For example, if we include variables that are only related to the exposure in the PS model, the variance of estimated causal effects will become larger. On the other hand, if we only include outcome-related variables in the PS model, the estimated causal effects will become more efficient.

However, little attention has been given to addressing situations when both PS and imputation models need to be selected. That motivates us to conduct studies to investigate which variables should be included in the imputation and PS models so that the smallest bias and variance of the causal estimates can be achieved among candidate models. Furthermore, since we do not know the true causal effect in the application, evaluating those selected models through some model selection criteria is also essential. Therefore, we propose a new rank-based criterion to combine the performance of the imputation and PS models, compared with the traditional criterion.

The rest of this article is organized as follows. In Section 2, we describe the goal and the notation; we design a simulation study to investigate model selection in both imputation and PS models. Section 3 discusses the traditional and newly proposed criteria for conducting model selection. In Section 4, an application study is conducted to identify the causal effect of CVD on mortality for a cohort of COVID-19 patients. The article ends with a discussion in Section 5.

2 Notation and Methods

2.1 Framework and Assumptions

Based on the counterfactual causal framework 17, we denote (Yi1,Yi0),i=1,,n,formulae-sequencesuperscriptsubscript𝑌𝑖1superscriptsubscript𝑌𝑖0𝑖1𝑛(Y_{i}^{1},Y_{i}^{0}),i=1,\dots,n,( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , italic_i = 1 , … , italic_n , as the potential outcomes if individual i𝑖iitalic_i is exposed or unexposed (or equivalently, the treated or untreated), respectively. To study the different roles of covariates in model selection, we consider a simplified scenario with three covariates. As shown in Figure 1, X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the main confounder, X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the exposure-related covariate, and X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is the covariate related only to the outcome. We denote the 4×1414\times 14 × 1 vector of covariates for subject i𝑖iitalic_i as 𝑿i=(1,Xi1,Xi2,Xi3)Tsubscript𝑿𝑖superscript1subscript𝑋𝑖1subscript𝑋𝑖2subscript𝑋𝑖3𝑇\bm{X}_{i}=(1,X_{i1},X_{i2},X_{i3})^{T}bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_X start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i 3 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Let Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the exposure of interest, Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the binary outcome, and Ri=I{Ai is missing}subscript𝑅𝑖𝐼subscript𝐴𝑖 is missingR_{i}=I\{A_{i}\text{ is missing}\}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is missing } denote the indicator of whether the exposure value is missing or not. In this study, we focus on the exposure variable being missing, so all other variables are assumed to be completely observed. In addition, we study the model selection problem on a finite set of covariates instead of the high-dimensional setting.

Refer to caption
Figure 1: Causal diagram for simulation studies. The black arrows refer to the causal relationship among confounders, the exposure, and the outcome. The double-sided dash arrows refer to the associational relationship among the missing indicator, the outcome, and the covariates. The red arrow is the causal effect of primary interest. The dashed short line refers to no association between the missing indicator and the exposure given covariates and the outcome due to MAR assumption.

For each individual i𝑖iitalic_i, Yi1superscriptsubscript𝑌𝑖1Y_{i}^{1}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and Yi0superscriptsubscript𝑌𝑖0Y_{i}^{0}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT cannot be observed at the same time; instead, we only observe Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The observed dataset is (𝑿i,Ai,Yi,Ri),i=1,,nformulae-sequencesubscript𝑿𝑖subscript𝐴𝑖subscript𝑌𝑖subscript𝑅𝑖𝑖1𝑛(\bm{X}_{i},A_{i},Y_{i},R_{i}),i=1,\dots,n( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i = 1 , … , italic_n, and the relationship between the observed and potential outcomes can be written as Yi=AiYi1+(1Ai)Yi0subscript𝑌𝑖subscript𝐴𝑖superscriptsubscript𝑌𝑖11subscript𝐴𝑖superscriptsubscript𝑌𝑖0Y_{i}=A_{i}Y_{i}^{1}+(1-A_{i})Y_{i}^{0}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + ( 1 - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, for i=1,2,,n𝑖12𝑛i=1,2,\dots,nitalic_i = 1 , 2 , … , italic_n. The true propensity score (PS) is defined as π(𝑿i)=P(Ai=1|𝑿i)𝜋subscript𝑿𝑖𝑃subscript𝐴𝑖conditional1subscript𝑿𝑖\pi(\bm{X}_{i})=P(A_{i}=1|\bm{X}_{i})italic_π ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_P ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for individual i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n. We denote τ1=E(Y1)subscript𝜏1𝐸superscript𝑌1\tau_{1}=E(Y^{1})italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_E ( italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) and τ0=E(Y0)subscript𝜏0𝐸superscript𝑌0\tau_{0}=E(Y^{0})italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_E ( italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) as the average effect for treated and untreated, respectively. We define the causal estimate of interest τ𝜏\tauitalic_τ as the risk ratio: τ=τ1/τ0𝜏subscript𝜏1subscript𝜏0\tau=\tau_{1}/\tau_{0}italic_τ = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For continuous outcomes, we can also define the causal effect as the average causal effect, but the key idea for the model selection process will not be largely changed by the different formats of the causal quantities. To estimate the risk ratio, three assumptions are required as follows:

  1. 1.

    Strongly ignorable treatment assignment assumption (SITA): (Y1,Y0)A|𝑿bottomsuperscript𝑌1superscript𝑌0conditional𝐴𝑿(Y^{1},Y^{0})\bot A|\bm{X}( italic_Y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ⊥ italic_A | bold_italic_X.

  2. 2.

    Positivity assumption: 0<P(A=1|𝑿=𝒙)<1,for all possible 𝒙formulae-sequence0𝑃𝐴conditional1𝑿𝒙1for all possible 𝒙0<P(A=1|\bm{X}=\bm{x})<1,\text{for all possible }\bm{x}0 < italic_P ( italic_A = 1 | bold_italic_X = bold_italic_x ) < 1 , for all possible bold_italic_x.

  3. 3.

    MAR assumption: RA|(X1,X2,X3)bottom𝑅conditional𝐴subscript𝑋1subscript𝑋2subscript𝑋3R\bot A|(X_{1},X_{2},X_{3})italic_R ⊥ italic_A | ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

SITA assumption is also known as the assumption of no unmeasured confounder, which cannot be tested in the application studies. The second positivity assumption requires the propensity score to be a positive probability, which can be checked after imputing the missing exposure status correctly. MAR assumption assumes that the missing indicator is conditionally independent of the exposure itself given all covariates in the dataset, which also cannot be tested on the observed data due to the unknown missing values.

To investigate the role of each variable in the imputation model, one of the approaches is to use the theories of directed acyclic graph (DAG), which is a common tool using “d-separation" 18 to check conditional independence among variables without specifying the form of the model. From Figure 1, we know that X1,X2subscript𝑋1subscript𝑋2X_{1},X_{2}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT should be included in the imputation model because they affect A𝐴Aitalic_A and R𝑅Ritalic_R directly. The main question is whether Y𝑌Yitalic_Y should be included in the imputation model or not. Even though we do not include the outcome in the true missingness model from Figure 1, Y𝑌Yitalic_Y is directly correlated with the exposure variable, so adding Y𝑌Yitalic_Y can help us predict the missing exposure values.

Next, we aim to study the role of X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in the imputation model. Notice that that X3Abottomsubscript𝑋3𝐴X_{3}\bot Aitalic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊥ italic_A, but if given (X1,X2)subscript𝑋1subscript𝑋2(X_{1},X_{2})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), Y𝑌Yitalic_Y is a function of X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and R𝑅Ritalic_R is also correlated with X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT from Figure 1. Therefore, X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT should be conditioned on for MAR assumption to hold. In addition, there is a new finding between X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and A𝐴Aitalic_A. If we condition on Y𝑌Yitalic_Y, Y𝑌Yitalic_Y will be the collider in the path of AYX3𝐴𝑌subscript𝑋3A\rightarrow Y\leftarrow X_{3}italic_A → italic_Y ← italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, so X3A|(Y,X1,X2)perpendicular-tosubscript𝑋3conditional𝐴𝑌subscript𝑋1subscript𝑋2X_{3}\not\!\perp A|(Y,X_{1},X_{2})italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT not ⟂ italic_A | ( italic_Y , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). In other words, including the outcome and outcome-related variables in the imputation model can largely improve the predictive ability of the imputation model. In summary, even though MAR assumption only requires us to control (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we should include (X1,X2,X3,Y)subscript𝑋1subscript𝑋2subscript𝑋3𝑌(X_{1},X_{2},X_{3},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_Y ) into the imputation model.

The objective of this study is two-fold. Firstly, since the roles of X1,X2,X3subscript𝑋1subscript𝑋2subscript𝑋3X_{1},X_{2},X_{3}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are quite different as shown in Figure 1, our primary goal is to find the best selection on both imputation and PS models when the exposure is MAR. Secondly, we aim to find a suitable criterion to evaluate the performance of the two models in the application study when a DAG is not known.

2.2 Estimation

Before we discuss the model selection strategy, we first provide the algorithm to estimate τ𝜏\tauitalic_τ. As we have previously mentioned, estimating τ𝜏\tauitalic_τ requires us to account for both missing and confounding issues. For the missing data problem, we will impute the exposure status via MICE based on MAR assumption 10, 11. After the missing exposures are imputed, τ𝜏\tauitalic_τ can be estimated using inverse-probability weighting (IPW) or double robust (DR) method to account for the confounding issues.

In summary, we present an algorithm with three key steps as follows:

  1. 1.

    Fit imputation (Imp) model based on MAR assumption using MICE: fit logistic regression model for A𝑿+Ysimilar-to𝐴𝑿𝑌A\sim\bm{X}+Yitalic_A ∼ bold_italic_X + italic_Y to obtain Aiimpsuperscriptsubscript𝐴𝑖impA_{i}^{\text{imp}}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT as the imputed exposure.

  2. 2.

    Fit PS model using the logistic model on imputed exposure: Aimp𝑿similar-tosuperscript𝐴imp𝑿A^{\text{imp}}\sim\bm{X}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X to obtain fitted PS values, denoted as π^i(𝑿)subscript^𝜋𝑖𝑿\hat{\pi}_{i}(\bm{X})over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X ).

  3. 3.

    Apply inverse weighting to adjust for the confounding issue, IPW estimator can be written as 19:

    τ^1IPW=1ni=1nAiimpYiπi^(𝑿i),τ^0IPW=1ni=1n(1Aiimp)Yi1πi^(𝑿i)formulae-sequencesuperscriptsubscript^𝜏1IPW1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝐴𝑖impsubscript𝑌𝑖^subscript𝜋𝑖subscript𝑿𝑖superscriptsubscript^𝜏0IPW1𝑛superscriptsubscript𝑖1𝑛1superscriptsubscript𝐴𝑖impsubscript𝑌𝑖1^subscript𝜋𝑖subscript𝑿𝑖\displaystyle\hat{\tau}_{1}^{\text{IPW}}=\frac{1}{n}\sum_{i=1}^{n}\frac{A_{i}^% {\text{imp}}Y_{i}}{\hat{\pi_{i}}(\bm{X}_{i})},\hat{\tau}_{0}^{\text{IPW}}=% \frac{1}{n}\sum_{i=1}^{n}\frac{(1-A_{i}^{\text{imp}})Y_{i}}{1-\hat{\pi_{i}}(% \bm{X}_{i})}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG ( 1 - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (1)

    Then, IPW estimator for τ𝜏\tauitalic_τ will be: τ^IPW=τ^1IPW/τ^0IPWsuperscript^𝜏IPWsuperscriptsubscript^𝜏1IPWsuperscriptsubscript^𝜏0IPW\hat{\tau}^{\text{IPW}}=\hat{\tau}_{1}^{\text{IPW}}/\hat{\tau}_{0}^{\text{IPW}}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT = over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT / over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT.

    or the DR estimator can be written as:

    τ^1DR=1ni=1n[AiimpYiπi^(𝑿i)πi^(X)Aiimpπi^(𝑿i)m1^(Aiimp,𝑿i)],τ^0DR=1ni=1n[(1Aiimp)Yi1πi^(𝑿i)Aiimpπi^(X)1πi^(𝑿i)m0^(Aiimp,𝑿i)]formulae-sequencesuperscriptsubscript^𝜏1DR1𝑛superscriptsubscript𝑖1𝑛delimited-[]superscriptsubscript𝐴𝑖impsubscript𝑌𝑖^subscript𝜋𝑖subscript𝑿𝑖^subscript𝜋𝑖𝑋superscriptsubscript𝐴𝑖imp^subscript𝜋𝑖subscript𝑿𝑖^subscript𝑚1superscriptsubscript𝐴𝑖impsubscript𝑿𝑖superscriptsubscript^𝜏0DR1𝑛superscriptsubscript𝑖1𝑛delimited-[]1superscriptsubscript𝐴𝑖impsubscript𝑌𝑖1^subscript𝜋𝑖subscript𝑿𝑖superscriptsubscript𝐴𝑖imp^subscript𝜋𝑖𝑋1^subscript𝜋𝑖subscript𝑿𝑖^subscript𝑚0superscriptsubscript𝐴𝑖impsubscript𝑿𝑖\displaystyle\hat{\tau}_{1}^{\text{DR}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{A% _{i}^{\text{imp}}Y_{i}}{\hat{\pi_{i}}(\bm{X}_{i})}-\frac{\hat{\pi_{i}}(X)-A_{i% }^{\text{imp}}}{\hat{\pi_{i}}(\bm{X}_{i})}\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X% }_{i})\right],\hat{\tau}_{0}^{\text{DR}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{% (1-A_{i}^{\text{imp}})Y_{i}}{1-\hat{\pi_{i}}(\bm{X}_{i})}-\frac{A_{i}^{\text{% imp}}-\hat{\pi_{i}}(X)}{1-\hat{\pi_{i}}(\bm{X}_{i})}\hat{m_{0}}(A_{i}^{\text{% imp}},\bm{X}_{i})\right]over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ divide start_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG - divide start_ARG over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_X ) - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG over^ start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ divide start_ARG ( 1 - italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG - divide start_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT - over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_X ) end_ARG start_ARG 1 - over^ start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG over^ start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] (2)

    where m1^(Aiimp,𝑿i)^subscript𝑚1superscriptsubscript𝐴𝑖impsubscript𝑿𝑖\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X}_{i})over^ start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the fitted response for the treatment group, i.e., m1^(Aiimp,𝑿i)=E[Yi|Aiimp=1,𝑿i;𝜷^]^subscript𝑚1superscriptsubscript𝐴𝑖impsubscript𝑿𝑖𝐸delimited-[]conditionalsubscript𝑌𝑖superscriptsubscript𝐴𝑖imp1subscript𝑿𝑖^𝜷\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X}_{i})=E[Y_{i}|A_{i}^{\text{imp}}=1,\bm{X}% _{i};\hat{\bm{\beta}}]over^ start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT = 1 , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG bold_italic_β end_ARG ]; m0^(Aiimp,𝑿i)^subscript𝑚0superscriptsubscript𝐴𝑖impsubscript𝑿𝑖\hat{m_{0}}(A_{i}^{\text{imp}},\bm{X}_{i})over^ start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the fitted response for the control group, i.e., m0^(Aiimp,𝑿i)=E[Yi|Aiimp=0,𝑿i;𝜷^]^subscript𝑚0superscriptsubscript𝐴𝑖impsubscript𝑿𝑖𝐸delimited-[]conditionalsubscript𝑌𝑖superscriptsubscript𝐴𝑖imp0subscript𝑿𝑖^𝜷\hat{m_{0}}(A_{i}^{\text{imp}},\bm{X}_{i})=E[Y_{i}|A_{i}^{\text{imp}}=0,\bm{X}% _{i};\hat{\bm{\beta}}]over^ start_ARG italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT = 0 , bold_italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG bold_italic_β end_ARG ]. Here, 𝜷^^𝜷\hat{\bm{\beta}}over^ start_ARG bold_italic_β end_ARG is estimated parameters from the outcome model, written as YA+Xsimilar-to𝑌𝐴𝑋Y\sim A+Xitalic_Y ∼ italic_A + italic_X. Then, DR estimator for τ𝜏\tauitalic_τ will be: τ^DR=τ^1DR/τ^0DRsuperscript^𝜏DRsuperscriptsubscript^𝜏1DRsuperscriptsubscript^𝜏0DR\hat{\tau}^{\text{DR}}=\hat{\tau}_{1}^{\text{DR}}/\hat{\tau}_{0}^{\text{DR}}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT = over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT / over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT.

If the imputation model is correctly specified, based on MAR assumption, we know P(Aimp=1|𝑿,Y)=P(A=1|𝑿,Y)𝑃superscript𝐴impconditional1𝑿𝑌𝑃𝐴conditional1𝑿𝑌P(A^{\text{imp}}=1|\bm{X},Y)=P(A=1|\bm{X},Y)italic_P ( italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT = 1 | bold_italic_X , italic_Y ) = italic_P ( italic_A = 1 | bold_italic_X , italic_Y ). Furthermore, if PS model is also correct, due to SITA assumption, τ^IPWpτpsuperscript^𝜏IPW𝜏\hat{\tau}^{\text{IPW}}\xrightarrow[]{\text{p}}\tauover^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT IPW end_POSTSUPERSCRIPT start_ARROW overp → end_ARROW italic_τ and its asymptotic normality holds as n𝑛n\to\inftyitalic_n → ∞ 19. For DR estimator, it can protect against misspecification of either PS or the outcome model 14, 20. In other words, if the imputation model is correct and either PS or the outcome model is correct, we know τ^DRpτpsuperscript^𝜏DR𝜏\hat{\tau}^{\text{DR}}\xrightarrow[]{\text{p}}\tauover^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT start_ARROW overp → end_ARROW italic_τ and its asymptotic normality holds as n𝑛n\to\inftyitalic_n → ∞ 15.

2.3 A Simulation Study

To investigate which covariates should be included in the imputation and PS models to achieve the smallest RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG, a simulation study is conducted. The number of replications is N=1000𝑁1000N=1000italic_N = 1000, and the sample size is n=500𝑛500n=500italic_n = 500 in each data generation. Three covariates X1,X2,X3subscript𝑋1subscript𝑋2subscript𝑋3X_{1},X_{2},X_{3}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are generated from N(0,1)𝑁01N(0,1)italic_N ( 0 , 1 ) independently. Y,A𝑌𝐴Y,Aitalic_Y , italic_A, and the missing indicator R𝑅Ritalic_R are generated by the following three models:

  • the missingness model: logit{P(R=1|𝑿)}=0.3+0.4X1+0.6X2+1.8X3logit𝑃𝑅conditional1𝑿0.30.4subscript𝑋10.6subscript𝑋21.8subscript𝑋3\text{logit}\{P(R=1|\bm{X})\}=-0.3+0.4X_{1}+0.6X_{2}+1.8X_{3}logit { italic_P ( italic_R = 1 | bold_italic_X ) } = - 0.3 + 0.4 italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 0.6 italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1.8 italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT;

  • the treatment model: logit{P(A=1|𝑿)}=0.2+0.3X1+1X2logit𝑃𝐴conditional1𝑿0.20.3subscript𝑋11subscript𝑋2\text{logit}\{P(A=1|\bm{X})\}=-0.2+0.3X_{1}+1X_{2}logit { italic_P ( italic_A = 1 | bold_italic_X ) } = - 0.2 + 0.3 italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;

  • the outcome model: logit{P(Y=1|A,𝑿)}=0.2+2A0.3X1+2.5X3;logit𝑃𝑌conditional1𝐴𝑿0.22𝐴0.3subscript𝑋12.5subscript𝑋3\text{logit}\{P(Y=1|A,\bm{X})\}=-0.2+2A-0.3X_{1}+2.5X_{3};logit { italic_P ( italic_Y = 1 | italic_A , bold_italic_X ) } = - 0.2 + 2 italic_A - 0.3 italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2.5 italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ;

The missing rate is about 44% in this scenario and the true causal effect τ=1.523𝜏1.523\tau=1.523italic_τ = 1.523. Notice that the true missingness model does not include the outcome Y𝑌Yitalic_Y, but we aim to investigate whether adding the outcome into the imputation model will improve the performance of the estimated causal effect in the simulation studies.

When we conduct imputation using MICE, multiple imputation times m=20𝑚20m=20italic_m = 20 are chosen for imputing the missing data. Then, IPW and DR estimators for τ𝜏\tauitalic_τ can be constructed on each imputed dataset and Rubin’s rule can be applied to obtain final estimated values based on the Section 2.2 12.

For each simulated data set, we investigate five different imputation models and four different PS models. In total, twenty possible combinations of the imputation and PS models are investigated. We name these models in Tables 1 and 2:

Table 1: Five imputation models in the simulation studies
Model Name (shortened form) Model Form
exposure-related model (Exp) AX1+X2similar-to𝐴subscript𝑋1subscript𝑋2A\sim X_{1}+X_{2}italic_A ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
outcome-related model (Out) AX1+X3similar-to𝐴subscript𝑋1subscript𝑋3A\sim X_{1}+X_{3}italic_A ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
all covariates-included model (Covs) AX1+X2+X3similar-to𝐴subscript𝑋1subscript𝑋2subscript𝑋3A\sim X_{1}+X_{2}+X_{3}italic_A ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
response-included model (Res) AX1+X2+Ysimilar-to𝐴subscript𝑋1subscript𝑋2𝑌A\sim X_{1}+X_{2}+Yitalic_A ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_Y
full model (Full) AX1+X2+X3+Ysimilar-to𝐴subscript𝑋1subscript𝑋2subscript𝑋3𝑌A\sim X_{1}+X_{2}+X_{3}+Yitalic_A ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y

We provide some rationales behind the specification of these imputation and PS models. When selecting the imputation model, we start with an exposure-related model, which includes only X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as they directly affect A𝐴Aitalic_A. To increase the predictive ability, we gradually modify the imputation model by adding X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and Y𝑌Yitalic_Y in a step-by-step process until the full imputation model is obtained.

In contrast, the goal of fitting PS model is not to increase the predictive ability, but to adjust for the confounding issue. Therefore, we start with a naive PS model with only X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the main confounder. Then, we gradually add exposure-related or outcome-related variables until PS model includes all covariates.

Table 2: Four PS models in the simulation studies
Model Name (shortened form) Model Form
naive model (Naive) AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
exposure-relate model (Exp) AimpX1+X2similar-tosuperscript𝐴impsubscript𝑋1subscript𝑋2A^{\text{imp}}\sim X_{1}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
outcome-relate model (Out) AimpX1+X3similar-tosuperscript𝐴impsubscript𝑋1subscript𝑋3A^{\text{imp}}\sim X_{1}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
all covariates-included model (Covs) AimpX1+X2+X3similar-tosuperscript𝐴impsubscript𝑋1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim X_{1}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

2.4 Simulation Results

In this subsection, we investigate which combination of imputation and PS models can lead to the optimal RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG. Table 3 shows simulation results for IPW estimator when n=500𝑛500n=500italic_n = 500 and Figure 2 is plotted to visualize the results. We also provide the simulation results for risk ratio and average causal effects in Appendix A.6. To evaluate the performance of different models, we present the bias rate as the main interest: (τ¯τ)/τ¯𝜏𝜏𝜏(\bar{{\tau}}-\tau)/\tau( over¯ start_ARG italic_τ end_ARG - italic_τ ) / italic_τ, where τ¯=1Nj=1Nτj^¯𝜏1𝑁superscriptsubscript𝑗1𝑁^subscript𝜏𝑗\bar{{\tau}}=\frac{1}{N}\sum_{j=1}^{N}\hat{{\tau}_{j}}over¯ start_ARG italic_τ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, and τ^jsubscript^𝜏𝑗\hat{{\tau}}_{j}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the point estimate in the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT replication, j=1,2,,N𝑗12𝑁j=1,2,\dots,Nitalic_j = 1 , 2 , … , italic_N. The empirical standard error (ESE), which is also called the sampling standard error, is written as ESE=1N1j=1N(τj^τ¯)2ESE1𝑁1superscriptsubscript𝑗1𝑁superscript^subscript𝜏𝑗¯𝜏2\text{ESE}=\sqrt{\frac{1}{N-1}\sum_{j=1}^{N}(\hat{{\tau}_{j}}-\bar{{\tau}})^{2}}ESE = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. The square root of mean squared error (RMSE) is defined as RMSE=1N1j=1N(τ^jτ)2RMSE1𝑁1superscriptsubscript𝑗1𝑁superscriptsubscript^𝜏𝑗𝜏2\text{RMSE}=\sqrt{\frac{1}{N-1}\sum_{j=1}^{N}(\hat{{\tau}}_{j}-\tau)^{2}}RMSE = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Refer to caption
Figure 2: Boxplot of estimated values for IPW estimators in the simulation studies when n=500𝑛500n=500italic_n = 500. The x-axis refers to which imputation and PS models are used. Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively. The red line is a true causal effect.
Table 3: Performance of different models using IPW method when n=500𝑛500n=500italic_n = 500
Imp, PS Models Bias Bias Rate ESE RMSE
Exp, naive -0.186 -12.187 0.085 0.204
Exp, exp -0.217 -14.283 0.092 0.236
Exp, out -0.188 -12.324 0.067 0.199
Exp, covs -0.219 -14.384 0.080 0.233
\hdashlineOut, naive -0.290 -19.069 0.109 0.310
Out, exp -0.333 -21.858 0.103 0.348
Out, out -0.230 -15.074 0.059 0.237
Out, covs -0.263 -17.301 0.059 0.270
\hdashlineCovs, naive -0.181 -11.920 0.123 0.219
Covs, exp -0.211 -13.883 0.138 0.252
Covs, out -0.188 -12.335 0.067 0.199
Covs, covs -0.218 -14.295 0.080 0.232
\hdashlineRes, naive 0.102 6.679 0.176 0.204
Res, exp 0.137 9.026 0.217 0.257
Res, out -0.046 -2.996 0.107 0.116
Res, covs -0.038 -2.523 0.137 0.143
\hdashlineFull, naive -0.017 -1.096 0.160 0.161
Full, exp -0.011 -0.710 0.194 0.194
Full, out 0.010 0.640 0.110 0.111
Full, covs 0.023 1.498 0.137 0.139
  • Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively.

2.4.1 Comparison of Imputation Models

Based on Table 3 and Figure 2, if we focus on the same PS model and compare different imputation models, we find that the full imputation model has resulted in the smallest bias of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG compared to other imputation models. In contrast, the exposure-related, outcome-related, and all covariates-included imputation models generate huge bias rates.

Since (X1,X2)subscript𝑋1subscript𝑋2(X_{1},X_{2})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are exposure-related variables from Figure 1, some researchers may argue that the preferred imputation model only needs to include (X1,X2)subscript𝑋1subscript𝑋2(X_{1},X_{2})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) without (X3,Y)subscript𝑋3𝑌(X_{3},Y)( italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_Y ) . However, our simulation results indicate that adding (X3,Y)subscript𝑋3𝑌(X_{3},Y)( italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_Y ) into the imputation model is necessary to reduce the bias of the estimated causal effect. For example, comparing the exposure-related model with the response-included imputation model, Table 3 shows that adding Y𝑌Yitalic_Y into the imputation model reduces about 10% of bias. To study the effect of X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, if we compare the exposure-related model versus the covariate-included imputation model, adding X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT into the imputation model does not reduce the large bias if we do not condition on Y𝑌Yitalic_Y. In contrast, when we condition on Y𝑌Yitalic_Y, compared with the response-included imputation versus the full imputation model, Figure 2 shows that adding X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT leads to a reduction of bias. These findings are consistent with the discussion of DAG in Section 2, which indicate that even though the true missing model does not include Y𝑌Yitalic_Y and A𝐴Aitalic_A is not directly correlated with X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, adding both X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and Y𝑌Yitalic_Y into the imputation model can improve the predictive ability.

In summary, even though the true missingness model does not include Y𝑌Yitalic_Y, all exposure-related, outcome-related covariates and the outcome itself should be included in the imputation model, so the full imputation model results in the smallest bias among others.

2.4.2 Comparison of PS Models

Based on the above discussion, we next focus on the full imputation model and compare the performance of different PS models in Table 3 and Figure 2. We find that all PS models that incorporate the main confounder X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can lead to very small bias, as X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the "sufficient set" for confounding adjustment, in this case, 21. Even though X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT directly causes A𝐴Aitalic_A, adding X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT does not reduce the bias because it only causes the change of the exposure and has no direct impact on the outcome Y𝑌Yitalic_Y, so X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not considered as the main confounder.

In terms of standard errors, the outcome-related PS model with X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT can generate the smallest ESE when we fit the full imputation model. In contrast, both exposure-related and all-covariates PS models lead to an increase in ESE because adding exposure-related variable X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT may generate more extreme PS values to enlarge the variation in finite samples 16. In summary, we recommend choosing the outcome-related PS model over the exposure-related PS model.

The results using DR estimator are presented in Table 4 and visualized in Figure 3. Overall, we observe a similar trend of model selection using DR estimator as IPW estimator, suggesting that full imputation plus outcome-based PS models can effectively reduce RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG compared to other candidate models.

Overall, the best model selection is to choose the full imputation plus outcome-related PS models for both IPW or DR estimator, which results in the smallest RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG. In contrast, using outcome-related imputation plus exposure-related PS models leads to the largest RMSE, which is not recommended.

Refer to caption
Figure 3: Boxplot of estimated values for DR estimators in the simulation studies when n=500𝑛500n=500italic_n = 500. The x-axis refers to which imputation and PS models are used. Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively. The red line is the true causal effect.
Table 4: Performance of different models using DR method when n=500𝑛500n=500italic_n = 500
Imp, PS Models Bias Bias Rate ESE RMSE
Exp, naive -0.186 -12.198 0.084 0.204
Exp, exp -0.222 -14.553 0.085 0.237
Exp, out -0.188 -12.374 0.065 0.199
Exp, covs -0.225 -14.804 0.066 0.235
\hdashlineOut, naive -0.291 -19.087 0.109 0.310
Out, exp -0.321 -21.056 0.103 0.337
Out, out -0.226 -14.859 0.058 0.233
Out, covs -0.264 -17.354 0.057 0.270
\hdashlineCovs, naive -0.182 -11.933 0.123 0.219
Covs, exp -0.215 -14.105 0.134 0.253
Covs, out -0.189 -12.392 0.065 0.200
Covs, covs -0.225 -14.786 0.067 0.235
\hdashlineRes, naive 0.102 6.679 0.176 0.204
Res, exp 0.133 8.763 0.211 0.250
Res, out -0.056 -3.654 0.100 0.114
Res, covs -0.066 -4.350 0.115 0.132
\hdashlineFull, naive -0.017 -1.102 0.160 0.161
Full, exp -0.015 -0.977 0.189 0.189
Full, out -0.010 -0.660 0.104 0.104
Full, covs -0.011 -0.691 0.120 0.120
  • Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively.

3 Model Selection Criteria

In application studies, the true causal effect is unknown, so we are not able to find which combination of models leads to the best performance. In such a case, we need some model selection criteria to choose the best imputation and PS models. In this section, we discuss some traditional criteria for model selection and propose a new criterion that takes into consideration both models.

3.1 Weighted Accuracy (Accuracy(w))\text{Accuracy}^{(w)})Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT )

We first discuss a theoretical way to evaluate the performance of the imputation model via weighted accuracy based on the observed data. Due to the missing exposure, even though we can impute the missing data, we do not know the true missing values, which makes it challenging to estimate the accuracy only based on the observed data. A naive way is to fit the imputation model based on the observed data and compare the imputed exposure with those known values. However, since the accuracy is a function of (𝑿,Y)𝑿𝑌(\bm{X},Y)( bold_italic_X , italic_Y ) and the distribution of (𝑿,Y)𝑿𝑌(\bm{X},Y)( bold_italic_X , italic_Y ) on observed data has been changed on the whole data, the way to estimate accuracy from the observed data is no longer valid.

One of the approaches to deal with this issue is to apply inverse probability weighting on the accuracy when we consider wi=P(Ri=1|𝑿𝒊,Yi)subscript𝑤𝑖𝑃subscript𝑅𝑖conditional1subscript𝑿𝒊subscript𝑌𝑖w_{i}=P(R_{i}=1|\bm{X_{i}},Y_{i})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | bold_italic_X start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the propensity for the missingness. Obtaining the weighted accuracy can be described in the following four steps:

  1. 1.

    First, we randomly split the individuals into either the training or the testing data according to a ratio q𝑞qitalic_q, where q=ntestnobs𝑞subscript𝑛testsubscript𝑛obsq=\frac{n_{\text{test}}}{n_{\text{obs}}}italic_q = divide start_ARG italic_n start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT end_ARG. Here, nobs,ntestsubscript𝑛obssubscript𝑛testn_{\text{obs}},n_{\text{test}}italic_n start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT test end_POSTSUBSCRIPT is the sample size for the whole observed data and for the testing data, respectively. Certainly, q𝑞qitalic_q can be arbitrarily chosen by the user, but the simplest way is to set the ratio equal to the missing rate in the original dataset.

  2. 2.

    Next, we fit the imputation model on the training data and impute the exposure on the testing data to compare with the known exposure status.

  3. 3.

    After that, we fit the full missingness model on the whole dataset, written as logit(wi)=𝑿𝒊𝑻𝜸logitsubscript𝑤𝑖superscriptsubscript𝑿𝒊𝑻𝜸\text{logit}(w_{i})=\bm{X_{i}^{T}\gamma}logit ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_X start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_T end_POSTSUPERSCRIPT bold_italic_γ, where 𝜸𝜸\bm{\gamma}bold_italic_γ is the vector of coefficients including the intercept to be estimated so that we can obtain wi^^subscript𝑤𝑖\hat{w_{i}}over^ start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG as the estimated propensity of missingness for individual i=1,2,,n𝑖12𝑛i=1,2,\dots,nitalic_i = 1 , 2 , … , italic_n.

  4. 4.

    Finally, we calculate the weighted accuracy, called “Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT":

    Accuracy(w)(Aimp)=i=1n1Ri1w^iδi𝟙(Ai=Aimp)n×q,superscriptAccuracy𝑤superscript𝐴impsuperscriptsubscript𝑖1𝑛1subscript𝑅𝑖1subscript^𝑤𝑖subscript𝛿𝑖1subscript𝐴𝑖superscript𝐴imp𝑛𝑞\displaystyle\text{Accuracy}^{(w)}(A^{\text{imp}})=\frac{\sum_{i=1}^{n}\frac{1% -R_{i}}{1-\hat{w}_{i}}\delta_{i}\mathbbm{1}(A_{i}=A^{\text{imp}})}{n\times q},Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n × italic_q end_ARG , (3)

where δi=𝟙(individual i is chosen in the test set)subscript𝛿𝑖1individual i is chosen in the test set\delta_{i}=\mathbbm{1}(\text{individual $i$ is chosen in the test set})italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_1 ( individual italic_i is chosen in the test set ) and 𝟙(Ai=Aimp)1subscript𝐴𝑖superscript𝐴imp\mathbbm{1}(A_{i}=A^{\text{imp}})blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) is an indicator for whether the imputed value equals to the observed value. Here, w^isubscript^𝑤𝑖\hat{w}_{i}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is estimated from the full missingness model including all covariates and the outcome.

The weighted accuracy is consistent to the true accuracy when the full missingness model is correctly specified. The detailed proof is attached in Appendix A.7.1. We also verify this conclusion in the simulation study, in which we find that the difference between the true accuracy and the weighted accuracy is close to 0.01 (the results are omitted).

3.2 ASMD, KS, and BIC

To evaluate the PS model, the traditional approaches are based on the balance statistic in the confounders. For example, one can calculate absolute standardized mean difference (ASMD) and Kolmogorov-Smirnov (KS) statistics in the covariates after propensity score adjustment.

Another approach to evaluate the PS model is using BIC criterion to select the best model adjusting for the sample size and number of variables in the model. Since in PS model selection, we aim to select the outcome-related variables instead of the exposure-related variables, we suggest using BIC of the outcome model YA+𝑿similar-to𝑌𝐴𝑿Y\sim A+\bm{X}italic_Y ∼ italic_A + bold_italic_X as the selection criterion. From Figure 1, since X2Y|A=abottomsubscript𝑋2conditional𝑌𝐴𝑎X_{2}\bot Y|A=aitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊥ italic_Y | italic_A = italic_a, a smaller BIC will indicate that we have included outcome-related variables and excluded the exposure-related variables. For example, based on the simulation results in Table 5, the naive and exposure-related PS models generally have larger BIC than outcome-related and covariates-included PS models. Notice that we do not recommend using the c-statistics or accuracy of PS model to select variables because in PS analysis, we aim to balance the confounder across the exposure and non-exposure group instead of maximizing the predictability of the PS model 22, 23.

We should also notice that only using one traditional criterion may not be appropriate to select both imputation and PS models, such as either Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT, ASMD, or KS statistics, because they just focus on the performance of a single model. For example in Table 5, if we just adopt Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT to select imputation and PS models, we cannot distinct between different PS models because the same imputation model will lead to the same Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT value. In contrast, if we only choose BIC to select two models and we choose the same exposure-related PS model, we cannot easily distinguish the performance of the outcome-related vs covariates-included imputation models. That motivated us to combine the evaluation of Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC into an integrated criterion, called rank score proposed in the next section.

3.3 Rank Score

The idea of the new criterion is to combine the performance of both imputation and PS models, so the “rank score" is proposed to average over the ranks of Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC among all candidates of models. We can also think of the rank score similar as the “best subset" strategy, which considers all possible combination of the candidate models and select the best choice in contrast to the “stepwise" strategy. For a given model, we can calculate its rank score value by:

Rank Score=Rank(1-Accuracy(w))+Rank(BIC)2,Rank ScoreRank(1-Accuracy(w))+Rank(BIC)2\displaystyle\text{ Rank Score}=\frac{\text{Rank(1-$\text{Accuracy}^{(w)}$)+% Rank(BIC)}}{2},Rank Score = divide start_ARG Rank(1- Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT )+Rank(BIC) end_ARG start_ARG 2 end_ARG , (4)

where “Rank(1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT)" is the rank of the model based on the value of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT if we order 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT from the largest to the smallest. “Rank(BIC)" is the rank for a given model based on the value of BIC for regressing Y𝑌Yitalic_Y on A𝐴Aitalic_A and the covariates selected in the given model. We can calculate the rank score for every possible combination of the imputation and PS models. Then, we select the smallest rank score, which leads to the highest Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and the smallest BIC, so the best imputation and PS models can be successfully chosen. From the simulation studies in Table 5, the full imputation plus outcome-related PS models will result in the smallest rank score as the best choice, consistent with the smallest RMSE. In comparison, if we choose the outcome imputation plus exposure-related model, the rank score has the largest value, which is also consistent with the highest RMSE.

In summary, the main advantage of the rank score is to combine the performance of both imputation and PS models and directly find the best model based on this rank-based criterion. In addition, the rank score is a unit-free score, which is not affected by the different magnitudes of the accuracy and BIC. The evaluation of those criteria is shown in Table 7.

Table 5: Criterion values for different models using IPW method when n=500𝑛500n=500italic_n = 500
Imp, PS Models Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT Out BIC ASMD KS ABIC Rank Score Rank(RMSE)
Exp, naive 0.586 677.561 0.161 0.130 1.2155 11.789 10
Exp, exp 0.586 720.437 0.022 0.130 1.3491 14.272 14
Exp, out 0.586 430.074 0.149 0.130 0.4498 7.849 7
Exp, covs 0.586 433.966 0.014 0.130 0.4619 8.959 13
\hdashlineOut, naive 0.507 683.862 0.103 0.148 1.8608 15.947 19
Out, exp 0.507 724.226 0.031 0.148 1.9864 18.306 20
Out, out 0.507 434.737 0.075 0.148 1.0901 12.460 15
Out, covs 0.507 435.659 0.008 0.148 1.0929 12.918 18
\hdashlineCovs, naive 0.585 676.668 0.170 0.136 1.2244 11.875 11
Covs, exp 0.585 719.239 0.033 0.136 1.3570 14.354 16
Covs, out 0.585 430.103 0.148 0.136 0.4616 8.043 8
Covs, covs 0.585 433.958 0.014 0.136 0.4736 9.136 12
\hdashlineRes, naive 0.609 652.471 0.182 0.124 0.9677 8.103 9
Res, exp 0.609 694.546 0.040 0.124 1.0989 10.450 17
Res, out 0.609 411.095 0.154 0.124 0.2209 3.993 2
Res, covs 0.609 415.906 0.015 0.124 0.2358 4.580 4
\hdashlineFull, naive 0.622 662.369 0.172 0.138 0.8601 7.463 5
Full, exp 0.622 705.051 0.034 0.138 0.9931 9.941 6
Full, out 0.622 401.525 0.149 0.138 0.0536 1.989 1
Full, covs 0.622 406.367 0.014 0.138 0.0687 2.574 3
  • Here, Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT: the weighted accuracy on the observed data. Out BIC is BIC of the outcome model. ASMD: absolute standardized mean difference. KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: average ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.

Table 6: Criterion values for different models using DR method when n=500𝑛500n=500italic_n = 500
Imp, PS Models Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT Out BIC ASMD KS ABIC Rank Score Rank(RMSE)
Exp, naive 0.586 677.561 0.161 0.130 1.2155 11.789 10
Exp, exp 0.586 720.437 0.022 0.130 1.3491 14.272 15
Exp, out 0.586 430.074 0.149 0.130 0.4498 7.849 7
Exp, covs 0.586 433.966 0.014 0.130 0.4619 8.959 14
\hdashlineOut, naive 0.507 683.862 0.103 0.148 1.8608 15.947 19
Out, exp 0.507 724.226 0.031 0.148 1.9864 18.306 20
Out, out 0.507 434.737 0.075 0.148 1.0901 12.460 12
Out, covs 0.507 435.659 0.008 0.148 1.0929 12.918 18
\hdashlineCovs, naive 0.585 676.668 0.170 0.136 1.2244 11.875 11
Covs, exp 0.585 719.239 0.033 0.136 1.3570 14.354 17
Covs, out 0.585 430.103 0.148 0.136 0.4616 8.043 8
Covs, covs 0.585 433.958 0.014 0.136 0.4736 9.136 13
\hdashlineRes, naive 0.609 652.471 0.182 0.124 0.9677 8.103 9
Res, exp 0.609 694.546 0.040 0.124 1.0989 10.450 16
Res, out 0.609 411.095 0.154 0.124 0.2209 3.993 2
Res, covs 0.609 415.906 0.015 0.124 0.2358 4.580 4
\hdashlineFull, naive 0.622 662.369 0.172 0.138 0.8601 7.463 5
Full, exp 0.622 705.051 0.034 0.138 0.9931 9.941 6
Full, out 0.622 401.525 0.149 0.138 0.0536 1.989 1
Full, covs 0.622 406.367 0.014 0.138 0.0687 2.574 3
  • Here, Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT: the weighted accuracy on the observed data. Out BIC is BIC of the outcome model. ASMD: absolute standardized mean difference. KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: average ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.

3.4 Other Criterion

Another possible criterion to combine accuracy and BIC is to directly conduct a product of two values, written as:

(1Accuracy(w))×BIC.1superscriptAccuracy𝑤BIC\displaystyle(1-\text{Accuracy}^{(w)})\times\text{BIC}.( 1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT ) × BIC . (5)

However, since the magnitude of Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC are very different, the product values are easily affected by the extreme value, so this criterion is not very helpful in selecting both models.

Even though BIC and Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT do not have the same magnitude, we try to rescale both terms to the range of [0,1]01[0,1][ 0 , 1 ]. We call the following criterion as “ABIC":

ABIC=(1Accuracy(w)min(Accuracy(w))max(Accuracy(w))min(Accuracy(w)))+BICmin(BIC)max(BIC)min(BIC),ABIC1superscriptAccuracy𝑤minsuperscriptAccuracy𝑤max(Accuracy(w))min(Accuracy(w))BICminBICmax(BIC)min(BIC)\displaystyle\text{ABIC}=\left(1-\frac{\text{$\text{Accuracy}^{(w)}$}-\text{% min}(\text{$\text{Accuracy}^{(w)}$})}{\text{max(\text{$\text{Accuracy}^{(w)}$}% )}-\text{min(\text{$\text{Accuracy}^{(w)}$})}}\right)+\frac{\text{BIC}-\text{% min}(\text{BIC})}{\text{max(\text{BIC})}-\text{min(\text{BIC})}},ABIC = ( 1 - divide start_ARG Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT - min ( Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT ) end_ARG start_ARG max( Accuracy(w) ) - min( Accuracy(w) ) end_ARG ) + divide start_ARG BIC - min ( BIC ) end_ARG start_ARG max( roman_BIC ) - min( roman_BIC ) end_ARG , (6)

where “min" means the smallest value and “max" means the largest value among all candidates of models. Even though ABIC solves the different magnitude issues, the min and max values of ABIC can still be affected by the extreme value of BIC or Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT. Then, the distinction of ABIC values among the candidate models may be largely shrunk to a quite small value, which makes it hard for us to distinguish the model performance, shown in Table 5-6.

3.5 Results of Selection Criteria

To evaluate the performance of different criteria, we also provide the correlation between RMSE and each criterion in Table 7. More specifically, after N=1000𝑁1000N=1000italic_N = 1000 replications, we have obtained RMSE for all candidate models. Within each replication of simulation studies, we can evaluate each candidate model with different criteria. Then, in the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT replication, we can calculate the Spearman correlation between the proposed criterion and RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG. Table 7 shows the average of the Spearman correlations over 1000 replications.

From Table 7, compared with other criteria, the rank score shows a stronger correlation (larger than 0.8) with the RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG using either IPW or DR method, which indicates that a smaller value of the rank score is more likely to result in a smaller RMSE. In other words, even though we do not know the true causal effect and RMSE in the application, the rank score is still valid for researchers to evaluate the performance of imputation and PS models together. From Table 5 and 6, we can confirm that full imputation plus outcome-related PS models have a higher chance to be selected via the “rank score" criterion.

Table 7: Spearman correlation between criteria and RMSE when n=500𝑛500n=500italic_n = 500
Criterion IPW Method DR Method
1-Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT 0.678 0.655
ASMD -0.306 -0.337
KS 0.212 0.191
Out BIC 0.682 0.702
\hdashlineABIC 0.793 0.785
Rank Score 0.841 0.837
  • Here, Out BIC is BIC of the outcome model. ASMD: absolute standardized mean difference for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: averaged ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.

In summary, our proposed rank score combines the performance of both imputation and PS models. As a rank-based criterion, the rank score shows a stronger correlation with the RMSE of τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG compared with other traditional criteria. Based on our simulation results, the rank score can also successfully select the full imputation plus the outcome-related PS models with the smallest RMSE as the best choice.

4 Application Study

4.1 Background

COVID-19 is a highly contagious disease that can lead to severe clinical manifestations and long-term complications for patients. Some researchers have shown that the presence of cardiovascular disease (CVD) is associated with significantly worse mortality in patients 24. However, the associational analysis cannot answer the question of whether CVD directly causes higher mortality or whether other risk factors that lead to CVD are also a common cause of COVID-19 death. In this section, we present an application study on a COVID-19 dataset to investigate the causal effect of CVD (as the exposure) on the mortality of COVID-19 patients (as the outcome) when CVD status is not fully observed. We aim to select both imputation and PS models using the proposed methods and compare their performance.

The data were collected during the second wave of the pandemic in Brazil from June 25th to December 31st, 2020. The dataset contains 2878 COVID-19 patients, of whom 1253 died from COVID-19. Although all patients were exposed to COVID-19, 548 patients had missing information on CVD, resulting in a missing rate of 19%. Demographic and clinical variables were also collected including age, sex, and diabetes without missing data. A summary of the dataset is provided in Table 8-9, stratified by the mortality or the missingness. The independence between mortality and covariates was tested using Fisher test for categorical variables and the Mann-Whitney test for continuous variables. From Table 8, we find a significant association between age and mortality, which is a well-known risk factor. There are also significant associations between CVD and mortality, which may be influenced by both confounders and missingness in the exposure. However, no significant association is identified between diabetes, sex, and mortality.

Table 8: Summary table of COVID-19 data stratified by mortality
Covariate Full Sample (n=2878) Alive (n=1625) Died (n=1253) p-value
CVD 0.022
   No 802 (34) 478 (36) 324 (32)
   Yes 1528 (66) 835 (64) 693 (68)
   Missing 548 312 236
\hdashlineage <<<0.001
   Mean (sd) 68.9 (16.3) 65.4 (16.6) 73.3 (14.7)
   Median (Min,Max) 72 (0,107) 67 (0,104) 76 (0,107)
\hdashlinesex 0.054
   Male 1551 (54) 850 (52) 701 (56)
   Female 1327 (46) 775 (48) 552 (44)
\hdashlinediabetes 0.3
   No 1117 (39) 617 (38) 500 (40)
   Yes 1761 (61) 1008 (62) 753 (60)
00footnotetext: CVD: cardiovascular disease. For the continuous variable, it shows the mean, se, min, max, and median. For the categorical variable, its frequency and percentage of total samples are provided in the bracket.
Table 9: Summary table of COVID-19 data stratified by the missing indicator
Covariate Full Sample (n=2878) Missing (n=2330) Observed (n=548) p-value
age 0.099
   Mean (sd) 68.9 (16.3) 69 (16.6) 68.3 (14.8)
   Median (Min,Max) 72 (0,107) 72 (0,104) 71 (1,107)
\hdashlinesex 0.37
   Male 1551 (54) 1246 (53) 305 (56)
   Female 1327 (46) 1084 (47) 243 (44)
\hdashlinediabetes <<<0.001
   No 1117 (39) 1111 (48) 6 (1)
   Yes 1761 (61) 1219 (52) 542 (99)

In this application, we aim to estimate the causal risk ratio of mortality for patients with CVD versus those without CVD as our main interest in the application study. However, in such observational studies, we cannot simply randomize the exposure status for patients, making it challenging to estimate the causal effect of CVD on COVID-19 mortality in the presence of confounding issues. Meanwhile, since CVD status is not fully observed, we also have to deal with the missing exposure problem, so we need to make assumptions about the missing exposure and confounding issues. From Table 9, we find a significant association between missingness and diabetes. Older COVID-19 patients with chronic diseases, such as diabetes, may be more likely to report whether they have CVD or not, so assuming MCAR in this study is not reasonable. Instead, we adopt the MAR assumption, which assumes that the missing indicator for CVD is conditionally independent of CVD status itself, given other clinical variables. In addition, we assume SITA assumption holds, i.e. the potential outcomes are conditionally independent of the CVD given all pre-measured covariates. After that, we intend to apply MICE to impute the missing values and IPW-based estimators to account for confounding issues.

4.2 Model Selection

Since the effect of each clinical variable on CVD status or the outcome may be different, properly selecting imputation and PS models is essential to obtain a valid estimation of the causal effect. However, a DAG in this case is not readily available if we lack sufficient background information in the application. Previous clinical papers show that age and sex are important risk factors for death 25, 26. In addition, many clinical studies find that diabetes and age can affect the risk of CVD 27. Based on the prior knowledge, we can draw a hypothetical causal diagram in Figure 4.

Refer to caption
Figure 4: A hypothetical causal diagram for application study. The black arrows refer to the causal relationship among confounders, the exposure, and the outcome. The double-sided dash arrows refer to the associational relationship among the missing indicator, the outcome, and the covariates. The red arrow is the effect of primary interest. The dashed short line refers to no association between the missing indicator and the exposure given covariates and the outcome due to MAR assumption.

For simplicity, we rename the age as 𝑿𝟏subscript𝑿1\bm{X_{1}}bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT, diabetes as X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, sex as X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, CVD as the exposure A𝐴Aitalic_A, and death as the outcome Y𝑌Yitalic_Y. Then, following the same model setup in the simulation study described in Section 2.3, we consider five different imputations and four different PS models as candidate models.

Table 10: Results of model selection for COVID-19 data using IPW and DR methods
Imp Model PS Model IPW Estimate DR Estimate Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT Out BIC ASMD KS Rank Score
A𝑿𝟏+X2similar-to𝐴subscript𝑿1subscript𝑋2A\sim\bm{X_{1}}+X_{2}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.945 0.986 0.607 3790.876 0.050 0.292 7.500
A𝑿𝟏+X2similar-to𝐴subscript𝑿1subscript𝑋2A\sim\bm{X_{1}}+X_{2}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2A^{\text{imp}}\sim\bm{X_{1}}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.961 0.988 0.607 3796.287 0.006 0.290 12.000
A𝑿𝟏+X2similar-to𝐴subscript𝑿1subscript𝑋2A\sim\bm{X_{1}}+X_{2}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.950 0.991 0.607 3789.885 0.048 0.292 5.000
A𝑿𝟏+X2similar-to𝐴subscript𝑿1subscript𝑋2A\sim\bm{X_{1}}+X_{2}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.963 0.989 0.607 3795.815 0.008 0.290 9.500
\hdashlineA𝑿𝟏+X3similar-to𝐴subscript𝑿1subscript𝑋3A\sim\bm{X_{1}}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.946 0.983 0.581 3790.743 0.042 0.267 12.000
A𝑿𝟏+X3similar-to𝐴subscript𝑿1subscript𝑋3A\sim\bm{X_{1}}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2A^{\text{imp}}\sim\bm{X_{1}}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.960 0.984 0.581 3796.138 0.006 0.264 16.500
A𝑿𝟏+X3similar-to𝐴subscript𝑿1subscript𝑋3A\sim\bm{X_{1}}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.951 0.988 0.581 3789.763 0.040 0.267 9.500
A𝑿𝟏+X3similar-to𝐴subscript𝑿1subscript𝑋3A\sim\bm{X_{1}}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.962 0.986 0.581 3795.679 0.007 0.264 14.500
\hdashlineA𝑿𝟏+X2+X3similar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3A\sim\bm{X_{1}}+X_{2}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.942 0.984 0.606 3790.866 0.050 0.294 11.000
A𝑿𝟏+X2+X3similar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3A\sim\bm{X_{1}}+X_{2}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2A^{\text{imp}}\sim\bm{X_{1}}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.958 0.985 0.606 3796.309 0.006 0.291 16.500
A𝑿𝟏+X2+X3similar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3A\sim\bm{X_{1}}+X_{2}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.947 0.990 0.606 3789.882 0.049 0.294 8.500
A𝑿𝟏+X2+X3similar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3A\sim\bm{X_{1}}+X_{2}+X_{3}italic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Aimp𝑿𝟏+X2+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.960 0.986 0.606 3795.837 0.008 0.291 14.000
\hdashlineA𝑿𝟏+X2+Ysimilar-to𝐴subscript𝑿1subscript𝑋2𝑌A\sim\bm{X_{1}}+X_{2}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_Y AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.942 0.985 0.606 3790.846 0.050 0.293 8.500
A𝑿𝟏+X2+Ysimilar-to𝐴subscript𝑿1subscript𝑋2𝑌A\sim\bm{X_{1}}+X_{2}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X2similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2A^{\text{imp}}\sim\bm{X_{1}}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.959 0.986 0.606 3796.272 0.006 0.291 13.500
A𝑿𝟏+X2+Ysimilar-to𝐴subscript𝑿1subscript𝑋2𝑌A\sim\bm{X_{1}}+X_{2}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.947 0.990 0.606 3789.852 0.049 0.293 6.000
A𝑿𝟏+X2+Ysimilar-to𝐴subscript𝑿1subscript𝑋2𝑌A\sim\bm{X_{1}}+X_{2}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X2+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.961 0.987 0.606 3795.797 0.008 0.291 11.000
\hdashlineA𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y AimpX1similar-tosuperscript𝐴impsubscript𝑋1A^{\text{imp}}\sim X_{1}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.946 0.986 0.608 3790.742 0.048 0.289 3.500
A𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X2similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2A^{\text{imp}}\sim\bm{X_{1}}+X_{2}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.960 0.987 0.608 3796.144 0.006 0.287 9.000
A𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.950 0.991 0.608 3789.735 0.048 0.289 1.000
A𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y Aimp𝑿𝟏+X2+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋2subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{2}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.962 0.988 0.608 3795.661 0.008 0.287 6.000

Without assuming the hypothetical DAG is correct, we apply the rank score to select the best imputation and PS models among the candidates. Table 10 reports the values for each criterion and based on rank score, the best imputation model includes “age", “sex", “diabetes", and “death" as the predictors and the best PS model selects “age", “sex", and “diabetes" as the predictors. We find that the above selection result leads to the smallest rank score, which is consistent with the simulation results. In addition, if we do not know DAG in Figure 4 is correct or not, using rank score can help us select the full imputation model plus the outcome-related PS model as the best choice.

4.3 Estimation Results

After the model selection, we obtain the estimated causal effect of CVD on mortality using IPW and DR estimators. We also provide confidence intervals (CI) after running B=2000𝐵2000B=2000italic_B = 2000 bootstrap resamples, as shown in Table 11. We compare two different types of CI including a CI based on bootstrap standard error (BSE) with a normal approximation, called “CI Normal", and another CI based on bootstrap percentiles (between 2.5% and 97.5% percentiles of bootstrap point estimates), called “CI Percentile" 28.

The estimated causal risk ratio of CVD on the mortality for COVID-19 patients is 0.95 using IPW method with a 95% CI of [0.867, 1.052] CI Percentile. Using DR method, we also obtain a similar result as IPW method.

Although some researchers may think that patients with CVD status increase the risk of death from the common sense, our analysis does not support this assumption. Since CI covers the value of 1, we would expect that the statistical test for whether the ratio of patients with CVD is higher than those without CVD is not significant after adjusting for missing data and confounding factors. One possible explanation for this finding is due to the confounding issue, i.e. age is strongly associated with both CVD status and death. Additionally, researchers may have ignored the effect of missing exposure on the estimated results. These results align with recent clinical research 29, 30, which suggests that those CVD risk factors, such as age, rather than CVD status itself, are the primary contributors to mortality.

One of the limitations of the application study is that we only include four clinical variables. Since we lack enough observations of other variables in the real dataset, we may miss some confounders in the analysis. In addition, the model selection and estimated results are only based on the samples in Brazil instead of the whole population in the world. These issues can be addressed if we have individual-level health data with better quality.

Table 11: Estimated results using IPW and DR estimators for COVID-19 data
Method Selected Imputation, PS Models Estimate (risk ratio) BSE CI Normal CI Percentile
IPW A𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y, Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.950 0.048 (0.856,1.044) (0.867,1.052)
DR A𝑿𝟏+X2+X3+Ysimilar-to𝐴subscript𝑿1subscript𝑋2subscript𝑋3𝑌A\sim\bm{X_{1}}+X_{2}+X_{3}+Yitalic_A ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_Y, Aimp𝑿𝟏+X3similar-tosuperscript𝐴impsubscript𝑿1subscript𝑋3A^{\text{imp}}\sim\bm{X_{1}}+X_{3}italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ∼ bold_italic_X start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.991 0.050 (0.897,1.085) (0.905,1.098)
  • Estimate: causal effect (as the risk ratio) of CVD on mortality. BSE: bootstrap standard errors. CI-Normal: 95% bootstrap normal confidence interval using BSE. CI-Percentile: 95% bootstrap percentile confidence interval.

5 Discussion

Model selection on both imputation and PS models when the exposure is missing at random (MAR) is a challenging problem, and few studies discuss the appropriate criterion for selecting these models in practice. In this paper, our simulation studies investigate the effect of different imputation and PS models on the estimated causal effect. Unexpectedly, we find that even though in the simulation setup the missing mechanism does not depend on the outcome, selecting the outcome and outcome-related variables for both the imputation and PS models can improve the accuracy of the imputed exposure and reduce the RMSE of the estimated causal effect. In addition, we propose “rank score" as a rank-based and unit-free criterion to evaluate the performance of both imputation and PS models, which shows a higher correlation with RMSE in simulation studies.

The limitation of the rank score is that it requires ranking both Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC among all candidate models, which may be affected by the pool of candidate models. If the number of covariates in the dataset is small, we can recommend fitting all possible combinations of models. On the other hand, when the number of candidate models is very large, ranking all candidate models becomes time-consuming. To address this issue, researchers can perform model selection sequentially, called “stepwise" selection, i.e., we use the Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT to select the best imputation model, and then we select PS model through BIC given that imputation model. However, such a sequential strategy is not ideal in some cases because the selection of PS model is only conditioned on the selected imputation model from the first step, which may not lead to the overall best models with the smallest RMSE. On the other side, the rank score is similar to the “best subset" strategy because both accuracy and BIC contribute equally to the rank score, ensuring a balance between the rankings of imputation and PS models.

Further research can expand the rank score into multiple areas such as MAR on both exposure and outcome. Besides that, one can also consider other missing mechanisms such as MNAR. It may be interesting for researchers to incorporate missing exposure into other complex settings, such as longitudinal or survival data in future studies.

6 Acknowledgements

We appreciate IntegraSUS 111https://integrasus.saude.ce.gov.br/ for publicly sharing the COVID-19 data in Brazil. We also appreciate Wei Liang for his great suggestions on the Appendix.

References

  • 1 Rubin DB. Inference and missing data. Biometrika 1976; 63(3): 581–592.
  • 2 Kennedy EH. Efficient nonparametric causal inference with missing exposure information. The international journal of biostatistics 2020; 16(1): 20190087.
  • 3 Kang JD, Schafer JL, others . Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 2007; 22(4): 523–539.
  • 4 Williamson EJ, Forbes A, Wolfe R. Doubly robust estimators of causal exposure effects with missing data in the outcome, exposure or a confounder. Statistics in medicine 2012; 31(30): 4382–4400.
  • 5 Zhang Z, Liu W, Zhang B, Tang L, Zhang J. Causal inference with missing exposure information: Methods and applications to an obstetric study. Statistical Methods in Medical Research 2016; 25(5): 2053–2066.
  • 6 Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1): 41–55.
  • 7 Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when" inverse probability" weights are highly variable. Statistical Science 2007; 22(4): 544–559.
  • 8 Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research 2013; 22(3): 278–295.
  • 9 Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG. A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 2006; 59(10): 1087–1091.
  • 10 Buuren vS, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 1999; 18(6): 681–694.
  • 11 van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 2011; 45(3): 1-67. doi: 10.18637/jss.v045.i03
  • 12 Rubin DB. Multiple imputation for nonresponse in surveys. 81. John Wiley & Sons . 2004.
  • 13 Neyman JS. On the application of probability theory to agricultural experiments. essay on principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical science (1990), 5, 465-480). Annals of Agricultural Sciences 1923; 10: 1–51.
  • 14 Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 1994; 89(427): 846–866.
  • 15 Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005; 61(4): 962–973.
  • 16 Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. American journal of epidemiology 2006; 163(12): 1149–1156.
  • 17 Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1): 41–55.
  • 18 Glymour M, Pearl J, Jewell NP. Causal inference in statistics: A primer. John Wiley & Sons . 2016.
  • 19 Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 1952; 47(260): 663–685.
  • 20 Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association 1999; 94(448): 1096–1120.
  • 21 Tanner-Smith EE, Lipsey MW. Identifying baseline covariates for use in propensity scores: A novel approach illustrated for a nonrandomized study of recovery high schools. Peabody journal of education 2014; 89(2): 183–196.
  • 22 Rubin DB. On principles for modeling propensity scores in medical research.. Pharmacoepidemiology and drug safety 2004; 13(12): 855–857.
  • 23 Westreich D, Cole SR, Funk MJ, Brookhart MA, Stürmer T. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiology and drug safety 2011; 20(3): 317–320.
  • 24 Kassir R. Risk of COVID-19 for patients with obesity. Obesity Reviews 2020; 21(6).
  • 25 Dana PM, Sadoughi F, Hallajzadeh J, et al. An insight into the sex differences in COVID-19 patients: what are the possible causes?. Prehospital and disaster medicine 2020; 35(4): 438–441.
  • 26 Abdi A, Jalilian M, Sarbarzeh PA, Vlaisavljevic Z. Diabetes and COVID-19: A systematic review on the current evidences. Diabetes research and clinical practice 2020; 166: 108347.
  • 27 Bertoluci MC, Rocha VZ. Cardiovascular risk assessment in patients with diabetes. Diabetology & metabolic syndrome 2017; 9: 1–13.
  • 28 Tibshirani RJ, Efron B. An introduction to the bootstrap. Monographs on statistics and applied probability 1993; 57: 1–436.
  • 29 Di Castelnuovo A, Bonaccio M, Costanzo S, et al. Common cardiovascular risk factors and in-hospital mortality in 3,894 patients with COVID-19: survival analysis and machine learning-based findings from the multicentre Italian CORIST Study. Nutrition, Metabolism and Cardiovascular Diseases 2020; 30(11): 1899–1913.
  • 30 Vasbinder A, Meloche C, Azam TU, et al. Relationship between preexisting cardiovascular disease and death and cardiovascular outcomes in critically ill patients with COVID-19. Circulation: Cardiovascular Quality and Outcomes 2022; 15(10): e008942.

7 Appendix

A.7.1 Proof of Weighted Accuracy

The key idea of weighted accuracy is to create a pseudo sample after weighting on those observed individuals so that the weighted accuracy on the observed dataset can represent the original accuracy on the whole dataset. Based on MAR assumption, if the missingness model is correct, under the true value of wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can prove that,

Accuracy(w)(Aimp)ppsuperscriptAccuracy𝑤superscript𝐴impabsent\displaystyle\text{$\text{Accuracy}^{(w)}$}(A^{\text{imp}})\xrightarrow[]{% \text{p}}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) start_ARROW overp → end_ARROW E[1Ri1wi𝟙(Ai=Aimp)]E(δi)E(δi)𝐸delimited-[]1subscript𝑅𝑖1subscript𝑤𝑖1subscript𝐴𝑖superscript𝐴imp𝐸subscript𝛿𝑖𝐸subscript𝛿𝑖\displaystyle E\left[\frac{1-R_{i}}{1-w_{i}}\mathbbm{1}(A_{i}=A^{\text{imp}})% \right]\frac{E(\delta_{i})}{E(\delta_{i})}italic_E [ divide start_ARG 1 - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) ] divide start_ARG italic_E ( italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_E ( italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (A.7)
=E{E[1Ri|𝑿,Y]1wiE[𝟙(Ai=Aimp)|𝑿,Y]}absent𝐸𝐸delimited-[]1conditionalsubscript𝑅𝑖𝑿𝑌1subscript𝑤𝑖𝐸delimited-[]conditional1subscript𝐴𝑖superscript𝐴imp𝑿𝑌\displaystyle=E\left\{\frac{E[1-R_{i}|\bm{X},Y]}{1-w_{i}}E[\mathbbm{1}(A_{i}=A% ^{\text{imp}})|\bm{X},Y]\right\}= italic_E { divide start_ARG italic_E [ 1 - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_X , italic_Y ] end_ARG start_ARG 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_E [ blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) | bold_italic_X , italic_Y ] }
=E{E[𝟙(Ai=Aimp)|𝑿,Y]}=P(Ai=Aimp).absent𝐸𝐸delimited-[]conditional1subscript𝐴𝑖superscript𝐴imp𝑿𝑌𝑃subscript𝐴𝑖superscript𝐴imp\displaystyle=E\left\{E[\mathbbm{1}(A_{i}=A^{\text{imp}})|\bm{X},Y]\right\}=P(% A_{i}=A^{\text{imp}}).= italic_E { italic_E [ blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) | bold_italic_X , italic_Y ] } = italic_P ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ) .

Here the first equality holds because δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an indicator of whether the individual is randomly selected into the testing data or not, which is free of (𝑿,Y)𝑿𝑌(\bm{X},Y)( bold_italic_X , italic_Y ). Its expectation is an arbitrarily chosen ratio, which can also be canceled. The second equality holds because after taking double expectation with the condition of (𝑿,Y)𝑿𝑌(\bm{X},Y)( bold_italic_X , italic_Y ), we know AiRi|𝑿,Ybottomsubscript𝐴𝑖conditionalsubscript𝑅𝑖𝑿𝑌A_{i}\bot R_{i}|\bm{X},Yitalic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊥ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_X , italic_Y based on the MAR assumption. The third equality holds because the full missingness model is correctly specified, so we know E(Ri|𝑿,Y)=wi𝐸conditionalsubscript𝑅𝑖𝑿𝑌subscript𝑤𝑖E(R_{i}|\bm{X},Y)=w_{i}italic_E ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_X , italic_Y ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. After taking the double expectation again, we know Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT will be converged to the true accuracy on the original data as n𝑛n\to\inftyitalic_n → ∞.

Notice that Equation A.7 also describes why we cannot only use the naive way to estimate the accuracy on the observed data, written as: Naive-Accuracy=i=1n(1Ri)𝟙(Ai=Aimp)Naive-Accuracysuperscriptsubscript𝑖1𝑛1subscript𝑅𝑖1subscript𝐴𝑖superscript𝐴imp\text{Naive-Accuracy}=\sum_{i=1}^{n}(1-R_{i})\mathbbm{1}(A_{i}=A^{\text{imp}})Naive-Accuracy = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT ). In such a way, the missing indicator will largely affect the naive estimator, which cannot be canceled without the weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, to properly evaluate the performance of the imputation model, we need to adopt the inverse weights of the missingness on the estimated accuracy of the observed data.

In the simulation studies, since we know the true missing values, we can also impute all missing values on the whole data via MICE and calculate the standard accuracy, called the “Benchmark-Accuracy" on the original data. In such a way, the bias between Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and Benchmark-Accuracy is relatively small, which supports the statement that the weighted accuracy will converge to the true accuracy as n𝑛n\to\inftyitalic_n → ∞. However, we will not know the true accuracy in the application due to the unknown missing data, so Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT will be a useful tool for researchers to evaluate the selected imputation model.

Additionally, IPW method for accuracy can also be easily extended into other criteria, such as sensitivity or specificity. We only need to modify the indicator function in Equation A.7 to represent the corresponding criterion of interest. Then, following the same steps, we can obtain the weighted criterion, which can evaluate the performance of the imputation model only based on the observed data.

A.7.2 Estimate ACE for Continuous Outcome

In this subsection, we provide the selection and estimated results when we intend to estimate the average causal effect (ACE), defined as τ=τ1τ0𝜏subscript𝜏1subscript𝜏0\tau=\tau_{1}-\tau_{0}italic_τ = italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for continuous outcome Y𝑌Yitalic_Y. The other setup for the simulation studies are same as before.

Similarly, Figure A.5-A.6 show that the full imputation model plus the outcome-related PS model leads to the smallest RMSE among other candidate models. In contrast, the outcome-related imputation model plus the exposure-related PS model leads to the largest RMSE, which is the worst model. Additionally, the correlation of the rank score with RMSE is higher than 0.8 and the rank score is the smallest value for the full imputation plus the outcome-related PS model. Even though the correlation between weighted accuracy and RMSE is also higher than 0.8 in this case, we cannot select the best PS model if we only use weighted accuracy. Compared with other traditional criteria, the rank score combines the performance of both models, so it can successfully pick up the best models together.

Refer to caption
Figure A.5: Boxplot of estimated values for IPW estimators in the simulation studies for the continuous outcome to estimate ACE when n=500𝑛500n=500italic_n = 500. The x-axis refers to which imputation and PS models are used. Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively. The red line is a true causal effect.
Refer to caption
Figure A.6: Boxplot of estimated values for DR estimators in the simulation studies for the continuous outcome to estimate ACE when n=500𝑛500n=500italic_n = 500. The x-axis refers to which imputation and PS models are used. Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively. The red line is the true causal effect.
Table A.12: Performance of different models using IPW method when n=500𝑛500n=500italic_n = 500 for continuous outcome to estimate ACE
Imp, PS Models Bias Bias Rate ESE RMSE
Exp, naive -0.754 -37.708 0.186 0.777
Exp, exp -0.895 -44.766 0.203 0.918
Exp, out -0.757 -37.865 0.088 0.762
Exp, covs -0.898 -44.885 0.122 0.906
\hdashlineOut, naive -1.163 -58.158 0.366 1.219
Out, exp -1.277 -63.846 0.366 1.328
Out, out -0.910 -45.479 0.082 0.913
Out, covs -1.005 -50.270 0.090 1.009
\hdashlineCovs, naive -0.752 -37.589 0.358 0.833
Covs, exp -0.889 -44.452 0.419 0.983
Covs, out -0.758 -37.915 0.088 0.763
Covs, covs -0.897 -44.874 0.123 0.906
\hdashlineRes, naive 0.428 21.390 0.299 0.522
Res, exp 0.548 27.392 0.362 0.656
Res, out -0.466 -23.292 0.113 0.479
Res, covs -0.538 -26.893 0.195 0.572
\hdashlineFull, naive -0.001 -0.033 0.298 0.298
Full, exp 0.000 0.012 0.361 0.361
Full, out -0.018 -0.922 0.101 0.103
Full, covs -0.019 -0.935 0.141 0.143
  • Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively.

Table A.13: Performance of different models using DR method when n=500𝑛500n=500italic_n = 500 for continuous outcome to estimate ACE
Imp, PS Models Bias Bias Rate ESE RMSE
Exp, naive -0.754 -37.718 0.186 0.777
Exp, exp -0.897 -44.845 0.204 0.920
Exp, out -0.757 -37.831 0.088 0.762
Exp, covs -0.901 -45.051 0.097 0.906
\hdashlineOut, naive -1.163 -58.174 0.366 1.220
Out, exp -1.271 -63.529 0.367 1.323
Out, out -0.903 -45.130 0.082 0.906
Out, covs -1.043 -52.127 0.087 1.046
\hdashlineCovs, naive -0.752 -37.599 0.358 0.833
Covs, exp -0.890 -44.476 0.422 0.984
Covs, out -0.756 -37.817 0.088 0.761
Covs, covs -0.901 -45.043 0.098 0.906
\hdashlineRes, naive 0.428 21.389 0.299 0.522
Res, exp 0.551 27.542 0.363 0.660
Res, out -0.492 -24.621 0.103 0.503
Res, covs -0.569 -28.453 0.133 0.584
\hdashlineFull, naive -0.001 -0.036 0.297 0.297
Full, exp 0.003 0.130 0.363 0.363
Full, out -0.018 -0.878 0.101 0.102
Full, covs -0.019 -0.932 0.117 0.119
  • Here, Exp, Out, Res, and Covs, refer to the exposure-related, outcome-related, response-included, and covariates-included models, respectively.

Table A.14: Criterion values for different models using IPW method when n=500𝑛500n=500italic_n = 500 for continuous outcome to estimate ACE
Imp, PS Models Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT Out BIC ASMD KS ABIC Rank Score Rank(RMSE)
Exp, naive 0.592 2470.074 0.161 0.130 1.7032 13.364 11
Exp, exp 0.592 2473.226 0.022 0.130 1.7062 13.946 16
Exp, out 0.592 1680.469 0.149 0.130 0.9350 8.879 9
Exp, covs 0.592 1676.146 0.014 0.130 0.9308 8.284 14
\hdashlineOut, naive 0.512 2481.170 0.102 0.148 1.9921 18.033 19
Out, exp 0.512 2479.270 0.031 0.148 1.9903 17.950 20
Out, out 0.512 1707.195 0.075 0.148 1.2392 13.368 15
Out, covs 0.512 1684.670 0.008 0.148 1.2172 12.552 18
\hdashlineCovs, naive 0.591 2467.830 0.170 0.136 1.7068 13.376 12
Covs, exp 0.591 2470.496 0.033 0.136 1.7094 13.835 17
Covs, out 0.591 1680.727 0.148 0.136 0.9410 9.018 10
Covs, covs 0.591 1676.198 0.014 0.136 0.9366 8.403 13
\hdashlineRes, naive 0.639 2396.528 0.219 0.153 1.4664 8.541 6
Res, exp 0.639 2400.459 0.077 0.153 1.4703 8.949 8
Res, out 0.639 1627.598 0.156 0.152 0.7184 4.648 5
Res, covs 0.639 1628.720 0.018 0.152 0.7195 4.836 7
\hdashlineFull, naive 0.794 2429.053 0.167 0.130 0.9480 7.019 3
Full, exp 0.794 2433.731 0.028 0.130 0.9526 7.489 4
Full, out 0.794 1454.705 0.149 0.130 0.0004 1.024 1
Full, covs 0.794 1459.247 0.014 0.130 0.0049 1.488 2
  • Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT: the weighted accuracy on the observed data. Out BIC is BIC of the outcome model. ASMD: absolute standardized mean difference. KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: average ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.

Table A.15: Criterion values for different models using DR method when n=500𝑛500n=500italic_n = 500 for continuous outcome to estimate ACE
Imp, PS Models Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT Out BIC ASMD KS ABIC Rank Score Rank(RMSE)
Exp, naive 0.592 2470.074 0.161 0.130 1.7032 13.364 11
Exp, exp 0.592 2473.226 0.022 0.130 1.7062 13.946 16
Exp, out 0.592 1680.469 0.149 0.130 0.9350 8.879 10
Exp, covs 0.592 1676.146 0.014 0.130 0.9308 8.284 14
\hdashlineOut, naive 0.512 2481.170 0.102 0.148 1.9921 18.033 19
Out, exp 0.512 2479.270 0.031 0.148 1.9903 17.950 20
Out, out 0.512 1707.195 0.075 0.148 1.2392 13.368 15
Out, covs 0.512 1684.670 0.008 0.148 1.2172 12.552 18
\hdashlineCovs, naive 0.591 2467.830 0.170 0.136 1.7068 13.376 12
Covs, exp 0.591 2470.496 0.033 0.136 1.7094 13.835 17
Covs, out 0.591 1680.727 0.148 0.136 0.9410 9.018 9
Covs, covs 0.591 1676.198 0.014 0.136 0.9366 8.403 13
\hdashlineRes, naive 0.639 2396.528 0.219 0.153 1.4664 8.541 6
Res, exp 0.639 2400.459 0.077 0.153 1.4703 8.949 8
Res, out 0.639 1627.598 0.156 0.152 0.7184 4.648 5
Res, covs 0.639 1628.720 0.018 0.152 0.7195 4.836 7
\hdashlineFull, naive 0.794 2429.053 0.167 0.130 0.9480 7.019 3
Full, exp 0.794 2433.731 0.028 0.130 0.9526 7.489 4
Full, out 0.794 1454.705 0.149 0.130 0.0004 1.024 1
Full, covs 0.794 1459.247 0.014 0.130 0.0049 1.488 2
  • Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT: the weighted accuracy on the observed data. Out BIC is BIC of the outcome model. ASMD: absolute standardized mean difference. KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: average ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.

Table A.16: Spearman correlation between criteria and RMSE when n=500𝑛500n=500italic_n = 500 for continuous outcome to estimate ACE
Criterion IPW Method DR Method
1-Accuracy(w)superscriptAccuracy𝑤\text{Accuracy}^{(w)}Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT 0.867 0.867
ASMD -0.339 -0.339
KS 0.203 0.201
Out BIC 0.583 0.583
\hdashlineABIC 0.730 0.730
Rank Score 0.854 0.854
  • Here, ASMD: absolute standardized mean difference for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). KS: Kolmogorov-Smirnov distance for (X1,X2,X3)subscript𝑋1subscript𝑋2subscript𝑋3(X_{1},X_{2},X_{3})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ABIC is the rescaled form of Accuracy and BIC. Rank Score: averaged ranks of 1Accuracy(w)1superscriptAccuracy𝑤1-\text{Accuracy}^{(w)}1 - Accuracy start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT and BIC.