DOI: https://doi.org/10.1145/3178876.3186151
WWW '18: Proceedings of The Web Conference 2018, Lyon, France, April 2018
Scientific and business practices are increasingly resulting in large collections of randomized experiments. Analyzed together multiple experiments can tell us things that individual experiments cannot. We study how to learn causal relationships between variables from the kinds of collections faced by data scientists: experiments are numerous, many experiments have very small effects, and the analyst lacks metadata (e.g., descriptions of the interventions). We use experimental groups as instrumental variables (IV) and show that a standard method (two-stage least squares) is biased even when the number of experiments is infinite. We show how a sparsity-inducing l 0 regularization can (in a reversal of the standard bias–variance tradeoff) reduce bias (and thus error) of interventional predictions. We are interested in estimating causal effects, rather than just predicting outcomes, so we also propose a modified cross-validation procedure (IVCV) to feasibly select the regularization parameter. We show, using a trick from Monte Carlo sampling, that IVCV can be done using summary statistics instead of raw data. This makes our full procedure simple to use in many real-world applications.
CCS Concepts: • General and reference → Experimentation; • Mathematics of computing → Probability and statistics; • Computing methodologies → Machine learning;
ACM Reference Format:
Alexander Peysakhovich and Dean Eckles. 2018. Learning Causal Effects From Many Randomized Experiments Using Regularized Instrumental Variables. In WWW 2018: The 2018 Web Conference, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA 9 Pages. https://doi.org/10.1145/3178876.3186151
Randomized experiments (i.e. A/B tests, randomized controlled trials) are a popular practice in medicine, business, and public policy [6, 25]. When decision-makers employ experimentation they have a far greater chance of learning true causal relationships and making good decisions than via observation alone [21, 26, 29]. However, a single experiment is often insufficient to learn about the causal mechanisms linking multiple variables. Learning such multivariate causal structures is important for both theory building and making decisions [17, 22].
Consider the situation of a internet service for watching videos. The firm is interested in how watching different types of videos (e.g., funny vs. serious, short vs. long) affects user behaviors (e.g., by increasing time spent on the site, inducing subscriptions, etc.). Such knowledge will inform decisions about content recommendation or content acquisition. Even though the firm can measure all relevant variables, training a model on observational data will likely be misleading. Existing content recommendation systems and heterogeneous user dispositions will produce strong correlations between exposure and time spent or subscription, but the magnitude of this correlation will, in general, not match what would occur if the decision-maker intervened and changed the promotion or availability of various video types. Thus, we are interested not just in prediction but prediction under intervention [9, 10, 31].
The standard solution is to run a randomized experiment exposing some users to more of some type of video. However, a single test will likely change many things in the complex system. It is hard to change the number of views of funny videos without affecting the number of views of serious videos or short videos. This is sometimes called the problem of ‘fat hand’ interventions because such interventions touch multiple causal variables at once and so the effect on a single variable is not identified. To solve this issue the company would need to experiment with several factors simultaneously, perhaps conducting new experiments specifically to measure effects via each mechanism [22].
However, because routine product experimentation is common in internet companies [5, 25, 39], this firm has likely already run many A/B tests, including on the video recommendation algorithm. The method proposed in this paper can either be applied to a new set of experiments run explicitly to learn a causal effect vector [as in, e.g., 13], or can be applied to repurpose already run tests by treating them as random perturbations injected into the system and using that randomness in a smart way.
Our contributions arise from adapting the econometric method of instrumental variables [IV; 41, 33, 1] to this setting. It is well known that a standard IV estimator — two-stage least squares (TSLS) — is biased in finite samples [3, 36]. For our case, it also has asymptotic bias. We show that this bias depends on the distribution of the treatment effects in the set of experiments under consideration.
Our main technical contribution is to introduce a multivariate l 0 regularization into the first stage of the TSLS procedure and show that it can reduce the bias of estimated causal effects. Because in finite samples this regularization procedure reduces bias but adds variance, we introduce a method to trade these off and select a regularization parameter. We call this procedure instrumental variables cross-validation (IVCV). In an empirical evaluation that combines simulation and data from hundreds of real randomized experiments, we show that the l 0 regularization with IVCV outperforms TSLS and a Bayesian random effects model.
Finally, we show how to perform this estimation in a computationally and practically efficient way. Our regularization and cross-validation procedures only require summary statistics at the level of experimental groups. This is advantageous when using raw data is computationally or practically burdensome, e.g., in the case of internet companies. This means the computational and data storage complexities of the method are actually quite low. In addition, standard A/B testing platforms [5, 42] should already compute and store all the required statistics, so the method here can be thought of as an “upcycling” of existing statistics.
Suppose we have some (potentially vector valued) random variable X and a scalar valued outcome variable Y. We want to ask: what happens to Y if we change some component of X by one unit, holding the rest constant? Formally, we study a linear structural (i.e. data generating) equation pair
In general, we are interested in estimating the causal effect β because we are interested in intervention, e.g., one which will change our data-generating model to
In the presence of unobserved confounders trying to learn causal relationships using predictive models naively can lead us astray [9, 10, 31, 34]. Suppose that we have observational data of the form (X, Y) with U completely unobserved. If we use this data to estimate the causal effect β we can, due to the influence of the unobserved confounder, get an estimate that is (even in infinite samples) larger, smaller or even the opposite sign of the true causal effect β.
To see this consider the linear structure equation model above and suppose that we only observe (X, Y) where both are scalar. Since the underlying model is linear, we can try to estimate it using a linear regression. However, not including the confounder U in the regression yields the estimator:
When all variables are scalar algebra yields
Thus, the best linear predictor of Y given X ($\hat{\beta }_\text{obs}$ ) may not be lead to a good estimate of what would happen to Y if we intervened (β).
We now discuss instrumental variable (IV) estimator as a method for learning the causal effects. Suppose that we have some variable Z that has two properties (see Figure 1 for the directed acyclic graph which represents these assumptions):
In terms of the structural equations, this modifies the equation for X to be
The standard IV estimator for β is two-stage least squares (TSLS) and works off the principle that the variance in X can be broken down into two components. The first component is confounded with the true causal effect (i.e. comes from U). The second component, on the other hand, is independent of U. Thus, if we could regress Y only on the random component, we could recover the causal effect β. Knowing Z allows us to do nearly this (i.e. by using only the variation in X caused by Z not U).
TSLS can be thought of as follows: in the first stage we regress X on Z. We then replace X by the predicted values from the regression. In the second stage, we regress Y on these fitted values. It is straightforward to show that as n approaches infinity this estimator converges to the true causal effect β [40, Theorem 5.1].
All IV methods make a full rank assumption. In order to estimate the effect of each variable Xj on Y with the other X’s held constant it must be the case that Z is such that it causes independent variation in all dimensions of X. This implies that we must, at least, have as many instruments as the dimension of β for TSLS to work. An interesting and fruitful direction for future work is what to do when some subspace of X is well spanned by our instruments but some some subspace is not.
In our setting of interest, randomly assigned groups from a large collection of experiments are the instruments.
Formally, here the IV is a categorical variable indicating which of K test groups a unit (e.g., user) was assigned to in one of many experiments. For simplicity of notation, we assume that each treatment group g ∈ {1, ..., K} has exactly ng = n per units assigned to it at random.
The way to represent the first stage regression of the TSLS is to use the one-hot representation (or dummy-variable encoding) of the group which each unit is assigned to, such that Zi is a K-dimensional vector of 0s and a single 1 indicating the randomly assigned group.
In this setup the TSLS estimator has a very convenient form. The first stage regression of X on Z simply yields estimates that are group level means of X in each group. This means that if each group has the same number of units (e.g., users) and the same error variance, the second stage has a convenient form as well: we can recover β by simply regressing group level averages of X on Y [3, section 4.1.3].
Thus, to estimate causal effects from large meta-analyses practitioners do not need to retain or compute with the raw data (which can span millions or billions of rows in the context of A/B testing at a medium or large internet company), but rather can retain and compute with sample means of X and Y in each A/B test group (this is now just thousands of rows of data). These are quantities that are recorded already in many A/B testing systems [5, 42]. Working with summary statistics simplifies computation enormously and allows us to reuse existing pipelines and data.
There are now multiple ways to think about the asymptotic properties of this “groups as IVs” estimator. Either we increase the size of each experiment (n per → ∞) or we get more experiments (K → ∞). The former is the standard asymptotic sequence, but for meta-analysis of a growing collection of experiments, the latter is the more natural asymptotic series, so we fix n per but we raise K.
We fix ideas with the case where X, Y, Z, U are scalar. We denote the group level means of our variables with bars (e.g., $\bar{X}$ to be the random variable that is the group-level means of X). Recall that our TSLS is, in the group case, a regression of $\bar{Y}$ on $\bar{X}$ .
Decompose the causal variable group level average into
While we are not considering asymptotic series where n per goes to infinity, n per will generally also be large enough that so that we can use the normality of sample means guaranteed by the central limit theorem. Thus, $\bar{U}$ and $\bar{\epsilon }_X$ are normal with mean 0 and variance proportional to nper − 1.
With finite n per we can show that, even as K → ∞, TSLS will be biased [cf. 7, 2]. Suppose for intuition that $\bar{Z}$ has mean 0 and finite variance $\sigma ^2_{\bar{Z}}$ this bias has the closed form which can be derived as follows. First, denote $\bar{A}$ as the group level mean of variable A. From the structural equations we know that:
Since the TSLS estimator in this case is a regression of $\bar{X}$ on $\bar{Y}$ we can use the equation derived above for the scalar case to rewrite
Thus, we simply recover the original observational estimate that we have already discussed as including omitted variable bias. When Z is not degenerate, $\bar{X}$ and $\bar{Y}$ include variation from both $\bar{U}$ and $\bar{Z}$ . As n per increases the influence of $\bar{U}$ decreases and so $\hat{\beta }_{\text{TSLS}}$ is consistent for β.
While in many cases, where variation induced by instrumental variables is large, this bias can be safely ignored, in the case of online A/B testing this is likely not the case. Since much of online experimentation involves hill climbing and small improvements (on the order of a few percent or less) that add up, the TSLS estimator can be quite biased in practice (more on this below).
We now introduce a regularization procedure that can decrease bias in the TSLS estimator. We show that, in this setting a l 0-regularized first stage is computationally feasible and can help reduce this bias under some conditions on the distribution of the latent treatment effects.
There are many types of A/B tests conducted — some are micro-optimizations at the margin and some are larger explorations of the action space. Consider the stylized case with two types of tests calling the smaller variance type ‘weak’ tests while the larger variance ones are ‘strong.’ Here we can model the first stage $\bar{Z}$ as being drawn from a two-component mixture distribution:
If we knew which group was drawn from which component and ran two separate TSLS procedures using only groups whose $\bar{Z}$ is drawn from the same component, we would asymptotically get two estimators:
In reality, we would likely not know which component each group is drawn from and if simply ran a TSLS on the full data set, this estimator will be a weighted combination of the two estimators.
Within this discrete mixture model, we are limited to how much we can reduce bias (since $\mathop{plim}_{K \rightarrow \infty } \hat{\beta }_\text{TSLS, strong} \ne \beta$ ). However if the treatment effects are drawn from a distribution which is an infinite mixture of normals that has full support on normals of all variances (for example a t distribution) then we can asymptotically reduce the bias below any ϵ by using only observations which come from components with arbitrarily large variances.
We now introduce a regularization procedure that tries to perform this selection. Because using this regularization effectively decreases our dataset size, decreasing the bias increases the variance. Thus, afterwards we will turn to a procedure to set the regularization parameter to make good finite-sample bias–variance tradeoffs.
Consider a data set $(\bar{X}_g, \bar{Y}_g)$ of vectors of group-level averages. Let
For a given threshold q ∈ (0, 1], let
Thus, this procedure is equivalent to an l 0 regularization in the first stage of the TSLS regression. In particular, when $\bar{U}_g + \bar{\epsilon }_{x,g}$ has a normal distribution, as in the present case, then this is equivalent to l 0-regularized least squares.
Recall that in the binary mixture example above, this regularization would preferentially retain groups that come from the higher variance (strong) component. This extends to infinite mixtures, such as the t, where this procedure will preferentially set $\bar{X}_g$ to zero for groups where $\bar{Z}_g$ is drawn from a lower variance component.
So far we have focused on scalar X. This procedure naturally extends to multidimensional settings. Just as with the single dimension we compute the ‘null’ distribution from no-intervention conditions. We then compute $p(\bar{X}_g)$ for each group and threshold all dimensions of the experimental group g; that is, if this probability is above a threshold q we set the whole vector $\bar{X}_g$ to 0. This gives us the multi-dimensional, group-based l 0 regularizer which we will apply in our experiments.
This group-l 0 regularization can be inefficient in certain regiments of treatment effects — for example, in a regime where each A/B test explicitly only moves a single dimension of X (i.e. ‘skinny hand’ interventions). We show how this can go wrong in our synthetic data experiment but we also see that real A/B tests appear not to fall into this regime.
We now turn to an important practical question: because there is a bias–variance tradeoff how should one set the regularization parameter when K is finite to optimize for prediction under intervention?
First, let us suppose that we have access to the raw data where a row is a (Xi , Zi , Yi ) which is a unit i’s, X, Y and treatment assignment Z. We propose a procedure to set our hyperparameter q. We describe 2-fold version as it conveys the full intuition, but extension to k-folds is straightforward.
Instrumental variables cross-validation algorithm (IVCV):
The intuition behind IVCV is similar to the main idea behind IV in general. Recall that our objective is to use variation in X that is not caused by U. The IVCV algorithm uses the X value from fold 1 and compares the prediction to the Y value in fold 2 because fold 1 and fold 2 share a Z but differ in U (since U is independent across units but Z is the same within group). This intuition has been exploited in split-sample based estimators [2, 19, 23].
We can demonstrate the importance of using the full causal loss by comparing the IVCV procedure to other two candidates. The first is simply applying naive CV in the second stage (i.e., splitting each group into 2, training a model on fold 1 and computing the CV loss naively as $\Vert Y_2 - X_2 \hat{\beta }_q \Vert ^2$ ). The second is stagewise, in which the regularization parameter is chosen to minimize MSE in the first stage, and then the second stage is fit conditional on the selected model [as in 8, 20]. We compare these approaches in a simple linear model with scalar X, such that
Recall that our goal is to find a hyperparameter (regularization threshold) which gives us the best prediction $\hat{\beta }_q$ of the causal parameter. Formally, we write this as
This gives us 3 candidate cross-validation loss functions to compare to the true casual loss in our simulations
Figure 2 shows these losses as a function of the parameter q averaged over 500 simulations of the model above. We see that both the first stage loss curve and the second stage loss curve look very different from the causal loss curve. However, the IVCV loss curve matches almost exactly. Thus, either stage error naively yields a very different objective function from minimizing the causal error. In particular, we see that making the bias–variance tradeoffs for the first stage need not coincide with a desirable bias-variance tradeoff for inferring β.
The l 0-regularized IV estimator only requires the kinds of summary statistics per experimental group that are already recorded in the course of running A/B tests, which has practical and computational utility. However, the cross-validation procedure above requires the use of raw data. We now turn to the following question: if the raw data is unavailable, but summary statistics are, can we use these summary statistics to choose a threshold q?
Suppose that we have access to summary means $\lbrace (\bar{X}_g, \bar{Y}_g) \rbrace$ for each treatment j and the covariance matrix of $(\bar{X}, \bar{Y})$ conditional on Z = 0 which we denote by τ. We note that τ can be estimated very precisely from observational data or, in the case of the experimental meta-analysis just looking at covariances among known control groups. We assume that n per is large enough such that the distributions of U and ϵ in groups of size $\frac{{n_{\text{per}}}}{2}$ are well approximated by the Gaussian $\mathcal {N}(0, \frac{\sigma ^2_i}{\frac{{n_{\text{per}}}}{2}}).$
To perform IVCV under these assumptions, we use a result from the literature on Monte Carlo [30, ch. 8]. If some vector X is distributed
This means if we know the observational covariance matrix τ then for every group g we can take the group level averages $(\bar{X}_g, \bar{Y}_g)$ and sample using the equation above to get $\bar{X}^1_g$ and $\bar{X}^2_g$ such that $\bar{X}^1_g + \bar{X}^2_g = 2\bar{X}_g$ . Since by the central limit theorem the generating Gaussian model is approximately correct, this procedure simulates the split required by IVCV without having access to the raw data. The algorithm is as follows:
Summary statistics instrumental variables cross-validation algorithm (sIVCV):
We now evaluate the IVCV procedure empirically. True causal effects in real IV data are generally unobservable, so comparisons of methods usually lack a single number to validate against. Examples of the kinds of evaluations usually done in work on IV include comparing different procedures and showing that one yields estimates which are more ‘reasonable.’
Simulations allow us to know the true causal effects, but can lack realism. We strike a middle ground by using simulations where we set the causal effects ourselves but use real data to generate distributions for other variables. In our simulations we use a model given by
The multivariate case is made difficult and interesting when U has a non-diagonal covariance matrix and $\bar{Z}$ has some unknown underlying distribution, so we generate these distributions from real data derived from 798 randomly assigned test groups from a sample of A/B tests run on a recommendation algorithm at Facebook. We define our endogenous, causal Xs as 7 key performance indicators (i.e. intermediate outcomes examined by decision-makers and analysts); we standardize these to have mean 0 and variance 1. As the distribution of U we use the estimated covariance matrix among these outcomes in observational data. Third, we take the experiment-level empirical means of the Xs as the true $\bar{Z}$ , to which we add the confounding noise according to the distribution of U.
We show a projection of these $\bar{Z}$ onto 2 of the X dimensions in Figure 3(A). We see that the A/B tests appear to have correlated effects but do span both dimensions independently, many groups are retained even with strong first stage regularization, and the distribution has much more pronounced extremes than would be expected under a Gaussian model. Figure 3(B) compares the observed and Gaussian quantiles, illustrating that all dimensions are notably non-normal (Shapiro–Wilk tests of normality reject normality for all the variables at ps < 10− 39).
We set β as the vector of ones and γ as a diagonal matrix with alternating elements 1 and − 1, so that there is both positive and negative confounding. For each simulated data set, we compute the causal MSE loss for β; that is, the expected risk from intervening on one of the causal variables at random. If $\hat{\beta }$ is our estimated β vector then recall that this is given by
In addition to the l 0-regularized IV method and TSLS, we examine a Bayesian random effects model, as in Chamberlain and Imbens [12] but with a t, rather than Gaussian, distribution for the instruments. Formally we let
Figure 4(A) shows the results for various dimensions of X for 1,000 simulations. Because of the high level of confounding in the observational data, the observational (OLS) estimates of the causal effect are highly biased, such that even the standard TSLS decreases our causal MSE by over $70\%.$
We see that the l 0-regularization path (black line) reduces error compared with TSLS and, with high regularization, approaches the Oracle estimator. Furthermore, feasible selection of this hyperparameter using IVCV leads to near optimal performance (purple line). The Bayesian random effects model can reduce bias, but substantially increases variance and thus MSE.
We also look at how large the collection of experimental groups needs to be to see advantages of a regularized estimator relative to a TSLS procedure.
We repeat the TSLS, Oracle, and l 0-regularization with IVCV analyses in 100 simulations with smaller K (Figure 4(B)) for the case of the 7 dimensional X. Intuitively, what is important is the relative size of the tails of the distribution of the latent treatment effects $\bar{Z}$ . As the tails get fatter, fewer experiments are required to get draws from the more extreme components of the mixture. We see that in this realistic case where $\bar{Z}$ is determined using a sampled set of Facebook A/B tests, feasible selection of the l 0-regularization hyperparameter using IVCV outperforms TSLS substantially for many values of K. Thus, meta-analyses of even relatively small collections of experiments can be improved by the first-stage l 0 regularization.
In addition to the evaluation using a real data set. We also consider the IVCV procedure in several completely synthetic data sets. The completely synthetic experiments allows us to elucidate the important assumptions for our procedure to work while the real data-based experiment shows that these assumptions are indeed satisfied in real world conditions.
We consider the same exact model as in the previous section except that we generate the first stage effects $\bar{Z}$ from a known parametric distribution and let U be normal. First, we consider
Note that in case 1 effects are axis aligned while in the next two larger values of one dimension can predict more extreme values of Z (and X) on another dimension. The final setup corresponds to the case where components are mean-uncorrelated but have correlated variance. This is the multivariate analog of our motivating example where some A/B tests are strong explorations of the parameter spaces and others are micro-optimizations at the margin. Note that the marginal distribution for each dimension is, in all cases, a t distribution with 3 degrees of freedom (since the t can be written as a mixture of normals drawn from the inverse gamma).
Figure 5 shows the results of applying IVCV to the data generating processes above (top = independent t, middle = Wishart t, bottom = correlated variances). We restrict to dim(X) ∈ {2, 4} because it is sufficient to illustrate our main points. We see that in the independent t case the IVCV procedure (and indeed our multivariate l 0 regularization) can underperform the Bayesian random effects model and fail to substantially improve on TSLS. This happens because in the independent t case there is a high probability that a single dimension is extreme enough to pass the regularization threshold and thus even strong regularization does not necessarily remove bias. On the other hand, when outcomes are correlated (or their variances are) we see that multivariate IVCV performs well because being extreme in one X component predicts having extreme outcomes in other components. Note that the fact that IVCV performs well in the distribution generated by real A/B tests suggests that real world A/B test effect variance is correlated within A/B test — ie. if a test moves one metric by a large amount, it likely moves others by a large amount.
Most analysis of randomized experiments, whether in academia, business, or public policy tends to look at each test in isolation. When meta-analyses of experiments are conducted, the goal is usually either to pool data about multiple instances of the same intervention or to find heterogeneity in the effects of interventions across settings. We instead propose that combining many experiments can help us learn richer causal structures. IV methods give a way of doing this pooling. We have shown that in such situations using easily-implemented l 0 regularization in the first stage can lead to much better estimates of the causal effects, and thus better predictions about interventions, than using standard TSLS methods.
We expand on the literature which uses multi-condition experiments as instruments [13, 16]. Such analyses usually feature a smaller number of experimental groups and a single causal variable. Our work contributes to the growing literature merging experimental techniques with methods from machine learning to allow deeper analyses than has been traditionally possible [4, 9, 14, 18, 32, 34]. In addition, our procedure is intended to be simple to implement and not require strong assumptions about the data-generating process.
Our work is also related to research on IV estimation with weak instruments [35, 36, 37]. In addition, we also contribute to existing research on regularized IV estimation [8, 12, 19]. In the case of univariate X and disjoint groups as instruments, the post-lasso method in Belloni et al. [8] coincides with the proposed l 0 regularization, however in the case where X is a vector, it does not.
Recently, active learning in the form of bandit optimization [15, 28] and reinforcement learning [38] have become quite popular in the AI community. Such approaches can be used in many of the same contexts as the IV analysis we have discussed here, and so may appear to be substitutes. However, we note there are many important differences the largest of which is that RL and bandit approaches try to perform policy optimization rather than learning of a causal graph. For this reason, often why RL/bandit estimated policies work can be hard to understand. This is in contrast to explicit causal models (e.g., the linear model described above) which can be stated explicitly and in terms that are more natural to human decision-makers [24]. On the other hand, RL/bandit approaches have advantages in that they are explicitly online whereas many causal inference procedures (including the one we have described here) are ‘batch’ procedures that assume that data collection is a passive enterprise, separate from analysis. There is growing interest in combining these approaches [27] and we think that future work would benefit greatly from this fusion of thought.
This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution.
WWW '18, April 23-27, 2018, Lyon, France
© 2018; IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-5639-8/18/04.
DOI: https://doi.org/10.1145/3178876.3186151