Posterior Robustness with Milder Conditions: Contamination Models Revisited
Abstract
Robust Bayesian linear regression is a classical but essential statistical tool. Although novel robustness properties of posterior distributions have been proved recently under a certain class of error distributions, their sufficient conditions are restrictive and exclude several important situations. In this work, we revisit a classical two-component mixture model for response variables, also known as contamination model, where one component is a light-tailed regression model and the other component is heavy-tailed. The latter component is independent of the regression parameters, which is crucial in proving the posterior robustness. We obtain new sufficient conditions for posterior (non-)robustness and reveal non-trivial robustness results by using those conditions. In particular, we find that even the Student- error distribution can achieve the posterior robustness in our framework. A numerical study is performed to check the Kullback-Leibler divergence between the posterior distribution based on full data and that based on data obtained by removing outliers.
Keywords: heavy-tailed distribution; posterior robustness; two-component mixture
Introduction
Bayesian posterior robustness (O’Hagan, 1979) and related topics have long been studied (e.g., West, 1984; Andrade and O’Hagan, 2006, 2011; O’Hagan and Pericchi, 2012). There, one of the most important objectives is to perform posterior analysis using moderate observations only and discarding outliers that are not related to the parameters of interest. Because the task of manually detecting or determining outliers is difficult in general, robust models are desired under which the effects of outliers are automatically removed.
Although many robust regression models have been proposed in the literature, few works (e.g., O’Hagan, 1979) have given theoretical justifications to those models. In fact, it is only recently that Desgagné (2013, 2015) and Gagnon et al. (2019) have proved posterior robustness for scale, location-scale, and regression models, respectively. Here, posterior densities are said to be robust if they converge to the corresponding conditional densities of parameters based only on non-outliers as the absolute values of outliers tend to infinity. Since then, posterior robustness has been established in various practically important settings; Hamura et al. (2022) obtained robustness results for regressions with shrinkage priors, whereas Hamura et al. (2021) considered a case of integer-valued observation.
In proving the posterior robustness, Gagnon et al. (2019) and Hamura et al. (2022) considered the following model; with observations , -dimensional covariate vectors , regression coefficients and a scale parameter , they assume
(1) |
for some error density and prior . In their proof of posterior robustness, it is crucial to assume that is the log-regularly varying error density. A typical density tail of the log-regularly varying distributions is as , where (For the rigorous definition, see Desgagné (2013)). This distribution has no finite moment and heavier density tails than those of the Student’s -distribution. If is the Student’s -distribution, the posterior is not robust (Gagnon and Hayashi, 2023). These theoretical findings imply the superiority of log-regularly varying error density to the Student’s -distributions. However, it has also been reported that the Student’s -error distribution is fairly competitive in posterior inference in several numerical studies (Hamura et al., 2022).
In this paper, we revisit the following classical two-component mixture regression model, also known as the contamination model:
(2) |
where and is a prior probability that an observation becomes an outlier. The first density, , has thinner tails and is typically the standard normal distribution. The second density, , is a heavy-tailed distribution, such as Student’s -distribution, and expected to accommodate outliers. One notable feature of the above model is that the second term is completely independent of the parameters . This is a significant difference from the classical two-component mixtures in Box and Tiao (1968) and subsequent research (Tak et al., 2019; Silva et al., 2020), where the second component is also scaled by observational standard error . Scaling the second component by is reasonable in terms of data fit, but could affect the inference on in the presence of outliers. This observation motivates our research on the above model.
Under the model (2), we show that the posterior is robust if , the marginal prior for , has tails sufficiently lighter than those of the error density . When is log-regularly varying, then most of prior distributions can satisfy this sufficient condition for robustness. Furthermore, we prove that the sufficient condition on the tails of is “nearly” necessary as well; if the error distribution is not log-regularly varying and has lighter tails than , then the posterior is not robust. With these conditions, we can identify the posterior (non)-robustness for most of the error and prior distributions used in the regression models.
Our result can also explain the gap between the non-robustness of the Student -distribution in model (1) and its success in posterior inference in numerical studies. For simplicity, assume that only the first observation, , is outlying and let . Then, under the model (1) with as for (Student’s -distribution with degree-of-freedom), it holds that
as . This limit is the product of the posterior density without and factor . In other words, the Student’s -distribution can never achieve the posterior robustness. By contrast, under the model (2), we have
as , provided that has sufficiently heavier tails than (For the rigorous proof, including the computation of the ignored normalizing constant, see the proof of Theorem 1 in the Supplementary Materials). This is precisely the posterior without , for which we confirm the posterior robustness. Also, note that can be the Student’s -distribution but still can achieve the posterior robustness under this model. The main difference from the model (1) is that the second component of (2) does not involve the parameters . Thanks to this difference, outliers are not linked to the parameters in this model and therefore have no effects on the posterior distribution of , as long as has heavier tails than . This observation applies to the general case of multiple outliers, as will be seen below.
The remainder of this paper is organized as follows. In Section 2, sufficient conditions and necessary conditions for posterior robustness are given. In Section 3, a numerical example is given, in which we see that the Kullback-Leibler divergence between the target and available posteriors can diverge or converge to in some cases. Proofs are given in the Supplementary Material.
Contamination Models and Posterior Robustness
Suppose that we observe
for , where is a set of continuous explanatory variables and where and are parameters of interest following a prior distribution . Here, is an error density, and is a prior probability that observation is generated from .
Following the work of Desgagné (2015), let satisfy , , and . Suppose that , , and , , for , such that represents the set of indices of outlying observations. We say that the posterior is robust to outliers under the above model if as , where , , and .
To derive conditions for posterior robustness, we limit the class of prior distributions for . Suppose that
(3) |
for some , and , where . That is, the ratio of the prior density and some double-sided scaled-beta density (with spike at the origin) must be bounded uniformly by some constant. This condition is satisfied by most of the conditionally independent priors that are commonly used in practice. Examples include shrinkage priors, such as the horseshoe prior (Carvalho et al., 2009, 2010), as well as the normal priors. The condition is also satisfied by some multivariate priors for dependent , including the multivariate normal prior.
Likewise, we assume the error distributions, , are bounded as
(4) |
for some , and . The class of distributions that satisfy this condition includes Student’s -distributions ( and ) and log-regularly varying distributions ( and ).
The following theorem gives a sufficient condition for the posterior to be robust.
Theorem 1.
The moment condition for in (5) could be a strong requirement when and is large. We will compare this condition with those in the literature later in Table 2. Next, we prove that the posterior robustness does not hold if this moment condition is not satisfied, in addition that the error density tails are not sufficiently heavily tailed.
Theorem 2.
Let be a probability density and suppose that . Let and suppose that
for all for some . Suppose that
(6) |
for all for some and . Then we have
at each .
Clearly, under the assumptions of Theorem 2, the posterior does not converge in the usual sense. Indeed, we see in the next section that the Kullback-Leibler divergence between and diverges in such a situation.
From Theorems 1 and 2, we can determine whether a prior yields a robust posterior or not in most cases. Suppose that and are independent (e.g., and ) and that (3) holds. Suppose that equality holds in (4). Then, if we use a gamma prior for , the moment condition in (5) is always satisfied; hence the posterior is robust regardless of the choice of . If we use an inverse gamma prior or a scaled beta prior for , either (5) or (6) is satisfied, depending on the hyperparameters. That is, there exists a threshold separating robust and non-robust cases. These observations are summarized in Table 1.
Prior for | Density | Condition (5) | Condition (6) |
---|---|---|---|
for robustness | for non-robustness | ||
Inverse-gamma: | |||
Gamma: | ✓ | NA | |
Scaled-beta: | |||
The sufficient conditions obtained in this study differ from those in Gagnon et al. (2019) and Hamura et al. (2022) not only in the model specification given in (1) and (2) but also in the requirement of the error and prior densities. Table 2 summarizes the sufficient conditions for posterior robustness in the literature and Theorem 1. As pointed out in the introduction, in our model does not have to be log-regular varying to achieve the posterior robustness, which is significantly different from the settings in the literature. Instead, at the cost of allowing for a wider class of error distributions for , more constraints on the choice of priors for are needed for the proof of Theorem 1. Consequently, the conditions used in the literature and Theorem 1 are not nested in one another. For example, the conditions in Gagnon et al. (2019) cover the improper prior for .
It is also worth emphasizing that, as clarified in Table 2, no assumption is made directly on , the number of outliers, in Theorem 1. Note that this number is defined by the residuals; is outlying for and close to zero for . The key result that enables the proof without any assumption on is the lemma we obtained about the residuals; for details, see Lemma 1 in the Supplemetary Materials.
Number of | Error density | Prior density | |||
---|---|---|---|---|---|
outliers | tails ( or ) | Density bounds | Moments | Improper | |
Gagnon et al. | LRVD | – | ✓ | ||
(2019) | |||||
Hamura et al. | LRVD | NA | |||
(2022) | |||||
Theorem 1 | Not needed | NA | |||
of this study |
Numerical Examples
Here, we consider a numerical example to illustrate the property of the posterior (non)-robustness. In doing so, we numerically evaluated the Kullback-Leibler (KL) divergence of the target posterior distribution from the available posterior distribution , or , as well as the point estimates of parameters and predictive intervals. We used the conjugate normal-inverse gamma prior , where . Under this prior, the posterior becomes a finite mixture of known distributions and analytically and numerically tractable. We considered the following two error densities:
where . The first error distribution, , is the double-sided scale-beta distribution, whose tail behavior is equivalent to that of Student’s -distribution. The second error distribution, , is the unfolded version of the log-Pareto distribution of Cormann and Reiss (2009).
As an example, we deterministically created the dataset as
and . In this example, where and , we considered . In computing the KL divergence, the fifth observation with is viewed as an outlier; and . Our experiment includes the case of to see the performance of the robust model in the absence of outliers. For the prior, we set , , and , and we considered the two cases and . Combining the two priors with the two error distributions and , we have four models in total.
First, we obtained the Monte Carlo approximation of the KL divergence by using 1,000 samples from the posterior distributions. The result is summarized in the left panel of Figure 1. It is clearly seen that the KL divergence does not decrease when and , since the condition of Theorem 2 is satisfied and the posterior is not convergent. In the other three cases, where the sufficient condition of Theorem 1 is satisfied, the KL divergence converges to as .
In addition, we computed the posterior means of and in each scenario, which are shown in the middle and right panels of Figure 1, respectively. The point estimates of and are stable regardless of the value of in the three cases where the posterior robustness holds. It should also be noted that the difference of the point estimates with and without outliers (say, and ) is small under the posterior robustness. In contrast, the point estimates become unreasonable as increases when and .
Next, under the same setting, we computed the posterior and predictive distributions of and given with When and , the credible intervals become extremely wide since the posterior robustness does not hold and the posterior of converges to zero. When and the posterior robustness holds even for , the lengths of the interval estimates become reasonable. The interval lengths obtained under and are similar but slightly different, reflecting the difference between their error density tails.
(regression value at ) | (unobserved data at ) | |||||||||
heavy () | 1.87 | 2.27 | 2.35 | 3.97 | 5.91 | 8.00 | 6.96 | 9.21 | ||
heavy () | 1.32 | 1.62 | 1.35 | 2.37 | 4.63 | 6.84 | 5.02 | 7.09 | ||
light () | 1.90 | 2.27 | 60.7 | 82.8 | 5.13 | 7.42 | 145 | 157 | ||
light () | 1.24 | 1.56 | 1.44 | 2.39 | 4.51 | 6.78 | 4.48 | 6.91 |
Acknowledgments
Research of the authors was supported in part by JSPS KAKENHI Grant Number 22K20132, 19K11852, 17K17659, and 21H00699 from Japan Society for the Promotion of Science.
References
- Andrade and O’Hagan (2006) Andrade, J. A. A. and A. O’Hagan (2006). Bayesian robustness modeling using regularly varying distributions. Bayesian Analysis 1(1), 169–188.
- Andrade and O’Hagan (2011) Andrade, J. A. A. and A. O’Hagan (2011). Bayesian robustness modelling of location and scale parameters. Scandinavian Journal of Statistics 38(4), 691–711.
- Box and Tiao (1968) Box, G. E. and G. C. Tiao (1968). A bayesian approach to some outlier problems. Biometrika 55(1), 119–129.
- Carvalho et al. (2009) Carvalho, C. M., N. G. Polson, and J. G. Scott (2009). Handling sparsity via the horseshoe. In AISTATS, Volume 5, pp. 73–80.
- Carvalho et al. (2010) Carvalho, C. M., N. G. Polson, and J. G. Scott (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480.
- Cormann and Reiss (2009) Cormann, U. and R.-D. Reiss (2009). Generalizing the pareto to the log-pareto model and statistical inference. Extremes 12(1), 93–105.
- Desgagné (2013) Desgagné, A. (2013). Full robustness in bayesian modelling of a scale parameter. Bayesian Analysis 8, 187–220.
- Desgagné (2015) Desgagné, A. (2015). Robustness to outliers in location–scale parameter model using log-regularly varying distributions. The Annals of Statistics 43(4), 1568–1595.
- Gagnon et al. (2019) Gagnon, P., P. Desgagne, and M. Bedard (2019). A new bayesian approach to robustness against outliers in linear regression. Bayesian Analysis 15(2), 389–414.
- Gagnon and Hayashi (2023) Gagnon, P. and Y. Hayashi (2023). Theoretical properties of bayesian student- linear regression. Statistics and Probability Letters 193.
- Hamura et al. (2021) Hamura, Y., K. Irie, and S. Sugasawa (2021). Robust hierarchical modeling of counts under zero-inflation and outliers. arXiv preprint arXiv:2106.10503.
- Hamura et al. (2022) Hamura, Y., K. Irie, and S. Sugasawa (2022). Log-regularly varying scale mixture of normals for robust regression. Computational Statistics & Data Analysis 173, 107517.
- O’Hagan (1979) O’Hagan, A. (1979). On outlier rejection phenomena in bayes inference. Journal of the Royal Statistical Society: Series B 41(3), 358–367.
- O’Hagan and Pericchi (2012) O’Hagan, A. and L. Pericchi (2012). Bayesian heavy-tailed models and conflict resolution: A review. Brazilian Journal of Probability and Statistics 26, 372–401.
- Silva et al. (2020) Silva, N., M. Prates, and F. Gonccalves (2020). Bayesian linear regression models with flexible error distributions. Journal of Statistical Computation and Simulation 90, 2571–2591.
- Tak et al. (2019) Tak, H., J. A. Ellis, and S. K. Ghosh (2019). Robust and accurate inference via a mixture of gaussian and student’st errors. Journal of Computational and Graphical Statistics 28(2), 415–426.
- West (1984) West, M. (1984). Outlier models and prior distributions in bayesian linear regression. Journal of the Royal Statistical Society: Series B (Methodological) 46(3), 431–439.
Supplementary Material for “Posterior Robustness with Milder Conditions: Contamination Models Revisited”
A Basic Lemma
Lemma 1 is used in the proof of Theorem 1. If satisfy , we write for the th unit vector in , namely the th column of the identity matrix.
Lemma 1.
Let . Let be continuous variables. Let . Let and . Let . Let and for . Suppose that . Suppose that and are continuous variables.
-
(i)
Let . Let satisfy . Let and be arbitrary. Then there exist , , , and such that for all and all , the condition that
implies the following conditions:
-
(a)
There exist distinct indices such that .
-
(b)
There exist distinct indices such that .
-
(a)
-
(ii)
Let and be arbitrary. Then there exist , , , and such that for all ,
Proof.
Part (ii) follows from part (i). For part (i), fix and and . Suppose that . Then
If is sufficiently large,
(S1) |
Now, suppose that . Then
Since is invertible by assumption,
for some and . This is a contradiction if is sufficiently small since by assumption. Thus, we have if is sufficiently small and we assume that .
For part (a), suppose that there exist distinct indices such that . Then
(S2) |
Let be such that . Let . Then if is sufficiently small, by (S1) and (S2)
or
Let . Then there exist invertible matrices and such that
Therefore,
Thus, there exists such that
or
which is a contradiction if is sufficiently small since by assumption. This proves part (a).
For part (b), suppose that there exist distinct indices such that . Then
Let , , and be such that , , and . Then
Let be such that and let . Then if is sufficiently large, we have, by (S1),
(S3) |
Also,
where . If is sufficiently small, the matrix has rank since otherwise for some invertible matrix . Therefore, there exists an invertible matrix such that
(S4) |
where . It follows from (S3) and (S4) that if is sufficiently large,
Thus, since the rank of the matrix is less than or equal to ,
for some invertible matrix , which is a contradiction if is sufficiently small. This completes the proof. ∎
Proof of Theorem 1
Here, we prove Theorem 1.
Proof of Theorem 1. The posterior is
where
Since
it is sufficient to show that
Since for all and all , and imply
for some , it follows from the dominated convergence theorem that
for all . Thus, since
it suffices to prove that for all and all , there exists such that
where
converges to as . This clearly holds for for all .
First, fix and let . Then, by Lemma 1, there exist , , and such that for all ,
Since
for all , clearly
by the dominated convergence theorem. Fix , with , and and let
Then
for some for any and therefore
for some .
Next, fix and . Let . Let and for . Then, by Lemma 1, there exist , , , and such that for all ,
Clearly,
Fix , with , and . Let
As in the previous case, for some that is sufficiently close to ,
for some . Therefore,
as . Now, suppose that and fix and with . Then if ,
where the equality follows since there is no point satisfying and for some . The right-hand side converges to as regardless of whether or not. This completes the proof.
Remark 1.
Although we assume for simplicity that is the standard normal density, similar results about posterior robustness will be established for other choices of as well. The most important property of that is used throughout the above proof is that has thinner tails than .
Proof of Theorem 2
Here, we prove Theorem 2.
Proof of Theorem 2. As in the proof of Theorem 1, we have
where
and
Now, if is sufficiently large such that for all , then
Therefore, by making the change of variables , we obtain
This completes the proof.