Abstract
Conditional Gaussian graphical models are a reparametrization of the multivariate linear regression model which explicitly exhibits (i) the partial covariances between the predictors and the responses, and (ii) the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe direct relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. It comes with an efficient alternating optimization procedure which is guaranteed to converge to the global minimum. On top of showing competitive performance on artificial and real datasets, our method demonstrates capabilities for fine interpretation, as illustrated on three high-dimensional datasets from spectroscopy, genetics, and genomics.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We set \(\mathbf {R}\) a correlation matrix in order not to excessively penalize the LASSO or the group-LASSO, which both use the same tuning parameter \(\lambda _1\) across the outcomes (and thus the same variance estimator).
We also used the same seed and CV-folds for both scenarios.
This value directly arises from the definition of the genetic distance itself.
References
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)
Brown, P., Vannucci, M., Fearn, T.: Multivariate bayesian variable selection and prediction. J. R. Stat. Soc. B 60(3), 627–641 (1998)
Brown, P., Fearn, T., Vannucci, M.: Bayesian wavelet regression on curves with applications to a spectroscopic calibration problem. J. Am. Stat. Assoc. 96, 398–408 (2001)
Chiquet, J., Grandvalet, Y., Ambroise, C.: Inferring multiple graphical structures. Stat. Comput. 21(4), 537–553 (2011)
de los Campos, G., Hickey, J., Pong-Wong, R., Daetwyler, H., Calus, M.: Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193(2), 327–345 (2012)
Efron, B.: The estimation of prediction error: covariance penalties and cross-validation (with discussion). J. Am. Stat. Assoc. 99, 619–642 (2004)
Ferreira, M., Satagopan, J., Yandell, B., Williams, P., Osborn, T.: Mapping loci controlling vernalization requirement and flowering time in brassica napus. Theor. Appl. Genet. 90, 727–732 (1995)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11(12), 4241–4257 (2000)
Hans, C.: Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc. 106, 1383–1393 (2011)
Harville, D.: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997)
Hebiri, M., van De Geer, S.: The smooth-lasso and other l1 + l2 penalized methods. Electron. J. Stat. 5, 1184–1226 (2011)
Hesterberg, T., Choi, N.M., Meier, L., Fraley, C.: Least angle and \(\ell _{1}\) penalized regression: a review. Stat. Surv. 2, 61–93 (2008)
Hoefling, H.: A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Stat. 19(4), 984–1006 (2010)
Kim, S., Xing, E.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5(8), e1000587 (2009)
Kim, S., Xing, E.: Tree-guided group lasso for multi-task regression with structured sparsity. In: Proceedings of the 27th International Conference on Machine Learning, pp. 543–550 (2010)
Kim, S.J., Koh, K., Boyd, S., D, G.: \(\ell _1\) trend filtering. SIAM Rev. 51(2), 339–360 (2009)
Kole, C., Thorman, C., Karlsson, B., Palta, J., Gaffney, P., Yandell, B., Osborn, T.: Comparative mapping of loci controlling winter survival and related traits in oilseed brassica rapa and B. napus. Mol. Breed. 1, 201–210 (2002)
Krishna, A., Bondell, H., Ghosh, S.: Bayesian variable selection using an adaptive powered correlation prior. J. Stat. Plan. Inference 139(8), 2665–2674 (2009)
Lajoie, M., Gascuel, O., Lefort, V., Brehelin, L.: Computational discovery of regulatory elements in a continuous expression space. Genome Biol. 13(11), R109 (2012). doi:10.1186/gb-2012-13-11-r109
Li, C., Li, H.: Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 4(3), 1498–1516 (2010)
Li, X., Panea, C., Wiggins, C., Reinke, V., Leslie, C.: Learning “graph-mer” motifs that predict gene expression trajectories in development. PLoS Comput. Biol. 6(4), e1000,761 (2010)
Lorbert, A., Eis, D., Kostina, V., Blei, D., Ramadge, P.: Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10), vol. 9, pp. 477–484 (2010)
Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)
Marin, J.M., Robert, C.P.: Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York (2007)
Obozinski, G., Wainwright, M., Jordan, M.: Support union recovery in high-dimensional multivariate regression. Ann. Stat. 39(1), 1–47 (2011)
Osborne, B., Fearn, T., Miller, A., Douglas, S.: Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit doughs. J. Sci. Food Agric. 35, 99–105 (1984)
Osborne, M.R., Presnell, B., Turlach, B.A.: On the lasso and its dual. J. Comput. Graph. Stat. 9(2), 319–337 (2000)
Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc. 103, 681–686 (2008)
Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.P.: Classification of microarray data using gene networks. BMC Bioinform. 8, 35 (2007)
Rothman, A., Levina, E., Zhu, J.: Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat. 19(4), 947–962 (2010)
Shannon, P.: MotifDb: An Annotated Collection of Protein-DNA Binding Sequence Motifs. R package version 1.4.0 (2013)
Slawski, M., W, Zu Castell, Tutz, G.: Feature selection guided by structural information. Ann. Appl. Stat. 4, 1056–1080 (2010)
Sohn, K., Kim, S.: Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. JMLR W&CP(22), 1081–1089 (2012)
Städler, N., Bühlmann, P., Geer, S.: \(\ell _1\)-penalization for mixture regression models. Test 19(2), 209–256 (2010). doi:10.1007/s11749-010-0197-z
Stein, C.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1981)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Stat. 39(3), 1335–1371 (2011). doi:10.1214/11-AOS878
Tibshirani, R., Taylor, J.: Degrees of freedom in lasso problems. Ann. Stat. 40, 1198–1232 (2012)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B 67, 91–108 (2005)
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Yin, J., Li, H.: A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 5, 2630–2650 (2011)
Yuan, X.T., Zhang, T.: Partial Gaussian graphical model estimation. IEEE Trans. Inform. Theory 60(3), 1673–1687 (2014)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005)
Acknowledgments
We would like to thank Mathieu Lajoie and Laurent Bréhélin for kindly sharing the dataset from Gasch et al. (2000). We also thank the reviewers for their questions and remarks, which helped us to improve our manuscript. This project was conducted in the framework of the project AMAIZING funded by the French ANR. This work has been partially supported by the GRANT Reg4Sel from the French INRA-SelGen metaprogram.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
Most of the proofs rely on basic algebra and properties of the trace and the \(\mathrm {vec}\) operators (see e.g. Harville 1997), in particular
1.1 Derivation of Proposition 1
Concerning the two regularization terms in (7), we have \(\Vert {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\Vert _1 = \Vert {\varvec{\omega }}\Vert _1\) since the \(\ell _1\)-norm applies element-wise here and
As for the log-likelihood (5), we work on the trace term:
The rest of the proof is straightforward.
1.2 Convexity lemma
Lemma 1
The function
is jointly convex in \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) and admits at least one global minimum which is unique when \(n\ge q\) and \((\lambda _2\mathbf {L}+ \mathbf {S}_{\mathbf {x}\mathbf {x}})\) is positive definite.
The convexity of \(-\frac{1}{n}\log L({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) is proved in Yuan and Zhang (2014) (Proposition 1). Similar arguments can be straightforwardly applied in the case at hand. Existence of the global minimum is related to strict convexity in both \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\), where direct differentiation leads to the corresponding conditions.
1.3 Proof of Proposition 2
The convexity of criterion \(J({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) in \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) is straightforward thanks to Lemma 1 and considering the fact that \(\Vert {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\Vert _1\) is also convex. One can then apply the results developed in Tseng and Yun (2009), Tseng (2001) on the convergence of block coordinate descent for the minimization of nonsmooth separable function. Since (7) is clearly separable in \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}, {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) for the nonsmooth part induced by the \(\ell _1\)-norm, the alternating scheme is guaranteed to converge to the unique global minimum under the assumption of Lemma 1. It remains to show that the two convex optimization subproblems (9a) and (9b) can be (efficiently) solved in practice.
Firstly, (9b) can be recast as an Elastic-Net problem, which in turn can be recast as a LASSO problem (see, e.g. Zou and Hastie 2005; Slawski et al. 2010). This is straightforward thanks to Proposition 1: when \(\hat{{\varvec{\varOmega }}}_{\mathbf {y}\mathbf {y}}\) is fixed, solution to (9b) can be obtained via
where \(\mathbf {A},\mathbf {b}\) and \(\tilde{\mathbf {L}}\) are defined by
Secondly, we can solve analytically (9a) with simple matrix algebra. By differentiation of the objective (7) over \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1}\) we obtain the quadratic form
After right multiplying both sides by \(\mathbf {S}_{\mathbf {y}\mathbf {y}}\), it becomes obvious that \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}} \mathbf {S}_{\mathbf {y}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}}(\lambda _2 \mathbf {L}+ \mathbf {S}_{\mathbf {x}\mathbf {x}}){\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\mathbf {S}_{\mathbf {y}\mathbf {y}}\) commute and thus share the same eigenvectors \(\mathbf {U}\). Besides, it induces the relationship \(\eta _j^2 -\eta _j = \zeta _j\) between their respective eigenvalues \(\eta _j\) and \(\zeta _j\), and we are looking for the positive solution of \(\eta _j\). To do so, first note that we may assume that \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}} (\lambda _2 \mathbf {L}+ \mathbf {S}_{\mathbf {x}\mathbf {x}}) {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {S}_{\mathbf {y}\mathbf {y}}\) are positive definite, when \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} \ne \mathbf {0}\) and \(n\ge q\); and second, recall that if a matrix is the product of two positive definite matrices then its eigenvalues are positive. Hence, \(\zeta _j>0\) and the positive solution of \(\eta _j\) is \(\eta _j = (1 + \sqrt{1+4\zeta _j})/2\). We thus obtain
Direct inversion yields
To get an expression for \(\hat{{\varvec{\varOmega }}}_{\mathbf {y}\mathbf {y}}\) which does not require additional matrix inversion, just note that
where \(\hat{{\varvec{\varSigma }}}_{\mathbf {x}\mathbf {x}}^{\lambda _2} = (\lambda _2\mathbf {L}+ \mathbf {S}_{\mathbf {x}\mathbf {x}})\). Combined with (14), this last equality leads to
Finally, in the particular case where \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}} = \mathbf {0}\), \(\hat{{\varvec{\varOmega }}}_{\mathbf {y}\mathbf {y}} = \mathbf {S}_{\mathbf {y}\mathbf {y}}^{-1}\).
1.4 Derivation of Proposition 3
To apply the results developed in Tibshirani and Taylor (2012) that rely on the well-known Stein’s Lemma (Stein 1981), we basically need to recast our problem as a classical LASSO applied on a Gaussian linear regression model such that the response vector follows a normal distribution of the form \(\mathscr {N}({\varvec{\mu }},\sigma \mathbf {I})\). This is straightforward by means of Proposition 1: in the same way as we derive expression (13), we can go a little further and reach the following LASSO formulation
with \(\mathbf {A},\mathbf {b}\) and \(\tilde{\mathbf {L}}\) defined as in (13). From model (4), it is not difficult to see that \(\mathbf {b}\) corresponds to an uncorrelated vector form of \(\mathbf {Y}\) so as
The explicit form of \({\varvec{\mu }}\) is of no interest here. The point is essentially to underline that the response vector is uncorrelated in (15), which allows us to apply Theorem 2 of Tibshirani and Taylor (2012) and results therein, notably for the Elastic-Net. By these means, an unbiased estimator of the degrees of freedom in (15) can be written as a function of the active set \(\mathscr {A}\) in \(\hat{{\varvec{\omega }}}^{\lambda _1,\lambda _2}\):
Routine simplifications lead to the desired result for the degrees of freedom.
Rights and permissions
About this article
Cite this article
Chiquet, J., Mary-Huard, T. & Robin, S. Structured regularization for conditional Gaussian graphical models. Stat Comput 27, 789–804 (2017). https://doi.org/10.1007/s11222-016-9654-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9654-1