Abstract
We propose methods to construct a biased linear estimator for linear regression which optimizes the relative mean squared error (MSE). Although there have been proposed biased estimators which are shown to have smaller MSE than the ordinary least squares estimator, our construction is based on the minimization of relative MSE directly. The performance of the proposed methods is illustrated by a simulation study and a real data example. The results show that our methods can improve on MSE, particularly when there exists correlation among the predictors.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Automat. Control 19(6), 716–723 (1974)
Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215–223 (1979)
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In Blondel, V., Boyd, S., Kimura, H., (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer (2008). http://stanford.edu/~boyd/graph_dcp.html
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1. (2014) http://cvxr.com/cvx
Hirst, J.D., King, R.D., Sternberg, M.J.: Quantitative structure-activity relationships by neural networks and inductive logic programming. i. the inhibition of dihydrofolate reductase by pyrimidines. J. Comput. Aided Mol. Des. 8(4), 405–420 (1994)
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Janson, L., Fithian, W., Hastie, T.J.: Effective degrees of freedom: a flawed metaphor. Biometrika 102(2), 479–485 (2015)
Liu, K.: A new class of blased estimate in linear regression. Commun. Stat. Theory Methods 22(2), 393–402 (1993)
Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A Proof of Theorem 1
Let \({{{\varvec{x}}}}_i\) and \({{{\varvec{m}}}}_i\) denote the ith row of \({{\varvec{X}}}\) and \({{\varvec{M}}}\), respectively. It is clear that when \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\le c'\), \(\mathrm{{Tr}}(\varvec{MM^T})\) can reach its unrestricted minimum 0 at \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\). So as long as \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\le c'\), \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\).
If \(c'=0\), the problem becomes to minimize \({{{\varvec{m}}}}_i^T{{{\varvec{m}}}}_i\) with respect to \({{\varvec{b}}}^T{{{\varvec{m}}}}_i=\tilde{\beta }_i\) for each \(i=1, 2,\ldots ,p\), whose solution is \({{{\varvec{m}}}}_i = \frac{\tilde{\beta }_i}{{{\varvec{b}}}^T{{\varvec{b}}}}{{\varvec{b}}}\) by the property of Moore-Penrose pseudoinverse. Therefore, \(\hat{{{\varvec{M}}}}\,=\,({{\varvec{b}}}^T{{\varvec{b}}})^{-1} {{\varvec{b}}}^T\otimes \tilde{\varvec{\beta }}\).
Now consider the situation that \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}> c'>0\). Note that, if we increase the upper bound for the constraint, the objective function minimum value will be non-increasing since we expand the feasible region. Let \(c''\) denote the value such that the constraint equals \(c''\) at the minimizer for a given bound, \(c'\). Thus, it follows that \(c''\le c'\) and the objective function is constant for any choice of bound between \(c''\) and \(c'\). Therefore, without loss of generality, assume that the solution is obtained on the boundary. Then, the optimization problem described in (6) is equivalent to minimizing \(L({{\varvec{M}}},\lambda )\), which is defined as:
Taking the derivative of \(L({{\varvec{M}}},\lambda )\) with respect to each \({{{\varvec{m}}}}_i\) (\(i=1, 2, \ldots , p\)) and setting them to 0, we have
From (9) and the constraint \(\sum _{i=1}^{p}({{\varvec{b}}}^T\varvec{m}_i-\tilde{\beta }_i)^2=c'\), we can get that \(\sum _{i=1}^{p}{{\varvec{m}}}^T{{\varvec{m}}}_i =\lambda ^2\sum _{i=1}^{p}({{\varvec{b}}}^T{{\varvec{m}}}_i -\tilde{\beta }_i)^2{{\varvec{b}}}^T{{\varvec{b}}} =\lambda ^2c'{{\varvec{b}}}^T{{\varvec{b}}}\). This implies that \(\lambda \) cannot be a constant independent of \(c'\), otherwise the objective function minimum value \(\sum _{i=1}^{p}{{\varvec{m}}}^T{{\varvec{m}}}_i\) would be a strictly increasing function of \(c'\). Since \(\lambda \) is not a constant, we have \(\lambda {{\varvec{b}}}^T{{\varvec{b}}} + 1 \ne 0\), in particular. Multiplying both sides of (9) by \({{\varvec{b}}}^T\), and rearranging terms, we obtain that
Plug (10) into the constraint \(\sum _{i=1}^{p}({{\varvec{b}}}^T{{\varvec{m}}}_i-\tilde{\beta }_i)^2=c'\), and we can get
For either choice of \(\lambda \), the denominator of (12) is \(\sum _{i=1}^{p}\tilde{\beta }_i^2/c'\), and we get \(\sum _{i=1}^{p}{{\varvec{m}}}_i^T{{\varvec{m}}}_i=\lambda ^2c'{{\varvec{b}}}^T{{\varvec{b}}}\). We know that \(\sum _{i=1}^{p}{{\varvec{m}}}_i^T{{\varvec{m}}}_i\) must be non-increasing in \(c'\). Hence, \(\lambda ^2\) must be non-increasing in \(c'\). Combining (11) and (12), we get \(\mathrm{{Tr}}(\varvec{MM}^T)=\sum _{i=1}^{p}{{\varvec{m}}}_i^T {{\varvec{m}}}_i=\frac{c'(-1\pm \sqrt{\sum _{i=1}^{p} \tilde{\beta }_i^2/c'})^2}{{{\varvec{b}}}^T{{\varvec{b}}}}\). Now, it can be shown directly that choosing \(\lambda =\frac{-1+\sqrt{\sum _{i=1}^{p}\tilde{\beta }_i^2/c'}}{{{\varvec{b}}}^T{{\varvec{b}}}}\) is the correct solution, as it is the only one that makes \(\mathrm{{Tr}}(\varvec{MM}^T)\) a strictly decreasing function of \(c'\) for \(0<c'<\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). This strict monotonicity also proves that the optimal value is actually obtained on the boundary of the constraint for any \(0<c'<\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). Based on (9), \(\hat{{{\varvec{m}}}}_i=(1/\lambda {{\varvec{I}}}_n \,+\,{{\varvec{b}}}{{\varvec{b}}}^T)^{-1}{{\varvec{b}}}\tilde{\beta }_i\) for all \(i=1, 2, \ldots , p\), which is equivalent to solving a Ridge regression with single response \(\tilde{\beta }_i\), covariate matrix \({{\varvec{b}}}^T\) and tuning parameter set at \(1/\lambda \). This further shows that the solution of \({{\varvec{M}}}\) is unique.
Appendix B Proof of Proposition 1
In order to prove Proposition 1, we require the following two lemmas, which we state here for completeness.
Lemma 1
For all symmetric \(p\times p\) matrix \(\varvec{A}\) and \(\forall \varvec{x}\in \mathbb {R}^p\), if \(\varvec{x}\ne \varvec{0}\), then,
where \(\lambda _1\ge \cdots \ge \lambda _p\) are the ordered eigenvalues of \(\varvec{A}\).
Lemma 2
(Schur Lemma) The matrix \(\begin{bmatrix} {{{\varvec{A}}}}&{{{\varvec{B}}}}\\ {{{\varvec{C}}}}&{{{\varvec{D}}}} \end{bmatrix}\) is n.n.d. iff the Schur complement of \({{{\varvec{D}}}}\), \({{{\varvec{A-BD}}^{-1}{{\varvec{C}}}}}\), is n.n.d.
Now to prove Proposition 1. By Lemma 1, \(\underset{\varvec{\beta }\in \mathbb {R}^p}{\text {max}} \frac{\varvec{\beta }^T(\varvec{MX}\,-\,{{\varvec{I}}}_p)^T ({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)\varvec{\beta }}{\varvec{\beta }^T\varvec{\beta }}\) is the largest eigenvalue of matrix \(({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)^T({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)\), noted as \(\lambda _1\). Hence, problem (7) is equivalent to
The last equivalence is based on Lemma 2.
Appendix C Proof of Theorem 2
Let \({{{\varvec{x}}}}_j^*\) denote the jth column of \({{\varvec{X}}}\). From Theorem 1, when \(c' \ge \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), it follows that \(\mathrm{{Tr}}({{\varvec{X}}}\hat{{{\varvec{M}}}})=0\), since \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\). When \(c'=0\), \(\mathrm{{Tr}}({{\varvec{X}}}\hat{{{\varvec{M}}}})=\sum _{i=1}^{p}({{{\varvec{x}}}}_i^*)^T\hat{{{\varvec{m}}}}_i=[\sum _{i=1}^{p}(\varvec{x}_i^*)^T\tilde{\beta }_i](\frac{{{\varvec{b}}}}{{{\varvec{b}}}^T{{\varvec{b}}}})=1\). When \(0< c' < \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), each row of \(\hat{{{\varvec{M}}}}\) satisfies (9) and (10). If we multiply \(({{{\varvec{x}}}}_i^*)^T\) to both sides of (9) and by (10), we have
Therefore,
Because \({{\varvec{b}}}\,=\,{{\varvec{X}}}\tilde{\varvec{\beta }}\) is nonzero, \({{\varvec{b}}}^T{{\varvec{b}}}\) is positive. We have already shown in the proof of Theorem 1 that \(\lambda =\frac{-1+ \sqrt{\sum _{i=1}^{p}\tilde{\beta }_i^2/c'}}{{{\varvec{b}}}^T\varvec{b}}\), which is a strictly decreasing function of \(c'\). So \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\) is a strictly decreasing function of \(c'\) when \(0< c' < \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). When \(c'\rightarrow 0\), \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\rightarrow 1\), and as \(c'\rightarrow \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\rightarrow 0\). Therefore, the statement in the theorem holds.
Rights and permissions
About this article
Cite this article
Su, L., Bondell, H.D. Best linear estimation via minimization of relative mean squared error. Stat Comput 29, 33–42 (2019). https://doi.org/10.1007/s11222-017-9792-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-017-9792-0