1 Introduction

In the simultaneous estimation of means of multivariate normal distributions, the so-called Stein problem has been studied in the literature since Stein (1956) established the inadmissibility of the maximum likelihood estimator when the dimension of means is larger than or equal to three. Especially, James and Stein (1961) provided a simple and concrete shrinkage estimator which improves on the maximum likelihood estimator, namely a minimax shrinkage estimator under the squared error loss. Then, a lot of minimax shrinkage estimators have been suggested for various models. For good account of this topic, see Stein (1981), Berger (1985), Brandwein and Strawderman (1990), Robert (2007) and Fourdrinier et al. (2018).

In particular, Efron and Morris (1972a, b) showed that the James–Stein estimator can be derived as an empirical Bayes estimator, and Strawderman (1971) discovered prior distributions which yield Bayes and minimax estimators. Strawderman’s estimator is an ultimate goal in the framework of decision theory, because it is admissible and minimax. Since then, Bayesian minimax estimation has been studied actively and extensively.

To explain the details, we consider random variable \({\varvec{X}}=(X_1, \ldots , X_p)^\top \) having a p-variate normal distribution \(\mathcal {N}_p({\varvec{\theta }}, {\varvec{I}})\) for \({\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top \), where estimator \(\widehat{{\varvec{\theta }}}=(\hat{{\theta }}_1, \ldots , \hat{{\theta }}_p)^\top \) of \({\varvec{\theta }}\) is evaluated by the risk relative to the squared error loss \(\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2=\sum _{i=1}^p (\hat{{\theta }}_i-{\theta }_i)^2\). The James–Stein estimator is

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{JS}=\left( 1-{p-2\over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}, \end{aligned}$$
(1.1)

which is improved on by the positive-part Stein estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{PS}=\max \left( 0, 1-{p-2\over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}. \end{aligned}$$
(1.2)

Both estimators are minimax for \(p\ge 3\), but inadmissible. An admissible and minimax estimator is the generalized Bayes estimator against the shrinkage prior \(\Vert {\varvec{\theta }}\Vert ^{2-p}\), given by

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{GB}=\left( 1-{\phi _0\left( \Vert {\varvec{X}}\Vert ^2\right) \over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}, \end{aligned}$$
(1.3)

for

$$\begin{aligned} \phi _0(w)=&w {\int _0^\infty ({\lambda }+1)^{-p/2-1}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }\over \int _0^\infty ({\lambda }+1)^{-p/2}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }}\\ =&p-2 - {2\exp (-w/2) \over \int _0^\infty ({\lambda }+1)^{-p/2}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }}. \end{aligned}$$

Kubokawa (1991, 1994) showed that this Bayes estimator improves on the James–Stein estimator.

The joint density of \(({\varvec{X}}, {\varvec{\theta }})\) for the shrinkage prior \(\Vert {\varvec{\theta }}\Vert ^{2-p}\) is

$$\begin{aligned} {1\over (2\pi )^{p/2}}\exp \left\{ -{1\over 2}\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2\right\} {1\over \Vert {\varvec{\theta }}\Vert ^{p-2}}. \end{aligned}$$

Bayes estimators in the literature often arises as posterior means, but in this paper, we try to handle modes of posterior distributions. Since the log likelihood of the joint density is proportional to

$$\begin{aligned} -{1\over 2}\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2 - {p-2\over 2} \log \Vert {\varvec{\theta }}\Vert ^2. \end{aligned}$$

This expression motivates us to consider the posterior mode, namely the minimization of

$$\begin{aligned} \Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2 + (p-2) \log \Vert {\varvec{\theta }}\Vert ^2. \end{aligned}$$
(1.4)

In linear regression models, ridge regression and lasso regression are regularized least squares methods with \(L_2\) and \(L_1\) penalties. These methods do not assume normality as the underlying distributions. Similarly, the minimization problem of (1.4) is set up without assuming normality. However, it may be hard to find the optimal solution of the minimization. Instead of that, in this paper, we utilize the equation of the first-order condition (FOC), and based on the FOC equation, we suggest shrinkage estimators.

The paper is organized as follows. In Sect. 2, we consider the estimation of means with known scale parameter, and set up the minimization problem with the logarithmic penalty. Using the optimal relation, we propose two kinds of shrinkage estimators, called the one-step shrinkage and the two-step shrinkage estimators. The one-step shrinkage estimator resembles the James–Stein estimator, and the two-step shrinkage estimator is differentiable and performs similarly to the positive-part Stein estimator. It is noted that the two-step shrinkage estimator does not belong to Baranchik’s class of minimax estimators, so that we need to evaluate the risk function directly. It can be shown that both estimators are minimax under normality. The estimators are numerically compared and it is revealed that the two-step shrinkage estimator has a good performance for non-normal distributions. The extension to the case of unknown scale is given in Sect. 3. In Sect. 4, we study the simultaneous estimation of means of discrete distributions. Two kinds of shrinkage estimators are derived and their minimaxity is established. In Sect. 5, we give remarks on extensions to estimation of matrix means and linear regression models. For numerical computations in this paper, we used the ox program by Doornik (2007).

2 Shrinkage estimators under the logarithmic penalties

Let \(X_1, \ldots , X_p\) be observable random variables generated from the model

$$\begin{aligned} X_i = {\theta }_i + {\sigma }{\varepsilon }_i,\quad i=1, \ldots , p, \end{aligned}$$
(2.1)

where \({\varepsilon }_1, \ldots , {\varepsilon }_p\) are independently distributed with \(\textrm{E}[{\varepsilon }_i]=0\) and common finite variance. Let \({\varvec{X}}=(X_1, \ldots , X_p)^\top \) and \({\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top \). Estimator \(\widehat{{\varvec{\theta }}}\) of \({\varvec{\theta }}\) is evaluated in terms of the risk relative to the quadratic loss \(\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2\). The estimator \({\varvec{X}}\) is the optimal solution of \(\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2\), and the solutions of \(\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+{\lambda }\sum _{i=1}^p |{\theta }_i|\) and \(\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+{\lambda }\Vert {\varvec{\theta }}\Vert ^2\) with \(L_1\) and \(L_2\) penalties for positive \({\lambda }\) give shrinkage estimators. Those penalized procedures correspond to the ridge estimator and the lasso in the framework of linear regression models.

In this paper, we consider deriving estimators of \({\varvec{\theta }}\) through the minimization

$$\begin{aligned} L({\varvec{\theta }})=\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+ c \log \left( \Vert {\varvec{\theta }}\Vert ^2\right) , \end{aligned}$$

for known positive constant c suitably chosen later. Since

$$\begin{aligned} {\partial \over \partial {\varvec{\theta }}}L({\varvec{\theta }})=- {2\over {\sigma }^2}({\varvec{X}}-{\varvec{\theta }})+{2c\over \Vert {\varvec{\theta }}\Vert ^2}{\varvec{\theta }}, \end{aligned}$$

the first-order condition (FOC) satisfies the equation

$$\begin{aligned} {\varvec{\theta }}= {\Vert {\varvec{\theta }}\Vert ^2 \over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}{\varvec{X}}=\left( 1-{c{\sigma }^2\over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}. \end{aligned}$$
(2.2)

The optimal solution may be given by solving the FOC equation numerically. Instead of that, we want to provide estimators with closed forms by using the FOC equation. A simple method is to replace \(\Vert {\varvec{\theta }}\Vert ^2\) in RHS of (2.2) with \(\Vert {\varvec{X}}\Vert ^2\). Then we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}=\left( 1-{c{\sigma }^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}, \end{aligned}$$
(2.3)

which we call the one-step shrinkage estimator. This type of shrinkage estimators were first treated by Stein (1956).

We next consider to use the FOC equation (2.2) twice. From (2.2), we have

$$\begin{aligned} \Vert {\varvec{\theta }}\Vert ^2 =\left( {\Vert {\varvec{\theta }}\Vert ^2 \over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}\right) ^2\Vert {\varvec{X}}\Vert ^2. \end{aligned}$$

Again substituting this quantity into (2.2) yields

$$\begin{aligned} {\varvec{\theta }}=&\left( 1-{c{\sigma }^2\over \left\{ \Vert {\varvec{\theta }}\Vert ^2/\left( \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2\right) \right\} ^2\Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}\nonumber \\ =&\left( 1-{c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{\theta }}\Vert ^2\right) ^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{\theta }}\Vert ^2\right) ^2}\right) {\varvec{X}}. \end{aligned}$$
(2.4)

Replacing \(\Vert {\varvec{\theta }}\Vert ^2\) with \(\Vert {\varvec{X}}\Vert ^2\), we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}= \left( 1-{c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{X}}\Vert ^2\right) ^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{X}}\Vert ^2\right) ^2}\right) {\varvec{X}}, \end{aligned}$$
(2.5)

which we call the two-step shrinkage estimator.

We here look into the two-step shrinkage estimator, which is rewritten as \(\widehat{{\varvec{\theta }}}^\textrm{TS}=\{1-\phi _\textrm{TS}(W)/W\}{\varvec{X}}\) for \(W=\Vert {\varvec{X}}\Vert ^2/{\sigma }^2\), where

$$\begin{aligned} \phi _{\textrm{TS}}(w)={c w (1+c/w)^2\over w + c(1+c/w)^2}. \end{aligned}$$
(2.6)

The derivative of \(\phi _\textrm{TS}(w)\) is

$$\begin{aligned} \phi _\textrm{TS}'(w)&= 1-{2w\over w+c(1+c/w)^2}+{w^2\over \{w+c(1+c/w)^2\}^2}\left\{ 1-2{c^2\over w^2}\left( 1+{c\over w}\right) \right\} \nonumber \\&= \{\phi _\textrm{TS}(w)\}^2{1\over w^2}\left\{ 1- {2\over (1+c/w)^3}\right\} . \end{aligned}$$
(2.7)

Let \(w_0\) be the solution of \(\phi _\textrm{TS}'(w)=0\) or \((1+c/w)^3=2\). The solution \(w_0\) satisfies that \(c/w_0=2^{1/3}-1\) or \(w_0=c/(2^{1/3}-1)\), and the maximum value of \(\phi _\textrm{TS}(w)\) is given by

$$\begin{aligned} \phi _\textrm{TS}(w_0) = {cw_02^{2/3} \over w_0+c 2^{2/2}}={c2^{2/3} \over 1+(2^{1/3}-1)2^{2/2}} = {2^{2/3}\over 3-2^{2/3}}c \approx 1.12 c < 1.13 c. \end{aligned}$$
(2.8)

Thus, \(\phi _\textrm{TS}(w)\) is increasing for \(0\le w\le w_0\), and decreasing for \(w>w_0\), which implies that \(\phi _\textrm{TS}(w)\le \phi _\textrm{TS}(w_0)\) for any \(w\ge 0\). The function \(\phi _\textrm{TS}(w)\) is illustrated in Fig. 1.

Fig. 1
figure 1

The Function \(\phi _\textrm{TS}(w)\) for \(c=p-2=1\)

An appropriate choice of c is \(c=p-2\), because both estimators \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) with \(c=p-2\) are minimax under the normal distribution as shown below. When \(c=p-2\), \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) resembles the James–Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{JS}\) in (1.1), while \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is close to the positive-part Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) in (1.2). Although \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) is not differentiable at some point, \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is differentiable. We compare the corresponding shrinkage coefficients of \({\varvec{X}}\) in \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\). Let

$$\begin{aligned} \psi ^\textrm{PS}(w)=&\max (0, 1-c/w),\\ \psi ^\textrm{TS}(w)=&1-{\phi _\textrm{TS}(w)\over w}={w\over w + c(1+c/w)^2}, \end{aligned}$$

for \(c=p-2\). These values for some small w’s are given in Table 2. For \(w\le 1.5\), \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) is more shrunken than \(\widehat{{\varvec{\theta }}}^\textrm{TS}\), but less shrunken for \(w\ge 2\). Numerical investigations given below demonstrate that \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is more efficient than \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) for small \(\Vert {\varvec{\theta }}\Vert ^2\).

Table 1 Values of shrinkage coefficients of \({\varvec{X}}\) in \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\)

It is noted that normal distributions are not used in the derivation of \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\). Under the normality, we can investigate minimaxity of these estimators. All the equivariant estimators under orthogonal transformations are of the form

$$\begin{aligned} \widehat{{\varvec{\theta }}}_ \phi ={\varvec{X}}- {\phi \left( \Vert {\varvec{X}}\Vert ^2\right) \over \Vert {\varvec{X}}\Vert ^2}{\varvec{X}}, \end{aligned}$$

for function \(\phi (\cdot )\). When estimator \(\widehat{{\varvec{\theta }}}\) is evaluated by the risk function \(R({\varvec{\theta }},\widehat{{\varvec{\theta }}})=\textrm{E}[\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2]\), Stein (1973, 1981) showed that the unbiased estimator of the risk difference \({\Delta }_\phi =R({\varvec{\theta }},\widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})\) is

$$\begin{aligned} \widehat{{\Delta }}_\phi (W) = \left\{ \phi (W)^2-2(p-2)\phi (W)\right\} /W - 4\phi '(W), \end{aligned}$$
(2.9)

for \(W=\Vert {\varvec{X}}\Vert ^2\). Baranchik (1970) demonstrated that \(\widehat{{\varvec{\theta }}}_\phi \) is minimax if (a) \(\phi (w)\) is nondecreasing and (b) \(0\le \phi (w)\le 2(p-2)\). A class of estimators satisfying these conditions is called Baranchik’s class. It can be easily seen that \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) belongs to Baranchik’s class of minimaxity for \(0<c\le 2(p-2)\). Concerning \(\widehat{{\varvec{\theta }}}^\textrm{TS}\), however, it does not belong to Baranchik’s class, because monotonicity of \(\phi _\textrm{TS}(w)\) is not satisfied as shown above. Thus, we need to investigate the unbiased estimator of the risk directly.

Theorem 2.1

Under the normality with \({\varvec{X}}\sim \mathcal {N}_p({\varvec{\theta }}, {\sigma }^2{\varvec{I}})\), the two-step shrinkage estimator \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is minimax relative to the quadratic loss \(\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2\) if \(0<c\le (3/2)(p-2)\) for \(p\ge 3\).

Proof

Without loss of generality, we set \({\sigma }^2=1\). We shall show that \(\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0\) for any nonnegative w, where

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(W) = \left\{ \phi _\textrm{TS}(W)^2-2(p-2)\phi _\textrm{TS}(W)\right\} /W - 4\phi _\textrm{TS}'(W). \end{aligned}$$

Using the derivative given in (2.7), we can write \(\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\) as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&= {\left\{ \phi _\textrm{TS}(w)\right\} ^2\over w}\left[ 1 - {2(p-2)\over \phi _\textrm{TS}(w)}-{4\over w}\left\{ 1- {2\over (1+c/w)^3}\right\} \right] \\&={\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 - {2(p-2)\over c}(1+c/w)\left\{ w+c(1+c/w)^2\right\} \right. \\&\quad \left. -4(1+c/w)^3+8\right] , \end{aligned}$$

which is further rewritten as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&={\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 \!-\! {2(p-2)\over c}(w\!+\!c) -2p (1+c/w)^3+8\right] \\&= {\{\phi _\textrm{TS}(w)\}^2\over w^5(1+c/w)^3}g(w), \end{aligned}$$

where

$$\begin{aligned}{} & {} g(w)=\{1-2(p-2)/c\}w^4\\{} & {} \quad +(3c-4p+12)w^3+3c(c-2p)w^2+c^2(c-6p)w-2pc^3. \end{aligned}$$

Then, g(w) is rewritten as

$$\begin{aligned} g(w)=&\left\{ 1-{2(p-2)\over c}\right\} \left\{ w+ {3c-4p+12\over 2\{1-2(p-2)/c\}}\right\} ^2w^2\\&+ \left\{ {(3c-4p+12)^2\over 4\{2(p-2)/c-1\}}+3c(c-2p) \right\} w^2+c^2(c-6p)w-2pc^3. \end{aligned}$$

Note that \(1-{2(p-2)/ c}<0\), \(c-2p<0\) and \(c-6p<0\) for \(0<c<(3/2)(p-2)\). Thus, it is sufficient to show that the coefficient of \(w^2\) is not positive, namely, \(h(c)\le 0\) for

$$\begin{aligned} h(c)=(3c-4p+12)^2 + 12\{2(p-2)-c\}(c-2p). \end{aligned}$$

Let \(d=(3/2)(p-2)-c\). Substituting \(c=(3/2)(p-2)-d\) into h(c), we have

$$\begin{aligned} h\left( {3(p-2)\over 2}-d\right)&=(p/2+3-3d)^2-3(p-2+2d)(p+6+2d)\\&= -3d^2-3(5p+14)d-{1\over 4}(p+6)(11p-30), \end{aligned}$$

which is not positive for \(p\ge 3\) and \(d\ge 0\). Thus, one can see that \(\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0\) for any \(w\ge 0\) when \(0<c\le (3/2)(p-2)\), and the minimaxity of \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is established. \(\square \)

We investigate the performance of the suggested shrinkage estimators and compare their risks with some existing shrinkage estimators in the model (2.1). The estimators we compare are the James–Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{JS}\) in (1.1), the positive-part Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{PS}\) in (1.2), the generalized Bayes estimator \(\widehat{{\varvec{\theta }}}^\textrm{GB}\) against the shrinkage prior \(\Vert {\varvec{\theta }}\Vert ^{2-p}\), given in (1.3), the one-step and two-step shrinkage estimators \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) with \(c=p-2\), which are denoted by JS, PS, GB, OS and TS, respectively. Since these estimators are orthogonally equivariant, their risks depend on \({\varvec{\theta }}\) only through \(\Vert {\varvec{\theta }}\Vert ^2\). Thus, the simulation experiments are conducted for \({\varvec{\theta }}=(k/3){\varvec{I}}\) for \(k=0, \ldots , 9\) and \({\sigma }=1\).

In the normal distribution \(\mathcal {N}({\varvec{\theta }}, {\varvec{I}})\), the exact values of the risk functions can be obtained by using numerical integration based on the noncentral chi-squared distribution. These values are illustrated in Fig. 2 and given in Table 2 for \(p=6\) and \({\lambda }=\Vert {\varvec{\theta }}\Vert ^2\).

Fig. 2
figure 2

Risk functions of the five estimators in the normal distribution for \(p=6\) and \({\lambda }=\Vert {\varvec{\theta }}\Vert ^2\)

Table 2 Exact values of risk functions of the five estimators in the normal distribution

We also investigate the performance when the underlying distributions of \({\varepsilon }_i\) are Student t-distribution with 5 degrees of freedom, logistic distribution and double exponential distribution, where these probability density functions are, respectively, given by

$$\begin{aligned} f(x)&= {{\Gamma }((n+1)/2)\over \sqrt{n\pi }{\Gamma }(n/2)}{1\over (1+x^2/n)^{(n+1)/2}}\ \textrm{for}\ n=5,\\ f(x)&= {e^{-x}\over (1+e^{-x})^2},\\ f(x)&= {1\over 2}e^{-|x|}. \end{aligned}$$

In these cases, the average losses based on 10,000 replications are calculated as approximation of risk functions and reported in Table 3.

From Fig. 2 and Tables 2 and 3, it is seen that the minimaxity of the five estimators is robust for the t-, logistic and double exponential distributions. It is also revealed that OS is close to, but worse than JS, and that TS is close to, but better than PS in most cases. As justified theoretically by Kubokawa (1991, 1994), GB is better than JS under the normality. However, it is confirmed numerically that this property does not hold for \(t_5\)- and logistic distributions with smaller \(\Vert {\varvec{\theta }}\Vert ^2\). Comparing TS and PS, we can see that TS is better than PS except for the cases of larger \(\Vert {\varvec{\theta }}\Vert ^2\) in the normal distribution. TS is the best of five in most cases. In particular, TS provides the smallest risk of five at \(\Vert {\varvec{\theta }}\Vert ^2=0\). TS is differentiable and the performance is recommendable in this setup of simulation. However, its admissibility is not guaranteed.

Table 3 Average losses of five shrinkage estimators for \(p=6\) and \({\varvec{\theta }}=(k/3){\varvec{I}}\)

3 Extension to the case of unknown scale

We consider to extend the result of the previous section to the case of an unknown scale parameter. To this end, let us consider the random sampling model

$$\begin{aligned} Y_{ij} = {\theta }_i + \tau {\varepsilon }_{ij}, \quad i=1, \ldots , p,\ j=1, \ldots , m, \end{aligned}$$
(3.1)

where \({\varepsilon }_{ij}\)’s are independently and identically distributed as \(\textrm{E}[{\varepsilon }_{ij}]=0\) and \(\textrm{Var}({\varepsilon }_{ij})<\infty \), and \(\tau \) is an unknown scale parameter. Let \(X_i=m^{-1}\sum _{j=1}^mY_{ij}\), \(S=\sum _{i=1}^p\sum _{j=1}^m(Y_{ij}-X_i)^2/m\), \({\sigma }^2=\tau ^2/m\) and \(n=p(m-1)\). Then, \(X_1, \ldots , X_p\) are independent random variables with \(\textrm{E}[X_i]={\theta }_i\) and \(\textrm{Var}(X_i)={\sigma }^2\), \(i=1, \ldots , p\), and \(\textrm{E}[S]=p(m-1)\tau ^2/m=n{\sigma }^2\).

Consider the estimation of \({\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top \) based on \({\varvec{X}}=(X_1, \ldots , X_p)^\top \) and S. Let \(W=\Vert {\varvec{X}}\Vert ^2/S\). Replacing \(\Vert {\varvec{X}}\Vert ^2/{\sigma }^2\) in (2.3) and (2.5) with W, we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}=&\left( 1-{c\over W + c}\right) {\varvec{X}}, \end{aligned}$$
(3.2)
$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}=&\left( 1-{c(1+c/W)^2\over W + c(1+c/W)^2}\right) {\varvec{X}}. \end{aligned}$$
(3.3)

Estimator \(\widehat{{\varvec{\theta }}}\) is evaluated in terms of the risk function given by \(R({\varvec{\theta }},\widehat{{\varvec{\theta }}})=\textrm{E}[\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2]\). Under the normality, we can show minimaxity of the suggested estimators. When we consider estimators of the general form

$$\begin{aligned} \widehat{{\varvec{\theta }}}_\phi ={\varvec{X}}-{\phi (W)\over W}{\varvec{X}}, \end{aligned}$$

for function \(\phi (w)\), Efron and Morris (1976a) showed that the unbiased estimator of the risk difference \({\Delta }=R({\varvec{\theta }},\widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})\) is

$$\begin{aligned} \widehat{{\Delta }}_\phi (W) = (n+2) \left\{ \phi (W)^2-2a\phi (W)\}/W - 4\{1+\phi (W)\right\} \phi '(W) \end{aligned}$$
(3.4)

for \(a=(p-2)/(n+2)\). Baranchik’s class of minimax estimators is characterized as (a) \(\phi (w)\) is nondecreasing and (b) \(0\le \phi (w)\le a=(p-2)/(n+2)\). The one-step shrinkage estimator \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) belongs to Baranchik’s class for \(0<c\le 2a\). Concerning \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) however, it does not belong to Baranchik’s class similarly to the case of Sect. 2.

Theorem 3.1

Under the normality with \({\varvec{X}}\sim \mathcal {N}_p({\varvec{\theta }}, {\sigma }^2{\varvec{I}})\) and \(S/{\sigma }^2\sim \chi _n^2\), the two-step shrinkage estimator \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is minimax relative to the quadratic loss \(\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2\) if \(0<c\le (4/3)(p-2)/(n+2)\) for \(p\ge 3\).

Proof

It is noted that \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is written as \(\widehat{{\varvec{\theta }}}^\textrm{TS}=\{1-\phi _\textrm{TS}(W)/W\}{\varvec{X}}\) for \(\phi _\textrm{TS}(w)\) given in (2.6). From (3.4), the unbiased estimator of the risk difference \({\Delta }=R({\varvec{\theta }},\widehat{{\varvec{\theta }}}^\textrm{TS})-R({\varvec{\theta }},{\varvec{X}})\) is

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(W) = (n+2) \left\{ \phi _\textrm{TS}(W)^2-2a\phi _\textrm{TS}(W)\right\} /W - 4\left\{ 1+\phi _\textrm{TS}(W)\right\} \phi _\textrm{TS}'(W) \end{aligned}$$

for \(a=(p-2)/(n+2)\). We shall show that \(\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0\) for any nonnegative w. The proof is harder than that of Theorem 2.1, because of the term \(\{1+\phi _\textrm{TS}(W)\}\phi _\textrm{TS}'(W)\).

From the arguments around (2.8), for \(c\le (4/3)a\), it is observed that \(\phi _\textrm{TS}(w_0) \le 1.13 c \le 1.13\times {4\over 3}a=1.507 a < 2 a\), which implies that \(\phi _\textrm{TS}(w)\le \phi _\textrm{TS}(w_0) < 2a\). Then,

$$\begin{aligned} {\phi _\textrm{TS}(w)^2 - 2a \phi _\textrm{TS}(w) \over 1+\phi _\textrm{TS}(w)}\le {\phi _\textrm{TS}(w)^2 - 2a \phi _\textrm{TS}(w) \over 1+\phi _\textrm{TS}(w_0)}, \end{aligned}$$

which is used to rewrite \(\widehat{{\Delta }}(w)\) as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&= \{1+\phi _\textrm{TS}(w)\} \left\{ (n+2) {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w\{1+\phi _\textrm{TS}(w)\}} - 4\phi _\textrm{TS}'(w)\right\} \\&\le \{1+\phi _\textrm{TS}(w)\} \left\{ (n+2) {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w\{1+\phi _\textrm{TS}(w_0)\}} - 4\phi _\textrm{TS}'(w)\right\} \\&={1+\phi _\textrm{TS}(w)\over 1+\phi _\textrm{TS}(w_0)}(n+2)\widetilde{{\Delta }}(w), \end{aligned}$$

where

$$\begin{aligned} \widetilde{{\Delta }}(w)= {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w} - 4{1+\phi _\textrm{TS}(w_0)\over n+2}\phi _\textrm{TS}'(w). \end{aligned}$$

Let \(A=\{1+\phi _\textrm{TS}(w_0)\}/(n+2)\) for simplicity. Using the same arguments as in the proof of Theorem 2.1, we can see that

$$\begin{aligned} \widetilde{{\Delta }}(w)={\{\phi _\textrm{TS}(w)\}^2\over w}\left[ 1 - {2(p-2)\over \phi _\textrm{TS}(w)}-{4\over w}A\left\{ 1- {2\over (1+c/w)^3}\right\} \right] , \end{aligned}$$

which is rewritten as

$$\begin{aligned}&{\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 - {2(p-2)\over c}\right. \\ {}&\left. (1+c/w)\left\{ w+c(1+c/w)^2\right\} -4A(1+c/w)^3+8A\right] \\&\quad ={\left\{ \phi _\textrm{TS}(w)\right\} ^2\over w^5(1+c/w)^3}\left[ w(w+c)^3 - {2(p-2)\over c}(w+c)\{w^3+c(w+c)^2\}\right. \\ {}&\qquad \left. -4A(w+c)^3+8Aw^3\right] . \end{aligned}$$

Thus, we have

$$\begin{aligned} \widetilde{{\Delta }}(w)= {\{\phi _\textrm{TS}(w)\}^2\over w^5(1+c/w)^3}g(w), \end{aligned}$$

where

$$\begin{aligned} g(w)= & {} (1-2a/c)w^4+(3c-4a+4A)w^3\\{} & {} +3c(c-2a-4A)w^2+c^2(c-6a-12A)w-2(a+2A)c^3. \end{aligned}$$

Then, g(w) is rewritten as

$$\begin{aligned} g(w)=&\left( 1-{2a\over c}\right) \left\{ w+ {3c-4a+4A\over 2(1-2a/c)}\right\} ^2w^2\\&+ \left\{ {(3c-4a+4A)^2\over 4(2a/c-1)}+3c(c-2a-4A)\right\} w^2\\&+c^2(c-6a-12A)w-2(a+2A)c^3. \end{aligned}$$

Note that \(1-{2a/ c}<0\), \(c-2a-4A<0\) and \(c-6a-12A<0\) for \(0<c<(4/3)a\). Thus, it is sufficient to show that the coefficient of \(w^2\) is not positive, namely, \(h(c)\le 0\) for

$$\begin{aligned} h(c)=(3c-4a+4A)^2 + 12(2a-c) (c-2a-4A). \end{aligned}$$

Let \(d=(4/3)a-c\). Substituting \(c=(4/3)a-d\) into h(c), we have

$$\begin{aligned} h\left( {4a\over 3}-d\right)&=(4A-3d)^2-12\left( {2\over 3}a+d\right) \left( {2\over 3}a+4A+d\right) \\&= -3d^2-24\left( {2\over 3}a+3A\right) d-8\left( {2\over 3}a^2 + 4aA - 2A^2\right) . \end{aligned}$$

We can verify that \((2/3)a^2 + 4aA - 2A^2\ge 0\) for \(p\ge 3\) and \(d\ge 0\). In fact, it is observed that

$$\begin{aligned} {2\over 3}a^2 + 4aA - 2A^2= A^2 \left\{ \left( {a\over A}\right) ^2 + 4{a\over A}-2\right\} , \end{aligned}$$

and from (2.8),

$$\begin{aligned} {a\over A}&={p-2\over 1+\phi _\textrm{TS}(w_0)} \ge {p-2\over 1+ 1.13c} \ge {p-2\over 1+ 1.13\times (4/3)(p-2)/(n+2)}\\&\ge {1\over 1+1.13\times (4/9)}>0.66. \end{aligned}$$

From this inequality, it follows that

$$\begin{aligned} {2\over 3}a^2 + 4aA - 2A^2\ge A^2\left( 0.66^2+4\times 0.66 -2\right) \ge A^2. \end{aligned}$$

Thus, one can see that \(\widetilde{{\Delta }}(w)\le 0\) or \(\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0\) for any \(w\ge 0\) when \(0<c\le (4/3)(p-2)/(n+2)\), and the minimaxity of \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is established. \(\square \)

We investigate the performance of the suggested shrinkage estimators in the random sampling model (3.1). As a distribution of \({\varepsilon }_{ij}\), we handle standard normal distribution, t-distribution with 5 degrees of freedom, logistic distribution and double exponential distribution. Although \({\varvec{X}}\) and S are independent under the normality, they are not independent for the other distributions.

Let \(W=\Vert {\varvec{X}}\Vert ^2/S\) and \(c=(p-2)/(n+2)\). The estimators we compare are the James–Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{JS}=(1-c/W){\varvec{X}}\), the positive-part Stein estimator \(\widehat{{\varvec{\theta }}}^\textrm{PS}=\max (0, 1-c/W){\varvec{X}}\), the one-step and two-step shrinkage estimators \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) with \(c=(p-2)/(n+2)\), which are denoted by JS, PS, OS and TS, respectively. Based on simulation experiments with 50,000 replications for \({\varvec{\theta }}=(k/3){\varvec{I}}\) for \(k=0, \ldots , 9\) and \(\tau =1\), their average losses are calculated as approximation of risk functions and reported in Table 4.

Overall, Table 4 shows that the performances of JS, PS, OS and TS are similar to the case of known scale in Tables 2 and 3. Comparing OS and TS, we can see that TS is better for smaller \(\Vert {\varvec{\theta }}\Vert ^2\), but OS is slightly better for larger \(\Vert {\varvec{\theta }}\Vert ^2\). Comparing the values at \(\Vert {\varvec{\theta }}\Vert ^2=0\) in Tables 2, 3 and 4, we also see that the cases of known scale give smaller risks than the cases of unknown scale in the normal distribution, but the cases of unknown scale give smaller risks in the non-normal distributions.

Table 4 Average losses of four shrinkage estimators for \((p, m, n)=(6,3,12)\) and \({\varvec{\theta }}=(k/3){\varvec{I}}\)

4 Estimation of means of discrete distributions

We here consider the simultaneous estimation of means of discrete distributions. Let \(X_1, \ldots , X_p\) be independent nonnegative random variables with \(\textrm{E}[X_i]={\theta }_i\) and finite variances. In the estimation of \({\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top \) based on \({\varvec{X}}=(X_1, \ldots , X_p)^\top \), we consider the minimization of

$$\begin{aligned} L({\varvec{\theta }}) = \sum _{i=1}^p\Big \{ X_i\log \Big ({X_i\over {\theta }_i}\Big )-X_i+{\theta }_i\Big \} + c\log \Big ( \sum _{j=1}^p{\theta }_j\Big ), \end{aligned}$$

for positive constant c. The first term of \(L({\varvec{\theta }})\) is a special case of the Bregman loss treated in Banerjee et al. (2005), and is often referred to as a generalized Kullback–Leibler loss. The partial derivative of \(L({\varvec{\theta }})\) with respect to \({\theta }_i\) is

$$\begin{aligned} {\partial L({\varvec{\theta }})\over \partial {\theta }_i}=-{X_i\over {\theta }_i}+1+{c\over \sum _{j=1}^p{\theta }_j}, \end{aligned}$$

and \(\partial L({\varvec{\theta }})/ \partial {\theta }_i=0\) gives the equation

$$\begin{aligned} {\theta }_i = {\sum _{j=1}^p{\theta }_j\over \sum _{j=1}^p{\theta }_j+c}X_i = X_i - {c\over \sum _{j=1}^p{\theta }_j+c}X_i. \end{aligned}$$

Replacing \(\sum _{j=1}^p{\theta }_j\) with \(Z=\sum _{j=1}^pX_j\), we have the one-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}={\varvec{X}}- {c \over Z + c}{\varvec{X}}. \end{aligned}$$
(4.1)

Similarly to (2.4), we have

$$\begin{aligned} {\theta }_i = X_i - {c \over \left\{ \sum _{j=1}^p{\theta }_j/(\sum _{j=1}^p{\theta }_j+c)\right\} \sum _{j=1}^pX_j+c}X_i. \end{aligned}$$

Replacing \(\sum _{j=1}^p{\theta }_j\) with \(Z=\sum _{j=1}^pX_j\), we have the two-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}&= {\varvec{X}}- {c\over Z^2/(Z+c)+c}{\varvec{X}}= {\varvec{X}}-{c(Z+c)\over Z^2+cZ+c^2}{\varvec{X}}\nonumber \\&= {\varvec{X}}- {c(1+c/Z)\over Z+c(1+c/Z)}{\varvec{X}}. \end{aligned}$$
(4.2)

These estimators belong to the class of estimators

$$\begin{aligned} \widehat{{\varvec{\theta }}}_\phi = {\varvec{X}}- {\phi (Z)\over Z+c}{\varvec{X}}, \end{aligned}$$

for positive function \(\phi (\cdot )\). It is noted that the one-step shrinkage estimator \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) with \(c=p-1\) is identical to the shrinkage estimator suggested by Clevenson and Zidek (1975), denoted by \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\).

We demonstrate that \(\hat{{\theta }}^\textrm{OS}\) and \(\hat{{\theta }}^\textrm{TS}\) improve on \({\varvec{X}}\) relative to the loss \(\sum _{i=1}^p(\hat{{\theta }}_i-{\theta })^2/{\theta }_i\) when \(X_i\) has the Poisson distribution with mean \({\theta }_i\), \(Po({\theta }_i)\) for \(i=1, \ldots , p\). Clevenson and Zidek (1975) showed that \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\) improves on \({\varvec{X}}\) and Tsui (1984) proved that this improvement still holds for negative binomial distributions. Using the same arguments as in Clevenson and Zidek (1975), we can show that the unbiased estimator of the risk difference \({\Delta }=R({\varvec{\theta }}, \widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})\) for \(X_i\sim Po({\theta }_i)\), \(i=1, \ldots , p\), is

$$\begin{aligned} \widehat{{\Delta }}(Z)=-2\left\{ {(Z+p)\phi (Z+1)\over Z+c+1}-{Z\phi (Z)\over Z+c}\right\} +{(Z+p)\phi ^2(Z+1)\over (Z+c+1)^2}. \end{aligned}$$
(4.3)

Then, it can be seen that \(\widehat{{\Delta }}(z)\le 0\) if (a) \(\phi (z)\) is nondecreasing and (b) \(0\le \phi (z)\le 2c\) for \(0<c\le p-1\) and \(0\le \phi (z)\le 2(p-1)\) for \(c>p-1\). In fact, since \(\phi (z)\le \phi (z+1)\) from (a), we have

$$\begin{aligned} \widehat{{\Delta }}(z)\le \left\{ -2{(p-1)z+pc\over z+c}+{z+p\over z+c+1}\phi (z+1)\right\} {\phi (z+1)\over z+c+1}, \end{aligned}$$

so that \(\widehat{{\Delta }}(z)\le 0\) if

$$\begin{aligned} \phi (z+1) \le 2{\{(p-1)z+pc\}(z+c+1)\over (z+p)(z+c)}. \end{aligned}$$

This inequality is satisfied by condition (b).

Concerning the one-step shrinkage estimator, it corresponds to \(\phi (z)=c\), so that it is verified that \(\widehat{{\varvec{\theta }}}^\textrm{OS}\) improves on \({\varvec{X}}\) if \(0< c\le 2(p-1)\) for \(p\ge 2\). For the two-step shrinkage estimator, however, it is harder to show the improvement, because the corresponding function \(\phi (z)\) is not nondecreasing. To further investigate, we restrict ourselves to \(c=p-1\) for simplicity.

Theorem 4.1

When \(X_i \sim Po({\theta }_i)\) for \(i=1, \ldots , p\), the two-step shrinkage estimator \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) with \(c=p-1\) improves on \({\varvec{X}}\) relative to the quadratic loss \(\sum _{i=1}^p(\hat{{\theta }}_i-{\theta })^2/{\theta }_i\) for \(p\ge 2\).

Proof

Let \(c=p-1\) or \(p=c+1\). It is noted that \(c\ge 1\) for \(p\ge 2\). From (4.3),

$$\begin{aligned} \widehat{{\Delta }}(z)=-2\left\{ \phi (z+1)-{z\phi (z)\over z+c}\right\} +{\phi ^2(z+1)\over z+c+1}, \end{aligned}$$

where

$$\begin{aligned} \phi (z)={c(z+c)^2\over z^2+cz+c^2}. \end{aligned}$$

Since the derivative of \(\phi (z)\) is

$$\begin{aligned} \phi '(z)={c^2(c-z)(c+z)\over (z^2+cz+c^2)^2}, \end{aligned}$$

\(\phi (z)\) is increasing for \(z\le c\) and decreasing for \(z>c\). This means that

$$\begin{aligned} \phi (z) \le \phi (c) = {4\over 3}c. \end{aligned}$$

In the case of \(z\le c\), since \(\phi (z)\le \phi (z+1)\), we have

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ -2 {c\over z+c}+{\phi (z+1)\over z+c+1}\right\} \phi (z+1) \le \{-2c+\phi (z+1)\}{\phi (z+1)\over z+c+1}\\ \le&\left\{ -2c+{4\over 3}c \right\} {\phi (z+1)\over z+c+1}<0. \end{aligned}$$

In the case of \(z>c\), note that \(\phi (z)> \phi (z+1)\). \(\widehat{{\Delta }}(z)\) is rewritten as

$$\begin{aligned} \widehat{{\Delta }}(z)&= 2\{\phi (z)-\phi (z+1)\} -2 {c\phi (z)\over z+c}+{\phi ^2(z+1)\over z+c+1}\\&\le 2\{\phi (z)-\phi (z+1)\} -2 {c\phi (z+1)\over z+c}+{\phi ^2(z+1)\over z+c+1}. \end{aligned}$$

It is here observed that

$$\begin{aligned} \phi (z)-\phi (z+1)&= {c(z+c)^2\over z^2+cz+c^2} - {c(z+c+1)^2\over (z+1)^2+c(z+1)+c^2} \\&= {c^2(z^2+z-c^2)\over \{(z+1)^2+c(z+1)+c^2\}\{z^2+cz+c^2\}}\\&= {c(z^2+z-c^2)\phi (z+1)\over (z+c+1)^2\{z^2+cz+c^2\}}, \end{aligned}$$

so that \(\widehat{{\Delta }}(z)\) is evaluated as

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ 2{c(z^2+z-c^2)\over (z+c+1)^2(z^2+cz+c^2)} -2 {c\over z+c}+{\phi (z+1)\over z+c+1}\right\} \phi (z+1). \end{aligned}$$

Since \((z^2+z-c^2)/(z^2+cz+c^2)\) is increasing in z for \(c\ge 1\), we have \((z^2+z-c^2)/(z^2+cz+c^2)<1\), so that for \(z>c\ge 1\),

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ 2c{1\over z+c+1} -2c {z+c+1\over z+c}+\phi (z+1)\right\} {\phi (z+1)\over z+c+1}\\ \le&\Big \{ 2c{1\over 2c+1} -2c +{4\over 3}c\Big \} {\phi (z+1)\over z+c+1}\\ \le&2c \Big \{ {1\over 3} -1 +{2\over 3}\Big \} {\phi (z+1)\over z+c+1}, \end{aligned}$$

which is zero. Thus, \(\widehat{{\Delta }}(z) \le 0\), and the improvement of \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) over \({\varvec{X}}\) is established. \(\square \)

We compare the risk performances of the three estimators \({\varvec{X}}\), \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) with \(c=p-1\) under the three distributions of \(X_i\): the Poisson distribution \(Po({\theta }_i)\), geometric distribution \(Geo(1/({\theta }_i+1))\) and negative binomial distribution \(Nbin(5, 5/({\theta }_i+5)\). The estimators \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) are denoted by CZ and TS. The estimator \(\widehat{{\varvec{\theta }}}=(\hat{{\theta }}_1, \ldots , \hat{{\theta }}_p)^\top \) of \({\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top \) is evaluated by the risk function relative to the loss \(\sum _{i=1}^p(\hat{{\theta }}_i-{\theta }_i)^2/{\theta }_i\). The simulation experiment has been conducted with \(p=6\) and \({\theta }_i=k/2\) for \(i=1, \ldots , p\) and \(k=1, \ldots , 10\), and the average losses based on simulation with 100,000 replications are reported Table 5. From Table 5, it is seen that the improvements of \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) over \({\varvec{X}}\) are significant. In the Poisson distribution, \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is slightly worse than \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\), while \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) is slightly better in the geometric and negative binomial distributions.

Table 5 Average Losses of \(\widehat{{\varvec{\theta }}}^\textrm{CZ}\) and \(\widehat{{\varvec{\theta }}}^\textrm{TS}\) for \(p=6\) and \({\theta }_{ik}=k/2\)

5 Remarks on extensions

We would like to conclude the paper with remarks on extensions of the results in the previous sections. An interesting extension is the estimation of an \(m\times p\) matrix mean \({\varvec{{\Theta }}}\) of a random matrix \({\varvec{X}}\) with \(\textrm{E}[{\varvec{X}}]={\varvec{{\Theta }}}\) and \(\textrm{E}[({\varvec{X}}-{\varvec{{\Theta }}})({\varvec{X}}-{\varvec{{\Theta }}})^\top ]={\varvec{I}}_m\). As an extension of the result of Sect. 2, we consider minimizing

$$\begin{aligned} L({\varvec{{\Theta }}})=\textrm{tr}\,\left\{ ({\varvec{X}}-{\varvec{{\Theta }}})({\varvec{X}}-{\varvec{{\Theta }}})^\top \right\} + c \log \left| {\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right| , \end{aligned}$$

for known c. The optimal solution satisfies the equation

$$\begin{aligned} {\varvec{{\Theta }}}= \left( {\varvec{I}}+c\left( {\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right) ^{-1}\right) ^{-1}{\varvec{X}}={\varvec{X}}- c\left( c{\varvec{I}}+{\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right) ^{-1}{\varvec{X}}. \end{aligned}$$
(5.1)

When \({\varvec{X}}{\varvec{X}}^\top \) is of full rank and \(p\ge m\), replacing \({\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \) with \({\varvec{X}}{\varvec{X}}^\top \) gives the one-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{{\Theta }}}}^\textrm{OS}={\varvec{X}}- c \left( {\varvec{X}}{\varvec{X}}^\top + c {\varvec{I}}\right) ^{-1}{\varvec{X}}. \end{aligned}$$
(5.2)

Similarly to (2.4), we have the two-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{{\Theta }}}}^\textrm{TS}= {\varvec{X}}- c\left[ c{\varvec{I}}+\left\{ {\varvec{I}}+c({\varvec{X}}{\varvec{X}}^\top )^{-1}\right\} ^{-1}{\varvec{X}}{\varvec{X}}^\top \left\{ {\varvec{I}}+c({\varvec{X}}{\varvec{X}}^\top )^{-1}\right\} ^{-1}\right] ^{-1}{\varvec{X}}. \end{aligned}$$
(5.3)

It is noted that \(\widehat{{\varvec{{\Theta }}}}^\textrm{OS}\) is close to the Efron and Morris (1976b) estimator \(\widehat{{\varvec{{\Theta }}}}^\textrm{EM}={\varvec{X}}-(p-m-1)({\varvec{X}}{\varvec{X}}^\top )^{-1}{\varvec{X}}\) and it can be verified to be minimax for a suitable condition on c. Although a numerical investigation shows that \(\widehat{{\varvec{{\Theta }}}}^\textrm{TS}\) is better than \(\widehat{{\varvec{{\Theta }}}}^\textrm{EM}\) by simulation, we could not give an analytical proof for minimaxity of \(\widehat{{\varvec{{\Theta }}}}^\textrm{TS}\).

Another interesting extension is the linear regression model \({\varvec{y}}={\varvec{X}}{\varvec{\beta }}+{\sigma }{\varvec{\epsilon }}\), where \({\varvec{y}}\) is an \(n\times 1\) vector of observations, \({\varvec{X}}\) is an \(n\times p\) matrix of known variables for \(p<n\), \({\varvec{\beta }}=({\beta }_1, \ldots , {\beta }_p)^\top \) is a \(p\times 1\) vector of unknown coefficients, \({\sigma }\) is an unknown scale parameter and \({\varvec{\epsilon }}=({\varepsilon }_1, \ldots , {\varepsilon }_n)^\top \) is an \(n\times 1\) vector of random errors with \(\textrm{E}[{\varepsilon }_i]=0\) and \(\textrm{E}[{\varepsilon }_i^2]<\infty \). When \({\varvec{X}}\) is of full rank, \({\varvec{\beta }}\) is estimated by the least squares estimator \(\widehat{{\varvec{\beta }}}=({\varvec{X}}^\top {\varvec{X}})^{-1}{\varvec{X}}^\top {\varvec{y}}\).

In the case of multicollinearity, \({\varvec{X}}\) is not of full rank and we can not use the least squares estimator. A feasible solution is the estimator \(\widehat{{\varvec{\beta }}}^\textrm{MP}=({\varvec{X}}^\top {\varvec{X}})^{+}{\varvec{X}}^\top {\varvec{y}}\), where \({\varvec{A}}^+\) is the Moore-Penrose inverse of matrix \({\varvec{A}}\). Alternative approaches to address the problem of multicollinearity are the regularized least squares methods. The minimization of

$$\begin{aligned} \Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + {\lambda }\Vert {\varvec{\beta }}\Vert ^2, \quad {\lambda }>0 \end{aligned}$$

yields the ridge regression estimator \(\widehat{{\varvec{\beta }}}^\textrm{RR}=({\varvec{X}}^\top {\varvec{X}}+{\lambda }{\sigma }^2{\varvec{I}})^{-1}{\varvec{X}}^\top {\varvec{y}}\), and the minimization of

$$\begin{aligned} \Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + {\lambda }\sum _{i=1}^p |{\beta }_i|, \quad {\lambda }>0 \end{aligned}$$

yields the lasso regression estimator. As a regularized method with a logarithmic penalty, we consider minimizing

$$\begin{aligned} L({\varvec{\beta }})=\Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + c \log (\Vert {\varvec{\beta }}\Vert ^2), \end{aligned}$$
(5.4)

for positive constant c. The optimal solution satisfies the equation

$$\begin{aligned} {\varvec{\beta }}= \Big ({\varvec{X}}^\top {\varvec{X}}+{c{\sigma }^2\over \Vert {\varvec{\beta }}\Vert ^2}{\varvec{I}}\Big )^{-1}{\varvec{X}}^\top {\varvec{y}}. \end{aligned}$$

Using the same arguments as in Sect. 2 based on the above equation and replacing \(\Vert {\varvec{\beta }}\Vert ^2\) with \(\Vert \widetilde{{\varvec{\beta }}}\Vert ^2\) for some stable estimator \(\widetilde{{\varvec{\beta }}}\) like \(\widehat{{\varvec{\beta }}}^\textrm{MP}\) or \(\widehat{{\varvec{\beta }}}^\textrm{RR}\), one gets the one-step and two-step estimators, given by

$$\begin{aligned} \widehat{{\varvec{\beta }}}^\textrm{OS}=&\left( {\varvec{X}}^\top {\varvec{X}}+(c\hat{{\sigma }}^2/\Vert \widetilde{{\varvec{\beta }}}\Vert ^2){\varvec{I}}\right) ^{-1}{\varvec{X}}^\top {\varvec{y}}, \end{aligned}$$
(5.5)
$$\begin{aligned} \widehat{{\varvec{\beta }}}^\textrm{TS}=&\left[ {\varvec{X}}^\top {\varvec{X}}+{c\hat{{\sigma }}^2\over {\varvec{y}}^\top {\varvec{X}}\{{\varvec{X}}^\top {\varvec{X}}+(c\hat{{\sigma }}^2/\Vert \widetilde{{\varvec{\beta }}}\Vert ^2){\varvec{I}}\}^{-2}{\varvec{X}}^\top {\varvec{y}}}{\varvec{I}}\right] ^{-1}{\varvec{X}}^\top {\varvec{y}}, \end{aligned}$$
(5.6)

where \(\hat{{\sigma }}^2={\varvec{y}}^\top ({\varvec{I}}-{\varvec{X}}({\varvec{X}}^\top {\varvec{X}})^-{\varvec{X}}^\top ){\varvec{y}}/(n-r)\) for \(r=\textrm{rank}\,({\varvec{X}})\). The minimaxity of these estimators is an interesting query to address in a future.