Shrinkage estimation with logarithmic penalties

Kubokawa, Tatsuya

doi:10.1007/s42081-023-00225-y

Shrinkage estimation with logarithmic penalties

Original Paper
Stein Estimation and Statistical Shrinkage Methods
Open access
Published: 29 November 2023

Volume 7, pages 341–360, (2024)
Cite this article

Download PDF

You have full access to this open access article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Shrinkage estimation with logarithmic penalties

Download PDF

Tatsuya Kubokawa¹

500 Accesses
Explore all metrics

Abstract

In this paper, we have developed a novel approach for deriving shrinkage estimators of means without assuming normality. Our method is based on the equation of the first-order condition with a logarithmic penalty, and it introduces both one-step and two-step shrinkage estimators. The one-step estimator closely resembles the James–Stein estimator, while the differentiable two-step estimator exhibits similar performance to the positive-part Stein estimator. Although the latter does not satisfy Baranchik’s conditions, both estimators can be demonstrated to be minimax under normality assumptions. Furthermore, we have extended this method to handle cases involving an unknown scale. We have successfully applied this approach to the simultaneous estimation of Poisson means.

Weak Versus Strong Dominance of Shrinkage Estimators

Article 18 November 2021

On Shrinkage Estimators and “Effective Degrees of Freedom”

On Minimax Shrinkage Estimation with Variable Selection

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the simultaneous estimation of means of multivariate normal distributions, the so-called Stein problem has been studied in the literature since Stein (1956) established the inadmissibility of the maximum likelihood estimator when the dimension of means is larger than or equal to three. Especially, James and Stein (1961) provided a simple and concrete shrinkage estimator which improves on the maximum likelihood estimator, namely a minimax shrinkage estimator under the squared error loss. Then, a lot of minimax shrinkage estimators have been suggested for various models. For good account of this topic, see Stein (1981), Berger (1985), Brandwein and Strawderman (1990), Robert (2007) and Fourdrinier et al. (2018).

In particular, Efron and Morris (1972a, b) showed that the James–Stein estimator can be derived as an empirical Bayes estimator, and Strawderman (1971) discovered prior distributions which yield Bayes and minimax estimators. Strawderman’s estimator is an ultimate goal in the framework of decision theory, because it is admissible and minimax. Since then, Bayesian minimax estimation has been studied actively and extensively.

To explain the details, we consider random variable ${\varvec{X}}=(X_1, \ldots , X_p)^\top $ having a p-variate normal distribution $\mathcal {N}_p({\varvec{\theta }}, {\varvec{I}})$ for ${\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top $, where estimator $\widehat{{\varvec{\theta }}}=(\hat{{\theta }}_1, \ldots , \hat{{\theta }}_p)^\top $ of ${\varvec{\theta }}$ is evaluated by the risk relative to the squared error loss $\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2=\sum _{i=1}^p (\hat{{\theta }}_i-{\theta }_i)^2$. The James–Stein estimator is

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{JS}=\left( 1-{p-2\over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}, \end{aligned}$$

(1.1)

which is improved on by the positive-part Stein estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{PS}=\max \left( 0, 1-{p-2\over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}. \end{aligned}$$

(1.2)

Both estimators are minimax for $p\ge 3$, but inadmissible. An admissible and minimax estimator is the generalized Bayes estimator against the shrinkage prior $\Vert {\varvec{\theta }}\Vert ^{2-p}$, given by

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{GB}=\left( 1-{\phi _0\left( \Vert {\varvec{X}}\Vert ^2\right) \over \Vert {\varvec{X}}\Vert ^2}\right) {\varvec{X}}, \end{aligned}$$

(1.3)

for

$$\begin{aligned} \phi _0(w)=&w {\int _0^\infty ({\lambda }+1)^{-p/2-1}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }\over \int _0^\infty ({\lambda }+1)^{-p/2}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }}\\ =&p-2 - {2\exp (-w/2) \over \int _0^\infty ({\lambda }+1)^{-p/2}\exp \left[ -w/\{2({\lambda }+1)\}\right] \textrm{d}{\lambda }}. \end{aligned}$$

Kubokawa (1991, 1994) showed that this Bayes estimator improves on the James–Stein estimator.

The joint density of $({\varvec{X}}, {\varvec{\theta }})$ for the shrinkage prior $\Vert {\varvec{\theta }}\Vert ^{2-p}$ is

$$\begin{aligned} {1\over (2\pi )^{p/2}}\exp \left\{ -{1\over 2}\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2\right\} {1\over \Vert {\varvec{\theta }}\Vert ^{p-2}}. \end{aligned}$$

Bayes estimators in the literature often arises as posterior means, but in this paper, we try to handle modes of posterior distributions. Since the log likelihood of the joint density is proportional to

$$\begin{aligned} -{1\over 2}\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2 - {p-2\over 2} \log \Vert {\varvec{\theta }}\Vert ^2. \end{aligned}$$

This expression motivates us to consider the posterior mode, namely the minimization of

$$\begin{aligned} \Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2 + (p-2) \log \Vert {\varvec{\theta }}\Vert ^2. \end{aligned}$$

(1.4)

In linear regression models, ridge regression and lasso regression are regularized least squares methods with $L_2$ and $L_1$ penalties. These methods do not assume normality as the underlying distributions. Similarly, the minimization problem of (1.4) is set up without assuming normality. However, it may be hard to find the optimal solution of the minimization. Instead of that, in this paper, we utilize the equation of the first-order condition (FOC), and based on the FOC equation, we suggest shrinkage estimators.

The paper is organized as follows. In Sect. 2, we consider the estimation of means with known scale parameter, and set up the minimization problem with the logarithmic penalty. Using the optimal relation, we propose two kinds of shrinkage estimators, called the one-step shrinkage and the two-step shrinkage estimators. The one-step shrinkage estimator resembles the James–Stein estimator, and the two-step shrinkage estimator is differentiable and performs similarly to the positive-part Stein estimator. It is noted that the two-step shrinkage estimator does not belong to Baranchik’s class of minimax estimators, so that we need to evaluate the risk function directly. It can be shown that both estimators are minimax under normality. The estimators are numerically compared and it is revealed that the two-step shrinkage estimator has a good performance for non-normal distributions. The extension to the case of unknown scale is given in Sect. 3. In Sect. 4, we study the simultaneous estimation of means of discrete distributions. Two kinds of shrinkage estimators are derived and their minimaxity is established. In Sect. 5, we give remarks on extensions to estimation of matrix means and linear regression models. For numerical computations in this paper, we used the ox program by Doornik (2007).

2 Shrinkage estimators under the logarithmic penalties

Let $X_1, \ldots , X_p$ be observable random variables generated from the model

$$\begin{aligned} X_i = {\theta }_i + {\sigma }{\varepsilon }_i,\quad i=1, \ldots , p, \end{aligned}$$

(2.1)

where ${\varepsilon }_1, \ldots , {\varepsilon }_p$ are independently distributed with $\textrm{E}[{\varepsilon }_i]=0$ and common finite variance. Let ${\varvec{X}}=(X_1, \ldots , X_p)^\top $ and ${\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top $. Estimator $\widehat{{\varvec{\theta }}}$ of ${\varvec{\theta }}$ is evaluated in terms of the risk relative to the quadratic loss $\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2$. The estimator ${\varvec{X}}$ is the optimal solution of $\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2$, and the solutions of $\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+{\lambda }\sum _{i=1}^p |{\theta }_i|$ and $\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+{\lambda }\Vert {\varvec{\theta }}\Vert ^2$ with $L_1$ and $L_2$ penalties for positive ${\lambda }$ give shrinkage estimators. Those penalized procedures correspond to the ridge estimator and the lasso in the framework of linear regression models.

In this paper, we consider deriving estimators of ${\varvec{\theta }}$ through the minimization

$$\begin{aligned} L({\varvec{\theta }})=\Vert {\varvec{X}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2+ c \log \left( \Vert {\varvec{\theta }}\Vert ^2\right) , \end{aligned}$$

for known positive constant c suitably chosen later. Since

$$\begin{aligned} {\partial \over \partial {\varvec{\theta }}}L({\varvec{\theta }})=- {2\over {\sigma }^2}({\varvec{X}}-{\varvec{\theta }})+{2c\over \Vert {\varvec{\theta }}\Vert ^2}{\varvec{\theta }}, \end{aligned}$$

the first-order condition (FOC) satisfies the equation

$$\begin{aligned} {\varvec{\theta }}= {\Vert {\varvec{\theta }}\Vert ^2 \over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}{\varvec{X}}=\left( 1-{c{\sigma }^2\over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}. \end{aligned}$$

(2.2)

The optimal solution may be given by solving the FOC equation numerically. Instead of that, we want to provide estimators with closed forms by using the FOC equation. A simple method is to replace $\Vert {\varvec{\theta }}\Vert ^2$ in RHS of (2.2) with $\Vert {\varvec{X}}\Vert ^2$. Then we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}=\left( 1-{c{\sigma }^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}, \end{aligned}$$

(2.3)

which we call the one-step shrinkage estimator. This type of shrinkage estimators were first treated by Stein (1956).

We next consider to use the FOC equation (2.2) twice. From (2.2), we have

$$\begin{aligned} \Vert {\varvec{\theta }}\Vert ^2 =\left( {\Vert {\varvec{\theta }}\Vert ^2 \over \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2}\right) ^2\Vert {\varvec{X}}\Vert ^2. \end{aligned}$$

Again substituting this quantity into (2.2) yields

$$\begin{aligned} {\varvec{\theta }}=&\left( 1-{c{\sigma }^2\over \left\{ \Vert {\varvec{\theta }}\Vert ^2/\left( \Vert {\varvec{\theta }}\Vert ^2 + c{\sigma }^2\right) \right\} ^2\Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2}\right) {\varvec{X}}\nonumber \\ =&\left( 1-{c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{\theta }}\Vert ^2\right) ^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{\theta }}\Vert ^2\right) ^2}\right) {\varvec{X}}. \end{aligned}$$

(2.4)

Replacing $\Vert {\varvec{\theta }}\Vert ^2$ with $\Vert {\varvec{X}}\Vert ^2$, we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}= \left( 1-{c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{X}}\Vert ^2\right) ^2\over \Vert {\varvec{X}}\Vert ^2 + c{\sigma }^2\left( 1+c{\sigma }^2/\Vert {\varvec{X}}\Vert ^2\right) ^2}\right) {\varvec{X}}, \end{aligned}$$

(2.5)

which we call the two-step shrinkage estimator.

We here look into the two-step shrinkage estimator, which is rewritten as $\widehat{{\varvec{\theta }}}^\textrm{TS}=\{1-\phi _\textrm{TS}(W)/W\}{\varvec{X}}$ for $W=\Vert {\varvec{X}}\Vert ^2/{\sigma }^2$, where

$$\begin{aligned} \phi _{\textrm{TS}}(w)={c w (1+c/w)^2\over w + c(1+c/w)^2}. \end{aligned}$$

(2.6)

The derivative of $\phi _\textrm{TS}(w)$ is

$$\begin{aligned} \phi _\textrm{TS}'(w)&= 1-{2w\over w+c(1+c/w)^2}+{w^2\over \{w+c(1+c/w)^2\}^2}\left\{ 1-2{c^2\over w^2}\left( 1+{c\over w}\right) \right\} \nonumber \\&= \{\phi _\textrm{TS}(w)\}^2{1\over w^2}\left\{ 1- {2\over (1+c/w)^3}\right\} . \end{aligned}$$

(2.7)

Let $w_0$ be the solution of $\phi _\textrm{TS}'(w)=0$ or $(1+c/w)^3=2$. The solution $w_0$ satisfies that $c/w_0=2^{1/3}-1$ or $w_0=c/(2^{1/3}-1)$, and the maximum value of $\phi _\textrm{TS}(w)$ is given by

$$\begin{aligned} \phi _\textrm{TS}(w_0) = {cw_02^{2/3} \over w_0+c 2^{2/2}}={c2^{2/3} \over 1+(2^{1/3}-1)2^{2/2}} = {2^{2/3}\over 3-2^{2/3}}c \approx 1.12 c < 1.13 c. \end{aligned}$$

(2.8)

Thus, $\phi _\textrm{TS}(w)$ is increasing for $0\le w\le w_0$, and decreasing for $w>w_0$, which implies that $\phi _\textrm{TS}(w)\le \phi _\textrm{TS}(w_0)$ for any $w\ge 0$. The function $\phi _\textrm{TS}(w)$ is illustrated in Fig. 1.

An appropriate choice of c is $c=p-2$, because both estimators $\widehat{{\varvec{\theta }}}^\textrm{OS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ with $c=p-2$ are minimax under the normal distribution as shown below. When $c=p-2$, $\widehat{{\varvec{\theta }}}^\textrm{OS}$ resembles the James–Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{JS}$ in (1.1), while $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is close to the positive-part Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{PS}$ in (1.2). Although $\widehat{{\varvec{\theta }}}^\textrm{PS}$ is not differentiable at some point, $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is differentiable. We compare the corresponding shrinkage coefficients of ${\varvec{X}}$ in $\widehat{{\varvec{\theta }}}^\textrm{PS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$. Let

$$\begin{aligned} \psi ^\textrm{PS}(w)=&\max (0, 1-c/w),\\ \psi ^\textrm{TS}(w)=&1-{\phi _\textrm{TS}(w)\over w}={w\over w + c(1+c/w)^2}, \end{aligned}$$

for $c=p-2$. These values for some small w’s are given in Table 2. For $w\le 1.5$, $\widehat{{\varvec{\theta }}}^\textrm{PS}$ is more shrunken than $\widehat{{\varvec{\theta }}}^\textrm{TS}$, but less shrunken for $w\ge 2$. Numerical investigations given below demonstrate that $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is more efficient than $\widehat{{\varvec{\theta }}}^\textrm{PS}$ for small $\Vert {\varvec{\theta }}\Vert ^2$.

Table 1 Values of shrinkage coefficients of ${\varvec{X}}$ in $\widehat{{\varvec{\theta }}}^\textrm{PS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$

Full size table

It is noted that normal distributions are not used in the derivation of $\widehat{{\varvec{\theta }}}^\textrm{OS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$. Under the normality, we can investigate minimaxity of these estimators. All the equivariant estimators under orthogonal transformations are of the form

$$\begin{aligned} \widehat{{\varvec{\theta }}}_ \phi ={\varvec{X}}- {\phi \left( \Vert {\varvec{X}}\Vert ^2\right) \over \Vert {\varvec{X}}\Vert ^2}{\varvec{X}}, \end{aligned}$$

for function $\phi (\cdot )$. When estimator $\widehat{{\varvec{\theta }}}$ is evaluated by the risk function $R({\varvec{\theta }},\widehat{{\varvec{\theta }}})=\textrm{E}[\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2]$, Stein (1973, 1981) showed that the unbiased estimator of the risk difference ${\Delta }_\phi =R({\varvec{\theta }},\widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})$ is

$$\begin{aligned} \widehat{{\Delta }}_\phi (W) = \left\{ \phi (W)^2-2(p-2)\phi (W)\right\} /W - 4\phi '(W), \end{aligned}$$

(2.9)

for $W=\Vert {\varvec{X}}\Vert ^2$. Baranchik (1970) demonstrated that $\widehat{{\varvec{\theta }}}_\phi $ is minimax if (a) $\phi (w)$ is nondecreasing and (b) $0\le \phi (w)\le 2(p-2)$. A class of estimators satisfying these conditions is called Baranchik’s class. It can be easily seen that $\widehat{{\varvec{\theta }}}^\textrm{OS}$ belongs to Baranchik’s class of minimaxity for $0<c\le 2(p-2)$. Concerning $\widehat{{\varvec{\theta }}}^\textrm{TS}$, however, it does not belong to Baranchik’s class, because monotonicity of $\phi _\textrm{TS}(w)$ is not satisfied as shown above. Thus, we need to investigate the unbiased estimator of the risk directly.

Theorem 2.1

Under the normality with ${\varvec{X}}\sim \mathcal {N}_p({\varvec{\theta }}, {\sigma }^2{\varvec{I}})$, the two-step shrinkage estimator $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is minimax relative to the quadratic loss $\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2$ if $0<c\le (3/2)(p-2)$ for $p\ge 3$.

Proof

Without loss of generality, we set ${\sigma }^2=1$. We shall show that $\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0$ for any nonnegative w, where

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(W) = \left\{ \phi _\textrm{TS}(W)^2-2(p-2)\phi _\textrm{TS}(W)\right\} /W - 4\phi _\textrm{TS}'(W). \end{aligned}$$

Using the derivative given in (2.7), we can write $\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)$ as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&= {\left\{ \phi _\textrm{TS}(w)\right\} ^2\over w}\left[ 1 - {2(p-2)\over \phi _\textrm{TS}(w)}-{4\over w}\left\{ 1- {2\over (1+c/w)^3}\right\} \right] \\&={\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 - {2(p-2)\over c}(1+c/w)\left\{ w+c(1+c/w)^2\right\} \right. \\&\quad \left. -4(1+c/w)^3+8\right] , \end{aligned}$$

which is further rewritten as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&={\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 \!-\! {2(p-2)\over c}(w\!+\!c) -2p (1+c/w)^3+8\right] \\&= {\{\phi _\textrm{TS}(w)\}^2\over w^5(1+c/w)^3}g(w), \end{aligned}$$

where

$$\begin{aligned}{} & {} g(w)=\{1-2(p-2)/c\}w^4\\{} & {} \quad +(3c-4p+12)w^3+3c(c-2p)w^2+c^2(c-6p)w-2pc^3. \end{aligned}$$

Then, g(w) is rewritten as

$$\begin{aligned} g(w)=&\left\{ 1-{2(p-2)\over c}\right\} \left\{ w+ {3c-4p+12\over 2\{1-2(p-2)/c\}}\right\} ^2w^2\\&+ \left\{ {(3c-4p+12)^2\over 4\{2(p-2)/c-1\}}+3c(c-2p) \right\} w^2+c^2(c-6p)w-2pc^3. \end{aligned}$$

Note that $1-{2(p-2)/ c}<0$, $c-2p<0$ and $c-6p<0$ for $0<c<(3/2)(p-2)$. Thus, it is sufficient to show that the coefficient of $w^2$ is not positive, namely, $h(c)\le 0$ for

$$\begin{aligned} h(c)=(3c-4p+12)^2 + 12\{2(p-2)-c\}(c-2p). \end{aligned}$$

Let $d=(3/2)(p-2)-c$. Substituting $c=(3/2)(p-2)-d$ into h(c), we have

$$\begin{aligned} h\left( {3(p-2)\over 2}-d\right)&=(p/2+3-3d)^2-3(p-2+2d)(p+6+2d)\\&= -3d^2-3(5p+14)d-{1\over 4}(p+6)(11p-30), \end{aligned}$$

which is not positive for $p\ge 3$ and $d\ge 0$. Thus, one can see that $\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0$ for any $w\ge 0$ when $0<c\le (3/2)(p-2)$, and the minimaxity of $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is established. $\square $

We investigate the performance of the suggested shrinkage estimators and compare their risks with some existing shrinkage estimators in the model (2.1). The estimators we compare are the James–Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{JS}$ in (1.1), the positive-part Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{PS}$ in (1.2), the generalized Bayes estimator $\widehat{{\varvec{\theta }}}^\textrm{GB}$ against the shrinkage prior $\Vert {\varvec{\theta }}\Vert ^{2-p}$, given in (1.3), the one-step and two-step shrinkage estimators $\widehat{{\varvec{\theta }}}^\textrm{OS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ with $c=p-2$, which are denoted by JS, PS, GB, OS and TS, respectively. Since these estimators are orthogonally equivariant, their risks depend on ${\varvec{\theta }}$ only through $\Vert {\varvec{\theta }}\Vert ^2$. Thus, the simulation experiments are conducted for ${\varvec{\theta }}=(k/3){\varvec{I}}$ for $k=0, \ldots , 9$ and ${\sigma }=1$.

In the normal distribution $\mathcal {N}({\varvec{\theta }}, {\varvec{I}})$, the exact values of the risk functions can be obtained by using numerical integration based on the noncentral chi-squared distribution. These values are illustrated in Fig. 2 and given in Table 2 for $p=6$ and ${\lambda }=\Vert {\varvec{\theta }}\Vert ^2$.

Table 2 Exact values of risk functions of the five estimators in the normal distribution

Full size table

We also investigate the performance when the underlying distributions of ${\varepsilon }_i$ are Student t-distribution with 5 degrees of freedom, logistic distribution and double exponential distribution, where these probability density functions are, respectively, given by

$$\begin{aligned} f(x)&= {{\Gamma }((n+1)/2)\over \sqrt{n\pi }{\Gamma }(n/2)}{1\over (1+x^2/n)^{(n+1)/2}}\ \textrm{for}\ n=5,\\ f(x)&= {e^{-x}\over (1+e^{-x})^2},\\ f(x)&= {1\over 2}e^{-|x|}. \end{aligned}$$

In these cases, the average losses based on 10,000 replications are calculated as approximation of risk functions and reported in Table 3.

From Fig. 2 and Tables 2 and 3, it is seen that the minimaxity of the five estimators is robust for the t-, logistic and double exponential distributions. It is also revealed that OS is close to, but worse than JS, and that TS is close to, but better than PS in most cases. As justified theoretically by Kubokawa (1991, 1994), GB is better than JS under the normality. However, it is confirmed numerically that this property does not hold for $t_5$- and logistic distributions with smaller $\Vert {\varvec{\theta }}\Vert ^2$. Comparing TS and PS, we can see that TS is better than PS except for the cases of larger $\Vert {\varvec{\theta }}\Vert ^2$ in the normal distribution. TS is the best of five in most cases. In particular, TS provides the smallest risk of five at $\Vert {\varvec{\theta }}\Vert ^2=0$. TS is differentiable and the performance is recommendable in this setup of simulation. However, its admissibility is not guaranteed.

Table 3 Average losses of five shrinkage estimators for $p=6$ and ${\varvec{\theta }}=(k/3){\varvec{I}}$

Full size table

3 Extension to the case of unknown scale

We consider to extend the result of the previous section to the case of an unknown scale parameter. To this end, let us consider the random sampling model

$$\begin{aligned} Y_{ij} = {\theta }_i + \tau {\varepsilon }_{ij}, \quad i=1, \ldots , p,\ j=1, \ldots , m, \end{aligned}$$

(3.1)

where ${\varepsilon }_{ij}$’s are independently and identically distributed as $\textrm{E}[{\varepsilon }_{ij}]=0$ and $\textrm{Var}({\varepsilon }_{ij})<\infty $, and $\tau $ is an unknown scale parameter. Let $X_i=m^{-1}\sum _{j=1}^mY_{ij}$, $S=\sum _{i=1}^p\sum _{j=1}^m(Y_{ij}-X_i)^2/m$, ${\sigma }^2=\tau ^2/m$ and $n=p(m-1)$. Then, $X_1, \ldots , X_p$ are independent random variables with $\textrm{E}[X_i]={\theta }_i$ and $\textrm{Var}(X_i)={\sigma }^2$, $i=1, \ldots , p$, and $\textrm{E}[S]=p(m-1)\tau ^2/m=n{\sigma }^2$.

Consider the estimation of ${\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top $ based on ${\varvec{X}}=(X_1, \ldots , X_p)^\top $ and S. Let $W=\Vert {\varvec{X}}\Vert ^2/S$. Replacing $\Vert {\varvec{X}}\Vert ^2/{\sigma }^2$ in (2.3) and (2.5) with W, we have

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}=&\left( 1-{c\over W + c}\right) {\varvec{X}}, \end{aligned}$$

(3.2)

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}=&\left( 1-{c(1+c/W)^2\over W + c(1+c/W)^2}\right) {\varvec{X}}. \end{aligned}$$

(3.3)

Estimator $\widehat{{\varvec{\theta }}}$ is evaluated in terms of the risk function given by $R({\varvec{\theta }},\widehat{{\varvec{\theta }}})=\textrm{E}[\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2]$. Under the normality, we can show minimaxity of the suggested estimators. When we consider estimators of the general form

$$\begin{aligned} \widehat{{\varvec{\theta }}}_\phi ={\varvec{X}}-{\phi (W)\over W}{\varvec{X}}, \end{aligned}$$

for function $\phi (w)$, Efron and Morris (1976a) showed that the unbiased estimator of the risk difference ${\Delta }=R({\varvec{\theta }},\widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})$ is

$$\begin{aligned} \widehat{{\Delta }}_\phi (W) = (n+2) \left\{ \phi (W)^2-2a\phi (W)\}/W - 4\{1+\phi (W)\right\} \phi '(W) \end{aligned}$$

(3.4)

for $a=(p-2)/(n+2)$. Baranchik’s class of minimax estimators is characterized as (a) $\phi (w)$ is nondecreasing and (b) $0\le \phi (w)\le a=(p-2)/(n+2)$. The one-step shrinkage estimator $\widehat{{\varvec{\theta }}}^\textrm{OS}$ belongs to Baranchik’s class for $0<c\le 2a$. Concerning $\widehat{{\varvec{\theta }}}^\textrm{TS}$ however, it does not belong to Baranchik’s class similarly to the case of Sect. 2.

Theorem 3.1

Under the normality with ${\varvec{X}}\sim \mathcal {N}_p({\varvec{\theta }}, {\sigma }^2{\varvec{I}})$ and $S/{\sigma }^2\sim \chi _n^2$, the two-step shrinkage estimator $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is minimax relative to the quadratic loss $\Vert \widehat{{\varvec{\theta }}}-{\varvec{\theta }}\Vert ^2/{\sigma }^2$ if $0<c\le (4/3)(p-2)/(n+2)$ for $p\ge 3$.

Proof

It is noted that $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is written as $\widehat{{\varvec{\theta }}}^\textrm{TS}=\{1-\phi _\textrm{TS}(W)/W\}{\varvec{X}}$ for $\phi _\textrm{TS}(w)$ given in (2.6). From (3.4), the unbiased estimator of the risk difference ${\Delta }=R({\varvec{\theta }},\widehat{{\varvec{\theta }}}^\textrm{TS})-R({\varvec{\theta }},{\varvec{X}})$ is

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(W) = (n+2) \left\{ \phi _\textrm{TS}(W)^2-2a\phi _\textrm{TS}(W)\right\} /W - 4\left\{ 1+\phi _\textrm{TS}(W)\right\} \phi _\textrm{TS}'(W) \end{aligned}$$

for $a=(p-2)/(n+2)$. We shall show that $\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0$ for any nonnegative w. The proof is harder than that of Theorem 2.1, because of the term $\{1+\phi _\textrm{TS}(W)\}\phi _\textrm{TS}'(W)$.

From the arguments around (2.8), for $c\le (4/3)a$, it is observed that $\phi _\textrm{TS}(w_0) \le 1.13 c \le 1.13\times {4\over 3}a=1.507 a < 2 a$, which implies that $\phi _\textrm{TS}(w)\le \phi _\textrm{TS}(w_0) < 2a$. Then,

$$\begin{aligned} {\phi _\textrm{TS}(w)^2 - 2a \phi _\textrm{TS}(w) \over 1+\phi _\textrm{TS}(w)}\le {\phi _\textrm{TS}(w)^2 - 2a \phi _\textrm{TS}(w) \over 1+\phi _\textrm{TS}(w_0)}, \end{aligned}$$

which is used to rewrite $\widehat{{\Delta }}(w)$ as

$$\begin{aligned} \widehat{{\Delta }}_{\phi _\textrm{TS}}(w)&= \{1+\phi _\textrm{TS}(w)\} \left\{ (n+2) {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w\{1+\phi _\textrm{TS}(w)\}} - 4\phi _\textrm{TS}'(w)\right\} \\&\le \{1+\phi _\textrm{TS}(w)\} \left\{ (n+2) {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w\{1+\phi _\textrm{TS}(w_0)\}} - 4\phi _\textrm{TS}'(w)\right\} \\&={1+\phi _\textrm{TS}(w)\over 1+\phi _\textrm{TS}(w_0)}(n+2)\widetilde{{\Delta }}(w), \end{aligned}$$

where

$$\begin{aligned} \widetilde{{\Delta }}(w)= {\phi _\textrm{TS}(w)^2-2a\phi _\textrm{TS}(w) \over w} - 4{1+\phi _\textrm{TS}(w_0)\over n+2}\phi _\textrm{TS}'(w). \end{aligned}$$

Let $A=\{1+\phi _\textrm{TS}(w_0)\}/(n+2)$ for simplicity. Using the same arguments as in the proof of Theorem 2.1, we can see that

$$\begin{aligned} \widetilde{{\Delta }}(w)={\{\phi _\textrm{TS}(w)\}^2\over w}\left[ 1 - {2(p-2)\over \phi _\textrm{TS}(w)}-{4\over w}A\left\{ 1- {2\over (1+c/w)^3}\right\} \right] , \end{aligned}$$

which is rewritten as

$$\begin{aligned}&{\{\phi _\textrm{TS}(w)\}^2\over w^2(1+c/w)^3}\left[ w(1+c/w)^3 - {2(p-2)\over c}\right. \\ {}&\left. (1+c/w)\left\{ w+c(1+c/w)^2\right\} -4A(1+c/w)^3+8A\right] \\&\quad ={\left\{ \phi _\textrm{TS}(w)\right\} ^2\over w^5(1+c/w)^3}\left[ w(w+c)^3 - {2(p-2)\over c}(w+c)\{w^3+c(w+c)^2\}\right. \\ {}&\qquad \left. -4A(w+c)^3+8Aw^3\right] . \end{aligned}$$

Thus, we have

$$\begin{aligned} \widetilde{{\Delta }}(w)= {\{\phi _\textrm{TS}(w)\}^2\over w^5(1+c/w)^3}g(w), \end{aligned}$$

where

$$\begin{aligned} g(w)= & {} (1-2a/c)w^4+(3c-4a+4A)w^3\\{} & {} +3c(c-2a-4A)w^2+c^2(c-6a-12A)w-2(a+2A)c^3. \end{aligned}$$

Then, g(w) is rewritten as

$$\begin{aligned} g(w)=&\left( 1-{2a\over c}\right) \left\{ w+ {3c-4a+4A\over 2(1-2a/c)}\right\} ^2w^2\\&+ \left\{ {(3c-4a+4A)^2\over 4(2a/c-1)}+3c(c-2a-4A)\right\} w^2\\&+c^2(c-6a-12A)w-2(a+2A)c^3. \end{aligned}$$

Note that $1-{2a/ c}<0$, $c-2a-4A<0$ and $c-6a-12A<0$ for $0<c<(4/3)a$. Thus, it is sufficient to show that the coefficient of $w^2$ is not positive, namely, $h(c)\le 0$ for

$$\begin{aligned} h(c)=(3c-4a+4A)^2 + 12(2a-c) (c-2a-4A). \end{aligned}$$

Let $d=(4/3)a-c$. Substituting $c=(4/3)a-d$ into h(c), we have

$$\begin{aligned} h\left( {4a\over 3}-d\right)&=(4A-3d)^2-12\left( {2\over 3}a+d\right) \left( {2\over 3}a+4A+d\right) \\&= -3d^2-24\left( {2\over 3}a+3A\right) d-8\left( {2\over 3}a^2 + 4aA - 2A^2\right) . \end{aligned}$$

We can verify that $(2/3)a^2 + 4aA - 2A^2\ge 0$ for $p\ge 3$ and $d\ge 0$. In fact, it is observed that

$$\begin{aligned} {2\over 3}a^2 + 4aA - 2A^2= A^2 \left\{ \left( {a\over A}\right) ^2 + 4{a\over A}-2\right\} , \end{aligned}$$

and from (2.8),

$$\begin{aligned} {a\over A}&={p-2\over 1+\phi _\textrm{TS}(w_0)} \ge {p-2\over 1+ 1.13c} \ge {p-2\over 1+ 1.13\times (4/3)(p-2)/(n+2)}\\&\ge {1\over 1+1.13\times (4/9)}>0.66. \end{aligned}$$

From this inequality, it follows that

$$\begin{aligned} {2\over 3}a^2 + 4aA - 2A^2\ge A^2\left( 0.66^2+4\times 0.66 -2\right) \ge A^2. \end{aligned}$$

Thus, one can see that $\widetilde{{\Delta }}(w)\le 0$ or $\widehat{{\Delta }}_{\phi _\textrm{TS}}(w)\le 0$ for any $w\ge 0$ when $0<c\le (4/3)(p-2)/(n+2)$, and the minimaxity of $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is established. $\square $

We investigate the performance of the suggested shrinkage estimators in the random sampling model (3.1). As a distribution of ${\varepsilon }_{ij}$, we handle standard normal distribution, t-distribution with 5 degrees of freedom, logistic distribution and double exponential distribution. Although ${\varvec{X}}$ and S are independent under the normality, they are not independent for the other distributions.

Let $W=\Vert {\varvec{X}}\Vert ^2/S$ and $c=(p-2)/(n+2)$. The estimators we compare are the James–Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{JS}=(1-c/W){\varvec{X}}$, the positive-part Stein estimator $\widehat{{\varvec{\theta }}}^\textrm{PS}=\max (0, 1-c/W){\varvec{X}}$, the one-step and two-step shrinkage estimators $\widehat{{\varvec{\theta }}}^\textrm{OS}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ with $c=(p-2)/(n+2)$, which are denoted by JS, PS, OS and TS, respectively. Based on simulation experiments with 50,000 replications for ${\varvec{\theta }}=(k/3){\varvec{I}}$ for $k=0, \ldots , 9$ and $\tau =1$, their average losses are calculated as approximation of risk functions and reported in Table 4.

Overall, Table 4 shows that the performances of JS, PS, OS and TS are similar to the case of known scale in Tables 2 and 3. Comparing OS and TS, we can see that TS is better for smaller $\Vert {\varvec{\theta }}\Vert ^2$, but OS is slightly better for larger $\Vert {\varvec{\theta }}\Vert ^2$. Comparing the values at $\Vert {\varvec{\theta }}\Vert ^2=0$ in Tables 2, 3 and 4, we also see that the cases of known scale give smaller risks than the cases of unknown scale in the normal distribution, but the cases of unknown scale give smaller risks in the non-normal distributions.

Table 4 Average losses of four shrinkage estimators for $(p, m, n)=(6,3,12)$ and ${\varvec{\theta }}=(k/3){\varvec{I}}$

Full size table

4 Estimation of means of discrete distributions

We here consider the simultaneous estimation of means of discrete distributions. Let $X_1, \ldots , X_p$ be independent nonnegative random variables with $\textrm{E}[X_i]={\theta }_i$ and finite variances. In the estimation of ${\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top $ based on ${\varvec{X}}=(X_1, \ldots , X_p)^\top $, we consider the minimization of

$$\begin{aligned} L({\varvec{\theta }}) = \sum _{i=1}^p\Big \{ X_i\log \Big ({X_i\over {\theta }_i}\Big )-X_i+{\theta }_i\Big \} + c\log \Big ( \sum _{j=1}^p{\theta }_j\Big ), \end{aligned}$$

for positive constant c. The first term of $L({\varvec{\theta }})$ is a special case of the Bregman loss treated in Banerjee et al. (2005), and is often referred to as a generalized Kullback–Leibler loss. The partial derivative of $L({\varvec{\theta }})$ with respect to ${\theta }_i$ is

$$\begin{aligned} {\partial L({\varvec{\theta }})\over \partial {\theta }_i}=-{X_i\over {\theta }_i}+1+{c\over \sum _{j=1}^p{\theta }_j}, \end{aligned}$$

and $\partial L({\varvec{\theta }})/ \partial {\theta }_i=0$ gives the equation

$$\begin{aligned} {\theta }_i = {\sum _{j=1}^p{\theta }_j\over \sum _{j=1}^p{\theta }_j+c}X_i = X_i - {c\over \sum _{j=1}^p{\theta }_j+c}X_i. \end{aligned}$$

Replacing $\sum _{j=1}^p{\theta }_j$ with $Z=\sum _{j=1}^pX_j$, we have the one-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{OS}={\varvec{X}}- {c \over Z + c}{\varvec{X}}. \end{aligned}$$

(4.1)

Similarly to (2.4), we have

$$\begin{aligned} {\theta }_i = X_i - {c \over \left\{ \sum _{j=1}^p{\theta }_j/(\sum _{j=1}^p{\theta }_j+c)\right\} \sum _{j=1}^pX_j+c}X_i. \end{aligned}$$

Replacing $\sum _{j=1}^p{\theta }_j$ with $Z=\sum _{j=1}^pX_j$, we have the two-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{\theta }}}^\textrm{TS}&= {\varvec{X}}- {c\over Z^2/(Z+c)+c}{\varvec{X}}= {\varvec{X}}-{c(Z+c)\over Z^2+cZ+c^2}{\varvec{X}}\nonumber \\&= {\varvec{X}}- {c(1+c/Z)\over Z+c(1+c/Z)}{\varvec{X}}. \end{aligned}$$

(4.2)

These estimators belong to the class of estimators

$$\begin{aligned} \widehat{{\varvec{\theta }}}_\phi = {\varvec{X}}- {\phi (Z)\over Z+c}{\varvec{X}}, \end{aligned}$$

for positive function $\phi (\cdot )$. It is noted that the one-step shrinkage estimator $\widehat{{\varvec{\theta }}}^\textrm{OS}$ with $c=p-1$ is identical to the shrinkage estimator suggested by Clevenson and Zidek (1975), denoted by $\widehat{{\varvec{\theta }}}^\textrm{CZ}$.

We demonstrate that $\hat{{\theta }}^\textrm{OS}$ and $\hat{{\theta }}^\textrm{TS}$ improve on ${\varvec{X}}$ relative to the loss $\sum _{i=1}^p(\hat{{\theta }}_i-{\theta })^2/{\theta }_i$ when $X_i$ has the Poisson distribution with mean ${\theta }_i$, $Po({\theta }_i)$ for $i=1, \ldots , p$. Clevenson and Zidek (1975) showed that $\widehat{{\varvec{\theta }}}^\textrm{CZ}$ improves on ${\varvec{X}}$ and Tsui (1984) proved that this improvement still holds for negative binomial distributions. Using the same arguments as in Clevenson and Zidek (1975), we can show that the unbiased estimator of the risk difference ${\Delta }=R({\varvec{\theta }}, \widehat{{\varvec{\theta }}}_\phi )-R({\varvec{\theta }},{\varvec{X}})$ for $X_i\sim Po({\theta }_i)$, $i=1, \ldots , p$, is

$$\begin{aligned} \widehat{{\Delta }}(Z)=-2\left\{ {(Z+p)\phi (Z+1)\over Z+c+1}-{Z\phi (Z)\over Z+c}\right\} +{(Z+p)\phi ^2(Z+1)\over (Z+c+1)^2}. \end{aligned}$$

(4.3)

Then, it can be seen that $\widehat{{\Delta }}(z)\le 0$ if (a) $\phi (z)$ is nondecreasing and (b) $0\le \phi (z)\le 2c$ for $0<c\le p-1$ and $0\le \phi (z)\le 2(p-1)$ for $c>p-1$. In fact, since $\phi (z)\le \phi (z+1)$ from (a), we have

$$\begin{aligned} \widehat{{\Delta }}(z)\le \left\{ -2{(p-1)z+pc\over z+c}+{z+p\over z+c+1}\phi (z+1)\right\} {\phi (z+1)\over z+c+1}, \end{aligned}$$

so that $\widehat{{\Delta }}(z)\le 0$ if

$$\begin{aligned} \phi (z+1) \le 2{\{(p-1)z+pc\}(z+c+1)\over (z+p)(z+c)}. \end{aligned}$$

This inequality is satisfied by condition (b).

Concerning the one-step shrinkage estimator, it corresponds to $\phi (z)=c$, so that it is verified that $\widehat{{\varvec{\theta }}}^\textrm{OS}$ improves on ${\varvec{X}}$ if $0< c\le 2(p-1)$ for $p\ge 2$. For the two-step shrinkage estimator, however, it is harder to show the improvement, because the corresponding function $\phi (z)$ is not nondecreasing. To further investigate, we restrict ourselves to $c=p-1$ for simplicity.

Theorem 4.1

When $X_i \sim Po({\theta }_i)$ for $i=1, \ldots , p$, the two-step shrinkage estimator $\widehat{{\varvec{\theta }}}^\textrm{TS}$ with $c=p-1$ improves on ${\varvec{X}}$ relative to the quadratic loss $\sum _{i=1}^p(\hat{{\theta }}_i-{\theta })^2/{\theta }_i$ for $p\ge 2$.

Proof

Let $c=p-1$ or $p=c+1$. It is noted that $c\ge 1$ for $p\ge 2$. From (4.3),

$$\begin{aligned} \widehat{{\Delta }}(z)=-2\left\{ \phi (z+1)-{z\phi (z)\over z+c}\right\} +{\phi ^2(z+1)\over z+c+1}, \end{aligned}$$

where

$$\begin{aligned} \phi (z)={c(z+c)^2\over z^2+cz+c^2}. \end{aligned}$$

Since the derivative of $\phi (z)$ is

$$\begin{aligned} \phi '(z)={c^2(c-z)(c+z)\over (z^2+cz+c^2)^2}, \end{aligned}$$

$\phi (z)$ is increasing for $z\le c$ and decreasing for $z>c$. This means that

$$\begin{aligned} \phi (z) \le \phi (c) = {4\over 3}c. \end{aligned}$$

In the case of $z\le c$, since $\phi (z)\le \phi (z+1)$, we have

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ -2 {c\over z+c}+{\phi (z+1)\over z+c+1}\right\} \phi (z+1) \le \{-2c+\phi (z+1)\}{\phi (z+1)\over z+c+1}\\ \le&\left\{ -2c+{4\over 3}c \right\} {\phi (z+1)\over z+c+1}<0. \end{aligned}$$

In the case of $z>c$, note that $\phi (z)> \phi (z+1)$. $\widehat{{\Delta }}(z)$ is rewritten as

$$\begin{aligned} \widehat{{\Delta }}(z)&= 2\{\phi (z)-\phi (z+1)\} -2 {c\phi (z)\over z+c}+{\phi ^2(z+1)\over z+c+1}\\&\le 2\{\phi (z)-\phi (z+1)\} -2 {c\phi (z+1)\over z+c}+{\phi ^2(z+1)\over z+c+1}. \end{aligned}$$

It is here observed that

$$\begin{aligned} \phi (z)-\phi (z+1)&= {c(z+c)^2\over z^2+cz+c^2} - {c(z+c+1)^2\over (z+1)^2+c(z+1)+c^2} \\&= {c^2(z^2+z-c^2)\over \{(z+1)^2+c(z+1)+c^2\}\{z^2+cz+c^2\}}\\&= {c(z^2+z-c^2)\phi (z+1)\over (z+c+1)^2\{z^2+cz+c^2\}}, \end{aligned}$$

so that $\widehat{{\Delta }}(z)$ is evaluated as

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ 2{c(z^2+z-c^2)\over (z+c+1)^2(z^2+cz+c^2)} -2 {c\over z+c}+{\phi (z+1)\over z+c+1}\right\} \phi (z+1). \end{aligned}$$

Since $(z^2+z-c^2)/(z^2+cz+c^2)$ is increasing in z for $c\ge 1$, we have $(z^2+z-c^2)/(z^2+cz+c^2)<1$, so that for $z>c\ge 1$,

$$\begin{aligned} \widehat{{\Delta }}(z)\le&\left\{ 2c{1\over z+c+1} -2c {z+c+1\over z+c}+\phi (z+1)\right\} {\phi (z+1)\over z+c+1}\\ \le&\Big \{ 2c{1\over 2c+1} -2c +{4\over 3}c\Big \} {\phi (z+1)\over z+c+1}\\ \le&2c \Big \{ {1\over 3} -1 +{2\over 3}\Big \} {\phi (z+1)\over z+c+1}, \end{aligned}$$

which is zero. Thus, $\widehat{{\Delta }}(z) \le 0$, and the improvement of $\widehat{{\varvec{\theta }}}^\textrm{TS}$ over ${\varvec{X}}$ is established. $\square $

We compare the risk performances of the three estimators ${\varvec{X}}$, $\widehat{{\varvec{\theta }}}^\textrm{CZ}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ with $c=p-1$ under the three distributions of $X_i$: the Poisson distribution $Po({\theta }_i)$, geometric distribution $Geo(1/({\theta }_i+1))$ and negative binomial distribution $Nbin(5, 5/({\theta }_i+5)$. The estimators $\widehat{{\varvec{\theta }}}^\textrm{CZ}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ are denoted by CZ and TS. The estimator $\widehat{{\varvec{\theta }}}=(\hat{{\theta }}_1, \ldots , \hat{{\theta }}_p)^\top $ of ${\varvec{\theta }}=({\theta }_1, \ldots , {\theta }_p)^\top $ is evaluated by the risk function relative to the loss $\sum _{i=1}^p(\hat{{\theta }}_i-{\theta }_i)^2/{\theta }_i$. The simulation experiment has been conducted with $p=6$ and ${\theta }_i=k/2$ for $i=1, \ldots , p$ and $k=1, \ldots , 10$, and the average losses based on simulation with 100,000 replications are reported Table 5. From Table 5, it is seen that the improvements of $\widehat{{\varvec{\theta }}}^\textrm{CZ}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ over ${\varvec{X}}$ are significant. In the Poisson distribution, $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is slightly worse than $\widehat{{\varvec{\theta }}}^\textrm{CZ}$, while $\widehat{{\varvec{\theta }}}^\textrm{TS}$ is slightly better in the geometric and negative binomial distributions.

Table 5 Average Losses of $\widehat{{\varvec{\theta }}}^\textrm{CZ}$ and $\widehat{{\varvec{\theta }}}^\textrm{TS}$ for $p=6$ and ${\theta }_{ik}=k/2$

Full size table

5 Remarks on extensions

We would like to conclude the paper with remarks on extensions of the results in the previous sections. An interesting extension is the estimation of an $m\times p$ matrix mean ${\varvec{{\Theta }}}$ of a random matrix ${\varvec{X}}$ with $\textrm{E}[{\varvec{X}}]={\varvec{{\Theta }}}$ and $\textrm{E}[({\varvec{X}}-{\varvec{{\Theta }}})({\varvec{X}}-{\varvec{{\Theta }}})^\top ]={\varvec{I}}_m$. As an extension of the result of Sect. 2, we consider minimizing

$$\begin{aligned} L({\varvec{{\Theta }}})=\textrm{tr}\,\left\{ ({\varvec{X}}-{\varvec{{\Theta }}})({\varvec{X}}-{\varvec{{\Theta }}})^\top \right\} + c \log \left| {\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right| , \end{aligned}$$

for known c. The optimal solution satisfies the equation

$$\begin{aligned} {\varvec{{\Theta }}}= \left( {\varvec{I}}+c\left( {\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right) ^{-1}\right) ^{-1}{\varvec{X}}={\varvec{X}}- c\left( c{\varvec{I}}+{\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top \right) ^{-1}{\varvec{X}}. \end{aligned}$$

(5.1)

When ${\varvec{X}}{\varvec{X}}^\top $ is of full rank and $p\ge m$, replacing ${\varvec{{\Theta }}}{\varvec{{\Theta }}}^\top $ with ${\varvec{X}}{\varvec{X}}^\top $ gives the one-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{{\Theta }}}}^\textrm{OS}={\varvec{X}}- c \left( {\varvec{X}}{\varvec{X}}^\top + c {\varvec{I}}\right) ^{-1}{\varvec{X}}. \end{aligned}$$

(5.2)

Similarly to (2.4), we have the two-step shrinkage estimator

$$\begin{aligned} \widehat{{\varvec{{\Theta }}}}^\textrm{TS}= {\varvec{X}}- c\left[ c{\varvec{I}}+\left\{ {\varvec{I}}+c({\varvec{X}}{\varvec{X}}^\top )^{-1}\right\} ^{-1}{\varvec{X}}{\varvec{X}}^\top \left\{ {\varvec{I}}+c({\varvec{X}}{\varvec{X}}^\top )^{-1}\right\} ^{-1}\right] ^{-1}{\varvec{X}}. \end{aligned}$$

(5.3)

It is noted that $\widehat{{\varvec{{\Theta }}}}^\textrm{OS}$ is close to the Efron and Morris (1976b) estimator $\widehat{{\varvec{{\Theta }}}}^\textrm{EM}={\varvec{X}}-(p-m-1)({\varvec{X}}{\varvec{X}}^\top )^{-1}{\varvec{X}}$ and it can be verified to be minimax for a suitable condition on c. Although a numerical investigation shows that $\widehat{{\varvec{{\Theta }}}}^\textrm{TS}$ is better than $\widehat{{\varvec{{\Theta }}}}^\textrm{EM}$ by simulation, we could not give an analytical proof for minimaxity of $\widehat{{\varvec{{\Theta }}}}^\textrm{TS}$.

Another interesting extension is the linear regression model ${\varvec{y}}={\varvec{X}}{\varvec{\beta }}+{\sigma }{\varvec{\epsilon }}$, where ${\varvec{y}}$ is an $n\times 1$ vector of observations, ${\varvec{X}}$ is an $n\times p$ matrix of known variables for $p<n$, ${\varvec{\beta }}=({\beta }_1, \ldots , {\beta }_p)^\top $ is a $p\times 1$ vector of unknown coefficients, ${\sigma }$ is an unknown scale parameter and ${\varvec{\epsilon }}=({\varepsilon }_1, \ldots , {\varepsilon }_n)^\top $ is an $n\times 1$ vector of random errors with $\textrm{E}[{\varepsilon }_i]=0$ and $\textrm{E}[{\varepsilon }_i^2]<\infty $. When ${\varvec{X}}$ is of full rank, ${\varvec{\beta }}$ is estimated by the least squares estimator $\widehat{{\varvec{\beta }}}=({\varvec{X}}^\top {\varvec{X}})^{-1}{\varvec{X}}^\top {\varvec{y}}$.

In the case of multicollinearity, ${\varvec{X}}$ is not of full rank and we can not use the least squares estimator. A feasible solution is the estimator $\widehat{{\varvec{\beta }}}^\textrm{MP}=({\varvec{X}}^\top {\varvec{X}})^{+}{\varvec{X}}^\top {\varvec{y}}$, where ${\varvec{A}}^+$ is the Moore-Penrose inverse of matrix ${\varvec{A}}$. Alternative approaches to address the problem of multicollinearity are the regularized least squares methods. The minimization of

$$\begin{aligned} \Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + {\lambda }\Vert {\varvec{\beta }}\Vert ^2, \quad {\lambda }>0 \end{aligned}$$

yields the ridge regression estimator $\widehat{{\varvec{\beta }}}^\textrm{RR}=({\varvec{X}}^\top {\varvec{X}}+{\lambda }{\sigma }^2{\varvec{I}})^{-1}{\varvec{X}}^\top {\varvec{y}}$, and the minimization of

$$\begin{aligned} \Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + {\lambda }\sum _{i=1}^p |{\beta }_i|, \quad {\lambda }>0 \end{aligned}$$

yields the lasso regression estimator. As a regularized method with a logarithmic penalty, we consider minimizing

$$\begin{aligned} L({\varvec{\beta }})=\Vert {\varvec{y}}-{\varvec{X}}{\varvec{\beta }}\Vert ^2/{\sigma }^2 + c \log (\Vert {\varvec{\beta }}\Vert ^2), \end{aligned}$$

(5.4)

for positive constant c. The optimal solution satisfies the equation

$$\begin{aligned} {\varvec{\beta }}= \Big ({\varvec{X}}^\top {\varvec{X}}+{c{\sigma }^2\over \Vert {\varvec{\beta }}\Vert ^2}{\varvec{I}}\Big )^{-1}{\varvec{X}}^\top {\varvec{y}}. \end{aligned}$$

Using the same arguments as in Sect. 2 based on the above equation and replacing $\Vert {\varvec{\beta }}\Vert ^2$ with $\Vert \widetilde{{\varvec{\beta }}}\Vert ^2$ for some stable estimator $\widetilde{{\varvec{\beta }}}$ like $\widehat{{\varvec{\beta }}}^\textrm{MP}$ or $\widehat{{\varvec{\beta }}}^\textrm{RR}$, one gets the one-step and two-step estimators, given by

$$\begin{aligned} \widehat{{\varvec{\beta }}}^\textrm{OS}=&\left( {\varvec{X}}^\top {\varvec{X}}+(c\hat{{\sigma }}^2/\Vert \widetilde{{\varvec{\beta }}}\Vert ^2){\varvec{I}}\right) ^{-1}{\varvec{X}}^\top {\varvec{y}}, \end{aligned}$$

(5.5)

$$\begin{aligned} \widehat{{\varvec{\beta }}}^\textrm{TS}=&\left[ {\varvec{X}}^\top {\varvec{X}}+{c\hat{{\sigma }}^2\over {\varvec{y}}^\top {\varvec{X}}\{{\varvec{X}}^\top {\varvec{X}}+(c\hat{{\sigma }}^2/\Vert \widetilde{{\varvec{\beta }}}\Vert ^2){\varvec{I}}\}^{-2}{\varvec{X}}^\top {\varvec{y}}}{\varvec{I}}\right] ^{-1}{\varvec{X}}^\top {\varvec{y}}, \end{aligned}$$

(5.6)

where $\hat{{\sigma }}^2={\varvec{y}}^\top ({\varvec{I}}-{\varvec{X}}({\varvec{X}}^\top {\varvec{X}})^-{\varvec{X}}^\top ){\varvec{y}}/(n-r)$ for $r=\textrm{rank}\,({\varvec{X}})$. The minimaxity of these estimators is an interesting query to address in a future.

References

Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749.
MathSciNet Google Scholar
Baranchik, A. J. (1970). A family of minimax estimators of the mean of a multivariate normal distribution. Annals of Mathematical Statistics, 41, 642–645.
Article MathSciNet Google Scholar
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer Verlag.
Book Google Scholar
Brandwein, A. C., & Strawderman, W. E. (1990). Stein estimation: The spherically symmetric case. Statistical Science, 5, 356–369.
Article MathSciNet Google Scholar
Clevenson, M. L., & Zidek, J. V. (1975). Simultaneous estimation of the means of independent Poisson laws. Journal of the American Statistical Association, 70, 698–705.
MathSciNet Google Scholar
Doornik, J. A. (2007). Object-oriented matrix programming using Ox (3rd ed.). Timberlake Consultants Press and Oxford.
Google Scholar
Efron, B., & Morris, C. (1972a). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika, 59, 335–347.
Article MathSciNet Google Scholar
Efron, B., & Morris, C. (1972b). Limiting the risk of Bayes and empirical Bayes estimators—Part II: The empirical Bayes case. Journal of the American Statistical Association, 67, 130–139.
MathSciNet Google Scholar
Efron, B., & Morris, C. (1976a). Families of minimax estimators of the mean of a multivariate normal distribution. Annals of Statistics, 4, 11–21.
Article MathSciNet Google Scholar
Efron, B., & Morris, C. (1976b). Multivariate empirical Bayes and estimation of covariance matrices. Annals of Statistics, 4, 22–32.
Article MathSciNet Google Scholar
Fourdrinier, D., Strawderman, W. E., & Wells, M. T. (2018). Shrinkage estimation. Springer Nature.
Book Google Scholar
James, W., & Stein, C. (1961). Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley Symposium on mathematical statistics and probability, 1, 361–379. University of California Press, Berkeley.
Kubokawa, T. (1991). An approach to improving the James-Stein estimator. Journal of Multivariate Analysis, 36, 121–126.
Article MathSciNet Google Scholar
Kubokawa, T. (1994). A unified approach to improving equivariant estimators. Annals of Statistics, 22, 290–299.
Article MathSciNet Google Scholar
Robert, C. P. (2007). The Bayesian choice: From decision-theoretic foundations to computational implementation (2nd ed.). Springer Verlag.
Google Scholar
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In: Proceedings of the Third Berkeley Symposium on mathematical statistics and probability, 1, 197–206. University of California Press, Berkeley.
Stein, C. (1973). Estimation of the mean of a multivariate normal distribution. In: Proceedings of the Prague Symposium on asymptotic statistics, II, 345–381. Charles University, Prague.
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9, 1135–1151.
Article MathSciNet Google Scholar
Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42, 385–388.
Article MathSciNet Google Scholar
Tsui, K.-W. (1984). Robustness of Clevenson-Zidek-type estimators. Journal of the American Statistical Association, 79, 152–157.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank the Editor, the Associate Editor and the reviewer for many valuable comments and helpful suggestions which led to an improved version of this paper. This research was supported in part by Grant-in-Aid for Scientific Research (18K11188, 22K11928) from Japan Society for the Promotion of Science.

Funding

Open access funding provided by The University of Tokyo.

Author information

Authors and Affiliations

Faculty of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
Tatsuya Kubokawa

Authors

Tatsuya Kubokawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatsuya Kubokawa.

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kubokawa, T. Shrinkage estimation with logarithmic penalties. Jpn J Stat Data Sci 7, 341–360 (2024). https://doi.org/10.1007/s42081-023-00225-y

Download citation

Received: 29 June 2023
Revised: 05 October 2023
Accepted: 05 October 2023
Published: 29 November 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s42081-023-00225-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Shrinkage estimation with logarithmic penalties

Abstract

Similar content being viewed by others

Weak Versus Strong Dominance of Shrinkage Estimators

On Shrinkage Estimators and “Effective Degrees of Freedom”

On Minimax Shrinkage Estimation with Variable Selection

1 Introduction

2 Shrinkage estimators under the logarithmic penalties

Theorem 2.1

Proof

3 Extension to the case of unknown scale

Theorem 3.1

Proof

4 Estimation of means of discrete distributions

Theorem 4.1

Proof

5 Remarks on extensions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Shrinkage estimation with logarithmic penalties

Abstract

Similar content being viewed by others

Weak Versus Strong Dominance of Shrinkage Estimators

On Shrinkage Estimators and “Effective Degrees of Freedom”

On Minimax Shrinkage Estimation with Variable Selection

1 Introduction

2 Shrinkage estimators under the logarithmic penalties

Theorem 2.1

Proof

3 Extension to the case of unknown scale

Theorem 3.1

Proof

4 Estimation of means of discrete distributions

Theorem 4.1

Proof

5 Remarks on extensions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation