Notes Estimation Theory
Notes Estimation Theory
Estimation theory
• Paramteric estimation;
• Bayesian estimation.
1
2 CHAPTER 1. ESTIMATION THEORY
- Fyθ (y) , fyθ (y) the cumulative distribution function and the probabil-
ity density function, respectively, of the observation vector y, which
depend on the unknown vector θ.
T : Y → Θ.
The value θ̂ = T (y), returned by the estimator when applied to the observa-
tion y of y, is called estimate of θ.
Unbiasedness
A first desirable property is that the expected value of the estimate θ̂ = T (y)
be equal to the actual value of the parameter θ.
In the above definition we used the notation Eθ [·], which stresses the
dependency on θ of the expected value of T (y), due to the fact that the pdf
of y is parameterized by θ itself.
The unbiasedness condition (1.1) guarantees that the estimator T (·) does
not introduce systematic errors, i.e., errors that are not averaged out even
when considering an infinite amount of observations of y. In other words,
T (·) does not overestimate neither underestimate θ, on average (see Fig. 1.1).
unbiased
biased
one has
" n
# n n
1 X 1X 1X
E [y] = E y = E [y i ] = m = m.
n i=1 i n i=1 n i=1
However,
!2
n
X
= n2 E (y i − m)2
E n(y i − m) − (y j − m)
j=1
" # !2
n
X n
X
− 2nE (y i − m) (y j − m) + E (y j − m)
j=1 j=1
= n2 σ 2 − 2nσ 2 + nσ 2
= n(n − 1)σ 2
because, for the independency assumption, E (y i − m)(y j − m) = 0 for
i 6= j. Therefore,
n
1X 1 n−1 2
σ̂y2 n(n − 1)σ 2 = σ 6= σ 2 .
E = 2
n i=1 n n
Consistency
n = 500
n = 100
n = 50
n = 20
If h i
lim E (θ̂n − θ)2 = 0,
n→∞
The result in Example 1.4 is a special case of the following more general
celebrated result.
where mT (y) = E [T (y)]. The above expression shows that the MSE of
a biased estimator is the sum of the variance of the estimator and of the
square of the deterministic quantity mT (y) − θ, which is called bias error. As
we will see, the trade off between the variance of the estimator and the bias
error is a fundamental limitation in many practical estimation problems.
The MSE can be used to decide which estimator is better within a family
of estimators.
Definition 1.5. Let T1 (·) and T2 (·) be two estimators of the parameter θ.
Then, T1 (·) is uniformly preferable to T2 (·) if
• be unbiased;
• the previous condition must hold for every admissible value of the pa-
rameter θ.
Unfortunately, there are many problems for which there does not exist any
UMV UE estimator. For this reason, we often restrict the class of estimators,
in order to find the best one within the considered class. A popular choice
is that of linear estimators, i.e., taking the form
n
X
T (y) = ai y i , (1.4)
i=1
with ai ∈ R.
Now, among all the estimators of form (1.4), with the coefficients ai satisfying
(1.5), we need to find the minimum variance one. Being the observations y i
independent, the variance of T (y) is given by
!2
n
X n
X
Eθ (T (y) − m)2 = Eθ a2i σi2 .
ai y i − m =
i=1 i=1
s.t.
n
X
ai = 1
i=1
n n
!
X X
L(a1 , . . . , an , λ) = a2i σi2 + λ ai − 1
i=1 i=1
∂L(a1 , . . . , an , λ)
= 0, i = 1, . . . , n (1.6)
∂ai
∂L(a1 , . . . , an , λ)
= 0. (1.7)
∂λ
From (1.7) we obtain the constraint (1.5), while (1.6) implies that
2ai σi2 + λ = 0, i = 1, . . . , n
12 CHAPTER 1. ESTIMATION THEORY
from which
1
λ=− n (1.8)
X 1
i=1
2σi2
1
σi2
ai = n , i = 1, . . . , n (1.9)
X 1
j=1
σj2
Notice that if all the measurements have the same variance σi2 = σ 2 , the
estimator m̂BLU E boils down to the sample mean y. This means that the
BLUE estimator can be seen as a generalization of the sample mean, in
the case when the measurements y i have different accuracy (i.e., different
variance σi2 ). In fact, the BLUE estimator is a weighted average of the ob-
servations, in which the weights are inversely proportional to the variance of
the measurements or, seen another way, directly proportional to the precision
of each observation. Let us assume that for a certain i, σi2 → ∞. This means
1
that the measurement y i is completely unreliable. Then, the weight σi2
of y i
within m̂BLU E will tend to zero. On the other hand, for an infinitely precise
1
measurement y j (σj2 → 0), the corresponding weight σj2
will be predominant
over all the other weights and the BLUE estimate will approach that mea-
surement, i.e., m̂BLU E ≃ y j . △
where !2
θ
∂ ln fyθ (y)
In (θ) = E (1.12)
∂θ
In (θ) = n I1 (θ).
where the inequality must be intended in matricial sense and the matrix
In (θ) ∈ Rp×p is the so-called Fisher information matrix
! !T
θ θ
∂ ln fy (y) ∂ ln fy (y)
In (θ) = Eθ .
∂θ ∂θ
h i
Notice that the matrix Eθ (T (y) − θ) (T (y) − θ)T is the covariance matrix
of the unbiased estimator T (·).
Theorem 1.3 states that there does not exist any estimator with variance
smaller than [In (θ)]−1 . Notice that In (θ) depends, in general, on the actual
value of the parameter θ (because the partial derivatives must be evaluated
in θ) which is unknown. For this reason, an approximation of the lower
bound is usually computed in practice, by replacing θ with an estimate θ̂.
Nevertheless, the Cramér-Rao is also important because it allows to define
the key concept of efficiency of an estimator.
14 CHAPTER 1. ESTIMATION THEORY
An efficient estimator has the least possible variance among all unbiased
estimators (therefore, it is also a UMVUE).
In the special case of i.i.d. observations y i , Theorem 1.3 states that
In (θ) = nI1 (θ), where I1 (θ) is the Fisher information of a single observa-
tion. Therefore, for a fixed θ, the Cramér-Rao bound decreases as n1 , as the
number of observations n grows.
θ
2
σy2 −1 [I1 (θ)]−1
E (y − my ) = ≥ [In (θ)] = .
n n
Let us now assume that the y i are distributed according to the Gaussian pdf
(y −m ) 2
1 − i 2y
fyi (yi ) = √ e 2σy
.
2πσy
Let us compute the Fisher information of a single measurement
!2
θ
∂ ln fy1 (y1 )
I1 (θ) = Eθ .
∂θ
and hence,
(y − my )2
θ 1
I1 (θ) = E = .
σy4 σy2
The Cramér-Rao bound takes on the value
Definition 1.10. Let y be a vector of observations with pdf fyθ (y), depend-
ing on the unknown parameter θ ∈ Θ. The likelihood function is defined
as
L(θ|y) = fyθ (y) .
ln L(θ|y).
Remark 1.1. Assuming that the pdf fyθ (y) be a differentiable function of
θ = (θ1 , . . . , θp ) ∈ Θ ⊆ Rp , with Θ an open set, if θ̂ is a maximum for L(θ|y),
it has to be a solution of the equations
∂L(θ|y)
= 0, i = 1, . . . , p (1.13)
∂θi θ=θ̂
or equivalently of
∂ ln L(θ|y)
= 0, i = 1, . . . , p. (1.14)
∂θi θ=θ̂
from which
n
X yi − m̂M L
= 0,
i=1
σy2
and hence
n
1X
m̂M L = y.
n i=1 i
Therefore, in this case the ML estimator coincides with the sample mean.
Since the observations are i.i.d. Gaussian variables, this estimator is also ef-
ficient (see Example 1.6). △
The result in Example 1.7 is not restricted to the specific setting or pdf
considered. The following general theorem illustrates the importance of max-
imum likelihood estimators, in the context of parametric estimation.
18 CHAPTER 1. ESTIMATION THEORY
Theorem 1.4. Under the same assumptions for which the Cramér-Rao bound
holds, if there exists an efficient estimator T ∗ (·), then T ∗ (·) is a maximum
likelihood estimator.
from which
n
1X
m̂M L = y
n i=1 i
n
2 1X
σM = (y − m̂M L )2 .
L
n i=1 i
1.4. NONLINEAR ESTIMATION WITH ADDITIVE NOISE 19
• asymptotically unbiased;
• consistent;
• asymptotically efficient;
• asymptotically normal.
h : Θ ⊆ Rp → Rn
y = Uθ + ε. (1.16)
In the following, we will assume that rank(U) = p, which means that the
number of linearly independent measurements is not smaller than the number
of parameters to be estimated (otherwise, the problem is ill posed).
We now introduce two popular estimators that can be used to estimate
θ in the setting (1.16). We will discuss their properties, depending on the
assumptions we make on the measurement noise ε. Let us start with the
Least Squares estimator.
The name of this estimator comes from the fact that it minimizes the
sum of the squared differences between the data realization y and the model
Uθ, i.e.
θ̂LS = arg min ky − Uθk2 .
θ
Indeed,
∂
ky − Uθk2 T
= 2θ̂LS U T U − 2y T U = 0,
∂θ θ=θ̂LS
∂xT Ax ∂Ax
where the properties ∂x
= 2xT A and ∂x
= A have been exploited. By
T
solving with respect to θ̂LS , one gets
T
θ̂LS = y T U(U T U)−1 .
22 CHAPTER 1. ESTIMATION THEORY
Finally, by transposing the above expression and taking into account that
the matrix (U T U) is symmetric, one obtains the equation (1.17).
It is worth stressing that the LS estimator does not require any a priori
information about the noise ε to be computed. As we will see in the sequel,
however, the properties of ε will influence those of the LS estimator .
Similarly to what has been shown for the LS estimator, it is easy to verify
that the GM estimator minimizes the weighted sum of squared errors between
y and Uθ, i.e.
θ̂GM = arg min(y − Uθ)T Σ−1
ε (y − Uθ).
θ
Notice that the Gauss-Markov estimator requires the knowledge of the co-
variance matrix Σε of the measurement noise. By using this information, the
measurements are weighted with a matricial weight that is inversely propor-
tional to their uncertainty.
Under the assumtpion that the noise has zero mean, E [ε] = 0, it is easy
to show that both the LS and the GM estimator are unbiased. For the LS
estimator one has
h i
Eθ θ̂LS = Eθ (U T U)−1 U T y = Eθ (U T U)−1 U T (Uθ + ε)
= Eθ θ + (U T U)−1 U T ε = θ.
= Eθ θ + (U T Σ−1 −1 T −1
ε U) U Σε ε = θ.
1.5. LINEAR ESTIMATION PROBLEMS 23
If the noise vector ε has non-zero mean, mε = E [ε], but the mean mε
is known, the LS and GM estimators can be easily amended to remove the
bias. In fact, if we define the new vector of random variables ε̃ = ε − mε ,
the equation (1.16) can be rewritten as
y − mε = Uθ + ε̃, (1.19)
and being clearly E [ε̃] = 0, E [ε̃ε̃′ ] = Σε , all the treatment can be repeated
by replacing y with y − mε . Therefore, the expressions of the LS and GM
estimators remain those in (1.17) and (1.18), with y replaced by y − mε .
The case in which the mean of ε is unknown is more intriguing. In some
cases, one may try to estimate it from the data, along with the parameter θ.
Assume for example that E [εi ] = m̄ε , ∀i. This means that E [ε] = m̄ε · 1,
where 1 = [1 1 ... 1]T . Now, one can define the extended parameter
vector θ̄ = [θ′ m̄ε ]T ∈ Rp+1 , and use the same decomposition as in (1.19) to
obtain
y = [U 1]θ̄ + ε̃
Then, one can apply the LS or GM estimator, by replacing U with [U 1], to
obtain a simultaneous estimate of the p parameters θ and of the scalar mean
m̄ε .
In the special case Σε = σε2 In (with In identity matrix of dimension n), i.e.,
when the variables ε are uncorrelated and have the same variance σε2 , the
BLUE estimator is the Least Squares estimator (1.17).
24 CHAPTER 1. ESTIMATION THEORY
Proof
Since we consider the class of linear unbiased estimators, we have T (y) = Ay,
and E [Ay] = AE [y] = AUθ. Therefore, one must impose the constraint
AU = Ip to guarantee that the estimator is unbiased.
In order to find the minimum variance estimator, it is necessary to minimize
(in matricial sense) the covariance of the estimation error
= E AεεT AT
= AΣε AT
A = (U T Σ−1 −1 T −1
ε U) U Σε + M (1.22)
AΣε AT = (U T Σ−1 −1 T −1 −1 T −1
ε U) U Σε Σε Σε U(U Σε U)
−1
+(U T Σ−1 −1 T −1
ε U) U Σε Σε M
T
+MΣε Σ−1 T −1
ε U(U Σε U)
−1
+ MΣε M T
= (U T Σ−1
ε U)
−1
+ MΣε M T
≥ (U T Σ−1
ε U)
−1
1.5. LINEAR ESTIMATION PROBLEMS 25
As it has been noticed after Definition 1.13, the solution of (1.23) is actu-
ally the Gauss-Markov estimator. Therefore, we can state that: in the case
of linear observations corrupted by additive Gaussian noise, the Maximum
Likelihood estimator coincides with the Gauss-Markov estimator. Moreover,
it is possible to show that in this setting
! !T
θ θ
∂ ln fy (y) ∂ ln fy (y)
Eθ = U T Σ−1 U
∂θ ∂θ
ε ∼ N(0, σε2 In ),
the, the GM estimator boils down to the LS one. Therefore: in the case
of linear observations, corrupted by independent and identically distributed
26 CHAPTER 1. ESTIMATION THEORY
Gaussian noise, the Maximum Likelihood estimator coincides with the Least
Squares estimator.
yi = θ + vi , i = 1, . . . , n
E [T (y)] = E [x] .
where d(x, T (y)) denotes the distance between x and its estimate T (y),
according to a suitable metric.
Since the distance d(x, T (y)) is a random variable, the aim is to minimize
its expected value, i.e. to find
x̂M SE = E [x|y] .
The previous result states that the estimator minimizing the MSE is the
a posteriori expected value of x, given the observation of y, i.e.
Z +∞
x̂M SE = xfx|y (x|y)dx. (1.25)
−∞
E [E [x|y]] = E [x] ,
one can conclude that the minimum MSE estimator is always unbiased.
The minimum MSE estimator has other attractive properties. In partic-
ulare, if we consider the matrix
• x̂M SE is the estimator minimizing (in matricial sense) Q(x, T (y)), i.e.
Example 1.10. Consider two random variables x and y, whose joint pdf is
given by
− 3 x2 + 2xy if 0 ≤ x ≤ 1, 1≤y≤2
2
fx,y (x, y) =
0 elsewhere
T (y) = Ay + b (1.26)
in which the matrix A ∈ Rm×n and the vector b ∈ Rm are the coefficients of
the estimator to be determined. Among all estimators of the form (1.26), we
aim at finding the one minimizing the MSE.
Definition 1.17. The Linear Mean Square Error (LMSE) estimator is de-
fined as x̂LM SE = A∗ y + b∗ , where
A∗ = Rxy Ry−1 ,
b∗ = mx − Rxy Ry−1 my .
(1.28)
1.6. BAYESIAN ESTIMATION 31
Observe that the last two terms of the previous expression are positive
semidefinite matrices. Hence, the solution of problem (1.28) is obtained by
choosing A∗ , b∗ such that the last two terms are equal to zero, i.e.
A∗ = Rxy Ry−1 ;
b∗ = mx − Amy = mx − Rxy Ry−1 my .
The LMSE estimator is unbiased because the expected value of the esti-
mation error is equal to zero. In fact,
= mx − mx + Rxy Ry−1 E [y − my ] = 0.
In the case in which the random variables x, y are jointly Gaussian, with
mean and covariance matrix defined as in Theorem 1.8, we recall that the
conditional expected value of x given the observation of y is given by
y 1 = x + ε1 ,
y 2 = x + ε2 .
Let ε1 , ε2 be two independent random variables, with zero mean and variance
σ12 , σ22 , respectively. Under the assumption that x and εi , i = 1, 2, are
independent, we aim at computing the LMSE estimator of x.
1.6. BAYESIAN ESTIMATION 33
y = 1 x + ε,
where 1 = (1 1)T .
First, let us compute the mean of y
E [y] = E [1 x + ε] = 1 mx
= 1 σx2 1T + Rε ,
where !
σ12 0
Rε =
0 σ22
mx σ12 σ22 mx 1 1
σx2 + σ22 y 1 + σ12 y 2 2
σx
+ y
σ12 1
+ y
σ22 2
= σ12 σ22
= σ12 +σ22
1
σ12 + σ22 + σx2 σx2 + σ12 σ22
mx 1 1
2
σx
+ y +
σ12 1 σ22 2
y
= 1 1 1 .
σx2 + σ2 + σ22
1
1.7 Exercises
1.1. Verify that in the problem of Example 1.9, the LS and GM estimators
of θ coincide respectively with y in (1.2) and m̂BLU E in (1.10).
1.7. EXERCISES 35
e) Find the variance of the estimation error for the estimator T (·) defined
in item d), in the case n = 1. Compute the Fisher information I1 (θ)
and show that the inequality (1.11) holds.
1.7. EXERCISES 37
1.7. Let a and b be two unknown quantities, for which we have three different
measurements:
y1 = a + v1
y2 = b + v2
y3 = a + b + v3
where v i , i = 1, 2, 3, are independent random variables, with zero mean.
Let E [v 21 ] = E [v 23 ] = 1 and E [v 22 ] = 21 . Find:
Compare the obtained estimates with those one would have if the observation
y 3 were not available. How does the variance of the estimation error change?
− 23 x2 + 2xy
(
0 ≤ x ≤ 1, 1 ≤ y ≤ 2
fx,y (x, y) =
0 elsewhere
a) Find the estimators x̂M SE and x̂LM SE of x, and plot them as functions
of the observation y.
38 CHAPTER 1. ESTIMATION THEORY
39