Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (3 votes)
53 views

Notes Estimation Theory

Uploaded by

arifabd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
53 views

Notes Estimation Theory

Uploaded by

arifabd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 1

Estimation theory

In this chapter, an introduction to estimation theory is provided. The objec-


tive of an estimation problem is to infer the value of an unknown quantity,
by using information concerning other quantities (the data).
Depending on the type of a priori information available on the unknown
quantity to be estimated, two different settings can be considered:

• Paramteric estimation;

• Bayesian estimation.

Paragraphs 1.1-1.5 discuss parametric estimation problems, while paragraph


1.6 concerns the Bayesian estimation framework.

1.1 Parametric estimation


The aim of a parametric estimation problem is to estimate a deterministic
quantity θ from observations of the random variables y 1 , . . . y n .

1.1.1 Problem formulation


let:

- θ ∈ Θ ⊆ Rp , an unknown vector of parameters;

1
2 CHAPTER 1. ESTIMATION THEORY

- y = (y 1 , . . . y n )T ∈ Y ⊆ Rn a vector of random variables, hereafter


called observations or measurements;

- Fyθ (y) , fyθ (y) the cumulative distribution function and the probabil-
ity density function, respectively, of the observation vector y, which
depend on the unknown vector θ.

The set Θ, to which the parameter vector θ belongs, is referred to as


the parameter set. It represents the a priori information available on the
admissible values of the vector θ. If all values are admissible, Θ = Rp .
The set Y, containing all the values that the random vector y may take,
is known as observation set. It is assumed that the cdf Fyθ (y) (or equivalently
the pdf fyθ (y)) is parameterized by the p parameters θ ∈ Rp (which means
that such parameters enter in the expressions of those functions). Hereafter,
the word parameter will be used to denote the entire unknown vector θ. To
emphasize the special case p = 1, we will sometimes use the expression scalar
parameter.
We are now ready to formulate the general version of a parametric esti-
mation problem.

Problem 1.1. Estimate the unknown parameter θ ∈ Θ, by using an obser-


vation y of the random vector y ∈ Y.

In order to solve Problem 1.1, one has to construct an estimator.

Definition 1.1. An estimator T (·) of the parameter θ is a function that


maps the set of observations to the parameter set:

T : Y → Θ.

The value θ̂ = T (y), returned by the estimator when applied to the observa-
tion y of y, is called estimate of θ.

An estimator T (·) defines a rule that associates to each realization y of


the measurement vector y, the quantity θ̂ = T (y) which is an estimate of θ.
1.1. PARAMETRIC ESTIMATION 3

Notice that θ̂ can be seen as a realization of the random variable T (y);


in fact, since T (y) is a function of the random variable y, the estimate θ̂ is
a random variable itself.

1.1.2 Properties of an estimator


According to Definition 1.1, the class of possible estimators is infinite. In
order to characterize the quality of an estimator, it is useful to introduce
some desired properties.

Unbiasedness

A first desirable property is that the expected value of the estimate θ̂ = T (y)
be equal to the actual value of the parameter θ.

Definition 1.2. An estimator T (y) of the parameter θ is unbiased (or cor-


rect) if
Eθ [T (y)] = θ, ∀θ ∈ Θ. (1.1)

In the above definition we used the notation Eθ [·], which stresses the
dependency on θ of the expected value of T (y), due to the fact that the pdf
of y is parameterized by θ itself.
The unbiasedness condition (1.1) guarantees that the estimator T (·) does
not introduce systematic errors, i.e., errors that are not averaged out even
when considering an infinite amount of observations of y. In other words,
T (·) does not overestimate neither underestimate θ, on average (see Fig. 1.1).

Example 1.1. Let y 1 , . . . , y n be random variables with mean m. The quan-


tity
n
1X
y= y (1.2)
n i=1 i

is the so-called sample mean. It is easy to verify that y is an unbiased


estimator of m. Indeed, due to the linearity of the expected value operator,
4 CHAPTER 1. ESTIMATION THEORY

unbiased
biased

Figure 1.1: Probability density function of an unbiased estimator and of a


biased one.

one has
" n
# n n
1 X 1X 1X
E [y] = E y = E [y i ] = m = m.
n i=1 i n i=1 n i=1

Example 1.2. Let y 1 , . . . , y n be scalar random variables, independent and


identically distributed (i.i.d.) with mean m and variance σ 2 . The quantity
n
1X
σ̂y2 = (y i − y)2
n i=1

is a biased estimator of the variance σ 2 . Indeed, from (1.2) one has


 !2 
n n
  1 X 1 X
E σ̂y2 = E  yi − y
n j=1 j

n i=1
 !2 
n n
1 X 1  X
= E ny i − yj 
n i=1 n2 j=1
 !2 
n n
1X 1  X
= E n(y i − m) − (y j − m)  .
n i=1 n2 j=1
1.1. PARAMETRIC ESTIMATION 5

However,
 !2 
n
X
 = n2 E (y i − m)2
 
E  n(y i − m) − (y j − m)
j=1
" #  !2 
n
X n
X
− 2nE (y i − m) (y j − m) + E  (y j − m) 
j=1 j=1

= n2 σ 2 − 2nσ 2 + nσ 2
= n(n − 1)σ 2
 
because, for the independency assumption, E (y i − m)(y j − m) = 0 for
i 6= j. Therefore,
n
1X 1 n−1 2
σ̂y2 n(n − 1)σ 2 = σ 6= σ 2 .
 
E = 2
n i=1 n n

Example 1.3. Let y 1 , . . . , y n be i.i.d. scalar random variables, with mean


m and variance σ 2 . The quantity
n
1 X
S2 = (y − y)2
(n − 1) i=1 i

is called sample variance. It is straightforward to verify that S 2 is an unbiased


estimator of the variance σ 2 . In fact, observing that
n
S2 = σ̂y2 ,
n−1
one has immeantely
n n n−1 2
E S2 = E σ̂y2 = σ = σ2 .
   
n−1 n−1 n

Notice that, if T (·) is an unbiased estimator of θ, then g(T (·)) is not in


general an unbiased estimator of g(θ), unless g(·) is a linear function.
6 CHAPTER 1. ESTIMATION THEORY

Consistency

Another desirable property of an estimator is to provide an estimate that


converges to the actual value of θ as the number of measurements grows.
Being the estimate a random variable, we need to introduce the notion of
convergence in probability.

Definition 1.3. Let {y i }∞


i=1 be a sequence of random variables. The se-
quence of estimators θ̂n = Tn (y 1 , . . . , y n ) of θ is said to be consistent if θ̂n
converges in probability to θ, for all admissible values of θ, i.e.
 
lim P θ̂n − θ ≥ ε = 0, ∀ε > 0, ∀θ ∈ Θ.
n→∞

n = 500

n = 100
n = 50
n = 20

Figure 1.2: Probability density function of a consistent estimator.

Notice that consistency is an asymptotic property of an estimator. It


guarantees that, as the number of data goes to infinity, the probability that
the estimate differ from the actual value of the parameter goes to zero (see
Fig. 1.2).
The next Theorem provides a sufficient condition for consistency of un-
biased estimators.

Theorem 1.1. Let θ̂n be a sequence of unbiased estimators of the scalar


parameter θ: h i
E θ̂n = θ, ∀n, ∀θ ∈ Θ.
1.1. PARAMETRIC ESTIMATION 7

If h i
lim E (θ̂n − θ)2 = 0,
n→∞

then the sequence θ̂n is consistent.


Proof
For a random variable x, the Chebishev inequality holds:
1 
P (|x − mx | ≥ ε) ≤ 2 E (x − mx )2 .

ε
Therefore, one has
  1 h 2
i
lim P θ̂n − θ ≥ ε ≤ lim 2 E (θ̂n − θ) ,
n→∞ n→∞ ε

from which the result follows immediately. 


Therefore, for a sequence of unbiased estimators to be consistent, it is suffi-
cient that the variance of the estimates goes to zero as the number of mea-
surements grows.

Example 1.4. Let y 1 , . . . , y n be i.i.d. random variables with mean m and


variance σ 2 . In Example 1.1 it has been shown that the sample mean
n
1X
y= y
n i=1 i
is an unbiased estimator of the mean m. Let us now show that it is also a
consistent estimator of m. The variance of the estimate is given by
 !2 
n
1 X
Var(y) = E (y − m)2 = E 
 
y −m 
n i=1 i
 !2 
n
1 X σ2
= 2E (y i − m)  =
n i=1
n

because the random variables y i are independent. Therefore,


σ2
Var(y) = → 0 as n → ∞.
n
Hence, due to Theorem 1.1, the sample mean y is a consistent estimator of
the mean m. △
8 CHAPTER 1. ESTIMATION THEORY

The result in Example 1.4 is a special case of the following more general
celebrated result.

Theorem 1.2. (Law of large numbers)


Let {y i }∞
i=1 be a sequence of independent random variables with mean m and
finite variance. Then, the sample mean y converges to m in probability.

Mean square error

A criterion for measuring the quality of the estimate provided by an estimator


is the Mean Square Error. Let us first consider the case of a scalar parameter
(θ ∈ R).

Definition 1.4. Let θ ∈ R. The Mean Square Error (MSE) of an estimator


T (·) is defined as
MSE T (·) = Eθ (T (y) − θ)2
 

Notice that if an estimator is unbiased, then the MSE is equal to the


variance of the estimate T (y), and also to the variance of the estimation
error T (y) − θ. On the other hand, for a biased estimator one has

MSE T (·) = Eθ (T (y) − mT (y) + mT (y) − θ)2


 

= Eθ (T (y) − mT (y) )2 + (mT (y) − θ)2


 

where mT (y) = E [T (y)]. The above expression shows that the MSE of
a biased estimator is the sum of the variance of the estimator and of the
square of the deterministic quantity mT (y) − θ, which is called bias error. As
we will see, the trade off between the variance of the estimator and the bias
error is a fundamental limitation in many practical estimation problems.
The MSE can be used to decide which estimator is better within a family
of estimators.

Definition 1.5. Let T1 (·) and T2 (·) be two estimators of the parameter θ.
Then, T1 (·) is uniformly preferable to T2 (·) if

Eθ (T1 (y) − θ)2 ≤ Eθ (T2 (y) − θ)2 ,


   
∀θ ∈ Θ
1.1. PARAMETRIC ESTIMATION 9

It is worth stressing that in order to be preferable to other estimators, an


estimator must provide a smaller MSE for all the admissible values of the
parameter θ.
The above definitions can be extended quite naturally to the case of a
parameter vector θ ∈ Rp .

Definition 1.6. Let θ ∈ Rp . The Mean Square Error (MSE) of an estimator


T (·) is defined as

MSE T (·) = Eθ kT (y) − θ)k2


 

= Eθ tr{(T (y) − θ)(T (y) − θ)T }


 

where tr(M) denotes the trace of the matrix M.

The concept of uniformly preferable estimator is analogous to that in


Definition 1.5. It can be also defined in terms of inequality between the
corresponding covariance matrices, i.e., T1 (·) is uniformly preferable to T2 (·)
if
Eθ (T1 (y) − θ)(T1 (y) − θ)T ≤ Eθ (T2 (y) − θ)(T2 (y) − θ)T
   

where the matrix inequality A ≤ B means that B−A is a positive semidefinite


matrix.

1.1.3 Minimum variance unbiased estimator


Let us restrict our attention to unbiased estimators. Since we have introduced
the concept of mean square error, it is natural to look for the estimator which
minimizes this performance index.

Definition 1.7. An unbiased estimator T ∗ (·) of the scalar parameter θ is a


Uniformly Minimum Variance Unbiased Estimator (UMVUE) if

Eθ (T ∗ (y) − θ)2 ≤ Eθ (T (y) − θ)2 ,


   
∀θ ∈ Θ (1.3)

for all unbiased estimators T (·) of θ.


10 CHAPTER 1. ESTIMATION THEORY

Notice that for an estimator to be UMVUE, it has to satisfy the following


conditions:

• be unbiased;

• have minimum variance among all unbiased estimators;

• the previous condition must hold for every admissible value of the pa-
rameter θ.

Unfortunately, there are many problems for which there does not exist any
UMV UE estimator. For this reason, we often restrict the class of estimators,
in order to find the best one within the considered class. A popular choice
is that of linear estimators, i.e., taking the form
n
X
T (y) = ai y i , (1.4)
i=1

with ai ∈ R.

Definition 1.8. A linear unbiased estimator T ∗ (·) of the scalar parameter θ


is the Best Linear Unbiased Estimator (BLUE) if

Eθ (T ∗ (y) − θ)2 ≤ Eθ (T (y) − θ)2 ,


   
∀θ ∈ Θ

for every linear unbiased estimator T (·) of θ.

Differently from the UMVUE estimator, the BLUE estimator takes on a


simple form and can be easily computed (one has just to find the optimal
values of the coefficients ai ).

Example 1.5. Let y i be independent random variables with mean m and


variance σi2 , i = 1, . . . , n. Assume the variances σi2 are known. Let us
compute the BLUE estimator of m. Being the estimator linear, it takes on
the form (1.4). In order to be unbiased, T (·) must satisfy
" n # n n
X X X
Eθ [T (y)] = Eθ ai y i = ai Eθ [y i ] = m ai = m
i=1 i=1 i=1
1.1. PARAMETRIC ESTIMATION 11

Therefore, we must enforce the constraint


n
X
ai = 1 (1.5)
i=1

Now, among all the estimators of form (1.4), with the coefficients ai satisfying
(1.5), we need to find the minimum variance one. Being the observations y i
independent, the variance of T (y) is given by
 !2 
n
X n
X
Eθ (T (y) − m)2 = Eθ  a2i σi2 .
 
ai y i − m =
i=1 i=1

Summing up, in order to determine the BLUE estimator, we have to solve


the following constrained optimization problem:
n
X
min a2i σi2
ai
i=1

s.t.
n
X
ai = 1
i=1

Let us write the Lagrangian function

n n
!
X X
L(a1 , . . . , an , λ) = a2i σi2 + λ ai − 1
i=1 i=1

and compute the stationary points by imposing

∂L(a1 , . . . , an , λ)
= 0, i = 1, . . . , n (1.6)
∂ai
∂L(a1 , . . . , an , λ)
= 0. (1.7)
∂λ

From (1.7) we obtain the constraint (1.5), while (1.6) implies that

2ai σi2 + λ = 0, i = 1, . . . , n
12 CHAPTER 1. ESTIMATION THEORY

from which
1
λ=− n (1.8)
X 1
i=1
2σi2
1
σi2
ai = n , i = 1, . . . , n (1.9)
X 1
j=1
σj2

Tehrefore, the BLUE estimator of the mean m is given by


n
1 X 1
m̂BLU E = n y (1.10)
X 1 σ2 i
i=1 i
σ2
i=1 i

Notice that if all the measurements have the same variance σi2 = σ 2 , the
estimator m̂BLU E boils down to the sample mean y. This means that the
BLUE estimator can be seen as a generalization of the sample mean, in
the case when the measurements y i have different accuracy (i.e., different
variance σi2 ). In fact, the BLUE estimator is a weighted average of the ob-
servations, in which the weights are inversely proportional to the variance of
the measurements or, seen another way, directly proportional to the precision
of each observation. Let us assume that for a certain i, σi2 → ∞. This means
1
that the measurement y i is completely unreliable. Then, the weight σi2
of y i
within m̂BLU E will tend to zero. On the other hand, for an infinitely precise
1
measurement y j (σj2 → 0), the corresponding weight σj2
will be predominant
over all the other weights and the BLUE estimate will approach that mea-
surement, i.e., m̂BLU E ≃ y j . △

1.2 Cramér-Rao bound


This paragraph introduces a fundamental result which establishes a lower
bound to the variance of every unbiased estimator of the parameter θ.
1.2. CRAMÉR-RAO BOUND 13

Theorem 1.3. (Cramér-Rao bound) Let T (·) be an unbiased estimator


of the scalar parameter θ based on the observations y of the random variables
y ∈ Y, and let that the observation set Y be independent from θ. Then,
under some technical regularity assumptions (see (Rohatgi and Saleh, 2001)),
it holds
Eθ (T (y) − θ)2 ≥ [In (θ)]−1 ,
 
(1.11)

where  !2 
θ
∂ ln fyθ (y)
In (θ) = E  (1.12)
∂θ

is called Fisher information. Moreover, if the observations y 1 , . . . , y n are


independent and identically distributed with the same pdf fyθ1 (y1 ), one has

In (θ) = n I1 (θ).

When θ is a p-dimensional vector, the Cramér-Rao bound (1.11) becomes


h i
Eθ (T (y) − θ) (T (y) − θ)T ≥ [In (θ)]−1 ,

where the inequality must be intended in matricial sense and the matrix
In (θ) ∈ Rp×p is the so-called Fisher information matrix
 ! !T 
θ θ
∂ ln fy (y) ∂ ln fy (y)
In (θ) = Eθ  .
∂θ ∂θ

h i
Notice that the matrix Eθ (T (y) − θ) (T (y) − θ)T is the covariance matrix
of the unbiased estimator T (·).
Theorem 1.3 states that there does not exist any estimator with variance
smaller than [In (θ)]−1 . Notice that In (θ) depends, in general, on the actual
value of the parameter θ (because the partial derivatives must be evaluated
in θ) which is unknown. For this reason, an approximation of the lower
bound is usually computed in practice, by replacing θ with an estimate θ̂.
Nevertheless, the Cramér-Rao is also important because it allows to define
the key concept of efficiency of an estimator.
14 CHAPTER 1. ESTIMATION THEORY

Definition 1.9. An unbiased estimator T (·) is efficient if its variance achieves


the Cramér-Rao bound, i.e.

Eθ (T (y) − θ)2 = [In (θ)]−1 .


 

An efficient estimator has the least possible variance among all unbiased
estimators (therefore, it is also a UMVUE).
In the special case of i.i.d. observations y i , Theorem 1.3 states that
In (θ) = nI1 (θ), where I1 (θ) is the Fisher information of a single observa-
tion. Therefore, for a fixed θ, the Cramér-Rao bound decreases as n1 , as the
number of observations n grows.

Example 1.6. Let y 1 , . . . , y n be i.i.d. random variables with mean my and


variance σy2 . In Examples 1.1 and 1.4, we have seen that the sample mean
n
1X
y= y
n i=1 i

is a consistent unbiased estimator of the mean my . Being the observations


i.i.d., from Theorem 1.3 one has

θ
 2
 σy2 −1 [I1 (θ)]−1
E (y − my ) = ≥ [In (θ)] = .
n n
Let us now assume that the y i are distributed according to the Gaussian pdf
(y −m ) 2
1 − i 2y
fyi (yi ) = √ e 2σy
.
2πσy
Let us compute the Fisher information of a single measurement
 !2 
θ
∂ ln fy1 (y1 )
I1 (θ) = Eθ  .
∂θ

In this example, the unknown parameter to be estimated is the mean θ = m.


Therefore,
∂ ln fyθ1 (y1 ) (y1 − m)2
 
∂ 1 y − my
= ln √ − = ,
∂θ ∂m 2πσy 2σy2 m=my
σy2
1.3. MAXIMUM LIKELIHOOD ESTIMATOR 15

and hence,
(y − my )2
 
θ 1
I1 (θ) = E = .
σy4 σy2
The Cramér-Rao bound takes on the value

−1 [I1 (θ)]−1 σy2


[In (θ)] = = ,
n n
which is equal to the variance of the estimator y. Therefore, we can con-
clude that: in the case of i.i.d. Gaussian observations, the sample mean is
an efficient estimator of the mean. △

1.3 Maximum Likelihood Estimator


In general, for a given parametric estimation problem, an efficient estimator
may not exist. In Example 1.6, it has been shown that the Cramér-Rao
bound allows one to check if an estimator is efficient. However, it remains
unclear how to find suitable candidates for efficient estimators and, in the
case that such candidates turn out to be not efficient, whether it is possible
to conclude that for the problem at hand there are no efficient estimators.
An answer to these questions is provided by the class of Maximum Likelihood
estimators.

Definition 1.10. Let y be a vector of observations with pdf fyθ (y), depend-
ing on the unknown parameter θ ∈ Θ. The likelihood function is defined
as
L(θ|y) = fyθ (y) .

It is worth remarking that, once the realization y of the random variable


y has been observed (i.e., after the data have been collected), the likelihood
function depends only on the unknown parameter θ (indeed, we refer to
L(θ|y) as the likelihood of θ “given” y).
A meaningful way to estimate θ is to choose the value that maximizes the
probability of the observed data. In fact, by exploiting the meaning of the
16 CHAPTER 1. ESTIMATION THEORY

probability density function, maximizing fyθ (y) with respect to θ corresponds


to choose θ in such a way that the measurement y has the highest possible
probability of having been observed, among all feasible scenarios θ ∈ Θ.

Definition 1.11. The Maximum Likelihood (ML) estimator of the unknown


parameter θ is given by

TM L (y) = arg max L(θ|y).


θ∈Θ

In several problems, in order to ease the computation, it may be conve-


nient to maximize the so-called log-likelihood function:

ln L(θ|y).

Being the natural logarithm a monotonically increasing function, L(θ|y) and


ln L(θ|y) achieve their maxima in the same values.

Remark 1.1. Assuming that the pdf fyθ (y) be a differentiable function of
θ = (θ1 , . . . , θp ) ∈ Θ ⊆ Rp , with Θ an open set, if θ̂ is a maximum for L(θ|y),
it has to be a solution of the equations

∂L(θ|y)
= 0, i = 1, . . . , p (1.13)
∂θi θ=θ̂

or equivalently of

∂ ln L(θ|y)
= 0, i = 1, . . . , p. (1.14)
∂θi θ=θ̂

It is worth observing that in many problems, even for a scalar parameter


(p = 1), equation (1.13) may admit more than one solution. It may also
happen that the likelihood function is not differentiable everywhere in Θ or
that Θ is not an open set, in which case the maximum can be achieved on
the boundary of Θ. For all these reasons, the computation of the maximum
likelihood estimator requires to study the function L(θ|y) over the entire
domain Θ (see Exercise 1.5). Clearly, this may be a formidable task for high
dimensional parameter vectors.
1.3. MAXIMUM LIKELIHOOD ESTIMATOR 17

Example 1.7. Let y 1 , . . . , y n be independent Gaussian random variables,


with unknown mean my and known variance σy2 . Let us compute the ML
estimator of the mean my .
Being the measurements independent, the lieklihood is given by
n 2
(y −m)
Y 1 − i 2
L(θ|y) = fyθ (y) = √ e 2σy .
i=1
2πσy

In this case, it is convenient to maximize the log-likelihood, which takes on


the form
n 
(yi − m)2

X 1
ln L(θ|y) = ln √ −
i=1
2πσ y 2σy2
n
1 X (yi − m)2
= n ln √ − 2
.
2πσy i=1
2σy

By imposing the condition (1.14), one gets


n
!
∂ ln L(θ|y) ∂ 1 X (yi − m)2
= n ln √ − = 0,
∂θ ∂m 2πσy i=1
2σy2
m=m̂M L

from which
n
X yi − m̂M L
= 0,
i=1
σy2

and hence
n
1X
m̂M L = y.
n i=1 i

Therefore, in this case the ML estimator coincides with the sample mean.
Since the observations are i.i.d. Gaussian variables, this estimator is also ef-
ficient (see Example 1.6). △

The result in Example 1.7 is not restricted to the specific setting or pdf
considered. The following general theorem illustrates the importance of max-
imum likelihood estimators, in the context of parametric estimation.
18 CHAPTER 1. ESTIMATION THEORY

Theorem 1.4. Under the same assumptions for which the Cramér-Rao bound
holds, if there exists an efficient estimator T ∗ (·), then T ∗ (·) is a maximum
likelihood estimator.

Therefore, if we are looking for an efficient estimator, the only candidates


are maximum likelihood estimators.

Example 1.8. Let y 1 , . . . , y n be independent Gaussian random variables,


with mean my and variance σy2 , both unknown. Let us compute the Maxi-
mum Likelihood estimator of the mean and the variance.
Similarly to what observed in Example 1.7, the log-likelihood turns out
to be n
1 X (yi − m)2
ln L(θ|y) = n ln p − .
2πσ 2 i=1
2σ 2

The unknown parameter vector to be estimated is θ = (m, σ 2 )T , for which


condition (1.14) becomes
n
!
∂ ln L(θ|y) ∂ 1 X (yi − m)2
= n ln p − = 0,
∂θ1 ∂m 2πσ 2 i=1
2σ 2 2
(m=m̂M L ,σ2 =σ̂M L)
n
!
∂ ln L(θ|y) ∂ 1 X (yi − m)2
= n ln p − = 0.
∂θ2 ∂σ 2 2πσ 2 i=1
2σy2 2
(m=m̂M L ,σ2 =σ̂M L)

By differentiating with respect m and σ 2 , one gets


n
X yi − m̂M L
2
=0
i=1
σ̂M L
n
n 1 X
− 2
+ 4 (yi − m̂M L )2 = 0,
2σM L 2σM L i=1

from which
n
1X
m̂M L = y
n i=1 i
n
2 1X
σM = (y − m̂M L )2 .
L
n i=1 i
1.4. NONLINEAR ESTIMATION WITH ADDITIVE NOISE 19

Although Eθ [m̂M L ] = my (see Example 1.1), one has Eθ [σM


2
L] =
n−1 2
n
σy (see
Example 1.2). Therefore, in this case, the Maximum Likelihood estimator is
biased and hence it is not efficient. Due to Theorem 1.4, we can conclude that
there does not exist any efficient estimator for the parameter θ = (m, σ 2 )T . △

The previous example shows that Maximum Likelihood estimators can


be biased. However, besides the motivations provided by Theorem 1.4, there
exist other reasons that make such estimators attractive.

Theorem 1.5. If the random variables y 1 , . . . , y n are i.i.d., then (under


suitable technical assumptions)
p
lim In (θ) (TML (y) − θ)
n→+∞

is a random variable with standard normal distribution N(0, 1).

Theorem 1.5 states that the maximum likelihood estimator is:

• asymptotically unbiased;

• consistent;

• asymptotically efficient;

• asymptotically normal.

1.4 Nonlinear estimation with additive noise


A popular class of estimation problems is the one in which the aim is to esti-
mate a parameter θ, by using n measurements y = (y 1 , . . . , y n )T corrupted
by additive noise. Formally, let

h : Θ ⊆ Rp → Rn

be a deterministic function of θ. The aim is to estimate θ by using the


observations
y = h(θ) + ε
20 CHAPTER 1. ESTIMATION THEORY

where ε ∈ Rn represents the measurement noise, modeled as a vector of


random variables with pdf fε (ε).
Under this assumptions, the likelihood function is given by

L(θ|y) = fyθ (y) = fε (y − h(θ)) .

In the case in which the measurement noise ε is distributed according to


the Gaussian pdf
1 1 T −1
fε (ε) = e− 2 εΣε ε
(2π)n/2 (det Σε )1/2
with zero mean and known covariance matrix Σε , the log-likelihood function
takes on the form
1
ln L(θ|y) = K − (y − h(θ))T Σ−1
ε (y − h(θ)),
2
where K is a constant that does not depend on θ. The computation of
the maximum likelihood estimator boils down to the following optimization
problem

θ̂M L = arg max ln L(θ|y)


θ

= arg min(y − h(θ))T Σ−1


ε (y − h(θ)). (1.15)
θ

Being h(·), in general, a nonlinear function of θ, the solution of (1.15) can


be computed by resorting to numerical methods. Clearly, the computational
complexity depends not only on the number p of parameters to be estimated
and on the size n of the data set, but also on the structure of h(·). For
example, if h(·) is convex there are efficient algorithms that allow to solve
problems with very large n and p, while if h(·) is noncovex the problem may
become intractable even for relatively small values of p.

1.5 Linear estimation problems


An intersting scenario is the one in which the relationship between the un-
known parameters and the data is linear, i.e. h(θ) = U θ, where U ∈ Rn×p .
1.5. LINEAR ESTIMATION PROBLEMS 21

In this case, the measurement equation takes on the form

y = Uθ + ε. (1.16)

In the following, we will assume that rank(U) = p, which means that the
number of linearly independent measurements is not smaller than the number
of parameters to be estimated (otherwise, the problem is ill posed).
We now introduce two popular estimators that can be used to estimate
θ in the setting (1.16). We will discuss their properties, depending on the
assumptions we make on the measurement noise ε. Let us start with the
Least Squares estimator.

Definition 1.12. Let y be a vector of random variables related to θ according


to (1.16). The estimator

TLS (y) = (U T U)−1 U T y (1.17)

is called Least Squares (LS) estimator of the parameter θ.

The name of this estimator comes from the fact that it minimizes the
sum of the squared differences between the data realization y and the model
Uθ, i.e.
θ̂LS = arg min ky − Uθk2 .
θ

Indeed,

ky − Uθk2 = (y − Uθ)T (y − Uθ) = y T y + θT U T Uθ − 2y T Uθ.

By differentiating with respect to θ, on gets


ky − Uθk2 T
= 2θ̂LS U T U − 2y T U = 0,
∂θ θ=θ̂LS

∂xT Ax ∂Ax
where the properties ∂x
= 2xT A and ∂x
= A have been exploited. By
T
solving with respect to θ̂LS , one gets

T
θ̂LS = y T U(U T U)−1 .
22 CHAPTER 1. ESTIMATION THEORY

Finally, by transposing the above expression and taking into account that
the matrix (U T U) is symmetric, one obtains the equation (1.17).
It is worth stressing that the LS estimator does not require any a priori
information about the noise ε to be computed. As we will see in the sequel,
however, the properties of ε will influence those of the LS estimator .

Definition 1.13. Let y be a vector of random variables related to θ according


to (1.16). Let Σε be the covariance matrix of ε. The estimator:

TGM (y) = (U T Σ−1 −1 T −1


ε U) U Σε y (1.18)

is called Gauss-Markov (GM) estimator (or Weighted Least Squares Estima-


tor) of the parameter θ.

Similarly to what has been shown for the LS estimator, it is easy to verify
that the GM estimator minimizes the weighted sum of squared errors between
y and Uθ, i.e.
θ̂GM = arg min(y − Uθ)T Σ−1
ε (y − Uθ).
θ

Notice that the Gauss-Markov estimator requires the knowledge of the co-
variance matrix Σε of the measurement noise. By using this information, the
measurements are weighted with a matricial weight that is inversely propor-
tional to their uncertainty.
Under the assumtpion that the noise has zero mean, E [ε] = 0, it is easy
to show that both the LS and the GM estimator are unbiased. For the LS
estimator one has
h i
Eθ θ̂LS = Eθ (U T U)−1 U T y = Eθ (U T U)−1 U T (Uθ + ε)
   

= Eθ θ + (U T U)−1 U T ε = θ.
 

For the GM estimator,


h i
Eθ θ̂GM = Eθ (U T Σ−1 −1 T −1 θ
   T −1 −1 T −1 
ε U) U Σε y = E (U Σε U) U Σε (Uθ + ε)

= Eθ θ + (U T Σ−1 −1 T −1
 
ε U) U Σε ε = θ.
1.5. LINEAR ESTIMATION PROBLEMS 23

If the noise vector ε has non-zero mean, mε = E [ε], but the mean mε
is known, the LS and GM estimators can be easily amended to remove the
bias. In fact, if we define the new vector of random variables ε̃ = ε − mε ,
the equation (1.16) can be rewritten as

y − mε = Uθ + ε̃, (1.19)

and being clearly E [ε̃] = 0, E [ε̃ε̃′ ] = Σε , all the treatment can be repeated
by replacing y with y − mε . Therefore, the expressions of the LS and GM
estimators remain those in (1.17) and (1.18), with y replaced by y − mε .
The case in which the mean of ε is unknown is more intriguing. In some
cases, one may try to estimate it from the data, along with the parameter θ.
Assume for example that E [εi ] = m̄ε , ∀i. This means that E [ε] = m̄ε · 1,
where 1 = [1 1 ... 1]T . Now, one can define the extended parameter
vector θ̄ = [θ′ m̄ε ]T ∈ Rp+1 , and use the same decomposition as in (1.19) to
obtain
y = [U 1]θ̄ + ε̃
Then, one can apply the LS or GM estimator, by replacing U with [U 1], to
obtain a simultaneous estimate of the p parameters θ and of the scalar mean
m̄ε .

An important property of the Gauss-Markov estimator is that of being


the minimum variance estimator among all linear unbiased estimators, i.e.,
the BLUE (see Definition 1.8). In fact, the following result holds.

Theorem 1.6. Let y be a vector of random variables related to the param-


eter θ according to (1.16). Let Σε be the covariance matrix of ε. Then, the
BLUE estimator of θ is the Gauss-Markov estimator (1.18).The correspond-
ing variance of the estimation error is given by
h i
E (θ̂GM − θ)(θ̂GM − θ)T = (U T Σ−1 −1
ε U) . (1.20)

In the special case Σε = σε2 In (with In identity matrix of dimension n), i.e.,
when the variables ε are uncorrelated and have the same variance σε2 , the
BLUE estimator is the Least Squares estimator (1.17).
24 CHAPTER 1. ESTIMATION THEORY

Proof
Since we consider the class of linear unbiased estimators, we have T (y) = Ay,
and E [Ay] = AE [y] = AUθ. Therefore, one must impose the constraint
AU = Ip to guarantee that the estimator is unbiased.
In order to find the minimum variance estimator, it is necessary to minimize
(in matricial sense) the covariance of the estimation error

E (Ay − θ)(Ay − θ)T = E (AUθ + Aε − θ)(·)T


   

= E AεεT AT
 

= AΣε AT

where we have enforced the constraint AU = Ip in the second equality. Then,


the BLUE estimator is obtained by solving the constrained optimization
problem
ABLU E = arg min AΣε AT
A
s.t. (1.21)
AU = Ip
and then setting T (y) = ABLU E y.
Being the constraint AU = Ip linear in the matrix A, it is possible to param-
eterize all the admissible solutions A as

A = (U T Σ−1 −1 T −1
ε U) U Σε + M (1.22)

with M ∈ Rp×n such that MU = 0. It is easy to check that all matrices A


defined by (1.22) satisfy the constraint AU = Ip . It is therefore sufficient to
find the one that minimizes the quantity AΣε AT . By substituting A with
the expression (1.22), one gets

AΣε AT = (U T Σ−1 −1 T −1 −1 T −1
ε U) U Σε Σε Σε U(U Σε U)
−1

+(U T Σ−1 −1 T −1
ε U) U Σε Σε M
T

+MΣε Σ−1 T −1
ε U(U Σε U)
−1
+ MΣε M T
= (U T Σ−1
ε U)
−1
+ MΣε M T
≥ (U T Σ−1
ε U)
−1
1.5. LINEAR ESTIMATION PROBLEMS 25

where the second equality is due to MU = 0, while the final inequality


exploits the fact that Σε is positive definite and hence MΣε M T is posi-
tive semidefinite. Since the expression (U T Σ−1
ε U)
−1
does not dipend on M,
we can conclude that the solution of problem (1.21) is obtained by setting
M = 0 in (1.22), which amounts to choosing ABLU E = (U T Σ−1 −1 T −1
ε U) U Σε .
Therefore, the BLUE estimator coincides with the Gauss-Markov one. The
expression of the covariance of the estimation error (1.20) is obtained from
AΣε AT when M = 0.
Finally, if Σε = σε2 In one has ABLU E = (U T U)−1 U T (whatever is the value
of σε2 ) and hence the GM estimator boils down to the LS one. 

In Section 1.4 it has been observed that, if the measurement noise ε is


Gaussian, the Maximum Likelihood estimator can be computed by solving
the optimization problem (1.15). If the observations depend linearly on θ, as
in (1.16), such a problem becomes

θ̂M L = arg min(y − Uθ)T Σ−1


ε (y − Uθ). (1.23)
θ

As it has been noticed after Definition 1.13, the solution of (1.23) is actu-
ally the Gauss-Markov estimator. Therefore, we can state that: in the case
of linear observations corrupted by additive Gaussian noise, the Maximum
Likelihood estimator coincides with the Gauss-Markov estimator. Moreover,
it is possible to show that in this setting
 ! !T 
θ θ
∂ ln fy (y) ∂ ln fy (y)
Eθ   = U T Σ−1 U
∂θ ∂θ

and hence the Gauss-Markov estimator is efficient (and UMVUE).


Finally, if the measurements are also independent and have the same
variance σε2 , i.e., being the noise Gaussian,

ε ∼ N(0, σε2 In ),

the, the GM estimator boils down to the LS one. Therefore: in the case
of linear observations, corrupted by independent and identically distributed
26 CHAPTER 1. ESTIMATION THEORY

Gaussian noise, the Maximum Likelihood estimator coincides with the Least
Squares estimator.

The following table summarizes the properties of the GM and LS estima-


tors, depending on the assumptions made on the noise ε.

Assumptions on ε Properties GM Properties LS


none arg minθ (y − U θ)T Σ−1
ε (y − U θ) arg minθ ky − U θk2
with known Σε
E [ε] known unbiased unbiased
E [ε] = mε BLUE BLUE
h i
E (ε − mε )(ε − mε )T = Σε if Σε = σε2 In
ε ∼ N (mε , Σε ) ML estimator ML estimator
efficient, UMVUE if Σε = σε2 In

Table 1.1: Properties of GM and LS estimators.

Example 1.9. On an unknown parameter θ, we collect n measurements

yi = θ + vi , i = 1, . . . , n

where the vi are realizations of n random variables v i , independent, with


zero mean and variance σi2 , i = 1, . . . , n.
It is immediate to verify that the measurements yi are realizations of random
variables y i , with mean θ and variance σi2 . Therefore, the estimate of θ can
be cast in terms of the estimate of the mean of n random variables (see
Examples 1.1 and 1.5, and Exercise 1.1).

1.6 Bayesian Estimation


In the Bayesian estimation setting, the quantity to be estimated is not deter-
ministic, but it is modeled as a random variable. In particular, the objective
is to estimate the random variable x ∈ Rm , by using observations of the ran-
dom variable y ∈ Rn (we will denote the unknown variable to be estimated
1.6. BAYESIAN ESTIMATION 27

by x instead of θ, to distinguish between the parametric and the Bayesian


framework). Clearly, the complete knowledge on the stochastic relationship
between x and y is given by the joint pdf fx,y (x, y).
As in the parametric setting, the aim is to find an estimator x̂ = T (y),
where
T (·) : Rn → Rm

Definition 1.14. In the Bayesian setting, an estimator T (·) is unbiased if

E [T (y)] = E [x] .

Similarly to what has been done in parametric estimation problems, it is


necessary to introduce a criterion to evaluate the quality of an estimator.

Definition 1.15. We define Bayes risk function the quantity


Z +∞ Z +∞
Jr = E [d(x, T (y))] = d(x, T (y))fx,y (x, y) dxdy
−∞ −∞

where d(x, T (y)) denotes the distance between x and its estimate T (y),
according to a suitable metric.

Since the distance d(x, T (y)) is a random variable, the aim is to minimize
its expected value, i.e. to find

T ∗ (·) = arg min Jr .


T (·)

1.6.1 Minimum Mean Square Error Estimator


A standard choice for the distance d(·) is the quadratic error

d(x, T (y)) = kx − T (y)k2 .

Definition 1.16. The minimum Mean Square Error (MSE) estimator is


defined as x̂M SE = T ∗ (·), where

T ∗ (·) = arg min E kx − T (y)k2 .


 
(1.24)
T (·)
28 CHAPTER 1. ESTIMATION THEORY

Notice that in (1.24), the expected value is computed with respect to


both random variables x and y, and hence it is necessary to know the joint
pdf fx,y (x, y).
The following fundamental result provides the solution to the minimum
MSE estimation problem.

Theorem 1.7. Let x be a random variable and y a vector of observations.


The minimum MSE estimator x̂M SE of x based on y is equal to the condi-
tional expected value of x given y:

x̂M SE = E [x|y] .

The previous result states that the estimator minimizing the MSE is the
a posteriori expected value of x, given the observation of y, i.e.
Z +∞
x̂M SE = xfx|y (x|y)dx. (1.25)
−∞

which is indeed a function of y.


Since it is easy to prove that

E [E [x|y]] = E [x] ,

one can conclude that the minimum MSE estimator is always unbiased.
The minimum MSE estimator has other attractive properties. In partic-
ulare, if we consider the matrix

Q(x, T (y)) = E (x − T (y))(x − T (y))T ,


 

it can be shown that:

• x̂M SE is the estimator minimizing (in matricial sense) Q(x, T (y)), i.e.

Q(x, x̂M SE ) ≤ Q(x, T (y)), ∀ T (y)

• x̂M SE minimizes every monotonically increasing scalar function of Q(x, T (y)),


like for example the trace of Q, corresponding to the MSE, E kx − T (y)k2 .
 
1.6. BAYESIAN ESTIMATION 29

The computation of the minimum MSE estimator may be difficult, or


even intractable, in practical problems, because it requires the knowledge of
the joint pdf fx,y (x, y) and the computation of the integral (1.25).

Example 1.10. Consider two random variables x and y, whose joint pdf is
given by

− 3 x2 + 2xy if 0 ≤ x ≤ 1, 1≤y≤2
2
fx,y (x, y) =
0 elsewhere

Let us find the minimum MSE estimator of x based on one observation of y.


From Theorem 1.7, we know that
Z +∞
x̂M SE = xfx|y (x|y)dx.
−∞

First, we need to compute


fx,y (x, y)
fx|y (x|y) = .
fy (y)
The marginal pdf of y can be calculated from the joint pdf as
Z 1
3
fy (y) = − x2 + 2xydx
0 2
x=1
x3 1
=− + yx2 =y− .
2 x=0 2
Hence, the conditional pdf is given by
 3 2
 − 2 x +2xy if 0 ≤ x ≤ 1, 1≤y≤2
y− 12
fx|y (x|y) =
0 elsewhere

Now, it is possible to compute the minimum MSE estimator


− 3 x2 + 2xy
Z 1
x̂M EQM = x 2 dx
0 y − 12
 x=1 2
y − 83

1 3 4 2 3 3
= − x + x y = .
y − 21 8 3 x=0 y − 12

30 CHAPTER 1. ESTIMATION THEORY

1.6.2 Linear Mean Square Error Estimator


We now restrict our attention to the class of affine linear estimators

T (y) = Ay + b (1.26)

in which the matrix A ∈ Rm×n and the vector b ∈ Rm are the coefficients of
the estimator to be determined. Among all estimators of the form (1.26), we
aim at finding the one minimizing the MSE.

Definition 1.17. The Linear Mean Square Error (LMSE) estimator is de-
fined as x̂LM SE = A∗ y + b∗ , where

A∗ , b∗ = arg min E kx − Ay − bk2 .


 
(1.27)
A,b

Theorem 1.8. Let x be a random variable and y a vector of observations,


such that
E [x] = mx , E [y] = my
 ! !T  !
x − mx x − mx Rx Rxy
E =
T
.
y − my y − my Rxy Ry
Then, the solution of problem (??) is given by

A∗ = Rxy Ry−1 ,
b∗ = mx − Rxy Ry−1 my .

and hence, the LMSE estimator x̂LM SE of x is given by

x̂LM SE = mx + Rxy Ry−1 (y − my ).


Proof
First, observe that the cost to be minimized is

E kx − Ay − bk2 = tr E (x − Ay − b)(x − Ay − b)T .


   

Since the trace is a monotonically increasing function, solving problem (??)


is equivalent to find A∗ , b∗ such that

E (x − A∗ y − b∗ )(x − A∗ y − b∗ )T ≤ E (x − Ay − b)(x − Ay − b)T ∀A, b


   

(1.28)
1.6. BAYESIAN ESTIMATION 31

Therefore, by denoting the estimation error as x̃ = x − Ay − b, one gets

E x̃x̃T = E [(x − mx − A(y − my ) + mx − Amy − b)


 
i
× (x − mx − A(y − my ) + mx − Amy − b)T

= Rx + ARy AT − Rxy AT − ARyx


+ (mx − Amy − b)(mx − Amy − b)T
= Rx + ARy AT − Rxy AT − ARxy
T
+ Rxy Ry−1 Rxy
T
− Rxy Ry−1 Rxy
T

+ (mx − Amy − b)(mx − Amy − b)T


T
= Rx − Rxy Ry−1 Rxy
T
+ Rxy Ry−1 − A Ry Rxy Ry−1 − A


+ (mx − Amy − b)(mx − Amy − b)T . (1.29)

Observe that the last two terms of the previous expression are positive
semidefinite matrices. Hence, the solution of problem (1.28) is obtained by
choosing A∗ , b∗ such that the last two terms are equal to zero, i.e.

A∗ = Rxy Ry−1 ;
b∗ = mx − Amy = mx − Rxy Ry−1 my .

This concludes the proof. 

The LMSE estimator is unbiased because the expected value of the esti-
mation error is equal to zero. In fact,

E [x̃] = E [x − x̂LM SE ] = mx − E mx + Rxy Ry−1 (y − my )


 

= mx − mx + Rxy Ry−1 E [y − my ] = 0.

By setting A = A∗ and b = b∗ in the last expression in (1.29), we can


compute the variance of the estimation error of the LMSE estimator, which
is equal to
T
E x̃x̃T = Rx − Rxy Ry−1 Rxy
 
.
It is worth noting that, by interpreting Rx as the a priori uncertainty on
x, Rx − Rxy Ry−1 Rxy
T
represents the new uncertainty on x after having ob-
served the measurement y. Since the matrix Rxy Ry−1 Rxy
T
is always positive
32 CHAPTER 1. ESTIMATION THEORY

semidefinite, the effect of the observations is that to reduce the uncertainty


on x. Moreover, such a reduction depends on the size of Rxy , i.e., on the cor-
relation between the measurement y and the unknown x (notice that there
is no uncertainty reduction when Rxy = 0, as expected).
It is worth stressing that in order to compute the LMSE estimator it is
not necessary to know the joint pdf fx,y (x, y), but only the first and second
order statistics mx , my , Rx , Ry , Rxy .
An interesting property of the LMSE estimator is that the estimation
error x̃ is uncorrelated to the observations y. In fact, one has

E x̃y T = E x − mx − Rxy Ry−1 (y − my ) y T


    
(1.30)
= Rxy − Rxy Ry−1 Ry = 0.
This result is often known as orthogonality principle. Conversely, it is pos-
sible to show that if a linear estimator satisfies the orthogonality condition
E x̃y T = 0, then it is the LMSE estimator.
 

In the case in which the random variables x, y are jointly Gaussian, with
mean and covariance matrix defined as in Theorem 1.8, we recall that the
conditional expected value of x given the observation of y is given by

E [x|y] = mx + Rxy Ry−1 (y − my ).

Therefore, we can conclude that: if x, y are Gaussian variables, the LMSE


estimator coincides with the minimum MSE estimator, i.e., x̂M SE = x̂LM SE .
In other words, in the Gaussian case the minimum MSE estimator is a linear
function of the observations y.

Example 1.11. Let y 1 , y 2 be two noisy observations of the scalar random


variable x, having mean mx and variance σx2 :

y 1 = x + ε1 ,
y 2 = x + ε2 .

Let ε1 , ε2 be two independent random variables, with zero mean and variance
σ12 , σ22 , respectively. Under the assumption that x and εi , i = 1, 2, are
independent, we aim at computing the LMSE estimator of x.
1.6. BAYESIAN ESTIMATION 33

Define the vectors y = (y 1 y 2 )T and ε = (ε1 ε2 )T , and rewrite the


measurement equations in the form

y = 1 x + ε,

where 1 = (1 1)T .
First, let us compute the mean of y

E [y] = E [1 x + ε] = 1 mx

In order to find the estimate x̂LM SE , we have to compute the covariance


matrices Rxy and Ry . We get
h i
Rxy = E (x − mx ) (1 (x − mx ) + ε)T = σx2 1T ,

because x and ε are uncorrelated. Moreover,


h i
T
Ry = E (1 (x − mx ) + ε) (1 (x − mx ) + ε)

= 1 σx2 1T + Rε ,

where !
σ12 0
Rε =
0 σ22

is the covariance matrix of ε. Finally, let us compute the inverse of the


measurement covariance matrix
" ! !#−1
1 1 σ12 0
Ry−1 = σx2 +
1 1 0 σ22
!−1
σx2 + σ12 σx2
=
σx2 σx2 + σ22
!
1 σx2 + σ22 −σx2
= 2 2 .
σx (σ1 + σ22 ) + σ12 σ22 −σx2 σx2 + σ12
34 CHAPTER 1. ESTIMATION THEORY

Hence, the LMSE estimator is given by

x̂LM SE = mx + Rxy Ry−1 (y − 1 mx )


= mx + σx2 1T Ry−1 (y − 1 mx )
! !
σx2 σx2 + σ22 −σx2 y 1 − mx
= mx + 2 2 (1 1)
σx (σ1 + σ22 ) + σ12 σ22 −σx2 σx2 + σ12 y 2 − mx
!
1 2 2 y 1 − mx
= mx + 2
σ σ 2 (σ2 σ 1 )
σ12 + σ22 + σ1 2 2 y 2 − mx
x

σ22 y 1 + σ12 y 2 − mx (σ12 + σ22 )


= mx + σ12 σ22
σ12 + σ22 + σx 2

mx σ12 σ22 mx 1 1
σx2 + σ22 y 1 + σ12 y 2 2
σx
+ y
σ12 1
+ y
σ22 2
= σ12 σ22
= σ12 +σ22
1
σ12 + σ22 + σx2 σx2 + σ12 σ22
mx 1 1
2
σx
+ y +
σ12 1 σ22 2
y
= 1 1 1 .
σx2 + σ2 + σ22
1

Notice that each measurement is weighted with a weight that is inversely


proportional to the variance of the noise affecting the measurement. More-
over, the a priori information on x (i.e., its mean mx and variance σx2 ), is
treated as an additional observation of x. In particular, it is interesting to
observe that if σx2 → +∞ (i.e., the a priori information on x is completely
unreliable), the estimate x̂LM SE takes on the same form of the Gauss-Markov
estimate of the mean mx (see Example 1.5 and Exercise 1.1). This highlights
the relationship between Bayesian and parametric estimation. △

1.7 Exercises

1.1. Verify that in the problem of Example 1.9, the LS and GM estimators
of θ coincide respectively with y in (1.2) and m̂BLU E in (1.10).
1.7. EXERCISES 35

1.2. Let d1 , d2 be two i.i.d. random variables, with pdf given by


(
θe−θδ se δ ≥ 0
f (δ) =
0 se δ < 0
Let δ1 , δ2 be the available observations of d1 , d2 . Find the Maximum Like-
lihood estimator of θ.

1.3. Let d1 , d2 be independent Gaussian random variables such that

E [d1 ] = m, E [d2 ] = 3m, E (d1 − m)2 = 2, E (d2 − 3m)2 = 4


   

Let δ1 , δ2 be the available observations of d1 , d2 . Find:

a) the minimum variance estimator of m among all linear unbiased esti-


mators;

b) the variance of such an estimator;

c) the Maximum Likelihood estimator (is it different from the estimator


in item a)?).

1.4. Two measurements are available on the unknown quantity x:


y 1 = x + d1
y 2 = 2x + d2
where d1 and d2 are independent disturbances modeled as random variables
with pdf (
λe−λδ se δ ≥ 0
f (δ) =
0 se δ < 0
a) Find the Maximum Likelihood estimator of x.

b) Determine if the ML estimator is unbiased.

1.5. Let x and y be two random variables with joint pdf


1

 3 (3x + y) 0 ≤ x ≤ θ, 0 ≤ y ≤ θ

fx,y (x, y) = 2θ

 0 elsewhere
where θ is a real parameter.
36 CHAPTER 1. ESTIMATION THEORY

a) Assume θ = 1 and suppose that an observation y of the random variable


y is available. Compute the minimum MSE estimator x̂M SE of x, based
on the observation y.

b) Assume θ is unknown and suppose that an observation y of the random


variable y is available. Compute the ML estimator θ̂M L of the param-
eter θ, based on the measurement y. Establish if such an estimator is
unbiased.

c) Assume θ is unknown and suppose that two observations x and y of the


random variables x and y are available. Compute the ML estimator
θ̂M L of the parameter θ, based on the measurements x and y.

1.6. Let θ ∈ [−2, 2] and consider the function



 θx + 1 − θ if x ∈ [0, 1]
θ
f (x) = 2
 0 elsewhere

a) Show that for all θ ∈ [−2, 2], f θ is a probability density function.

b) Let y be a random variable with pdf f θ . Compute mean and variance


of y as functions of θ.

c) Compute the Maximum Likelihood estimator of θ, based on an obser-


vation y of the random variable y.

d) Let y 1 , . . ., y n be n random variables, each one distributed according


to the pdf f θ , and consider the estimator
n
!
1X 1
T (y 1 , . . . , y n ) = 12 yk −
n k=1 2

Show that T (·) is an unbiased estimator of θ.

e) Find the variance of the estimation error for the estimator T (·) defined
in item d), in the case n = 1. Compute the Fisher information I1 (θ)
and show that the inequality (1.11) holds.
1.7. EXERCISES 37

1.7. Let a and b be two unknown quantities, for which we have three different
measurements:
y1 = a + v1
y2 = b + v2
y3 = a + b + v3
where v i , i = 1, 2, 3, are independent random variables, with zero mean.
Let E [v 21 ] = E [v 23 ] = 1 and E [v 22 ] = 21 . Find:

a) The LS estimator of a and b;

b) The GM estimator of a and b;


h i
c) The variance of the estimation error E (a − â)2 + (b − b̂)2 , for the
estimators computed in items a) and b).

Compare the obtained estimates with those one would have if the observation
y 3 were not available. How does the variance of the estimation error change?

1.8. Consider two random variables x and y, whose joint pdf is

− 23 x2 + 2xy
(
0 ≤ x ≤ 1, 1 ≤ y ≤ 2
fx,y (x, y) =
0 elsewhere

Find the LMSE estimate x̂LM SE of x, based on one observation of y.


Plot the estimate x̂LM SE computed above and the minimum MSE estimate
x̂M SE derived in Example 1.10, as functions of y (the realization of y).
Compute the expected values of both estimates and compare them with the
a priori mean E [x].

1.9. Let x and y be two random variables with joint pdf



 1 (x + y)e−y 0 ≤ x ≤ 4, y ≥ 0

fx,y (x, y) = 12
 0

elsewhere

Assume that an observation y of y is available.

a) Find the estimators x̂M SE and x̂LM SE of x, and plot them as functions
of the observation y.
38 CHAPTER 1. ESTIMATION THEORY

b) Compute the MSE of the estimators obtained in item a) [Hint: use


MATLAB to compute the integrals].

1.10. Let X be an unknown quantity and assume the following measurement


is available  
1
y = ln +v
X
(
e−v v ≥ 0
where v is a random variable, whose pdf is given by fv (v) = .
0 v<0

a) Find the Maximum Likelihood estimator of X. Establish if it is biased


or not. Is it possible to find an unbiased estimator of X?

b) Assume that X is a random variable independent from v, whose a


priori pdf is given by
(
1 0≤x≤1
fX (x) = .
0 altrimenti

Find the MSE and LMSE estimators of X.

c) Plot the estimates obtained in items a) e b) as functions of y.


Bibliography

Rohatgi, V. K. and A. K. Md. E. Saleh (2001). An introduction to probability


and statistics. 2nd ed.. Wiley Interscience.

39

You might also like