Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Generalized Boundary Conditions for the Time-Fractional Advection Diffusion Equation
Next Article in Special Issue
Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family
Previous Article in Journal / Special Issue
Most Likely Maximum Entropy for Population Analysis with Region-Censored Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems †

by
Ali Mohammad-Djafari
Laboratoire des Signaux et Systèmes, UMR 8506 CNRS-SUPELEC-UNIV PARIS SUD, SUPELEC, Plateau de Moulon, 3 rue Juliot-Curie, 91192 Gif-sur-Yvette, France
This paper is an extended version of the paper published in Proceedings of the 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014.
Entropy 2015, 17(6), 3989-4027; https://doi.org/10.3390/e17063989
Submission received: 20 November 2014 / Revised: 4 May 2015 / Accepted: 5 May 2015 / Published: 12 June 2015
(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Abstract

:
The main content of this review article is first to review the main inference tools using Bayes rule, the maximum entropy principle (MEP), information theory, relative entropy and the Kullback–Leibler (KL) divergence, Fisher information and its corresponding geometries. For each of these tools, the precise context of their use is described. The second part of the paper is focused on the ways these tools have been used in data, signal and image processing and in the inverse problems, which arise in different physical sciences and engineering applications. A few examples of the applications are described: entropy in independent components analysis (ICA) and in blind source separation, Fisher information in data model selection, different maximum entropy-based methods in time series spectral estimation and in linear inverse problems and, finally, the Bayesian inference for general inverse problems. Some original materials concerning the approximate Bayesian computation (ABC) and, in particular, the variational Bayesian approximation (VBA) methods are also presented. VBA is used for proposing an alternative Bayesian computational tool to the classical Markov chain Monte Carlo (MCMC) methods. We will also see that VBA englobes joint maximum a posteriori (MAP), as well as the different expectation-maximization (EM) algorithms as particular cases.

1. Introduction

As this paper is an overview and an extension of my tutorial paper in MaxEnt 2014 workshop [1], this Introduction gives a summary of the content of this paper.
The qualification Bayesian refers to the influence of Thomas Bayes [2], who introduced what is now known as Bayes’ rule, even if the idea had been developed independently by Pierre-Simon de Laplace [3]. For this reason, I am asking a question of the community if we shall change Bayes to Laplace and Bayesian to Laplacian or at least mention them both. Whatever the answer, we assume that the reader knows what probability means in a Bayesian or Laplacian framework. The main idea is that a probability law P(X) assigned to a quantity X represents our state of knowledge that we have about it. If X is a discrete valued variable, {P(X = xn) = pn; n = 1, ⋯, N} with mutually exclusive values xn is its probability distribution. When X is a continuous valued variable, p(x) is its probability density function from which we can compute P ( a X < b ) = a b p ( x ) d x or any other probabilistic quantity, such as its mode, mean, median, region of high probabilities, etc.
In science, it happens very often that a quantity cannot be directly observed or, even when it can be observed, the observations are uncertain (commonly said to be noisy), by uncertain or noisy, here, I mean that, if we repeat the experiences with the same practical conditions, we obtain different data. However, in the Bayesian approach, for a given experiment, we have to use the data as they are, and we want to infer it from those observations. Before starting the observation and gathering new data, we have very incomplete knowledge about it. However, this incomplete knowledge can be translated in probability theory via an a priori probability law. We will discuss this point later on regarding how to do this. For now, we assume that this can be done. When a new observation (data D) on X becomes available (direct or indirect), we gain some knowledge via the likelihood P(D|X). Then, our state of knowledge is updated combining P(D|X) and P(X) to obtain an a posteriori law P(X|D), which represents the new state of knowledge on X. This is the main esprit of the Bayes rule, which can be summarized as:
P ( X | D ) = P ( D | X ) P ( X ) / P ( D ) .
As P(X|D) has to be a probability law, we have:
P ( D ) = X P ( D | X ) P ( X ) .
This relation can be extended to the continuous case. Some more details will be given in Section 2.
Associated with a probability law is the quantity of information it contains. Shannon [4] introduced the notion of quantity of information In associated with one of the possible values of xn of X with probabilities P(X = xn) = pn to be I n = ln 1 p n = ln p n and the entropy H as the expected value
H = n = 1 N p n ln p n .
The word entropy has also its roots in thermodynamics and physics. However, this notion of entropy has no direct link with entropy in physics, even if for a particular physical system, we may attribute a probability law to a quantity of interest of that system and then define its entropy. This information definition of Shannon entropy became the main basis of information theory in many data analyses and the science of communication. More details and extensions about this subject will be given in Section 3.
As we can see up to now, we did not yet discuss how to assign a probability law to a quantity. For the discrete value variable, when X can take one of the N values {x1, ⋯, xN} and when we do not know anything else about it, Laplace proposed the “Principe d’indifférence”, where P ( X = x n ) = p n = 1 N , n = 1 , , N, a uniform distribution. However, what if we know more, but not enough to be able to assign the probability law {p1, ⋯, pN} completely?
For example, if we know that the expected value is n x n p n = d, this problem can be handled by considering this equation as a constraint on the probability distribution {p1, ⋯, pN}. If we have a sufficient number of constraints (at least N), then we may obtain a unique solution. However, very often, this is not the case. The question now is how to assign a probability distribution {p1, ⋯, pN} that satisfies the available constraints. This question is an ill-posed problem in the mathematical sense of Hadamard [5] in the sense that the solution is not unique. We can propose many probability distributions that satisfy the constraint imposed by this problem. To answer this question, Jaynes [68] introduced the maximum entropy principle (MEP) as a tool for assigning a probability law to a quantity on which we have some incomplete or macroscopic (expected values) information. Some more details about this MEP, the mathematical optimization problem, the expression of the solution and the algorithm to compute it will be given in Sections 3 and 4.
Kullback [9] was interested in comparing two probability laws and introduced a tool to measure the increase of information content of a new probability law with respect to a reference one. This tool is called either the Kullback–Leibler (KL) divergence, cross entropy or relative entropy. It has also been used to update a prior law when new pieces of information in the form of expected values are given. As we will see later, this tool can also be used as an extension to the MEP of Jaynes. Furthermore, as we will see later, this criterion of comparison of two probability laws is not symmetric: one of the probability laws has to be chosen to be the reference, and then, the second is compared to this reference. Some more details and extensions will be given in Section 5.
Fisher [10] wanted to measure the amount of information that a random variable X carries about an unknown parameter θ upon which its probability law p(x|θ) depends. The partial derivative with respect to θ of the logarithm of this probability law, called the log-likelihood function for θ, is called the score. He showed that the first order moment of the score is zero, but its second order moment is positive and is also equivalent to the expected values of the second derivative of log-likelihood function with respect to θ. This quantity is called the Fisher information. It is also been shown that for the small variations of θ, the Fisher information induces locally a distance in the space of parameters Θ, if we had to compare two very close values of θ. In this way, the notion of the geometry of information is introduced [11,12]. We must be careful here that this geometrical property is related to the space of the parameters Θ for small changes of the parameter for a given family of parametric probability law p(x|θ) and not in the space of probabilities. However, for two probability laws p1(x) = p(x|θ1) and p2(x) = p(x|θ2) in the same exponential family, the Kullback–Leibler divergence KL [p1 : p2] induces a Bregman divergence B[θ1 : θ2] between the two parameters [13,14]. More details will be given in Section 8.
At this stage, we have almost introduced all of the necessary tools that we can use for different levels of data, signal and image processing. In the following, we give some more details for each of these tools and their inter-relations. Then, we review a few examples of their use in different applications. As examples, we demonstrate how these tools can be used in independent components analysis (ICA) and source separation, data model selection, in spectral analysis of the signals and in inverse problems, which arise in many sciences and engineering applications. At the end, we focus more on the Bayesian approach for inverse problems. We present some details concerning unsupervised methods, where the hyper parameters of the problem have to be estimated jointly with the unknown quantities (hidden variables). Here, we will see how the Kullback–Leibler divergence can help approximate Bayesian computation (ABC). In particular, some original materials concerning variational Bayesian approximation (VBA) methods are presented.

2. Bayes Rule

Let us introduce things very simply. If we have two discrete valued related variables X and Y, for which we have assigned probability laws P(X) and P(Y), respectively, and their joint probability law P(X, Y), then from the sum and product rule, we have:
P ( X , Y ) = P ( X | Y ) P ( Y ) = P ( Y | X ) P ( X )
where P(X, Y) is the joint probability law, P ( X ) = Y P ( X , Y ) and P ( Y ) = X P ( X , Y ) are the marginals and P ( X | Y ) = P ( X , Y ) / P ( Y ) and P ( Y | X ) = P ( X , Y ) / P ( X ) are the conditionals. Now, consider the situation where Y can be observed, but not X. Because these two quantities are related, we may want to infer X from the observations on Y. Then, we can use:
P ( X | Y ) = P ( Y | X ) P ( X ) P ( Y )
which is called the Bayes rule.
This relation is extended to the continuous valued variables using the measure theory [15,16]:
p ( x | y ) = p ( y | x ) p ( x ) p ( y )
with:
p ( y ) = p ( y | x ) p ( x ) d x .
More simply, the Bayes rule is often written as:
p ( x | y ) p ( y | x ) p ( x ) .
This writing can be used when we want to use p(x|y) to compute quantities that are only dependent on the shape of p(x|y), such as the mode, the median or quantiles. However, we must be careful that the denominator is of importance in many other cases, for example when we want to compute expected values. There is no need for more sophisticated mathematics here if we want to use this approach.
As we mentioned, the main use of this rule is in particular when X can not be observed (unknown quantity), but Y is observed and we want to infer X. In this case, the term p(y|x) is called the likelihood (of unknown quantity X in the observed data y), p(x) is called a priori and p(x|y) a posteriori. The likelihood is assigned using the link between the observed Y and the unknown X, and p(x) is assigned using the prior knowledge about it. The Bayes rule then is a way to do state of knowledge fusion. Before taking into account any observation, our state of knowledge is represented by p(x), and after the observation of Y, it becomes p(x|y).
However, in this approach, two steps are very important. The first step is the assigning of p(x) and p(y|x) before being able to use the Bayes rule. As noted in the Introduction and as we will see later, we need other tools for this step. The second important step is after: how to use p(x|y) to summarize it. When X is just a scalar value variable, we can do this computation easily. For example, we can compute the probability that X is in the interval [a, b] via:
P ( a X < b | y ) = a b p ( x | y ) d x .
However, when the unknown becomes a high dimensional vectorial variable X, as is the case in many signal and image processing applications, this computation becomes very costly [17]. We may then want to summarize p(x|y) by a few interesting or significant point estimates. For example, compute the maximum a posteriori (MAP) solution:
x ^ MAP = arg max x { p ( x | y ) } ,
the expected a posteriori (EAP) solution:
x ^ EAP = x p ( x | y ) d x ,
the domains of X which include an integrated probability mass of more than some desired value (0.95 for example):
[ x 1 , x 2 ] : x 1 x 2 p ( x | y ) d x = .95 ,
or any other questions, such as median or any α-quantiles:
x q : x q p ( x | y ) d x = ( 1 α ) .
As we see, computation of MAP needs an optimization algorithm, while these last three cases need integration, which may become very complicated for high dimensional cases [17].
We can also just explore numerically the whole space of the distribution using the Markov chain Monte Carlo (MCMC) [1826] or any other sampling techniques [17]. In the scalar case (one dimension), all of these computations can be done numerically very easily. For the vectorial case, when the dimensions become large, we need to develop specialized approximation methods, such as VBA and algorithms to do these computations. We give some more details about these when using this approach for inverse problems in real applications.
Remarks on notation used for the expected value in this paper: For a variable X with the probability density function (pdf) p(x) and any regular function h(X), we use indifferently:
E { X } = E p { X } = < X > = < X > p = x p ( x ) d x
and:
E { h ( X ) } = E p { h ( X ) } = < h ( X ) > = < h ( X ) > p = h ( x ) p ( x ) d x .
As an example, as we will say later, the entropy of p(x) is noted indifferently:
H [ p ] = E { ln ( p ( X ) ) } = E p { ln p ( X ) } = < ln p ( X ) > = < ln p ( X ) > p = p ( x ) ln p ( x ) d x .
For any conditional probability density function (pdf) p(x|y) and any regular function h(X), we use indifferently:
E { X | y } = E p ( x | y ) { X } = < X | y > = < X > p ( x | y ) = x p ( x | y ) d x
and:
E { h ( X ) | y } = E p ( x | y ) { h ( X ) } = < h ( X ) | y > = < h ( X ) > p ( x | y ) = h ( x ) p ( x | ) d x .
As another example, as we will see later, the relative entropy of p(x) over q(x) is noted indifferently:
D [ p | q ] = E p { ln p ( X ) q ( X ) } = < ln p ( X ) q ( X ) > p ( x ) p ( x ) ln p ( x ) q ( x ) d x
and when there is not any ambiguity in the integration variable, we may omit it. For example, we may note:
D [ p | q ] = E p { ln p q } = < ln p q > p = p ln p q .
Finally, when we have two variables X and Y with their joint pdf p(x, y), their marginals p(x) and p(y) and their conditionals p(x|y) and p(y|x), we may use the following notations:
E { h ( X ) | y } = E p ( x | y ) { h ( X ) } = E X | Y { h ( X ) } = < h ( X ) | y > = < h ( X ) > p ( x | y ) = h ( x ) p ( x | y ) d x .

3. Quantity of Information and Entropy

3.1. Shannon Entropy

To introduce the quantity of information and the entropy, Shannon first considered a discrete valued variable X taking values {x1, ⋯, xN} with probabilities {p1, ⋯, pN} and defined the quantities of information associated with each of them as I n = ln 1 p n = ln p n and its expected value as the entropy:
H [ X ] = i = 1 N p i ln p i .
Later, this definition is extended to the continuous case by:
H [ X ] = p ( x ) ln p ( x ) d x .
By extension, if we consider two related variables (X, Y) with the probability laws, joint p(x, y), marginals, p(x), p(y), and conditionals, p(y|x), p(x|y), we can define, respectively, the joint entropy:
H [ X , Y ] = p ( x , y ) ln p ( x , y ) d x d y ,
as well as H[X], H[Y], H[Y|X] and H[X|Y].
Therefore, for any well-defined probability law, we can have an expression for its entropy. H[X], H[Y], H[Y|X], H[X|Y] and H[X, Y], which should better be noted as H[p(x)], H[p(y)], H[p(y|x)], H[p(x|y)] and H[p(x, y)].

3.2. Thermodynamical Entropy

Entropy is also a property of thermodynamical systems introduced by Clausius [27]. For a closed homogeneous system with reversible transformation, the differential entropy δS is related to δQ the incremental reversible transfer of heat energy into that system by δS = δQ/T with T being the uniform temperature of the closed system.
It is very hard to establish a direct link between these two entropies. However, in statistical mechanics, thanks to Boltzmann, Gibbs and many others, we can establish some link if we consider the microstates (for example, the number, positions and speeds of the particles) and the macrostates (for example, the temperature T, pressure P, volume V and energy E) of the system and if we assign a probability law to microstates and consider the macrostates as the average (expected values) of some functions of those microstates. Let us give a very brief summary of some of those interpretations.

3.3. Statistical Mechanics Entropy

The interpretation of entropy in statistical mechanics is the measure of uncertainty that remains about the state of a system after its observable macroscopic properties, such as temperature (T), pressure (P) and volume (V), have been taken into account. For a given set of macroscopic variables T, P and V, the entropy measures the degree to which the probability of the system is spread out over different possible microstates. In contrast to the macrostate, which characterizes plainly observable average quantities, a microstate specifies all atomic details about the system, including the position and velocity of every atom. Entropy in statistical mechanics is a measure of the number of ways in which the microstates of the system may be arranged, often taken to be a measure of “disorder” (the higher the entropy, the higher the disorder). This definition describes the entropy as being proportional to the natural logarithm of the number of possible microscopic configurations of the system (microstates), which could give rise to the observed macroscopic state (macrostate) of the system. The proportionality constant is the Boltzmann constant.

3.4. Boltzmann Entropy

Boltzmann described the entropy as a measure of the number of possible microscopic configurations Ω of the individual atoms and molecules of the system (microstates) that comply with the macroscopic state (macrostate) of the system. Boltzmann then went on to show that k ln Ω was equal to the thermodynamic entropy. The factor k has since been known as Boltzmann’s constant.
In particular, Boltzmann showed that the entropy S of an ideal gas is related to the number of states of the molecules (microstates Ω) with a given temperature (macrostate):
S = k ln Ω

3.5. Gibbs Entropy

The macroscopic state of the system is defined by a distribution on the microstates that are accessible to a system in the course of its thermal fluctuations. Therefore, the entropy is defined over two different levels of description of the given system. The entropy is given by the Gibbs entropy formula, named after J. Willard Gibbs. For a classical system (i.e., a collection of classical particles) with a discrete set of microstates, if Ei is the energy of microstate i and pi is its probability that it occurs during the system’s fluctuations, then the entropy of the system is:
S = k i = 1 N p i ln p i .
where k is again the physical constant of Boltzmann, which, like the entropy, has units of heat capacity. The logarithm is dimensionless. It is interesting to note that Relation (17) can be obtained from Relation (18) when the probability distribution is uniform over the volume Ω [2830].

4. Relative Entropy or Kullback–Leibler Divergence

Kullback wanted to compare the relative quantity of information between two probability laws p1 and p2 on the same variable X. Two related notions have been defined:
  • Relative Entropy of p1 with respect to p2:
    D [ p 1 : p 2 ] = p 1 ( x ) ln p 1 ( x ) p 2 ( x ) d x
  • Kullback–Leibler divergence of p1 with respect to p2:
    KL [ p 1 : p 2 ] = D [ p 1 : p 2 ] = p 1 ( x ) ln p 1 ( x ) p 2 ( x ) d x
We may note that:
  • KL [q : p] 0,
  • KL [q : p] = 0, if q = p and
  • KL [q : p0] KL [q : p1] + KL [p1 : p0].
  • KL [q : p] is invariant with respect to a scale change, but is not symmetric.
  • A symmetric quantity can be defined as:
    J [ q , p ] = 1 2 ( KL [ q : p ] + KL [ p : q ] ) .

5. Mutual Information

The purpose of mutual information is to compare two related variables Y and X. It can be defined as the expected amount of information that one gains about Xif we observe the value of Y, and vice versa. Mathematically, the mutual information between X and Y is defined as:
I [ Y , X ] = H [ X ] H [ X | Y ] = H [ Y ] H [ Y | X ]
or equivalently as:
I [ Y , X ] = D [ p ( X , Y ) : p ( X ) p ( Y ) ] .
With this definition, we have the following properties:
H [ X , Y ] = H [ X ] + H [ Y | X ] = H [ Y ] + H [ X | Y ] = H [ X ] + H [ Y ] I [ Y , X ]
and:
I [ Y , X ] = E X { D [ p ( Y | X ) ] : p ( Y ) } = Δ D [ p ( y | x ) : p ( y ) ] p ( x ) d x = E Y { D [ p ( X | Y ) ] : p ( X ) } = Δ D [ p ( x | y ) : p ( x ) ] p ( y ) d y .
We may also remark on the following property:
  • I[Y, X] is a concave function of p(y) when p(x|y) is fixed and a convex function of p(x|y) when p(y) is fixed.
  • I[Y, X] 0 with equality only if X and Y are independent.

6. Maximum Entropy Principle

The first step before applying any probability rules for inference is to assign a probability law to a quantity. Very often, the available knowledge on that quantity can be described mathematically as the constraints on the desired probability law. However, in general, those constraints are not enough to determine in a unique way that probability law. There may exist many solutions that satisfy those constraints. We need then a tool to select one.
Jaynes introduced the MEP [8], which can be summarized as follows: When we do not have enough constraints to determine a probability law that satisfies those constraints, we may select between them the one with maximum entropy.
Let us be now more precise. Let us assume that the available information on that quantity X is the form of:
E { ϕ k ( X ) } = d k , k = 1 , , K .
where ϕk are any known functions. First, we assume that such probability laws exist by defining:
P = { p ( x ) : ϕ k ( x ) p ( x ) d x = d k , k = 0 , , K }
with ϕ0 = 1 and d0 = 1 for the normalization purpose. Then, the MEP is written as an optimization problem:
p M E ( x ) = arg max p P { H [ p ] = p ( x ) ln p ( x ) d x }
whose solution is given by:
p M E ( x ) = 1 Z ( λ ) exp [ k = 1 K λ k ϕ k ( x ) ]
where Z(λ), called the partition function, is given by: Z ( λ ) = exp [ k = 1 K λ k ϕ k ( x ) ] d x and λ = [ λ 1 , , λ K ] have to satisfy:
ln Z ( λ ) λ k = d k , k = 1 , , K
which can also be written as λ ln Z(λ) = d. Different algorithms have been proposed to compute numerically the ME distributions. See, for example, [3137]
The maximum value of entropy reached is given by:
H max = ln Z ( λ ) + λ d .
This optimization can easily be extended to the use of relative entropy by replacing H(p) by D[p : q] where q(x) is a given reference of a priori law. See [9,38,39] and [34,4042] for more details.

7. Link between Entropy and Likelihood

Consider the problem of the parameter estimation θ of a probability law p(x|θ) from an n-element sample of data x = {x1, ⋯, xn}.
The log-likelihood of θ is defined as:
L ( θ ) = ln i = 1 n p ( x i | θ ) = i = 1 n ln p ( x i | θ ) .
Maximizing L(θ) with respect to θ gives what is called the maximum likelihood (ML) estimate of θ.
Noting that L(θ) depends on n, we may consider 1 n L ( θ ) and define:
L ¯ ( θ ) = lim x 1 n L ( θ ) = E { ln p ( x | θ ) } = p ( x | θ * ) ln p ( x | θ ) d x ,
where θ* is the right answer and p(x|θ*) its corresponding probability law. We may then remark that:
D [ p ( x | θ * ) : p ( x | θ ) ] = p ( x | θ * ) ln p ( x | θ ) p ( x | θ * ) d x = p ( x | θ * ) ln p ( x | θ * ) d x + L ¯ ( θ ) .
The first term in the right-hand side being a constant, we derive that:
arg max θ { D [ p ( x | θ * ) ; p ( x | θ ) ] } = arg max θ { L ¯ ( θ ) } .
In this way, there is a link between the maximum likelihood and maximum relative entropy solutions [24].
There is also a link between the maximum relative entropy and the Bayes rule. See, for example, [43,44] and their corresponding references.

8. Fisher Information, Bregman and Other Divergences

Fisher [10] was interested in measuring the amount of information that samples of a variable X carries about an unknown parameter θ upon which its probability law p(x|θ) depends. For a given sample of observation x and its probability law p(x|θ), the function L(θ) = p(x|θ) is called the likelihood of θ in the sample x. He called the score of x over θ the partial derivative with respect to θ of the logarithm of this function:
S ( x | θ ) = ln p ( x | θ ) θ
He also showed that the first order moment of the score is zero:
E { S ( X | θ ) } = E { ln p ( x | θ ) θ } = 0
but its second order moment is positive and is also equivalent to the expected values of the second derivative of the log-likelihood function with respect to θ.
E { S 2 ( X | θ ) } = E { | ln p ( x | θ ) θ | 2 } = E { 2 ln p ( x | θ ) θ 2 } = F
This quantity is called the Fisher information [14].
It is also shown that for the small variations of θ, the Fisher information induces locally a distance in the space of parameters Θ, if we had to compare two very close values of θ. In this way, the notion of the geometry of information is introduced. The main steps for introducing this notion are the following: Consider D [p(x|θ*) : p(x|θ* + ∆θ)] and assume that ln p(x|θ) can be developed in a Taylor series. Then, keeping the terms up to the second order, we obtain:
D [ p ( x | θ * ) : p ( x | θ * + Δ θ ) ] 1 2 Δ θ F ( θ * ) Δ θ .
where F is the Fisher information:
F ( θ * ) = E { 2 ln p ( x | θ ) θ θ | θ = θ * } .
We must be careful here that this geometry property is related to the space of the parameters Θ for a given family of parametric probability law p(x|θ) and not in the space of probabilities. However, for two probability laws p1(x) = p(x|θ1) and p2(x) = p(x|θ2) in the same exponential family, the Kullback–Leibler divergence KL [p1 : p2] induces a Bregman divergence B[θ1|θ2] between the two parameters [14,4548].
To go further into detail, let us extend the discussion about the link between Fisher information and KL divergence, as well as other divergences, such as f-divergences, Rényi’s divergences and Bregman divergences.
  • f-divergences:
    The f-divergences, which are a general class of divergences, indexed by convex functions f, that include the KL divergence as a special case. Let f: (0, ) ↦ R be a convex function for which f(1) = 0. The f-divergence between two probability measures P and Q is defined by:
    D f [ P : Q ] = q f ( p q )
    Every f-divergence can be viewed as a measure of distance between probability measures with different properties. Some important special cases are:
    • f(x) = x ln x gives KL divergence: KL [ P : Q ] = p ln ( p q ).
    • f ( x ) = | x 1 | / 2 gives total variation distance: TV [ P , Q ] = | p q | / 2.
    • f ( x ) = ( x 1 ) 2 gives the square of the Hellinger distance: H 2 [ P , Q ] = ( p q ) 2.
    • f(x) = (x − 1)2 gives the chi-squared divergence: χ 2 [ P : Q ] = ( p q ) 2 q.
  • Rényi divergences:
    These are another generalization of the KL divergence. The Rényi divergence between two probability distributions P and Q is:
    D α [ P : Q ] = 1 α 1 ln p α q 1 α .
    When α = 1, by a continuity argument, Dα[P : Q] converges to KL [P : Q].
    D 1 / 2 [ P , Q ] = 2 ln p q is called Bhattacharyya divergence (closely related to Hellinger distance). Interestingly, this quantity is always smaller than KL:
    D 1 / 2 [ P : Q ] KL [ P : Q ] .
    As a result, it is sometimes easier to derive risk bounds with D 1 / 2 as the loss function as opposed to KL.
  • Bregman divergences:
    The Bregman divergences provide another class of divergences that are indexed by convex functions and include both the Euclidean distance and the KL divergence as special cases. Let ϕ be a differentiable strictly convex function. The Bregman divergence Bϕ is defined by:
    B ϕ [ x : y ] = ϕ ( x ) ϕ ( y ) x y , ϕ ( y )
    where < x , y > = j x i y i here means the scalar product of x and y and where the domain of ϕ is a space where convexity and differentiability make sense (e.g., whole or a subset of Rd or an Lp space). For example, ϕ ( x ) = x 2 and Rd gives the Euclidean distance:
    B ϕ [ x : y ] = ϕ ( x ) ϕ ( y ) x y , ϕ ( y ) = x 2 y 2 x y , 2 y = x y 2
    and ϕ ( x ) = j x j ln x j on the simplex in Rd gives the KL divergence:
    B ϕ [ x : y ] = j x j ln x j j y j ln y j j ( x j y j ) ( 1 + ln y j ) = j x j ln x j y j = KL [ x : y ]
    where it is assumed j x j = j y j = 1.
Let X be a quantity taking values in the domain of ϕ with a probability distribution function p(x).
Then, Ep(x) {Bϕ(X, m)} is minimized over m in the domain of ϕ at m = E {X}:
m ^ = arg max m { B ϕ ( X , m ) } = E { X } .
Moreover, this property characterizes Bregman divergence. When applied to the Bayesian approach, this means that, using the Bregman divergence as the loss function, the Bayes estimator is the posterior mean. This point is detailed in the following.
Links between all of these through an example:
Let us consider the Bayesian parameter estimation where we have some data y, a set of parameters x, a likelihood p(y|x) and a prior π(x), which gives the posterior p ( x | y ) p ( y | x ) π ( x ). Let us also consider a cost function C [ x , x ˜ ] in the parameter space xX. The classical Bayesian point estimation of x is expressed as the minimizer of an expected risk:
x ^ = arg min x ˜ { C ¯ ( x ˜ ) }
where:
C ¯ ( x ˜ ) = E p ( x | y ) { C [ x , x ˜ ] } = C [ x , x ˜ ] p ( x | y ) d x
It is very well known that the mean squared error estimator, which corresponds to C [ x , x ˜ ] = x x ˜ 2, is the posterior mean. It is now interesting to know that choosing C [ x , x ˜ ] to be any Bregman divergence B ϕ [ x , x ˜ ], we obtain also the posterior mean:
x ^ = arg min x ˜ { B ¯ ϕ ( x ˜ ) } = E p ( x | y ) { x p ( x | y ) d x }
where:
B ¯ ϕ ( x ˜ ) = E p ( x | y ) { D ϕ [ x , x ˜ ] } = B ϕ [ x , x ˜ ] p ( x | y ) d x
Consider now that we have two prior probability laws π1(x) and π2(x), which give rise to two posterior probability laws p1(x|y) and p2(x|y). If the prior laws and the likelihood are in the exponential families, then the posterior laws are also in the exponential family. Let us note them as p1(x|y; θ1) and p2(x|y; θ2), where θ1 and θ2 are the parameters of those posterior laws. We then have the following properties:
  • KL [p1 : p2] is expressed as a Bregman divergence B[θ1 : θ2].
  • A Bregman divergence B[x1 : x2] is induced when KL [p1 : p2] is used to compare the two posteriors.

9. Vectorial Variables and Time Indexed Process

The extension of the scalar variable to the finite dimensional vectorial case is almost immediate. In particular, for the Gaussian case p ( x ) = N ( x | μ , R ), the mean becomes a vector μ = E{X}, and the variances are replaced by a covariance matrix: R = E { ( X μ ) ( X μ ) }; and almost all of the quantities can be defined immediately. For example, for a Gaussian vector p ( x ) = N ( x | 0 , R ), the entropy is given by [49]:
H = n 2 ln ( 2 π ) + 1 2 ln ( | det ( R ) | )
and the relative entropy of N ( x | 0 , R ) N ( x | 0 , R )with respect to N ( x | 0 , S ) is given by:
D = 1 2 ( tr ( R S 1 ) log | det ( R ) | det ( S ) n ) .
The notion of time series or processes needs extra definitions. For example, for a random time series X(t), we can define p(X(t)), ∀t, the expected value time series x ¯ ( t ) = E { X ( t ) } and what is called the autocorrelation function Γ(t, τ) = E {X(t) X(t + τ)}. A time series is called stationary when these quantities does not depend on t, i.e., x ¯ ( t ) = m and Γ(t, τ) = Γ(τ) [50]. Another quantity of interest for a stationary time series is its power spectral density (PSD) function:
S ( ω ) = FT { Γ ( τ ) } = Γ ( τ ) exp [ j ω τ ] d τ .
When X(t) is observed on times t = nΔT with ΔT = 1, we have X(n), and for a sample {X(1),⋯, X(N)}, we may define the mean μ = E {X} and the covariance matrix = E { ( X μ ) ( X μ ) }.
With these definitions, it can easily been shown that the covariance matrix of a stationary Gaussian process is Toeplitz [49]. It is also possible to show that the entropy of such a process can be expressed as a function of its PSD function:
lim n 1 n H ( p ) = 1 2 π π π ln S ( ω ) d ω .
For two stationary Gaussian processes with two spectral density functions S1(ω) and S2(ω), we have:
lim n 1 n D [ p 1 : p 2 ] = 1 4 π π π ( S 1 ( ω ) S 2 ( ω ) ln S 1 ( ω ) S 2 ( ω ) 1 ) d ω
where we find the Itakura–Saito distance in the spectral analysis literature [5053].
These definitions and expressions have often been used in time series analysis. In what follows, we give a few examples of the different ways these notions and quantities have been used in different applications of data, signal and image processing.

10. Entropy in Independent Component Analysis and Source Separation

Given a vector of time series x(t), the independent component analysis (ICA) consists of finding a separating matrix B, such that the components y(t) = Bx(t) are as independent as possible. The notion of entropy is used here as a measure of independence. For example, to find B, we may choose D [ p ( y ) : j p j ( y j ) ] as a criterion of independence of the components yj. The next step is to choose a probability law p(x) from which we can find an expression for p(y) from which we can find an expression for D [ p ( y ) : j p j ( y j ) ] as a function of the matrix B, which can be optimized to obtain it.
The ICA problem has a tight link with the source separation problem, where it is assumed that the measured time series x(t) is a linear combination of the sources s(t), i.e., x(t) = As(t), with A being the mixing matrix. The objective of source separation is then to find the separating matrix B = A1.
To see how the entropy is used here, let us note y = Bx. Then,
p Y ( y ) = 1 | y / x | p X ( x ) H ( y ) = E { ln p Y ( y ) } = E { ln | y / x | } H ( x ) .
H(y) is used as a criterion for ICA or source separation. As the objective in ICA is to obtain y in such a way that its components become as independent as possible, the separating matrix B has to maximize H(y). Many ICA algorithms are based on this optimization [5465]

11. Entropy in Parametric Modeling and Model Selection

Determining the order of a model, i.e., the dimension of the vector parameter θ in a probabilistic model p(x|θ), is an important subject in many data and signal processing problems. As an example, in autoregressive (AR) modeling:
x ( n ) = k = 1 K θ k x ( n k ) + ε ( n )
where θ = {θ1⋯, θK}, we may want to compare two models with two different values of K.
When the order K is fixed, the estimation of the parameters θ is a very well-known problem, and there are likelihood based [66] or Bayesian approaches for that [67]. The determination of the order is however more difficult [68]. Between the tools, we may mention here the Bayesian methods [6974], but also the use of relative entropy D [p(x|θ*): p(x|θ)], where θ* represents the vector of the parameters of dimension K* and θ and the vector θ with dimension K ≤ K*. In such cases, even if the two probability laws to be compared have parameters with different dimensions, we can always use the KL [p(x|θ*): p(x|θ)] to compare them. The famous criterion of Akaike [7578] uses this quantity to determine the optimal order. For a linear parameter model with Gaussian probability laws and likelihood-based methods, there are analytic solutions for it [68].

12. Entropy in Spectral Analysis

Entropy and MEP have been used in different ways in the spectral analysis problem. It has been an important subject of signal processing for the decades. Here, we are presenting, in a brief way, these different approaches.

12.1. Burg’s Entropy-Based Method

A classical one is Burg’s entropy method [79], which can be summarized as follows: Let X(n) be a stationary, centered process, and assume we have as data a finite number of samples (lags) of its autocorrelation function:
r ( k ) = E { X ( n ) X ( n + k ) } = 1 2 π π π S ( ω ) exp [ j k ω ] d ω , k = 0 , , K .
The task is then to estimate its power spectral density function:
S ( ω ) = k = r ( k ) exp [ j k ω ]
As we can see, due to the fact that we have only the elements of the right-hand for k = −K, ⋯, +K, the problem is ill posed. To obtain a probabilistic solution, we may start by assigning a probability law p(x) to the vector X ¯ = [ X ( 0 ) , , X ( N 1 ) ] . For this, we can use the principle of maximum entropy (PME) with the data as constraints (54). As these constraints are the second order moments, the PME solution is a Gaussian probability law: N ( x | 0 , R ). For a stationary Gaussian process, when the number of samples N → ∞, the expression of the entropy becomes:
H = π π ln S ( ω ) d ω .
This expression is called Burg’s entropy [79]. Thus, Burg’s method consists of maximizing H subject to the constraints (54). The solution is:
S ( ω ) = 1 | k = K K λ k exp [ j k ω ] | 2 ,
where λ = [ λ 0 , , λ K ] , the Lagrange multipliers associated with the constraints (54), are here equivalent to the AR modeling of the Gaussian process X(n).
We may note that, in this particular case, we have an analytical expression for λ, which provides the possibility to give an analytical expression for S(ω) as a function of the data {r(k), k = 0,⋯, K}:
S ( ω ) = δ Γ 1 δ e Γ 1 e ,
where Γ = Toeplitz(r(0),⋯, r(K)) is the correlation matrix and δ and e are two vectors defined by δ = [1, 0,⋯, 0]′ and e = [1, ejω, ej2ω,⋯, ej]′.
We may note that we first used MEP to choose a probability law for X(n). With the prior knowledge that we have second order moments, the MEP results in a Gaussian probability density function. Then, as for a stationary Gaussian process, the expression of the entropy is related to the power spectral density S(ω), and as this is related to the correlation data by a Fourier transform, an ME solution could be computed easily.

12.2. Extensions to Burg’s Method

The second approach consists of maximizing the relative entropy D [p(x): p0(x)] or minimizing KL [p(x) : p0(x)] where p0(x) is an a priori law. The choice of the prior is important. Choosing a uniform p0(x), we retrieve the previous case [77].
However, choosing a Gaussian law for p0(x), the expression to maximize becomes:
D [ p ( x ) : p 0 ( x ) ] = 1 4 π π π ( S ( ω ) S 0 ( ω ) ln S ( ω ) S 0 ( ω ) 1 ) d ω
when N and where S0(ω) corresponds to the power spectral density of the reference process p0(x). Now, the problem becomes: minimize D [p(x): p0(x)] subject to the constraints (54).

12.3. Shore and Johnson Approach

Another approach is to decompose first the process X(n) on the Fourier basis {cos kωt, sin kωt}, to consider ω to be the variable of interest and S(ω), normalized properly, to be considered as its probability distribution function. Then, the problem can be reformulated as the determination of the S(ω), which maximizes the entropy:
π π S ( ω ) ln S ( ω ) d ω
subject to the linear constraints (54). The solution is in the form of:
S ( ω ) = exp [ k = K K λ k exp [ j k ω ] ] .
which can be considered as the most uniform power spectral density that satisfies those constraints.

12.4. ME in the Mean Approach

In this approach, we consider S(ω) as the expected value Z(ω) for which we have a prior law μ(z), and we are looking to assign p(z), which maximizes the relative entropy D [p(z) : μ(z)] subject to the constraints (54).
When p(z) is determined, the solution is given by:
S ( ω ) = E { Z ( ω ) } = Z ( ω ) p ( z ) d z .
The expression of S(ω) depends on μ(z). When μ(z) is Gaussian, we obtain the Rényi entropy:
H = π π S 2 ( ω ) d ω .
If we choose a Poisson measure for μ(z), we obtain the Shannon entropy:
H = π π S ( ω ) ln S ( ω ) d ω ,
and if we choose a Lebesgue measure over [0, ], we obtain Burg’s entropy:
H = π π ln S ( ω ) d ω .
When this step is done, the next step becomes maximizing these entropies subject to the constraints of the correlations. The obtained solutions are very different. For more details, see [39,7985].

13. Entropy-Based Methods for Linear Inverse Problems

13.1. Linear Inverse Problems

A general way to introduce inverse problems is the following: Infer an unknown signal f(t), image f(x, y) or any multi-variable function f(r) through an observed signal g(t), image g(x, y) or any multi-variable observable function g(s), which are related through an operator : f g. This operator can be linear or nonlinear. Here, we consider only linear operators g = H f:
g ( s ) = h ( r , s ) f ( r ) d r
where h(r, s) is the response of the measurement system. Such linear operators are very common in many applications of signal and image processing. We may mention a few examples of them:
  • Convolution operations g = h * f in 1D (signal):
    g ( t ) = h ( t t ) f ( t ) d t
    or in 2D (image):
    g ( x , y ) = h ( x x , y y ) f ( x , y ) d x d y
  • Radon transform (RT) in computed tomography (CT) in the 2D case [86]:
    g ( r , ϕ ) = δ ( r x cos ϕ y sin ϕ ) f ( x , y ) d x d y
  • Fourier transform (FT) in the 2D case:
    g ( u , v ) = exp [ j ( u x + u y ) ] f ( x , y ) d x d y
    which arise in magnetic resonance imaging (MRI), in synthetic aperture radar (SAR) imaging or in microwave and diffraction optical tomography (DOT) [8690].
No matter the category of the linear transforms, when the problem is discretized, we arrive at the relation:
g = Hf + ϵ .
where f = [f1,⋯, fn]′ represents the unknowns, g = [g1,⋯, gm]′ the observed data, ϵ = [ϵ1,⋯, ϵm]′ the errors of modeling and measurement and H the matrix of the system response.

13.2. Entropy-Based Methods

Let us consider first the simple no noise case:
g = Hf ,
where H is a matrix of dimensions (M × N), which is in general singular or very ill conditioned. Even if the cases M > N or M = N may appear easier, they have the same difficulties as those of the underdetermined case M < N that we consider here. In this case, evidently the problem has an infinite number of solutions, and we need to choose one.
Between the numerous methods, we may mention the minimum norm solution, which consists of choosing between all of the possible solutions:
F = { f : Hf = g }
the one that has the minimum norm:
Ω ( f ) = f 2 2 = j f j 2 .
This optimization problem can be solved easily in this case, and we obtain:
f ^ N M = arg min f { Ω ( f ) = f 2 2 } = H ( H H ) 1 g .
In fact, we may choose any other convex criterion Ω(f) and satisfy the uniqueness of the solution. For example:
Ω ( f ) = j f j ln f j
which can be interpreted as the entropy when fj > 0 and ∑ fj = 1, thus considering fj as a probability distribution fj = P (U = uj). The variable U can correspond (or not) to a physical quantity. Ω(f) is the entropy associated with this variable.
If we consider fj > 0 to represent the power spectral density of a physical quantity, then the entropy becomes:
Ω ( f ) = j f j ln f j
and we can use it as criterion to select a solution to the problem (71).
As we can see, any convex criterion Ω(f) can be used. Here, we mentioned four of them with different interpretations.
  • L2 or quadratic:
    Ω ( f ) = j f j 2
    which can be interpreted as the Rényi’s entropy with q = 2.
  • Lβ:
    Ω ( f ) = j | f j | β
    When β < 1 the criterion is not bounded at zero. When β ≥ 1 the criterion is convex.
  • Shannon entropy:
    Ω ( f ) = j f j ln f j
    which has a valid interpretation if 0 < fj < 1,
  • The Burg entropy:
    Ω ( f ) = j ln f j
    which needs fj > 0.
Unfortunately, only for the first case, there is an analytical solution for the problem, which is f ^ = H ( H H ) g. For all of the other cases, we may need an optimization algorithm to obtain a numerical solution [9195].

13.3. Maximum Entropy in the Mean Approach

A second approach consists of considering fj = E {Uj} or f = E {U} [41,41,42]. Again, here, Uj or U can, but need not, correspond to some physical quantities. In any case, we now want to assign a probability law p ^ ( u ) to it. Noting that the data g = H f = HE {U} = E {HU} can be considered as the constraints on it, we may need again a criterion to determine p ^ ( u ). Assuming then having some prior μ(u), we may maximize the relative entropy as that criterion. The mathematical problem then becomes:
minimize D [ p ( u ) : μ ( u ) ] subject to H u p ( u ) d u = g
The solution is:
p ^ ( u ) = 1 Z ( λ ) μ ( u ) exp [ λ H u ]
where:
Z ( λ ) = μ ( u ) exp [ λ H u ] d u .
When p ^ ( u ) is obtained, we may be interested in computing:
f ^ = E { U } = u p ^ ( u ) d u
which is the required solution.
Interestingly, if we focus on f ^ = E { U }, we will see that its expression depends on the choice of the prior μ(u). When μ(u) is separable: μ ( u ) = j μ j ( u j ), the expression of p ^ ( u ) will also be separable.
To go a little more into the details, let us introduce s = H λ and define:
G ( s ) = ln μ ( u ) exp [ s u ] d u
and its conjugate convex:
F ( f ) = sup s { f s G ( s ) } .
It can be shown easily that f ^ = E { U } can be obtained either via the dual λ ^ variables:
f ^ = G ( H λ ^ )
where λ ^ is obtained by:
λ ^ = arg min { D ( λ ) = ln Z ( λ ) + λ g } ,
or directly:
f ^ = arg min { f : Hf = g } { F ( f ) } .
D(λ) is called the dual criterion and F (f) primal. However, it is not always easy to obtain an analytical expression for G(s) and its gradient G′(s). The functions F (f) and G(s) are conjugate convex.
For the computational aspect, unfortunately, the cases where we may have analytical expressions for Z(λ) or G(s) = ln Z or F (f) are very limited. However, when there is analytical expressions for them, the computations can be done very easily. In Table 1, we summarizes some of those solutions:

14. Bayesian Approach for Inverse Problems

In this section, we present in a brief way the Bayesian approach for the inverse problems in signal and image processing.

14.1. Simple Bayesian Approach

The different steps to find a solution to an inverse problem using the Bayesian approach can be summarized as follows:
  • Assign a prior probability law p(ϵ) to the modeling and observation errors, here ϵ. From this, find the expression of the likelihood p(g|f, θ1). As an example, consider the Gaussian case:
    p ( ) = N ( | 0 , v I ) p ( g | f = N ( g | H f , v I ) .
    θ1 in this case is the noise variance vϵ.
  • Assign a prior probability law p(f|θ2) to the unknown f to translate your prior knowledge on it. Again, as an example, consider the Gaussian case:
    p ( f ) = N ( f | 0 , v f I )
    θ2 in this case is the variance vf.
  • Apply the Bayes rule to obtain the expression of the posterior law:
    p ( f | g , θ 1 , θ 2 ) = p ( g | f , θ 1 ) p ( f | θ 2 ) p ( g | θ 1 , θ 2 ) p ( g | f , θ 1 ) p ( f | θ 2 ) ,
    where the sign ∝ stands for “proportionality to”, p(g|f, θ1) is the likelihood, p(f|θ2) the prior model, θ = [θ1, θ2]′ their corresponding parameters (often called the hyper-parameters of the problem) and p(g|θ1, θ2) is called the evidence of the model.
  • Use p(f|g, θ1, θ2) to infer any quantity dependent of f.
For the expressions of likelihood in (90) and the prior in (91), we obtain very easily the expression of the posterior:
p ( f | g , v , v f ) = N ( f | f ^ , V ^ ) with V ^ = ( H H + v v f I ) 1 a n d f ^ = V ^ H g
When the hyper-parameters θ can be fixed a priori, the problem is easy. In practice, we may use some summaries, such as:
  • MAP:
    f ^ MAP = arg max f { p ( f | g , θ ) }
  • EAP or posterior mean (PM):
    f ^ E A P = f p ( f | g , θ ) d f
For the Gaussian case of (91), the MAP and EAP are the same and can be obtained by noting that:
f ^ MAP = arg min f { J ( f ) } with J ( f ) = g Hf 2 2 + λ f 2 2 , where λ = v / v f .
However, in real applications, the computation of even these simple point estimators may need efficient algorithm:
  • For MAP, we need optimization algorithms, which can handle the huge dimensional criterion J(f) = ln p(f|g, θ). Very often, we may be limited to using gradient-based algorithms.
  • For EAP, we need integration algorithms, which can handle huge dimensional integrals. The most common tool here is the MCMC methods [24]. However, for real applications, very often, the computational costs are huge. Recently, different methods, called approximate Bayesian computation (ABC) [96100] or VBA, have been proposed [74,96,98,101107].

14.2. Full Bayesian: Hyperparameter Estimation

When the hyperparameters θ have also to be estimated, a prior p(θ) is assigned to them, and the expression of the joint posterior:
p ( f , θ | g ) = p ( g | f , θ 1 ) p ( f | θ 2 ) p ( θ ) p ( g )
is obtained, which can then be used to infer them jointly. Very often, the expression of this joint posterior law is complex, and any computation may become very costly. The VBA methods try to approximate p(f, θ|g) by a simpler distribution, which can be handled more easily. Two particular and extreme cases are:
  • Bloc separable, such as q(f, θ) = q1(f) q2(θ) or
  • Completely separable, such as q ( f , θ ) = j q 1 j ( f j ) k q 2 k ( θ k ).
Any mixed solution is also valid. For example, the one we have chosen is:
q ( f , θ ) = q 1 ( f ) k q 2 k ( θ k )
Obtaining the expressions of these approximated separable probability laws has to be done via a criterion. The natural criterion with some geometrical interpretation for the probability law manifolds is the Kullback–Leibler (KL) criterion:
KL [ q : p ] = q ln q p = ln q p q .
For hierarchical prior models with hidden variables z, the problem becomes more complex, because we have to give the expression of the joint posterior law:
p ( f , z , θ | g ) p ( g | f , θ 1 ) p ( f | z , θ 2 ) p ( z | θ 3 ) p ( θ )
and then approximate it by separable ones:
q ( f , z , θ | g ) = q 1 ( f ) q 2 ( z ) q 3 ( θ ) or q ( f , θ ) = j q 1 j ( f j | z f j ) j q 2 j ( z f j ) k q 3 k ( θ k )
and then use them for estimation. See more discussions in [9,31,38,108110]
In the following, first the general VBA method is detailed for the inference problems with hierarchical prior models. Then, a particular class of prior model (Student t) is considered, and the details of VBA algorithms for that are given.

15. Basic Algorithms of the Variational Bayesian Approximation

To illustrate the basic ideas and tools, let us consider a vector X and its probability density function p(x), which we want to approximate by q(x) = ∏j qj(xj). Using the KL criterion:
KL [ q : p ] = q ( x ) ln q ( x ) p ( x ) d x = q ( x ) ln q ( x ) d x q ( x ) ln p ( x ) d x = j q j ( x j ) ln q j ( x j ) d x j ln p ( x ) q = j q j ( x j ) ln q j ( x j ) d x j q j ( x j ) < ln p ( x ) > q j d x j
where we used the notation: 〈ln p(x)〉q = ∫ q(x) ln p(x) dx and q−j(x) = ∏i≠j qi (xi)
From here, trying to find the solution qi, the basic method is an alternate optimization algorithm:
q j ( x j ) exp [ < ln p ( x ) > q j ] .
As we can see, the expression of qj(xj) depends on q (xi), ij. It is not always possible to obtain analytical expressions for qj(xj). It is however possible to show that, if p(x) is a member of exponential families, then qj(xj) are also members of exponential families. These iterations then become much simpler, because at each iteration, we need to update the parameters of the exponential families. To go a little more into the details, let us consider some particular simple cases.

15.1. Case of Two Gaussian Variables

In the case of two variables x = [x1, x2]′, we have:
{ q 1 ( x 1 ) exp [ < ln p ( x ) > q 2 ( x 2 ) ] q 2 ( x 2 ) exp [ < ln p ( x ) > q 1 ( x 1 ) ]
As an illustrative example, consider the case where we want to approximate p(x1, x2) by q(x1, x2) = q1(x1) q2(x2) to be able to compute the expected values:
{ m 1 = E { x 1 } = x 1 p ( x 1 , x 2 ) d x 1 d x 2 m 2 = { x 2 } = x 2 p ( x 1 , x 2 ) d x 1 d x 2
which need double integrations when p(x1, x2) is not separable in its two variables. If we can do that separable approximation, then, we can compute:
{ μ ˜ 1 = E { x 1 } = x 1 q 1 ( x 1 ) d x 1 μ ˜ 2 = E { x 2 } = x 2 q 2 ( x 2 ) d x 2
which needs only 1D integrals. Let us see if ( μ ˜ 1 , μ ˜ 2 ) will converge to (m1, m2). To illustrate this, let us consider the very simple case of the Gaussian:
p ( x 1 , x 2 ) = N ( [ x 1 x 2 ] | [ m 1 m 2 ] , [ v 1 ρ v 1 v 2 ρ v 1 v 2 v 2 ] ) .
It is then easy to see that q 1 ( x 1 ) = N ( x 1 | μ ˜ 1 , v ˜ 1 ) and q 2 ( x 2 ) = N ( x 2 | μ ˜ 2 , v ˜ 2 ) and that:
{ q 1 ( k + 1 ) ( x 1 ) = p ( x 1 ) | x 2 = μ ˜ 2 ( k ) = N ( x 1 | μ ˜ 1 ( k ) , v ˜ 1 ( k ) ) q 1 ( k + 1 ) ( x 2 ) = p ( x 2 | x 1 = μ ˜ 1 ( k ) = N ( x 1 | μ ˜ 2 ( k ) , v ˜ 2 ( k ) )
with:
{ μ ˜ 1 ( k + 1 ) = m 1 + ρ v 1 / v 2 ( μ ˜ 2 ( k ) m 2 ) v ˜ 1 ( k + 1 ) = ( 1 ρ 2 ) v 1 μ ˜ 2 ( k + 1 ) = m 2 + ρ v 2 / v 1 ( μ ˜ 1 ( k ) m 1 ) v ˜ 2 ( k + 1 ) = ( 1 ρ 2 ) v 2 .
See [111] for details and where we showed that, initializing the algorithm with μ ˜ 1 ( 0 ) = 0 and μ ˜ 2 ( 0 ) = 0, the means converges to the right values m1 and m2, However, we may be careful about the convergence of the variances.

15.2. Case of Exponential Families

As we could see, to be able to use such an algorithm in practical cases, we need to be able to compute < ln p(x) >q2(x2) and < ln p(x) >q1(x1). Only for a few cases can we can do this analytically. Different algorithms can be obtained depending on the choice of a particular family for qj(xj) [103,112120].
To show this, let us consider the exponential family:
p ( x | θ ) = g ( θ ) exp [ θ u ( x ) ]
where θ is a vector of parameter and g(θ) and u(x) are known functions.
This parametric exponential family has the following conjugacy property: For a given prior p(θ) in the family:
p ( θ | η , ν ) = h ( η , ν ) g ( θ ) η exp [ ν θ ]
the corresponding posterior:
p ( θ | x ) p ( x | θ ) p ( θ | η , ν ) g ( θ ) η + 1 exp [ ν + u ( x ) ] θ ] p ( θ | η + 1 , ν + u ( x ) )
is in the same family.
For this family, we have:
ln p ( x | θ ) q = ln g ( θ ) + θ u ( x ) q .
It is then easy to show that:
q j ( x j ) g ( θ ) exp [ θ u ( x ) q j ]
which are in the same exponential family. This simplifies greatly the computations, thanks to the fact that, in each iteration, we only need to compute u ˜ ( x ) = u ( x ) q j and update the parameters.
Now, if we consider:
p ( x | θ ) = g ( θ ) exp [ θ u ( x ) ]
with a prior on θ:
p ( θ | η , ν ) = h ( η , ν ) g ( θ ) η exp [ ν θ ]
and the joint p(x, θ|η, ν) = p(x|θ) p(θ|η, ν), which is not separable in x and θ, and we want to approximate it with the separable q(x, θ) = q1(x) q2(θ), then we will have:
{ q ( θ ) = h ( η ˜ , ν ˜ ) g ( θ ) η ˜ exp [ ν ˜ θ ] q ( x ) = g ( θ ˜ ) exp [ θ ˜ u ( x ) ] with { η ˜ = η + 1 ν ˜ = ν + u ˜ ( x ) θ ˜ = ν ˜
where μ ˜ = u ( x ) q 1 ( x ).

16. VBA for the Unsupervised Bayesian Approach to Inverse Problems

Before going into the details and for similarity with the notations in the next sections, we replace x by f, such that now we are trying to approximate p(f, θ) = p(f|θ) p(θ) by a separable q(f, θ) = q1(f) q2(θ). Interestingly, depending on the choice of the family laws for q1 and q2, we obtain different algorithms:
  • q 1 ( f ) = δ ( f f ˜ ) and q 2 ( θ ) = δ ( θ θ ˜ ). In this case, we have:
    { q 1 ( f ) exp [ < ln p ( f , θ ) > q 2 ] exp [ ln p ( f , θ ˜ ) ] p ( f , θ = θ ˜ ) p ( f | θ = θ ˜ ) q 2 ( θ ) exp [ < ln p ( f , θ ) > q 1 ] exp [ ln p ( f ˜ , θ ) ] p ( f = f ˜ , θ ) p ( θ | f = f ˜ )
    and so:
    { f ˜ = arg max f { p ( f , θ ) = θ ˜ } θ ˜ = arg max θ { p ( f = f ˜ , θ ) }
    which can be interpreted as an alternate optimization algorithm for obtaining the JMAPestimates:
    ( f ˜ , θ ˜ ) = arg max ( f , θ ) { p ( f , θ ) } .
    The main drawback here is that the uncertainties of the f are not used for the estimation of θ and the uncertainties of θ are not used for the estimation of f.
  • q1(f) is free form and q 2 ( θ ) = δ ( θ θ ˜ ) In the same way, this time we obtain:
    { < ln p ( f , θ ) > q 2 ( θ ) = ln p ( f , θ ˜ ) < ln p ( f , θ ) > q 1 ( f ) = < ln p ( f , θ ) > q 1 ( f | θ ˜ ) = Q ( θ , θ ˜ )
    which leads to:
    { q 1 ( f ) exp [ ln p ( f , θ = θ ˜ ) ] p ( f , θ ˜ ) q 2 ( θ ) exp [ Q ( θ , θ ˜ ) ] θ ˜ = arg max θ { Q ( θ , θ ˜ ) }
    which can be compared the Bayesian expectation maximization (BEM) algorithm. The E-step is the computation of the expectation Q ( θ , θ ˜ ) in (121), and the M-step is the maximization in (122). Here, the uncertainties of the f are used for the estimation of θ, but the uncertainties of θ are not used for the estimation of f.
  • q 1 ( f ) = δ ( f f ˜ ) and q2(θ) is free form. In the same way, this time we obtain:
    { < ln p ( f , θ ) > q 1 ( f ) = ln p ( f = f ˜ , θ ) < ln p ( f , θ ) > q 2 ( θ ) = < ln p ( f , θ ) > p ( θ | f = f ˜ ) = Q ( f ˜ , f )
    { q 2 ( θ ) ln p ( f = f ˜ , θ ) = p ( θ | f = f ˜ ) q 1 ( f ) exp [ Q ( f ˜ , θ ) ] θ ˜ = arg max θ { Q ( f = f ˜ , θ ) }
    which can be compared with the classical EM algorithm. Here, the uncertainties of the f are used for the estimation of θ, but the uncertainties of θ are not used for the estimation of f.
  • Both q1(f) and q2(θ) have free form. The main difficulty here is that, at each iteration, the expression of q1 and q2 may change. However, if p(f, θ) is in the generalized exponential family, the expressions of q1(f) and q2(θ) will also be in the same family, and we have only to update the parameters at each iteration.

17. VBA for a Linear Inverse Problem with Simple Gaussian Priors

As a simple example, consider the Gaussian case where p ( g | f , θ 1 ) = N ( g | Hf , ( 1 / θ 1 ) I ) , p ( f | θ 2 ) = N ( f | 0 , ( 1 / θ 2 ) I ) and p ( θ 1 ) = G ( θ 1 | α 10 , β 10 ) p ( θ 2 ) = G ( θ 2 | α 20 , β 20 ), and so, we have:
ln p ( f , θ 1 , θ 2 | g ) = M 2 ln θ 1 θ 1 2 g Hf 2 2 + N 2 ln θ 2 θ 2 2 f 2 2 + ( α 10 1 ) ln θ 1 β 10 θ 1 + ( α 20 1 ) ln θ 2 β 20 θ 2 .
From this expression J(f, θ1, θ2) = ln p(f, θ1, θ2|g), it is easy to obtain the equations of an alternate JMAP algorithm by computing the derivatives of it with respective to its arguments and equating them to zero:
J f = 0 f = ( H H + λ I ) 1 H g with λ = θ 2 θ 1 J θ 1 = 0 θ 1 = α ˜ 1 β ˜ 1 with α ˜ 1 = ( α 10 1 ) + M 2 and β ˜ 1 = β 10 + 1 2 g Hf 2 2 J θ 2 = 0 θ 1 = α ˜ 2 β ˜ 2 with α ˜ 2 = ( α ˜ 20 1 ) + M 2 and β ˜ 2 = β 20 + 1 2 f 2 2
From the expression of the joint probability law p(f, θ1, θ2|g), we can also obtain the expressions of the conditionals:
{ p ( f | g , θ 1 , θ 2 ) = N ( f | f ˜ , V ˜ ) with V ˜ = ( H H + λ I ) 1 , f ˜ = V ˜ H g , λ = θ 2 θ 1 p ( θ 1 | g , f , θ 2 ) = G ( θ 1 | α ˜ 1 , β ˜ 1 ) with α ˜ 1 = ( α 10 1 ) + M 2 , β ˜ 1 = β 10 + 1 2 g Hf 2 2 p ( θ 2 | g , f , θ 1 ) = G ( θ 2 | α ˜ 2 , β ˜ 2 ) with α ˜ 2 = ( α 20 1 ) + M 2 , β ˜ 2 = β 20 + 1 2 f 2 2
However, obtaining analytical expressions of the marginals p(f|g), p(θ1|g) and p(θ2|g) is not easy. We can then obtain approximate expressions q1(f|g), q2(θ1|g) and q3(θ2|g) using the VBA method. For this case, thanks to the conjugacy property, we have:
{ q ( f ) = N ( f | f ˜ , V ˜ ) with V ˜ = ( H H + λ I ) 1 , f ˜ = V ˜ H g , λ ˜ = < θ 2 > < θ 1 > ; q ( θ 1 = G ( θ 1 | α ˜ 1 , β ˜ 1 ) with α ˜ 1 = ( α 10 1 ) + M 2 , β ˜ 1 = β 10 + 1 2 < g Hg 2 2 > p ( θ 2 | g , f ) = G ( θ 2 | α ˜ 2 , β ˜ 2 ) with α ˜ 2 = ( α 20 1 ) + N 2 , β ˜ 2 = β 20 + 1 2 < f 2 2 >
We can then compare the three algorithms in Table 2:
It is important to remark that, in JMAP, the computation of f can be done via the optimization of the criterion J(f, θ1, θ2) = ln p(f, θ1, θ2|g), which does not need explicitly the matrix inversion of V ˜ = ( H H + λ I ) 1 However, in BEM and VBA, we need to compute it due to the following requirements:
< f > q = f ˜ , < f 2 > q = tr ( < f ˜ f ˜ > q ) = tr ( f ˜ f ˜ + V ˜ ) + f ˜ 2 + tr ( V ˜ ) , < f j 2 > q = [ V ˜ ] j j + f ˜ j 2 , < g Hf 2 > q = [ g g 2 f q H g + H f ˜ f q H ] = [ g g 2 f ˜ H ˜ g + H ( V ˜ + f ˜ f ˜ ) H ] = g H f ˜ 2 + tr ( H V ˜ H )
For some extensions and more details, see [111].

18. Bayesian Variational Approximation with Hierarchical Prior Models

For a linear inverse problem:
M : g = Hf +
with an assigned likelihood p(g|f, θ1), when a hierarchical prior model p(f|z, θ2) p(z|θ3) is used and when the estimation of the hyper-parameters θ = [θ1, θ2, θ3]′ has to be considered, the joint posterior law of all the unknowns becomes:
p ( f , z , θ | g ) = p ( f , z , θ | g ) p ( g ) = p ( g , θ 1 ) p ( f | z , θ 2 ) p ( z | θ 3 ) p ( θ ) p ( g )
The main idea behind the VBA is to approximate this joint posterior by a separable one, for example: q(f, z, θ|g) = q1(f) q2(z) q3(θ) and where the expressions of q(f, z, θ|g) are obtained by minimizing the Kullback–Leibler divergence (99), as explained in previous section. This approach can also be used for model selection based on the evidence of the model ln p(g) [121] where:
p ( g ) = p ( f , z , θ , g ) d f d z d θ .
Interestingly, it is easy to show that:
ln p ( g ) = KL [ q : p ] + ( q )
where ( q ) is the free energy associated with q defined as:
( q ) = ln p ( f , z , θ , g ) q ( f , z , θ ) q
Therefore, for a given model , minimizing KL [q : p] is equivalent to maximizing ( q ) and when optimized, ( q * ) gives a lower bound for ln p(g). Indeed, the name variational approximation is due to the fact that ln p ( g ) ( q ), and so, ( q ) is a lower bound to the evidence ln p(g).
Without any other constraint than the normalization of q, an alternate optimization of ( q ) with respect to q1, q2 and q3 results in:
{ q 1 ( f ) exp [ ln p ( f , z , θ , g ) q ( z ) q ( θ ) ] , q 2 ( z ) exp [ ln p ( f , z , θ , g ) q ( f ) q ( θ ) ] , q 2 ( θ ) exp [ ln p ( f , z , θ , g ) q ( f ) q ( z ) ] ,
Note that these relations represent an implicit solution for q1(f), q2(z) and q3(θ), which need, at each iteration, the expression of the expectations in the right hand of exponentials. If p(g|f, z, θ1) is a member of an exponential family and if all of the priors p(f|z, θ2), p(z|θ3), p(θ1), p(θ2) and p(θ3) are conjugate priors, then it is easy to see that these expressions lead to standard distributions for which the required expectations are easily evaluated. In that case, we may note:
q ( f , z , θ ) = q 1 ( f | z ˜ , θ ˜ ) q 2 ( z | f ˜ , θ ˜ ) q 3 ( θ | f ˜ , z ˜ )
where the tilded quantities z ˜, f ˜ and θ ˜are, respectively, functions of ( f ˜, θ ˜), ( f ˜, z ˜) and ( f ˜, z ˜). This means that the expression of q 1 ( f | z ˜ , θ ˜ ) depends on ( f ˜, z ˜), the expression of q 2 ( z | f ˜ , θ ˜ ) depends on ( z ˜, θ ˜) and the expression of q 3 ( θ | f ˜ , z ˜ ) depends on ( f ˜, z ˜). With this notation, the alternate optimization results in alternate updating of the parameters ( z ˜, θ ˜) of q1, the parameters ( f ˜, θ ˜) of q2 and the parameters ( f ˜, z ˜) of q3. Finally, we may note that, to monitor the convergence of the algorithm, we may evaluate the free energy:
( q ) = ln p ( f , z , θ , g ) q ln q ( f , z , θ ) q = ln p ( g | f , z , θ ) q + ln p ( f | z , θ ) q + ln p ( z | θ ) q + ln p ( θ ) q ln q ( f ) q ln q ( z ) q ln q ( θ ) q .
Other decompositions for q(f, z, θ) are also possible. For example: q(f, z, θ) = q1(f|z) q2(z) q3(θ) or even: q ( f , z , θ ) = j q 1 j ( f j ) j q 2 j ( z f j ) l q 3 l ( θ l ). Here, we consider the first case and give some more details on it.

19. Bayesian Variational Approximation with Student t Priors

The Student t model is:
p ( f | ν ) = j S t ( f j | ν ) with S t ( f j | ν ) = 1 π ν Γ ( ( ν + 1 ) / 2 ) Γ ( ν / 2 ) ( 1 + f j 2 / ν ) ( ν + 1 ) / 2
The Cauchy model is obtained when ν = 1. Knowing that:
S t ( f j | ν ) = 0 N ( f j | 0 , 1 / z f j ) G ( z f j | ν / 2 , ν / 2 ) d z f j
we can write this model via the positive hidden variables zfj:
{ p ( f j | z j f ) = N ( f j | 0 , 1 / z f j ) exp [ 1 2 z f j f j 2 ] p ( z f j | α , β ) = G ( z j j | α , β ) z f j ( α 1 ) exp [ β z f j ]
Now, let us consider the forward model g = Hf + ϵ and assign a Gaussian law with unknown variance υ i to the noise ϵi, which results in p ( ) = N ( g | 0 , V ) with V = diag [ υ e ] with υ = [ υ 1 , , υ M ], and so:
p ( g | f , υ ) = N ( g | Hf , V ) exp [ 1 2 ( g Hf ) V 1 ( g Hf ) ] .
Let us also note by z i = 1 / υ e i, z = [ z 1 , , z M ]and Z = diag [ z ] = V 1 and assign a prior on it p ( υ i | α 0 , β 0 ) = G ( υ i | α 0 , β 0 ) or equivalently:
p ( z i | α 0 , β 0 ) = G ( z i | α 0 , β 0 ) a n d p ( z | α 0 , β 0 ) = i G ( z i | α 0 , β 0 ) .
Let us also note υ f = [ υ f 1 , , υ f N ], V f = diag [ υ f ], z f i = 1 / υ f i, Z f = diag [ z f ] = V f 1 and note:
p ( f | υ f ) = j N ( f j | 0 , υ f j ) = j N ( f j | 0 , υ f j ) = N ( f | 0 , V f )
and finally,
p ( υ f | α f 0 , β f 0 ) = j G ( υ f i | α f 0 , β f 0 ) .
Then, we obtain the following expressions for the VBA:
{ q 1 ( f | μ ˜ , V ˜ f ) = N ( f | μ ˜ , V ˜ ) with μ ˜ = V ˜ H g , V ˜ = ( H V ˜ 1 H + Z ˜ f ) 1 ; q 2 j ( z f j ) = G ( z f j | α ˜ j , β ˜ j ) with α ˜ j = α 00 + 1 / 2 , β ˜ j = β 00 + < f j 2 > / 2 ; q 3 ( z i ) = G ( z i | α ˜ i , β ˜ i ) with α ˜ i = α 0 + ( N + 1 ) / 2 , β ˜ i = β 0 + 1 2 < g i [ Hf ] i | 2 > ;
where:
< | g i [ Hf ] i | 2 > = | g i H < f > ] i | 2 + [ H V ˜ H ] ii , < f > = μ ˜ , < f f > = V ˜ + μ ˜ μ ˜ , < f j 2 > = [ V ˜ ] jj + μ ˜ j 2
We have implemented these algorithms for many linear inverse problems [102], such as periodic components estimation in time series [122] or computed tomography [123], blind deconvolution [124], blind image separation [125,126] and blind image restoration [89].

20. Conclusions

The main conclusions of this paper can be summarized as follows:
  • A probability law is a tool for representing our state of knowledge about a quantity.
  • The Bayes or Laplace rule is an inference tool for updating our state of knowledge about an inaccessible quantity when another accessible, related quantity is observed.
  • Entropy is a measure of information content in a variable with a given probability law.
  • The maximum entropy principle can be used to assign a probability law to a quantity when the available information about it is in the form of a limited number of constraints on that probability law.
  • Relative entropy and Kullback–Leibler divergence are tools for updating probability laws in the same context.
  • When a parametric probability law is assigned to a quantity and we want to measure the amount of information gain about the parameters when some direct observations of that quantity is available, we can use the Fisher information. The structure of the Fisher information geometry in the space of parameters is derived from the relative entropy by a second order Taylor series approximation.
  • All of these rules and tools are used currently in different ways in data and signal processing. In this paper, a few examples of the ways these tools are used in data and signal processing problems are presented. One main conclusion is that each of these tools has to be used in appropriate contexts. The example in spectral estimation shows that it is very important to define the problems very clearly at the beginning and to use appropriate tools and interpret the results appropriately.
  • The Laplacian or Bayesian inference is the appropriate tool for proposing satisfactory solutions to inverse problems. Indeed, the expression of the posterior probability law represents the combination of the state of the knowledge in the forward model and the data and the state of the knowledge before using the data.
  • The Bayesian approach can also easily be used to propose unsupervised methods for the practical application of these methods.
  • One of the main limitation of those sophisticated methods is the computational cost. For this, we proposed to use VBA as an alternative to MCMC methods to propose realistic algorithms in huge dimensional inverse problems where we want to estimate an unknown signal (1D), image (2D), volume (3D) or even more (3D + time or 3D + wavelength), etc.

Acknowledgments

The author would like to thank the reviewers who, by their true review work and their extensive comments and remarks, helped to improve this review paper greatly.
  • This paper is an extended version of the paper published in Proceedings of the 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Mohammad-Djafari, A. Bayesian or Laplacian inference, entropy and information theory and information geometry in data and signal processing. AIP Conf. Proc 2014, 1641, 43–58. [Google Scholar]
  2. Bayes, T. An Essay toward Solving a Problem in the Doctrine of Chances. Philos. Trans 1763, 53, 370–418, By the late Rev. Mr. Bayes communicated by Mr. Price, in a Letter to John Canton. [Google Scholar]
  3. De Laplace, P. S. Mémoire sur la probabilité des causes par les évènements. Mémoires de l’Academie Royale des Sciences Presentés par Divers Savan 1774, 6, 621–656. [Google Scholar]
  4. Shannon, C. A Mathematical Theory of Communication. Bell Syst. Tech. J 1948, 27, 379–423. [Google Scholar]
  5. Hadamard, J. Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques encastrées; Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de France; Imprimerie nationale, 1908. [Google Scholar]
  6. Jaynes, E.T. Information Theory Statistical Mechanics. Phys. Rev 1957, 106, 620–630. [Google Scholar]
  7. Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev 1957, 108, 171–190. [Google Scholar]
  8. Jaynes, E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern 1968, 4, 227–241. [Google Scholar]
  9. Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
  10. Fisher, R. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Stat. Soc. A 1922, 222, 309–368. [Google Scholar]
  11. Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull. Culcutta Math. Soc 1945, 37, 81–91. [Google Scholar]
  12. Sindhwani, V.; Belkin, M.; Niyogi, P. The Geometric basis for Semi-supervised Learning. In Semi-supervised Learning; Chapelle, O., Schölkopf, B., Zien, A., Eds.; MIT press: Cambridge, MA, USA, 2006; pp. 209–226. [Google Scholar]
  13. Lin, J. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar]
  14. Johnson, O.; Barron, A.R. Fisher Information Inequalities and the Central Limit Theorem. Probab. Theory Relat. Fields 2004, 129, 391–409. [Google Scholar]
  15. Berger, J. Statistical Decision Theory and Bayesian Analysis, 2nd ed; Springer-Verlag: New York, NY, USA, 1985. [Google Scholar]
  16. Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis, 2nd ed; Chapman & Hall/CRC Texts in Statistical Science; Chapman and Hall/CRC: Boca Raton, FL, USA, 2003. [Google Scholar]
  17. Skilling, J. Nested Sampling. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Proceedings of 24th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 25–30 July 2004, Fischer, R., Preuss, R., Toussaint, U.V., Eds.; pp. 395–405.
  18. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys 1953, 21, 1087–1092. [Google Scholar]
  19. Hastings, W.K. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika 1970, 57, 97–109. [Google Scholar]
  20. Gelfand, A.E.; Smith, A.F.M. Sampling-Based Approaches to Calculating Marginal Densities. J. Am. Stat. Assoc 1990, 85, 398–409. [Google Scholar]
  21. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Introducing Markov Chain Monte Carlo. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 1–19. [Google Scholar]
  22. Gilks, W.R. Strategies for Improving MCMC. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 89–114. [Google Scholar]
  23. Roberts, G.O. Markov Chain Concepts Related to Sampling Algorithms. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 45–57. [Google Scholar]
  24. Tanner, M.A. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions; Springer series in Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
  25. Djurić, P.M.; Godsill, S.J. (Eds.) Special Issue on Monte Carlo Methods for Statistical Signal Processing; IEEE: New York, NY, USA, 2002.
  26. Andrieu, C.; de Freitas, N.; Doucet, A.; Jordan, M.I. An Introduction to MCMC for Machine Learning. Mach. Learn 2003, 50, 5–43. [Google Scholar]
  27. Clausius, R. On the Motive Power of Heat, and on the Laws Which Can be Deduced From it for the Theory of Heat; Poggendorff’s Annalen der Physick, LXXIX, Dover Reprint: New York, NY, USA, 1850; ISBN ISBN 0-486-59065-8. [Google Scholar]
  28. Caticha, A. Maximum Entropy, fluctuations and priors.
  29. Giffin, A.; Caticha, A. Updating Probabilities with Data and Moments.
  30. Caticha, A.; Preuss, R. Maximum Entropy and Bayesian Data Analysis: Entropic Priors Distributions. Phys. Rev. E 2004, 70, 046127. [Google Scholar]
  31. Akaike, H. On Entropy Maximization Principle. In Applications of Statistics; Krishnaiah, P.R., Ed.; North-Holland: Amsterdam, The Netherlands, 1977; pp. 27–41. [Google Scholar]
  32. Agmon, N.; Alhassid, Y.; Levine, D. An Algorithm for Finding the Distribution of Maximal Entropy. J. Comput. Phys 1979, 30, 250–258. [Google Scholar]
  33. Jaynes, E.T. Where do we go from here? In Maximum-Entropy and Bayesian Methods in Inverse Problems; Smith, C.R., Grandy, W.T., Jr, Eds.; Springer: Dordrecht, The Netherlands, 1985; pp. 21–58. [Google Scholar]
  34. Borwein, J.M.; Lewis, A.S. Duality relationships for entropy-like minimization problems. SIAM J. Control Optim 1991, 29, 325–338. [Google Scholar]
  35. Elfwing, T. On some Methods for Entropy Maximization and Matrix Scaling. Linear Algebra Appl 1980, 34, 321–339. [Google Scholar]
  36. Eriksson, J. A note on Solution of Large Sparse Maximum Entropy Problems with Linear Equality Constraints. Math. Program 1980, 18, 146–154. [Google Scholar]
  37. Erlander, S. Entropy in linear programs. Math. Program 1981, 21, 137–151. [Google Scholar]
  38. Jaynes, E.T. On the Rationale of Maximum-Entropy Methods. Proc. IEEE 1982, 70, 939–952. [Google Scholar]
  39. Shore, J.E.; Johnson, R.W. Properties of Cross-Entropy Minimization. IEEE Trans. Inf. Theory 1981, 27, 472–482. [Google Scholar]
  40. Mohammad-Djafari, A. Maximum d’entropie et problèmes inverses en imagerie. Traitement Signal 1994, 11, 87–116. [Google Scholar]
  41. Bercher, J. Développement de critères de nature entropique pour la résolution des problèmes inverses linéaires. Ph.D. Thesis, Université de Paris–Sud, Orsay, France, 1995. [Google Scholar]
  42. Le Besnerais, G. Méthode du maximum d’entropie sur la moyenne, critère de reconstruction d’image et synthèse d’ouverture en radio astronomie. In Ph.D. Thesis; Université de Paris-Sud: Orsay, France, 1993. [Google Scholar]
  43. Caticha, A.; Giffin, A. Updating Probabilities. [CrossRef]
  44. Caticha, A. Entropic Inference.
  45. Costa, S.I.R.; Santos, S.A.; Strapasson, J.E. Fisher information distance: A geometrical reading 2012. arXiv: 1210.2354.
  46. Rissanen, J. Fisher Information and Stochastic Complexity. IEEE Trans. Inf. Theory 1996, 42, 40–47. [Google Scholar]
  47. Shimizu, R. On Fisher’s amount of information for location family. In A Modern Course on Statistical Distributions in Scientific Work; D. Reidel Dordrecht: The Netherlands, 1975; Volume 3, pp. 305–312. [Google Scholar]
  48. Nielsen, F.; Nock, R. Sided and Symmetrized Bregman Centroids. IEEE Trans. Inf. Theory 2009, 55, 2048–2059. [Google Scholar]
  49. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  50. Schroeder, M.R. Linear prediction, entropy and signal analysis. IEEE ASSP Mag 1984, 1, 3–11. [Google Scholar]
  51. Itakura, F.; Saito, S. A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies. Electron. Commun. Jpn 1970, 53-A, 36–43. [Google Scholar]
  52. Kitagawa, G.; Gersch, W. Smoothness Priors Analysis of Time Series; Lecture Notes in Statistics; Volume 116, Springer: New York, NY, USA, 1996. [Google Scholar]
  53. Rue, H.; Held, L. Gaussian Markov Random Fields: Theory and Applications; CRC Press: New York, NY, USA, 2005. [Google Scholar]
  54. Amari, S.; Cichocki, A.; Yang, H.H. A new learning algorithm for blind source separation. 757–763.
  55. Amari, S. Neural learning in structured parameter spaces—Natural Riemannian gradient. 127–133.
  56. Amari, S. Natural gradient works efficiently in learning. Neural Comput 1998, 10, 251–276. [Google Scholar]
  57. Knuth, K.H. Bayesian source separation and localization. SPIE Proc 1998, 3459. [Google Scholar] [CrossRef]
  58. Knuth, K.H. A Bayesian approach to source separation. 283–288.
  59. Attias, H. Independent Factor Analysis. Neural Comput 1999, 11, 803–851. [Google Scholar]
  60. Mohammad-Djafari, A. A Bayesian approach to source separation. 221–244.
  61. Choudrey, R.A.; Roberts, S. Variational Bayesian Mixture of Independent Component Analysers for Finding Self-Similar Areas in Images. 107–112.
  62. Lopes, H.F.; West, M. Bayesian Model Assessment in Factor Analysis. Statsinica 2004, 14, 41–67. [Google Scholar]
  63. Ichir, M.; Mohammad-Djafari, A. Bayesian Blind Source Separation of Positive Non Stationary Sources. 493–500.
  64. Mohammad-Djafari, A. Bayesian Source Separation: Beyond PCA and ICA.
  65. Comon, P.; Jutten, C. (Eds.) Handbook of Blind Source Separation: Independent Component Analysis and Applications; Academic Press: Burlington, MA, USA, 2010.
  66. Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007, 94, 19–35. [Google Scholar]
  67. Fitzgerald, W. Markov Chain Monte Carlo methods with Applications to Signal Processing. Signal Process 2001, 81, 3–18. [Google Scholar]
  68. Matsuoka, T.; Ulrych, T. Information theory measures with application to model identification. IEEE Trans. Acoust. Speech Signal Process 1986, 34, 511–517. [Google Scholar]
  69. Bretthorst, G.L. Bayesian Model Selection: Examples Relevant to NMR. In Maximum Entropy and Bayesian Methods; Springer: Dordrecht, The Netherlands, 1989; pp. 377–388. [Google Scholar]
  70. Gelfand, A.E.; Dey, D.K. Bayesian model choice: Asymptotics and exact calculations. J. R. Stat. Soc. Ser. B 1994, 56, 501–514. [Google Scholar]
  71. Mohammad-Djafari, A. Model selection for inverse problems: Best choice of basis function and model order selection.
  72. Clyde, M.A.; Berger, J.O.; Bullard, F.; Ford, E.B.; Jefferys, W.H.; Luo, R.; Paulo, R.; Loredo, T. Current Challenges in Bayesian Model Choice. 71, 224–240.
  73. Wyse, J.; Friel, N. Block clustering with collapsed latent block models. Stat. Comput 2012, 22, 415–428. [Google Scholar]
  74. Giovannelli, J.F.; Giremus, A. Bayesian noise model selection and system identification based on approximation of the evidence. 125–128.
  75. Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Control 1974, AC-19, 716–723. [Google Scholar]
  76. Akaike, H. Power spectrum estimation through autoregressive model fitting. Ann. Inst. Stat. Math 1969, 21, 407–419. [Google Scholar]
  77. Farrier, D. Jaynes’ principle and maximum entropy spectral estimation. IEEE Trans. Acoust. Speech Signal Process 1984, 32, 1176–1183. [Google Scholar]
  78. Wax, M. Detection and Estimation of Superimposed Signals. Ph.D. Thesis, Standford University, CA, USA, March 1985. [Google Scholar]
  79. Burg, J.P. Maximum Entropy Spectral Analysis.
  80. McClellan, J.H. Multidimensional spectral estimation. Proc. IEEE 1982, 70, 1029–1039. [Google Scholar]
  81. Lang, S.; McClellan, J.H. Multidimensional MEM spectral estimation. IEEE Trans. Acoust. Speech Signal Process 1982, 30, 880–887. [Google Scholar]
  82. Johnson, R.; Shore, J. Which is Better Entropy Expression for Speech Processing:-SlogS or logS? IEEE Trans. Acoust. Speech Signal Process 1984, ASSP-32, 129–137. [Google Scholar]
  83. Wester, R.; Tummala, M.; Therrien, C. Multidimensional Autoregressive Spectral Estimation Using Iterative Methods. 1. [CrossRef]
  84. Picinbono, B.; Barret, M. Nouvelle présentation de la méthode du maximum d’entropie. Traitement Signal 1990, 7, 153–158. [Google Scholar]
  85. Borwein, J.M.; Lewis, A.S. Convergence of best entropy estimates. SIAM J. Optim 1991, 1, 191–205. [Google Scholar]
  86. Mohammad-Djafari, A. (Ed.) Inverse Problems in Vision and 3D Tomography; digital signal and image processing series; ISTE: London, UK; Wiley: Hoboken, NJ, USA, 2010.
  87. Mohammad-Djafari, A.; Demoment, G. Tomographie de diffraction and synthèse de Fourier à maximum d’entropie. Rev. Phys. Appl. (Paris) 1987, 22, 153–167. [Google Scholar]
  88. Féron, O.; Chama, Z.; Mohammad-Djafari, A. Reconstruction of piecewise homogeneous images from partial knowledge of their Fourier transform. 68–75.
  89. Ayasso, H.; Mohammad-Djafari, A. Joint NDT Image Restoration and Segmentation Using Gauss–Markov–Potts Prior Models and Variational Bayesian Computation. IEEE Trans. Image Process 2010, 19, 2265–2277. [Google Scholar]
  90. Ayasso, H.; DuchÃłne, B.; Mohammad-Djafari, A. Bayesian inversion for optical diffraction tomography. J. Mod. Opt 2010, 57, 765–776. [Google Scholar]
  91. Burch, S.; Gull, S.F.; Skilling, J. Image Restoration by a Powerful Maximum Entropy Method. Comput. Vis. Graph. Image Process 1983, 23, 113–128. [Google Scholar]
  92. Gull, S.F.; Skilling, J. Maximum entropy method in image processing. IEE Proc. F 1984, 131, 646–659. [Google Scholar]
  93. Gull, S.F. Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods; Skilling, J., Ed.; Springer: Dordrecht, The Netherlands, 1989; pp. 53–71. [Google Scholar]
  94. Jones, L.K.; Byrne, C.L. General entropy criteria for inverse problems with application to data compression, pattern classification and cluster analysis. IEEE Trans. Inf. Theory 1990, 36, 23–30. [Google Scholar]
  95. Macaulay, V.A.; Buck, B. Linear inversion by the method of maximum entropy. Inverse Probl 1989, 5. [Google Scholar] [CrossRef]
  96. Rue, H.; Martino, S. Approximate Bayesian inference for hierarchical Gaussian Markov random field models. J. Stat. Plan. Inference 2007, 137, 3177–3192. [Google Scholar]
  97. Wilkinson, R. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error 2009. arXiv:0811.3355.
  98. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392. [Google Scholar]
  99. Fearnhead, P.; Prangle, D. Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC 2011. arxiv:1004.1112v2.
  100. Turner, B.M.; van Zandt, T. A tutorial on approximate Bayesian computation. J. Math. Psych 2012, 56, 69–85. [Google Scholar]
  101. MacKay, D.J.C. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput 1992, 4, 448–472. [Google Scholar]
  102. Mohammad-Djafari, A. Variational Bayesian Approximation for Linear Inverse Problems with a hierarchical prior models. 8085, 669–676.
  103. Likas, C.L.; Galatsanos, N.P. A Variational Approach For Bayesian Blind Image Deconvolution. IEEE Trans. Signal Process 2004, 52, 2222–2233. [Google Scholar]
  104. Beal, M.; Ghahramani, Z. Variational Bayesian learning of directed graphical models with hidden variables. Bayesian Stat 2006, 1, 793–832. [Google Scholar]
  105. Kim, H.; Ghahramani, Z. Bayesian Gaussian Process Classification with the EM-EP Algorithm. IEEE Trans. Pattern Anal. Mach. Intell 2006, 28, 1948–1959. [Google Scholar]
  106. Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn 2006, 37, 183–233. [Google Scholar]
  107. Forbes, F.; Fort, G. Combining Monte Carlo and Mean-Field-Like Methods for Inference in Hidden Markov Random Fields. IEEE Trans. Image Process 2007, 16, 824–837. [Google Scholar]
  108. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. (B) 1977, 39, 1–38. [Google Scholar]
  109. Miller, M.I.; Snyder, D.L. The Role of Likelihood and Entropy in Incomplete-Data Problems: Applications to Estimating Point-Process Intensities and Toeplitz Constrained Covariances. Proc. IEEE 1987, 75, 892–907. [Google Scholar]
  110. Snoussi, H.; Mohammad-Djafari, A. Information geometry of Prior Selection.
  111. Mohammad-Djafari, A. Approche variationnelle pour le calcul bayésien dans les problémes inverses en imagerie 2009. arXiv:0904.4148.
  112. Beal, M. Variational Algorithms for Approximate Bayesian Inference. In Ph.D. Thesis; Gatsby Computational Neuroscience Unit, University College London: UK, 2003. [Google Scholar]
  113. Winn, J.; Bishop, C.M.; Jaakkola, T. Variational message passing. J. Mach. Learn. Res 2005, 6, 661–694. [Google Scholar]
  114. Chatzis, S.; Varvarigou, T. Factor Analysis Latent Subspace Modeling and Robust Fuzzy Clustering Using t-DistributionsClassification of binary random Patterns. IEEE Trans. Fuzzy Syst 2009, 17, 505–517. [Google Scholar]
  115. Park, T.; Casella, G. The Bayesian Lasso. J. Am. Stat. Assoc 2008, 103, 681–686. [Google Scholar]
  116. Mohammad-Djafari, A. A variational Bayesian algorithm for inverse problem of computed tomography. In Mathematical Methods in Biomedical Imaging and Intensity-Modulated Radiation Therapy (IMRT); Censor, Y., Jiang, M., Louis, A.K., Eds.; Publications of the Scuola Normale Superiore/CRM Series; Edizioni della Normale: Rome, Italy, 2008; pp. 231–252. [Google Scholar]
  117. Mohammad-Djafari, A.; Ayasso, H. Variational Bayes and mean field approximations for Markov field unsupervised estimation. 1–6.
  118. Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res 2001, 1, 211–244. [Google Scholar]
  119. He, L.; Chen, H.; Carin, L. Tree-Structured Compressive Sensing With Variational Bayesian Analysis. IEEE Signal Process. Lett 2010, 17, 233–236. [Google Scholar]
  120. Fraysse, A.; Rodet, T. A gradient-like variational Bayesian algorithm, Proceedings of 2011 IEEE Conference on Statistical Signal Processing Workshop (SSP), Nice France, 28–30 June 2011; pp. 605–608.
  121. Johnson, V.E. On Numerical Aspects of Bayesian Model Selection in High and Ultrahigh-dimensional Settings. Bayesian Anal 2013, 8, 741–758. [Google Scholar]
  122. Dumitru, M.; Mohammad-Djafari, A. Estimating the periodic components of a biomedical signal through inverse problem modeling and Bayesian inference with sparsity enforcing prior. AIP Conf. Proc 2015, 1641, 548–555. [Google Scholar]
  123. Wang, L.; Gac, N.; Mohammad-Djafari, A. Bayesian 3D X-ray computed tomography image reconstruction with a scaled Gaussian mixture prior model. AIP Conf. Proc 2015, 1641, 556–563. [Google Scholar]
  124. Mohammad-Djafari, A. Bayesian Blind Deconvolution of Images Comparing JMAP, EM and VBA with a Student-t a priori Model. 98–103.
  125. Su, F.; Mohammad-Djafari, A. An Hierarchical Markov Random Field Model for Bayesian Blind Image Separation.
  126. Su, F.; Cai, S.; Mohammad-Djafari, A. Bayesian blind separation of mixed text patterns. 1373–1378.
Table 1. Analytical solutions for different measures μ(u)
Table 1. Analytical solutions for different measures μ(u)
μ ( u ) exp [ 2 1 j u j 2 ] f ^ = H λ f ^ = H ( H H ) 1 g
μ ( u ) exp [ j | u j | ] f ^ = 1. / ( H λ ± 1 ) H f ^ = g
μ ( u ) exp [ j u j α 1 exp [ β u j ] ] , u j > 0 f ^ = α 1. / ( H λ + β 1 ) H f ^ = g
Table 2. Comparision of three algorithms: JMAP, BEM and VBA
Table 2. Comparision of three algorithms: JMAP, BEM and VBA
JMPABEMVBA

q ( f ) = δ ( f f ˜ ) q ( f ) = N ( f | f ˜ , V ˜ ) q ( f ) = N ( f | f ˜ , V ˜ )
V ˜ = ( H H + λ ˜ I ) 1 V ˜ = ( H H + λ ˜ I ) 1 V ˜ = ( H H + λ ˜ I ) 1
f ˜ = V ˜ H g f ˜ = V ˜ H g f ˜ = V ˜ H g
q ( θ 1 ) = δ ( θ 1 θ ˜ 1 ) q ( θ 1 ) = δ ( θ 1 θ ˜ 1 ) q ( θ 1 ) = G ( θ 1 | α ˜ 1 , β ˜ 1 )
α ˜ 1 = ( α 10 1 ) + M 2 α ˜ 1 = ( α 10 1 ) + M 2 α ˜ 1 = ( α 10 1 ) + M 2
β ˜ 1 = β 10 + 1 2 g Hf 2 2 β ˜ 1 = β 10 + 1 2 < g Hf 2 2 > β ˜ 1 = β 10 + 1 2 < g Hf 2 2 >
θ ˜ 1 = α ˜ 1 β 1 θ ˜ 1 = α ˜ 1 β ˜ 1 θ ˜ 1 = α ˜ 1 β ˜ 1
q ( θ 2 ) = δ ( θ 2 θ ˜ 2 ) q ( θ 2 ) = δ ( θ 2 θ ˜ 2 ) q ( θ 2 ) = G ( θ 2 | α ˜ 2 , β ˜ 2 )
α ˜ 2 = ( α 20 1 ) + M 2 α ˜ 2 = ( α 20 1 ) + M 2 α ˜ 2 = ( α 20 1 ) + N 2
β ˜ 2 = β 10 + 1 2 f 2 2 β ˜ 2 = β 20 + 1 2 < f 2 2 > β ˜ 2 = β 20 + 1 2 < f 2 2 >
θ ˜ 2 = α ˜ 2 β ˜ 2 θ ˜ 2 = α ˜ 2 β ˜ 2 θ ˜ 2 = α ˜ 2 β ˜ 2
λ ˜ = θ ˜ 2 θ ˜ 1 λ ˜ = θ ˜ 2 θ ˜ 1 λ ˜ = θ ˜ 2 θ ˜ 1

Share and Cite

MDPI and ACS Style

Mohammad-Djafari, A. Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems. Entropy 2015, 17, 3989-4027. https://doi.org/10.3390/e17063989

AMA Style

Mohammad-Djafari A. Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems. Entropy. 2015; 17(6):3989-4027. https://doi.org/10.3390/e17063989

Chicago/Turabian Style

Mohammad-Djafari, Ali. 2015. "Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems" Entropy 17, no. 6: 3989-4027. https://doi.org/10.3390/e17063989

APA Style

Mohammad-Djafari, A. (2015). Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems. Entropy, 17(6), 3989-4027. https://doi.org/10.3390/e17063989

Article Metrics

Back to TopTop