Continuous Entropy
Continuous Entropy
Charles Marsh
Department of Computer Science
Princeton University
crmarsh@princeton.edu
December 13, 2013
Abstract
Classically, Shannon entropy was formalized over discrete probability distributions. However, the concept of entropy can be extended to
continuous distributions through a quantity known as continuous (or differential ) entropy. The most common definition for continuous entropy is
seemingly straightforward; however, further analysis reveals a number of
shortcomings that render it far less useful than it appears. Instead, relative entropy (or KL divergence) proves to be the key to information theory
in the continuous case, as the notion of comparing entropy across probability distributions retains value. Expanding off this notion, we present
several results in the field of maximum entropy and, in particular, conclude with an information-theoretic proof of the Central Limit Theorem
using continuous relative entropy.
Introduction
Much discussion of information theory implicitly or explicitly assumes the (exclusive) usage of discrete probability distributions. However, many of information theorys key results and principles can be extended to the continuous
casethat is, to operate over continuous probability distributions. In particular,
continuous (or differential ) entropy is seen as the continuous-case extension of
Shannon entropy. In this paper, we define and evaluate continuous entropy,
relative entropy, maximum entropy, and several other topics in continuous information theory, concluding with an information-theoretic proof of the Central
Limit Theorem using the techniques introduced throughout.
1.1
Goals
Continuous Entropy
2. Discuss some results of maximum entropy (i.e., for distributions with fixed
mean, fixed variance, finite support, etc.).
3. Derive the Central Limit Theorem using information-theoretic principles.
Continuous Entropy
2.1
A Definition
Information theory truly began with Shannon entropy, i.e., entropy in the discrete case. While we will not review the concept extensively, recall the definition:
Definition (Shannon entropy). The Shannon entropy h(X) of a discrete random variable X with distribution P (x) is defined as:
H(X) = i P (xi ) log
1
P (xi )
f (x) log
S
b
1
dx
f (x)
1
log (b a)dx
b
a
a
Z b
1
dx
=
log (b a)
ba
a
= log (b a)
=
2.3
Continuous Entropy
Weaknesses
As mentioned earlier, Shannon entropy was derived from a set of axioms. But
our definition for continuous entropy was provided with no such derivation.
Where does the definition actually come from?
The natural approach to deriving continuous entropy would be to take discrete entropy in the limit of n, the number of symbols in our distribution. This
is equivalent to rigorously defining integrals in calculus using a Reimannian approach: it makes sense that the continuous case would come from extending the
discrete case towards infinity.
To begin, we discretize our continuous distribution f into bins of size . By
the Mean Value Theorem, we get that there exists an xi such that f (xi ) =
R (i+1)
f (x)dx. This implies that we can approximate f by a Reimann sum:
i
Z
f (x)dx = lim
i= f (xi )
= lim
i= f (xi ) log (f (xi )) lim i= f (xi ) log
0
0
Z
lim
f
(x
)
=
f (x)dx = 1
i
i=
0
Z
lim
f
(x
)
log
(f
(x
))
=
f (x) log f (x)dx
i
i
i=
0
Z
H =
f (x) log f (x)dx lim
i= f (xi )
Ideally, wed have that H were equal to our definition for continuos entropy,
as it represents Shannon entropy in the limit. But note that log () as
3
Continuous Entropy
h(X) is variant under change of variables. Depending on your coordinate system, a distribution might have a different continuous entropy. This shouldnt
be the casebut it is. Informally, this means that the same underlying distribution, represented with different variables, might not have the same continuous
entropy.
To understand why, note that the probability contained in a differential area
should not alter under change of variables. That is, for x, y:
|fY (y)dy| = |fX (x)dx|
Further, define g(x) to be the mapping from x to y, and g 1 (y), its inverse.
Then, we get:
Lemma 2.1. fY (y) =
d
1
(y))fX (g 1 (y))
dy (g
Proof.
dx
fX (x)
dy
d
=
(x)fX (x)
dy
d 1
=
(g (y))fX (g 1 (y))
dy
fY (y) =
Well use this fact in the following example[2]: Say, abstractly, that you have
an infinite distribution of circles. Let p(x) be the distribution
p w of their radii and
2
q(w), the distribution of their areas. Further, x(w) =
and w(x) = x .
Youd expect this distribution to have the same continuous entropy regardless
of its representation, In fact, well show that H(p) 6= H(q).
Claim. H(p) 6= H(q)
Continuous Entropy
Proof.
d 1
(g (x))|q(w)
dx
0
= w (x)q(w)
p(x) = |
= 2xq(w)
Thus: q(w) =
p(x)
Z2x
1
dw
q(w)
Z
p(x)
2x
=
log
(2xdx)
2x
p(x)
Z
= p(x)(log (2x) log (p(x)))dx
Z
Z
1
dx
= p(x) log (2x)dx + p(x) log
p(x)
Z
= H(x) + p(x) log (2x)dx
Therefore: H(w) =
q(w) log
6= H(x)
To quote Shannon: The scale of measurements sets an arbitrary zero corresponding to a uniform distribution over a unit volume.[8] The implication here
is that all continuous entropy quantities are somehow relative to the coordinate
system in-use. Further, one could extend this argument to say that continuous entropy is useless when viewed on its own. In particular, relative entropy
between distributions could be the valuable quantity (which well see later on).
2.3.3
Scale Variant
Generalizing this result, we can also get that continuous entropy is not scale
invariant.
Theorem 2.2. If Y = X, then h(Y ) = h(X) + log ||.[14]
Proof.
dx
|]
dy
1
= h(X) E[log ]
= h(X) + log ||
2.3.4
Continuous Entropy
could also be defined as the expected value of the information of the distribution or the number of bits youd need to reliably encode n symbols. In
the continuous case, this intuition deteriorates as h(X) does not give you the
amount of information in X.
To see why, note that h(X) can be negative! For example: if X is uniformly
distributed in [0, 21 ], then h(X) = log ( 12 0) = log 12 = 1. If entropy can be
negative, how can this quantity have any relationship to the information content
of X?
2.4
An Alternative Definition
E.T. Jaynes[8] argued that we should define an invariant factor m(X) that
defines the density (note: not probability density) of a discrete distribution in
the limit.
Definition. Suppose we have a discrete set {xi } of an increasingly dense distribution. The invariant factor m(X) is defined as:
Rb
limn n1 (number of points in a < x < b) = a m(x)dx
This would give us an alternative definition of continuous entropy that is
invariant under change of variables.
Definition. Let X be a random variable with probability distribution p(X). An
alternative definition of the entropy H(X) follows:
R
p(x)
H(X) = S p(x) log m(x)
dx
where S is the support set of X.
We provide this definition solely
purposes. The rest of the
R for educational
1
paper will assume that H(X) = S p(x) log p(x)
dx.
2.5
Despite the aforementioned flaws, theres hope yet for information theory in the
continuous case. A key result is that definitions for relative entropy and mutual
information follow naturally from the discrete case and retain their usefulness.
Lets go ahead and define relative entropy in the continuous case, using the
definition in [6].
Definition. The relative entropy D(f ||g) of two PDFs f and g is defined as:
R
(x)
D(f ||g) = S f (x) log fg(x)
dx
where S is the support set of f . Note that D(f ||g) = 0 if supp(g) 6 supp(f ).
2.5.1
Continuous Entropy
p(x) log
p(x)
dx
q(x)
p(X)
]
q(X)
q(X)
= E[ log
]
p(X)
p
q(X)
] by Jensens Inequality
log E[
p p(X)
Z
q(x)
= log p(x)
dx
p(x)
Z
= log q(x)dx
= E[log
p
= log 1
=0
2.5.2
Before we advance, its worth formalizing a key lemma that follows from the
non-negativity of relative entropy.
Lemma 2.4. If f and g are continuous probability distributions, then:
R
h(f ) f (x) log g(x)dx
Continuous Entropy
We can use this lemma to prove upper bounds on the entropy of probability
distributions given certain constraints. Examples will follow in the proceeding
sections.
2.6
We can use our definition of relative entropy to define mutual information for
continuous distributions as well. Recall that in the discrete case, we had:
I(X; Y ) = D(p(x, y)||p(x)p(y))
Well use this statement to define mutual information for continuous distributions[6].
Definition. The mutual information I(X; Y ) of two random variables X and
Y drawn from continuous probability distributions is defined as:
I(X; Y ) = D(p(x, y)||p(x)p(y))
Z
p(x, y)
= p(x, y) log
dx
p(x)p(y)
Maximum Entropy
Now that weve defined and analyzed continuous entropy, we can now focus
on some interesting results that follow from our formulation. Recall that the
entropy of a continuous distribution is a highly problematic quantity as it is
variant under change of coordinates, potentially non-negative, etc. The true
quantity of interest, then, is the relative entropy between (sets of) distributions.
This leads us to examine the problem of maximum entropy, defined in [5] as
follows:
8
Continuous Entropy
The first constraint we will examine is that of finite support. That is, lets find
the distribution of maximum entropy for all distributions with support limited
to the interval [a, b].
Recall that in the discrete case, entropy is maximized when a set of events are
equally likely, i.e., uniformly distributed. Intuitively, as the events are equiprobable, we cant make any educated guesses about which event might occur; thus,
we learn a lot when were told which event occurred.
In the continuous case, the result is much the same.
Claim. The uniform distribution is the maximum entropy distribution on any
interval [a, b].
Proof. From [14]: Suppose f (x) is a distribution for x (a, b) and u(x) is the
uniform distribution on that interval. Then:
Z
f (x)
dx
D(f ||u) = f (x) log
u(x)
Z
= f (x)(log (f (x)) log (u(x)))dx
Z
= h(x) f (x) log (u(x))dx
= h(x) + log (b a) 0 by Theorem 2.3
Therefore, log (b a) h(x). That is, no distribution with finite support
can have greater entropy than the uniform on the same interval.
3.2
Continuous Entropy
1
2 2
exp {
(x )2
}
2 2
2
2
2
Z
2
As
f (x)(x ) dx is the variance of f :
1
1
log (2 2 ) +
2
2
1
2
= log (2e )
2
= h()
Therefore, the entropy of f must be less than or equal to the entropy of the
normal distribution with identical mean and variance.
3.3
Continuous Entropy
Z
1
log + p(x)xdx
Z
1
log + p(x)xdx
1
log + E[X]
1
log + 1
= h(q)
Continuous Entropy
For the rest of the proof, we assume that = 0, and thus Sn = ni=1 Xi / n,
as is simply a shifting factor. If Sn is normal for = 0, then it will be normal
for any , as this factor just modifies the center of the distribution.
4.1
Overview
Typically, proofs of the CLT use inverse Fourier transforms or moment generating functions, as in [11]. In this paper, well use information-theoretic principles.
The broad outline of the proof will be to show that the relative entropy of
Sn with respect to a normal distribution goes to zero.
To see that this is sufficient to prove the CLT, we use Pinksers Inequality
(from [10]).
Theorem 4.1 (Pinskers Inequality). The variational distance between two
probability mass functions P and Q, defined as:
d(P, Q) = xX |P (x) Q(x)|
is bounded above the relative entropy between the two distributions in the sense
that
D(P ||Q) 12 d2 (P, Q)
Thus, if limn D(Sn ||) = 0, then the distance d(P, Q) between the two
distributions goes to 0. In other words, Sn approaches the normal.
(Note: from here onwards, well define D(X) = D(f ||), where X has distribution f .)
To begin the proof, we provide a number of definitions and useful lemmas.
4.2
Fisher Information is a useful quantity in the proof of the Central Limit Theorem. Intuitively, Fisher Information is the minimum error involved in estimating
a parameter of a distribution. Alternatively, it can be seen as a measurement of
how much information a random variable X carries about a parameter upon
which it depends.
We provide the following definitions. While they will be necessary in our
proofs, it is not imperative that you understand their significance.
Definition. The standardized Fisher information of a random variable Y with
density g(y) and variance 2 is defined as
J(Y ) = 2 E[(Y ) (Y )]2
where = g 0 /g is the score function for Y and = 0 / is the score
function for the normal with the same mean and variance as Y .[2]
Definition. The Fisher information is defined in [2] as
I(Y ) = E[2 (Y )]
12
Continuous Entropy
0
( f (y) )2 f (y)dy
f (y)
From [1], we can relate relative entropy to Fisher Information through the following lemma.
Lemma 4.2. Let X be a random variable with finite variance. Then:
Z 1
dt
D(X) =
J( tX + 1 tZ) , t (0, 1)
2t
Z0
d
t
=
J(X + Z)
,=
, (0, 1)
1+
1t
0
This connection will be key in proving the Central Limit Theorem.
4.4
Convolution Inequalities
Again from [1] (drawing on [3] and [15]), we have the following result:
Lemma 4.3. If Y1 and Y2 are random variables and i 0, 1 + 2 = 1, then
I( 1 Y1 + 2 Y2 ) 1 I(Y1 ) + 2 I(Y2 ).
Using this result, we can prove something stronger.
Lemma 4.4. If Y1 and Y2 have the same variance, then
J( 1 Y1 + 2 Y2 ) 1 J(Y1 ) + 2 J(Y2 )
and
J(i i Yi ) i i J(Yi )
I( 1 Y1 + 2 Y2 ) 1 I(Y1 ) + 2 I(Y2 )
J( 1 Y1 + 2 Y2 ) + 1 1 (J(Y1 ) + 1) + 2 (J(Y2 ) + 1)
J( 1 Y1 + 2 Y2 ) + 1 1 J(Y1 ) + 2 J(Y2 ) + (1 + 2 )
J( 1 Y1 + 2 Y2 ) + 1 1 J(Y1 ) + 2 J(Y2 ) + 1
J( 1 Y1 + 2 Y2 ) 1 J(Y1 ) + 2 J(Y2 )
13
Continuous Entropy
Proof. From [1]. Let Yi = Xi + Zi , where Zi is the normal with the same
variance as Xi . By combining Lemma 4.5 with the equation:
Z
d
D(X) =
J(X + Z)
1+
0
1
m.
14
Continuous Entropy
Limit is infimum. Next, we prove that the limit exists and equals the infimum.
Let p be such that H(Sp ) supn (H(Sn )) . Let n = mp + r where r < p.
H(Smp ) = H(m
i=1 Sp / m)
H(Sn ) = H(Smp+r )
mp
r
= H( Smp + Sr )
n
n
mp
H( Smp ) as samples i.i.d., entropy increases on convolution
n
1
= H(Smp ) + log (mp/n)
2
1
= H(Smp ) + log (mp/(mp + r))
2
1
= H(Smp ) + log (1 (r/n))
2
1
H(Sp ) + log (1 (r/n)) by Lemma 4.8
2
This quantity converges to H(Sp ) as n . As a result, we get that:
lim H(Sn ) H(Sp ) + 0
sup(H(Sn ))
n
If we let 0, we get that limn H(Sn ) = supn (H(Sn )). From the
definition of relative entropy, we have H(Sn ) = 21 log (2e 2 ) D(Sn ). Thus,
the previous statement is equivalent to limn D(Sn ) = inf n (D(Sn )).
Infimum is 0. The skeleton of the proof in [2] is to show that the infimum is
0 for a subsequence of the nk s. As the limit exists, all subsequences must
converge to the limit of the sequence, and thus we can infer the limit of the
entire sequence given a limit of one of the subsequences.
In particular, the subsequence is nk = 2k n0 , implying that the goal is to
prove
limk D(S2k n0 ) = 0. This is done by showing that limk J(S2k n0 +
Continuous Entropy
Conclusion
Beginning with a definition for continuous entropy, weve shown that the quantity on its own holds little value due to its many shortcomings. While the
definition wason the surfacea seemingly minor notational deviation from the
discrete case, continuous entropy lacks invariance under change of coordinates,
non-negativity, and other desirable quantities that helped motivate the original
definition for Shannon entropy.
But while continuous entropy on its own proved problematic, comparing entropy across continuous distributions (with relative entropy) yielded fascinating
results, both through maximum entropy problems and, interestingly enough,
the information-theoretic proof of the Central Limit Theorem, where the relative entropy of the standardized sum and the normal distribution was shown to
drop to 0 as the sample size grew to infinity.
The applications of continuous information-theoretic techniques are varied;
but, perhaps best of all, they allow us a means of justifying and proving results
with the same familiar, intuitive feel granted us in the discrete realm. An
information-theoretic proof of the Central Limit Theorem makes sense when we
see that the relative entropy of the standardized sum and the normal decreases
over time; similarly, the normal as the maximum entropy distribution for fixed
mean and variance feels intuitive as well. Calling on information theory to
prove and explain these results in the continuous case results in both rigorous
justifications and intuitive explanations.
Appendix
Convolution Increases Entropy. From [9]: Recall that conditioning decreases
entropy. Then, for independent X and Y , we have:
h(X + Y |X) = h(Y |X) = h(Y ) by independence
h(Y ) = h(X + Y |X) h(X + Y )
References
[1] Andrew R. Barron. Monotonic Central Limit Theorem for Densities. Technical report, Stanford University, 1984.
[2] Andrew R. Barron. Entropy and the Central Limit Theorem. The Annals
of Probability, 14:336342, 1986.
[3] Nelson M. Blachman. The Convolution Inequality for Entropy Powes. IEEE
Interactions on Information Theory, pages 267271, April 1965.
[4] R.M. Castro. Maximum likelihood estimation and complexity regularization (lecture notes). May 2011.
16
Continuous Entropy
17