0% found this document useful (0 votes)

201 views

Continuous Entropy

1) Continuous (or differential) entropy was proposed as an extension of Shannon entropy to continuous probability distributions, defined as the integral of the probability density function multiplied by the logarithm of the density. 2) However, further analysis revealed weaknesses in continuous entropy, such as its dependence on the choice of coordinates and scaling, and the possibility of negative values which contradict its interpretation as "information content". 3) Relative entropy (KL divergence), which measures the difference between two probability distributions, is a more useful information-theoretic concept in the continuous case, as it retains the desirable property of being invariant to changes of variables or scale. The document concludes by presenting results on maximum entropy and proving the Central Limit Theorem using

Uploaded by

DanielDanielli

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

201 views

Continuous Entropy

Uploaded by

DanielDanielli

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Introduction to Continuous Entropy

Charles Marsh
Department of Computer Science
Princeton University
crmarsh@princeton.edu
December 13, 2013
Abstract
Classically, Shannon entropy was formalized over discrete probability distributions. However, the concept of entropy can be extended to
continuous distributions through a quantity known as continuous (or differential ) entropy. The most common definition for continuous entropy is
seemingly straightforward; however, further analysis reveals a number of
shortcomings that render it far less useful than it appears. Instead, relative entropy (or KL divergence) proves to be the key to information theory
in the continuous case, as the notion of comparing entropy across probability distributions retains value. Expanding off this notion, we present
several results in the field of maximum entropy and, in particular, conclude with an information-theoretic proof of the Central Limit Theorem
using continuous relative entropy.

Introduction

Much discussion of information theory implicitly or explicitly assumes the (exclusive) usage of discrete probability distributions. However, many of information theorys key results and principles can be extended to the continuous
casethat is, to operate over continuous probability distributions. In particular,
continuous (or differential ) entropy is seen as the continuous-case extension of
Shannon entropy. In this paper, we define and evaluate continuous entropy,
relative entropy, maximum entropy, and several other topics in continuous information theory, concluding with an information-theoretic proof of the Central
Limit Theorem using the techniques introduced throughout.
1.1

Goals

More specifically, our goals are as follows:

1. Introduce and evaluate a definition for continuous entropy.

Charles Marsh (crmarsh@)

Continuous Entropy

2. Discuss some results of maximum entropy (i.e., for distributions with fixed
mean, fixed variance, finite support, etc.).
3. Derive the Central Limit Theorem using information-theoretic principles.

Continuous Entropy

2.1

A Definition

Information theory truly began with Shannon entropy, i.e., entropy in the discrete case. While we will not review the concept extensively, recall the definition:
Definition (Shannon entropy). The Shannon entropy h(X) of a discrete random variable X with distribution P (x) is defined as:
H(X) = i P (xi ) log

1
P (xi )

The formula for continuous entropy is a (seemingly) logical extension of the

discrete case. In fact, we merely replace the summation with an integral.
Definition (Continuous entropy). The continuous entropy h(X) of a continuous random variable X with density f (x) is defined as:
R
1
h(X) = S f (x) log f (x)
dx
where S is the support set of the random variable.[6] As shorthand, we can also
write H(f ) = H(X), where random variable X has distribution f (x).
To see how continuous entropy operates in the wild, consider the following
example.
2.2

An Example: The Uniform Distribution

Allow f to be the uniform distribution on [a, b]. That is:

(
1
, for x [a, b]
f (x) = ba
0,
else
Lets solve for the continuous entropy of this distribution.
Z
h(f ) =

f (x) log
S
b

1
dx
f (x)

1
log (b a)dx
b

a
a
Z b
1
dx
=
log (b a)
ba
a
= log (b a)
=

Informally, the continuous entropy of the uniform distribution is equal to

the log of the width of the interval.
2

Charles Marsh (crmarsh@)

2.3

Continuous Entropy

Weaknesses

The definition of continuous entropy provided seems to follow quite naturally

from Shannon entropy. But rigorously, how well does it help up? Is it a good
extension of Shannon entropy?
As well show, there are a number of kinks or weaknesses with our definition
of continuous entropy. In the discrete case, we had a set of axioms from which we
derived Shannon entropy and thus a bunch of nice properties that it exhibited.
In the continuous case, however, our definition is highly problematicto the
point that, on its own, it may not be an entirely useful mathematical quantity.
2.3.1

Shannon entropy in the Limit

As mentioned earlier, Shannon entropy was derived from a set of axioms. But
our definition for continuous entropy was provided with no such derivation.
Where does the definition actually come from?
The natural approach to deriving continuous entropy would be to take discrete entropy in the limit of n, the number of symbols in our distribution. This
is equivalent to rigorously defining integrals in calculus using a Reimannian approach: it makes sense that the continuous case would come from extending the
discrete case towards infinity.
To begin, we discretize our continuous distribution f into bins of size . By
the Mean Value Theorem, we get that there exists an xi such that f (xi ) =
R (i+1)
f (x)dx. This implies that we can approximate f by a Reimann sum:
i
Z

f (x)dx = lim
i= f (xi )

Claim. Continuous entropy differs from Shannon entropy in the limit by a

potentially infinite offset.
Proof.
H = lim
i= f (xi ) log (f (xi ))
0

= lim
i= f (xi ) log (f (xi )) lim i= f (xi ) log
0
0
Z
lim
f
(x
)
=
f (x)dx = 1
i
i=
0

Z
lim
f
(x
)
log
(f
(x
))
=
f (x) log f (x)dx
i
i
i=
0

Z
H =
f (x) log f (x)dx lim
i= f (xi )

Ideally, wed have that H were equal to our definition for continuos entropy,
as it represents Shannon entropy in the limit. But note that log () as
3

Charles Marsh (crmarsh@)

Continuous Entropy

0. As a result, the right term will explode. So instead, we need a special

definition for continuous entropy:
Z
h(f ) = lim (H + log ) =
f (x) log f (x)dx
0

In this sense, continuous entropy differs in the limit by an infinite offset!

This demonstrates that the formula for continuous entropy is not a derivation of anything, unlike Shannon entropyits merely the result of replacing
the summation with an integration.[8] This result may not be a problem in and
of itself, but it helps to explain some of the proceeding difficulties with the
definition.
2.3.2

Variable Under Change of Coordinates

h(X) is variant under change of variables. Depending on your coordinate system, a distribution might have a different continuous entropy. This shouldnt
be the casebut it is. Informally, this means that the same underlying distribution, represented with different variables, might not have the same continuous
entropy.
To understand why, note that the probability contained in a differential area
should not alter under change of variables. That is, for x, y:
|fY (y)dy| = |fX (x)dx|
Further, define g(x) to be the mapping from x to y, and g 1 (y), its inverse.
Then, we get:
Lemma 2.1. fY (y) =

d
1
(y))fX (g 1 (y))
dy (g

Proof.
dx
fX (x)
dy
d
=
(x)fX (x)
dy
d 1
=
(g (y))fX (g 1 (y))
dy

fY (y) =

Well use this fact in the following example[2]: Say, abstractly, that you have
an infinite distribution of circles. Let p(x) be the distribution
p w of their radii and
2
q(w), the distribution of their areas. Further, x(w) =
and w(x) = x .
Youd expect this distribution to have the same continuous entropy regardless
of its representation, In fact, well show that H(p) 6= H(q).
Claim. H(p) 6= H(q)

Charles Marsh (crmarsh@)

Continuous Entropy

Proof.
d 1
(g (x))|q(w)
dx
0
= w (x)q(w)

p(x) = |

= 2xq(w)
Thus: q(w) =

p(x)
Z2x

1
dw
q(w)
Z
p(x)
2x
=
log
(2xdx)
2x
p(x)
Z
= p(x)(log (2x) log (p(x)))dx
Z
Z
1
dx
= p(x) log (2x)dx + p(x) log
p(x)
Z
= H(x) + p(x) log (2x)dx

Therefore: H(w) =

q(w) log

6= H(x)
To quote Shannon: The scale of measurements sets an arbitrary zero corresponding to a uniform distribution over a unit volume.[8] The implication here
is that all continuous entropy quantities are somehow relative to the coordinate
system in-use. Further, one could extend this argument to say that continuous entropy is useless when viewed on its own. In particular, relative entropy
between distributions could be the valuable quantity (which well see later on).
2.3.3

Scale Variant

Generalizing this result, we can also get that continuous entropy is not scale
invariant.
Theorem 2.2. If Y = X, then h(Y ) = h(X) + log ||.[14]
Proof.
dx
|]
dy
1
= h(X) E[log ]

= h(X) + log ||

h(Y ) = h(X) E[log |

2.3.4

Negativity & Information Content

With Shannon entropy, we had this wonderful intuition in which it represented

the information content of a discrete distribution. That is, Shannon entropy
5

Charles Marsh (crmarsh@)

Continuous Entropy

could also be defined as the expected value of the information of the distribution or the number of bits youd need to reliably encode n symbols. In
the continuous case, this intuition deteriorates as h(X) does not give you the
amount of information in X.
To see why, note that h(X) can be negative! For example: if X is uniformly
distributed in [0, 21 ], then h(X) = log ( 12 0) = log 12 = 1. If entropy can be
negative, how can this quantity have any relationship to the information content
of X?
2.4

An Alternative Definition

E.T. Jaynes[8] argued that we should define an invariant factor m(X) that
defines the density (note: not probability density) of a discrete distribution in
the limit.
Definition. Suppose we have a discrete set {xi } of an increasingly dense distribution. The invariant factor m(X) is defined as:
Rb
limn n1 (number of points in a < x < b) = a m(x)dx
This would give us an alternative definition of continuous entropy that is
invariant under change of variables.
Definition. Let X be a random variable with probability distribution p(X). An
alternative definition of the entropy H(X) follows:
R
p(x)
H(X) = S p(x) log m(x)
dx
where S is the support set of X.
We provide this definition solely
purposes. The rest of the
R for educational
1
paper will assume that H(X) = S p(x) log p(x)
dx.
2.5

Continuous Relative Entropy (KL divergence)

Despite the aforementioned flaws, theres hope yet for information theory in the
continuous case. A key result is that definitions for relative entropy and mutual
information follow naturally from the discrete case and retain their usefulness.
Lets go ahead and define relative entropy in the continuous case, using the
definition in [6].
Definition. The relative entropy D(f ||g) of two PDFs f and g is defined as:
R
(x)
D(f ||g) = S f (x) log fg(x)
dx
where S is the support set of f . Note that D(f ||g) = 0 if supp(g) 6 supp(f ).

Charles Marsh (crmarsh@)

2.5.1

Continuous Entropy

Non-Negativity of Relative Entropy

Importantly, relative entropy remains non-negative in the continuous case. We

prove this using Jensens Inequality[4].
Theorem 2.3. For any two distributions f and g:
D(f ||g) 0
Proof.
Z
D(p||q) =

p(x) log

p(x)
dx
q(x)

p(X)
]
q(X)
q(X)
= E[ log
]
p(X)
p
q(X)
] by Jensens Inequality
log E[
p p(X)
Z
q(x)
= log p(x)
dx
p(x)
Z
= log q(x)dx
= E[log
p

= log 1
=0
2.5.2

Using Relative Entropy to Prove Upper Bounds

Before we advance, its worth formalizing a key lemma that follows from the
non-negativity of relative entropy.
Lemma 2.4. If f and g are continuous probability distributions, then:
R
h(f ) f (x) log g(x)dx

Charles Marsh (crmarsh@)

Continuous Entropy

Proof. Using relative entropy.

Z
f (x)
D(f ||g) = f (x) log
dx
g(x)
Z
= f (x)(log (f (x)) log (g(x)))dx
Z
Z
= f (x) log (f (x))dx f (x) log (g(x))dx
Z
Z
1
= f (x) log
dx f (x) log (g(x))dx
(f (x))
Z
= h(x) f (x) log (g(x))dx
Z
= h(x) f (x) log (g(x))dx
0
Z
Therefore: h(x)

f (x) log (g(x))dx

We can use this lemma to prove upper bounds on the entropy of probability
distributions given certain constraints. Examples will follow in the proceeding
sections.
2.6

Continuous Mutual Information

We can use our definition of relative entropy to define mutual information for
continuous distributions as well. Recall that in the discrete case, we had:
I(X; Y ) = D(p(x, y)||p(x)p(y))
Well use this statement to define mutual information for continuous distributions[6].
Definition. The mutual information I(X; Y ) of two random variables X and
Y drawn from continuous probability distributions is defined as:
I(X; Y ) = D(p(x, y)||p(x)p(y))
Z
p(x, y)
= p(x, y) log
dx
p(x)p(y)

Maximum Entropy

Now that weve defined and analyzed continuous entropy, we can now focus
on some interesting results that follow from our formulation. Recall that the
entropy of a continuous distribution is a highly problematic quantity as it is
variant under change of coordinates, potentially non-negative, etc. The true
quantity of interest, then, is the relative entropy between (sets of) distributions.
This leads us to examine the problem of maximum entropy, defined in [5] as
follows:
8

Charles Marsh (crmarsh@)

Continuous Entropy

Definition. The maximum entropy problem is to find a probability distribution

that maximizes entropy satisfying some set of constraints.
Intuitively, the maximum entropy problem focuses on finding the most random distribution under some conditions. For example, finding the maximum
entropy among all distributions with mean , or all distributions with variance
2 . (Incidentally, both of these constraints yield interesting solutions.)
We further motivate maximum entropy by noting the following from [16]:
1. Maximizing entropy will minimize the amount of prior information built
into the probability distribution.
2. Physical systems tend to move towards maximum entropy as time progresses.
3.1

Maximum Entropy on an Interval

The first constraint we will examine is that of finite support. That is, lets find
the distribution of maximum entropy for all distributions with support limited
to the interval [a, b].
Recall that in the discrete case, entropy is maximized when a set of events are
equally likely, i.e., uniformly distributed. Intuitively, as the events are equiprobable, we cant make any educated guesses about which event might occur; thus,
we learn a lot when were told which event occurred.
In the continuous case, the result is much the same.
Claim. The uniform distribution is the maximum entropy distribution on any
interval [a, b].
Proof. From [14]: Suppose f (x) is a distribution for x (a, b) and u(x) is the
uniform distribution on that interval. Then:
Z
f (x)
dx
D(f ||u) = f (x) log
u(x)
Z
= f (x)(log (f (x)) log (u(x)))dx
Z
= h(x) f (x) log (u(x))dx
= h(x) + log (b a) 0 by Theorem 2.3
Therefore, log (b a) h(x). That is, no distribution with finite support
can have greater entropy than the uniform on the same interval.
3.2

Maximum Entropy for Fixed Variance

Maximizing entropy over all distributions with fixed variance 2 is particularly

interesting. Variance seems like the most natural quantity to vary when discussing entropy. Intuitively, if entropy is interpreted as a barometer for the
9

Charles Marsh (crmarsh@)

Continuous Entropy

randomness of a distribution, then it would hopefully have some significant

relationship to variance.
Recall (or see [13]) that the normal distribution with mean and variance
2 is defined as:
f (x) =

1
2 2

exp {

(x )2
}
2 2

We will prove (from [5]) that the normal maximizes entropy.

Theorem 3.1. The normal distribution maximizes entropy for all distributions
with fixed variance 2 and mean .
Proof. Again, we use relative entropy. Consider some distribution f and the
normal distribution .
It is easily verified that the normal distribution with mean and variance
2 has entropy equal to h() = 21 log (2e 2 ).
Combining this result with Lemma 2.4, we get:
Z
h(f ) f (x) log ((x))dx
Z
1
(x )2
f (x) log (
})dx
exp {
2 2
2 2
Z
(x )2
1
))dx
f (x)(log (exp {
}) + log (
2 2
2 2
Z
(x )2
1
f (x)(
log (2 2 ))dx
2 2
2
Z
Z
2
(x )
1
2
f (x)
dx
+
log
(2
)
f (x)dx
2 2
2
Z
1
1
f (x)(x )2 dx + log (2 2 )

2
2
2
Z
2
As
f (x)(x ) dx is the variance of f :
1
1
log (2 2 ) +
2
2
1
2
= log (2e )
2
= h()

Therefore, the entropy of f must be less than or equal to the entropy of the
normal distribution with identical mean and variance.
3.3

Maximum Entropy for Fixed Mean

As another example, consider the following problem in which we put a constraint

on the mean of the distribution: Find the continuous probability density function
p of maximum entropy on (0, ) with mean 1 .
10

Charles Marsh (crmarsh@)

Continuous Entropy

Claim. The exponential distribution with parameter maximizes entropy on

(0, ) for distributions with mean 1
Proof. Consider the exponential distribution q with parameter (and, consequently, mean 1 ). It is easily verified that h(q) = log 1 + 1.
Let p be some other distribution on (0, ) with mean 1 . Then, by Lemma
2.4:
Z
h(p) p(x) log (q(x))dx
Z
p(x) log (ex )dx
Z
p(x)(log + log ex )dx
Z
1
p(x)(log log ex )dx

Z
1
log + p(x)xdx

1
log + E[X]

1
log + 1

= h(q)

The Central Limit Theorem

We start with an informal definition of the Central Limit Theorem, motivated

by [7].
Definition. The Central Limit Theorem (CLT) states that the distribution of
the mean of a sample of i.i.d. random variables will approach normal in the limit.
Specifically, if our variables Xi have mean and variance 2 , the arithmetic
mean will approach normal with parameters (, 2 /n).
The CLT has massive implications within statistics. Intuitively, it says that
the distribution of the standardized sum of a bunch of Xi s will be normal
regardless of the shape of the Xi s themselves. This allows us to make normality
assumptions fairly often when handling real-world data.
In this paper, we prove the version of the CLT defined in [2].
Claim. Let X1 , X2 , ... be i.i.d. random variables with mean and variance
2 . Further, let Sn = n[(ni=1 Xi )/n ] be the standardized sum (that is, the
convolution of the Xi s divided by n). We claim that the underlying distribution
of Sn approaches normal with mean 0 and variance 2 as n .
11

Charles Marsh (crmarsh@)

Continuous Entropy

For the rest of the proof, we assume that = 0, and thus Sn = ni=1 Xi / n,
as is simply a shifting factor. If Sn is normal for = 0, then it will be normal
for any , as this factor just modifies the center of the distribution.
4.1

Overview

Typically, proofs of the CLT use inverse Fourier transforms or moment generating functions, as in [11]. In this paper, well use information-theoretic principles.
The broad outline of the proof will be to show that the relative entropy of
Sn with respect to a normal distribution goes to zero.
To see that this is sufficient to prove the CLT, we use Pinksers Inequality
(from [10]).
Theorem 4.1 (Pinskers Inequality). The variational distance between two
probability mass functions P and Q, defined as:
d(P, Q) = xX |P (x) Q(x)|
is bounded above the relative entropy between the two distributions in the sense
that
D(P ||Q) 12 d2 (P, Q)
Thus, if limn D(Sn ||) = 0, then the distance d(P, Q) between the two
distributions goes to 0. In other words, Sn approaches the normal.
(Note: from here onwards, well define D(X) = D(f ||), where X has distribution f .)
To begin the proof, we provide a number of definitions and useful lemmas.
4.2

Fisher Information and its Connection to Entropy

Fisher Information is a useful quantity in the proof of the Central Limit Theorem. Intuitively, Fisher Information is the minimum error involved in estimating
a parameter of a distribution. Alternatively, it can be seen as a measurement of
how much information a random variable X carries about a parameter upon
which it depends.
We provide the following definitions. While they will be necessary in our
proofs, it is not imperative that you understand their significance.
Definition. The standardized Fisher information of a random variable Y with
density g(y) and variance 2 is defined as
J(Y ) = 2 E[(Y ) (Y )]2
where = g 0 /g is the score function for Y and = 0 / is the score
function for the normal with the same mean and variance as Y .[2]
Definition. The Fisher information is defined in [2] as
I(Y ) = E[2 (Y )]
12

Charles Marsh (crmarsh@)

Continuous Entropy

Alternatively, from [12]:

I(Y ) =

0
( f (y) )2 f (y)dy
f (y)

where the two quantities are related by I = (J + 1)/ 2 .

4.3

Relationship Between Relative Entropy and Fisher Information

From [1], we can relate relative entropy to Fisher Information through the following lemma.
Lemma 4.2. Let X be a random variable with finite variance. Then:
Z 1

dt
D(X) =
J( tX + 1 tZ) , t (0, 1)
2t
Z0

d
t
=
J(X + Z)
,=
, (0, 1)
1+
1t
0
This connection will be key in proving the Central Limit Theorem.
4.4

Convolution Inequalities

Again from [1] (drawing on [3] and [15]), we have the following result:
Lemma 4.3. If Y1 and Y2 are random variables and i 0, 1 + 2 = 1, then

I( 1 Y1 + 2 Y2 ) 1 I(Y1 ) + 2 I(Y2 ).
Using this result, we can prove something stronger.
Lemma 4.4. If Y1 and Y2 have the same variance, then

J( 1 Y1 + 2 Y2 ) 1 J(Y1 ) + 2 J(Y2 )

and

J(i i Yi ) i i J(Yi )

Proof. Recall that I = (J + 1)/ 2 . Then:

I( 1 Y1 + 2 Y2 ) 1 I(Y1 ) + 2 I(Y2 )

(J( 1 Y1 + 2 Y2 ) + 1)/ 2 1 (J(Y1 ) + 1)/ 2 + 2 (J(Y2 ) + 1)/ 2

J( 1 Y1 + 2 Y2 ) + 1 1 (J(Y1 ) + 1) + 2 (J(Y2 ) + 1)

J( 1 Y1 + 2 Y2 ) + 1 1 J(Y1 ) + 2 J(Y2 ) + (1 + 2 )

J( 1 Y1 + 2 Y2 ) + 1 1 J(Y1 ) + 2 J(Y2 ) + 1

J( 1 Y1 + 2 Y2 ) 1 J(Y1 ) + 2 J(Y2 )

Charles Marsh (crmarsh@)

Continuous Entropy

This argument can be extended to yield the stronger statement.

Next, we apply Lemma 4.4 to prove a number of helpful convolution inequalities.

Lemma 4.5. D(i i Xi ) i i D(Xi )

Lemma 4.6. H(i i Xi ) i i H(Xi )

Proof. From [1]. Let Yi = Xi + Zi , where Zi is the normal with the same
variance as Xi . By combining Lemma 4.5 with the equation:
Z

d
D(X) =
J(X + Z)
1+
0

We get Lemma 4.5: D(i i Xi ) i i D(Xi ). Noting that H(X) =

1
log
(2e 2 ) D(X) gives us Lemma 4.6.
2
Well need a few more results before we can complete the proof of the CLT.
m

) H(X1 ) if the Xi are i.i.d.

Lemma 4.7. H( X1 +...+X
m

Proof. Apply Lemma 4.6 with i =

1
m.

Lemma 4.8. For any integers n = mp, H(Smp ) H(Sp ).

Proof. Returning to the standardized sum, we note that Smp = m

i=0 Sp / m. If
we apply Lemma 4.6 with Xi = Sp and i = 1/m, we get:
H(Smp ) H(Sp )
4.5

Subadditivity of Relative Entropy for Standardized Sum

The main theorem follows.

Theorem 4.9. Let Sn be the standardized sum. Then nD(Sn ) is a subadditive
sequence, and D(S2n ) D(Sn ). As a result, we get convergence of the relative
entropy:
limn D(Sn ) = 0
Proof. We divide our proof into several stages.
Subadditivity. Recall that H(Smp ) H(Sp ). Setting m = 2 and p = n, we get
H(S2n ) H(Sn ), which implies that D(S2n ) D(Sn ).

Charles Marsh (crmarsh@)

Continuous Entropy

Limit is infimum. Next, we prove that the limit exists and equals the infimum.
Let p be such that H(Sp ) supn (H(Sn )) . Let n = mp + r where r < p.

H(Smp ) = H(m
i=1 Sp / m)
H(Sn ) = H(Smp+r )

mp
r
= H( Smp + Sr )
n
n

mp
H( Smp ) as samples i.i.d., entropy increases on convolution
n
1
= H(Smp ) + log (mp/n)
2
1
= H(Smp ) + log (mp/(mp + r))
2
1
= H(Smp ) + log (1 (r/n))
2
1
H(Sp ) + log (1 (r/n)) by Lemma 4.8
2
This quantity converges to H(Sp ) as n . As a result, we get that:
lim H(Sn ) H(Sp ) + 0

sup(H(Sn ))
n

If we let 0, we get that limn H(Sn ) = supn (H(Sn )). From the
definition of relative entropy, we have H(Sn ) = 21 log (2e 2 ) D(Sn ). Thus,
the previous statement is equivalent to limn D(Sn ) = inf n (D(Sn )).
Infimum is 0. The skeleton of the proof in [2] is to show that the infimum is
0 for a subsequence of the nk s. As the limit exists, all subsequences must
converge to the limit of the sequence, and thus we can infer the limit of the
entire sequence given a limit of one of the subsequences.
In particular, the subsequence is nk = 2k n0 , implying that the goal is to
prove
limk D(S2k n0 ) = 0. This is done by showing that limk J(S2k n0 +

Z) = 0, i.e., that J goes to zero for a subsequence of the nk s (proven by going

back to the definition of Fisher Information). Using the relationship between D
and J demonstrated in Lemma 4.2, we get that limk D(S2k n0 ) = 0.
As the limit exists, all subsequences must converge to the limit of the sequence, and thus the limit of the entire sequence is 0.
With that established, weve proven that limn D(Sn ) = 0.
The significance of Theorem 4.9 is that the distribution of the standardized
sum deviates by less and less from the normal as n increases and, in the limit,
does not deviate at all. Therefore, as we sample (i.e., as we increase n), the
distribution of the standardized sum approaches the normal, proving the Central
Limit Theorem.
15

Charles Marsh (crmarsh@)

Continuous Entropy

Conclusion

Beginning with a definition for continuous entropy, weve shown that the quantity on its own holds little value due to its many shortcomings. While the
definition wason the surfacea seemingly minor notational deviation from the
discrete case, continuous entropy lacks invariance under change of coordinates,
non-negativity, and other desirable quantities that helped motivate the original
definition for Shannon entropy.
But while continuous entropy on its own proved problematic, comparing entropy across continuous distributions (with relative entropy) yielded fascinating
results, both through maximum entropy problems and, interestingly enough,
the information-theoretic proof of the Central Limit Theorem, where the relative entropy of the standardized sum and the normal distribution was shown to
drop to 0 as the sample size grew to infinity.
The applications of continuous information-theoretic techniques are varied;
but, perhaps best of all, they allow us a means of justifying and proving results
with the same familiar, intuitive feel granted us in the discrete realm. An
information-theoretic proof of the Central Limit Theorem makes sense when we
see that the relative entropy of the standardized sum and the normal decreases
over time; similarly, the normal as the maximum entropy distribution for fixed
mean and variance feels intuitive as well. Calling on information theory to
prove and explain these results in the continuous case results in both rigorous
justifications and intuitive explanations.

Appendix
Convolution Increases Entropy. From [9]: Recall that conditioning decreases
entropy. Then, for independent X and Y , we have:
h(X + Y |X) = h(Y |X) = h(Y ) by independence
h(Y ) = h(X + Y |X) h(X + Y )

References
[1] Andrew R. Barron. Monotonic Central Limit Theorem for Densities. Technical report, Stanford University, 1984.
[2] Andrew R. Barron. Entropy and the Central Limit Theorem. The Annals
of Probability, 14:336342, 1986.
[3] Nelson M. Blachman. The Convolution Inequality for Entropy Powes. IEEE
Interactions on Information Theory, pages 267271, April 1965.
[4] R.M. Castro. Maximum likelihood estimation and complexity regularization (lecture notes). May 2011.
16

Charles Marsh (crmarsh@)

Continuous Entropy

[5] Keith Conrad. Probability Distributions and Maximum Entropy.

[6] Natasha Devroye. University of Illinois at Chicago ECE 534 notes on Differential Entropy. 2009.
[7] Justin Domke and Linwei Wang. The Central Limit Theorem (RIT lecture
notes). 2012.
[8] E.T. Jaynes. Information theory and statistical mechanics. Brandeis University Summer Institute Lectures in Theoretical Physics, pages 182218,
1963.
[9] O. Johnson and Y. Suhov. Cambridge University Information and Cdoing
notes. 2006.
[10] Sanjeev Khudanpur. Johns Hopkins University ECE 520.674 notes. 1999.
[11] Hank Krieger. Proof of the Central Limit Theorem (Harvey Mudd College
lecture notes). 2005.
[12] Miller Smith Puckette. Shannon Entropy and the Central Limit Theorem.
PhD thesis, Harvard University, 1986.
[13] Raul Rojas. Why the Normal Distribution? (Freis Universitat Berlin lecture notes). 2010.
[14] Besma Smida. Harvard University ES250 notes on Differential Entropy.
2009.
[15] A.J. Stam. Some Inequalities Satisfied by the Quantities of Information of
Fisher and Shannon. Information and Control, 2:101112, 1959.
[16] Yao Xie. Duke university ECE587 notes on Differential Entropy.

Lab Report 5 CHM138
50% (2)
Lab Report 5 CHM138
7 pages
Multigas Instrument 1438382546BH-4A 英文说明书
No ratings yet
Multigas Instrument 1438382546BH-4A 英文说明书
13 pages
Entropy Post
No ratings yet
Entropy Post
27 pages
Information Theory: 1 Random Variables and Probabilities X
No ratings yet
Information Theory: 1 Random Variables and Probabilities X
8 pages
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
10 pages
Random Variables PDF
No ratings yet
Random Variables PDF
90 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
ent-var-two-rmks
No ratings yet
ent-var-two-rmks
13 pages
1 Heisenberg Uncertainty Principle
No ratings yet
1 Heisenberg Uncertainty Principle
5 pages
A Central Limit Theorem For Convex Sets
No ratings yet
A Central Limit Theorem For Convex Sets
45 pages
Class6 Prep A
No ratings yet
Class6 Prep A
7 pages
Lecture 4 - Inequalities
No ratings yet
Lecture 4 - Inequalities
19 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Balady
No ratings yet
Balady
11 pages
Rousseeuwhubert Highbdmultivariatelocscatter Fests
No ratings yet
Rousseeuwhubert Highbdmultivariatelocscatter Fests
19 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
3 pages
5 Continuous Random Variables
No ratings yet
5 Continuous Random Variables
11 pages
mean-variance
No ratings yet
mean-variance
14 pages
All Simulation Lectures
No ratings yet
All Simulation Lectures
41 pages
A Light Discussion and Derivation of Entropy
No ratings yet
A Light Discussion and Derivation of Entropy
4 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Hoeffding Bounds
No ratings yet
Hoeffding Bounds
9 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Shannon's Theorems: Math and Science Summer Program 2020
No ratings yet
Shannon's Theorems: Math and Science Summer Program 2020
28 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Mathematical Analysis
No ratings yet
Mathematical Analysis
31 pages
Lecture 3: Fano's, Differential Entropy, Maximum Entropy Distributions
No ratings yet
Lecture 3: Fano's, Differential Entropy, Maximum Entropy Distributions
4 pages
Entropy Uncertainty Final Revision JPA
No ratings yet
Entropy Uncertainty Final Revision JPA
13 pages
Stanford Stats 200
No ratings yet
Stanford Stats 200
6 pages
Asymptotic Statistics (By Changliang ZOU)
No ratings yet
Asymptotic Statistics (By Changliang ZOU)
115 pages
statatics and probability chapter 3 and 4
No ratings yet
statatics and probability chapter 3 and 4
10 pages
Convergence Concepts: 2.1 Convergence of Random Variables
No ratings yet
Convergence Concepts: 2.1 Convergence of Random Variables
6 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Archive of SID: Information Indices in The Theory of Fuzzy Sets
No ratings yet
Archive of SID: Information Indices in The Theory of Fuzzy Sets
9 pages
13 Aihp548
No ratings yet
13 Aihp548
52 pages
Probability Theory Nate Eldredge
No ratings yet
Probability Theory Nate Eldredge
65 pages
Unit-14 IGNOU STATISTICS
No ratings yet
Unit-14 IGNOU STATISTICS
13 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Handout2 BasicsOf Random Variables
No ratings yet
Handout2 BasicsOf Random Variables
3 pages
Kunneth Formulas Allen Yuan
No ratings yet
Kunneth Formulas Allen Yuan
6 pages
MIT14 30s09 Lec17
No ratings yet
MIT14 30s09 Lec17
9 pages
Empirical Processes
No ratings yet
Empirical Processes
47 pages
s10884-004-4286-0 (1)
No ratings yet
s10884-004-4286-0 (1)
7 pages
CPD
No ratings yet
CPD
5 pages
2540. 重The Information Bottleneck Method -2000
No ratings yet
2540. 重The Information Bottleneck Method -2000
11 pages
Elements of Mathematical Analysis Author Harald Hanche-Olsen
No ratings yet
Elements of Mathematical Analysis Author Harald Hanche-Olsen
31 pages
R300 Advanced Econometrics Methods Lecture Slides
No ratings yet
R300 Advanced Econometrics Methods Lecture Slides
362 pages
revised (1)
No ratings yet
revised (1)
24 pages
Non Asymptotic Information
No ratings yet
Non Asymptotic Information
9 pages
Basic Statistics For Lms
0% (1)
Basic Statistics For Lms
23 pages
Chapter1p2
No ratings yet
Chapter1p2
57 pages
07-BEJ6114
No ratings yet
07-BEJ6114
27 pages
STAT0009 Introductory Notes
No ratings yet
STAT0009 Introductory Notes
4 pages
Quotient Topology: S. Kumaresan School of Math. and Stat. University of Hyderabad Hyderabad 500046
No ratings yet
Quotient Topology: S. Kumaresan School of Math. and Stat. University of Hyderabad Hyderabad 500046
8 pages
The Uniform Distributn
No ratings yet
The Uniform Distributn
7 pages
Lecture 5 - Dirac Delta Functions, Characteristic Functions, and The Law of Large Numbers
No ratings yet
Lecture 5 - Dirac Delta Functions, Characteristic Functions, and The Law of Large Numbers
3 pages
8 Structure Formation: 8.1 Inhomogeneity
No ratings yet
8 Structure Formation: 8.1 Inhomogeneity
67 pages
Lec 6
No ratings yet
Lec 6
7 pages
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Presented By: Jackie Nicholas Mathematics Learning Centre
No ratings yet
Presented By: Jackie Nicholas Mathematics Learning Centre
9 pages
GFC Tutorial
No ratings yet
GFC Tutorial
19 pages
Rules of Differentiation and Their Use in Comparative Statics - Alpha Chiang
No ratings yet
Rules of Differentiation and Their Use in Comparative Statics - Alpha Chiang
24 pages
VIRTIS: The Visible and Infrared Thermal Imaging Spectrometer
No ratings yet
VIRTIS: The Visible and Infrared Thermal Imaging Spectrometer
27 pages
Argmax Over Continuous Indices of Random Variables An Approach Us-Ing Random Fields
No ratings yet
Argmax Over Continuous Indices of Random Variables An Approach Us-Ing Random Fields
34 pages
BARBEL Entropy
No ratings yet
BARBEL Entropy
14 pages
MPRA Paper 10963 - Cópia
No ratings yet
MPRA Paper 10963 - Cópia
49 pages
11-01 - Cópia
No ratings yet
11-01 - Cópia
61 pages
Spanish Alphabets and Phonetics
No ratings yet
Spanish Alphabets and Phonetics
3 pages
g11 Plumbing
No ratings yet
g11 Plumbing
23 pages
8 Greeks
No ratings yet
8 Greeks
5 pages
DIN standreds (flender)
No ratings yet
DIN standreds (flender)
10 pages
Lab3 Sei
No ratings yet
Lab3 Sei
10 pages
Low Glycemic Index Guide
67% (3)
Low Glycemic Index Guide
24 pages
Carabusul de Aur
No ratings yet
Carabusul de Aur
28 pages
Instant download (Ebook) Antennas for All Applications by John D. Kraus, Ronald J. Marhefka ISBN 9780072321036, 9780071122405, 0072321032, 0071122400 pdf all chapter
100% (4)
Instant download (Ebook) Antennas for All Applications by John D. Kraus, Ronald J. Marhefka ISBN 9780072321036, 9780071122405, 0072321032, 0071122400 pdf all chapter
86 pages
JMGNance Care Sheet
No ratings yet
JMGNance Care Sheet
1 page
Silikophen P 80X
No ratings yet
Silikophen P 80X
1 page
Practical A5 Stoke' S Law and Viscosity of Oil.
No ratings yet
Practical A5 Stoke' S Law and Viscosity of Oil.
7 pages
Sellos
No ratings yet
Sellos
40 pages
Skateboard Truck
100% (1)
Skateboard Truck
14 pages
1019dahmk2 Spec Sheet
No ratings yet
1019dahmk2 Spec Sheet
1 page
The Representation of Female Body
No ratings yet
The Representation of Female Body
10 pages
Glovebox Guide To Evs Esf
No ratings yet
Glovebox Guide To Evs Esf
20 pages
Vor Vir-351
No ratings yet
Vor Vir-351
3 pages
Analizador de Redes Mpr52s
No ratings yet
Analizador de Redes Mpr52s
34 pages
AWS Certified Solutions Architect - Associate SAA-C03 Exam - Free Exam Q&as, Page 1 - ExamTopics
100% (3)
AWS Certified Solutions Architect - Associate SAA-C03 Exam - Free Exam Q&as, Page 1 - ExamTopics
449 pages
Lesson 21: Incomplete Adverb
No ratings yet
Lesson 21: Incomplete Adverb
7 pages
Emaillog
No ratings yet
Emaillog
1,365 pages
Download Complete Designing with the Mind in Mind Simple Guide to Understanding User Interface Design Guidelines 2nd Edition Johnson PDF for All Chapters
No ratings yet
Download Complete Designing with the Mind in Mind Simple Guide to Understanding User Interface Design Guidelines 2nd Edition Johnson PDF for All Chapters
38 pages
Bts SN 2019 E4 Documents Techniques Snir
No ratings yet
Bts SN 2019 E4 Documents Techniques Snir
25 pages
Applications of Binomial Distribution
No ratings yet
Applications of Binomial Distribution
3 pages
Bilston Urban Village SPD Complete Document
No ratings yet
Bilston Urban Village SPD Complete Document
74 pages
Frequently Used Equations - The Physics Hypertextbook
No ratings yet
Frequently Used Equations - The Physics Hypertextbook
4 pages
1st Periodical Exam Midterm
No ratings yet
1st Periodical Exam Midterm
9 pages
Cosmic Dialogues 2
No ratings yet
Cosmic Dialogues 2
8 pages