1. Introduction
In many areas of science, it is of primary importance to assess the “randomness” of a certain random variable X. That variable could represent, for example, a cryptographic key, a signature, some sensitive data, or any type of intended secret. For simplicity, we assume that X is an M-ary discrete random variable, taking values in a finite alphabet of size M, with known probability distribution (in short, ).
Depending on the application, many different criteria can be used to evaluate randomness. Some are information-theoretic, others are related to detection/estimation theory or to hypothesis testing. We review the most common ones in the following subsections.
1.1. Entropy
A “sufficiently random”
X is often described as “entropic” in the literature. The usual notion of entropy is the Shannon entropy [
1]
which is classically thought of as a measure of “uncertainty”. It has, however, an operational definition in the fields of data compression or source coding. The problem is to find the binary description of
X with the shortest average description length or “coding rate”.
Note that the base of the logarithm is not specified in (
1). Similar to all information-theoretic quantities, the choice of the base determines the unit of information. Logarithms of base 2 give binary units (bits) or Shannons (Sh). Logarithms of base 10 give decimal units (dits) or Hartleys. Natural logarithms (base
e) give natural units (nats).
This compression problem can be seen as equivalent to a “game of 20 questions” § 5.7.1 in [
2], where a binary codeword for
X is identified as a sequence of answers to yes–no questions about
X that uniquely identifies it. There is no limitation on the type of questions asked, except that they must be answered by yes (1) or no (0). The goal of the game is to minimize the average number of questions, which is equal to the coding rate. It is well known, since Shannon [
1], that the entropy
is a lower bound on the coding rate that can be achieved asymptotically for repeated descriptions.
In this perspective, entropy is a natural measure of efficient (lossless) compression rate. A highly random variable (with high entropy) cannot be compressed too much without losing information: “random” means “hard to compress”.
1.2. Guessing Entropy
Another perspective arises in cryptography when one wants to guess a secret key. The situation is similar to the “game of 20 questions” of the preceding subsection. The difference is that the only possibility is to actually try out one possible key hypothesis at a time. In other words, yes–no questions are restricted to be of the form “is
X equal to
x?” until the correct value has been found. The optimal strategy that minimizes the average number of questions is to guess the values of
X in order of decreasing probabilities: first, the value with maximum probability
, then the second maximum
, and so on. The corresponding minimum average number of guesses is the guessing entropy [
3] (also known as “guesswork” [
4]):
Massey [
3] has shown that the guessing entropy
G is exponentially increasing as entropy
H increases. A recent improved inequality is [
5,
6]
. It is sometimes convenient to use
instead of
G, to express it in the same logarithmic unit of information as entropy
H.
In this perspective, a highly random variable (with high guessing entropy) cannot be guessed rapidly: “random” means “hard to guess”.
1.3. Coincidence or Collision
Another perspective is to view
X as a (publicly available) “identifier”, “fingerprint” or “signature” obtained by a randomized algorithm from some sensitive data. In such a scheme, to prevent “collision attacks”, it is important to ensure that
X is “unique” in the sense that there is only a small chance that another independent
obtained by the same randomized algorithm coincides with
X. Since
X and
are i.i.d., the “index of coincidence”
should be as small as possible, that is, the complementary quantity (sometimes called quadratic entropy [
7]):
should be as large as possible. In the context of hash functions, this is called “universality” (Chapter 8 in [
8]). The corresponding logarithmic measure is known as the collision entropy (Rényi entropy [
9] of order 2, also known as quadratic entropy [
10]):
which should also be as large as possible. By concavity of the logarithm,
, that is,
; hence, high collision entropy implies high entropy.
In this perspective, a highly random variable (with high collision entropy) cannot be found easily by coincidence: “random” means “unique” or “hard to collide”.
1.4. Estimation Error
In estimation or detection theory, one observes some disclosed data which may depend on
X and tries to estimate
X from the observation. The best estimator
minimizes the probability of error,
. Therefore, given the observation, the best estimation is the value
x with highest probability
, and the minimum probability of error is written:
If
X is meant to be kept secret, then this probability of error should be as large as possible. The corresponding logarithmic measure is known as the min-entropy:
which should also be as large as possible. It is easily seen that
; hence, high min-entropy implies high entropy in all the previous senses.
In this perspective, a highly random variable (with high min-entropy) cannot be efficiently estimated: “random” means “hard to estimate” or “hard to detect”.
Figure 1 illustrates various randomness measures for a binary distribution.
1.5. Some Generalizations
One can generalize the above concepts in multiple ways. We only mention a few.
The
-entropy, or Rényi entropy of order
, is defined as follows [
9]:
where
is the “
-norm” (strictly speaking,
is a norm only when
). The Shannon entropy
is recovered in the limiting case
, the collision entropy
is recovered in the case
, and the min-entropy
is recovered in the limiting case
.
The
-guessing entropy, or guessing moment [
11] of order
, is defined as the minimum
th-order moment of the number of guesses needed to find
X. The same optimal strategy as for the guessing entropy yields the following:
which generalizes
for
. Arikan [
11] has shown that
behaves asymptotically as
. In particular,
behaves asymptotically as the ½-entropy
.
In some cryptographic scenarios, one has the ability to estimate or guess
X in a given maximum number
m of tries. The corresponding error probability takes the form
. The same optimal strategy as for guessing entropy
yields an error probability of order
m:
which generalizes
for
.
One obtains similar randomness measures by replacing
p with its “negation”
, as explained in [
12].
1.6. “Distances” to the Uniform
A fairly common convention is that, if we “draw at random” X, it is assumed that we sample it according to a uniform distribution unless otherwise explicitly indicated. Thus, the uniform distribution u, where all possible outcomes being equally likely—all M values have equal probability for all k—is considered as the ideal randomness.
From this viewpoint, a variable X with distribution p should be all the more “random” as p is “close to uniform”: randomness can be measured as some complementary “distance” from p to the uniform u, in the form, say, , where “distance” d has maximum value . Such should not necessarily obey all axioms of a mathematical distance, but at least should be nonnegative and vanish only when .
Many of the above entropic criteria fall into this category. For example:
where
denotes the (Kullback–Leibler) divergence (or “distance”). More generally:
where
denotes the (Rényi)
-divergence [
13].
In the particular case
, since
, the complementary index of coincidence
—hence, the collision entropy
—is also related to the squared 2-norm distance to the uniform:
It follows that the 2-norm distance is related to the 2-divergence by the formula
(see, e.g., Lemma 3 in [
14]).
Similarly, in the particular case
, one can write
, where
is a complementary quantity of the squared
Hellinger distance , which is related to the
-divergence by the formula
.
Another important example is given next.
1.7. Statistical Distance to the Uniform
Suppose one wants to design a statistical experiment to know whether X follows either distribution p (null hypothesis ) or another distribution q (alternate hypothesis). Any statistical test takes the form “is ?”: if yes, then accept ; otherwise, reject it. Type-I and type-II errors have total probability , where , are the probability measures corresponding to p and q, respectively. Clearly, if is small enough, the two hypotheses p and q are indistinguishable in the sense that decision errors have total probability arbitrarily close to 1.
The statistical (total variation) distance § 8.8 in [
8] is defined as follows:
where the
factor is present to ensure that
. The maximum in the definition of the statistical distance:
is attained for any event
, satisfying the following:
The statistical distance is particularly important from a hypothesis testing viewpoint, since, as we have just seen, a very small distance
ensures that no statistical test can distinguish the two hypotheses
p and
q.
Following the discussion of the preceding subsection, we can define “statistical randomness” as the complementary value of the statistical distance
between
p and the uniform distribution
u. Therefore, if
is uniform and letting
, then
has maximum value
and statistical randomness can be defined as follows:
This is similar to (
12), where half the 1-norm is used in place of the squared 2-norm.
From the hypothesis testing perspective, it follows that a high statistical randomness R ensures that no statistical test can effectively distinguish between the actual distribution and the uniform. This is, for example, the usual criterion used to evaluate randomness extractors in cryptology. Since equiprobable values are the least predictable, a highly random variable cannot be easily statistically predicted: “random” means “hard to predict”.
1.8. Conditional Versions
In many applications, the randomness of
X is evaluated after observing some disclosed data or side information
Y. The observed random variable
Y can model any type of data and is not necessarily discrete. The conditional probability distribution of
X having observed
is denoted by
to distinguish it from the unconditional distribution
(without side information). By the law of total probability
,
is recovered by averaging all conditional distributions:
where
denotes the expectation operator over
Y.
The “conditional randomness” of
X given
Y can then be defined as the average randomness measure of
over all possible observations, that is, the expectation over
Y of all randomness measures of
. For example, Shannon’s conditional entropy or equivocation [
1] is given by the following:
Similarly:
gives the average minimum number of guesses to find
X after having observed
Y. Additionally:
gives the average probability of non-collision to identify
X upon observation of
Y, and
gives the minimum average probability of error, as achieved by the maximum a posteriori (MAP) decision rule. The “conditional statistical randomness” is likewise defined as shown:
For the generalized quantities of
Section 1.5, the conditional
-guessing entropy is given by the following:
and the conditional
mth-order probability of error is as below:
For
-entropy, however, many different definitions of conditional
-entropy have been proposed in the literature [
15]. The preferred choice for most applications seems to be Arimoto’s definition [
16]:
where the expectation over
Y is taken on the
-norm inside the logarithm and not outside. Shannon’s conditional entropy
is recovered in the limiting case
. One nice property of Arimoto’s definition is that it is compatible with that of
in the limiting case
, since the relation
of (
6) naturally extends to conditional quantities:
Notice that for any order
, Arimoto’s definition can be rewritten as a simple expectation of
instead of
:
where
is the increasing function, defined as follows:
The requirement that
is increasing is important in the following. The signum term was introduced so that
is increasing, not only for
, but also for
. The exponential function exp is assumed to the same base as the logarithm:
for
x in bits,
in dits,
in nats). In what follows, we indifferently refer to
or
.
1.9. Aim and Outline
The enumeration in the preceding subsections is by no means exhaustive. Every subfield or application has its preferred criterion, either information/estimation theoretic or statistical, conditioned on some observations or not. Clearly, all these randomness measures share many properties.
Therefore, a natural question is to determine a (possibly minimal) set of properties that characterize all possible randomness measures. Many axiomatic approaches have been proposed for entropy [
1,
17],
-entropy [
9], information leakage [
18] or conditional entropy [
19,
20].
Extending the work in [
21],
Section 2 presents a simple alternative, which naturally encompass all common randomness measures
H,
,
G,
,
,
,
and
R, based on two natural axioms:
Many properties, shared by all randomness measures described above, are deduced from these two axioms.
Another important issue is to study the relationship between randomness measures, by establishing the exact locus or joint range of two such measures among all probability distributions with tight lower and upper bounds. In this paper, extending the presentation made in [
21], we establish the optimal bounds relating information-theoretic (e.g., entropic) quantities on one hand and statistical quantities (probability of error and statistical distance) on the other hand.
Section 3 establishes general optimal Fano and reverse-Fano inequalities, relating any randomness measure to the probability of error. This generalizes Fano’s original inequality [
22]
, which has become ubiquitous in information theory (e.g., to derive converse channel coding theorems) and in statistics (e.g., to derive lower bounds on the maximum probability of error in multiple hypothesis testing).
Section 4 establishes general optimal Pinsker and reverse-Pinsker inequalities, relating any randomness measure to the statistical randomness or the statistical distance to the uniform. Generally speaking, Pinsker and reverse-Pinsker inequalities relate some divergence measure (e.g.,
or
) between two distributions to their statistical distance
. Here, following the discussion in
Section 1.6, we restrict ourselves to the divergence or distance to the uniform distribution
. (For the general case of arbitrary distributions
see, e.g., the historical perspective on Pinsker–Schützenberger inequalities in [
23].). In this context, we improve the well-known Pinsker inequality [
24,
25], which reads
. This inequality, of more general applicability for any distributions
, is no longer optimal in the particular case
.
Finally,
Section 5 lists some applications in the literature, and
Section 6 gives some research perspectives.
2. An Axiomatic Approach
Let X be any M-ary random variable with distribution . How should a measure of “randomness” of X be defined in general? To simplify the discussion, we assume that is nonnegative.
As advocated by Shannon [
26], such a notion should not depend on the particular “reversible encoding” of
X. In other words, any two equivalent random variables should have the same measure
, where equivalence is defined as follows.
Definition 1 (Equivalent Variables)
. Two random variables X and Y are equivalent: , if there exist two mappings f and g, such that a.s. (almost surely, i.e., with probability one) and a.s.
Remark 1 (Equivalent Measures)
. Obviously, it is also essentially equivalent to study or , for example, or any quantity of the form , where is any increasing (invertible) function.
Definition 2 (Conditional Randomness)
. Given any random variable Y, the conditional form of is defined as follows:where (or ) denotes the random variable X, conditioned of the event . This quantity represents the average amount of randomness of X knowing Y. Remark 2 (Equivalent Conditional Measures)
. Again, it is essentially equivalent to study or , where is any increasing function. One may, therefore, generalize the notion of conditional randomness by writing in place of (31), the same as (29) for α-entropy. However, in the sequel, we stay with the basic Definition 2 and simply assume that is considered instead of whenever it is convenient to do so. In the sequel, we study the implications of only two axioms:
Axiom 1 (Equivalence)
.
Axiom 2 (Knowledge Reduces Randomness)
. We find such postulates quite intuitive and natural. First, equivalent random variables should be equally random. Second, knowledge of some side observation should, on average, reduces randomness.
All randomness quantities described in
Section 1 obviously satisfy Axiom 1. That they also satisfy Axiom 2 is shown in the following examples.
Example 1 (Entropies)
. For Shannon’s entropy H, the inequality is well known Thm 2.6.5 in [2]. This is often paraphrased as “conditioning reduces entropy”, “knowledge reduces uncertainty” or “information can’t hurt”. The difference is the mutual information, which is always nonnegative. Inequality is also known to hold for any , see [15,16] and Example 4 below. Example 2 (Guessing Entropies)
. Axiom 2 for the guessing entropies G or can be easily checked from their definition, as follows.
Let be any random variable giving the number of guesses needed to find X in any guessing strategy. N is equivalent to X (Definition 1) since every value of N corresponds to a unique value of X, and vice versa. By definition, , where the minimum is over all possible equivalent to X (corresponding to all possible strategies). Now, , by the law of total expectation. Taking the minimum over gives , which is Axiom 2.
The case was already shown in [27]. The result is quite intuitive: any side information Y can only improve the guess of X. Example 3 (Error Probabilities)
. Axiom 2 for the error probability follows from the corresponding inequality for (see (28) and Example 1 for ), but it can also be checked directly from its definition, as well as in the case of of order m, as follows. The mth order error probability is , i.e., the minimum probability that X is not equal to any of the m first estimates . Then, , by the law of total probability, for every sequence . Taking the minimum over such sequences gives , which is Axiom 2.
The case was already shown, e.g., in [27]. Again, the result is quite intuitive: any side information Y can only improve the estimation of X. 2.1. Symmetry and Concavity
We now rewrite Axioms 1 and 2 as equivalent conditions on probability distributions.
Definition 3 (Probability “Simplex”)
. Let be the set of all sequences of nonnegative numbers:such that the following are satisfied: Notice that has infinite dimension even though only a finite number of components are nonzero in every . Thus, any can be seen as the probability distribution of M-ary random variables with arbitrary large M.
Theorem 1 (Symmetry)
. Axiom 1 is equivalent to the condition that is a symmetric function of , identified as the probability distribution of X.
Proof. Let be the finite set (“alphabet”) of all values taken by , and let f be an injective mapping from to , whose image is a finite subset of . From Definition 1, X is equivalent to , with probabilities . Then, by Axiom 1, does not depend on the particular values of but only on the corresponding probabilities, so that , where is identified to . Now, letting h be any bijection (permutation) of , Axiom 1 implies that does not depend on the ordering of the s, that is, is a symmetric function of p. Conversely, any bijection applied to X can only change the ordering of the s in , which leaves as invariant. □
Accordingly, it is easily checked directly that all expressions in terms of probability distributions p of random measures given in
Section 1 are symmetric in p.
Remark 3. Some authors [17] define as the union of all for , where is the M-simplex . With this viewpoint, even when the expression of does not explicitly depend on M, one has to define separately for all different values of M as a function , defined over , and further impose the compatibility condition that , as in [17] (this is called “expansibility” in [20]). Such expansibility condition is unnecessary to state explicitly in our approach: it is an obvious consequence of an appropriate choice of f in Definition 1, namely, the injective embedding of into .
Theorem 2 (Concavity)
. Axiom 2 is equivalent to the condition that is concave in p.
Proof. Using the notations of Theorem 1, Definition 2 and (
19), Axiom 2 can be rewritten as shown:
This is exactly Jensen’s inequality for concave functions on the convex “simplex”
. □
Remark 4 (
-Concavity)
. Similarly as in Remark 2, we may consider in place of in the definition of conditional randomness, where is any increasing function. Then, by Theorem 2, is concave, that is, is a φ-concave function of p (for example, for , one recovers the usual definition of a log-concave function). This is called “core-concavity” in [20]. Example 4 (Symmetric Concave Measures)
. All randomness measures of Examples 1–3 satisfy both Axioms 1 and 2, and are, therefore, symmetric concave in p. This can also be checked directly from certain closed-form expressions given in Section 1: Shannon’s entropy H, as well as the complementary index of coincidence , can be written in the form , where r is a strictly concave function. Thus, both are symmetric and strictly concave in p;
Statistical randomness can also be written in this form, where is concave in . Thus, is also symmetric concave and, therefore, is also an acceptable randomness measure satisfying Axioms 1 and 2;
For α-entropy, consider where is the increasing function (30). It is known that the α-norm is strictly convex for finite (by Minkowski’s inequality) and strictly concave for (by the reverse Minkowski inequality). Thus, α-entropy is symmetric and (strictly) -concave in the sense of Remark 4. Therefore, one finds anew that it satisfies Axioms 1 and 2.
Corollary 1 (Mixing Increases Randomness)
. Let be any two probability distributions and consider the “mixed” distribution , where , , and . Then:In particular, mixing two equally random distributions results in a “more random” distribution: . Proof. Immediate from the concavity of . □
Example 5. The mixing property of the Shannon entropy H is well-known Thm. 2.7.3 in [2]. A well-known thermodynamic interpretation is that mixing two gases of equal entropy results in a gas with higher entropy. 2.2. Basic Properties in Terms of Random Variables
In terms of random variables, one can deduce the following properties.
Corollary 2 (Consistency)
. If X is independent of Y, then . In particular, let 0 denote any deterministic variable (by Defintion 1, any deterministic random variable is equivalent to the constant 0). Then: Thus “absolute” (unconditional) randomness can be recovered as a special case of conditional randomness.
Proof. If X and Y are independent, then for (almost) any y, so that . In particular, X and 0 are always independent. □
Remark 5 (Strict Concavity)
. A randomness measure is “strictly concave” in p if Jensen’s inequality (34) holds with equality only when for almost all y. This can be stated in terms of random variables as follows. For any strictly concave random measure , (32) is strict unless independence holds: Example 6 (Strictly Concave Measures)
. As already seen in Example 4, entropy H, all α-entropies for finite and are strictly concave.
In particular, for entropy, if and only if X and Y are independent. This is well known since the mutual information vanishes only in the case of independence [2] (p. 28). More generally, for α-entropy, if and only if X and Y are independent. Guessing entropy G, or, more generally, ρ-guessing entropy , is not strictly concave in p. For example, is linear in .
Corollary 3 (Additional Knowledge Reduces Randomness)
. Inequality (32) is equivalent to the following:for any . Proof. Inequality (
32) applied to
and Z for fixed y gives
. Taking the expectation over Y of both sides yields the announced inequality. Conversely, letting
, one obtains
, which is (
32). □
Corollary 4 (Data Processing Inequality: Processing Knowledge Increases Randomness)
. For any Markov chain (i.e., such that ), one has the following:This property is equivalent to (32). Proof. Since
for (almost) any z, one has
, which, from Corollary 3, is
. Conversely, letting
, one recovers (
32). □
Example 7 (Data Processing Inequalities)
. For entropy H, the property amounts to , i.e., (post-)processing in the Markov chain can never increase information § 2.8 in [2]. The data processing inequality for and G was already shown in [27]. 2.3. Equalization (Minorization) via Robin Hood Operations
We now turn to another type of “mixing” probability distributions which are sometimes known as Robin Hood operations. To quote Arnold [
28]:
“When Robin and his merry hoods performed an operation in the woods they took from the rich and gave to the poor. The Robin Hood principle asserts that this decreases inequality (subject only to the obvious constraint that you don’t take too much from the rich and turn them into poor.)”
Definition 4 (Robin Hood operations [
28])
. An elementary
“Robin Hood” operation in modifies only two probabilities () in such a way that . A (general) “Robin Hood operation” results from a finite sequence of elementary Robin Hood operations. Notice that in an elementary Robin Hood operation, the sum
should remain the same, since p and q are probability distributions. The fact that
decreases “increases equality”, i.e., makes the probabilities more equal. This can be written as follows:
provided that (“you don’t take too much from the rich and turn them into poor”). Setting , (40) can be easily rewritten in the form:
where
,
and
.
Remark 6 (Increasing Probability Product)
. In any elementary Robin Hood operation , the product:always increases, with equality if and only if either or 1, or else . This equality condition boils down to , that is, the unordered set is unchanged. Therefore, in any general Robin Hood operation, the product of all modified probabilities always increases, unless the probability distribution is unchanged (up to the order of the probabilities).
Remark 7 (Inverse Robin Hood Operation)
. One can also define a “Sheriff of Nottingham” operation as an inverse Robin Hood operation, resulting from a finite sequence of elementary Sheriff of Nottingham operations of the form , where . Increasing the quantity “increases inequality”, i.e., makes the probabilities more unequal.
Definition 5 (Equalization Relation)
. We write (“X is equalized by Y”) if can be obtained from by a Robin Hood operation. Such operation “equalizes” in the sense that is “more equal” or “more uniform” than . In terms of distributions, we also write . Equivalently, can be obtained from by a Sheriff of Nottingham operation ( is more unequal than ). We may also write or .
Remark 8 (Generalization)
. The above definitions hold verbatim for any vector or finitely many nonnegative numbers with a fixed sum (not necessarily equal to one). In the following, we sometimes use the concept of “equalization” in this slightly more general context.
Remark 9 (Minorization)
. amounts to saying that “majorizes” in majorization theory [28,29]. So, in fact, the equalization relation ⪯ is a “minorization”—the opposite of a majorization. Unfortunately, it is common in majorization theory to write “” when X “majorizes” Y, instead of when Y is “more equal” than X. Arguably, the notation adopted in this paper is more convenient, since it follows the usual relation order between randomness measures such as entropy. Also notice that the present approach avoids the use of Lorenz order [28,29] and focuses on the more intuitive Robin Hood operations. Remark 10 (Partial Order)
. It is easily seen that ⪯ is a partial order on the set of (finitely valued) discrete random variables (considering two variables “equal” if they are equivalent in the sense of Definition 1). Indeed, reflexivity and transitivity are immediate from the definition, and antisymmetry is, e.g., an easy consequence of Remark 6: if and , then the product of all modified probabilities of X cannot increase by the two combined Robin Hood operations. Therefore, should be the same as up to order; hence, .
The following fundamental lemmas establish expressions for maximally equal and unequal distributions.
Lemma 1 (Maximally Equal = Uniform)
. For any vector of nonnegative numbers with sum :In particular, any probability distribution p is equalized by the uniform distribution u: Proof. Suppose at least one component of p is . Since the s sum to s, there should be at least one and one . By a suitable Robin Hood operation on , at least one of these two probabilities can be made , reducing the total number of components . Continuing in this manner, we arrive at all probabilities equal to after, at most, Robin Hood operations. □
Lemma 2 (Maximally Unequal)
. For any vector of nonnegative numbers with sum and constrained maximum :with remainder component . Without the maximum constraint (), one simply has the following:In particular, for any probability distribution p:where δ is the (Dirac) probability distribution of any deterministic variable. (This can be written in terms of random variables as , since, by Defintion 1, any deterministic random variable is equivalent to the constant 0.) Proof. Suppose at least two components lie between 0 and P: . By a suitable Sheriff of Nottingham operation on , at least one of these two probabilities can be made either or , reducing the number of components lying inside . Continuing in this manner, we arrive at, at most, one component . Finally, the sum constraint implies where , whence . □
Theorem 3 (Schur Concavity [
28,
29])
. Proof. It suffices to prove the inequality for an elementary Robin Hood operation
. Dropping the dependence on the other (fixed) probabilities, one has, by symmetry, (Theorem 1) and concavity (Theorem 2):
□
Inequality (
48), expressed in terms of distributions:
is known as “Schur concavity” [
28,
29].
Remark 11. Theorem 3 can also be given a physical interpretation similar to Corollary 1. In fact, from (41), any Robin Hood operation can be seen as mixing two permuted probability distributions, which have equal randomness. Such mixing can only increase randomness. Example 8 (Entropy is Schur-Concave)
. That the Shannon entropy is Schur-concave is well known § 13 E in [29]. Similar to concavity (Example 5), this also has a similar physical interpretation: a liquid mixed with another results in a “more disordered”, “more chaotic” system, which results in a “more equal” distribution and a higher entropy § 1 A9 in [29]. Remark 12 (-Schur Concavity). Schur concavity is not equivalent to concavity (even when assuming symmetry). In fact, with the notations of Remark 4, it is obvious that Schur concavity of is equivalent to Schur concavity of , where is any increasing function. In other words, while “φ-concavity” (in the sense of Remark 4) is not the same as concavity, there is no need to introduce “φ-Schur concavity”, since it is always equivalent to Schur concavity.
Remark 13 (Strict Schur Concavity)
. A randomness measure is “strictly Schur concave” if the inequality for holds with equality if and only if .
If is strictly concave (see Remark 5), then equality holds in (49) if and only if either or 1, or else . Either of these conditions means that is unchanged. Therefore, in this case, is also strictly Schur concave. Remark 6 states that the product of nonzero probabilities is strictly Schur-concave.
Example 9 (Strictly Schur Concave Measures)
. Randomness measures presented in Section 1 are (Schur) concave, but not all of them are strictly
Schur concave: Not only the Shannon entropy H is Schur concave (Example 8), but, as seen in Example 6, H, as well as all α-entropies for finite and , are strictly concave and, hence, strictly Schur concave;
As seen also in Example 6, guessing entropy G, or, more generally, ρ-guessing entropy , is not strictly concave in p. However, G and are strictly Schur concave by the following argument.
It suffices to show that some elementary Robin Hood operation (40) (with ) strictly increases . One may always choose δ as small as one pleases, since any elementary Robin Hood operation on can be seen as resulting from other ones on with smaller δ. One chooses δ small enough such that the elementary Robin Hood operation does not change the order of the probabilities in p. With the notations of Section 1.2, assuming, for example, that , where , then and , since . This shows that strictly increases; Error probability , or, more generally, , is neither strictly concave nor strictly Schur concave in general. In fact, if , any elementary Robin Hood operation on leaves unchanged;
Statistical randomness R is neither strictly concave nor strictly Schur concave if . For example, it is easily checked from the definition (18) that the elementary Robin Hood operation leaves R unchanged.
2.4. Resulting Properties in Terms of Random Variables
Corollary 5 (Minimal and Maximal Randomness)
. In other words, minimal randomness is achieved for (for any deterministic variable 0) and maximal randomness is achieved for uniformly distributed X.
Proof. From Lemmas 1 and 2, one obtains . The result follows by Theorem 3. □
Remark 14 (Zero Randomness)
. Without loss of generality, we may always impose that by considering in place of . Then, zero randomness is achieved when . It is easily checked from the expressions given in Section 1 that this convention holds for H, , , , , , and R. To simplify notations in the remainder of this paper, we assume that the zero randomness convention always holds.
Example 10 (Distribution Achieving Zero Randomness)
. By Remark 13, if is strictly Schur concave, zero randomness is achieved only when : As seen in Example 9, this is the case for H, , , and . In particular, we recover the well known property that zero entropy is achieved only when X is deterministic;
Although the error probability is not strictly Schur concave, one can check directly that if and only if , which corresponds to the δ distribution;
Similarly, from the discussion in Section 1.7, correspond to the maximum value of attained for and , which, again, corresponds to a δ distribution.
To summarize, all quantities H, , , , , and R satisfy (52). Remark 15 (Maximal Randomness Increases with
M)
. For an M-ary random variable, maximal randomness is attained for a uniform distribution . Since, by Lemma 1, , one has : maximal randomness increases with M.
Example 11 (Distribution Achieving Maximum Randomness)
. The following maximum values for M-ary random variables are easily checked from the expressions given in Section 1: , and, more generally, . Since H and are strictly Schur-concave, the maximum is attained if and only if X is uniformly distributed. This observation is also an easy consequence of (10) or (11); , , , etc. Again, since G and are strictly Schur-concave, their maximum is achieved if and only if X is uniformly distributed;
, and, more generally, . The maximum of is achieved if and only if the maximum probability equals , which implies that X is uniformly distributed;
(see (12) and (18)) is achieved if and only if .
To summarize, for all quantities H, , , , , and R, the unique maximizing distribution is the uniform distribution. Notice that, as expected, each of these maximum values increases with M.
Corollary 6 (Deterministic Data Processing Inequality: Processing Reduces Randomness)
. For any deterministic function f: Proof. Consider preimages by f of values . The application of f can be seen as resulting from a sequence of elementary operations, each of which puts together two distinct values of x (say, and ) in the same preimage of some y. In terms of probability distributions, this amounts to a Sheriff of Nottingham operation . Overall, one has . The result then follows by Schur concavity (Theorem 3). □
Example 12. The fact that is well known (see Ex. 2.4 in [2]). This can also be seen from the data processing inequality of Corollary 4 by noting that, since is trivially a Markov chain, . Remark 16 (Lattices of Information and Majorization)
. Shannon [26] defined the order relation if a.s. and showed that it satisfies the properties of a lattice, called the “ìnformation lattice” (see [30] for detailed proofs). With this notation, (53) writes as shown:Majorization (or the order relation ) also satisfies the properties of a lattice—the “majorization lattice”, as studied in [31]. From the proof of Corollary 6, one actually obtains the following:Therefore, the majorization lattice is denser than the information lattice. Corollary 7 (Addition Increases Randomness)
. This property is equivalent to (53). Proof. Apply Corollary 6 to the projection
. Conversely, (
53) follows from (
56), by taking
and noting that
. □
Corollary 8 (Total Dependence)
. Assuming the zero randomness convention (Remark 14), if (52) holds, then the following holds:that is, in the sense of Shannon (Remark 16). Proof. Since
for any y,
if and only if
for (almost) all y. By (
52), this implies that X is deterministic given
, i.e., X is a deterministic function of Y. □
Example 13. From Example 10, (57) is true for H, , , , , and R. The equivalence is well known ([2], Ex. 2.5). Knowledge of Y removes equivocation only when X is fully determined by Y; is intuitively clear: knowing Y allows one to fully determine X in only one guess;
: knowing Y allows one to estimate X without error only when X is fully determined by Y.
3. Fano and Reverse-Fano Inequalities
Definition 6 (Fano-type inequalities)
. A “Fano inequality” (resp. “reverse Fano inequality”) for gives an upper (resp. lower) bound of as a function of the probability of error . Fano and reverse-Fano inequalities are similarly defined for conditional randomness , lower or upper bounded as a function of .
In this section, we establish optimal Fano and reverse-Fano inequalities, where upper and lower bounds are tight. In other words, we determine the maximum and minimum of for fixed . The exact locus of the region , as well as the exact locus of all attainable values of , is determined analytically for fixed M, based on the following.
Lemma 3. Let and . For any M-ary probability distribution : Proof. On the left side, apply Lemma 2 with and . On the right side, with being fixed, apply Lemma 1 to the remaining probabilities , which sum to . □
Theorem 4 (Optimal Fano and Reverse-Fano Inequalities for
)
. The optimal Fano and reverse-Fano inequalities for the randomness measure of any M-ary random variable X in terms of are given analytically by the following: Proof. The proof is immediate from Lemma 3 and Theorem 3. The Fano and reverse-Fano bounds are achieved by the distributions on the left and right sides of (
58), respectively. □
A similar proof holding for any Schur concave
was already given by Vajda and Vašek [
17].
Assuming the zero randomness convention for simplicity (Remark 14), Fano and reverse-Fano bounds can be qualitatively described as follows. They are illustrated in
Figure 2.
Proposition 1 (Shape of Fano Bounds)
. The (upper) Fano bound:where denotes maximal randomness (Remark 15) is continuous in , concave in and increases from 0
(for ) to (for ). For any fixed , it also increases with M. Proof. Since is concave over (Theorem 2), it is continuous on the interior of . Since is linear, the Fano bound results from the composition of a linear and a concave function. It is, therefore, concave, and continuous at every . It is clear from Lemma 3, or using a suitable Robin Hood operation, that the maximizing distribution becomes more equal as increases. Therefore, the Fano bound increases with . The maximum is attained for , which corresponds to the uniform distribution achieving maximum randomness . For fixed , it is also clear, using a suitable Robin Hood operation, that the maximizing distribution becomes more equal if M is increased by one. Therefore, the Fano bound also increases with M. □
Proposition 2 (Shape of reverse-Fano Bounds)
. The (lower) reverse-Fano bound:is continuous in , increases from 0
(for ) to (for ) and is composed of continuous concave increasing curves connecting successive points (, ) for . Proof. For any , the reverse-Fano bound at is . It suffices to prove that the reverse-Fano bound is continuous, concave and increasing for . When , that is, , the reverse-Fano bound is . This results from the composition of a linear and a concave function , which is continuous in the interior of . Therefore, it is concave in , and continuous on the whole closed interval . Finally, it is clear from Lemma 2 or using a suitable Robin Hood operation that becomes more equal as increases. Therefore, each curve increases from to . □
Remark 17 (Independence of the reverse-Fano Bound from the Alphabet Size)
. Contrary to the (upper) Fano bound, the (lower) reverse-Fano bound is achieved by a probability distribution that does not depend on M. As a result, when the definition of does not itself explicitly depend on M (as is the case for H, , G, , , , ), the reverse-Fano bound is the same for all M, except that it is truncated up to , at which point it meets the (upper) Fano bound (see Figure 2). Theorem 5 (Optimal Fano and Reverse-Fano Inequalities for
)
. The optimal Fano and reverse-Fano inequalities for the randomness measure of any M-ary random variable X in terms of are given analytically by the following:where we have noted ( is the usual ceil function , unless x is an integer), and . Proof. The Fano region for
, i.e., the locus of the points
for each
, is given by the inequalities (
59). From the definition of conditional randomness, the exact locus of points
is composed of all convex combinations of points in the Fano region, that is, its convex envelope. The extreme points
and
are unchanged. The upper Fano bound joining these two extreme points is concave by Proposition 1 and, therefore, already belongs to the convex envelope. It follows that the upper Fano bound in (
59) remains the same, as given in (
62). However, the lower reverse-Fano bound for
is the convex hull of the lower bound in (
59). By Proposition 2, it is easily seen to be the piecewise linear curve joining all singular points (
,
) for
(see
Figure 2). A closed-form expression is obtained by noting that, when
, that is,
, the equation of the straight line joining (
,
) and (
,
) is
. Plugging
and
gives the lower reverse-Fano bound in (
62). □
Remark 18 (Shape of Fano and reverse-Fano bounds for Conditional Randomness)
. By Theorem 5, the Fano inequality for the conditional version takes the same form as for . In particular, it is increasing and concave in . Compared to that for , the reverse-Fano bound for , however, is a piecewise linear convex hull. Clearly, it is still continuous and increasing in , as illustrated in Figure 2. If the corresponding sequence of slopes is increasing in k, then the reverse-Fano bound for is also convex in . Remark 19 (-Fano Bounds). If is used instead of , where φ is an increasing function (in particular, to define conditional randomness as in Remark 4), then Theorem 4 and the (upper) Fano bound of Theorem 5 can be directly applied to . When φ is nonlinear, this may result in (upper) Fano bounds that are no longer concave.
However, to obtain the reverse-Fano inequalities for , one has to apply Theorem 5 to and then apply the inverse function to the left side of (62). When φ is nonlinear, the resulting “reverse-Fano bound” for will not be piecewise linear anymore. This is the case, e.g., for conditional α-entropies (see Example 15 below). Example 14 (Fano and reverse-Fano Inequalities for Entropy)
. For the Shannon entropy, the optimal Fano inequality (right sides of (59) and (62)) takes the form:where is the binary entropy function. Inequality (64) is the original Fano inequality established in 1952 [22], which has become ubiquitous in information theory and in statistics to relate equivocation to probability of error. Inequality (63) trivially follows, in case of blind estimation (). That these inequalities are sharp is well known (see, e.g., [32]). The optimal reverse-Fano inequality (left sides of (59) and (62) with ) takes the form:whereThese two lower bounds were first derived by Kovalevsky [33] in 1965. Optimality was already proven in [32]. Example 15 (Fano and reverse-Fano Inequalities for
-Entropy)
. By Remark 19, the optimal Fano inequality for is obtained as the right side of (59), which gives the following:This was proven by Toussaint [34] for and, independently, by Ben-Bassat and Raviv [35] for .Additionally, by Remark 19, the optimal Fano inequality for is obtained by averaging over Y the Fano upper bound of , which is of the form , where , which is concave Lemma 1 in [36]. Therefore, the optimal Fano inequality for is likewise obtained as the right side of (62), which gives the following: The optimal reverse-Fano inequality for is obtained as the left side of (59). By Remark 19, is obtained by applying to the left side of (62) for , where is given by (30). This gives the following:whereFano and reverse-Fano inequalities for and were recently established by Sason and Verdú [36]. Example 16 (Fano and reverse-Fano Inequalities for non collision
)
. Theorem 4 readily gives the optimal Fano region for :This can also be easily deduced from (69) and (71) for via (4). Fano and reverse-Fano inequalities for were first stated without proof in [7].The optimal Fano region for , however, cannot be directly deduced from that of , because a different kind of average over Y is involved. However, a direct application of Theorem 5 with gives the optimal Fano region:Remarkably, the reverse-Fano inequality has a very simple form (see Figure 3). Example 17 (Fano and reverse-Fano Inequalities for Guessing Entropy)
. For guessing entropy G, the Fano inequality is written as shown:One obtains similarly , , etc. Due to the fact that is linear in p, for fixed , the reverse-Fano bound for is linear in . It follows that the bound is already piecewise linear, with a sequence of slopes , which is easily seen to be increasing. Therefore, the (lower) reverse-Fano bound is piecewise linear and convex and coincides with its convex hull. In other words, the reverse-Fano inequality for and takes the same form:The following is easily determined from the left side of either (59) or (62):For example, , such that the following occurs:Fano and reverse-Fano inequalities for were recently established by Sason and Verdú [37]. As already shown in [27] for , the use of Schur concavity greatly simplifies the derivation. Figure 4 shows some optimal Fano regions for
,
,
and
.
4. Pinsker and Reverse-Pinsker Inequalities
Pinsker and reverse-Pinsker inequalities relate some divergence measure (e.g.,
or
) between two distributions to their statistical distance
. For simplicity, even though we restrict ourselves to the divergence or distance to the uniform distribution
, we still use the generic name “Pinsker inequalities”. Following the discussion in
Section 1.6, we adopt the following.
Definition 7 (Pinsker-type inequalities)
. A “Pinsker inequality” (resp. “reverse-Pinsker inequality”) for gives an upper (resp. lower) bound of as a function of the statistical randomness (or statistical distance ). Pinsker and reverse-Pinsker inequalities are similarly defined for conditional randomness , lower or upper bounded as a function of .
In this Section, we establish optimal Pinsker and reverse-Pinsker inequalities, where upper and lower bounds are tight. In other words, we determine the maximum and minimum of for fixed R (or fixed Δ). The exact locus of the region , as well as the exact locus of all attainable values of is determined analytically for fixed M, based on the following.
Lemma 4. Let and . For any M-ary probability distribution and any integer K such thatwhere denotes the cardinality of the set A, one has the following: Proof. Let
be defined as in (
17) for a uniform distribution
. Then,
satisfies (
84), and (
16) gives
. First, consider the largest
K probabilities, which are all
and sum to
. One obtains the following:
where, on the right side, we have used Lemma 1 and, on the left side, we have used Lemma 2, applied to
, which sum to Δ. Next, consider the smallest
probabilities, which are all
and sum to
. One has the following:
where, on the right side, we have used Lemma 1 and, on the left side, we have used Lemma 2 with
. Combining (
86) and (
87) gives (
85), where the remainder component
is computed so that the sum of probabilities on the left side equals one, which gives
. □
Theorem 6 (Optimal Pinsker and Reverse-Pinsker Inequalities for
)
. The optimal Pinsker and reverse-Pinsker inequalities for the randomness measure of any M-ary random variable X in terms of are given analytically as below:where and the maximum is over all integers . Proof. Apply Lemma 4 and Theorem 3. The Pinsker and reverse-Pinsker bounds are achieved by the distributions on the left and right sides of (
85), respectively. The best value of
K maximize the randomness
of the distribution on the right side of (
85), with the constraint
, that is,
. □
Assuming the zero randomness convention for simplicity (Remark 14), Pinsker and reverse-Pinsker bounds can be qualitatively described as follows. They are illustrated in
Figure 5.
Proposition 3 (Shape of Pinsker Bounds)
. The (upper) Pinsker bound:where and the maximum is over all integers , is increasing and piecewise continuous in each subinterval , (), with possible jump discontinuities at points (). Proof. First, notice that the distributions are not necessarily comparable in terms of equalization (partial) order for different values of K. It follows that, in general, the optimal value of K maximizing depends not only on Δ (or R), but also on the choice of the randomness measure .
However, for fixed K, is linear. In addition, since is concave over (Theorem 2), it is continuous on the interior of . Therefore, the bound results from the composition of a linear and a continuous concave function. It is, therefore, continuous and concave over the domain , that is, . Also, it is clear, using a suitable Robin Hood operation, that, for a fixed K, is decreasing in Δ, and, therefore, increasing in R.
It follows that the (upper) Pinsker bound is a maximum of at most M increasing continuous concave functions, defined over intervals of the form
. It is, therefore, increasing over the entire interval
and piecewise continuous in each subinterval
, with possible jumps at the endpoints (see
Figure 5). □
Proposition 4 (Shape of reverse-Pinsker Bounds)
. The (lower) reverse-Pinsker bound:is continuous in , increases from 0
(for ) to (for ) and is composed of continuous concave increasing curves connecting successive points (, for , where the following holds: Proof. For fixed , that is, , the bound results from the composition of a linear and a concave function. It is, therefore, concave, and continuous at every . It is clear, using a suitable Robin Hood operation on , that this bound increases with R on the subinterval . For , it equals , which is easily seen, using a suitable Robin Hood operation, to be increasing with k, with maximum . □
Theorem 7 (Optimal Pinsker and Reverse-Pinsker Inequalities for
)
. The optimal Pinsker and reverse-Pinsker inequalities for the randomness measure of any M-ary random variable X in terms of are given by the convex envelope of the Pinsker region determined by (88). In particular, consider the following:If the (upper) Pinsker bound for is concave (with no discontinuities), then the same optimal bound holds for in terms of : If the sequence () is nondecreasing, where is defined by (91), then the optimal (lower) reverse-Pinsker bound for is given by the piecewise linear function connecting points ; If the sequence () is nonincreasing, then the optimal (lower) reverse-Pinsker bound for writes as follows:where, as before: and .
Proof. The Pinsker region for
, i.e., the locus of the points
for each
, is given by the inequalities (
88). From the definition of conditional randomness, the exact locus of points
is composed of all convex combinations of points in the Pinsker region, that is, its convex envelope.
The extreme points
and
are unchanged. The upper Pinsker bound joining these two extreme points is piecewise concave by Proposition 3 and, therefore, if continuous, already belongs to the convex envelope. It follows, in this case, that the upper Pinsker bound in (
88) remains the same, as given in (
92).
The lower reverse-Pinsker bound for
is the convex hull of the lower bound in (
88). By Proposition 4, if the sequence
is non nondecreasing, the piecewise linear curve joining all singular points (
,
) for
) is convex and already coincides with its convex hull. If, on the contrary, the sequence
is non nonincreasing, that piecewise linear curve is concave, and its convex hull is simply the straight line joining the extreme endpoints (
,
) and (
,
), which is given by (
93). □
Remark 20 (-Pinsker Bounds). If is used instead of , where φ is an increasing function (in particular, to define conditional randomness as in Remark 4), then Theorem 6 can be directly applied to . When φ is nonlinear, this may result in (upper) Pinsker bounds that are no longer concave.
However, to obtain the reverse-Pinsker inequalities for , one has to apply Theorem 7 to and then apply the inverse function to (92). When φ is nonlinear, the resulting “reverse-Pinsker bound” for is no longer piecewise linear. This is the case, e.g., for conditional α-entropies (see Example 19 below). Example 18 (Pinsker and reverse-Pinsker Inequalities for Entropy)
. For the Shannon entropy, the optimal Pinsker bounds of Theorem 6 are easily determined as shown:where and . The maximizing value of K depends on the value of Δ
. The lower bound was proven in implicit form in Thm. 3 in [38], while the upper bound was given in Thm. 26 in [39]. Here, (91) is of the form , where is strictly concave increasing for . As a consequence, the sequence is decreasing for , and, by Theorem 7, the optimal reverse-Pinsker inequality for conditional entropy is simply the following: Example 19 (Pinsker and reverse-Pinsker Inequalities for
-Entropy and for
)
. By Remark 20, the optimal Pinsker and reverse-Pinsker inequalities (88) for α-entropy are given as below:where and . Again, the maximizing value of K depends on the value of Δ
.For collision entropy (), since achieves its minimum when the integer K is closest to , the optimal Pinsker and reverse-Pinsker inequalities simplify to the following:where . In terms of , the optimal Pinsker and reverse-Pinsker inequalities read as shown: Since , one always has (maximum achieved when ), so that the (upper) Pinsker bound can be further bounded:This upper bound was derived by Shoup Thm 8.36 in [8] and was later re-derived in the Lemma in 4 [40]. This, however, is the optimal
Pinsker bound only
when , that is, when M is even and (i.e., ). By Remark 20, to obtain the optimal reverse-Pinsker inequality for , we consider , where, from (30), and . For this quantity, one has, from (91), of the form , where is strictly concave increasing for . As a consequence, the sequence is decreasing for , and, by Theorem 7, the optimal reverse-Pinsker bound for conditional 2-entropy is , which gives the optimal reverse-Pinsker inequality: For , one has , where is strictly concave increasing for . As a consequence, the sequence is decreasing for , and, since , by Theorem 7, the optimal reverse-Pinsker inequality for is simply as below:(see Figure 6). Example 20 (Pinsker and reverse-Pinsker Inequalities for Guessing Entropy)
. For the guessing entropy, the optimal Pinsker bounds of Theorem 6 are easily determined:A notable property is that the optimal upper bound does not depend on the value of K. The upper bound is mentioned by Pliam in [4] as an upper bound of . The methodology of this paper, based on Schur concavity, greatly simplifies the derivation. For the conditional guessing entropy , observe that the upper Pinsker bound for is linear (hence, concave) in R and that (91) is of the form , where the sequence is increasing. Therefore, by Theorem 7, the optimal Pinsker region for conditional entropy is the same as for : Figure 7 shows some optimal Pinsker regions for
,
,
and
.
Example 21 (Statistical Randomness vs. Probability of Error)
. As a final example, we present the optimal regions of statistical randomness R vs. probability of error . In this case, observe the following from Definitions 6 and 7:
The (optimal) Fano inequality for R is the same as the (optimal) reverse-Pinsker inequality for ;
The (optimal) Pinsker inequality for is the same as the (optimal) reverse-Fano inequality for R.
Letting and , Theorem 4 readily gives the optimal Fano and reverse-Fano inequalities:while Theorem 6 gives the optimal Pinsker and reverse-Pinsker inequalities:since the maximum of in the right side of (88) is for maximum . Similarly, letting and , Theorem 5 with readily gives the optimal Fano and reverse-Fano inequalities:while Theorem 7 gives the optimal Pinsker and reverse-Pinsker inequalities:where the upper bound is the piecewise linear function connecting points for . From the above observation, the left (reverse-Fano) inequality in (104) is equivalent to the right (Pinsker) inequality in (105), and, similarly, the left (reverse-Fano) inequality in (106) is equivalent to the right (Pinsker) inequality in (107), which do not seem obvious from the expressions above. The optimal Fano/Pinsker region is illustrated in Figure 8. 5. Some Applications
Fano and Pinsker inequalities find many applications in many areas of science; we only mention a few. They have been applied in character recognition [
33], feature selection [
7], Bayesian statistical experiments [
17], statistical data processing [
13], quantization [
41], hypothesis testing [
36], entropy estimation [
38], channel coding [
42], sequential decoding [
11] and list decoding [
36,
43], lossless compression [
37,
43,
44] and guessing [
37,
44], knowledge representation [
12], cipher security measures [
4], hash functions [
8], randomness extractors [
40], information flow [
18], statistical decision making [
20] and side-channel analysis [
14,
27,
45]. Some of the various inequalities used for these applications are not optimal (or not proven optimal) for various reasons (simplicity of the expressions, approximations, etc.). By contrast, the methodology of this paper always provides optimal direct or reverse-Fano and -Pinsker inequalities.
6. Conclusions and Perspectives
We have derived optimal regions for randomness measures compared to either the error probability or the statistical randomness (or the total variation distance). One perspective is to provide similar optimal regions relating two arbitrary randomness measures. Of course, by (
6), Fano regions such as
vs.
can be trivially reinterpreted as regions
vs.
(see, e.g.,
Figure 2 in [
42] for the region H vs.
). Using some more involved derivations, the authors of [
46] have investigated the optimal regions H vs.
and, more generally, the authors of [
47,
48] have investigated the optimal regions between two
α-entropies of different orders. It would be desirable to apply the methods of this paper to the more general case of two arbitrary randomness measures. In particular, the determination of the optimal regions
vs.
will allow one to assess the sharpness of the “Massey-type” inequalities of [
5].
Catalytic majorization [
49] was found to be a necessary and sufficient condition for the increase of all Rényi entropies (including the ones with negative parameters
α). It would be interesting to find similar necessary and sufficient conditions for other types of randomness measures.
It is also possible to generalize the notion of entropies and other randomness quantities with respect to an arbitrary dominating measure instead of the counting measure, e.g., to extend the considerations of this paper from the discrete case to the continuous case. The relevant notion of majorization in this more general context is studied, e.g., in [
50].
Concerning Pinsker regions, another perspective is to extend the results of this paper to the more general case of Pinsker and reverse-Pinsker inequalities, relating “distances” of two arbitrary distributions
by removing the restriction that
is uniform. Some results in this direction appear in [
38,
51,
52,
53,
54,
55,
56,
57].
Other types of inequalities on randomness measures with different constraints can also be obtained via majorization theory [
43,
44].