1. Introduction
The multinomial distribution is an adequate model for describing how observations fall into categories. Quoting Johnson et al. [
1], “The Multinomial distribution, like the Multivariate Normal distribution among the continuous multivariate distributions, consumed a sizable amount of the attention that numerous theoretical as well as applied researchers directed towards the area of discrete multivariate distributions.”
The entropy of a (multivariate, in our case) random variable is a substantial quantity. It quantifies the predictability of a system whose outputs can be described by such a model. Entropy has several definitions, both conceptual and mathematical. The concept of entropy originated as a way to relate a system’s energy and temperature [
2]. The same concept was used to describe the number of ways the particles of a system can be arranged.
Entropy has been seldom studied as a random variable. Hutcheson [
3] and Hutcheson and Shenton [
4] discussed the exact expected value and variance of the Shannon entropy under the multinomial model. These works also provided approximate expressions that circumvent the numerical issues when using the exact value.
Jacquet and Szpankowski [
5] studied high-quality analytic approximations of the Rényi entropy, of which the Shannon entropy is a particular case, under the binomial model. With the same approach, Cichoń and Golębiewski [
6] obtained expressions for more general functionals that include the multinomial distribution. These works treat the entropy as a fixed quantity. Cook et al. [
7] studied almost unbiased estimators of functions of the parameter of the binomial distribution. The authors extended those results to find an almost-unbiased estimator for the entropy under multinomial laws.
Chagas et al. [
8] treated the Shannon entropy as a random variable. The authors obtained its asymptotic distribution when indexing by the maximum likelihood estimators of the proportions under the multinomial distribution. This result allowed the devising of unilateral and bilateral tests for comparing the entropy from two samples in a very general way. These tests do not require having the same number of categories.
In this work, our attention is directed toward the asymptotic distribution of other forms of entropy under the multinomial model. This allows the comparison of large samples throughout their entropies and, with this, they may have different numbers of classes. The comparison also allows using different types of entropy. We firstly apply the multivariate delta method and, in the case of the Rényi entropy, we transform the resulting multivariate normal distribution into that of the logarithm of the absolute value of a normally distributed random variable. Then, we provide the general expression of a test statistic that suits our needs.
This paper unfolds as follows.
Section 2 recalls the main properties of the multinomial distribution and defines the four types of entropies we will study. In
Section 3, we present the central results, i.e., the asymptotic distribution of those entropies. We describe the techniques we used and left for
Appendix A.1 technical details. We validate our results with simulation studies in
Section 4: we show the adequacy of the normal distribution as limit law for the entropies under three probability models of different support, considering various sample sizes. In
Section 5, we show that these asymptotic properties lead to a helpful hypothesis test between samples with different categories. We conclude the article in
Section 6.
Appendix A.2 comments on applications that justify our choices of the number of categories and sample sizes in the simulation studies.
Appendix A.3 discloses relevant computational information, including reproducibility.
2. Entropies and the Multinomial
Distribution
Consider a series of
n independent trials, where only one of
k mutually exclusive events
must be observed in each one, with probability
such that
and
. Let
be the random vector that counts the number of occurrences of the events
in the
n trials, with
and
. A sample from
, say
, is a
k-variate vector of integer values
. Then, the joint distribution of
is
We denote this situation as .
In practice, one does not know the true values of , the probabilities that index this multinomial distribution. Such values are estimated by computing , the proportion of times the class (category, event) was observed among the k possible categories during the n trials. The maximum likelihood estimator for is the random vector of proportions. This maximum likelihood estimator coincides with the intuitive estimator based on the distribution’s first moments, and is the most frequently used in applications.
We study the distribution of several forms of entropy of the random vector for fixed k. Notice that is computed over a single k-variate measurement of random proportions corresponding to a single random sample from . The asymptotic behaviors we derive hold for typical cases in which .
The Shannon entropy measures the disorder or unpredictability of systems characterized by a probability distribution. On the one hand, the minimum Shannon value occurs when there is complete knowledge about the system behavior and total confidence in predicting the following observation. On the other hand, when a uniform distribution describes the system’s behavior, that is, when all possibilities have the same probability of occurrence, the knowledge about the behavior of the data is minimal. In Chagas et al. [
8], we studied the asymptotic distribution of the Shannon entropy. In this work, we extend those results to three other forms of entropy.
Other types of descriptors have been proposed in the literature to extract additional information not captured by the Shannon entropy. Tsallis [
9] and Rényi [
10], for instance, proposed parametric versions, which include the Shannon entropy.
Fisher information [
11] is defined by an average logarithm derivative of a continuous probability density function. In the case of discrete densities, this measure can be approximated using differences of probabilities between consecutive distribution elements. While the Shannon entropy captures the degree of unpredictability of a system, the Fisher information is related to the rate of change of consecutive observations and, thus, quantifies small changes and perturbations.
Given a type of entropy H, we are interested in the distribution of when indexed by , the maximum likelihood estimator of . Our problem then becomes finding the distribution of for the following:
The Tsallis entropy with index
The Rényi entropy of order
The Fisher information, also termed “Fisher Information Measure” in the literature, with renormalization coefficient
Among other possibilities, we used Equation (2.7) from Ref. [
12].
3. Asymptotic Distributions of Entropies
The main results of this section are the asymptotic distributions of the Shannon (
2), Tsallis of order
q (
3), and Rényi of order
q (
4) entropies, and Fisher information (
5). These results are presented, respectively, in Equations (
30)–(
32) and (
35). Notice that the Rényi entropy is not asymptotically normally distributed, while the other three are.
We recall the following theorems known respectively as the delta method and its multivariate version. We refer to Lehmann and Casella [
13] for their proofs.
Theorem 1. Let be a sequence of independent and identically distributed random variables such that converges in distribution to a . If exists and does not vanish, then converges in distribution to a .
Theorem 2. Let be a sequence of independent and identically distributed vectors of random variables such that converges in distribution to a multivariate normal distribution , where Σ is the covariance matrix. Suppose that are real functions continuously differentiable in a neighborhood of the parameter point and such that the matrix of partial derivatives is non-singular in the mentioned neighborhood. Then, the following convergence in distribution holds:where denotes the transpose of B. Now, we focus on the case
. Let
be the vector of sample proportions which coincides with the maximum likelihood estimator (MLE) of
and
. Then
where
.
Let us explore the covariance matrix in this case:
It means that the covariance matrix
we are interested in is of the form
The above statements are generalized. In the following, we obtain new results for the Tsallis and Rényi entropies, and for the Fisher information. For the sake of completeness, we also include the results for the Shannon entropy.
In order to apply the delta method using Theorem 2, we consider the following functions:
for
except for () that holds for
. The assumptions are verified, and thus,
Finally, we need the covariance matrix of the multivariate normal limit distribution, which is
where
. Since
are diagonal matrices for
, we can use Equation (
A1) to conclude that
In the case of
, from Equations (
A3) and (
A4) we have the following:
For
and
:
For
:
For
:
For
:
For
:
Hence, we conclude that
where
and
in all cases except for the case of the Fisher information in which
. An equivalent expression is
If
is a vector of random variables such that
, then it can be proved that
and
. Provided well-known properties, it holds that
and
. Applying this to (
28),
Now, using (
29), we find the asymptotic distribution of (
2)–(
5). In order to do so, we need to know the distribution of the sum of
k Gaussian random variables with different means and an arbitrary covariance matrix.
For any
k-dimensional multivariate normal distribution
, with
and covariance matrix
, holds that the distribution of
, with
, is
. Using the limit distribution presented in (
29) and
, we directly have the asymptotic distribution of the Shannon entropy as follows:
With similar arguments and
, we obtain the asymptotic distribution for the Tsallis entropy of order
q:
The procedure is analogous for the Fisher information but with
. Hence, it can be proved that
where
To obtain expression (
33), we use the symmetry of the covariance matrix which implies that
. It is worth noticing that the expression of the covariance matrix for Fisher information is more complicated than the previously analyzed entropies since the matrix of partial derivatives is not diagonal in this case.
The case of Rényi entropy is different because, following the previous methodology, we can prove that
Hence,
where
with
and
. Notice that this is not a normal distribution but the distribution of the logarithm of the absolute value of a normally distributed random variable.
Often, in practice, these entropies are scaled to be in
; these are called “normalized entropies”. The following modifications must be considered in the normalized versions of the entropies. For the normalized Shannon entropy, the asymptotic mean and variance are multiplied by
and
, respectively. In the case of the normalized Tsallis entropy, the asymptotic mean and variance are multiplied by
and
, respectively. Finally, the asymptotic distribution of the normalized Rényi entropy is
. Notice that normalized entropies do not depend on the logarithm basis. The Fisher information is, as defined in (
5), already normalized.
4. Analysis and Validation
In this section, we study the empirical distribution of the entropies computed from
under three models, four categories (
), and three sample sizes (
) that depend on the number of categories. These choices of
k and
n are based on the values that appear in signal analysis with ordinal patterns; see details of this technique in
Appendix A.2.
We considered the following probability functions :
Linear: , .
One-Almost-Zero: for , , and with (the smallest positive number for which, in our computer platform, ).
Half-and-Half: for , and for , with .
These probability functions are illustrated, for
and
, in
Figure 1. We studied the behavior of the Shannon entropy, the Rényi entropy with
, the Tsallis entropy with
, and the Fisher information computed on samples of sizes
. We used 300 independent samples (replicates).
Although Equation (
35) shows that the Rényi entropy is not asymptotically normal, we verified that its density is similar to that of a Gaussian distribution. With this in mind, we also checked of the normality of Rényi entropies. We used the Anderson–Darling test to verify the null hypothesis that the data follow a normal distribution. We chose this test because it uses the hypothesized distribution in calculating critical values. This test is more sensitive than other alternatives; see, for instance, the book by Lehman and Romano [
14].
From
Table 1, we notice that the Fisher information is the one that fails most times to pass the normality test at the 1% The situation that appears with
in the table has, in fact,
; the table shows rounded values.
Figure 2 shows four of these cases, namely for
,
, and
. We notice that the deviation from the normal hypothesis is more prevalent in both tails, being that the observations are larger than the theoretical quantiles.
The normality hypothesis was rejected at the 1% level by the Anderson–Darling test in only 24 out of 432 situations, showing that the asymptotic Gaussian model for the entropies is a good description for these data.
Table 1 shows those situations.
With the aim to assess the goodness of fit of the asymptotic models, we applied the Kolmogorov–Smirnov test to fifty replicates of samples.
Table 2 shows the results where the
p-value of the test is at least equal to
.
It is worth noticing that even in those cases where the
p-value is lesser than
, the asymptotic models are a good fit to the data as can be seen in several examples exhibited in
Figure 3. The Fisher information shows the worst fitting. Additionally, notice in
Figure 3d that, although the asymptotic distribution of the Rényi entropy is not normal, the probability density function is visually very close to the Gaussian model. We verified this similarity in all the cases we considered.
5. Application
Inspired by an example from Agresti [
16] (p. 200), we extracted data from the General Social Survey (GSS, a project of the independent research organization NORC at the University of Chicago, with principal funding from the National Science Foundation, available at
https://gss.norc.org/. The data were downloaded on 24 December 2022).
Table 3 shows the level of agreement to the assertion “Religious people are often too intolerant” as measured in three years.
The p-values of pairwise tests for the null hypotheses that the underlying probabilities are equal are
1998 and 2008:,
1998 and 2018:,
2008 and 2018:.
On the one hand, these values attest that 1998 and 2008 and 1998 and 2018 are very different. On the other hand, although significant, the change between 2008 and 2018 is not so significant.
Table 4 shows the asymptotic mean and variance (in entropies normalized units) of the entropies of the proportions reported in
Table 3.
We perform the same hypothesis test with the asymptotic quantities presented in
Table 4.
Table 5 shows the
p-values of the null hypothesis that the entropies are equal, using the test discussed by Chagas et al. [
8] (
Section 5):
where
is the cumulative distribution function of a standard normal random variable,
H is any of the considered entropies computed with the observed proportions
,
, and
is the corresponding sample asymptotic variance that takes into account the sample size
. Notice that the test based on entropies compares only these features, and not the underlying distribution.
The results in
Table 5 are consistent with those provided by the
tests, i.e., the most significant differences arise between 1998 and 2008 and between 1998 and 2018. Moreover, the tests based on entropies do not reject the null hypothesis in the pair 2008–2018, except for Rényi entropy of order
. The increased
p-values are a consequence of the information reduction: whereas the
test compares count-by-count, those based on entropies compare two scalars.
In the second part of this application, we will illustrate the use of test statistics based on entropies for comparing samples with different categories. Situations like this may appear when applying alternative versions of the same questionnaire in a series of surveys.
We collapsed the categories of 1998 into three: “agreement” (by adding “strongly agree” and “agree”), “indifference” (“not agree/disagree”), and “disagreement” (by adding “disagree” and “strong disagree”). The resulting asymptotic mean entropies and asymptotic variances are shown in
Table 6.
Table 7 presents the
p-values of the tests that verify the null hypothesis of the same entropy between the collapsed 1998 data (three categories), and 2008 and 2018 (five categories). These results agree with those presented in
Table 5. Such an agreement suggests that, although the number of categories was reduced in 1998 from five to three, the tests based on entropies cope with the loss of information.
6. Conclusions
We presented expressions for the asymptotic distribution of the Rényi and Tsallis entropies of order
q, and Fisher information. The Fisher information and the Tsallis and Shannon entropies have limit normal distribution with means and variances that depend on the underlying probability of patterns and the number of patterns. The Rényi entropy follows, asymptotically, a different distribution, cf. (
35), but a Gaussian law can well approximate it. Those expressions pose no numerical challenges other than setting
. We verified that these asymptotic distributions are good models for data arising from both simulations with a variety of models and from the analysis of actual data.
On the one hand, the Fisher information is the one that fails more frequently to pass the Anderson–Darling normality tests. On the other hand, it does not provide evidence to reject the same hypothesis under the One-Almost-Zero model.
The distributions we present here can be used for building test statistics, as discussed by Chagas et al. [
8]. Moreover, Equation (
37) allows performing tests with mixed types of distributions, a situation that may appear in Internet of Things applications, in which, citing Borges et al. [
17], one has to deal with “large time series data generated at different rates, of different types and magnitudes, possibly having issues concerning uncertainty, inconsistency, and incompleteness due to missing readings and sensor failures.”