1. Introduction
When experimentally testing Bayesian network learning algorithms, in most of the cases, the performance is evaluated looking at structural differences between the graphs of the original Bayesian network and the learned one [
1], as in the case of using the structural Hamming distance. This measure is used in recent contributions as [
2,
3,
4]. A study and comparison of the different metrics used to measure the structural differences between two Bayesian networks can be found in [
1].
However, in most cases the aim of learning a Bayesian network is to estimate a joint probability for the variables in the problem. In that situation the error of a learning procedure should be computed by measuring the difference between the probability associated with the learned network and the original joint probability distribution. Therefore, it can be useful to estimate a network that is less dense than the original one, but in which parameters can have a more accurate estimation. This is the case of the Naive Bayes classifier, which obtains very good results in classification problems, despite the fact that the structure is not the correct one. So, in this situation, structural graphical differences are not a good measure of performance.
The basic measure to determine the divergence between an estimated distribution and a true one is the so-called Kullback–Leibler divergence [
5]. Some papers use this way of asserting the quality of a learning procedure as in [
6,
7,
8]. A direct computation of the divergence is unfeasible if the number of variables is high. However, some basic decomposition properties [
9] (Theorem 8.5) can be applied to reduce the cost of computation of the divergence. This is the basis of the procedure implemented in the Elvira system [
10] which is the one used in [
6]. Methods in [
7,
8] are also based on the same basic decomposition. Kullback–Leibler divergence is not only meaningful for measuring divergence between a learned network and a true one, but also for other tasks, as for example the approximation of a Bayesian network by a simpler one [
11,
12,
13] by removing some of the existing links.
The aim of this work is to improve existing methods for computing Kullback–Leibler divergence in Bayesian networks and to provide a basic algorithm for this task using
Python and integrated into the
pgmpy [
14] environment. The algorithm implemented in the
Elvira system [
10] is based on carrying out a number of propagation computations in the original true network. The hypothesis underlying our approach is that there are a lot of computations that are repeated in these propagation algorithms, so what it is done is to determine which are the operations with potentials that are repeated and then storing the results in a cache of operations in order to allow reuse them. The experimental work will show that this is an effective method to improve the efficiency of algorithms, especially in large networks.
The paper is organized as follows:
Section 2 is devoted to set the basic framework and to present fundamental results for the Kullback–Leibler divergence computation;
Section 3 describes the method implemented in the Elvira system for computing Kullback–Leibler divergence;
Section 4 is devoted to describing our proposal based on the cache of operations with potentials;
Section 5 contains the experimental setting and the obtained results; finally the conclusions are shown in
Section 6.
2. Kullback–Leibler Divergence
Let
N be a Bayesian network defined on a set of variables
. The family of a variable
in
is termed
, where
is the set of parents of
in the directed acyclic graph (DAG) defined by
N.
denotes the complete set of families, one for each one of the variables. Sometimes simplified notations for families and parent sets will be used:
(for
) and
(for
), respectively. As a running example, assume a network with three variables,
and the following structure:
(see right part of
Figure 1). Then the set of families for this network is given by
, where
.
A configuration or assignment of values to a set of variables , , can be abbreviated with and is denoted as . If the set of possible values for each variable in the previous example is , then a configuration can be , representing the assignment .
A partial configuration involving a subset of variables is denoted as . If the set of variables is or , then the partial configuration will be denoted by or , respectively. In our example, if an example of partial configuration about these variables will be .
The set of configurations for variables is denoted by . If is an assignment and , then the configuration obtained by deleting the values of the variables in is denoted by . If is a partial configuration about variables and we consider , then is the configuration obtained by removing the value of , i.e., .
If and are configurations for and respectively, and , then is a configuration for , and will be called the composition of and . For example, if is the configuration about variable and is the configuration defined on , then its composition will be the configuration for variables .
The conditional probability distribution for given its parents will be denoted as which is a potential defined on the set of variables . In general, a potential for variables is a mapping defined on into the set of real numbers: . The set of variables of potential will be denoted as . If is a set of potentials, will denote .
In our example, there are three potentials and (that is, a probability distribution about , and two conditional probability distributions: one for given and the other for given , respectively).
There are three basic operations that can be performed on potentials:
Multiplication. If
are potentials, then their multiplication is the potential
, with set of variables
and obtained by pointwise multiplication:
In our example, the combination of
and
will be the potential
defined on
and given by
.
Marginalization. If
is a potential defined for variables
and
, then the marginalization of
on
is denoted by
and it is obtained by summing in the variables in
:
When
is equal to
minus a variable
W, then
will be called the result of removing
W in
and also denoted as
. In the example, a marginalization of
is obtained by removing
producing
defined on
and given by
. If
represents the conditional probability of
given
, then it can be obtained that
is always equal to 1 (
).
Selection. If
is a potential defined for variables
and
is a configuration for variables
, then the selection of
for this configuration
is the potential
defined on variables
and given by
In this expression
is the composition of configurations
and
which is a configuration for variables
. Going back to the example, assume that we want to perform the selection of
to configuration
for variables
, then
will be a potential defined for variables
given by
, as we are reducing
to a configuration
in which
.
The family of all the conditional distributions is denoted as
. It is well known that given
N the joint probability distribution of the variables in
N,
p, is a potential that decomposes as the product of the potentials included in
:
Considering the example, and . The marginal distribution of p for a set of variables is equal to . When contains only one variable , then to simplify the notation, will be denoted as . Sometimes it will be needed to make reference to the Bayesian network containing a potential or family. In these cases we will use a superscript. For example, and refer to the family and parents set of in a Bayesian network respectively.
The aim of this paper is to compute the Kullback–Leibler divergence (termed
) between the joint probability distributions,
and
, of two different Bayesian networks
and
defined on the same set of variables
but possibly having different structures. This divergence, denoted as
can be computed considering the probabilities for each configuration
in both distributions as follows:
However, the computation of the joint probability distribution may be unfeasible for complex models as the number of configurations
is exponential in the number of variables. If
are probability distributions on
then the expected
log likelihood (
) of
q with respect to
p is:
then, from Equation (
2) it is immediate that:
The probability distribution
can be decomposed as well as considered in Equation (
1). Therefore, the term
in Equation (
3) can be obtained as follows considering the families of variables in
and their corresponding configurations,
:
Interchanging additions and reorganizing the terms in Equation (
4):
Equation (
5) implies a decomposition of the term
and, as a consequence of
computation as well. Observe that
is the value of the potential
for a configuration
and can be obtained directly from the potential
of the Bayesian network
. The main difficulty in Equation (
5) consists of the computation of
values, as it is necessary to compute the marginal probability distribution for variables in
, the family of
in Bayesian network
, but using the joint probability distribution
associated with the Bayesian network
.
4. Inference with Operations Cache
The approach proposed in this paper is based on the following fact: the computation of
divergence using Equations (
3) and (
5) requires us to obtain the following families of marginal distributions:
, for each in , for computing
, for each in , for obtaining
We have designed a procedure to compute each one of the required marginals for each . Marginals are computed by deleting the variables not in . The procedure uses a cache of computations which can be reused in the different marginalizations in order to avoid repeated computations.
We have implemented a general procedure to calculate the marginal for a family
of subsets
of
in a Bayesian network
N. In our case the family
is
for each
and the Bayesian network is
. A previous step consists of determining the relevant potentials for computing the marginal on a subset
, as not all the initial potentials are necessary. If
is the list of potentials, then a conditional probability potential
for variable
is relevant for
when
is an ascendant for some of the variables in
. This is a consequence of known relevance properties in Bayesian networks [
17]. Let us call
the family of relevant potentials for subset
.
Our algorithm assumes that the subsets in are numbered from 1 to K: . The algorithm first carries out the deletion algorithm symbolically, without actually doing numerical computations, in order to determine which of them can be reused. A symbolic combination of and consists of determining a potential defined for variables but without computing its numerical values (only the scope of the resulting potential is actually computed). This procedure is analogously done in the case of marginalization.
In fact, two repositories are employed: one for potentials () and another for operations (). The entry for each potential in contains a value acting as its identifier (); the potential itself; the identifier of the last operation for which this potential was required (this is denoted as potential ). Initially, contains the potentials in assigning to all of them. When a potential is no longer required, then it is removed from in order to alleviate memory space requirements. The potentials representing the required marginals (the results of the queries) are set with in order to avoid their deletion.
The repository contains an entry for each operation (combination or marginalization) with potentials performed during the running of the algorithm in order to compute the required marginals. This allows that if an operation is needed in the future, its result can be retrieved from preventing repeated computations. Initially will be empty. At the end of the analysis, it will include the description of the elementary operations carried out throughout the evaluation of all the queries. Two kinds of operations will be stored in :
combination of two potentials and producing a new one as result, .
marginalization of a potential , in order to sum-out a variable and obtaining as result.
The operation description will be stored as registers with the following information:
A unique identifier for the operation (; an integer).
The type of operation (): marginalization or combination.
Identifiers of the potentials involved as operands and result (identifiers allow to retrieve potentials from ). If the operation is a marginalization, then will identify the index of the variable to remove.
The computation of a marginal for a set
will also require a deletion order of variables in some cases. This order is always obtained with a fixed triangulation heuristic
[
18]. However, the procedure described here does not depend on this heuristic and any one of them could be used.
Algorithm 2 depicts the basic structure of the procedure. The result is , an ordered list of K potentials containing the required marginals for . The algorithm is divided into two main parts.
In the first part (lines 2–26), the operations are planned (using symbolic propagation) and detecting repeated operations. It is assumed that there are two basic functions SCombine and SMarginalize, representing the symbolic operations: SCombine will create a new potential with and SMarginalize producing another potential with .
We will also consider that there are two conditional versions of these operations: if the operation already exists, only the is updated, and if it does not exist it is symbolically carried out and added to the repository of operations. The conditional combination will be CondSCombine and the conditional marginalization will be CondSMarginalize and are depicted in Algorithms 3 and 4, respectively. It is assumed that both repositories are global variables for all the procedures. The potentials representing the required marginals are never deleted. For that, a time equal to is assigned: if , then the potential should not be removed and then this time is never updated. We will assume the function UpdateTime which does nothing if the time of is equal to , and updates the time of to t otherwise in repository .
Observe that the first part of Algorithm 2 (lines 2–26) just determines the necessary operations for the deletion algorithm for for all the marginals, while the second part (lines 27–32) carries out the numerical computations in the order that was established in the first part. After each operation, the potentials that are no longer necessary are removed from and their memory is deallocated. We will assume a function DeleteIf doing this (remove from if of is equal to t).
As mentioned above, the analysis of the operation sequence will be carried out using symbolic operations and taking into account the scopes of potentials but without numerical computations. This allows an efficient analysis. The result of the analysis will be used as an operation planning for the posterior numerical computation.
Assume the same example considered in previous sections for the networks in
Figure 1. The marginals to compute on
(as reference model) will correspond to families
,
,
and
(observe that
, and
). Therefore, in this case
.
Initially, the potentials repository
contains the potentials of
(a conditional probability for each variable given its parents):
,
, and
with time 0. We indicate the variables involved in each potential. The operations repository,
, will be empty.
Table 1 contains the initial repositories. Notice that the superscript
A has been omitted in order to simplify the notation.
Algorithm 2 Computation of marginals of p for subsets |
- 1:
function Marginal() - 2:
- 3:
for each do - 4:
Let the subset in - 5:
Let the family of potentials from relevant to subset - 6:
for do ▹ determine operations for the query - 7:
Let - 8:
Assume - 9:
- 10:
for do - 11:
CondSCombine - 12:
- 13:
end for - 14:
CondSMarginalize - 15:
- 16:
- 17:
end for - 18:
Assume ▹ compute joint distribution - 19:
- 20:
for do - 21:
CondSCombine - 22:
- 23:
end for - 24:
Append to - 25:
Set of to - 26:
end for - 27:
- 28:
for each do ▹ start numerical computation using operations planning - 29:
Select register with time t from : - 30:
Compute numerical values of ▹ Actual computation - 31:
DeleteIf, DeleteIf, DeleteIf - 32:
end for - 33:
return - 34:
end function
|
Algorithm 3 Conditional symbolic combination |
- 1:
function CondSCombine() - 2:
if register is in then - 3:
UpdateTime UpdateTime UpdateTime - 4:
else - 5:
SCombine - 6:
Add register to with as identifier - 7:
UpdateTime UpdateTime - 8:
Add to with - 9:
end if - 10:
return - 11:
end function
|
Algorithm 4 Conditional symbolic marginalization |
- 1:
function CondSMarginalize() - 2:
if register is in then - 3:
UpdateTime UpdateTime - 4:
else - 5:
SMarginalize - 6:
Add register to with as identifier - 7:
UpdateTime - 8:
Add to with t as time - 9:
end if - 10:
return - 11:
end function
|
The first marginal to compute is for . In this case, the set of relevant potentials is and there are not operations to carry out. Therefore the first marginal is which is appended to .
The second marginal to be computed is for
. In this case, the relevant potentials are
and there are no variables to remove, but it is necessary to carry out the symbolic combination of
and
in order to compute
(lines 19–23 of Algorithm 2). If we call
the result, then the repositories after this operation will be as shown in
Table 2.
The third marginal to compute is for set
. Now, the relevant potentials are
. The situation is analogous to the computation of the previous marginal, with the difference that now the symbolic combination to carry out is
. The repositories status after
is shown in
Table 3. We have that the third desired marginal is
.
Finally, for
we have to compute the marginal for
. The relevant potentials are now
. Variable
has to be deleted from this set of potentials. As all the potentials contain this variable, as a first step it is necessary to combine all of them, and afterwards to remove
by marginalizing on
. Assume that the order of the symbolic operations is: first combine
and
and then its result is combined with
; then this result is marginalized by removing
. Then the repositories after
are as presented in
Table 4. The combination of
and
was previously carried out for
and therefore its result can be retrieved without new computations.
After that, the numerical part of operations in
Table 4 are carried out in the same order in which they are described in that table. In this process, after doing an operation with an identifier (
) equal to
t, the potentials with
equal to
t are removed from the
table. For example, in this case, potentials
and
can be removed from
after doing operation with
and potential
can be removed after operation with
, leaving only in
the potentials containing the desired marginal potentials needed to compute the
divergence between both networks (potentials with
).