Generalised max entropy classifiers

Fabio Cuzzolin

Generalised max entropy classifiers

2018, International Conference on Belief Functions (BELIEF 2018)

In this paper we propose a generalised maximum-entropy classification framework, in which the empirical expectation of the feature functions is bounded by the lower and upper expectations associated with the lower and upper probabilities associated with a belief measure. This generalised setting permits a more cautious appreciation of the information content of a training set. We analytically derive the Karush-Kuhn-Tucker conditions for the generalised max-entropy classifier in the case in which a Shannon-like entropy is adopted.

Generalised max entropy classifiers Fabio Cuzzolin1 Oxford Brookes University, UK fabio.cuzzolin@brookes.ac.uk Abstract. In this paper we propose a generalised maximum-entropy classification framework, in which the empirical expectation of the feature functions is bounded by the lower and upper expectations associated with the lower and upper probabilities associated with a belief measure. This generalised setting permits a more cautious appreciation of the information content of a training set. We analytically derive the KarushKuhn-Tucker conditions for the generalised max-entropy classifier in the case in which a Shannon-like entropy is adopted. Keywords: Classification · Max entropy · Constrained optimisation. 1 Introduction The emergence of new challenging real-world applications has exposed serious issues with current approaches to model adaptation in machine learning. Existing theory and algorithms focus on fitting the available training data, but cannot provide worst-case guarantees in mission-critical applications. Vapnik’s statistical learning theory is useless for model selection, as the bounds on generalisation errors it predicts are too wide to be useful, and rely on the assumption that training and testing data come from the same (unknown) distribution. The crucial question is: what exactly can one infer from a training set? Max entropy classifiers [19] provide a significant example, due to their simplicity and widespread application. There, the entropy of the sought joint (or conditional) probability distribution of data and class is maximised, following the maximum entropy principle that the least informative distribution which matches the available evidence should be chosen. Having picked a set of feature functions, selected to efficiently encode the training information, the joint distribution is subject to the constraint that their empirical expectation equals that associated with the max entropy distribution. The assumptions that (i) training and test data come from the same probability distribution, and that (ii) the empirical expectation of the training data is correct, and the model expectation should match it, are rather strong, and work against generalisation power. A way around this issue is to adopt as models convex sets of probability distributions, rather than standard probability measures. Random sets, in particular, are mathematically equivalent to a special class of credal sets induced by probability mass assignments on the power set of the sample space. When random sets are defined on finite domain, they are often called belief functions 2 F. Cuzzolin [20]. One can then envisage a robust theory of learning based on generalising traditional statistical learning theory in order to allow for test data to be sampled from a different probability distribution than the training data, under the weaker assumption that both belong to the same random set. In this paper we make a step in that direction by generalising the max entropy classification framework. We take the view that a training set does not provide, in general, sufficient information to precisely estimate the joint probability distribution of class and data. We assume instead that a belief measure can be estimated, providing lower and upper bounds on the joint probability of data and class. As in the classical case, an appropriate measure of entropy for belief measures is maximised. In opposition to the classical case, however, the empirical expectation of the chosen feature functions is assumed to be compatible with lower and upper bounds associated with the sought belief measure. This leads to a constrained optimisation problem with inequality constraints, rather than equality ones, which needs to be solved by looking at the Karush-KuhnTucker (KKT) conditions. Due to the concavity of the objective function and the convexity of the constraints, KKT conditions are both necessary and sufficient. Related work. A significant amount of work has been conducted in the past on machine learning approaches based on belief theory. Most efforts were directed at developing clustering tools, including evidential clustering [4], evidential and belief C-means [15]. Ensemble classification [23], in particular, has been extensively studied. Concerning classification, Denoeux [5] proposed in a seminal work a k-nearest neighbor classifier based on belief theory. Relevantly to this paper, interesting work has been conducted to generalise the framework of decision trees to situations in which uncertainty is encoded by belief functions, mainly by Elouedi and co-authors [7], and Vannoorenberghe and Denoeux [22]. Paper outline. After reviewing in Section 2 max-entropy classification, we recall in Section 3 the necessary notions of belief theory. In Section 4 the possible generalisations of Shannon’s entropy to the case of belief measures are reviewed. In Section 5 the generalised max-entropy problem is formulated, together with the associated Kush-Karun-Tucker conditions. It is shown that for several generalised measures of entropy the KKT conditions are necessary and sufficient for the optimalised of generalised max-entropy (Section 5.1). In Section 5.2 we derive the analytical expression of the system of KKT conditions for the case of a Shannon-like entropy for belief measures. Section 6 concludes the paper. 2 Max-entropy classifiers The objective of maximum entropy classifiers is to maximise the Shannon entropy of the conditional classification distribution p(Ck |x), where x ∈ X is the observable and Ck ∈ C = {C1 , ..., CK } is the associated class. Given a training set in which each observation is attached a class, namely: D = {(xi , yi ), i = 1, ..., N |xi ∈ X, yi ∈ C}, a set M of feature maps is designed, φ(x, Ck ) = [φ1 (x, Ck ), · · · , φM (x, Ck )]′ whose values depend on both the object observed and its class. Each feature map φm : X × C → R is then a random Generalised max entropy classifiers 3 P variable whose expectation is: E[φm ] = x,k p(x, Ck )φm (x, Ck ). In opposition, P the empirical expectation of φm is: Ê[φm ] = x,k p̂(x, Ck )φm (x, Ck ), where p̂ is a histogram constructed by P counting occurrences of the pair (x, Ck ) in the training set: p̂(x, Ck ) = N1 (xi ,yi )∈D δ(xi = x ∧ yi = Ck ). The theoretical expectation E[φm ] can be approximated by decomposing p(x, Ck ) = p(x)p(Ck |x) via Bayes’ rule, and approximating the (unknown) prior of the observations p(x) with the empirical P prior p̂, i.e., the histogram of observed values in the training set: Ẽ[φm ] = x,k p̂(x)p(Ck |x)φm (x, Ck ). Definition 1. Given a training set D = {(xi , yi ), i = 1, ..., N |xi ∈ X, yi ∈ C} related to problem of classifying x ∈ X as belonging to one of the classes C = {C1 , ..., CK }, the max entropy classifier is the conditional probability p∗ (Ck |x) . such that: p∗ (Ck |x) = arg maxp(Ck |x) Hs (P ), where Hs is the traditional Shannon entropy, subject to: Ẽp [φm ] = Ê[φm ] ∀m = 1, ..., M. The constraint requires the classifier to be consistent with the empirical frequencies of the features in the training set, while seeking the least informative probability distribution that does so. The solution of the maximum entropy clas∗ sification P problem (Definition 1) is the so-called log-linear model : p (Ck |x) = 1 λm φm (x,Ck ) ′ m , where λ = [λ1 , ..., λM ] are the Lagrange multipliers assoZλ (x) e ciated with the linear constraints Ẽp [φm ] = Ê[φm ], and Zλ (x)P is a normalisation factor. The related classification function is: y(x) = arg maxk m λm φm (x, Ck ), i.e., x is assigned the class which maximises the linear combination of the feature functions with coefficients λ. 3 Belief functions Definition 2. A basic probability assignment (BPA) [1] over a discrete set Θ is aPfunction m : 2Θ → [0, 1] defined on 2Θ = {A ⊆ Θ} such that: m(∅) = BPA m : 2Θ → 0, A⊂Θ m(A) = 1. The belief function (BF) associated with aP Θ [0, 1] is the set function Bel : 2 → [0, 1] defined as: Bel(A) = B⊆A m(B). The elements of the power set 2Θ associated with non-zero values of m are called the focal elements of m. For each subset (‘event’) A ⊂ Θ the quantity Bel(A) is called the degree of belief that the outcome lies in A, and represents the total belief committed to a set of outcomes A by the available evidence m. Dually, the . upper probability of A: P l(A) = 1−Bel(Ā), Ā = Θ\A, expresses the ‘plausibility’ of a proposition A or, in other words, the amount of evidence not against A [3]. The plausibility function P l : 2Θ → [0, 1] Pthus conveys the same information as Bel, and can be expressed as: P l(A) = B∩A6=∅ m(B) ≥ Bel(A). Belief functions are mathematically equivalent to a special class of credal sets (convex sets of probability measures), as each BF Bel is associated with the set P[Bel] = {P : P (A) ≥ Bel(A)} of probabilities P dominating it. Its centre of mass is the pignistic function BetP [Bel](x) = A∋x m(A)/|A|, x ∈ Θ. Given a function f : Θ → R, the lower expectation and upper P expectation of f w.r.t. . Bel are, respectively: EBel∗ [f ] = inf P ∈P[Bel] EP [f ] = A⊆Θ m(A) inf x∈A f (x), P . ∗ EBel [f ] = supP ∈P[Bel] EP [f ] = A⊆Θ m(A) supx∈A f (x). 4 4 F. Cuzzolin Measures of generalised entropy The issue of how to assess the level of uncertainty associated with a belief function [10] is not trivial, as authors such as Yager and Klir argued that there are several facets to uncertainty, such as conflict (or discord, dissonance) and non-specificity (also called vagueness, ambiguity or imprecision). Some measures arePdirectly inspired by Shannon’s entropy of probability measures: Hs [p] = − x∈Θ p(x) log p(x). While Nguyen’s measure is a direct generalisation P in which probability values are replaced by mass values [17]: Hn [m] = − A∈F m(A) log m(A), where F is the list of focal elements of m, in Yager’s P entropy [24] probabilities are (partly) replaced by plausibilities: Hy [m] = − A∈F m(A) P log P l(A). Hohle’s measure of confusion [9] is the dual measure: Ho [m] = − A∈F m(A) log Bel(A). All such measures only capture the ‘conflict’ portion of uncertainty. Other measures are designed to capture the specificity of belief measures, i.e., the degree of concentration of the mass assigned to focal elements. A first such measure was due to Klir, Dubois & Prade [6]: P Hd [m] = A∈F m(A) log |A|, and can be considered as a generalization of Hartley’s entropy (H = log(|Θ|)) to belief functions. A more sophisticated proposal P by Pal [18]: Ha [m] = A∈F m(A)/|A|, assesses the dispersion of the evidence and is linked to the pignistic transform. A final proposal based P on the commonP 1 ). ality function Q(A) = B⊃A m(B) is due to Smets: Ht = A∈F log( Q(A) Composite measures, such as Lamata and Moral’s Hl [m] = Hy [m] + Hd [m] [14], as designed to capture both entropy and specificity. Klir & Ramer [13] proposed a ‘global uncertainty measure’ defined as: Hk [m] = D[m] + Hd [m], P P where: D(m) = − A∈F m(A) log[ B∈F m(B) |A∩B| |B| ]. Pal et al [18] argued that none of these composite measures is really satisfactory, as they do not admit a unique maximum and there is no sounding rationale for simply adding conflict and non-specificity measures together. In the credal interpretation of belief functions, Harmanec and Klir’s aggregated uncertainty (AU) [8] is defined as the maximal Shannon entropy of all the probabilities consistent with the given BF: Hh [m] = maxP ∈P[Bel] {Hs [P ]}. Hh [m] is the minimal measure meeting a set of rationality requirements which include: symmetry, continuity, expansibility, subadditivity, additivity, monotonicity, normalisation. Similarly, Maeda and Ichihashi [16] proposed a composite measure Hi [m] = Hh [m]+Hd [m] whose first component consists of the maximum entropy of the set of probability distributions consistent with m, and whose second part is the generalized Hartley entropy. As both Hh and Hi have high computational complexity, Jousselme et al [11] proposed an ambiguity measure (AM), as the classical entropy of the pignistic function: Hj [m] = Hs [BetP [m]]. Jirousek and Shenoy [10] analysed all these proposal in 2016, assessing them versus a number of significant properties, concluding that only the MaedaIchihashi proposal meets all these properties. The issue remains still unsettled. In the following we will adopt a straighforward generalisation of Shannon’s entropy, and a few selected proposals based on their concavity property. Generalised max entropy classifiers 5 5 Generalised max-entropy problem Technically, in order to generalise the max-entropy optimisation problem (Definition 1) to the case of belief functions, we need to: (i) choose an appropriate measure of entropy for belief function as the objective function; (ii) revisit the constraints that the (theoretical) expectations of the feature maps are equal to the empirical ones computed over the training set. As for (ii), it is sensible to require that the empirical expectation of the feature functions is bracketed by the lower and upper expectations associated with the sought belief function Bel : 2X×C → [0, 1]. In this paper we only make use of the 2-monotonicity of belief functions, and write: X X Bel(x, Ck )φm (x, Ck ) ≤ Ê[φm ] ≤ P l(x, Ck )φm (x, Ck ) (1) (x,Ck ) (x,Ck ) ∀m = 1, ..., M , as we only consider probability intervals on singleton elements (x, Ck ) ∈ X × C. Fully fledged lower and upper expectations (cfr. Section 3), which express the full monotonicity of BFs, will be considered in future work. Going even further, should constraints of the form (1) be enforced on all possible subsets A ⊂ X × C, rather than just singleton pairs (x, Ck )? This goes back to the question of what information does a training set actually carry. More general constraints would require extending the domain of feature functions to set values – we will investigate this idea in the near future as well. 5.1 Formulation and Karush-Kuhn-Tucker (KKT) conditions In the same classification setting of Section 2, the maximum belief entropy classifier is the joint belief measure Bel∗ (x, Ck ) : 2X×C → [0, 1] which solves the . following optimisation problem: Bel∗ (x, Ck ) = arg maxBel(x,Ck ) H(Bel) subject to the inequality constraints (1), where H is an appropriate measure of entropy for belief measures. As the above optimisation problem involves inequality constraints (1), as opposed to the equality constraints of traditional max entropy classifiers, we need to analyse the Karush-Kuhn-Tucker (KKT) [12] necessary conditions for a belief function Bel to be an optimal solution to the problem. Definition 3. Suppose that the objective function f : Rn → R and the constraint functions gi : Rn → R and hj : Rn → R of a nonlinear optimisation problem arg maxx f (x) subject to: gi (x) ≤ 0 i = 1, ..., m, hj (x) = 0 j = 1, ..., l are continuously differentiable at a point x∗ . If x∗ is a local optimum, under appropriate regularity conditions then there exist constants µi , (i = 1, . . . , m) and λj (j = 1, . . . , l), called KKT multipliers, such that the following conditions hold: Pm Pl 1. Stationarity: ∇f (x∗ ) = i=1 µi ∇gi (x∗ ) + j=1 λj ∇hj (x∗ ); 2. Primal feasibility: gi (x∗ ) ≤ 0 ∀i = 1, . . . , m, and hj (x∗ ) = 0, ∀j = 1, . . . , l; 3. Dual feasibility: µi ≥ 0 for all i = 1, . . . , m; 4. Complementary slackness: µi gi (x∗ ) = 0 for all i = 1, . . . , m. 6 F. Cuzzolin Crucially, the KKT conditions are also sufficient whenever the objective function f is concave, the inequality constraints gi are continuously differentiable convex functions, and the equality constraints hj are affine1 . Theorem 1. If either Ht , Hn , Hd , Hs [Bel] or Hs [P l] is adopted as measure of entropy, the generalised max entropy optimisation problem has concave objective function and convex constraints. Therefore, the KKT conditions are sufficient for the optimality of its solution(s). Concavity of the entropy objective function. It is well known that Shannon’s entropy is a concave function of probability distributions, represented as vectors of probability values2 . Furthermore: any linear combination of concave functions is concave; a monotonic and concave function of a concave function is still concave; the logarithm is a concave function. As shown by Smets [21], the transformations which map mass vectors to vectors of belief (and commonality) values are linear, as they can be expressed in the form of matrices. In particular, bel = Bf rM m, where Bf rM is a matrix whose (A, B) entry is: Bf rM (A, B) = 1 if B ⊆ A, 0 otherwise, and bel, m are vectors collecting the belief (mass) values of all events A ⊆ Θ. The same can be said of the mapping q = Qf rM m between a mass vector and the associated commonality vector. As a consequence, belief, plausibility and commonality are all linear (and therefore concave) functions of a mass vector. Using this matrix representation, it is easy to conclude that several of the entropies definedP in Section 4 are indeed concave. In particular, Smets’ specificity 1 ) is concave, as a linear combination of concave funcmeasure Ht = A log( Q(A) P tions. Nguyen’s entropy Hn = − A m(A) log(m(A)) = Hs [m] is also concave, as the P Shannon’s entropy of a mass assignment. Dubois and Prade’s measure Hd = A m(A) log(|A|) is also concave with respect to m, as a linear combination of mass values. Direct applications of Shannon’s entropy function to Bel P 1 ), HP l [m] = Hs [P l] = and P l: HBel [m] = Hs [Bel] = A⊆Θ Bel(A) log( Bel(A) P 1 A⊆Θ P l(A) log( P l(A) ) are also trivially concave, due to the concavity of the entropy function and to the linearity of the mapping from m to Bel, P l. Drawing conclusions on the other measures is less immediate, as they involve products of concave functions (which are not, in general, guaranteed to be concave). Convexity of the interval expectation constraints. As for the contraints (1) of the generalised max entropy problem, we first note P that (1) can be decomposed . 1 into the following pair of constraints: gm (m) = x,k Bel(x, Ck )φm (x, Ck ) − P 2 Ê[φm ] ≤ 0, gm (m) = x,k φm (x, Ck )[p̂(x, Ck ) − P l(x, Ck )] ≤ 0 for all m = 1, ..., M . The first inequality constraint is a linear combination of linear functions of the sought mass assignment m∗ : 2X×C → [0, 1] (since Bel∗ results from applying a matrix transformation to m∗ ). As pl = 1 − Jbel = 1 − JBf rM m, 2 constraint gm is also a linear combination of mass values. Hence, as linear func1 2 tion, constraints gm and gm are both concave and convex. 1 2 More general sufficient conditions can be given in terms of invexity [2] requirements. http://projecteuclid.org/euclid.lnms/1215465631 Generalised max entropy classifiers 5.2 7 Belief max-entropy classifier for Shannon’s entropy For the Shannon-like entropy Condition 1. (stationarity), applied to the sought PM 1 optimal BF Bel∗ : 2X×C → [0, 1], reads as: ∇HBel (Bel∗ ) = m=1 µ1m ∇gm (Bel∗ )+ 2 µ2m ∇gm (Bel∗ ). The components of ∇HBel are the partial derivatives of the entropy with respect to the mass values m(B), for all B ⊆ Θ. They read as: X X X X ∂HBel ∂ = −( [1+log Bel(A)]. m(B)) log( m(B)) = − ∂m(B) ∂m(B) B⊆A B⊆A A⊇B A⊇B 1 P ∂gm ∂ 1 = ∂m(B) As for ∇gm (Bel∗ ) we have: ∂m(B) (x,Ck )∈Θ Bel(x, Ck )φm (x, Ck ) − P ∂ Ê[φm ] = ∂m(B) (x,Ck )∈Θ m(x, Ck )φm (x, Ck )−Ê[φm ] which is equal to φm (x, Ck ) ∂g 2 m for B = {(x, Ck )}, 0 otherwise3 . As for the second set of constraints: ∂m(B) = P ∂ (x,Ck )∈Θ φm (x, Ck )[p̂(x, Ck )−P l(x, Ck )] which, recalling that P l(x, Ck ) = ∂m(B) P P B∩{(x,Ck )}6=∅ m(B), becomes equal to = − (x,Ck )∈B φm (x, Ck ). Assembling all our results, the KKT stationarity conditions for the generalised, belief-theoretical maximum entropy problem amount to, for all B ⊂ X×C: ( P PM − A⊇B [1 + log Bel(A)] = m=1 φm (x, C k )[µ1m − µ2m ], |B = {(x, C k )}| = 1, PM P P − A⊇B [1 + log Bel(A)] = m=1 µ2m (x,Ck )∈B φm (x, Ck ), |B| > 1. (2) 2 1 ≥ 0 , µ The other conditions are, ∀m = 1, ..., M , (1) (primal feasibility), µ m m P (dual feasibility), and complementary slackness: µ1m (x,Ck )∈Θ Bel(x, Ck )φm (x, Ck )− P Ê[φm ] = 0, µ2m (x,Ck )∈Θ φm (x, Ck )[p̂(x, Ck ) − P l(x, Ck )] = 0. 6 Conclusions In this paper we proposed a generalisation of the max entropy classifier entropy in which the assumptions that test and training data are sampled by a same probability distribution, and that the empirical expectation of the feature functions is ‘correct’ are relaxed in the formalism of belief theory. We also studied the conditions under which the associated KKT conditions are necessary and sufficient for the optimality of the solution. Much work remains: (i) providing analytical model expressions, similar to log-linear models, for the Shannon-like and other major entropy measures for belief functions; (ii) analysing the case in which the full lower and upper expectations are plugged in; (iii) comparing the resulting classifiers; (iv) analysing a formulation based on the least commitment principle, rather than max entropy, for the objective function to optimise; finally, (v) relaxing the constraint that feature functions be defined on singleton pairs (x, Ck ), in a further generalisation of this important framework. 3 If we could define feature functions over non singletons subsets A ⊆ Θ, this would simply generalise to φ(B) for all B ⊆ Θ. 8 F. Cuzzolin References 1. Augustin, T.: Modeling weak information with generalized basic probability assignments. In: Data Analysis and Information Systems, 101–113, Springer, 1996. 2. Ben-Israel, A. et al: What is invexity? J. Austral. Math. Soc. Ser. B 28:1–9, 1986. 3. Cuzzolin, F.: Three alternative combinatorial formulations of the theory of evidence. Intelligent Data Analysis 14(4):439–464, 2010. 4. Denoeux, T. and Masson, M.-H.: EVCLUS: Evidential Clustering of Proximity Data. IEEE Trans Syst Man Cybern B 34(1):95-109, 2004. 5. Denœux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans Syst Man Cybern 25(5):804-813, 1995. 6. Dubois, D., Prade, H.: Properties of measures of information in evidence and possibility theories. Fuzzy Sets Syst 100:35–49, 1999. 7. Elouedi, Z., Mellouli, K., Smets, P.: Belief decision trees: theoretical foundations. Int J Approx Reason 28(23):91–124, 2001. 8. Harmanec, D., Klir, G.J.: Measuring total uncertainty in Dempster-Shafer theory: A novel approach. Int J Gen Syst 22(4):405–419, 1994. 9. Hohle, U.: Entropy with respect to plausibility measures. In: Proceedings of the 12th IEEE Symposium on Multiple-Valued Logic., pp. 167–169, 1982. 10. Jirousek, R., Shenoy, P.P.: Entropy of belief functions in the Dempster-Shafer theory: A new perspective. In: Proceedings of BELIEF, pp. 3-13, 2016. 11. Jousselme, A.L. et al: Measuring ambiguity in the evidence theory. IEEE Trans Syst Man Cybern A 36(5):890–903, 2006. 12. Karush, W.: Minima of functions of several variables with inequalities as side constraints. MSc Dissertation, Dept. of Mathematics, Univ. of Chicago, 1939. 13. Klir, G.J.: Measures of uncertainty in the Dempster-Shafer theory of evidence. In: Advances in the Dempster-Shafer theory of evidence, pp. 35–49, 1994. 14. Lamata, M.T., Moral, S.: Measures of entropy in the theory of evidence. Int J Gen Syst 14(4):297–305, 1988. 15. Liu, Z. et al: Belief C-means: An extension of fuzzy c-means algorithm in belief functions framework. Pattern Recognit Lett 33(3):291–300, 2012. 16. Maeda, Y., Ichihashi, H.: An uncertainty measure with monotonicity under the random set inclusion. Int J Gen Syst 21(4):379–392, 1993. 17. Nguyen, H.: On entropy of random sets and possibility distributions. In: The Analysis of Fuzzy Information, pp. 145–156, 1985. 18. Pal, N.R., Bezdek, J.C., Hemasinha, R.: Uncertainty measures for evidential reasoning ii: A new measure of total uncertainty. Int J Approx Reason 8:1–16, 1993. 19. Pietra, S.D., et al: Inducing features of random fields. IEEE Trans Pattern Anal Mach Intell 19(4):380–393, 1997. 20. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, 1976. 21. Smets, P.: The application of the matrix calculus to belief functions. Int J Approx Reason 31(1-2):1–30, 2002. 22. Vannoorenberghe, P. et al: Handling uncertain labels in multiclass problems using belief decision trees. In: Proceedings of IPMU, 2002. 23. Xu, L. et al: Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22(3):418–435, 1992. 24. Yager, R.R.: Entropy and specificity in a mathematical theory of evidence. Int J Gen Syst 9:249–260, 1983.

Log In

Generalised max entropy classifiers

Related papers

Related papers

Related topics