An Introduction To Functional Derivatives
An Introduction To Functional Derivatives
B ela A. Frigyik, Santosh Srivastava, Maya R. Gupta Dept of EE, University of Washington Seattle WA, 98195-2500
Department of Electrical Engineering University of Washington Box 352500 Seattle, Washington 98195-2500 PHN: (206) 543-2150 FAX: (206) 543-3842 URL: http://www.ee.washington.edu
January 2008
Abstract This tutorial on functional derivatives focuses on Fr echet derivatives, a subtopic of functional analysis and of the calculus of variations. The reader is assumed to have experience with real analysis. Denitions and properties are discussed, and examples with functional Bregman divergence illustrate how to work with the Fr echet derivative.
f f Consider a function f dened over vectors such that f : Rd R. The gradient f = { x , f , . . . , x } describes 1 x2 d the instantaneous vector direction in which the function changes the most. The gradient f (x0 ) at x0 Rd tells you that if you are starting at x0 which direction would lead to the greatest instantaneous change in f . The inner product (dot product) f (x0 )T y for y Rd gives the directional derivative (how much f instantaneously changes) of f at x0 in the direction dened by the vector y . One generalization of a gradient is the Jacobian, which is the matrix of derivatives for a function that map vectors to vectors (f : Rd Rm ). In this tutorial we consider the generalization of the gradient to functions that map functions to scalars; such functions are called functionals. For example let a functional be dened over over the convex set of functions,
G = {g : Rd R s. t.
x
(1)
An example functional dened on this set is the entropy: : G R where (g ) = x g (x) ln g (x)dx for g G . In this tutorial we will consider functional derivatives, which are analogs of vector gradients. We will focus on the Fr echet derivative, which can be used to answer questions like, What function g will maximize (g )? First we will introduce the Fr echet derivative, then discuss higher-order derivatives and some basic properties, and note optimality conditions useful for optimizing functionals. This material will require a familiarity with measure theory that can be found in any standard measure theory text or garnered from the informal measure theory tutorial by Gupta [1]. In Section 3 we illustrate the functional derivative with the denition and properties of the functional Bregman divergence [2]. Readers may nd it useful to prove these properties for themselves as an exercise.
Fr echet Derivative
Let Rd , , be a measure space, where is a Borel measure, d is a positive integer, and dene the set of functions A = {a Lp ( ) subject to a : Rd R} where 1 p . The functional : Lp ( ) R is linear and continuous if 1. [a1 + a2 ] = [a1 ] + [a2 ] for any a1 , a2 Lp ( ) and any real number 2. there is a constant C such that | [a]| C a for all a Lp ( ).
Let be a real functional over the normed space Lp ( ) such that maps functions that are Lp integrable with respect to to the real line: : Lp ( ) R. The bounded linear functional [f ; ] is the Fr echet derivative of at f Lp ( ) if [f + a] [f ] = [f ; a] = [f ; a] + [f, a] a Lp ( ) (2) for all a Lp ( ), with [f, a] 0 as a Lp ( ) 0. Intuitively, what we are doing is perturbing the input function f by another function a, then shrinking the perturbing function a to zero in terms or its Lp norm, and considering the difference [f + a] [a] in this limit. Note this functional derivative is linear: [f ; a1 + a2 ] = [f ; a1 ] + [f ; a2 ]. When the second variation 2 and the third variation 3 exist, they are described by [f ; a] 1 = [f ; a] + 2 [f ; a, a] + 2 1 = [f ; a] + 2 [f ; a, a] + 2 [f, a] a
2 Lp ( ) 3 Lp ( )
(3) ,
1 3 [f ; a, a, a] + [f, a] a 6
where [f, a] 0 as a Lp ( ) 0. The term 2 [f ; a, b] is bilinear with respect to arguments a and b, and 3 [f ; a, b, c] is trilinear with respect to a, b, and c.
2.1
Consider sequences of functions {an }, {fn } Lp ( ), where an a, fn f , and a, f Lp ( ). If C 3 (Lp ( ); R) and [f ; a], 2 [f ; a, a], and 3 [f ; a, a, a] are dened as above, then [fn ; an ] [f ; a], 2 [fn ; an , an ] 2 [f ; a, a], and 3 [fn ; an , an , an ] 3 [f ; a, a, a].
2.2
The quadratic functional 2 [f ; a, a] dened on normed linear space Lp ( ) is strongly positive if there exists a constant 2 k > 0 such that 2 [f ; a, a] k a Lp ( ) for all a A. In a nite-dimensional space, strong positivity of a quadratic form is equivalent to the quadratic form being positive denite. From (3), [f + a] 1 = [f ] + [f ; a] + 2 [f ; a, a] + o( a 2 ), 2 1 [f ] = [f + a] [f + a; a] + 2 [f + a; a, a] + o( a 2 ), 2
where o( a 2 ) denotes a function that goes to zero as a goes to zero, even if it is divided by a 2 . Adding the above two equations and canceling the s yields 0 which is equivalent to [f + a; a] [f ; a] = 2 [f ; a, a] + o( a 2 ), because 2 [f + a; a, a] 2 [f ; a, a] 2 [f + a; , ] 2 [f ; , ] a 2 , and we assumed C 2 , so 2 [f + a; a, a] 2 [f ; a, a] is of order o( a 2 ). This shows that the variation of the rst variation of is the second variation of . A procedure like the above can be used to prove that analogous statements hold for higher variations if they exist. (4) 1 1 = [f ; a] [f + a; a] + 2 [f ; a, a] + 2 [f + a; a, a] + o( a 2 ), 2 2
UWEETR-2008-0001
2.3
such that J [f ] achieves a local minimum of J . Consider a functional J and the problem of nding the function f , it is necessary that For J [f ] to have an extremum (minimum) at f J [f ; a] = 0 and 2 J [f ; a, a] 0, and for all admissible functions a A. A sufcient condition for f to be a minimum is that the rst for f = f , and its second variation 2 J [f ; a, a] must be strongly positive for f = f . variation J [f ; a] must vanish for f = f
2.4
The Fr echet derivative is a common functional derivative, but other functional derivatives have been dened for various purposes. Another common one is the G ateaux derivative, which instead of considering any perturbing function a in (2), only considers perturbing functions in a particular direction.
We illustrate working with the Fr echet derivative by introducing a class of distortions between any two functions called the functional Bregman divergences, giving an example for squared error, and then proving a number of properties. First, we review the vector case. Bregman divergences were rst dened for vectors [3], and are a class of distortions that includes squared error, relative entropy, and many other dissimilarities common in engineering and statis : Rn R, you can dene a Bregman divergence tics [4]. Given any strictly convex and twice differentiable function n over vectors x, y R that are admissible inputs to : T d (x, y ) = (x) (y ) (y ) (x y ). (5)
By re-arranging the terms of (5), one sees that the Bregman divergence d is the tail of the Taylor series expansion of around y : (x) = (y ) + (y )T (x y ) + d (x, y ). (6) The Bregman divergences have the useful property that the mean of a set has the minimum mean Bregman divergence to all the points in the set [4]. Recently, we generalized Bregman divergence to a functional Bregman divergence [5] in order to show that the mean of a set of functions minimizes the mean Bregman divergence to the set of functions; see [2] for full details. The functional Bregman divergence is a straightforward analog to the vector case. Let : Lp ( ) R be a strictly convex, twice-continuously Fr echet-differentiable functional. The Bregman divergence d : A A [0, ) is dened for all f, g A as d [f, g ] = [f ] [g ] [g ; f g ], (7) where [g ; f g ] is the Fr echet derivative of at g in the direction of f g .
3.1
Lets consider how a particular choice of turns (7) into the total squared error between two functions. Let [g ] = g 2 d , where : L2 ( ) R, and let g, f, a L2 ( ). Then [g + a] [g ] = Because a2 d a as a 0 in L2 ( ), it holds that [g ; a] = 2 UWEETR-2008-0001 gad, 3
L2 ( )
(g + a)2 d
g 2 d = 2
gad +
a2 d.
a a
2 L2 ( ) L2 ( )
= a
L2 ( )
which is a continuous linear functional in a. Then, by denition of the second Fr echet derivative, 2 [g ; b, a] = [g + b; a] [g ; a] = = 2 2 (g + b)ad 2 bad. gad
Thus 2 [g ; b, a] is a quadratic form, where 2 is actually independent of g and strongly positive since 2 [g ; a, a] = 2 a2 d = 2 a
2 L2 ( )
g 2 d 2
g (f g )d
3.2
Next we establish some properties of the functional Bregman divergence. We have listed these in order of easiest to prove to hardest in case the reader would like to use proving the properties as exercises. Linearity The functional Bregman divergence is linear with respect to . Proof: d(c1 1 +c2 2 ) [f, g ] = (c1 1 + c2 2 )[f ] (c1 1 + c2 2 )[g ] (c1 1 + c2 2 )[g ; f g ] = c1 d1 [f, g ]+ c2 d2 [f, g ]. (8) Convexity The Bregman divergence d [f, g ] is always convex with respect to f . Proof: Consider d [f, g ; a] = d [f + a, g ] d [f, g ] = [f + a] [f ] [g ; f g + a] + [g ; f g ]. Using linearity in the third term, d [f, g ; a] = =
(a)
[g ; a]
where (a) and the conclusion follows from (3). Linear Separation The set of functions f A that are equidistant from two functions g1 , g2 A in terms of functional Bregman divergence form a hyperplane.
UWEETR-2008-0001
Proof: Fix two non-equal functions g1 , g2 A, and consider the set of all functions in A that are equidistant in terms of functional Bregman divergence from g1 and g2 : d [f, g1 ] = d [f, g2 ] [g1 ] [g1 ; f g1 ] = [g2 ] [g2 ; f g2 ] [g1 ; f g1 ] = [g1 ] [g2 ] [g2 ; f g2 ].
Using linearity the above relationship can be equivalently expressed as [g1 ; f ] + [g1 ; g1 ] [g2 ; f ] [g1 ; f ] Lf = [g1 ] [g2 ] [g2 ; f ] + [g2 ; g2 ], = [g1 ] [g2 ] [g1 ; g1 ] + [g2 ; g2 ]. = c,
where L is the bounded linear functional dened by Lf = [g2 ; f ] [g1 ; f ], and c is the constant corresponding to the right-hand side. In other words, f has to be in the set {a A : La = c}, where c is a constant. This set is a hyperplane. Generalized Pythagorean Inequality For any f, g, h A, d [f, h] = d [f, g ] + d [g, h] + [g ; f g ] [h; f g ]. Proof: d [f, g ] + d [g, h] = = = [f ] [h] [g ; f g ] [h; g h] [f ] [h] [h; f h] + [h; f h] [g ; f g ] [h; g h] d [f, h] + [h; f g ] [g ; f g ],
where the last line follows from the denition of the functional Bregman divergence and the linearity of the fourth and last terms. Equivalence Classes Partition the set of strictly convex, differentiable functions {} on A into classes with respect to functional Bregman divergence, so that 1 and 2 belong to the same class if d1 [f, g ] = d2 [f, g ] for all f, g A. For brevity we will denote d1 [f, g ] simply by d1 . Let 1 2 denote that 1 and 2 belong to the same class, then is an equivalence relation because it satises the properties of reexivity (because d1 = d1 ), symmetry (because if d1 = d2 , then d2 = d1 ), and transitivity (because if d1 = d2 and d2 = d3 , then d1 = d3 ). Further, if 1 2 , then they differ only by an afne transformation. Proof: It only remains to be shown that if 1 2 , then they differ only by an afne transformation. Note that by assumption, 1 [f ] 1 [g ] 1 [g ; f g ] = 2 [f ]2 [g ] 2 [g ; f g ], and x g so 1 [g ] and 2 [g ] are constants. By the linearity property, [g ; f g ] = [g ; f ] [g ; g ], and because g is xed, this equals [g ; f ] + c0 where c0 is a scalar constant. Then 2 [f ] = 1 [f ] + (2 [g ; f ] 1 [g ; f ]) + c1 , where c1 is a constant. Thus, 2 [f ] = 1 [f ] + Af + c1 , where A = 2 [g ; ] 1 [g ; ], and thus A : A R is a linear operator that does not depend on f . Dual Divergence Given a pair (g, ) where g Lp ( ) and is a strictly convex twice-continuously Fr echet-differentiable functional,
UWEETR-2008-0001
then the function-functional pair (G, ) is the Legendre transform of (g, ) [6], if [ g ] [g ; a] = [G] + = g (x)G(x)d (x), (9) (10)
1 p
G(x)a(x)d (x), +
1 q
where is a strictly convex twice-continuously Fr echet-differentiable functional, and G Lq ( ), where Given Legendre transformation pairs f, g Lp ( ) and F, G Lq ( ), d [f, g ] = d [G, F ]. Proof: The proof begins by substituting (9) and (10) into (7): d [f, g ] = [f ] + [G] g (x)G(x)d (x) G(x)f (x)d (x). G(x)(f g )(x)d (x)
= 1.
= [f ] + [G]
(11)
Applying the Legendre transformation to (G, ) implies that [G] [G; a] = [g ] + = g (x)G(x)d (x) (12) (13)
g (x)a(x)d (x).
Using (12) and (13), d [G, F ] can be reduced to (11). Non-negativity The functional Bregman divergence is non-negative. : R R by (t) = [tf + (1 t)g ], f, g A. From the denition of the Fr Proof: To show this, dene echet derivative, d = [tf + (1 t)g ; f g ]. (14) dt is convex because is convex by denition. Then from the mean value theorem there is some 0 The function t0 1 such that (1) (0) = d (t0 ) d (0). (15) dt dt (1) = [f ], (0) = [g ], and (14), subtracting the right-hand side of (15) implies that Because [f ] [g ] [g, f g ] 0. If f = g , then (16) holds in equality. To nish, we prove the converse. Suppose (16) holds in equality; then (1) (0) = d (0). dt (17) (16)
(0) to (1) is (t) = (0) + ( (1) (0))t, and the tangent line to The equation of the straight line connecting d at (0) is y (t) = (0) + t d (0). Because ( ) = (0) + d (t)dt and d the curve dt dt (t) dt (0) as a 0 dt (t) y (t). Convexity also implies that (t) (t). However, the direct consequence of convexity, it must be that assumption that (16) holds in equality implies (17), which means that y (t) = (t), and thus (t) = (t), which is not strictly convex. Because is by denition strictly convex, it must be true that [tf + (1 t)g ] < t[f ] + (1 t)[g ] unless f = g . Thus, under the assumption of equality of (16), it must be true that f = g .
UWEETR-2008-0001
Further Reading
For further reading, try the texts by Gelfand and Fomin [6] or Luenberger [7], and the wikipedia pages on functional derivatives, Fr echet derivatives, and G ateaux derivatives. Readers may also nd our paper [2] helpful, which further illustrates the use of functional derivatives in the context of the functional Bregman divergence, conveniently using the same notation as this introduction.
References
[1] M. R. Gupta, A measure theory tutorial: Measure theory for dummies, Univ. Washington Technical Report 2006-0008, 2008, Available at idl.ee.washington.edu/publications.php. [2] B. Frigyik, S. Srivastava, and M. R. Gupta, Functional Bregman divergence and Bayesian estimation of distributions, IEEE Trans. on Information Theory, vol. 54, no. 11, pp. 51305139, 2008. [3] L. Bregman, The relaxation method of nding the common points of convex sets and its application to the solution of problems in convex programming, USSR Computational Mathematics and Mathematical Physics, vol. 7, pp. 200217, 1967. [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, Clustering with Bregman divergences, Journal of Machine Learning Research, vol. 6, pp. 17051749, 2005. [5] S. Srivastava, M. R. Gupta, and B. A. Frigyik, Bayesian quadratic discriminant analysis, Journal of Machine Learning Research, vol. 8, pp. 12871314, 2007. [6] I. M. Gelfand and S. V. Fomin, Calculus of Variations. [7] D. Luenberger, Optimization by Vector Space Methods. USA: Dover, 2000. United States of America: Wiley-Interscience, 1997.
UWEETR-2008-0001