18141 \lmcsheadingLABEL:LastPageJan. 19, 2021Mar. 22, 2022
Higher Order Automatic Differentiation of Higher Order Functions
Abstract.
We present semantic correctness proofs of automatic differentiation (AD). We consider a forward-mode AD method on a higher order language with algebraic data types, and we characterise it as the unique structure preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming, based on diffeological spaces. We show that it interprets our language, and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is, in essence, a logical relations argument. Throughout, we show how the analysis extends to AD methods for computing higher order derivatives using a Taylor approximation.
Key words and phrases:
automatic differentiation, software correctness, denotational semantics1. Introduction
Automatic differentiation (AD), loosely speaking, is the process of taking a program describing a function, and constructing the derivative of that function by applying the chain rule across the program code. As gradients play a central role in many aspects of machine learning, so too do automatic differentiation systems such as TensorFlow [AAB+16], PyTorch [PGC+17] or Stan [CHB+15].
Differentiation has a well-developed mathematical theory in terms of differential geometry. The aim of this paper is to formalize this connection between differential geometry and the syntactic operations of AD, particularly for AD methods that calculate higher order derivatives. In this way we achieve two things: (1) a compositional, denotational understanding of differentiable programming and AD; (2) an explanation of the correctness of AD.
This intuitive correspondence (summarized in Fig. 1) is in fact rather complicated. In this paper, we focus on resolving the following problem: higher order functions play a key role in programming, and yet they have no counterpart in traditional differential geometry. Moreover, we resolve this problem while retaining the compositionality of denotational semantics.
1.0.1. Higher order functions and differentiation.
A major application of higher order functions is to support disciplined code reuse. Code reuse is particularly acute in machine learning. For example, a multi-layer neural network might be built of millions of near-identical neurons, as follows.
(Here is the sigmoid function, as illustrated.) We can use these functions to build a network as follows (see also Fig. 2):
(1) |
Here with . This program (1) describes a smooth (infinitely differentiable) function. The goal of automatic differentiation is to find its derivative.
If we -reduce all the ’s, we end up with a very long function expression just built from the sigmoid function and linear algebra. We can then find a program for calculating its derivative by applying the chain rule. However, automatic differentiation can also be expressed without first -reducing, in a compositional way, by explaining how higher order functions like and propagate derivatives. This paper is a semantic analysis of this compositional approach.
The general idea of denotational semantics is to interpret types as spaces and programs as functions between the spaces. In this paper, we propose to use diffeological spaces and smooth functions [Sou80, IZ13] to this end. These satisfy the following three desiderata:
-
•
is a space, and the smooth functions are exactly the functions that are infinitely differentiable;
-
•
The set of smooth functions between spaces again forms a space, so we can interpret function types.
-
•
The disjoint union of a sequence of spaces again forms a space, and this enables us to interpret variant types and inductive types, e.g. lists of reals form the space .
We emphasise that the most standard formulation of differential geometry, using manifolds, does not support spaces of functions. Diffeological spaces seem to us the simplest notion of space that satisfies these conditions, but there are other candidates [BH11, Sta11]. A diffeological space is in particular a set equipped with a chosen set of curves and a smooth map must be such that if then . This is reminiscent of the method of logical relations.
1.0.2. From smoothness to automatic derivatives at higher types.
Our denotational semantics in diffeological spaces guarantees that all definable functions are smooth. But we need more than just to know that a definable function happens to have a mathematical derivative: we need to be able to find that derivative.
In this paper we focus on forward mode automatic differentiation methods for computing higher derivatives, which are macro translations on syntax (called in Section 3). We are able to show that they are correct, using our denotational semantics.
Here there is one subtle point that is central to our development. Although differential geometry provides established derivatives for first order functions (such as above), there is no canonical notion of derivative for higher order functions (such as and ) in the theory of diffeological spaces (e.g. [CW14]). We propose a new way to resolve this, by interpreting types as triples where, intuitively, is a space of inhabitants of the type, is a space serving as a chosen bundle of tangents (or jets, in the case of higher order derivatives) over , and is a binary relation between curves, informally relating curves in with their tangent (resp. jet) curves in . This new model gives a denotational semantics for higher order automatic differentiation on a language with higher order functions.
In Section 4 we boil this new approach down to a straightforward and elementary logical relations argument for the correctness of higher order automatic differentiation. The approach is explained in detail in Section 6. We explore some subtleties of non-uniqueness of derivatives of higher order functions in Section 7.
1.0.3. Related work and context.
AD has a long history and has many implementations. AD was perhaps first phrased in a functional setting in [PS08], and there are now a number of teams working on AD in the functional setting (e.g. [WWE+19, SFVPJ19, Ell18]), some providing efficient implementations. Although that work does not involve formal semantics, it is inspired by intuitions from differential geometry and category theory.
This paper adds to a very recent body of work on verified automatic differentiation. In the first order setting, there are recent accounts based on denotational semantics in manifolds [FST19, LYRY20] and based on synthetic differential geometry [CGM19], work making a categorical abstraction [CCG+20] and work connecting operational semantics with denotational semantics [AP20, Plo18], as well as work focussing on how to correctly differentiate programs that operate on tensors [BML+20] and programs that make use of quantum computing [ZHCW20]. Recently there has also been significant progress at higher types. The work of Brunel et al. [BMP20] and Mazza and Pagani [MP21] give formal correctness proofs for reverse-mode derivatives on a linear -calculus with a particular operational semantics. The work of Barthe et al. [BCLG20] provides a general discussion of some new syntactic logical relations arguments including one very similar to our syntactic proof of Theorem 3. Sherman et al. [SMC20] discuss a differential programming technique that works at higher types, based on exact real arithmetic and relate it to a computable semantics. We understand that the authors of [CGM19] are working on higher types. Vákár [Vák21, VS21, LNV21] phrase and prove correct a reverse mode AD technique on a higher order language based on a similar gluing technique. Vákár [Vák20] extends a standard -calculus with type recursion, and proves correct a forward-mode AD on such a higher-order language, also using a gluing argument.
The differential -calculus [ER03] is related to AD, and explicit connections are made in [MO20, Man12]. One difference is that the differential -calculus allows the addition of terms at all types, and hence vector space models are suitable to interpret all types. This choice would appear peculiar with the variant and inductive types that we consider here, as the dimension of a disjoint union of spaces is only defined locally.
This paper builds on our previous work [HSV20a, Vák20] in which we gave denotational correctness proofs for forward mode AD algorithms for computing first derivatives. Here, we explain how these techniques extend to methods that calculate higher derivatives.
The Faà di Bruno construction has also been investigated [CS11] in the context of Cartesian differential categories.
The idea of directly calculating higher order derivatives by using automatic differentiation methods that work with Taylor approximations (also known as jets in differential geometry) is well-known [GUW00] and it has recently gained renewed interest [Bet18, BJD19]. So far, such “Taylor-mode AD” methods have only been applied to first order functional languages, however. This paper shows how to extend these higher order AD methods to languages with support for higher order functions and algebraic data types.
The two main methods for implementing AD are operator overloading and, the method used in this paper, source code transformation [VMBBL18]. Taylor-mode AD has been seen to be significantly faster than iterated AD in the context of operator overloading [BJD19] in Jax [FJL18]. There are other notable implementations of forward Taylor-mode [BS96, BS97, Kar01, PS07, WGP16]. Some of them are implemented in a functional language [Kar01, PS07]. Taylor-mode implementations use the rich algebraic structure of derivatives to avoid a lot of redundant computations occurring via iterated first order methods and share of a lot of redundant computations. Perhaps the simplest example to see this is with the sin function, whose iterated derivatives only involve sin, cos, and negation. Importantly, most AD tools have the right complexity up to a constant factor, but this constant is quite important in practice and Taylor-mode helps achieve better performance. Another stunning result of a version of Taylor-mode was achieved in [LMG18], where a gain of performance of up to two orders of magnitude was achieved for computing certain Hessian-vector products using Ricci calculus. In essence, the algorithm used is mixed-mode that is derived via jets in [Bet18]. This is further improved in [LMG20]. Tayor-mode can also be useful for ODE solvers and hence will be important for neural differential equations [CRBD18].
Finally, we emphasise that we have chosen the neural network (1) as our running example mainly for its simplicity. Indeed one would typically use reverse-mode AD to train neural networks in practice. There are many other examples of AD outside the neural networks literature: AD is useful whenever derivatives need to be calculated on high dimensional spaces. This includes optimization problems more generally, where the derivative is passed to a gradient descent method (e.g. [RM51, KW+52, Qia99, KB14, DHS11, LN89]). Optimization problems involving higher order functions naturally show up in the calculus of variations and its applications in physics, where one typically looks for a function minimizing a certain integral [GSS00]. Other applications of AD are in advanced integration methods, since derivatives play a role in Hamiltonian Monte Carlo [Nea11, HG14] and variational inference [KTR+17]. Second order methods for gradient-descent have also been extensively studied. As the basic second order Newton method requires inverting a high dimentional hessian matrix, several alternatives and approximations have been studied. Some of them still require Taylor-like modes of differentiation and require a matrix-vector product where the matrix resembles the hessian or inverse hessian [KK04, Mar10, Ama12].
1.0.4. Summary of contributions.
We have provided a semantic analysis of higher order automatic differentiation. Our syntactic starting point are higher order forward-mode AD macros on a typed higher order language that extend their well-known first order equivalent (e.g. [SFVPJ19, WWE+19, HSV20a]). We present these in Section 3 for function types, and in Section 5 we extend them to inductive types and variants. The main contributions of this paper are as follows.
-
•
We give a denotational semantics for the language in diffeological spaces, showing that every definable expression is smooth (Section 4).
-
•
We show correctness of the higher order AD macros by a logical relations argument (Th. 3).
-
•
We give a categorical analysis of this correctness argument with two parts: a universal property satisfied by the macro in terms of syntactic categories, and a new notion of glued space that abstracts the logical relation (Section 6).
-
•
We then use this analysis to state and prove a correctness argument at all first order types (Th. 8).
Relation to previous work
This paper extends and develops the paper [HSV20a] presented at the 23rd International Conference on Foundations of Software Science and Computation Structure (FoSSaCS 2020). This version includes numerous elaborations, notably the extension of the definition, semantics and correctness of automatic differentiation methods for computing higher order derivatives (introduced in Section 2.2-2.4) and a novel discussion about derivatives of higher-order functions (Section 7).
2. Rudiments of differentiation: how to calculate with dual numbers and Taylor approximations
2.1. First order differentiation: the chain rule and dual numbers.
We will now recall the definition of gradient of a differentiable function, the goal of AD and, and what it means for AD to be correct. Recall that the derivative of a function , if it exists, is a function such that for all , is the gradient of at in the sense that the function gives the best linear approximation of at . (The gradient is often written .)
The chain rule for differentiation tells us that we can calculate . In that sense, the chain rule tells us how linear approximations to a function transform under post-composition with another function.
To find in a compositional way, using the chain rule, two generalizations are reasonable:
-
•
We need both and when calculating of a composition , using the chain rule, so we are really interested in the pair ;
-
•
In building we will need to consider functions of multiple arguments, such as , and these functions should propagate derivatives.
Thus we are more generally interested in transforming a function into a function in such a way that for any ,
(2) |
Computing automatically the program representing , given a program representing , is the goal of automatic differentiation. An intuition for is often given in terms of dual numbers. The transformed function operates on pairs of numbers, , and it is common to think of such a pair as for an ‘infinitesimal’ . But while this is a helpful intuition, the formalization of infinitesimals can be intricate, and the development in this paper is focussed on the elementary formulation in (2).
A function satisfying (2) encodes all the partial derivatives of . For example, if , then with and , by applying (2) to we obtain and similarly . And conversely, if is differentiable in each argument, then a unique satisfying (2) can be found by taking linear combinations of partial derivatives, for example:
(Here, recall that the partial derivative is a particular notation for the gradient , i.e. with fixed. )
In summary, the idea of differentiation with dual numbers is to transform a differentiable function to a function which captures and all its partial derivatives. We packaged this up in (2) as an invariant which is useful for building derivatives of compound functions in a compositional way. The idea of (first order) forward mode automatic differentiation is to perform this transformation at the source code level.
We say that a macro for AD is correct if, given a semantic model , the program representing is transformed by the macro to a program representing . This means in particular that computes correct partial derivatives of (the function represented by) .
Smooth functions.
In what follows we will often speak of smooth functions , which are functions that are continuous and differentiable, such that their derivatives are also continuous and differentiable, and so on.
2.2. Higher order differentiation: the Faà di Bruno formula and Taylor approximations.
We now generalize the above in two directions:
-
•
We look for the best local approximations to with polynomials of some order , generalizing the above use of linear functions ().
-
•
We can work directly with multivariate functions instead of functions of one variable ().
To make this precise, we recall that, given a smooth function and a natural number , the -th order Taylor approximation of at is defined in terms of the partial derivatives of :
This is an -th order polynomial. Similarly to the case of first order derivatives, we can recover the partial derivatives of up to the -th order from its Taylor approximation by evaluating the series at basis vectors. See Section 2.3 below for an example.
Recall that the ordering of partial derivatives does not matter for smooth functions (Schwarz/Clairaut’s theorem). So there will be -th order partial derivatives, and altogether there are summands in the -th order Taylor approximation. (This can be seen by a ‘stars-and-bars’ argument.)
Since there are partial derivatives of of order , we can store them in the Euclidean space , which can also be regarded as the space of -variate polynomials of degree .
We use a convention of coordinates where is intended to represent a partial derivative for some function . We will choose these coordinates in lexicographic order of the multi-indices , that is, the indexes in the Euclidean space will typically range from to .
The -Taylor representation of a function is a function that transforms the partial derivatives of of order under postcomposition with :
|
(3) |
Thus the Taylor representation generalizes the dual numbers representation ().
To explicitly calculate the Taylor representation for a smooth function, we recall a generalization of the chain rule to higher derivatives. The chain rule tells us how the coefficients of linear approximations transform under composition of the functions. The Faà di Bruno formula [Sav06, EM03, CS96] tells us how coefficients of Taylor approximations – that is, higher derivatives – transform under composition. We recall the multivariate form from [Sav06, Theorem 2.1]. Given functions and , for ,
where are an enumeration of all the vectors of natural numbers such that and and we write for the number of such vectors. The details of this formula reflect the complicated combinatorics the arise from repeated applications of the chain and product rules for differentiation that one uses to prove it. Conceptually, however, it is rather straightforward: it tells us that the coefficients of the -th order Taylor approximation of can be expressed exclusively in terms of those of and .
Thus the Faà di Bruno formula uniquely determines the Taylor approximation in terms of the derivatives of of order , and we can also recover all such derivatives from .
2.3. Example: a two-dimensional second order Taylor series
As an example, we can specialize the Faà di Bruno formula above to the second order Taylor series of a function and its behaviour under postcomposition with a smooth function :
where might either coincide or be distinct.
Rather than working with the full -Taylor representation of , we ignore the non-mixed second order derivatives and for the moment, and we represent the derivatives of order of (at some point ) as the numbers
and we can choose a similar representation for the derivatives of . Then, we observe that the Faà di Bruno formula induces the function
In particular, we can note that
We see can use this method to calculate any directional first and second order derivative of in one pass. For example, if , so , then the last component of is the result of taking the first derivative in direction and the second derivative in direction , and evaluating at .
In the proper Taylor representation we explicitly include the non-mixed second order derivatives as inputs and outputs, leading to a function . Above we have followed a common trick to avoid some unnecessary storage and computation, since these extra inputs and outputs are not required for computing the second order derivatives of . For instance, if then the last component of computes .
2.4. Example: a one-dimensional second order Taylor series
As opposed to (2,2)-AD, (1,2)-AD computes the first and second order derivatives in the same direction. For example, if is a smooth function, then . An intuition for can be given in terms of triple numbers. The transformed function operates on triples of numbers, , and it is common to think of such a triple as for an ‘infinitesimal’ which has the property that . For instance we have
We see that we directly get non-mixed second-order partial derivatives but not the mixed-ones. We can recover as .
More generally, if , then satisfies:
We can always recover the mixed second order partial derivatives from this but this requires several computations involving . This is thus different from the (2,2) method which was more direct.
2.5. Remark
In the rest of this article, we study forward-mode -automatic differentiation for a language with higher-order functions. The reader may like to fix for a standard automatic differentiation with first-order derivatives, based on dual numbers. This is the approach taken in the conference version of this paper [HSV20b]. But the generalization to higher-order derivatives with arbitrary and flows straightforwardly through the whole narrative.
3. A Higher Order Forward-Mode AD Translation
3.1. A simple language of smooth functions.
We consider a standard higher order typed language with a first order type of real numbers. The types and terms are as follows.
The typing rules are in Figure 3. We have included some abstract basic -ary operations for every . These are intended to include the usual (smooth) mathematical operations that are used in programs to which automatic differentiation is applied. For example,
-
•
for any real constant , we typically include a constant ; we slightly abuse notation and will simply write for in our examples;
-
•
we include some unary operations such as which we intend to stand for the usual sigmoid function, ;
-
•
we include some binary operations such as addition and multiplication ;
We add some simple syntactic sugar and, for some natural number ,
Similarly, we will frequently denote repeated sums and products using - and -signs, respectively: for example, we write as and as . This in addition to programming sugar such as for and for .
3.2. Syntactic automatic differentiation: a functorial macro.
The aim of higher order forward mode AD is to find the -Taylor representation of a function by syntactic manipulations, for some choice of that we fix. For our simple language, we implement this as the following inductively defined macro on both types and terms (see also [WWE+19, SFVPJ19]). For the sake of legibility, we simply write as here and leave the dimension and order of the Taylor representation implicit. The following definition is for general and , but we treat specific cases afterwards in Example 3.2.
where | |||
Here, are some chosen terms of type in the language with free variables from . We think of these terms as implementing the partial derivative of the smooth function that implements. For example, we could choose the following representations of derivatives of order of our example operations
Note that our rules, in particular, imply that .
[- and -AD] Our choices of partial derivatives of the example operations are sufficient to implement -Taylor forward AD with . To be explicit, the distinctive formulas for - and -AD methods (specializing our abstract definition of above) are
where we informally write for the one-hot encoding of (the sequence of length consisting exclusively of zeros except in position where it has a ) and for the two-hot encoding of and (the sequence of length consisting exclusively of zeros except in positions and where it has a if and a if )
As noted in Section 2, it is often unnecessary to include all components of the
-algorithm, for example when computing a second order directional derivative.
In that case, we may define a restricted -AD algorithm that drops the non-mixed second order derivatives from the definitions above and defines
and
We extend to contexts: . This turns into a well-typed, functorial macro in the following sense.
Lemma 1 (Functorial macro).
If then .
If and
then
.
Proof 3.1.
By induction on the structure of typing derviations.
[Inner products] Let us write for the -fold product . Then, given we can define their inner product
To illustrate the calculation of , let us expand (and -reduce) :
Let us also expand the calculation of : | ||||
4. Semantics of differentiation
Consider for a moment the first order fragment of the language in Section 3, with only one type, , and no ’s or pairs. This has a simple semantics in the category of cartesian spaces and smooth maps. Indeed, a term has a natural reading as a function by interpreting our operation symbols by the well-known operations on with the corresponding name. In fact, the functions that are definable in this first order fragment are smooth. Let us write for this category of cartesian spaces ( for some ) and smooth functions.
The category has cartesian products, and so we can also interpret product types, tupling and pattern matching, giving us a useful syntax for constructing functions into and out of products of . For example, the interpretation of in (1) becomes
where , and are the usual inner product, addition and the sigmoid function on , respectively.
Inside this category, we can straightforwardly study the first order language without ’s, and automatic differentiation.
In fact, we can prove the following by plain induction on the syntax:
The interpretation of the (syntactic) forward AD of a first order term
equals the usual (semantic) derivative of the interpretation of as a smooth function.
However, as is well-known, the category does not support function spaces. To see this, notice that we have polynomial terms
for each , and so if we could interpret as a Euclidean space then, by interpreting these polynomial expressions, we would be able to find continuous injections for every , which is topologically impossible for any , for example as a consequence of the Borsuk-Ulam theorem (see Appx. A).
This lack of function spaces means that we cannot interpret the functions and from (1) in , as they are higher order functions, even though they are very useful and innocent building blocks for differential programming! Clearly, we could define neural nets such as (1) directly as smooth functions without any higher order subcomponents, though that would quickly become cumbersome for deep networks. A problematic consequence of the lack of a semantics for higher order differential programs is that we have no obvious way of establishing compositional semantic correctness of for the given implementation of (1).
We now show that every definable function is smooth, and then in Section 4.2 we show that the macro witnesses its derivatives.
4.1. Smoothness at higher types and diffeologies
The aim of this section is to introduce diffeological spaces as a semantic model for the simple language in Section 3. By way of motivation, we begin with a standard set theoretic semantics, where types are interpreted as follows
and a term is interpreted as a function , mapping a valuation of the context to a result.
We can show that the interpretation of a term is always a smooth function , even if it has higher order subterms. We begin with a fairly standard logical relations proof of this, and then move from this to the semantic model of diffeological spaces.
Proposition 2.
If then the function is smooth.
Proof 4.1.
For each type define a set by induction on the structure of types:
Now we show the fundamental lemma: if and then . This is shown by induction on the structure of typing derivations. The only interesting step here is that the basic operations (, , etc.) are smooth. We deduce the statement of the theorem by putting , , and letting be the projections.
At higher types, the logical relations show that we can only define functions that send smooth functions to smooth functions, meaning that we can never use them to build first order functions that are not smooth. For example, in (1) has this property.
This logical relations proof suggests to build a semantic model by interpreting types as sets with structure: for each type we have a set together with a set of plots. {defi} A diffeological space consists of a set together with, for each and each open subset of , a set of functions, called plots, such that
-
•
all constant functions are plots;
-
•
if is a smooth function and , then ;
-
•
if is a compatible family of plots and covers , then the gluing is a plot.
We call a function between diffeological spaces smooth if, for all plots , we have that . We write for the set of smooth maps from to . Smooth functions compose, and so we have a category of diffeological spaces and smooth functions.
A diffeological space is thus a set equipped with structure. Many constructions of sets carry over straightforwardly to diffeological spaces.
[Cartesian diffeologies] Each open subset of can be given the structure of a diffeological space by taking all the smooth functions as . Smooth functions from in the traditional sense coincide with smooth functions in the sense of diffeological spaces [IZ13]. Thus diffeological spaces have a profound relationship with ordinary calculus.
In categorical terms, this gives a full embedding of in . {exa}[Product diffeologies] Given a family of diffeological spaces, we can equip the product of sets with the product diffeology in which -plots are precisely the functions of the form for .
This gives us the categorical product in . {exa}[Functional diffeology] We can equip the set of smooth functions between diffeological spaces with the functional diffeology in which -plots consist of functions such that is an element of .
This specifies the categorical function object in .
We can now give a denotational semantics to our language from Section 3 in the category of diffeological spaces. We interpret each type as a set equipped with the relevant diffeology, by induction on the structure of types:
A context is interpreted as a diffeological space . Now well typed terms are interpreted as smooth functions , giving a meaning for for every valuation of the context. This is routinely defined by induction on the structure of typing derivations once we choose a smooth function to interpret each -ary operation . For example, constants are interpreted as constant functions; and the first order operations () are interpreted by composing with the corresponding functions, which are smooth: e.g., , where . Variables are interpreted as . The remaining constructs are interpreted as follows, and it is straightforward to show that smoothness is preserved.
The logical relations proof of Proposition 2 is reminiscent of diffeological spaces. We now briefly remark on the suitability of the axioms of diffeological spaces (Def 4.1) for a semantic model of smooth programs. The first axiom says that we only consider reflexive logical relations. From the perspective of the interpretation, it recognizes in particular that the semantics of an expression of type is defined by its value on smooth functions rather than arbitrary arguments. That is to say, the set-theoretic semantics at the beginning of this section, , is different to the diffeological semantics, . The second axiom for diffeological spaces ensures that the smooth maps in are exactly the plots in . The third axiom ensures that categories of manifolds fully embed into ; it will not play a visible role in this paper — in fact, [BCLG20] prove similar results for a simple language like ours by using plain logical relations (over ) and without demanding the diffeology axioms. However, we expect the third axiom to be crucial for programming with other smooth structures or partiality.
4.2. Correctness of AD
We have shown that a term is interpreted as a smooth function , even if involves higher order functions (like (1)). Moreover, the macro differentiation is a function (Proposition 1). This enables us to state a limited version of our main correctness theorem:
Theorem 3 (Semantic correctness of (limited)).
For any term
, the function
is the -Taylor representation (3) of
.
In detail: for any smooth functions
,
For instance, if , then .
Proof 4.2.
We prove this by logical relations. A categorical version of this proof is in Section 6.2.
For each type , we define a binary relation between (open) -dimensional plots in and (open) -dimensional plots in , i.e. , by induction on :
Then, we establish the following ‘fundamental lemma’:
If and, for all , and
are such that is in , then we have that
is in .
This is proved routinely by induction on the typing derivation of . The case for relies on the precise definition of .
We conclude the theorem from the fundamental lemma by considering the case where , and .
5. Extending the language: variant and inductive types
In this section, we show that the definition of forward AD and the semantics generalize if we extend the language of Section 3 with variants and inductive types. As an example of inductive types, we consider lists. This specific choice is only for expository purposes and the whole development works at the level of generality of arbitrary algebraic data types generated as initial algebras of (polynomial) type constructors formed by finite products and variants. These types are easily interpreted in the category of diffeological spaces in much the same way. The categorically minded reader may regard this as a consequence of being a concrete Grothendieck quasitopos, e.g. [BH11], and hence is complete and cocomplete.
5.1. Language.
We additionally consider the following types and terms:
We extend the type system according to the rules of Fig. 4.
We can then extend (again, writing it as , for legibility) to our new types and terms by
To demonstrate the practical use of expressive type systems for differential programming, we consider the following two examples. {exa}[Lists of inputs for neural nets] Usually, we run a neural network on a large data set, the size of which might be determined at runtime. To evaluate a neural network on multiple inputs, in practice, one often sums the outcomes. This can be coded in our extended language as follows. Suppose that we have a network that operates on single input vectors. We can construct one that operates on lists of inputs as follows:
[Missing data] In practically every application of statistics and machine learning, we face the problem of missing data: for some observations, only partial information is available.
In an expressive typed programming language like we consider, we can model missing data conveniently using the data type . In the context of a neural network, one might use it as follows. First, define some helper functions
Given a neural network , we can build a new one that operates on on a data set for which some covariates (features) are missing, by passing in default values to replace the missing covariates:
Then, given a data set with missing covariates, we can perform automatic differentiation on this network to optimize, simultaneously, the ordinary network parameters and the default values for missing covariates .
5.2. Semantics.
In Section 4 we gave a denotational semantics for the simple language in diffeological spaces. This extends to the language in this section, as follows. As before, each type is interpreted as a diffeological space, which is a set equipped with a family of plots:
-
•
A variant type is inductively interpreted as the disjoint union of the semantic spaces, , with -plots
-
•
A list type is interpreted as the union of the sets of length tuples for all natural numbers , with -plots
The constructors and destructors for variants and lists are interpreted as in the usual set theoretic semantics.
It is routine to show inductively that these interpretations are smooth. Thus every term in the extended language is interpreted as a smooth function between diffeological spaces. List objects as initial algebras are computed as usual in a cocomplete category (e.g. [JR11]). More generally, the interpretation for algebraic data types follows exactly the usual categorical semantics of variant types and inductive types (e.g. [Pit95]).
6. Categorical analysis of (higher order) forward AD and its correctness
This section has three parts. First, we give a categorical account of the functoriality of AD (Ex. 6.1). Then we introduce our gluing construction, and relate it to the correctness of AD (dgm. (4)). Finally, we state and prove a correctness theorem for all first order types by considering a category of manifolds (Th. 8).
6.1. Syntactic categories.
The key contribution of this subsection is that the AD macro translation (Section 3.2) has a canonical status as a unique functor between categories with structure. To this end, we build a syntactic category from our language, which has the property of a free category with certain structure. This means that for any category with this structure, there is a unique structure-preserving functor , which is an interpretation of our language in that category. Generally speaking, this is the categorical view of denotational semantics (e.g. [Pit95]). But in this particular setting, the category itself admits alternative forms of this structure, given by the dual numbers interpretation, the triple numbers interpretation etc. of Section 2. This gives canonical functors translating the language into itself, which are the AD macro translations (Section 3.2). A key point is that is almost entirely determined by universal properties (for example, cartesian closure for the function space); the only freedom is in the choice of interpretation of
-
(1)
the real numbers , which can be taken as the plain type , or as the dual numbers interpretation etc.;
-
(2)
the primitive operations , which can be taken as the operation itself, or as the derivative of the operation etc..
In more detail, our language induces a syntactic category as follows. {defi} Let be the category whose objects are types, and where a morphism is a term in context modulo the -laws (Fig. 5). Composition is by substitution. For simplicity, we do not impose identities involving the primitive operations, such as the arithmetic identity in . As is standard, this category has the following universal property.
Lemma 4 (e.g. [Pit95]).
For every bicartesian closed category with list objects, and every choice of an object and morphisms for all and , in , there is a unique functor respecting the interpretation and preserving the bicartesian closed structure as well as list objects.
Proof 6.1 (Proof notes).
The functor is a canonical denotational semantics for the language, interpreting types as objects of and terms as morphisms. For instance, , the function space in the category , and is the composite .
When , the denotational semantics of the language in diffeological spaces (Section 4,5.2) can be understood as the unique structure preserving functor satisfying , and so on.
[Canonical definition of forward AD] The forward AD macro (Section 3,5.1) arises as a canonical bicartesian closed functor on that preserves list objects. Consider the unique bicartesian closed functor that preserves list objects such that and
Then for any type , , and for any term , as morphisms in the syntactic category.
This observation is a categorical counterpart to Lemma 1.
6.2. Categorical gluing and logical relations.
Gluing is a method for building new categorical models which has been used for many purposes, including logical relations and realizability [MS92]. Our logical relations argument in the proof of Theorem 3 can be understood in this setting. (In fact we originally found the proof of Theorem 3 in this way.) In this subsection, for the categorically minded, we explain this, and in doing so we quickly recover a correctness result for the more general language in Section 5 and for arbitrary first order types.
The general, established idea of categorical logical relations starts from the observation that that logical relations are defined by induction on the structure of types. Types have universal properties in a categorical semantics (e.g. cartesian closure for the function space), and so we can organize the logical relations argument by defining some category of relations and observing that it has the requisite categorical structure. The interpretation of types as relations can then be understood as coming from a unique structure preserving map . In this paper, our logical relations are not quite as simple as a binary relation on sets; rather they are relations between plots. Nonetheless, this still forms a category with the appropriate structure, which follows because it can still be regarded as arising from a gluing construction, as we now explain.
We define a category whose objects are triples where and are diffeological spaces and is a relation between their -dimensional plots. A morphism is a pair of smooth functions , , such that if then . The idea is that this is a semantic domain where we can simultaneously interpret the language and its automatic derivatives.
Proposition 5.
The category is bicartesian closed, has list objects, and the projection functor preserves this structure.
Proof 6.2 (Proof notes).
The category is a full subcategory of the comma category
.
The result thus follows by the general theory of categorical gluing (e.g. [JLS07, Lemma 15]).
We give a semantics for the language in , interpreting types as objects , and terms as morphisms. We let and , with the relation
We interpret the operations according to in , but according to the -Taylor representation of in . For instance, when and , is
At this point one checks that these interpretations are indeed morphisms in . This is equivalent to the statement that is the -Taylor representation of (3). The remaining constructions of the language are interpreted using the categorical structure of , following Lemma 4.
Notice that the diagram below commutes. One can check this by hand or note that it follows from the initiality of (Lemma 4): all the functors preserve all the structure.
(4) |
We thus arrive at a restatement of the correctness theorem (Th. 3), which holds even for the extended language with variants and lists, because for any , the interpretations are in the image of the projection , and hence is a -Taylor representation of .
6.3. Correctness at all first order types, via manifolds.
We now generalize Theorem 3 to hold at all first order types, not just the reals.
So far, we have shown that our macro translation (Section 3.2) gives correct derivatives to functions of the real numbers, even if other types are involved in the definitions of the functions (Theorem 3 and Section 6.2). We can state this formally because functions of the real numbers have well understood derivatives (Section 2). There are no established mathematical notions of derivatives at higher types, and so we cannot even begin to argue that our syntactic derivatives of functions match with some existing mathematical notion (see also Section 7).
However, for functions of first order type, like , there are established mathematical notions of derivative, because we can understand as the manifold of all tuples of reals, and then appeal to the well-known theory of manifolds and jet bundles. We do this now, to achieve a correctness theorem for all first order types (Theorem 8). The key high level points are that
-
•
manifolds support a notion of differentiation, and an interpretation of all first order types, but not an interpretation of higher types;
-
•
diffeological spaces support all types, including higher types, but not an established notion of differentiation in general;
-
•
manifolds and smooth maps embed full and faithfully in diffeological spaces, preserving the interpretation of first order types, so we can use the two notions together.
We now explain this development in more detail.
For our purposes, a smooth manifold is a second-countable Hausdorff topological space together with a smooth atlas. In more detail, a topological space is second-countable when there exists a collection of open subsets of such that any open subset of can be written as a union of elements of . A topological space is Hausdorff if for every distinct points and , there exists disjoint open subsets of such that . A smooth atlas of a topological space is an open cover together with homeomorphisms (called charts, or local coordinates) such that is smooth on its domain of definition for all . A function between manifolds is smooth if is smooth for all charts and of and , respectively. Let us write for this category. This definition of manifolds is a slight generalisation of the more usual one from differential geometry because different charts in an atlas may have different finite dimensions . Thus we consider manifolds with dimensions that are potentially unbounded, albeit locally finite.
Each open subset of can be regarded as a manifold. This lets us regard the category of manifolds as a full subcategory of the category of diffeological spaces. We consider a manifold as a diffeological space with the same carrier set and where the plots , called the manifold diffeology, are the smooth functions in . A function is smooth in the sense of manifolds if and only if it is smooth in the sense of diffeological spaces [IZ13]. For the categorically minded reader, this means that we have a full embedding of into . Moreover, the natural interpretation of the first order fragment of our language in coincides with that in . That is, the embedding of into preserves finite products and countable coproducts (hence initial algebras of polynomial endofunctors).
Proposition 6.
Suppose that a type is first order, i.e. it is just built from reals, products, variants, and lists (or, again, arbitrary inductive types), and not function types. Then the diffeological space is a manifold.
Proof 6.3 (Proof notes).
This is proved by induction on the structure of types. In fact, one may show that every such is isomorphic to a manifold of the form where the bound is either finite or , but this isomorphism is typically not an identity function.
We recall how the Taylor representation of any morphism of manifolds is given by its action on jets [KSM99, Chapter IV]. For each point in a manifold , define the -jet space to be the set of equivalence classes of -dimensional plots in based at , where we identify iff all partial derivatives of order coincide in the sense that
for all smooth and all multi-indices .
In the case of , a -jet space is better known as a tangent space.
The -jet bundle (a.k.a. tangent bundle, in case ) of is the set . The charts of equip with a canonical manifold structure.
The (manifold) diffeology of these jet bundles can be concisely summarized by the plots
.
Then acts on smooth maps to give is defined as .
In local coordinates, this action is seen to coincide precisely with the
-Taylor representation of given by the Faà di Bruno formula [Mer04].
All told, the -jet bundle is a functor [KSM99].
We can understand the jet bundle of a composite space in terms of that of its parts.
Lemma 7.
There are canonical isomorphisms and .
Proof 6.4 (Proof notes).
For disjoint unions, notice that that smooth morphisms from into a disjoint union of manifolds always factor over a single inclusion, because is connected. For products, it is well-known that partial derivatives of a morphism are calculated component-wise [Lee13, ex. 3-2].
We define a canonical isomorphism for every type , by induction on the structure of types. We let be given by . For the other types, we use Lemma 7. We can now phrase correctness at all first order types.
Theorem 8 (Semantic correctness of (full)).
For any ground , any first order context and any term , the syntactic translation coincides with the -jet bundle functor, modulo these canonical isomorphisms:
Proof 6.5 (Proof notes).
For any -dimensional plot , let be the -jet curve, given by . First, we note that a smooth map is of the form for some if for all smooth we have . This generalizes (3). Second, for any first order type , . This is shown by induction on the structure of types. We conclude the theorem from diagram (4), by putting these two observations together.
7. Discussion: What are derivatives of higher order functions?
In our gluing categories of Section 6.2, we have avoided the question of what semantic derivatives should be associated with higher order functions. Our syntactic macro provides a specific derivative for every definable function, but in the model there is only a relation between plots and their corresponding Taylor representations, and this relation is not necessarily single-valued. Our approach has been rather indifferent about what “the” correct derivative of a higher order function should be. Instead, all we have cared about is that we are using “a” derivative that is correct in the sense that it can never be used to produce incorrect derivatives for first order functions, where we do have an unambiguous notion of correct derivative.
7.1. Automatic derivatives of higher order functions may not be unique!
For a concrete example to show that derivatives of higher order functions might not be unique in our framework, let us consider the case and focus on first derivatives of the evaluation function
Our macro will return . In this section we show that the lambda term is also a valid derivative of the evaluation map, where is defined by
This map is idempotent and it converts any map into the dual-numbers representation of its first component. For example, is the constantly function, where we write
According to our gluing semantics, a function defines a correct -Taylor representation of a function iff defines a morphism in . In particular, there is no guarantee that every has a unique correct -Taylor representation . (Although such Taylor representations are, in fact, unique when are first order types.) The gluing relation in relates curves in to “tangent curves” . In this relation, the function is related to at least two different tangent curves.
Lemma 9.
We have a smooth map
Proof 7.1.
Let and let . Then, also by definition of the exponential in . Therefore, we also have that , as we are working with infinitely differentiable smooth maps. Consequently,
by definition of the product in . It follows that .
Proposition 10.
We have that both and for
Proof 7.2.
By definition of , we need to show that for any , we have that . This means that we need to show that for
Unrolling further, this means we need to show that for any and such that for any (which means that and ), we have that
The latter part finally means that we need to show that
Now, focussing on : we need to show that
Inlining the definition of : we need to show that
This follows by assumption by choosing , and hence .
Focussing on : we need to show that
Inlining ’s definition: we need to show that
is equal to
That is, we need to show that for all , which holds by the assumption that by choosing (and hence ) and then specializing to .
Yet, as and . This shows that are both “valid” semantic derivatives of the evaluation function in our framework. In particular, it shows that semantic derivatives of higher order functions might not be unique. Our macro will return , but everything would still work just as well if it instead returned .
7.2. Canonical derivatives of higher order functions?
Differential geometers and analysts have long pursued notions of a canonical derivative of various higher order functions arising, for example, in the calculus of variations and in the study of infinite dimensional Lie groups [KM97]. Such an uncontroversial notion of derivative exists on various (infinite dimensional) spaces of functions that form suitable (so-called convenient) vector spaces, or, manifolds locally modelled on such vector spaces. At the level of generality of diffeological spaces, however, various natural notions of derivative that coincide in convenient vector spaces start to diverge and it is no longer clear what the best definition of a derivative is [CW14]. Another, fundamentally different setting that defines canonical derivatives of many higher order functions is given by synthetic differential geometry [Koc06].
While derivatives of higher order functions are of deep interest and have rightly been studied in their own right in differential geometry, we believe the situation is subtly different in computer science:
-
(1)
In programming applications, we use higher order programs only to construct the first order functions that we ultimately end up running and calculating derivatives of. Automatic differentiation methods can exploit this freedom: derivatives of higher order functions only matter in so far as they can be used to construct the correct derivatives of first order functions, so we can choose a simple and cheap notion of derivative among the valid options. As such, the fact that our semantics does not commit to a single notion of derivative of higher order functions can be seen as a feature rather than bug that models the pragmatics of programming practice.
-
(2)
While function spaces in differential geometry are typically infinite dimensional objects that are unsuitable for representation in the finite memory of a computer, higher order functions as used in programming are much more restricted: all they can do is call a function on finitely many arguments and analyse the function outputs. As such, function types in programming can be thought of as (locally) finite dimensional. In case a canonical notion of automatic derivative of higher order function is really desired, it may be worth pursuing a more intentional notion of semantics such as one based on game semantics. Such intentional techniques could capture the computational notion of higher order function better than our current (and other) extensional semantics using existing techniques from differential geometry. We hope that an exploration of such techniques might lead to an appropriate notion of computable derivative, even for higher order functions.
8. Discussion and future work
8.1. Summary
We have shown that diffeological spaces provide a denotational semantics for a higher order language with variants and inductive types (Section 4,5). We have used this to show correctness of simple forward-mode AD translations for calculating higher derivatives (Theorem 3, Theomem 8).
The structure of our elementary correctness argument for Theorem 3 is a typical logical relations proof over a denotational semantics. As explained in Section 6, this can equivalently be understood as a denotational semantics in a new kind of space obtained by categorical gluing.
Overall, then, there are two logical relations at play. One is in diffeological spaces, which ensures that all definable functions are smooth. The other is in the correctness proof (equivalently in the categorical gluing), which explicitly tracks the derivative of each function, and tracks the syntactic AD even at higher types.
8.2. Connection to the state of the art in AD implementation
As is common in denotational semantics research, we have here focused on an idealized language and simple translations to illustrate the main aspects of the method. There are a number of points where our approach is simplistic compared to the advanced current practice, as we now explain.
8.2.1. Representation of vectors
In our examples we have treated -vectors as tuples of length . This style of programming does not scale to large . A better solution would be to use array types, following [SFVPJ19]. As demonstrated by [CJS20], our categorical semantics and correctness proofs straightforwardly extend to cover them, in a similar way to our treatment of lists. In fact, [CJS20] formalizes our correctness arguments in Coq and extends them to apply to the system of [SFVPJ19].
8.2.2. Efficient forward-mode AD
For AD to be useful, it must be fast. The -AD macro that we use is the basis of an efficient AD library [SFVPJ19]. Numerous optimizations are needed, ranging from algebraic manipulations, to partial evaluations, to the use of an optimizing C compiler, but the resulting implementation is performant in experiments [SFVPJ19]. The Coq formalization [CJS20] validates some of these manipulations using a similar semantics to ours. We believe the implementation in [SFVPJ19] can be extended to apply to the more general -AD methods we described in this paper through minor changes.
8.2.3. Reverse-mode and mixed-mode AD
While forward-mode AD methods are useful, many applications require reverse-mode AD, or even mixed-mode AD for efficiency. In [HSV20a], we described how our correctness proof applies to a continuation-based AD technique that closely resembles reverse-mode AD, but only has the correct complexity under a non-standard operational semantics [BMP20] (in particular, the linear factoring rule is crucial). It remains to be seen whether this technique and its correctness proof can be adapted to yield genuine reverse AD under a standard operational semantics.
Alternatively, by relying on a variation of our techniques, [Vák21] gives a correctness proof of a rather different -reverse AD algorithm that stores the (primal, adjoint)-vector pair as a struct-of-arrays rather than as an array-of-structs. Future work could explore extended its analysis to -reverse AD and mixed-mode AD for efficiently computing higher order derivatives.
8.2.4. Other language features
The idealized languages that we considered so far do not touch on several useful language constructs. For example: the use of functions that are partial (such as division) or partly-smooth (such as ReLU); phenomena such as iteration, recursion; and probabilities. Recent work by MV [Vák20] shows how our analysis of -AD extends to apply to partiality, iteration, and recursion. This development is orthogonal to the one in this paper: its methods combine directly with those in the present paper to analyze -forward mode AD of recursive programs. We leave the analysis of AD of probabilistic programs for future work.
Acknowledgment
We have benefited from discussing this work with many people, including M. Betancourt, B. Carpenter, O. Kammar, C. Mak, L. Ong, B. Pearlmutter, G. Plotkin, A. Shaikhha, J. Sigal, and others. In the course of this work, MV has also been employed at Oxford (EPSRC Project EP/M023974/1) and at Columbia in the Stan development team. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 895827; a Royal Society University Research Fellowship; the ERC BLAST grant; the Air Force Office of Scientific Research under award number FA9550–21–1–0038; and a Facebook Research Award.
References
- [AAB+16] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- [Ama12] Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business Media, 2012.
- [AP20] Martín Abadi and Gordon D Plotkin. A simple differentiable programming language. In Proc. POPL 2020. ACM, 2020.
- [BCLG20] Gilles Barthe, Raphaëlle Crubillé, Ugo Dal Lago, and Francesco Gavazzo. On the versatility of open logical relations: Continuity, automatic differentiation, and a containment theorem. In Proc. ESOP 2020. Springer, 2020. To appear.
- [Bet18] Michael Betancourt. A geometric theory of higher-order automatic differentiation. arXiv preprint arXiv:1812.11592, 2018.
- [BH11] John Baez and Alexander Hoffnung. Convenient categories of smooth spaces. Transactions of the American Mathematical Society, 363(11):5789–5825, 2011.
- [BJD19] Jesse Bettencourt, Matthew J Johnson, and David Duvenaud. Taylor-mode automatic differentiation for higher-order derivatives in JAX. 2019.
- [BML+20] Gilbert Bernstein, Michael Mara, Tzu-Mao Li, Dougal Maclaurin, and Jonathan Ragan-Kelley. Differentiating a tensor language. arXiv preprint arXiv:2008.11256, 2020.
- [BMP20] Alois Brunel, Damiano Mazza, and Michele Pagani. Backpropagation in the simply typed lambda-calculus with linear negation. In Proc. POPL 2020, 2020.
- [BS96] Claus Bendtsen and Ole Stauning. Fadbad, a flexible C++ package for automatic differentiation. Technical report, Technical Report IMM–REP–1996–17, Department of Mathematical Modelling, Technical University of Denmark, Lyngby, 1996.
- [BS97] Claus Bendtsen and Ole Stauning. Tadiff, a flexible c++ package for automatic differentiation. TU of Denmark, Department of Mathematical Modelling, Lungby. Technical report IMM-REP-1997-07, 1997.
- [CCG+20] J. Robin B. Cockett, Geoff S. H. Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, Benjamin MacAdam, Gordon D. Plotkin, and Dorette Pronk. Reverse derivative categories. In Proc. CSL 2020, 2020.
- [CGM19] Geoff Cruttwell, Jonathan Gallagher, and Ben MacAdam. Towards formalizing and extending differential programming using tangent categories. In Proc. ACT 2019, 2019.
- [CHB+15] Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. The Stan math library: Reverse-mode automatic differentiation in C++. arXiv preprint arXiv:1509.07164, 2015.
- [CJS20] Curtis Chin Jen Sem. Formalized correctness proofs of automatic differentiation in Coq. Master’s Thesis, Utrecht University, 2020. Thesis: https://dspace.library.uu.nl/handle/1874/400790. Coq code: https://github.com/crtschin/thesis.
- [CRBD18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
- [CS96] G Constantine and T Savits. A multivariate Faa di Bruno formula with applications. Transactions of the American Mathematical Society, 348(2):503–520, 1996.
- [CS11] J Robin B Cockett and Robert AG Seely. The Faa di Bruno construction. Theory and Applications of Categories, 25(15):394–425, 2011.
- [CW14] J Daniel Christensen and Enxin Wu. Tangent spaces and tangent bundles for diffeological spaces. arXiv preprint arXiv:1411.5425, 2014.
- [DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- [Ell18] Conal Elliott. The simple essence of automatic differentiation. Proceedings of the ACM on Programming Languages, 2(ICFP):70, 2018.
- [EM03] L Hernández Encinas and J Munoz Masque. A short proof of the generalized Faà di Bruno’s formula. Applied Mathematics Letters, 16(6):975–979, 2003.
- [ER03] Thomas Ehrhard and Laurent Regnier. The differential lambda-calculus. Theoretical Computer Science, 309(1-3):1–41, 2003.
- [FJL18] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. Systems for Machine Learning, 2018.
- [FST19] Brendan Fong, David Spivak, and Rémy Tuyéras. Backprop as functor: A compositional perspective on supervised learning. In 2019 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 1–13. IEEE, 2019.
- [GSS00] Izrail Moiseevitch Gelfand, Richard A Silverman, and Richard A Silverman. Calculus of variations. Courier Corporation, 2000.
- [GUW00] Andreas Griewank, Jean Utke, and Andrea Walther. Evaluating higher derivative tensors by forward propagation of univariate taylor series. Mathematics of Computation, 69(231):1117–1130, 2000.
- [HG14] Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
- [HSV20a] Mathieu Huot, Sam Staton, and Matthijs Vákár. Correctness of automatic differentiation via diffeologies and categorical gluing. In FoSSaCS, pages 319–338, 2020.
- [HSV20b] Mathieu Huot, Sam Staton, and Matthijs Vákár. Correctness of automatic differentiation via diffeologies and categorical gluing. Full version, 2020. arxiv:2001.02209.
- [IZ13] Patrick Iglesias-Zemmour. Diffeology. American Mathematical Soc., 2013.
- [JLS07] Peter T Johnstone, Stephen Lack, and P Sobocinski. Quasitoposes, quasiadhesive categories and Artin glueing. In Proc. CALCO 2007, 2007.
- [JR11] Bart Jacobs and JMMM Rutten. An introduction to (co)algebras and (co)induction. In Advanced Topics in Bisimulation and Coinduction, pages 38–99. CUP, 2011.
- [Kar01] Jerzy Karczmarczuk. Functional differentiation of computer programs. Higher-Order and Symbolic Computation, 14(1):35–57, 2001.
- [KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [KK04] Dana A Knoll and David E Keyes. Jacobian-free Newton–Krylov methods: a survey of approaches and applications. Journal of Computational Physics, 193(2):357–397, 2004.
- [KM97] Andreas Kriegl and Peter W Michor. The convenient setting of global analysis, volume 53. American Mathematical Soc., 1997.
- [Koc06] Anders Kock. Synthetic differential geometry, volume 333. Cambridge University Press, 2006.
- [KSM99] Ivan Kolár, Jan Slovák, and Peter W Michor. Natural operations in differential geometry. 1999.
- [KTR+17] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1):430–474, 2017.
- [KW+52] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
- [Lee13] John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer, 2013.
- [LMG18] Sören Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrix and tensor expressions. Advances in Neural Information Processing Systems, 31:2750–2759, 2018.
- [LMG20] Sören Laue, Matthias Mitterreiter, and Joachim Giesen. A simple and efficient tensor calculus. In AAAI, pages 4527–4534, 2020.
- [LN89] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- [LNV21] Fernando Lucatelli Nunes and Matthijs Vákár. CHAD for expressive total languages. arXiv e-prints, pages arXiv–2110, 2021.
- [LYRY20] Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentiation for non-differentiable functions. In Advances in Neural Information Processing Systems, 2020.
- [Man12] Oleksandr Manzyuk. A simply typed -calculus of forward automatic differentiation. In Proc. MFPS 2012, 2012.
- [Mar10] James Martens. Deep learning via Hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
- [Mer04] Joel Merker. Four explicit formulas for the prolongations of an infinitesimal lie symmetry and multivariate Faa di Bruno formulas. arXiv preprint math/0411650, 2004.
- [MO20] Carol Mak and Luke Ong. A differential-form pullback programming language for higher-order reverse-mode automatic differentiation. arxiv:2002.08241, 2020.
- [MP21] Damiano Mazza and Michele Pagani. Automatic differentiation in PCF. Proc. ACM Program. Lang., 5(POPL):1–27, 2021. doi:10.1145/3434309.
- [MS92] John C Mitchell and Andre Scedrov. Notes on sconing and relators. In International Workshop on Computer Science Logic, pages 352–378. Springer, 1992.
- [Nea11] Radford M Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo, chapter 5. Chapman & Hall / CRC Press, 2011.
- [PGC+17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- [Pit95] Andrew M Pitts. Categorical logic. Technical report, University of Cambridge, Computer Laboratory, 1995.
- [Plo18] Gordon D Plotkin. Some principles of differential programming languages. Invited talk, POPL 2018, 2018.
- [PS07] Barak A Pearlmutter and Jeffrey Mark Siskind. Lazy multivariate higher-order forward-mode ad. ACM SIGPLAN Notices, 42(1):155–160, 2007.
- [PS08] Barak A Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(2):7, 2008.
- [Qia99] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
- [Sav06] Thomas H Savits. Some statistical applications of Faa di Bruno. Journal of Multivariate Analysis, 97(10):2131–2140, 2006.
- [SFVPJ19] Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, and Simon Peyton Jones. Efficient differentiable programming in a functional array-processing language. Proceedings of the ACM on Programming Languages, 3(ICFP):97, 2019.
- [SMC20] Benjamin Sherman, Jesse Michel, and Michael Carbin. : Computable semantics for differentiable programming with higher-order functions and datatypes. arXiv preprint arXiv:2007.08017, 2020.
- [Sou80] Jean-Marie Souriau. Groupes différentiels. In Differential geometrical methods in mathematical physics, pages 91–128. Springer, 1980.
- [Sta11] Andrew Stacey. Comparative smootheology. Theory Appl. Categ., 25(4):64–117, 2011.
- [Vák20] Matthijs Vákár. Denotational correctness of foward-mode automatic differentiation for iteration and recursion. arXiv preprint arXiv:2007.05282, 2020.
- [Vák21] Matthijs Vákár. Reverse AD at higher types: Pure, principled and denotationally correct. In ESOP, pages 607–634, 2021.
- [VMBBL18] Bart Van Merriënboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. Automatic differentiation in ML: Where we are and where we should be going. In Advances in Neural Information Processing Systems, pages 8757–8767, 2018.
- [VS21] Matthijs Vákár and Tom Smeding. CHAD: Combinatory homomorphic automatic differentiation. arXiv preprint arXiv:2103.15776, 2021.
- [WGP16] Mu Wang, Assefaw Gebremedhin, and Alex Pothen. Capitalizing on live variables: new algorithms for efficient hessian computation via automatic differentiation. Mathematical Programming Computation, 8(4):393–433, 2016.
- [WWE+19] Fei Wang, Xilun Wu, Gregory Essertel, James Decker, and Tiark Rompf. Demystifying differentiable programming: Shift/reset the penultimate backpropagator. Proceedings of the ACM on Programming Languages, 3(ICFP), 2019.
- [ZHCW20] Shaopeng Zhu, Shih-Han Hung, Shouvanik Chakrabarti, and Xiaodi Wu. On the principles of differentiable quantum programming languages. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, pages 272–285. ACM, 2020. doi:10.1145/3385412.3386011.
Appendix A and are not cartesian closed categories
Lemma 11.
There is no continuous injection .
Proof A.1.
If there were, it would restrict to a continuous injection . The Borsuk-Ulam theorem, however, tells us that every continuous has some such that , which is a contradiction.
Let us define the terms:
Assuming that / is cartesian closed, observe that these get interpreted as injective continuous (because smooth) functions in and .
Theorem 12.
is not cartesian closed.
Proof A.2.
In case were cartesian closed, we would have for some . Then, we would get, in particular a continuous injection , which contradicts Lemma 11.
Theorem 13.
is not cartesian closed.
Proof A.3.
Observe that we have ; and that . Let us write for the image of and . Then, is connected because it is the continuous image of a connected set. Similarly, is connected because it is the non-disjoint union of connected sets. This means that lies in a single connected component of , which is a manifold with some finite dimension, say .
Take some (say, ), take some open -ball around , and take some open -ball around in . Then, restricts to a continuous injection from to , or equivalently, to , which contradicts Lemma 11.