Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond

AIP Conference Proceedings, 2004
...Read more
Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond Bernd Fischer , Arsen Hajian , Kevin Knuth ∗∗ and Johann Schumann RIACS / NASA Ames Research Center United States Naval Observatory ∗∗ NASA Ames Research Center Abstract. AUTOBAYES is a fully automatic program synthesis system for the data analysis domain. Its input is a declarative problem description in form of a statistical model; its output is documented and optimized C/C++ code. The synthesis process relies on the combination of three key techniques. Bayesian networks are used as a compact internal representation mechanism which enables problem decompositions and guides the algorithm derivation. Program schemas are used as independently composable building blocks for the algorithm construction; they can encapsulate advanced algo- rithms and data structures. A symbolic-algebraic system is used to find closed-form solutions for problems and emerging subproblems. In this paper, we describe the application of AUTOBAYES to the analysis of planetary nebulae images taken by the Hubble Space Telescope. We explain the system architecture, and present in detail the automatic derivation of the scientists’ original analysis [1] as well as a refined analysis using clustering models. This study demonstrates that AUTOBAYES is now mature enough so that it can be applied to realistic scientific data analysis tasks. INTRODUCTION Planetary nebulae are remnants of dying stars. Scientists try to understand their physics by collecting and analyzing data, for example images taken by the Hubble Space Tele- scope (HST). The analysis follows a general pattern in science: formulate an initial un- derstanding of the underlying physical processes, formalize it as a statistical model, fit the model to the collected data, interpret the results, and refine the model as long as necessary. However, the large data volumes collected by modern instruments make computer support indispensable. Consequently, development and refinement of the nec- essary data analysis programs have become a bottleneck, and tapes of unanalyzed data often sit in warehouses, waiting for the software to be completed. AUTOBAYES [2, 3] is a fully automatic program synthesis system for data analy- sis problems which increases the speed with which reliable data analysis software can be developed and thus eliminates this bottleneck. Its input is a declarative problem de- scription in form of a statistical model; its output is documented and optimized C/C++ code. Externally, it thus looks like a compiler; internally, however, it is quite different. AUTOBAYES first derives a customized algorithm implementing the model and then transforms it into code implementing the algorithm. The algorithm derivation or synthe- sis process—which distinguishes AUTOBAYES from traditional compilers—relies on the intricate interplay of three key techniques. (i)AUTOBAYES uses Bayesian networks
(BNs) [4, 5] as a compact internal representation of the statistical models. BNs provide an efficient encoding of the joint probability distribution over all variables and thus en- able replacing expensive probabilistic reasoning by faster graphical reasoning. In partic- ular, they speed up the decomposition of a problem into statistically independent simpler subproblems. (ii)AUTOBAYES uses program schemas as the basic building blocks for the algorithm derivation. Schemas consist of a parameterized code fragment (i.e., tem- plate) and a set of constraints which are formulated as conditions on BNs. The templates can encapsulate advanced algorithms and data structures, which lifts the abstraction level of the algorithm derivation. The constraints allow the network structure to guide the ap- plication of the schemas, which prevents a combinatorial explosion of the search space. (iii)AUTOBAYES contains a specialized symbolic-algebraic subsystem which allows it to find closed-form solutions for many problems and emerging subproblems. The com- bination of these techniques results in fast turnaround times comparable to compilation times, supporting the iterative development style typical for the domain. AUTOBAYES thus enables the scientists to think and to program in models instead of code. In this paper, we take one typical scientific data analysis problem—the analysis of planetary nebulae images taken by the HST—and show that AUTOBAYES can be used to automate the implementation of the necessary programs. We initially follow the analysis by Knuth and Hajian [1] and use AUTOBAYES to derive code for the published models. We show how concisely these models can be specified and how the code is derived; in particular, we show how the interaction between graphical reasoning, code instantiation, and symbolic computation is crucial for a fully automatic code derivation. We then go beyond the original analysis and use AUTOBAYES to derive code for a simple mixture model which can be used as an image segmentation procedure, automating a previously manual preprocessing step. Finally, we show how AUTOBAYES derives customized code for modifications of the standard model, which yield a more detailed analysis. The main contribution of this paper is to demonstrate that AUTOBAYES has reached a level of maturity which makes it applicable to realistic scientific data analysis applications. AUTOBAYES AUTOBAYES is implemented in SWI-Prolog [6] and currently comprises about 75,000 lines of documented code. Figure 1 shows the system architecture; in the following paragraphs we explain the major components. Statistical Models and Specification Language. A statistical model describes the expected properties of the data in a fully declarative fashion: for each problem variable of interest (i.e., observation or parameter), properties and dependencies are specified via probability distributions and constraints. Figure 2 shows how the Gaussian model discussed later in more detail is represented in AUTOBAYES’s specification language. Line 1 identifies the model. Lines 2 and 4 introduce symbolic constants whose values are left unspecified but constrained by the where-clauses in lines 3 and 5, respectively. In general, constraints can be complex boolean formulae tying together multiple variables (cf. line 11). Lines 6–15 introduce the parameters, again constrained by where-clauses. Variables can be annotated with as-clauses; these textual annotations are propagated
Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond Bernd Fischer∗ , Arsen Hajian† , Kevin Knuth∗∗ and Johann Schumann∗ ∗ RIACS / NASA Ames Research Center † United States Naval Observatory ∗∗ NASA Ames Research Center Abstract. AUTO BAYES is a fully automatic program synthesis system for the data analysis domain. Its input is a declarative problem description in form of a statistical model; its output is documented and optimized C/C++ code. The synthesis process relies on the combination of three key techniques. Bayesian networks are used as a compact internal representation mechanism which enables problem decompositions and guides the algorithm derivation. Program schemas are used as independently composable building blocks for the algorithm construction; they can encapsulate advanced algorithms and data structures. A symbolic-algebraic system is used to find closed-form solutions for problems and emerging subproblems. In this paper, we describe the application of AUTO BAYES to the analysis of planetary nebulae images taken by the Hubble Space Telescope. We explain the system architecture, and present in detail the automatic derivation of the scientists’ original analysis [1] as well as a refined analysis using clustering models. This study demonstrates that AUTO BAYES is now mature enough so that it can be applied to realistic scientific data analysis tasks. INTRODUCTION Planetary nebulae are remnants of dying stars. Scientists try to understand their physics by collecting and analyzing data, for example images taken by the Hubble Space Telescope (HST). The analysis follows a general pattern in science: formulate an initial understanding of the underlying physical processes, formalize it as a statistical model, fit the model to the collected data, interpret the results, and refine the model as long as necessary. However, the large data volumes collected by modern instruments make computer support indispensable. Consequently, development and refinement of the necessary data analysis programs have become a bottleneck, and tapes of unanalyzed data often sit in warehouses, waiting for the software to be completed. AUTO BAYES [2, 3] is a fully automatic program synthesis system for data analysis problems which increases the speed with which reliable data analysis software can be developed and thus eliminates this bottleneck. Its input is a declarative problem description in form of a statistical model; its output is documented and optimized C/C++ code. Externally, it thus looks like a compiler; internally, however, it is quite different. AUTO BAYES first derives a customized algorithm implementing the model and then transforms it into code implementing the algorithm. The algorithm derivation or synthesis process—which distinguishes AUTO BAYES from traditional compilers—relies on the intricate interplay of three key techniques. (i) AUTO BAYES uses Bayesian networks (BNs) [4, 5] as a compact internal representation of the statistical models. BNs provide an efficient encoding of the joint probability distribution over all variables and thus enable replacing expensive probabilistic reasoning by faster graphical reasoning. In particular, they speed up the decomposition of a problem into statistically independent simpler subproblems. (ii) AUTO BAYES uses program schemas as the basic building blocks for the algorithm derivation. Schemas consist of a parameterized code fragment (i.e., template) and a set of constraints which are formulated as conditions on BNs. The templates can encapsulate advanced algorithms and data structures, which lifts the abstraction level of the algorithm derivation. The constraints allow the network structure to guide the application of the schemas, which prevents a combinatorial explosion of the search space. (iii) AUTO BAYES contains a specialized symbolic-algebraic subsystem which allows it to find closed-form solutions for many problems and emerging subproblems. The combination of these techniques results in fast turnaround times comparable to compilation times, supporting the iterative development style typical for the domain. AUTO BAYES thus enables the scientists to think and to program in models instead of code. In this paper, we take one typical scientific data analysis problem—the analysis of planetary nebulae images taken by the HST—and show that AUTO BAYES can be used to automate the implementation of the necessary programs. We initially follow the analysis by Knuth and Hajian [1] and use AUTO BAYES to derive code for the published models. We show how concisely these models can be specified and how the code is derived; in particular, we show how the interaction between graphical reasoning, code instantiation, and symbolic computation is crucial for a fully automatic code derivation. We then go beyond the original analysis and use AUTO BAYES to derive code for a simple mixture model which can be used as an image segmentation procedure, automating a previously manual preprocessing step. Finally, we show how AUTO BAYES derives customized code for modifications of the standard model, which yield a more detailed analysis. The main contribution of this paper is to demonstrate that AUTO BAYES has reached a level of maturity which makes it applicable to realistic scientific data analysis applications. AUTOBAYES AUTO BAYES is implemented in SWI-Prolog [6] and currently comprises about 75,000 lines of documented code. Figure 1 shows the system architecture; in the following paragraphs we explain the major components. Statistical Models and Specification Language. A statistical model describes the expected properties of the data in a fully declarative fashion: for each problem variable of interest (i.e., observation or parameter), properties and dependencies are specified via probability distributions and constraints. Figure 2 shows how the Gaussian model discussed later in more detail is represented in AUTO BAYES’s specification language. Line 1 identifies the model. Lines 2 and 4 introduce symbolic constants whose values are left unspecified but constrained by the where-clauses in lines 3 and 5, respectively. In general, constraints can be complex boolean formulae tying together multiple variables (cf. line 11). Lines 6–15 introduce the parameters, again constrained by where-clauses. Variables can be annotated with as-clauses; these textual annotations are propagated const nat n. const nat c := 3. ... data double x(I:=1..n) ~ gauss(mu(c(I)),sigma(c(I))). max pr(x | {phi, mu, sigma} for {phi, mu, sigma}. statistical model AutoBayes Input Parser internal representation Schema Synthesis Kernel Library Rewriting Engine Test−data Generator intermediate code Equation Solver System utilities Optimizer intermediate code Code Generator 100 90 raw data 70 for i=0;i<n;i++ { for j=0;j<c;j++ { mu[j] = ... sigma[j] = ... } rel. class density 3.232323 3.342387 5.125683 3.891308 3.550812 5.412458 3.000217 2.994381 3.244621 80 60 50 40 30 20 10 } 0 0.0 C / C++ code FIGURE 1. data class 1 class 2 class 3 µ1 = 3.024 µ2 = 2.948 ... σ1 = 0.103 σ2 = 0.437 ... 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 fitted data AUTO BAYES system architecture. into the generated code to improve its legibility. Line 16 declares the observation (denoted by the data-modifier) as a matrix; its expected properties are specified in the distribution clause in line 17. Distributions can be both discrete (e.g., binomial) or continuous (e.g., Gaussian, Poisson, . . . ) but have to be univariate; support for multivariate distributions is currently under development. Distributions are chosen from a predefined list; adding more distributions is straightforward and requires only the definitions of the density functions. The final line in the model is the task clause. It specifies the analysis problem the synthesized program has to solve. Since AUTO BAYES only supports parameter learning (i.e., the estimation of the parameter values best explaining the observed data, given a model) but not structure learning (i.e., the estimation of the best model itself), tasks have the form to maximize a conditional probability w.r.t. a set of goal variables. In this case, the task is a maximum likelihood estimation because all goal variables occur to the right of the conditioning bar in the conditional probability and there are no priors on the parameters. However, AUTO BAYES can also solve maximum a posteriori estimation (MAP) problems. Bayesian Networks. A Bayesian network (cf. Figure 3) is a directed, acyclic graph whose nodes represent random variables and whose edges define probabilistic dependencies between the random variables. AUTO BAYES uses a variant of hybrid BNs to represent the statistical models internally. Here, nodes can represent discrete as well 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 model gauss as ’2D Gauss-Model for Nebulae Analysis’. % Image size const nat nx as ’number of pixels, x-dimension’. where 0 < nx. const nat ny as ’number of pixels, y-dimension’. where 0 < ny. % Center; assume center is on the image double x0 as ’center position, x-dimension’. where 1 =< x0 && x0 =< nx. double y0 as ’center position, y-dimension’. where 1 =< y0 && y0 =< ny. % Extent; assume full nebula is on the image double r as ’radius of the nebula’. where 0 < r && r < nx/2 && r < ny/2. % Intensity; upper bound determined by instrument double i0 as ’overall intensity of the nebula’. where 0 < i0 && i0 =< 255. % Noise; upper bound arbitrary, for initialization double sigma as ’noise’. where 0 < sigma && sigma < 100000. % Data and Distribution data double pixel(1..nx, 1..ny) as ’image’. pixel(I,J) ~ gauss(i0*exp(-((I-x0)**2+(J-y0)**2)/(2*r**2)), sigma). % Task max pr(pixel|{i0,x0,y0,r,sigma}) for {i0,x0,y0,r,sigma}. FIGURE 2. Complete AUTO BAYES-specification for Gaussian model. Keywords have been underlined and line numbers have been added for reference; comments start with a % and extend to the end of the line. as continuous random variables; these are rendered as boxes and ellipses, respectively. However, in Figure 3 all variables are continuous. Shaded nodes represent known variables, i.e., input data. Shaded boxes enclosing a set of nodes represent plates [4], i.e., collections of independent, co-indexed random variables. Distribution information for the random variables is attached to the respective nodes. Here, pixel is a nx×ny matrix of independent and identically distributed (i.i.d.) Gaussian random variables with observed values. Bayesian networks combine probability theory and graph theory. They are a common representation method in machine learning because they provide an efficient encoding of the joint probability distribution over all variables and thus allow to replace expensive probabilistic reasoning by faster graphical reasoning [4, 5]. Schemas and Schema Library. Program synthesis from first principles is notoriously difficult to scale up (cf. [8, 9]). AUTO BAYES thus follows a schema-based approach. A schema consists of a parameterized code fragment (i.e., template) and a set of constraints. The parameters are instantiated by AUTO BAYES, either directly or by calling itself recursively with a modified problem. The constraints determine whether a schema is applicable and how the parameters can be instantiated. Constraints are formulated as conditions on the Bayesian network or directly on the specified model; they include FIGURE 3. Bayesian net for Gaussian model, automatically extracted from the Gaussian model specification and drawn using the dot graph layouting package [7]. the task clause as special case. This allows the network structure to guide the application of the schemas and thus to constrain combinatorial explosion of the search space, even if a large number of schemas is available. Schemas are implemented as Prologclauses and search control is thus simply relegated to the Prolog-interpreter: schemas are tried in their textual order. This simple approach has not caused problems so far, mainly because the domain admits a natural layering which can be used to organize the schema library. The top layer comprises network decomposition schemas which try to break down the network into independent subnets, based on independence theorems for Bayesian networks. These are domain-specific divide-and-conquer schemas: the emerging subnets are fed back into the synthesis process and the resulting programs are composed to achieve a program for the original problem. AUTO BAYES is thus able to automatically synthesize larger programs by composition of different schemas. The next layer comprises more localized decomposition schemas which work on products of i.i.d. variables. Their application is also guided by the network structure but they require more substantial symbolic computations. The core layer of the library contains statistical algorithm schemas as for example expectation maximization (EM) [10, 11] and k-Means (i.e., nearest neighbor clustering); these generate the skeleton of the program. The final layer contains standard numeric optimization methods as for example the Nelder-Mead simplex method or different conjugate gradient methods. These are applied after the statistical problem has been transformed into an ordinary numeric optimization problem and if AUTO BAYES failed to find a symbolic solution for the problem. Currently, the library consists of 28 top-level schemas, with a number of additional variations (e.g., different initializations). Symbolic Subsystem. AUTO BAYES relies significantly on symbolic computations to support schema instantiation and code optimization. The core part of the symbolic subsystem implements symbolic-algebraic computations, similar to those in Mathematica [12]. It is based on the concept of term rewriting [13] and uses a small but reasonably efficient rewrite engine. Expression simplification and symbolic differentiation are implemented as sets of rewrite rules for this rewrite engine. The basic rules are straightforward; however, the presence of vectors and matrices introduce a few complications and require a careful formalization. In addition, AUTO BAYES contains a rewrite system which implements a domain-specific refinement of the standard sign abstraction where numbers are not only abstracted into pos and neg but also into small (i.e., | x | < 1) and large. AUTO BAYES then uses a relatively simple symbolic equation solver built on top of these rewrite systems. This handles only low-order polynomials (i.e., linear, quadratic, and simple cubic). However, it also shifts and normalizes exponents, recognizes multiple FIGURE 4. Planetary nebula IC 418 or Spirograph Nebula (a) Composite false-color image taken by the HST (Sahai et al., NASA and The Hubble Heritage Team). The different colors (resp. gray-scales) indicate the different chemicals prevalent in the different regions of the nebula; the origin of the visible texture is still unknown. The central white dwarf is discernible as a white dot in the center of the nebula. (b) Manually masked original image. (c) Original image with estimated Gaussian model parameters superimposed. (d) Sample data generated using estimated Gaussian model parameters. roots and bi-quadratic forms, and tries to find polynomial factors, and handles expressions in x and (1 − x) which are common in statistical applications. A smaller part of the symbolic subsystem implements the graphical reasoning routines necessary for Bayesian networks with plates, for example, computing the parents, children, or Markov blanket [5] of a node. Backend. The code constructed by schema instantiation and composition is represented in an imperative intermediate language. This is essentially a “sanitized” subset of C (e.g., no pointers), which is extended by a number of domain-specific constructs like vector/matrix operations, finite sums, and convergence-loops. Since straightforward schema application can produce suboptimal code, AUTO BAYES interleaves synthesis and code optimization (cf. [14] for an overview of advanced optimization techniques). Schemas can explicitly trigger aggressive large-scale optimizations like code motion, common sub-expression elimination, and memoization which can take advantage of information from the synthesis process. Traditional low-level optimizations like constant propagation or loop fusion, however, are left to the compiler. In a final step, AUTO BAYES translates the intermediate code into code tailored for a specific run-time environment. Currently, AUTO BAYES includes code generators for the Octave and Matlab environments; it can also produce stand-alone C and Modula-2 code. Each code generator is implemented as a rewrite system which eliminates the intermediate language constructs not supported by the target environment; most rules are shared between the different code generators. Test-Data Generation. Statistical models also allow to generate problem-specific test data (i.e., sampling), if the the edges of the extracted network are followed forward from the sources and random numbers are generated along the way. AUTO BAYES uses this interpretation to generate model-specific sampling programs which can be used to validate the models and solutions. PLANETARY NEBULAE Stars with initial masses between roughly 0.8 and 8 solar masses turn into swollen red giants when they run out of hydrogen to support their primary fusion process. In a secondary fusion process, these giants then burn the helium produced by the hydrogen fusion, resulting in a carbon-oxygen core roughly the size of the earth. Eventually, the secondary fusion runs out of fuel as well and the red giants begin to collapse into extremely hot white dwarfs. During this collapse, most of the material is expelled, forming blown-out gaseous shells which are called planetary nebulae. The shells continue to expand and after 10,000 to 50,000 years their density becomes too small for the nebulae to be visible. Figure 4(a) shows an image of the planetary nebula IC 418. Planetary nebulae occupy an important position in the stellar life cycle and are the major sources of interstellar carbon and oxygen but their physics and dynamics are not yet well understood. The characterization and analysis of their properties is thus an important task in astronomy. A HIERARCHICAL SET OF MODELS Knuth and Hajian [1] present a hierarchical set of three models they use to analyze images of the planetary nebula IC418 (cf. Figure 4(a)). Each model estimates a parameter set which is then refined by the subsequent models. Their common idea is (i) that the light intensity which is expected at a given pixel position (x, y) on the image can be described by a function F of this position, the (unknown) nebula center (x0 , y0 ), and some additional parameters, and (ii) that the measured intensities can be fitted against F using a maximum likelihood estimation of the parameters, which results in a simple mean square error minimization due to the Gaussian likelihood. The only difference between the models is the form of F. Here we sketch how these models are represented in AUTO BAYES’ specification language, and how the code is derived. In particular, we show how the interaction between graphical reasoning, code instantiation, and symbolic computation is crucial for a fully automatic code derivation. Gaussian Model The most simple model describes the nebula as a blurred two-dimensional Gaussian cloud. The function F thus describes the shape of a bell whose apex is at (x0 , y0 ) which can be formalized by a two-dimensional Gaussian curve: (x −x)2 +(y −y)2 0 0 2r2 F(x, y) = i0 · e− (1) The additional parameters i0 and r capture the overall intensity and extent of the nebula (i.e., the height and diameter of the bell). Model Specification. The AUTO BAYES specification shown in Figure 2 is a direct transcription of the underlying mathematics. The distribution clause for the image pixels in line 17 formalizes the idea that the expected value of the pixel (i, j) can be described by the function F(x, y) from Equation (1) since the expected value of a Gaussian random variable is given by the mean of the distribution. The variance of the distribution represents the error of the fit. In combination with the Gaussian distribution, the task clause in line 18 thus specifies a mean square error minimization. The constraints formalize additional assumptions on the structure of the image or the output of the instrument (cf. line 13). Mathematical Derivation. For the Gaussian model, the program derivation can neatly be separated into two phases such that the first phase is a purely mathematical derivation and the second phase only instantiates code templates. In general, however, this is not the case and symbolic computation and template instantiation are interleaved. The first step is to unfold the entire pixel-matrix element by element, using a decomposition schema based on the conditionalized version of the general product rule for probabilities: ny pr(pixel | i0 , x0 , y0 , r, σ ) = ∏nx i=1 ∏ j=1 pr(pixel(i, j) | i0 , x0 , y0 , r, σ ) The precondition for this step is that the pixels are pairwise independent, given the remaining variables, i.e., that pr(pixel(i, j) | pixel(i′ , j′ ), i0 , x0 , y0 , r, σ ) = pr(pixel(i, j) | i0 , x0 , y0 , r, σ ) holds for all i, j, i′, j′ such that i 6= i′ or j 6= j′. Instead of proving this from first principles, using the distribution information given in the model specification, AUTO BAYES can easily check it on the Bayesian network: there is no edge going from the pixel-node into itself and, hence, by definition of Bayesian networks, the pixels are pairwise independent. The probability is now elementary in the sense that on the left of the conditioning bar we have only a single variable pixel(i, j) which depends exactly on all the variables to the right of the conditioning bar. Hence, the probability expression can be replaced by the distribution function. AUTO BAYES’s domain theory contains rewrite rules for the most common probability density functions; additional density functions can easily be added. This rewrite yields the likelihood-function  − ny √ 1 ·e ∏nx i=1 ∏ j=1 2 2πσ pixel(i, j)−i ·e− 0 (x0 −i)2 +(y0 − j)2 2r2 2σ 2 2  which must be maximized w.r.t. goal variables io , x0 , y0 , r, and σ . In general it is easier to work with the log-likelihood function which yields the same solutions during maximization since the logarithm is strictly monotone. After simplification using the symbolic subsystem, AUTO BAYES thus derives the following log-likelihood function:   (i−x0 )2 +( j−y0 )2 2 − nx ny 1 2 2r L = −nx · ny · log(2π ) − nx · ny · log(σ ) − pixel(i, j)−i0 · e ·∑ ∑ 2σ 2 i=1 j=1 A solution can now be attempted in two different ways, numerically or symbolically. By default, AUTO BAYES tries to find symbolic solutions first. In this case, it computes the partial differentials 2 2 2 2 (i−x ) +( j−y ) ) (i−x ) +( j−y ) ) ∂ L = 1 · ∑nx ∑ny pixel(i, j) · e− 0 2r2 0 − i0 · ∑nx ∑ny e− 0 r2 0 ∂ i0 σ 2 i=1 j=1 σ 2 i=1 j=1   (i−x )2 +( j−y )2 ) 2 nx·ny ∂ L = 1 ·∑nx ∑ny pixel(i, j)−i ·e− 0 2r2 0 − σ i=1 j=1 3 0 ∂σ σ ! ! and solves the equations ∂ L/∂ i0 = 0 and ∂ L/∂ σ = 0 which are essentially simple polynomials in i0 and σ . These can easily be handled by the built-in equation solver; however, attempts to solve for the remaining three variables x0 , yo , and r fail. Code Derivation. At this point, the symbolic computations have been exhausted without leading to a complete symbolic (i.e., closed-form) solution. AUTO BAYES thus derives code for a numeric solution, incorporating the computed partial symbolic solution. This is done in two steps, which correspond to schemas in the schema library. In the first step, AUTO BAYES converts the symbolic solutions into assignment statements. Then it identifies their order and position relative to the remaining code which is still to be synthesized. Since both solutions contain at least one (in fact all) of the remaining variables, they must follow that code. Similarly, since the solution for σ contains i0 , its assignment must in turn follow that of i0 . AUTO BAYES then eliminates both variables from the formula by applying the substitution corresponding to the solution. Variables whose solutions do not depend on any unsolved variable can be considered as symbolic constants and need not be eliminated; their corresponding assignments must precede the missing code block. Since this reasoning is done on the expression level, and expression evaluation is free of side-effects, a dataflow analysis is not required. In the second step, AUTO BAYES instantiates a numeric optimization routine, in this case the Fletcher-Reeves conjugate gradient method. The schema’s template actually contains a wrapper to the implementation provided by the GNU Scientific Library (GSL) [15]. However, the schema itself contains not just boilerplate code but also constructs specific initialization code and a number of auxiliary functions to evaluate the goal function and the derivatives. AUTO BAYES contains different heuristics to derive initialization code from specification information; here, the initial values are taken as the midpoints of the specified ranges. AUTO BAYES also generates the auxiliary functions; since a straightforward translation from the goal expression would be prohibitively inefficient, AUTO BAYES aggressively optimizes the auxiliary functions. The optimizations applied here include common subexpression elimination, memoization (i.e., caching of expressions which depend on int-variables), and code motion. They are applied both intra- and inter-procedural, although the latter is restricted to the generated auxiliary functions. The optimizations can also take into account locally constant variables, since AUTO BAYES knows the set of goal variables. Again, a dataflow analysis is not required since the reasoning is done on the expression-level. Program Results. We have applied the generated program to the manually masked image of IC418 shown in the second panel of Figure 4. The third panel shows the results. The program roughly approximates the center but its estimate of the overall extent is predictably off the mark. The last panel shows sample data generated using the estimated Gaussian model parameters. Sigmoidal Models The Gaussian model hard-codes a number of assumptions about the structure of the image, in particular that it is circular, with a pronounced intensity peak and a gradual intensity falloff at the edges. However, a quick look at Figure 4 shows that the image is clearly elliptic, with a broad intensity plateau and a pronounced falloff at the edges. Simple Sigmoidal Model. Knuth and Hajian thus refine their initial model and replace the two-dimensional Gaussian by a two-dimensional sigmoidal function of the form # " 1 √ (2) F(x, y) = i0 · 1 − 1 + e−a r(x,y)−1 with the auxiliary function r(x, y) r(x, y) = cxx ·(x0 −x)2 + 2cxy ·(x0 −x)(y0 −y) + cyy ·(y0 −y)2 and constants cxx , cyy , and cxy 2 2 cxx = cos2 θ + sin2 θ rx rx 2 2 cyy = sin2 θ + cos2 θ rx rx θ cxy = sin θ2 · cos rx · ry2 where rx and ry are the extent of the nebula along the major and minor axis, resp., θ is its orientation, and a the intensity falloff. The specification for this modified model can easily be derived from the one for the Gaussian model shown in Figure 2, essentially by replacing the mean value in the distribution clause (cf. line 17) with the new version of F and adding declarations for the new model variables. The auxiliary constants and functions are represented as deterministic nodes in the Bayesian network, which are expanded like C-style macro definitions during the program definition. The program for this model is then derived using the same steps as before; the only difference is that the symbolic expressions become much more complicated. Axis-aligned Sigmoidal Model. The derivation and resulting program can be simplified and sped up, if the nebula image is assumed to be axis-aligned. This can be modeled by changing the random variable θ into a constant with known value, i.e., const double theta := 0 as ’orientation’. AUTO BAYES can then propagate this constant value already on the specification level and derive code from the simplified model. Dual Sigmoidal Model. In a final refinement step, Knuth and Hajian try to estimate the thickness of the shell as well. Since projecting the three-dimensional ellipsoidal shell of gas onto a two-dimensional image produces an ellipsoidal blob surrounded by a ring of higher intensity, the image can be modeled as the difference of two sigmoidal functions with the same center and orientation but different extents, intensities, and falloffs. Re-using the auxiliary definitions from the simple sigmoidal model, this refinement can also be specified easily for AUTO BAYES. IMAGE SEGMENTATION MODELS In their original analysis, Knuth and Hajian manually masked the central star and the diffraction spikes (cf. Figure 4(b)). This prevents their analysis from misinterpreting the comparatively bright star as the center of the nebula. A simple segmentation of the image model segment as ’Image segmentation via Clustering’. ... (see Figure 3) ... % Class parameters and relative frequencies 6 const nat n_classes as ’number of classes’. 7 where 0 < n_classes. 8 double mu(1..n_classes), sigma(1..n_classes). 9 where 0<sigma(_). 10 double phi(1..n_classes). 11 where sum(I:=1..n_classes, phi(I))=1. % Classes and Distribution 12 output nat c(1..nx, 1..ny) as ’class’. 13 where 1 =< c(_,_) && c(_,_) =< n_classes. 14 c(_,_) ~ discrete(phi). % Data and Distribution 15 data double pixel(1..nx, 1..ny) as ’image’. 16 pixel(I,J) ~ gauss(mu(c(I,J)), sigma(c(I,J))). % Task 17 max pr(pixel|{phi,mu,sigma}) for {phi,mu,sigma}. 1 FIGURE 5. AUTO BAYES-specification for image segmentation model. into different conceptual classes (e.g., central star, nebula, and background) can replace the generic mask. In this section we show how AUTO BAYES can be used to derive code for this and further refined segmentation models. Segmentation via Clustering. The simplest segmentation model is the straightforward Gaussian mixture model shown Figure 5. At its core is the latent variable c (cf. lines 12–14) which represents the unknown class each pixel belongs to and which determines the mean and variance (cf. line 16); its values are independently drawn from a discrete distribution with relative frequencies ϕ , i.e., pr(ci j = k) = ϕk . The constraint on the probability vector ϕ (cf. line 11) is required to make this model well-formed. Since we need the class-assignments for the segmentation, c is declared as output. Note that the model assumes that the number of classes is known. AUTO BAYES solves this model by an application of the EM-schema which is triggered by the latent variable structure of the corresponding Bayesian network. After instantiation of the EM-schema, all emerging sub-problems (for the E- and M-step) can eventually be solved symbolically (for the detailed derivation cf. [2, 3]). Figure 6 shows the result of segmenting the IC418 image according to the identified different classes. In particular, a segmentation into three classes already isolates the central star, the nebula, and the background from each other. It can thus be used as a precise, image-specific mask (cf. Figures 4(b) and 6(a)). More classes reveal more details, e.g., identify the nebula’s shell and separate the halo from the background, but too many classes bear the usual risk of overfitting the image. Model modifications. Code implementing this standard model can be found in many libraries. AUTO BAYES, however, can also derive customized code for model modifications. For example, a uniform variance for all classes can be enforced by simply changing the distribution clause into pixel(I,J) ~ gauss(mu(c(I,J)), sigma). FIGURE 6. Segmentations of IC418 image: (a) three-class segmentation (b) ditto, white class used as mask load to original image (c) five-class segmentation, and (d) sample data from geometrically refined segmentation model; class 1 (white) corresponds to the hull, class 2 (gray) to the core, and class 3 (not shown) to the background. and adapting the declaration of σ accordingly. AUTO BAYES still solves this model by an application of the EM-schema as before, but the structure of the M-step changes. In the standard model, it contains a loop over all classes which computes the current estimates of µi and σi . In the modified model, only the µi are computed inside the loop, while the single σ is computed separately. This modification cannot be achieved by simple code parametrization—it requires an actual code adaptation. Similarly, the standard model can be refined by adding priors (e.g., on the class means) to capture additional background knowledge. In AUTO BAYES, these priors are simply added to the specification as distribution clauses for the parameters. For example, a conjugate prior on each individual mean can be specified by mu(I) ~ gauss(mu0(I), sqrt(sigma_sq(I)) * kappa0(I)). AUTO BAYES again solves the model by an application of the EM-schema, and again the structure of the M-step changes, now to reflect the MAP estimate. More model modifications can be specified easily: a conjugate prior using the same parameters for all class means, non-conjugate priors, a mixture of different distributions (e.g., a single Cauchy and multiple Gaussians), and many more. In each case, AUTO BAYES derives an appropriate version of the EM-algorithm, instantiating and composing schemas as required. This makes it much more flexible than a simple code library. Multivariate approximation. Images of planetary nebulae are taken at different wavelengths, usually in the infrared and visible bands. Such data sets can be analyzed by AUTO BAYES if the covariance matrix is (assumed to be) diagonal, even though AUTO BAYES cannot yet properly handle multivariate distributions. In the specification shown in Figure 5, only the declaration (line 15) and distribution (line 16) of the input data need to be modified as shown below. Again, AUTO BAYES generates an appropriately instantiated EM algorithm. data double pixel(1..nc,1..nx, 1..ny) as ’data cube’. pixel(C,I,J) ~ gauss(mu(C,c(I,J)), sigma(C,c(I,J))). Segmentation with geometric information. A further refinement of the basic model can be obtained by adding spatial information, e.g., the geometric knowledge about the elliptic shape of the nebula. In AUTO BAYES, we can model such a refinement, for example, by making the mean values of the mixed Gaussian distributions depend on the location (i, j) of the pixel. Here, the pixels in each class are located on an elliptical ring around the center x0 (ci j ), y0 (ci j ) with radii rx (ci j ) and ry (ci j ), and a thickness d(ci j ). Each of these unknown parameters depend on the class ci j of each pixel at position i, j. In the AUTO BAYES specification of Fig 5, we only need to replace the simple mean µ (ci j ) of the Gaussian distribution (cf. line 16) by the expression i0 (ci j ) · e − √ ( r(i, j)−1)2 ·(rx (ci j )2 +ry (ci j )2 ) 4d(ci j )2 (3) where r(i, j) = (i − rx (ci j ))2 /rx (ci j )2 + ( j − ry (ci j ))2 /ry (ci j )2 . AUTO BAYES again synthesizes an EM-algorithm, this time with a numerical optimization routine (e.g., a Nelder-Mead simplex method) nested within the M-step. Figure 6(d) shows sampling data generated with the parameters estimated by this model. EVALUATION Table 1 summarizes the results; it shows that AUTO BAYES’s specification language allows a compact problem representation: none of the models discussed here required more than 35 lines of specification. The major difficulty in writing the specifications was to understand and then to express the core idea of the original models. After that, each specification took only a few minutes to write and in general one or two iterations to debug and complete (e.g., adding constraints). The table also shows the overall feasibility of the approach. AUTO BAYES was able to derive code for each of the models; scale-up factors from the model specification to the generated code are generally around 1:30. Synthesis times are generally only a few seconds and comparable to compilation times of the derived code. However, for the gauss and sigmoid models, AUTO BAYES spends almost all time simplifying the partial differentials and then optimizing the auxiliary functions evaluating them; in the sigmoid-case, this even exhausts the available memory. With the command line options -nosolve and -nolib, AUTO BAYES can be forced to stay away from these expensive calculations and code is derived much faster, using the Nelder-Mead simplex method which requires no differentials. Search space explosion is a common problem in program synthesis. In our case, it is mitigated by the deterministic nature of the symbolic-algebraic computations, the higher level of abstraction inherent to the schemas, and the inherent structure of the schema library. Still, search spaces are large since schemas can be functionally equivalent and solve the same class of problems. For the gauss-model, AUTO BAYES derives 224 programs, many of them identical, in 35 minutes total synthesis time. Parts of the search space can be pruned away manually by command line options like -nosolve, but more implicit control is required, especially when the schema library grows further. The original Matlab code written by the Knuth [1] uses a gradient descent method; it computes the differentials in a single iteration over all pixels while the synthesized code iterates once for each differential but reuses memoized subexpressions. This decomposition requires domain knowledge not yet formalized for AUTO BAYES. For the gauss-model, the (interpreted) original Matlab-code is approximately 70 lines, mainly TABLE 1. Summary of Results. For each model, the size of the specification | Spec | and the generated program (including generated comments) | Code |, and the synthesis time Tsynth (measured on an unloaded 2GHz/4GB LinuxPC) are given. -nosolve and -nolib are AUTO BAYES command line options which suppress the application of the schemas using partial symbolic solutions and library components. The entries for sigmoid-0, sigmoid-2, and segment-geom refer to the -nolib -nosolve variants. | Spec | Model | Code | Tsynth gauss -nosolve -nolib -nolib -nosolve 18 1045 703 764 494 36.4s 2.4s 4.9s 1.1s sigmoid -nosolve -nolib -nosolve 28 12650 872 39m42.9s 3.2s sigmoid-0 27 581 1.3s sigmoid-2 35 1202 6.7s segment 17 518 1.2s segment-multi 18 602 1.4s segment-geom 30 1765 40.6s for the computation of the gradient. It requires on average 174 seconds to converge, while the synthesized C++-code only requires 11 seconds. RELATED WORK Scientific computing is a (relatively) popular application domain for program synthesis; due to its complexity, most related work also follows a schema-based approach. However, most work focuses on wrapping solvers for partial differential equations (PDEs). SciNapse [16] is a general “problem-solving environment” for PDEs; it supports code generation for a variety of features in the PDE-domain, as for example coordinate transformation or grid generation. Ellman et al. [17] describe a system to generate simulation programs from PDEs. Both systems use Mathematica as underlying symbolic engine. Ctadel [18] is PDE-based synthesis system for the weather forecasting domain. Like AUTO BAYES, it is implemented in SWI-Prolog and contains its own symbolic engine. Other domains than PDEs have been tackled less often. Amphion [19] is a purely deductive synthesis system which has been used to synthesize astronomical software using library components. Amphion/NAV [20] applies the same technology to the state estimation domain. The scale-up difficulties encountered there led to a switch to a schema-based approach and the development of the AUTO F ILTER-system [21] as a domain-specific extension of AUTO BAYES. It provides its own specification language and schemas but reuses the core system. Planware [22] deductively synthesizes highperformance schedulers. It uses concepts from higher-order logic and category theory to structure the domain theory and thus to reduce the required proof effort. Schema-based program synthesis is also related to generative programming [23], since schemas (more precisely, the code fragments) correspond to templates. The major differences are that (i) the almost completely syntax-directed template instantiation is less powerful and less secure than schema instantiation and (ii) the users still have to write the core algorithm. Code libraries are common in scientific computing and data analysis, but they lack the level of automation achievable by program synthesis. For example, the Bayes Net Toolbox [24] is a Matlab library which allows program development on the model level, but it does not derive algorithms or generate code. BUGS [25] is a statistical model interpreter based on Gibbs-sampling, a more general but less efficient Bayesian inference technique; it could be integrated into AUTO BAYES as an additional schema. CONCLUSIONS We presented AUTO BAYES, a fully automatic program synthesis system for the data analysis domain and demonstrated how it can be used to support a realistic scientific task, the analysis of planetary nebulae images taken by the HST. We specified the hierarchy of models presented in [1]. AUTO BAYES was able to automatically generate code for these models; in tests, the synthesized code gave the same results as the scientists’ Matlab code. We also used AUTO BAYES to refine the analysis, combining cluster-based image segmentation with geometric constraints. Key elements to achieve these results are the schema-based synthesis approach, an efficient problem representation via Bayesian networks, and a strong symbolic-algebraic subsystem. This combination is unique to AUTO BAYES and allows us to solve realistic data analysis problems. In this particular case study the recently completed integration of the GSL and the aggressive optimization of the goal expressions proved to be instrumental to derive efficient code. However, AUTO BAYES must still be improved before it can be delivered to the working data analyst. In particular, the domain coverage must be extended and schemas must be added to enable solutions for new classes of models (e.g., multivariate distributions). Synthesis from large models requires support for hierarchical specifications and more powerful optimizations to generate faster code. Finally, numeric optimization schemas must be extended with numeric constraint handling to produce more versatile and robust code. Yet, AUTO BAYES already demonstrates that program synthesis technology has become a viable and versatile technology that enables scientists to think and to program in models rather than in code. Availability. A web-based interface to AUTO BAYES and links to other papers are available at http://ase.arc.nasa.gov/autobayes. Acknowledgments. A. Hajian was supported for this work by NASA through grant numbers GO7501, GO8390, GO8773 from the Space Telescope Science Institute, which is operated by AURA, Inc. under NASA contract NAS5-26555. K. Knuth was supported by the NASA Ames DDF Grant 274-51-02-51, the NASA IDU/IS/CICT Program, and the NASA Aerospace Technology Enterprise. W. Buntine contributed substantially to the early development of AUTO BAYES. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Knuth, K. H., and Hajian, A. R., “Hierarchies of Models: Toward Understanding of Planetary Nebulae,” in Proc. Bayesian Inference and Maximum Entropy Methods in Science and Engineering, edited by C. Willimans, American Institute of Physics, 2002, pp. 92–103. Fischer, B., and Schumann, J., J. Functional Programming, 13, 483–508 (2003). Gray, A. G., Fischer, B., Schumann, J., and Buntine, W., “Automatic Derivation of Statistical Algorithms: The EM Family and Beyond,” in Advances in Neural Information Processing Systems 15, edited by S. Becker, S. Thrun, and K. Obermayer, MIT Press, 2003, pp. 689–696. Buntine, W. L., J. AI Research, 2, 159–225 (1994). Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers, 1988. Wielemaker, J., SWI-Prolog 5.2.9 Reference Manual, Amsterdam (2003). Koutsofios, E., and North, S., Drawing graphs with dot, Tech. rep., AT&T Bell Laboratories, USA (1996). Ayari, A., and Basin, D., J. Symbolic Computation, 31, 487–520 (2001). Manna, Z., and Waldinger, R., The Deductive Foundations of Computer Programming, AddisonWesley, 1993. Dempster, A. P., Laird, N. M., and Rubin, D. B., J. of the Royal Statistical Society series B, 39, 1–38 (1977). McLachlan, G., and Krishnan, T., The EM Algorithm and Extensions, Wiley Series in Probability and Statistics, John Wiley & Sons, New York, 1997. Wolfram, S., The Mathematica Book, Cambridge Univ. Press, Cambridge, UK, 1999, 4th edn. Baader, F., and Nipkow, T., Term Rewriting and all that, Cambridge Univ. Press, Cambridge, 1998. Muchnick, S. S., Advanced Compiler Design and Implementation, Morgan Kaufmann Publishers, San Mateo, CA, USA, 1997. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., and Rossi, F., GNU Scientific Library - Reference Manual (2002), edition 1.3, for GSL Version 1.3. Akers, R., Baffes, P., Kant, E., Randall, C., Steinberg, S., and Young, R., IEEE Computational Science and Engineering, 4, 33–42 (1997). Ellman, T., Deak, R., and Fotinatos, J., “Knowledge-Based Synthesis of Numerical Programs for Simulation of Rigid-Body Systems in Physics-Based Animation,” in Proc. 17th Intl. Conf. Automated Software Engineering, edited by W. Emmerich and D. Wile, IEEE Comp. Soc. Press, 2002, pp. 93– 104. van Engelen, R. A., Wolters, L., and Cats, G., “Ctadel: A Generator of Efficient Code for PDE-based Scientific Applications,” in Proc. 9th Intl. Conf. on Supercomputing, ACM Press, 1996, pp. 86–93. Stickel, M., Waldinger, R., Lowry, M., Pressburger, T., and Underwood, I., “Deductive Composition of Astronomical Software from Subroutine Libraries,” in Proc. 12th Intl. Conf. Automated Deduction, edited by A. Bundy, Springer, 1994, vol. 814 of Lect. Notes Artificial Intelligence, pp. 341–355. Whittle, J., Van Baalen, J., Schumann, J., Robinson, P., Pressburger, T., Penix, J., Oh, P., Lowry, M., and Brat, G., “Amphion/NAV: Deductive Synthesis of State Estimation Software,” in Proc. 16th Intl. Conf. Automated Software Engineering, edited by M. S. Feather and M. Goedicke, IEEE Comp. Soc. Press, 2001, pp. 395–399. Whittle, J., and Schumann, J., Automating the Implementation of Kalman-Filter Algorithms (2003), in review. Blaine, L., Gilham, L.-M., Liu, J., Smith, D. R., and Westfold, S., “Planware – Domain-Specific Synthesis of High-Performance Schedulers,” in Proc. 13th Intl. Conf. Automated Software Engineering, edited by D. F. Redmiles and B. Nuseibeh, IEEE Comp. Soc. Press, 1998, pp. 270–280. Czarnecki, K., and Eisenecker, U. W., Generative Programming: Methods, Tools, and Applications, Addison-Wesley, 2002. Murphy, K., Computing Science and Statistics, 33 (2001). Thomas, A., Spiegelhalter, D. J., and Gilks, W. R., “BUGS: A program to perform Bayesian inference using Gibbs sampling,” in Bayesian Statistics 4, edited by J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford Univ. Press, 1992, pp. 837–842.
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Daniel Condurache
Technical University of Iasi
Harijono Djojodihardjo
Institut Teknologi Bandung
Masroor Bukhari
University of Houston
Sergey A Khaibrakhmanov
Saint-Petersburg State University