Search | arXiv e-print repository

Automated MPI code generation for scalable finite-difference solvers

Authors: George Bisbas, Rhodri Nelson, Mathias Louboutin, Paul H. J. Kelly, Fabio Luporini, Gerard Gorman

Abstract: Partial differential equations (PDEs) are crucial in modelling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs on a large scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically… ▽ More Partial differential equations (PDEs) are crucial in modelling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs on a large scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to solve explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modelling simulations at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. While the contributions of this work are implemented and integrated within the Devito framework, the DMP concepts and the techniques applied are generally applicable to any FD solvers. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive weak and strong scaling on the Archer2 supercomputer, demonstrating the effectiveness of the proposed approach in meeting the demands of large-scale scientific simulations. △ Less

Submitted 7 May, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 11 pages, 12 figures (18 pages with References and Appendix)

arXiv:2309.03600 [pdf, other]

A Novel Immersed Boundary Approach for Irregular Topography with Acoustic Wave Equations

Authors: Edward Caunt, Rhodri Nelson, Fabio Luporini, Gerard Gorman

Abstract: Irregular terrain has a pronounced effect on the propagation of seismic and acoustic wavefields but is not straightforwardly reconciled with structured finite-difference (FD) methods used to model such phenomena. Methods currently detailed in the literature are generally limited in scope application-wise or non-trivial to apply to real-world geometries. With this in mind, a general immersed bounda… ▽ More Irregular terrain has a pronounced effect on the propagation of seismic and acoustic wavefields but is not straightforwardly reconciled with structured finite-difference (FD) methods used to model such phenomena. Methods currently detailed in the literature are generally limited in scope application-wise or non-trivial to apply to real-world geometries. With this in mind, a general immersed boundary treatment capable of imposing a range of boundary conditions in a relatively equation-agnostic manner has been developed, alongside a framework implementing this approach, intending to complement emerging code-generation paradigms. The approach is distinguished by the use of N-dimensional Taylor-series extrapolants constrained by boundary conditions imposed at some suitably-distributed set of surface points. The extrapolation process is encapsulated in modified derivative stencils applied in the vicinity of the boundary, utilizing hyperspherical support regions. This method ensures boundary representation is consistent with the FD discretization: both must be considered in tandem. Furthermore, high-dimensional and vector boundary conditions can be applied without approximation prior to discretization. A consistent methodology can thus be applied across free and rigid surfaces with the first and second-order acoustic wave equation formulations. Application to both equations is demonstrated, and numerical examples based on analytic and real-world topography implementing free and rigid surfaces in 2D and 3D are presented. △ Less

Submitted 7 September, 2023; originally announced September 2023.

Comments: Submitted to Geophysics. 24 pages, 26 figures

arXiv:2110.03345 [pdf, other]

doi 10.1016/j.cmpb.2022.106855

Stride: a flexible platform for high-performance ultrasound computed tomography

Authors: Carlos Cueto, Oscar Bates, George Strong, Javier Cudeiro, Fabio Luporini, Oscar Calderon Agudo, Gerard Gorman, Lluis Guasch, Meng-Xing Tang

Abstract: Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the p… ▽ More Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the pace of research and hindering reproducibility. Consequently, we have developed Stride, an open-source Python library for the solution of large-scale ultrasound tomography problems. On one hand, Stride provides high-level interfaces and tools for expressing the types of optimisation problems encountered in medical ultrasound tomography. On the other, these high-level abstractions seamlessly integrate with high-performance wave-equation solvers and with scalable parallelisation routines. The wave-equation solvers are generated automatically using Devito, a domain specific language, and the parallelisation routines are provided through the custom actor-based library Mosaic. Through a series of examples, we show how Stride can handle realistic tomographic problems, in 2D and 3D, providing intuitive and flexible interfaces that scale from a local multi-processing environment to a multi-node high-performance cluster. △ Less

Submitted 18 May, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Journal ref: Computer Methods and Programs in Biomedicine, 221, 2022

arXiv:2010.10248 [pdf, other]

Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

Authors: George Bisbas, Fabio Luporini, Mathias Louboutin, Rhodri Nelson, Gerard Gorman, Paul H. J. Kelly

Abstract: Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However,… ▽ More Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid ("off-the-grid"). Our work is motivated by modeling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x. △ Less

Submitted 25 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: Accepted for publication at 35th IEEE International Parallel & Distributed Processing Symposium

arXiv:2004.10519 [pdf, other]

Scaling through abstractions -- high-performance vectorial wave simulations for seismic inversion with Devito

Authors: Mathias Louboutin, Fabio Luporini, Philipp Witte, Rhodri Nelson, George Bisbas, Jan Thorbecke, Felix J. Herrmann, Gerard Gorman

Abstract: [Devito] is an open-source Python project based on domain-specific language and compiler technology. Driven by the requirements of rapid HPC applications development in exploration seismology, the language and compiler have evolved significantly since inception. Sophisticated boundary conditions, tensor contractions, sparse operations and features such as staggered grids and sub-domains are all su… ▽ More [Devito] is an open-source Python project based on domain-specific language and compiler technology. Driven by the requirements of rapid HPC applications development in exploration seismology, the language and compiler have evolved significantly since inception. Sophisticated boundary conditions, tensor contractions, sparse operations and features such as staggered grids and sub-domains are all supported; operators of essentially arbitrary complexity can be generated. To accommodate this flexibility whilst ensuring performance, data dependency analysis is utilized to schedule loops and detect computational-properties such as parallelism. In this article, the generation and simulation of MPI-parallel propagators (along with their adjoints) for the pseudo-acoustic wave-equation in tilted transverse isotropic media and the elastic wave-equation are presented. Simulations are carried out on industry scale synthetic models in a HPC Cloud system and reach a performance of 28TFLOP/s, hence demonstrating Devito's suitability for production-grade seismic inversion problems. △ Less

Submitted 22 April, 2020; originally announced April 2020.

Comments: 11 pages, 3 figures

arXiv:1912.00695 [pdf, ps, other]

doi 10.1007/978-3-030-41005-6_16

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

Authors: Vitor Hugo Mickus Rodrigues, Lucas Cavalcante, Maelso Bruno Pereira, Fabio Luporini, István Reguly, Gerard Gorman, Samuel Xavier de Souza

Abstract: The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units (GPUs) are an attractive architectural target for stencil computations because of its high degree of data parallelism. However, the rapid architectural and technolog… ▽ More The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units (GPUs) are an attractive architectural target for stencil computations because of its high degree of data parallelism. However, the rapid architectural and technological progression makes it difficult for even the most proficient programmers to remain up-to-date with the technological advances at a micro-architectural level. In this work, we present an extension for an open source compiler designed to produce highly optimized finite difference kernels for use in inversion methods named Devito. We embed it with the Oxford Parallel Domain Specific Language (OP-DSL) in order to enable automatic code generation for GPU architectures from a high-level representation. We aim to enable users coding in a symbolic representation level to effortlessly get their implementations leveraged by the processing capacities of GPU architectures. The implemented backend is evaluated on a NVIDIA GTX Titan Z, and on a NVIDIA Tesla V100 in terms of operational intensity through the roof-line model for varying space-order discretization levels of 3D acoustic isotropic wave propagation stencil kernels with and without symbolic optimizations. It achieves approximately 63% of V100's peak performance and 24% of Titan Z's peak performance for stencil kernels over grids with 256 points. Our study reveals that improving memory usage should be the most efficient strategy for leveraging the performance of the implemented solution on the evaluated architectures. △ Less

Submitted 2 December, 2019; originally announced December 2019.

Comments: This work was accepted and presented to Latin America High Performance Computing (CARLA 2019)

arXiv:1908.03653 [pdf, other]

Performance of Devito on HPC-Optimised ARM Processors

Authors: Hermes Senger, Jaime F. de Souza, Edson S. Gomi, Fabio Luporini, Gerard J. Gorman

Abstract: We evaluate the performance of Devito, a domain specific language (DSL) for finite differences on Arm ThunderX2 processors. Experiments with two common seismic computational kernels demonstrate that Arm processors can deliver competitive performance compared to other Intel Xeon processors. We evaluate the performance of Devito, a domain specific language (DSL) for finite differences on Arm ThunderX2 processors. Experiments with two common seismic computational kernels demonstrate that Arm processors can deliver competitive performance compared to other Intel Xeon processors. △ Less

Submitted 19 August, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

Comments: 2 pages, one figure, 2 tables

arXiv:1907.02818 [pdf, other]

doi 10.1145/3337821.3337906

Automatic Differentiation for Adjoint Stencil Loops

Authors: Jan Hückelheim, Navjot Kukreja, Sri Hari Krishna Narayanan, Fabio Luporini, Gerard Gorman, Paul Hovland

Abstract: Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoin… ▽ More Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications. △ Less

Submitted 5 July, 2019; originally announced July 2019.

Comments: ICPP 2019

arXiv:1810.05268 [pdf, other]

Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems

Authors: Navjot Kukreja, Jan Hueckelheim, Mathias Louboutin, Fabio Luporini, Paul Hovland, Gerard Gorman

Abstract: Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected point… ▽ More Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations. In this paper, we combine compression and checkpointing for the first time to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing. △ Less

Submitted 20 September, 2021; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: Accepted in European Conference on Parallel Proessing (EuroPar) 2019. Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)

arXiv:1808.01995 [pdf, other]

doi 10.5194/gmd-12-1165-2019

Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration

Authors: Mathias Louboutin, Michael Lange, Fabio Luporini, Navjot Kukreja, Philipp A. Witte, Felix J. Herrmann, Paulius Velesko, Gerard J. Gorman

Abstract: We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can t… ▽ More We introduce Devito, a new domain-specific language for implementing high-performance finite difference partial differential equation solvers. The motivating application is exploration seismology where methods such as Full-Waveform Inversion and Reverse-Time Migration are used to invert terabytes of seismic data to create images of the earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days. △ Less

Submitted 9 August, 2019; v1 submitted 6 August, 2018; originally announced August 2018.

Journal ref: https://www.geosci-model-dev.net/12/1165/2019/

arXiv:1807.03032 [pdf, other]

Architecture and performance of Devito, a system for automated stencil computation

Authors: Fabio Luporini, Michael Lange, Mathias Louboutin, Navjot Kukreja, Jan Hückelheim, Charles Yount, Philipp Witte, Paul H. J. Kelly, Felix J. Herrmann, Gerard J. Gorman

Abstract: Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process… ▽ More Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications. △ Less

Submitted 7 February, 2020; v1 submitted 9 July, 2018; originally announced July 2018.

Comments: Submitted to ACM Transactions on Mathematical Software

MSC Class: 65N06; 68N20

arXiv:1708.03183 [pdf, other]

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Authors: Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, Paul H. J. Kelly

Abstract: Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of th… ▽ More Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation -- not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing: a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased run-time performance; implementation in a publicly-available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28x on a platform consisting of 896 Intel Broadwell cores. △ Less

Submitted 19 June, 2019; v1 submitted 10 August, 2017; originally announced August 2017.

Comments: 29 pages including supplementary materials and references

ACM Class: D.1.2; G.4

arXiv:1707.03776 [pdf, other]

Optimised finite difference computation from symbolic equations

Authors: Michael Lange, Navjot Kukreja, Fabio Luporini, Mathias Louboutin, Charles Yount, Jan Hückelheim, Gerard J. Gorman

Abstract: Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automa… ▽ More Domain-specific high-productivity environments are playing an increasingly important role in scientific computing due to the levels of abstraction and automation they provide. In this paper we introduce Devito, an open-source domain-specific framework for solving partial differential equations from symbolic problem definitions by the finite difference method. We highlight the generation and automated execution of highly optimized stencil code from only a few lines of high-level symbolic Python for a set of scientific equations, before exploring the use of Devito operators in seismic inversion problems. △ Less

Submitted 12 July, 2017; originally announced July 2017.

Comments: Accepted for publication in Proceedings of the 16th Python in Science Conference (SciPy 2017)

arXiv:1705.03667 [pdf, other]

doi 10.1137/17M1130642

TSFC: a structure-preserving form compiler

Authors: Miklós Homolya, Lawrence Mitchell, Fabio Luporini, David A. Ham

Abstract: A form compiler takes a high-level description of the weak form of partial differential equations and produces low-level code that carries out the finite element assembly. In this paper we present the Two-Stage Form Compiler (TSFC), a new form compiler with the main motivation to maintain the structure of the input expression as long as possible. This facilitates the application of optimizations a… ▽ More A form compiler takes a high-level description of the weak form of partial differential equations and produces low-level code that carries out the finite element assembly. In this paper we present the Two-Stage Form Compiler (TSFC), a new form compiler with the main motivation to maintain the structure of the input expression as long as possible. This facilitates the application of optimizations at the highest possible level of abstraction. TSFC features a novel, structure-preserving method for separating the contributions of a form to the subblocks of the local tensor in discontinuous Galerkin problems. This enables us to preserve the tensor structure of expressions longer through the compilation process than other form compilers. This is also achieved in part by a two-stage approach that cleanly separates the lowering of finite element constructs to tensor algebra in the first stage, from the scheduling of those tensor operations in the second stage. TSFC also efficiently traverses complicated expressions, and experimental evaluation demonstrates good compile-time performance even for highly complex forms. △ Less

Submitted 9 April, 2018; v1 submitted 10 May, 2017; originally announced May 2017.

Comments: Accepted version. 28 pages plus 5 pages supplement

MSC Class: 68N20; 65M60; 65N30

Journal ref: SIAM Journal on Scientific Computing, 40 (2018), pp. C401-C428

arXiv:1609.03361 [pdf, other]

Devito: Towards a generic Finite Difference DSL using Symbolic Python

Authors: Michael Lange, Navjot Kukreja, Mathias Louboutin, Fabio Luporini, Felippe Vieira, Vincenzo Pandolfo, Paulius Velesko, Paulius Kazakas, Gerard Gorman

Abstract: Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain… ▽ More Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems. △ Less

Submitted 12 September, 2016; originally announced September 2016.

Comments: pyHPC 2016 conference submission

arXiv:1608.08658 [pdf, other]

Devito: automated fast finite difference computation

Authors: Navjot Kukreja, Mathias Louboutin, Felippe Vieira, Fabio Luporini, Michael Lange, Gerard Gorman

Abstract: Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases requi… ▽ More Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance opti- mization on different computer architectures. Although a large number of stencil languages are available, finite differ- ence domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform compa- rably or better than hand-tuned code. All this is transpar- ent to users, who only see concise symbolic mathematical expressions. △ Less

Submitted 10 October, 2016; v1 submitted 30 August, 2016; originally announced August 2016.

Comments: Accepted at WolfHPC 2016

arXiv:1604.05937 [pdf, other]

doi 10.5194/gmd-9-3803-2016

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

Authors: Gheorghe-Teodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, Paul H. J. Kelly

Abstract: We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite ele… ▽ More We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-specific limit. △ Less

Submitted 28 October, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: Bibliography fixes, 23 pages

Journal ref: Geoscientific Model Development 9:3803-3815 (2016)

arXiv:1604.05872 [pdf, other]

doi 10.1145/3054944

An algorithm for the optimization of finite element integration loops

Authors: Fabio Luporini, David A. Ham, Paul H. J. Kelly

Abstract: We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the s… ▽ More We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the state of the art in finite element code generation in almost all cases. This validates the effectiveness of the algorithm presented here, and illustrates its limitations. △ Less

Submitted 20 April, 2016; originally announced April 2016.

ACM Class: G.1.8; G.4

arXiv:1501.01809 [pdf, other]

doi 10.1145/2998441

Firedrake: automating the finite element method by composing abstractions

Authors: Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, Paul H. J. Kelly

Abstract: Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separat… ▽ More Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns which eases the incorporation of separate contributions from computer scientists, numerical analysts and application specialists. These contributions may add functionality, or improve performance. Firedrake benefits from automatically applying new optimisations. This includes factorising mixed function spaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside pure variational formulations, such as flux-limiters. △ Less

Submitted 1 July, 2016; v1 submitted 8 January, 2015; originally announced January 2015.

Comments: Minor revisions to v2

ACM Class: G.1.8; G.4

Journal ref: ACM Transactions on Mathematical Software 43(3):24:1--24:27 (2016)

arXiv:1407.0904 [pdf, other]

doi 10.1145/2687415

COFFEE: an Optimizing Compiler for Finite Element Local Assembly

Authors: Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, Paul H. J. Kelly

Abstract: The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient ke… ▽ More The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved. △ Less

Submitted 4 July, 2014; v1 submitted 3 July, 2014; originally announced July 2014.

Comments: Remove volume metadata

ACM Class: G.1.8; G.4

arXiv:1111.2259 [pdf, other]

A Survey on Open Problems for Mobile Robots

Authors: Alberto Bandettini, Fabio Luporini, Giovanni Viglietta

Abstract: Gathering mobile robots is a widely studied problem in robotic research. This survey first introduces the related work, summarizing models and results. Then, the focus shifts on the open problem of gathering fat robots. In this context, "fat" means that the robot is not represented by a point in a bidimensional space, but it has an extent. Moreover, it can be opaque in the sense that other robots… ▽ More Gathering mobile robots is a widely studied problem in robotic research. This survey first introduces the related work, summarizing models and results. Then, the focus shifts on the open problem of gathering fat robots. In this context, "fat" means that the robot is not represented by a point in a bidimensional space, but it has an extent. Moreover, it can be opaque in the sense that other robots cannot "see through" it. All these issues lead to a redefinition of the original problem and an extension of the CORDA model. For at most 4 robots an algorithm is provided in the literature, but is gathering always possible for n>4 fat robots? Another open problem is considered: Boundary Patrolling by mobile robots. A set of mobile robots with constraints only on speed and visibility is working in a polygonal environment having boundary and possibly obstacles. The robots have to perform a perpetual movement (possibly within the environment) so that the maximum timespan in which a point of the boundary is not being watched by any robot is minimized. △ Less

Submitted 7 November, 2011; originally announced November 2011.

Comments: 28 pages, 4 figures

Showing 1–21 of 21 results for author: Luporini, F