Supervisors: Erik Lindahl and Berk Hess Address: Visiting: Science for Life Laboratory, Tomtebodavägen 23A, 17165 Solna, Sweden Mailing: Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for elect... more Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for electrostatics have been revolutionary to the precision and accuracy of molecular simulations in general. Despite the performance penalty associated with lattice summation electrostatics, few biomolecular simulations today are performed without it. There are increasingly strong arguments for moving in the same direction for Lennard-Jones (LJ) interactions, and by using geometric approximations of the combination rules in reciprocal space, we have been able to make a very high-performance implementation available in GROMACS. Here, we present a new way to correct for these approximations to achieve exact treatment of Lorentz–Berthelot combination rules within the cutoff, and only a very small approximation error remains outside the cutoff (a part that would be completely ignored without LJ-PME). This not only improves accuracy by almost an order of magnitude but also achieves absolute biomolecular simulation performance that is an order of magnitude faster than any other available lattice summation technique for LJ interactions. The implementation includes both CPU and GPU acceleration, and its combination with improved scaling LJ-PME simulations now provides performance close to the truncated potential methods in GROMACS but with much higher accuracy.
GROMACS is one of the most widely used open-source and free software codes in chemistry, used pri... more GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. These work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. The latest best-in-class compressed trajectory storage format is supported.
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware ... more The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Though hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed since these cards do not support ECC memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.
GROMACS is a widely used package for biomolecular simulation, and over the last two decades it ha... more GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.
Calculating interactions or correlations between pairs of particles is typically the most time-co... more Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
Motivation: Molecular simulation has historically been a lowthroughput technique, but faster comp... more Motivation: Molecular simulation has historically been a lowthroughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling largescale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing, or cloud resources.
Results: Results: Here, we present a range of new simulation algorithms and features developed over the last four years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules such as proteins, nucleic acids, and lipids and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models as well as new free energy algorithms, and the software now employs multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Availability: GROMACS is open source and free software, and available from http://www.gromacs.org.
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis
Abstract Several GPU-based algorithms have been developed to accelerate biomolecular simulations,... more Abstract Several GPU-based algorithms have been developed to accelerate biomolecular simulations, but although they provide benefits over single-core implementations, they have not been able to surpass the performance of state-of-the art SIMD CPU implementations ( ...
The core goal of parallel computing is to speedup computations by executing independent computati... more The core goal of parallel computing is to speedup computations by executing independent computational tasks concurrently (“in parallel”) on multiple units in a processor, on multiple processors in a computer, or on multiple networked computers which may be even spread across large geographical scales (distributed and grid computing); it is the dominant principle behind “supercomputing” respectively “high performance computing”. For several decades, the density of transistors on a computer chip has doubled every 18–24 months (“Moore’s Law”); until recently, this rate could be directly transformed into a corresponding increase of a processor’s clock frequency and thus into an automatic performance gain for sequential programs. However, since also a processor’s power consumption increases with its clock frequency, this strategy of “frequency scaling” became ultimately unsustainable: since 2004 clock frequencies have remained essentially stable and additional transistors have been primarily used to build multiple processors on a single chip (multi-core processors). Today therefore every kind of software (not only “scientific” one) must be written in a parallel style to profit from newer computer hardware.
The rapid evolution of CUDA GPU architecture and the new heterogenous platforms that break the he... more The rapid evolution of CUDA GPU architecture and the new heterogenous platforms that break the hegemony of x86 offer opportunities for performance optimizations, but also pose challenges for scalable heterogeneous parallelization of the GROMACS molecular simulation package. This session will present our latest efforts to harness recent CUDA architectures to improve algorithmic efficiency and performance of our molecular dynamics kernels. We will also discuss load balancing and latency-hiding challenges emphasized by the expansions of GPU-accelerated platforms with CPUs a ranging from a power-optimized ARM architectures to extreme-performance highly multi-threaded Power and Xeon CPUs. Come to learn about our experiences in developing portable heterogeneous high performance code!
Molecular dynamics (MD) simulations are able to provide a wealth of detailed information about bi... more Molecular dynamics (MD) simulations are able to provide a wealth of detailed information about biological systems enabling studies impossible to perform in a laboratory. As such, MD simulations provide insight into important biological processes like protein folding, the mechanism of viral infections, as well as aid modern drug design.
Typical MD simulations require billions of time-steps, at each step evaluating forces between tens to hundreds of thousands of particles. This represents a floating-point bottleneck and suggests that the problem is well-suited for HPC. However, the iterative nature of MD algorithms means that speeding up simulations requires reducing the calculation time of a time-step. As the amount of parallelism is limited by the range of relevant problem sizes, improving absolute performance requires strong scaling. Highly optimized MD codes like GROMACS reach iteration rates in the millisecond range which renders MD strongly latency sensitive and represents unique challenges for efficient parallelization.
The HPC landscape is rapidly changing toward increasing on-chip parallelism as well as heterogeneity through emerging accelerator platforms like GPUs or Intel MIC.
The GROMACS MD package is known for its high performance thanks to hand-tuned SIMD kernels and state-of-the art parallel algorithms. However, making efficient use of modern architectures required a substantial redesign of the previously MPI-only parallelization as well as
entirely new algorithms for wide SIMD units and massive accelerator parallelism.
Here we present our recent work on multi-level heterogeneous parallelization of MD implemented in GROMACS. We have developed new, highly efficient SIMD algorithms for pair force calculation targeting current and future architectures ranging from CPUs to GPUs to FPGAs. Thanks to recent efforts, all compute-intensive code in GROMACS uses SIMD acceleration to maximize single-core CPU performance. OpenMP multithreading allows combining a set of cores with GPUs for node-level heterogeneous parallelization. On the top level, MPI-based neutral-territory domain-decomposition ensures scaling across compute nodes. To maximize the utilization of all compute resources, we employ multi-level, inter- and intra-node dynamic load balancing. The latest version of GROMACS not only provides a dramatic improvement in strong scaling to tens of particles per core at peak, but is able to efficiently utilize all compute units in heterogeneous HPC hardware with an absolute simulation performance between hundreds of nanoseconds to a microsecond a day.
GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level het... more GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level heterogeneous parallelization. Our new CUDA-based algorithms provide 4x speedup over handtuned CPU SIMD assembly, and unprecedented absolute performance. However, the heterogeneity of hardware and the inherent bottlenecks involved make efficient resource utilization and strong scaling very challenging. This advanced session describes our recent efforts on multi-level load-balancing, kernel execution strategies, CPU-GPU work splitting, and ways to exploit Kepler features such as Hyper-Q. Join us to talk about current limits of GPU acceleration in MD, and how to take molecular dynamics simulations to 100 millisecond/iteration, equivalent to 10,000 fps, in the near future!
Molecular Dynamics is an important application for GPU acceleration, but many algorithmic optimiz... more Molecular Dynamics is an important application for GPU acceleration, but many algorithmic optimizations and features still rely on code that prefers traditional CPUs. It is only with the latest hardware and software we have been able to realize a heterogeneous GPU/CPU implementation and reach performance significantly beyond the state-of-the-art of hand-tuned CPU code in our GROMACS program. The sub-millisecond iteration time poses challenges on all levels of parallelization. Come and learn about our new atom-cluster pair interaction approach for non-bonded force evaluation that achieves 60% work-efficiency and other innovative solutions for heterogeneous GPU systems.
Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for elect... more Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for electrostatics have been revolutionary to the precision and accuracy of molecular simulations in general. Despite the performance penalty associated with lattice summation electrostatics, few biomolecular simulations today are performed without it. There are increasingly strong arguments for moving in the same direction for Lennard-Jones (LJ) interactions, and by using geometric approximations of the combination rules in reciprocal space, we have been able to make a very high-performance implementation available in GROMACS. Here, we present a new way to correct for these approximations to achieve exact treatment of Lorentz–Berthelot combination rules within the cutoff, and only a very small approximation error remains outside the cutoff (a part that would be completely ignored without LJ-PME). This not only improves accuracy by almost an order of magnitude but also achieves absolute biomolecular simulation performance that is an order of magnitude faster than any other available lattice summation technique for LJ interactions. The implementation includes both CPU and GPU acceleration, and its combination with improved scaling LJ-PME simulations now provides performance close to the truncated potential methods in GROMACS but with much higher accuracy.
GROMACS is one of the most widely used open-source and free software codes in chemistry, used pri... more GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. These work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. The latest best-in-class compressed trajectory storage format is supported.
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware ... more The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Though hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed since these cards do not support ECC memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.
GROMACS is a widely used package for biomolecular simulation, and over the last two decades it ha... more GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.
Calculating interactions or correlations between pairs of particles is typically the most time-co... more Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
Motivation: Molecular simulation has historically been a lowthroughput technique, but faster comp... more Motivation: Molecular simulation has historically been a lowthroughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling largescale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing, or cloud resources.
Results: Results: Here, we present a range of new simulation algorithms and features developed over the last four years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules such as proteins, nucleic acids, and lipids and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models as well as new free energy algorithms, and the software now employs multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Availability: GROMACS is open source and free software, and available from http://www.gromacs.org.
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis
Abstract Several GPU-based algorithms have been developed to accelerate biomolecular simulations,... more Abstract Several GPU-based algorithms have been developed to accelerate biomolecular simulations, but although they provide benefits over single-core implementations, they have not been able to surpass the performance of state-of-the art SIMD CPU implementations ( ...
The core goal of parallel computing is to speedup computations by executing independent computati... more The core goal of parallel computing is to speedup computations by executing independent computational tasks concurrently (“in parallel”) on multiple units in a processor, on multiple processors in a computer, or on multiple networked computers which may be even spread across large geographical scales (distributed and grid computing); it is the dominant principle behind “supercomputing” respectively “high performance computing”. For several decades, the density of transistors on a computer chip has doubled every 18–24 months (“Moore’s Law”); until recently, this rate could be directly transformed into a corresponding increase of a processor’s clock frequency and thus into an automatic performance gain for sequential programs. However, since also a processor’s power consumption increases with its clock frequency, this strategy of “frequency scaling” became ultimately unsustainable: since 2004 clock frequencies have remained essentially stable and additional transistors have been primarily used to build multiple processors on a single chip (multi-core processors). Today therefore every kind of software (not only “scientific” one) must be written in a parallel style to profit from newer computer hardware.
The rapid evolution of CUDA GPU architecture and the new heterogenous platforms that break the he... more The rapid evolution of CUDA GPU architecture and the new heterogenous platforms that break the hegemony of x86 offer opportunities for performance optimizations, but also pose challenges for scalable heterogeneous parallelization of the GROMACS molecular simulation package. This session will present our latest efforts to harness recent CUDA architectures to improve algorithmic efficiency and performance of our molecular dynamics kernels. We will also discuss load balancing and latency-hiding challenges emphasized by the expansions of GPU-accelerated platforms with CPUs a ranging from a power-optimized ARM architectures to extreme-performance highly multi-threaded Power and Xeon CPUs. Come to learn about our experiences in developing portable heterogeneous high performance code!
Molecular dynamics (MD) simulations are able to provide a wealth of detailed information about bi... more Molecular dynamics (MD) simulations are able to provide a wealth of detailed information about biological systems enabling studies impossible to perform in a laboratory. As such, MD simulations provide insight into important biological processes like protein folding, the mechanism of viral infections, as well as aid modern drug design.
Typical MD simulations require billions of time-steps, at each step evaluating forces between tens to hundreds of thousands of particles. This represents a floating-point bottleneck and suggests that the problem is well-suited for HPC. However, the iterative nature of MD algorithms means that speeding up simulations requires reducing the calculation time of a time-step. As the amount of parallelism is limited by the range of relevant problem sizes, improving absolute performance requires strong scaling. Highly optimized MD codes like GROMACS reach iteration rates in the millisecond range which renders MD strongly latency sensitive and represents unique challenges for efficient parallelization.
The HPC landscape is rapidly changing toward increasing on-chip parallelism as well as heterogeneity through emerging accelerator platforms like GPUs or Intel MIC.
The GROMACS MD package is known for its high performance thanks to hand-tuned SIMD kernels and state-of-the art parallel algorithms. However, making efficient use of modern architectures required a substantial redesign of the previously MPI-only parallelization as well as
entirely new algorithms for wide SIMD units and massive accelerator parallelism.
Here we present our recent work on multi-level heterogeneous parallelization of MD implemented in GROMACS. We have developed new, highly efficient SIMD algorithms for pair force calculation targeting current and future architectures ranging from CPUs to GPUs to FPGAs. Thanks to recent efforts, all compute-intensive code in GROMACS uses SIMD acceleration to maximize single-core CPU performance. OpenMP multithreading allows combining a set of cores with GPUs for node-level heterogeneous parallelization. On the top level, MPI-based neutral-territory domain-decomposition ensures scaling across compute nodes. To maximize the utilization of all compute resources, we employ multi-level, inter- and intra-node dynamic load balancing. The latest version of GROMACS not only provides a dramatic improvement in strong scaling to tens of particles per core at peak, but is able to efficiently utilize all compute units in heterogeneous HPC hardware with an absolute simulation performance between hundreds of nanoseconds to a microsecond a day.
GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level het... more GROMACS is a state-of-the-art molecular simulation package that employs extensive multi-level heterogeneous parallelization. Our new CUDA-based algorithms provide 4x speedup over handtuned CPU SIMD assembly, and unprecedented absolute performance. However, the heterogeneity of hardware and the inherent bottlenecks involved make efficient resource utilization and strong scaling very challenging. This advanced session describes our recent efforts on multi-level load-balancing, kernel execution strategies, CPU-GPU work splitting, and ways to exploit Kepler features such as Hyper-Q. Join us to talk about current limits of GPU acceleration in MD, and how to take molecular dynamics simulations to 100 millisecond/iteration, equivalent to 10,000 fps, in the near future!
Molecular Dynamics is an important application for GPU acceleration, but many algorithmic optimiz... more Molecular Dynamics is an important application for GPU acceleration, but many algorithmic optimizations and features still rely on code that prefers traditional CPUs. It is only with the latest hardware and software we have been able to realize a heterogeneous GPU/CPU implementation and reach performance significantly beyond the state-of-the-art of hand-tuned CPU code in our GROMACS program. The sub-millisecond iteration time poses challenges on all levels of parallelization. Come and learn about our new atom-cluster pair interaction approach for non-bonded force evaluation that achieves 60% work-efficiency and other innovative solutions for heterogeneous GPU systems.
Uploads
Papers by Szilárd Páll
Results: Results: Here, we present a range of new simulation algorithms and features developed over the last four years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules such as proteins, nucleic acids, and lipids and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models as well as new free energy algorithms, and the software now employs multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Availability: GROMACS is open source and free software, and available from http://www.gromacs.org.
Talks by Szilárd Páll
Conference Presentations by Szilárd Páll
Typical MD simulations require billions of time-steps, at each step evaluating forces between tens to hundreds of thousands of particles. This represents a floating-point bottleneck and suggests that the problem is well-suited for HPC. However, the iterative nature of MD algorithms means that speeding up simulations requires reducing the calculation time of a time-step. As the amount of parallelism is limited by the range of relevant problem sizes, improving absolute performance requires strong scaling. Highly optimized MD codes like GROMACS reach iteration rates in the millisecond range which renders MD strongly latency sensitive and represents unique challenges for efficient parallelization.
The HPC landscape is rapidly changing toward increasing on-chip parallelism as well as heterogeneity through emerging accelerator platforms like GPUs or Intel MIC.
The GROMACS MD package is known for its high performance thanks to hand-tuned SIMD kernels and state-of-the art parallel algorithms. However, making efficient use of modern architectures required a substantial redesign of the previously MPI-only parallelization as well as
entirely new algorithms for wide SIMD units and massive accelerator parallelism.
Here we present our recent work on multi-level heterogeneous parallelization of MD implemented in GROMACS. We have developed new, highly efficient SIMD algorithms for pair force calculation targeting current and future architectures ranging from CPUs to GPUs to FPGAs. Thanks to recent efforts, all compute-intensive code in GROMACS uses SIMD acceleration to maximize single-core CPU performance. OpenMP multithreading allows combining a set of cores with GPUs for node-level heterogeneous parallelization. On the top level, MPI-based neutral-territory domain-decomposition ensures scaling across compute nodes. To maximize the utilization of all compute resources, we employ multi-level, inter- and intra-node dynamic load balancing. The latest version of GROMACS not only provides a dramatic improvement in strong scaling to tens of particles per core at peak, but is able to efficiently utilize all compute units in heterogeneous HPC hardware with an absolute simulation performance between hundreds of nanoseconds to a microsecond a day.
Results: Results: Here, we present a range of new simulation algorithms and features developed over the last four years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules such as proteins, nucleic acids, and lipids and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models as well as new free energy algorithms, and the software now employs multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Availability: GROMACS is open source and free software, and available from http://www.gromacs.org.
Typical MD simulations require billions of time-steps, at each step evaluating forces between tens to hundreds of thousands of particles. This represents a floating-point bottleneck and suggests that the problem is well-suited for HPC. However, the iterative nature of MD algorithms means that speeding up simulations requires reducing the calculation time of a time-step. As the amount of parallelism is limited by the range of relevant problem sizes, improving absolute performance requires strong scaling. Highly optimized MD codes like GROMACS reach iteration rates in the millisecond range which renders MD strongly latency sensitive and represents unique challenges for efficient parallelization.
The HPC landscape is rapidly changing toward increasing on-chip parallelism as well as heterogeneity through emerging accelerator platforms like GPUs or Intel MIC.
The GROMACS MD package is known for its high performance thanks to hand-tuned SIMD kernels and state-of-the art parallel algorithms. However, making efficient use of modern architectures required a substantial redesign of the previously MPI-only parallelization as well as
entirely new algorithms for wide SIMD units and massive accelerator parallelism.
Here we present our recent work on multi-level heterogeneous parallelization of MD implemented in GROMACS. We have developed new, highly efficient SIMD algorithms for pair force calculation targeting current and future architectures ranging from CPUs to GPUs to FPGAs. Thanks to recent efforts, all compute-intensive code in GROMACS uses SIMD acceleration to maximize single-core CPU performance. OpenMP multithreading allows combining a set of cores with GPUs for node-level heterogeneous parallelization. On the top level, MPI-based neutral-territory domain-decomposition ensures scaling across compute nodes. To maximize the utilization of all compute resources, we employ multi-level, inter- and intra-node dynamic load balancing. The latest version of GROMACS not only provides a dramatic improvement in strong scaling to tens of particles per core at peak, but is able to efficiently utilize all compute units in heterogeneous HPC hardware with an absolute simulation performance between hundreds of nanoseconds to a microsecond a day.