article

SaC/C formulations of the all-pairs N-body problem and their performance on SMPs and GPGPUs

Authors:

Artjoms Šinkarovs,

Sven-Bodo Scholz,

Robert Bernecky,

Clemens GrelckAuthors Info & Claims

Concurrency and Computation: Practice & Experience, Volume 26, Issue 4

Pages 952 - 971

https://doi.org/10.1002/cpe.3078

Published: 25 March 2014 Publication History

Abstract

This paper describes our experience in implementing the classical N-body algorithm in SaC and analysing the runtime performance achieved on three different machines: a dual-processor 8-core Dell PowerEdge 2950 a Beowulf cluster node, the reference machine, a quad-core hyper-threaded Intel Core-i7 based system equipped with an NVidia GTX-480 graphics accelerator and an Oracle Sparc T4-4 server with a total of 256 hardware threads. We contrast our findings with those resulting from the reference C code and a few variants of it that employ OpenMP pragmas as well as explicit vectorisation. Our experiments demonstrate that the SaC implementation successfully combines a high level of abstraction, very close to the mathematical specification, with very competitive runtimes. In fact, SaC matches or outperforms the hand-vectorised and hand-parallelised C codes on all three systems under investigation without the need for any source code modification. Furthermore, only SaC is able to effectively harness the advanced compute power of the graphics accelerator, again by mere recompilation of the same source code. Our results illustrate the benefits that SaC provides to application programmers in terms of coding productivity, source code, and performance portability among different machine architectures, as well as long-term maintainability in evolving hardware environments. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]

SICSA MultiCore Challenge, Phase II, 2012. Available from: "http://www.macs.hw.ac.uk/sicsawiki/index.php/Challenge-PhaseII" {accessed on 31 July 2013}.

[2]

Barnes J, Hut P. A hierarchical ONlog N force calculation algorithm. Nature 1986; Volume 324: pp.446-449.

[3]

Grelck C, Scholz SB. SAC: a functional array language for efficient multithreaded execution. International Journal of Parallel Programming 2006; Volume 34 Issue 4: pp.383-427.

Digital Library

[4]

SAC. Available from: "http://www.sac-home.org/" {accessed on 31 July 2013}.

[5]

Grelck C. Shared memory multiprocessor support for functional array processing in SAC. Journal of Functional Programming 2005; Volume 15 Issue 3: pp.353-401.

Digital Library

[6]

Guo J, Thiyagalingam J, Scholz SB. Breaking the GPU programming barrier with the auto-parallelising SAC compiler. In 6th Workshop on Declarative Aspects of Multicore Programming DAMP'11. ACM Press: Austin, USA, 2011; pp.15-24.

Digital Library

[7]

OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.1, July 2011. Available from: "http://www.openmp.org/mp-documents/OpenMP3.1.pdf" {accessed on 31 July 2013}.

[8]

OpenMP. Available from: "http://www.openmp.org/" {accessed on 31 July 2013}.

[9]

NVidia. NVIDIA CUDA C Programming Guide 4.0. Technical Report, NVidia, 2011.

[10]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.2, 15 November 2011. Available from: "http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf" {accessed on 31 July 2013}.

[11]

OpenCL. Available from: "http://www.khronos.org/opencl/" {accessed on 31 July 2013}.

[12]

Scholz SB. With-loop-folding in SAC - condensing consecutive array operations. In Implementation of Functional Languages, 9th International Workshop IFL'97, St. Andrews, UK, Selected Papers, Vol.Volume 1467, Clack C, Davie T, Hammond K eds, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 1998; pp.72-92.

Digital Library

[13]

Scholz SB. A case study: effects of with-loop folding on the NAS benchmark MG in SAC. In Implementation of Functional Languages, 10th International Workshop IFL'98, London, England, UK, Selected Papers, Vol.Volume 1595, Hammond K, Davie T, Clack C eds, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 1999; pp.216-228.

Digital Library

[14]

Grelck C, Hinckfuβ K, Scholz SB. With-loop fusion for data locality and parallelism. In Implementation and Application of Functional Languages, 17th International Workshop IFL'05, Dublin, Ireland, Revised Selected Papers, Vol.Volume 4015, Butterfield A ed., <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 2006; pp.178-195.

Digital Library

[15]

Grelck C, Scholz SB, Trojahner K. With-loop scalarization: merging nested array operations. In Implementation of Functional Languages, 15th International Workshop IFL'03, Edinburgh, Scotland, UK, Revised Selected Papers, Vol.Volume 3145, Trinder P, Michaelson G eds, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 2004; pp.118-134.

Digital Library

[16]

Scholz SB, Herhut S, Penczek F, Grelck C. SaC 1.0 - Single Assignment C - Tutorial. Technical Report, University of Hertfordshire, University of Amsterdam, 2010. Available from: "http://www.sac-home.org/publications/tutorial.pdf" {accessed on 31 July 2013}.

[17]

Grelck C. Single Assignment C SAC: high productivity meets high performance. In 4th Central European Functional Programming Summer School CEFP'11, Budapest, Hungary, Vol.Volume 7241, Horváth Z, Zsók V eds, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 2012; pp.207-278.

Digital Library

[18]

Nuzman D, Henderson R. Multi-platform auto-vectorization. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '06. IEEE Computer Society: Washington, DC, USA, 2006; pp.281-294. http://dx.doi.org/10.1109/CGO.2006.25.

Digital Library

[19]

Sreraman N, Govindarajan R. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming 2000; Volume 28: pp.363-400.

[20]

Chen KH, Shen BY, Yang W. An automatic superword vectorization in LLVM. 16th Workshop on Compiler Techniques for High-Performance and Embedded Computing, Taipei, 2010; pp.19-27.

[21]

Šinkarovs A, Scholz SB. Portable support for explicit vectorisation in C. 16th Workshop on Compilers for Parallel Computing CPC'12, Padua, Italy, 2012. "http://ashinkarov.github.io/publications/cpcgcc.pdf" {accessed on 31 July 2013}.

[22]

Chapman B, Jost G, <familyNamePrefix>van der</familyNamePrefix>Pas R. Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press: Cambridge Massachusetts, 2008.

Digital Library

[23]

TOP500. Available from: "http://www.top500.org/" {accessed on 31 July 2013}.

[24]

SUN/Oracle. A technical overview of the Oracle Sparc Supercluster T4-4. White paper, SUN/Oracle, 2012.

[25]

Fulgham B. The Computer Language Benchmarks Game, 2012. Available from: "http://benchmarksgame.alioth.debian.org/" {accessed on 31 July 2013}.

[26]

Kudryavtsev A, Rolls D, Scholz SB, Shafarenko A. Numerical simulations of unsteady shock wave interactions using SAC and Fortran-90. In 10th International Conference on Parallel Computing Technologies PaCT'09, Vol.Volume 5083, <bookSeriesTitle>Lecture Notes in Computer Science</bookSeriesTitle>. Springer: Berlin Heidelberg, 2009; pp.445-456.

Digital Library

[27]

Grelck C, Douma R. SAC on a Niagara T3-4 Server: lessons and experiences. In Applications, Tools and Techniques on the Road to Exascale Computing, Vol.Volume 22, <familyNamePrefix>de</familyNamePrefix>Bosschere K, D'Hollander E, Joubert G, Padua D, Peters F, Sawyer M eds, <bookSeriesTitle>Advances in Parallel Computing</bookSeriesTitle>. IOS Press: Amsterdam, 2012; pp.289-296.

[28]

Wieser V, Grelck C, Haslinger P, Guo J, Korzeniowski F, Bernecky R, Moser B, Scholz S. Combining high productivity and high performance in image processing using Single Assignment C on multi-core CPUs and many-core GPUs. Journal of Electronic Imaging 2012; Volume 21 Issue 2: pp.021116-1-021116-13.

Cited By

Grelck CWiesinger H(2019)Persistent Asynchronous Adaptive Specialization for Generic Array ProgrammingInternational Journal of Parallel Programming10.1007/s10766-018-0567-947:2(164-183)Online publication date: 15-May-2019
https://dl.acm.org/doi/10.1007/s10766-018-0567-9
Matsuda MFukuda KMaruyama N(2018)A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework TapasProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3149457.3149471(241-250)Online publication date: 28-Jan-2018
https://dl.acm.org/doi/10.1145/3149457.3149471
Pachev ILupo C(2017)GPUMapProceedings of the 7th Workshop on Python for High-Performance and Scientific Computing10.1145/3149869.3149875(1-10)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3149869.3149875
Show More Cited By

Recommendations

Towards Compiling SAC for the Xeon Phi Knights Corner and Knights Landing Architectures: Strategies and Experiments
IFL '17: Proceedings of the 29th Symposium on the Implementation and Application of Functional Programming Languages

Xeon Phi is the common brand name of Intel's Many Integrated Core (MIC) architecture. The first commercially available generation Knights Corner and the second generation Knights Landing form a middle ground between modestly parallel desktop and ...
Simulation of sonic waves along a borehole in a heterogeneous formation

This paper presents an implementation of a 2.5-D finite-difference (FD) code to model acoustic full waveform monopole logging in cylindrical coordinates accelerated by using the new parallel computing devices (PCDs). For that purpose we use the industry ...
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
High Performance Computing for Computational Science - VECPAR 2008

The increase in performance of the last generations of graphics processors (GPUs) has made this class of hardware a coprocessing platform of remarkable success in certain types of operations. In this paper we evaluate the performance of linear algebra ...

Comments

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience

Concurrency and Computation: Practice & Experience Volume 26, Issue 4

March 2014

169 pages

ISSN:1532-0626

Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 25 March 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Grelck CWiesinger H(2019)Persistent Asynchronous Adaptive Specialization for Generic Array ProgrammingInternational Journal of Parallel Programming10.1007/s10766-018-0567-947:2(164-183)Online publication date: 15-May-2019
https://dl.acm.org/doi/10.1007/s10766-018-0567-9
Matsuda MFukuda KMaruyama N(2018)A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework TapasProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3149457.3149471(241-250)Online publication date: 28-Jan-2018
https://dl.acm.org/doi/10.1145/3149457.3149471
Pachev ILupo C(2017)GPUMapProceedings of the 7th Workshop on Python for High-Performance and Scientific Computing10.1145/3149869.3149875(1-10)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3149869.3149875
Šinkarovs AScholz S(2016)Type-driven data layouts for improved vectorisationConcurrency and Computation: Practice & Experience10.1002/cpe.350128:7(2092-2119)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1002/cpe.3501
Grelck C(2015)Single Assignment C (SAC)Central European Functional Programming School10.1007/978-3-030-28346-9_7(207-282)Online publication date: 6-Jul-2015
https://dl.acm.org/doi/10.1007/978-3-030-28346-9_7
Loidl HSinger J(2014)SICSA multicore challenge editorial prefaceConcurrency and Computation: Practice & Experience10.1002/cpe.307726:4(929-934)Online publication date: 25-Mar-2014
https://dl.acm.org/doi/10.1002/cpe.3077
Grelck CWiesinger HPlasmeijer R(2013)Next Generation Asynchronous Adaptive Specialization for Data-Parallel Functional Array Processing in SACProceedings of the 25th symposium on Implementation and Application of Functional Languages10.1145/2620678.2620690(117-128)Online publication date: 28-Aug-2013
https://dl.acm.org/doi/10.1145/2620678.2620690

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents