GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Deakin, Tom; Price, James; Martineau, Matt; McIntosh-Smith, Simon

doi:10.1007/978-3-319-46079-6_34

Tom Deakin¹⁶,
James Price¹⁶,
Matt Martineau¹⁶ &
…
Simon McIntosh-Smith¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

3236 Accesses
7 Altmetric

Abstract

Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.

The choice of one programming model over another should ideally not limit the performance that can be achieved on a device. GPU-STREAM has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Source-to-Source Parallelization Compilers for Scientific Shared-Memory Multi-core and Accelerated Multiprocessing: Analysis, Pitfalls, Enhancement and Potential

Article 08 August 2019

OpenMP Target Offload Utilizing GPU Shared Memory

Notes

1.
https://github.com/clang-ykt.

References

Bhat, K.: clpeak (2015). https://github.com/krrishnarraj/clpeak
Codeplay: ComputeCpp. https://www.codeplay.com/products/computecpp
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63–74. ACM, New York (2010). http://doi.acm.org/10.1145/1735688.1735702
Deakin, T., McIntosh-Smith, S.: GPU-STREAM: benchmarking the achievable memory bandwidth of graphics processing units (poster). In: Supercomputing, Austin, Texas (2015)
Google Scholar
Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2012), pp. 1–10. ACM (2012)
Google Scholar
Heroux, M., Doerfler, D., et al.: Improving performance via mini-applications. Technical report, SAND2009-5574, Sandia National Laboratories (2009)
Google Scholar
Hornung, R.D., Keasler, J.A.: The RAJA Portability Layer: Overview and Status (2014)
Google Scholar
Khronos OpenCL Working Group SYCL subgroup: SYCL Provisional Specification (2016)
Google Scholar
Martineau, M., McIntosh-Smith, S., Boulton, M., Gaudin, W.: An evaluation of emerging many-core parallel programming models. In: Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycore, PMAM 2016, pp. 1–10. ACM, New York (2016). http://doi.acm.org/10.1145/2883404.2883420
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newslett. 19–25 (1995)
Google Scholar
Munshi, A.: The OpenCL Specification, Version 1.1 (2011)
Google Scholar
NVIDIA: CUDA Toolkit 7.5
Google Scholar
OpenACC-Standard.org: The OpenACC Application Programming Interface - Version 2.5 (2015)
Google Scholar
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015)
Google Scholar
Reguly, I.Z., Keita, A.K., Giles, M.B.: Benchmarking the IBM Power8 processor. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 61–69. IBM Corporation, Riverton (2015)
Google Scholar
Standard Performance Evaluation Corporation: SPEC Accel (2016). https://www.spec.org/accel/

Download references

Acknowledgements

We would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, Swan, and the Cray CS cluster, Falcon. Our thanks to Codeplay for access to the ComputeCpp SYCL compiler and to Douglas Miles at PGI (NVIDIA) for access to the PGI compiler. We would also like to that the University of Bristol Intel Parallel Computing Center (IPCC). This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol - http://www.bris.ac.uk/acrc/. Thanks also go to the University of Oxford for access to the Power 8 system.

Author information

Authors and Affiliations

Department of Computer Science, University of Bristol, Bristol, UK
Tom Deakin, James Price, Matt Martineau & Simon McIntosh-Smith

Authors

Tom Deakin
View author publications
You can also search for this author in PubMed Google Scholar
James Price
View author publications
You can also search for this author in PubMed Google Scholar
Matt Martineau
View author publications
You can also search for this author in PubMed Google Scholar
Simon McIntosh-Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom Deakin .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S. (2016). GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_34
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics