research-article

A comparative investigation of device-specific mechanisms for exploiting HPC accelerators

Authors:

Ayman Tarakji,

Lukas Börger,

Rainer LeupersAuthors Info & Claims

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Pages 1 - 12

https://doi.org/10.1145/2716282.2716293

Published: 07 February 2015 Publication History

Get Access

Abstract

A variety of computational accelerators have been greatly improved in recent years. Intel's MIC (Many Integrated Core) and both GPU architectures, NVIDIA's Kepler and AMD's Graphics Core Next, all represent real innovations in the field of HPC. Based on the single unified programing interface OpenCL, this paper reports a careful study of a well thought-out selection of such devices. A micro-benchmark suite is designed and implemented to investigate the capability of each accelerator to exploit parallelism in OpenCL. Our results expose the relationship between several programing aspects and their possible impact on performance. Instruction level parallelism, intra-kernel vector parallelism, multiple-issue, work-group size, instruction scheduling and a variety of other aspects are explored, highlighting their interaction that must be carefully considered when developing applications for heterogeneous architectures. Evidence-based findings related to microarchitectural features as well as performance characteristics are cross-checked with reference to the compiled code being executed. In conclusion, a case study involving a real application is presented as a part of the verification process of statements.

References

[1]

TOP500 LISTS JUNE 2014. http://www.top500.org/lists/2014/06/.

Google Scholar

[2]

Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. Technical report, September 2012.

Google Scholar

[3]

https://software.intel.com/sites/default/ files/forum/278102/327364001en.pdf.

Google Scholar

[4]

Southern Islands Instruction Set Architecture. Technical report, December 2012. http://developer. amd.com/resources/documentation-articles/ developer-guides-manuals/.

Google Scholar

[5]

White Paper | AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE. Technical report, June 2012. http://www.amd.com/Documents/GCN_ Architecture_whitepaper.pdf.

Google Scholar

[6]

Intel Xeon Phi Coprocessor Vector Microarchitecture. Technical report, 2013.

Google Scholar

[7]

software.intel.com/articles/ intel-xeon-phi-coprocessor-vector-microarchitecture.

Google Scholar

[8]

CUDA Binary Utilites - Application Note. Technical report, August 2014. http://docs.nvidia.com/cuda/ pdf/CUDA\_Binary\_Utilities.pdf.

Google Scholar

[9]

Intel xeon phi coprocessor datasheet, April 2014. http://www.intel.com/content/dam/www/public/ us/en/documents/datasheets/ xeon-phi-coprocessor-datasheet.pdf.

Google Scholar

[10]

Parallel Thread Execution Isa. Technical report, August 2014. http://docs.nvidia.com/pdf/ptx_isa_4.1.pdf.

Google Scholar

[11]

Tuning CUDA Applications for Kepler. Technical report, August 2014. http://docs.nvidia.com/cuda/ pdf/Kepler_Tuning_Guide.pdf.

Google Scholar

[12]

J. C. Beyer, E. J. Stotzer, A. Hart, and B. R. de Supinski. OpenMP for Accelerators. In OpenMP in the Petascale Era Lecture Notes in Computer Science, 2011.

Digital Library

Google Scholar

[13]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10), IISWC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

Google Scholar

[14]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63–74. ACM, 2010.

Digital Library

Google Scholar

[15]

J. Fang, A. L. Varbanescu, and H. Sips. A Comprehensive Performance Comparison of CUDA and OpenCL. In 2011 International Conference on Parallel Processing, pages 216–225, 2011.

Digital Library

Google Scholar

[16]

A. Heinecke, M. Klemm, and H.-J. Bungartz. From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture. Computing in Science and Engineering, 14(2):78–83, 2012.

Digital Library

Google Scholar

[17]

J. Herdman, W. Gaudin, S. McIntosh-Smith, M. Boulton, D. Beckingsale, A. Mallinson, and S. A. Jarvis. Accelerating Hydrocodes with OpenACC, OpeCL and CUDA. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 465–471. IEEE, 2012.

Digital Library

Google Scholar

[18]

J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High-Performance Programming, volume 432. Morgan Kaufmann, first edition, 2013.

Digital Library

Google Scholar

[19]

K. Karimi, N. G. Dickson, and F. Hamze. A Performance Comparison of CUDA and OpenCL. http://arxiv.org/abs/1005.2581, May 2010.

Google Scholar

[20]

K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa, and H. Kobayashi. Evaluating performance and portability of OpenCL programs. In The fifth international workshop on automatic performance tuning, page 7, 2010.

Google Scholar

[21]

S. Mclntosh-Smith, J. Price, R. B. Sessions, and A. A. Ibarra. High performance in silico virtual drug screening on many-core processors. International Journal of High Performance Computing Applications, page 1094342014528252, 2014.

Google Scholar

[22]

A. Munshi, B. R. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg. OpenCL Programming Guide. Addison-Wesley Pearson Education, first edition, 2011.

Digital Library

Google Scholar

[23]

NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Technical report, 2012.

Google Scholar

[24]

www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf.

Google Scholar

[25]

S. Pennycook, S. Hammond, S. Wright, J. Herdman, I. Miller, and S. Jarvis. An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing, 73:1439–1450, August 2013.

Digital Library

Google Scholar

[26]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.

Google Scholar

[27]

A. Tarakji and N. O. Salscheider. Runtime Behavior Comparison of Modern Accelerators and Coprocessors. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 97––108. IEEE, 2014.

Digital Library

Google Scholar

[28]

A. Tarakji, N. O. Salscheider, S. Alt, and J. Heiducoff. Feature-based device selection in heterogeneous computing systems. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, 2014.

Digital Library

Google Scholar

[29]

P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer. Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In Euro-Par 2011 Parallel Processing, 2011.

Digital Library

Google Scholar

[30]

K. Thouti and S. Sathe. Comparison of OpenMP & OpenCL Parallel Processing Technologies. In International Journal of Advanced Computer Science and Applications Vol. 3, 2012.

Google Scholar

[31]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd annual international symposium on Computer architecture, 1995.

Digital Library

Google Scholar

[32]

S. Wienke, P. Springer, C. Terboven, and D. an Mey. OpenACC – First Experiences with Real-World Applications. In Euro-Par 2012 Parallel Processing, volume 7484 of Lecture Notes in Computer Science, pages 859–870. Springer Berlin Heidelberg, 2012.

Digital Library

Google Scholar

Index Terms

A comparative investigation of device-specific mechanisms for exploiting HPC accelerators
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
2. Hardware
  1. Hardware validation

Recommendations

Feature-based device selection in heterogeneous computing systems
CF '14: Proceedings of the 11th ACM Conference on Computing Frontiers

With the advent of accelerator-based heterogeneous parallel systems, the need for a solution of the task-device matching problem is increasing. Due to the enormously growing diversity in existing computing architectures, optimal matching promises to ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...

Reviews

Reviewer: Khaled Hamidouche

The performance of different high-performance computing (HPC) accelerator/coprocessor devices is evaluated and compared in this well-written paper. It analyzes the behavior of Xeon Phi, NVIDIA K20c, and AMD FirePro S9000 using an open computing language (OpenCL) framework. In order to have a fine-grain evaluation, the authors propose and develop FeatureBench, a benchmark test suite. The comparison considers only a single accelerator configuration, however. Even though I agree with the authors on the fact that OpenCL is a portable framework and probably the best fit for this evaluation, it does not offer the productivity feature for hybrid message passing interface (MPI+X) models in HPC systems such as CUDA-aware MPI and OpenACC-aware MPI. I wish this aspect had been addressed in the discussion and comparison. Also, it is not clear how the comparison between hardware-accelerated and non-hardware-accelerated transcendental operations is performed. Is it through different benchmarks/application programming interface (API) calls or compiler options__?__ Finally, what about the memory bandwidth and behavior__?__ An analysis and comparison of the memory bandwidth and cache effects would have been a welcome contribution. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

February 2015

120 pages

ISBN:9781450334075

DOI:10.1145/2716282

Program Chairs:
David Kaeli
Northeastern University, USA
,
John Cavazos
University of Delaware, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU-8

GPGPU-8: General-purpose Processing with Graphics Processing Units 8

February 7, 2015

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
164
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Feature-based device selection in heterogeneous computing systems

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Reviews

Access critical reviews of Computing literature here