Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2716282.2716293acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

A comparative investigation of device-specific mechanisms for exploiting HPC accelerators

Published: 07 February 2015 Publication History
  • Get Citation Alerts
  • Abstract

    A variety of computational accelerators have been greatly improved in recent years. Intel's MIC (Many Integrated Core) and both GPU architectures, NVIDIA's Kepler and AMD's Graphics Core Next, all represent real innovations in the field of HPC. Based on the single unified programing interface OpenCL, this paper reports a careful study of a well thought-out selection of such devices. A micro-benchmark suite is designed and implemented to investigate the capability of each accelerator to exploit parallelism in OpenCL. Our results expose the relationship between several programing aspects and their possible impact on performance. Instruction level parallelism, intra-kernel vector parallelism, multiple-issue, work-group size, instruction scheduling and a variety of other aspects are explored, highlighting their interaction that must be carefully considered when developing applications for heterogeneous architectures. Evidence-based findings related to microarchitectural features as well as performance characteristics are cross-checked with reference to the compiled code being executed. In conclusion, a case study involving a real application is presented as a part of the verification process of statements.

    References

    [1]
    TOP500 LISTS JUNE 2014. http://www.top500.org/lists/2014/06/.
    [2]
    Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. Technical report, September 2012.
    [3]
    https://software.intel.com/sites/default/ files/forum/278102/327364001en.pdf.
    [4]
    Southern Islands Instruction Set Architecture. Technical report, December 2012. http://developer. amd.com/resources/documentation-articles/ developer-guides-manuals/.
    [5]
    White Paper | AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE. Technical report, June 2012. http://www.amd.com/Documents/GCN_ Architecture_whitepaper.pdf.
    [6]
    Intel Xeon Phi Coprocessor Vector Microarchitecture. Technical report, 2013.
    [7]
    software.intel.com/articles/ intel-xeon-phi-coprocessor-vector-microarchitecture.
    [8]
    CUDA Binary Utilites - Application Note. Technical report, August 2014. http://docs.nvidia.com/cuda/ pdf/CUDA\_Binary\_Utilities.pdf.
    [9]
    Intel xeon phi coprocessor datasheet, April 2014. http://www.intel.com/content/dam/www/public/ us/en/documents/datasheets/ xeon-phi-coprocessor-datasheet.pdf.
    [10]
    Parallel Thread Execution Isa. Technical report, August 2014. http://docs.nvidia.com/pdf/ptx_isa_4.1.pdf.
    [11]
    Tuning CUDA Applications for Kepler. Technical report, August 2014. http://docs.nvidia.com/cuda/ pdf/Kepler_Tuning_Guide.pdf.
    [12]
    J. C. Beyer, E. J. Stotzer, A. Hart, and B. R. de Supinski. OpenMP for Accelerators. In OpenMP in the Petascale Era Lecture Notes in Computer Science, 2011.
    [13]
    S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10), IISWC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society.
    [14]
    A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63–74. ACM, 2010.
    [15]
    J. Fang, A. L. Varbanescu, and H. Sips. A Comprehensive Performance Comparison of CUDA and OpenCL. In 2011 International Conference on Parallel Processing, pages 216–225, 2011.
    [16]
    A. Heinecke, M. Klemm, and H.-J. Bungartz. From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture. Computing in Science and Engineering, 14(2):78–83, 2012.
    [17]
    J. Herdman, W. Gaudin, S. McIntosh-Smith, M. Boulton, D. Beckingsale, A. Mallinson, and S. A. Jarvis. Accelerating Hydrocodes with OpenACC, OpeCL and CUDA. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 465–471. IEEE, 2012.
    [18]
    J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High-Performance Programming, volume 432. Morgan Kaufmann, first edition, 2013.
    [19]
    K. Karimi, N. G. Dickson, and F. Hamze. A Performance Comparison of CUDA and OpenCL. http://arxiv.org/abs/1005.2581, May 2010.
    [20]
    K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa, and H. Kobayashi. Evaluating performance and portability of OpenCL programs. In The fifth international workshop on automatic performance tuning, page 7, 2010.
    [21]
    S. Mclntosh-Smith, J. Price, R. B. Sessions, and A. A. Ibarra. High performance in silico virtual drug screening on many-core processors. International Journal of High Performance Computing Applications, page 1094342014528252, 2014.
    [22]
    A. Munshi, B. R. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg. OpenCL Programming Guide. Addison-Wesley Pearson Education, first edition, 2011.
    [23]
    NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Technical report, 2012.
    [24]
    www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf.
    [25]
    S. Pennycook, S. Hammond, S. Wright, J. Herdman, I. Miller, and S. Jarvis. An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing, 73:1439–1450, August 2013.
    [26]
    J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.
    [27]
    A. Tarakji and N. O. Salscheider. Runtime Behavior Comparison of Modern Accelerators and Coprocessors. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 97––108. IEEE, 2014.
    [28]
    A. Tarakji, N. O. Salscheider, S. Alt, and J. Heiducoff. Feature-based device selection in heterogeneous computing systems. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, 2014.
    [29]
    P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer. Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In Euro-Par 2011 Parallel Processing, 2011.
    [30]
    K. Thouti and S. Sathe. Comparison of OpenMP & OpenCL Parallel Processing Technologies. In International Journal of Advanced Computer Science and Applications Vol. 3, 2012.
    [31]
    D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd annual international symposium on Computer architecture, 1995.
    [32]
    S. Wienke, P. Springer, C. Terboven, and D. an Mey. OpenACC – First Experiences with Real-World Applications. In Euro-Par 2012 Parallel Processing, volume 7484 of Lecture Notes in Computer Science, pages 859–870. Springer Berlin Heidelberg, 2012.

    Index Terms

    1. A comparative investigation of device-specific mechanisms for exploiting HPC accelerators

        Recommendations

        Reviews

        Khaled Hamidouche

        The performance of different high-performance computing (HPC) accelerator/coprocessor devices is evaluated and compared in this well-written paper. It analyzes the behavior of Xeon Phi, NVIDIA K20c, and AMD FirePro S9000 using an open computing language (OpenCL) framework. In order to have a fine-grain evaluation, the authors propose and develop FeatureBench, a benchmark test suite. The comparison considers only a single accelerator configuration, however. Even though I agree with the authors on the fact that OpenCL is a portable framework and probably the best fit for this evaluation, it does not offer the productivity feature for hybrid message passing interface (MPI+X) models in HPC systems such as CUDA-aware MPI and OpenACC-aware MPI. I wish this aspect had been addressed in the discussion and comparison. Also, it is not clear how the comparison between hardware-accelerated and non-hardware-accelerated transcendental operations is performed. Is it through different benchmarks/application programming interface (API) calls or compiler options__?__ Finally, what about the memory bandwidth and behavior__?__ An analysis and comparison of the memory bandwidth and cache effects would have been a welcome contribution. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
        February 2015
        120 pages
        ISBN:9781450334075
        DOI:10.1145/2716282
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 February 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. GCN
        2. GPGPU
        3. Kepler
        4. MIC
        5. OpenCL

        Qualifiers

        • Research-article

        Conference

        GPGPU-8

        Acceptance Rates

        Overall Acceptance Rate 57 of 129 submissions, 44%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 164
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media