Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2451116.2451162acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Portable performance on heterogeneous architectures

Published: 16 March 2013 Publication History

Abstract

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.

References

[1]
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Symposium on Code Generation and Optimization, 2006.
[2]
L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. Finding effective compilation sequences. In Conference on Languages, Compilers, and Tools for Embedded Systems, New York, NY, USA, 2004.
[3]
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: A language and compiler for algorithmic choice. In Programming Language Design and Implementation, Dublin, Ireland, Jun 2009.
[4]
Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In Symposium on Code Generation and Optimization, Chamonix, France, Apr 2011.
[5]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), 2011.
[6]
Muthu Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-CUDA code generation for affine programs. In Rajiv Gupta, editor, Compiler Construction, volume 6011. Springer Berlin / Heidelberg, 2010.
[7]
J. Charles, P. Jassi, N.S. Ananth, A. Sadat, and A. Fedorova. Evaluation of the Intel Core i7 Turbo Boost feature. In Symposium on Workload Characterization, Oct 2009.
[8]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2010.
[9]
Andrew Davidson, Yao Zhang, and John D. Owens. An auto-tuned method for solving large tridiagonal systems on the GPU. In Parallel and Distributed Processing Symposium. IEEE, May 2011.
[10]
Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), February 2005.
[11]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Programming language design and implementation, New York, NY, USA, 1998.
[12]
Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle, Phil Barnard, Elton Ashton, Eric Courtois, and Francois Bodin. MILEPOST GCC: machine learning based research compiler. In GCC Developers' Summit, Jul 2008.
[13]
Scott Grauer-Gray, Lifan Xu, Robert Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing Conference. IEEE, May 2012.
[14]
Eun-jin Im and Katherine Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Computational Science. Springer, 2001.
[15]
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In Programming language design and implementation, New York, NY, USA, 2011.
[16]
Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11), November 2005.
[17]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44, February 2009.
[18]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In international symposium on Computer architecture, New York, NY, USA, 2010.
[19]
Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Workshop on General-Purpose Computation on Graphics Processing Units, New York, NY, USA, 2010.
[20]
Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In International Symposium on Microarchitecture, New York, NY, USA, 2009.
[21]
Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-d FFT library for CUDA GPUs. In High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009.
[22]
Eunjung Park, L.-N. Pouche, J. Cavazos, A. Cohen, and P. Sadayappan. Predictive modeling in a polyhedral optimization space. In Symposium on Code Generation and Optimization, April 2011.
[23]
Markus Puschel, Jose M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robbert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93. IEEE, Feb 2005.
[24]
Alina Sbırlea, Yi Zou, Zoran Budimlıc, Jason Cong, and Vivek Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, New York, NY, USA, 2012.
[25]
V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Supercomputing, November 2008.
[26]
Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Scientific Discovery through Advanced Computing Conference, San Francisco, CA, USA, June 2005.
[27]
Richard Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 1998.
[28]
Yonghong Yan, Max Grossman, and Vivek Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Henk Sips, Dick Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume 5704. Springer Berlin / Heidelberg, 2009.
[29]
Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Workshops on Languages and Compilers for Parallel Computing. Springer, 2008.
[30]
Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the GPU. In Symposium on Principles and Practice of Parallel Programming, January 2010.

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
  • (2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
  • Show More Cited By

Index Terms

  1. Portable performance on heterogeneous architectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
      March 2013
      574 pages
      ISBN:9781450318709
      DOI:10.1145/2451116
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
        ASPLOS '13
        March 2013
        540 pages
        ISSN:0163-5964
        DOI:10.1145/2490301
        Issue’s Table of Contents
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 4
        ASPLOS '13
        April 2013
        540 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499368
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 March 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. autotuning
      2. compilers
      3. gpgpu
      4. heterogeneous

      Qualifiers

      • Research-article

      Conference

      ASPLOS '13

      Acceptance Rates

      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)33
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 10 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
      • (2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
      • (2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
      • (2021)A performance predictor for implementation selection of parallelized static and temporal graph algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.626734:2Online publication date: 26-Apr-2021
      • (2020)AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear ProgrammingProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378465(875-890)Online publication date: 9-Mar-2020
      • (2019)NoTThe Journal of Supercomputing10.1007/s11227-019-02749-175:7(3810-3841)Online publication date: 1-Jul-2019
      • (2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
      • (2018)FloemProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291217(663-679)Online publication date: 8-Oct-2018
      • (2018)Automatic Matching of Legacy Code to Heterogeneous APIsACM SIGPLAN Notices10.1145/3296957.317318253:2(139-153)Online publication date: 19-Mar-2018
      • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media