research-article

Portable performance on heterogeneous architectures

Authors:

Phitchaya Mangpo Phothilimthana,

Jonathan Ragan-Kelley,

Saman AmarasingheAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 431 - 444

https://doi.org/10.1145/2451116.2451162

Published: 16 March 2013 Publication History

Abstract

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.

To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.

References

[1]

F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Symposium on Code Generation and Optimization, 2006.

Digital Library

[2]

L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. Finding effective compilation sequences. In Conference on Languages, Compilers, and Tools for Embedded Systems, New York, NY, USA, 2004.

Digital Library

[3]

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: A language and compiler for algorithmic choice. In Programming Language Design and Implementation, Dublin, Ireland, Jun 2009.

Digital Library

[4]

Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In Symposium on Code Generation and Optimization, Chamonix, France, Apr 2011.

Digital Library

[5]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), 2011.

Digital Library

[6]

Muthu Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-CUDA code generation for affine programs. In Rajiv Gupta, editor, Compiler Construction, volume 6011. Springer Berlin / Heidelberg, 2010.

Digital Library

[7]

J. Charles, P. Jassi, N.S. Ananth, A. Sadat, and A. Fedorova. Evaluation of the Intel Core i7 Turbo Boost feature. In Symposium on Workload Characterization, Oct 2009.

Digital Library

[8]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2010.

Digital Library

[9]

Andrew Davidson, Yao Zhang, and John D. Owens. An auto-tuned method for solving large tridiagonal systems on the GPU. In Parallel and Distributed Processing Symposium. IEEE, May 2011.

Digital Library

[10]

Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), February 2005.

[11]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Programming language design and implementation, New York, NY, USA, 1998.

Digital Library

[12]

Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle, Phil Barnard, Elton Ashton, Eric Courtois, and Francois Bodin. MILEPOST GCC: machine learning based research compiler. In GCC Developers' Summit, Jul 2008.

[13]

Scott Grauer-Gray, Lifan Xu, Robert Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing Conference. IEEE, May 2012.

[14]

Eun-jin Im and Katherine Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Computational Science. Springer, 2001.

Digital Library

[15]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In Programming language design and implementation, New York, NY, USA, 2011.

Digital Library

[16]

Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11), November 2005.

[17]

Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44, February 2009.

Digital Library

[18]

Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In international symposium on Computer architecture, New York, NY, USA, 2010.

Digital Library

[19]

Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Workshop on General-Purpose Computation on Graphics Processing Units, New York, NY, USA, 2010.

Digital Library

[20]

Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In International Symposium on Microarchitecture, New York, NY, USA, 2009.

Digital Library

[21]

Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-d FFT library for CUDA GPUs. In High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009.

Digital Library

[22]

Eunjung Park, L.-N. Pouche, J. Cavazos, A. Cohen, and P. Sadayappan. Predictive modeling in a polyhedral optimization space. In Symposium on Code Generation and Optimization, April 2011.

Digital Library

[23]

Markus Puschel, Jose M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robbert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93. IEEE, Feb 2005.

[24]

Alina Sbırlea, Yi Zou, Zoran Budimlıc, Jason Cong, and Vivek Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, New York, NY, USA, 2012.

Digital Library

[25]

V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Supercomputing, November 2008.

Digital Library

[26]

Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Scientific Discovery through Advanced Computing Conference, San Francisco, CA, USA, June 2005.

[27]

Richard Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 1998.

Digital Library

[28]

Yonghong Yan, Max Grossman, and Vivek Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Henk Sips, Dick Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume 5704. Springer Berlin / Heidelberg, 2009.

Digital Library

[29]

Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Workshops on Languages and Compilers for Parallel Computing. Springer, 2008.

Digital Library

[30]

Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the GPU. In Symposium on Principles and Practice of Parallel Programming, January 2010.

Digital Library

Cited By

Mogers NLi LRadu VDubach CEgger BSmith A(2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517777
Xiao JAndelfinger PCai WEckhoff DKnoll A(2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-95391-1_48
Rehman AAhmad MKhan O(2021)A performance predictor for implementation selection of parallelized static and temporal graph algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.626734:2Online publication date: 26-Apr-2021
https://doi.org/10.1002/cpe.6267
Show More Cited By

Index Terms

Portable performance on heterogeneous architectures
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages

Recommendations

Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
CF '12: Proceedings of the 9th conference on Computing Frontiers

With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
967
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mogers NLi LRadu VDubach CEgger BSmith A(2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517777
Xiao JAndelfinger PCai WEckhoff DKnoll A(2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-95391-1_48
Rehman AAhmad MKhan O(2021)A performance predictor for implementation selection of parallelized static and temporal graph algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.626734:2Online publication date: 26-Apr-2021
https://doi.org/10.1002/cpe.6267
Hildebrand MKhan JTrika SLowe-Power JAkella VLarus JCeze LStrauss K(2020)AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear ProgrammingProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378465(875-890)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378465
Wu SDong XZhang XZhu Z(2019)NoTThe Journal of Supercomputing10.1007/s11227-019-02749-175:7(3810-3841)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11227-019-02749-1
Barik RShpeisman TRong HHu CLee VAnderson THenry GLiu HWu YPetersen PLowney G(2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_13
Phothilimthana PLiu MKaufmann APeter SBodik RAnderson TArpaci-Dusseau AVoelker G(2018)FloemProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291217(663-679)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291217
Ginsbach PRemmelg TSteuwer MBodin BDubach CO'Boyle M(2018)Automatic Matching of Legacy Code to Heterogeneous APIsACM SIGPLAN Notices10.1145/3296957.317318253:2(139-153)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173182
Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Stawinoga NField T(2018)Predictable Thread CoarseningACM Transactions on Architecture and Code Optimization10.1145/319424215:2(1-26)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3194242
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents