Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2451116.2451162acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Portable performance on heterogeneous architectures

Published: 16 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
    To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.

    References

    [1]
    F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Symposium on Code Generation and Optimization, 2006.
    [2]
    L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. Finding effective compilation sequences. In Conference on Languages, Compilers, and Tools for Embedded Systems, New York, NY, USA, 2004.
    [3]
    Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: A language and compiler for algorithmic choice. In Programming Language Design and Implementation, Dublin, Ireland, Jun 2009.
    [4]
    Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In Symposium on Code Generation and Optimization, Chamonix, France, Apr 2011.
    [5]
    Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), 2011.
    [6]
    Muthu Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-CUDA code generation for affine programs. In Rajiv Gupta, editor, Compiler Construction, volume 6011. Springer Berlin / Heidelberg, 2010.
    [7]
    J. Charles, P. Jassi, N.S. Ananth, A. Sadat, and A. Fedorova. Evaluation of the Intel Core i7 Turbo Boost feature. In Symposium on Workload Characterization, Oct 2009.
    [8]
    Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2010.
    [9]
    Andrew Davidson, Yao Zhang, and John D. Owens. An auto-tuned method for solving large tridiagonal systems on the GPU. In Parallel and Distributed Processing Symposium. IEEE, May 2011.
    [10]
    Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), February 2005.
    [11]
    Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Programming language design and implementation, New York, NY, USA, 1998.
    [12]
    Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle, Phil Barnard, Elton Ashton, Eric Courtois, and Francois Bodin. MILEPOST GCC: machine learning based research compiler. In GCC Developers' Summit, Jul 2008.
    [13]
    Scott Grauer-Gray, Lifan Xu, Robert Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing Conference. IEEE, May 2012.
    [14]
    Eun-jin Im and Katherine Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Computational Science. Springer, 2001.
    [15]
    Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In Programming language design and implementation, New York, NY, USA, 2011.
    [16]
    Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11), November 2005.
    [17]
    Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44, February 2009.
    [18]
    Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In international symposium on Computer architecture, New York, NY, USA, 2010.
    [19]
    Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Workshop on General-Purpose Computation on Graphics Processing Units, New York, NY, USA, 2010.
    [20]
    Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In International Symposium on Microarchitecture, New York, NY, USA, 2009.
    [21]
    Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-d FFT library for CUDA GPUs. In High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009.
    [22]
    Eunjung Park, L.-N. Pouche, J. Cavazos, A. Cohen, and P. Sadayappan. Predictive modeling in a polyhedral optimization space. In Symposium on Code Generation and Optimization, April 2011.
    [23]
    Markus Puschel, Jose M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robbert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93. IEEE, Feb 2005.
    [24]
    Alina Sbırlea, Yi Zou, Zoran Budimlıc, Jason Cong, and Vivek Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, New York, NY, USA, 2012.
    [25]
    V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Supercomputing, November 2008.
    [26]
    Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Scientific Discovery through Advanced Computing Conference, San Francisco, CA, USA, June 2005.
    [27]
    Richard Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 1998.
    [28]
    Yonghong Yan, Max Grossman, and Vivek Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Henk Sips, Dick Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume 5704. Springer Berlin / Heidelberg, 2009.
    [29]
    Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Workshops on Languages and Compilers for Parallel Computing. Springer, 2008.
    [30]
    Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the GPU. In Symposium on Principles and Practice of Parallel Programming, January 2010.

    Cited By

    View all
    • (2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
    • (2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
    • (2021)A performance predictor for implementation selection of parallelized static and temporal graph algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.626734:2Online publication date: 26-Apr-2021
    • Show More Cited By

    Index Terms

    1. Portable performance on heterogeneous architectures

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
        March 2013
        574 pages
        ISBN:9781450318709
        DOI:10.1145/2451116
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
          ASPLOS '13
          March 2013
          540 pages
          ISSN:0163-5964
          DOI:10.1145/2490301
          Issue’s Table of Contents
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 48, Issue 4
          ASPLOS '13
          April 2013
          540 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2499368
          Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 March 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. autotuning
        2. compilers
        3. gpgpu
        4. heterogeneous

        Qualifiers

        • Research-article

        Conference

        ASPLOS '13

        Acceptance Rates

        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)41
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 05 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUsProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517777(218-230)Online publication date: 19-Mar-2022
        • (2022)OptCL: A Middleware to Optimise Performance for High Performance Domain-Specific Languages on Heterogeneous PlatformsAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-95391-1_48(772-791)Online publication date: 23-Feb-2022
        • (2021)A performance predictor for implementation selection of parallelized static and temporal graph algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.626734:2Online publication date: 26-Apr-2021
        • (2020)AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear ProgrammingProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378465(875-890)Online publication date: 9-Mar-2020
        • (2019)NoTThe Journal of Supercomputing10.1007/s11227-019-02749-175:7(3810-3841)Online publication date: 1-Jul-2019
        • (2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
        • (2018)FloemProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291217(663-679)Online publication date: 8-Oct-2018
        • (2018)Automatic Matching of Legacy Code to Heterogeneous APIsACM SIGPLAN Notices10.1145/3296957.317318253:2(139-153)Online publication date: 19-Mar-2018
        • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
        • (2018)Predictable Thread CoarseningACM Transactions on Architecture and Code Optimization10.1145/319424215:2(1-26)Online publication date: 12-Jun-2018
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media