research-article

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

Authors:

Wen-mei W. HwuAuthors Info & Claims

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 667 - 680

https://doi.org/10.1145/2872362.2872373

Published: 25 March 2016 Publication History

Abstract

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.

References

[1]

M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalap\ eno JVM. In ACM SIGPLAN Notices, volume 35, pages 47--65. ACM, 2000.

Digital Library

[2]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In ACM SIGPLAN Notices, volume 45, pages 105--114, 2010.

Digital Library

[3]

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In ACM SIGPLAN Notices, volume 35, pages 1--12. ACM, 2000.

[4]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 18:1--18:11, 2009. ISBN 978--1--60558--744--8.

Digital Library

[5]

Tsigas, Dolinsky, Augonnet, Bachmayer, Kessler, Moloney, and Osipov]peppherS. Benkner, S. Pllana, J. L. Traf, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31 (5): 28--41, 2011.

Digital Library

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, 2009.

Digital Library

[7]

G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.

Digital Library

[8]

W.-K. Chen, S. Lerner, R. Chaiken, and D. M. Gillies. Mojo: A dynamic optimization system. In 3rd ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-3), pages 81--90, 2000.

[9]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74, 2010.

Digital Library

[10]

J.-F. Dollinger and V. Loechner. Adaptive runtime selection for GPU. In Parallel Processing, 2013 42nd International Conference on, pages 70--79, 2013.

Digital Library

[11]

Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast Fourier transform on graphics processors. In ACM SIGPLAN Notices, volume 46, pages 257--266, 2011.

Digital Library

[12]

J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin Peaks: A Software Platform for Heterogeneous Computing on General-purpose and Graphics Processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 205--216, 2010.

Digital Library

[13]

Intel. Vectorizer knobs. https://software.intel.com/en-us/node/540483.

[14]

P. Jaaskelainen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable OpenCL implementation, 2014.

[15]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst., 22 (1): 105--118, 2011.

Digital Library

[16]

Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008.

[17]

H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 257--268, 2015.

Digital Library

[18]

L. Li, U. Dastgeer, and C. Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In High Performance Computing for Computational Science-VECPAR 2012, pages 329--345. 2013.

[19]

A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 11, 2013.

Digital Library

[20]

J. Reinders. Intel threading building blocks: outfitting C+ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007.

[21]

N. Rotem. Intel OpenCL Implicit Vectorization Module, 2011.

[22]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, 2008.

Digital Library

[23]

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, volume 47, pages 11--22, 2012.

Digital Library

[24]

J. Srinivas, W. Ding, and M. Kandemir. Reactive tiling. In Code Generation and Optimization, 2015 IEEE/ACM International Symposium on, pages 91--102, 2015.

[25]

M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. Flexible software profiling of gpu architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 185--197. ACM, 2015.

Digital Library

[26]

Stratton, Anssari, Rodrigues, Sung, Obeid, Chang, Liu, and Hwu]John_InparJ. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. D. Liu, and W.-m. Hwu. Optimization and architecture effects on GPU computing workload performance. In Innovative Parallel Computing (InPar), 2012, pages 1--10, 2012.

[27]

J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing, pages 16--30. 2008.

Digital Library

[28]

Stratton, Rodrigues, Sung, Obeid, Chang, Anssari, Liu, and Hwu]parboilJ. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, 2012.

[29]

Stratton, Rodrigues, Sung, Chang, Anssari, Liu, Hwu, and Obeid]stratton2012algorithmJ. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W.-m. W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45 (8): 0026--32, 2012.

[30]

R. Vasudevan, S. S. Vadhiyar, and L. V. Kalé. G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 349--358, 2013.

Digital Library

[31]

M. J. Voss and R. Eigemann. High-level adaptive program optimization with ADAPT. In ACM SIGPLAN Notices, volume 36, pages 93--102. ACM, 2001.

Digital Library

[32]

J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 51--60, 2014.

[33]

J. R. Wernsing and G. Stitt. Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing. In ACM SIGPLAN Notices, volume 45, pages 115--124, 2010.

Digital Library

[34]

Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Notices, volume 49, pages 93--106, 2014.

Digital Library

[35]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Notices, volume 45, pages 86--97, 2010.

Digital Library

Cited By

Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
De Gonzalo SHuang SGómez-Luna JHammond SMutlu OHwu WKandemir MJimborean AMoseley T(2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314884
Wang SLiang YZhang W(2019)Poly: Efficient Heterogeneous System and Application Management for Interactive Applications2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00038(199-210)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00038
Show More Cited By

Index Terms

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
ASPLOS '16

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy ...
DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model
ASPLOS'16

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

March 2016

824 pages

ISBN:9781450340915

DOI:10.1145/2872362

General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2016 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '16

Sponsor:

ASPLOS '16: Architectural Support for Programming Languages and Operating Systems

April 2 - 6, 2016

Georgia, Atlanta, USA

Acceptance Rates

ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
380
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
De Gonzalo SHuang SGómez-Luna JHammond SMutlu OHwu WKandemir MJimborean AMoseley T(2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314884
Wang SLiang YZhang W(2019)Poly: Efficient Heterogeneous System and Application Management for Interactive Applications2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00038(199-210)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00038
Gonzalo SHuang SGomez-Luna JHammond SMutlu OHwu W(2019)Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661187(73-84)Online publication date: Feb-2019
https://doi.org/10.1109/CGO.2019.8661187
Chen JFang JLiu WTang TChen XYang C(2017)Efficient and Portable ALS Matrix Factorization for Recommender Systems2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.91(409-418)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.91
Chang LHajj IRodrigues CGómez-Luna JHwu WHsu WYang CLipasti MLee H(2016)Efficient kernel synthesis for performance portable programmingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195653(1-13)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195653
Chang LEl Hajj IKim HGómez-Luna JDakkak AHwu W(2016)A programming system for future proofing performance critical librariesACM SIGPLAN Notices10.1145/3016078.285117851:8(1-2)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/3016078.2851178
Chang LEl Hajj IKim HGómez-Luna JDakkak AHwu WAsenjo RHarris T(2016)A programming system for future proofing performance critical librariesProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851178(1-2)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/2851141.2851178
Chang LHajj IRodrigues CGomez-Luna JHwu W(2016)Efficient kernel synthesis for performance portable programming2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783715(1-13)Online publication date: Oct-2016
https://doi.org/10.1109/MICRO.2016.7783715
Kumar LAhalya CKaliappan SH A(2023)Development of Healthcare Architecture based on Cloud Technology and IoT Applications2023 7th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC56507.2023.10084029(1385-1388)Online publication date: 23-Feb-2023
https://doi.org/10.1109/ICCMC56507.2023.10084029

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents