Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2872362.2872373acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

Published: 25 March 2016 Publication History

Abstract

The rising pressure for simultaneously improving performance and reducing power is driving more diversity into all aspects of computing devices. An algorithm that is well-matched to the target hardware can run multiple times faster and more energy efficiently than one that is not. The problem is complicated by the fact that a program's input also affects the appropriate choice of algorithm. As a result, software developers have been faced with the challenge of determining the appropriate algorithm for each potential combination of target device and data. This paper presents DySel, a novel runtime system for automating such determination for kernel-based data parallel programming models such as OpenCL, CUDA, OpenACC, and C++AMP. These programming models cover many applications that demand high performance in mobile, cloud and high-performance computing. DySel systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. The test-deployment, referred to as micro-profiling, contributes to the final execution result and incurs less than 8% of overhead in the worst observed case when compared to an oracle. We show four major use cases where DySel provides significantly more consistent performance without tedious effort from the developer.

References

[1]
M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Adaptive optimization in the Jalap\ eno JVM. In ACM SIGPLAN Notices, volume 35, pages 47--65. ACM, 2000.
[2]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In ACM SIGPLAN Notices, volume 45, pages 105--114, 2010.
[3]
V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In ACM SIGPLAN Notices, volume 35, pages 1--12. ACM, 2000.
[4]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 18:1--18:11, 2009. ISBN 978--1--60558--744--8.
[5]
Tsigas, Dolinsky, Augonnet, Bachmayer, Kessler, Moloney, and Osipov]peppherS. Benkner, S. Pllana, J. L. Traf, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31 (5): 28--41, 2011.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54, 2009.
[7]
G. Chen, B. Wu, D. Li, and X. Shen. PORPLE: An extensible optimizer for portable data placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.
[8]
W.-K. Chen, S. Lerner, R. Chaiken, and D. M. Gillies. Mojo: A dynamic optimization system. In 3rd ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-3), pages 81--90, 2000.
[9]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74, 2010.
[10]
J.-F. Dollinger and V. Loechner. Adaptive runtime selection for GPU. In Parallel Processing, 2013 42nd International Conference on, pages 70--79, 2013.
[11]
Y. Dotsenko, S. S. Baghsorkhi, B. Lloyd, and N. K. Govindaraju. Auto-tuning of fast Fourier transform on graphics processors. In ACM SIGPLAN Notices, volume 46, pages 257--266, 2011.
[12]
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin Peaks: A Software Platform for Heterogeneous Computing on General-purpose and Graphics Processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 205--216, 2010.
[13]
Intel. Vectorizer knobs. https://software.intel.com/en-us/node/540483.
[14]
P. Jaaskelainen, C. S. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. pocl: A performance-portable OpenCL implementation, 2014.
[15]
B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst., 22 (1): 105--118, 2011.
[16]
Khronos OpenCL Working Group and others. The OpenCL Specification. A. Munshi, Ed, 2008.
[17]
H.-S. Kim, I. El Hajj, J. Stratton, S. Lumetta, and W.-M. Hwu. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 257--268, 2015.
[18]
L. Li, U. Dastgeer, and C. Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In High Performance Computing for Computational Science-VECPAR 2012, pages 329--345. 2013.
[19]
A. Magni, C. Dubach, and M. F. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 11, 2013.
[20]
J. Reinders. Intel threading building blocks: outfitting C+ for multi-core processor parallelism. " O'Reilly Media, Inc.", 2007.
[21]
N. Rotem. Intel OpenCL Implicit Vectorization Module, 2011.
[22]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, 2008.
[23]
J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In ACM SIGPLAN Notices, volume 47, pages 11--22, 2012.
[24]
J. Srinivas, W. Ding, and M. Kandemir. Reactive tiling. In Code Generation and Optimization, 2015 IEEE/ACM International Symposium on, pages 91--102, 2015.
[25]
M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. Flexible software profiling of gpu architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 185--197. ACM, 2015.
[26]
Stratton, Anssari, Rodrigues, Sung, Obeid, Chang, Liu, and Hwu]John_InparJ. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. D. Liu, and W.-m. Hwu. Optimization and architecture effects on GPU computing workload performance. In Innovative Parallel Computing (InPar), 2012, pages 1--10, 2012.
[27]
J. A. Stratton, S. S. Stone, and W. W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing, pages 16--30. 2008.
[28]
Stratton, Rodrigues, Sung, Obeid, Chang, Anssari, Liu, and Hwu]parboilJ. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report, 2012.
[29]
Stratton, Rodrigues, Sung, Chang, Anssari, Liu, Hwu, and Obeid]stratton2012algorithmJ. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. Liu, W.-m. W. Hwu, and N. Obeid. Algorithm and data optimization techniques for scaling to massively threaded systems. Computer, 45 (8): 0026--32, 2012.
[30]
R. Vasudevan, S. S. Vadhiyar, and L. V. Kalé. G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems. In Proceedings of the 27th international ACM conference on International conference on supercomputing, pages 349--358, 2013.
[31]
M. J. Voss and R. Eigemann. High-level adaptive program optimization with ADAPT. In ACM SIGPLAN Notices, volume 36, pages 93--102. ACM, 2001.
[32]
J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 51--60, 2014.
[33]
J. R. Wernsing and G. Stitt. Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing. In ACM SIGPLAN Notices, volume 45, pages 115--124, 2010.
[34]
Y. Yang and H. Zhou. CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Notices, volume 49, pages 93--106, 2014.
[35]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In ACM SIGPLAN Notices, volume 45, pages 86--97, 2010.

Cited By

View all
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
  • (2019)Poly: Efficient Heterogeneous System and Application Management for Interactive Applications2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00038(199-210)Online publication date: Feb-2019
  • Show More Cited By

Index Terms

  1. DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2016
      824 pages
      ISBN:9781450340915
      DOI:10.1145/2872362
      • General Chair:
      • Tom Conte,
      • Program Chair:
      • Yuanyuan Zhou
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 March 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dynamic profiling
      2. graphics processing unit

      Qualifiers

      • Research-article

      Conference

      ASPLOS '16

      Acceptance Rates

      ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;
      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 01 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
      • (2019)Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314884(73-84)Online publication date: 16-Feb-2019
      • (2019)Poly: Efficient Heterogeneous System and Application Management for Interactive Applications2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00038(199-210)Online publication date: Feb-2019
      • (2019)Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661187(73-84)Online publication date: Feb-2019
      • (2017)Efficient and Portable ALS Matrix Factorization for Recommender Systems2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.91(409-418)Online publication date: May-2017
      • (2016)Efficient kernel synthesis for performance portable programmingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195653(1-13)Online publication date: 15-Oct-2016
      • (2016)A programming system for future proofing performance critical librariesACM SIGPLAN Notices10.1145/3016078.285117851:8(1-2)Online publication date: 27-Feb-2016
      • (2016)A programming system for future proofing performance critical librariesProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851178(1-2)Online publication date: 27-Feb-2016
      • (2016)Efficient kernel synthesis for performance portable programming2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783715(1-13)Online publication date: Oct-2016
      • (2023)Development of Healthcare Architecture based on Cloud Technology and IoT Applications2023 7th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC56507.2023.10084029(1385-1388)Online publication date: 23-Feb-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media