Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2628071.2628088acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Adaptive heterogeneous scheduling for integrated GPUs

Published: 24 August 2014 Publication History

Abstract

Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing.
We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

References

[1]
Intel thread building blocks.
[2]
Opensource computer vision library.
[3]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, NY, USA, 2009. ACM.
[4]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011.
[5]
R. Barik, R. Kaleem, D. Majeti, B. Lewis, T. Shpeisman, C. Hu, Y. Ni, and A.-R. Adl-Tabatabai. Efficient mapping of irregular C++ applications to integrated GPUs. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2014.
[6]
J. Barnes and P. Hut. A hierarchical O(Nlog N) force calculation algorithm. Nature, 324:446--449, 1986.
[7]
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4):57:1--57:20, Jan. 2013.
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 72--81, NY, USA, 2008. ACM.
[9]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999.
[10]
C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010.
[11]
D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH '08, pages 57--64, Aire-la-Ville, Switzerland, 2008.
[12]
S. Chatterjee, M. Grossman, A. Sbirlea, and V. Sarkar. Dynamic task parallelism with a GPU work-stealing runtime system. In Languages and Compilers for Parallel Computing, volume 7146 of Lecture Notes in Computer Science, pages 203--217. Springer Berlin Heidelberg, 2011.
[13]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society.
[14]
S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 13:1--13:11, NY, USA, 2011. ACM.
[15]
L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao. Dynamic load balancing on single- and multi-GPU systems. In IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1--12, 2010.
[16]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1--12, NY, USA, 2012. ACM.
[17]
D. Grewe, Z. Wang, and M. O'Boyle. OpenCL task partitioning in the presence of GPU Contention. In S. Rajopadhye and M. Mills Strout, editors, Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013.
[18]
D. Grewe, Z. Wang, and M. O'Boyle. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10, 2013.
[19]
S. Hong and H. Kim. An integrated GPU power and performance model. SIGARCH Comput. Archit. News, 38(3):280--289, June 2010.
[20]
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011.
[21]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, NY, USA, 2011. ACM.
[22]
J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, PACT, 2013.
[23]
C. E. Leiserson. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 522--527, NY, USA, 2009. ACM.
[24]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, NY, USA, 2009. ACM.
[25]
Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka. An efficient, model-based CPU-GPU heterogeneous FFT library. In IEEE International Symposium on Parallel and Distributed Processing. IPDPS, pages 1--10, 2008.
[26]
P. Pandit and R. Govindarajan. Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 273:273--273:283, NY, USA, 2014. ACM.
[27]
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, pages 431--444, NY, USA, 2013. ACM.
[28]
K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 12--25, NY, USA, 2011. ACM.
[29]
V. Ravi and G. Agrawal. A dynamic scheduling framework for emerging heterogeneous systems. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10, Dec 2011.
[30]
V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 137--146, NY, USA, 2010. ACM.
[31]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 49--68, NY, USA, 2013. ACM.
[32]
A. Sbîrlea, Y. Zou, Z. Budimlíc, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES '12, pages 61--70, NY, USA, 2012. ACM.
[33]
D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In IEEE International Symposium on Parallel Distributed Processing. IPDPS., pages 1--12, 2009.
[34]
T. Scogland, B. Rountree, W. chun Feng, and B. De Supinski. Heterogeneous task scheduling for accelerated OpenMP. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pages 144--155, May 2012.
[35]
F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA '12, pages 91--100, NY, USA, 2012. ACM.

Cited By

View all
  • (2024)Study and evaluation of automatic offloading for function blocks of applicationsAutomatika10.1080/00051144.2024.230188865:1(387-400)Online publication date: 9-Jan-2024
  • (2023)PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAMProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580009(249-262)Online publication date: 17-Feb-2023
  • (2023)Balancing of Web Applications Workload Using Hybrid Computing (CPU–GPU) ArchitectureSN Computer Science10.1007/s42979-023-02444-25:1Online publication date: 21-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. heterogeneous computing
  2. integrated gpus
  3. irregular applications
  4. load balancing
  5. scheduling

Qualifiers

  • Research-article

Conference

PACT '14
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)85
  • Downloads (Last 6 weeks)8
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Study and evaluation of automatic offloading for function blocks of applicationsAutomatika10.1080/00051144.2024.230188865:1(387-400)Online publication date: 9-Jan-2024
  • (2023)PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAMProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580009(249-262)Online publication date: 17-Feb-2023
  • (2023)Balancing of Web Applications Workload Using Hybrid Computing (CPU–GPU) ArchitectureSN Computer Science10.1007/s42979-023-02444-25:1Online publication date: 21-Dec-2023
  • (2023)Hierarchical dynamic workload scheduling on heterogeneous clusters for grid search of inverse problemsThe Journal of Supercomputing10.1007/s11227-023-05306-z79:15(16720-16772)Online publication date: 30-Apr-2023
  • (2023)A machine learning-based resource-efficient task scheduler for heterogeneous computer systemsThe Journal of Supercomputing10.1007/s11227-023-05266-479:14(15700-15728)Online publication date: 20-Apr-2023
  • (2023)Proposal and Evaluation of GPU Offloading Parts Reconfiguration During Applications Operations for Environment AdaptationJournal of Network and Systems Management10.1007/s10922-023-09789-232:1Online publication date: 28-Nov-2023
  • (2022)Overflowing emerging neural network inference tasks from the GPU to the CPU on heterogeneous serversProceedings of the 15th ACM International Conference on Systems and Storage10.1145/3534056.3534935(26-39)Online publication date: 6-Jun-2022
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • (2022)Proposal and evaluation of adjusting resource amount for automatically offloaded applicationsCogent Engineering10.1080/23311916.2022.20856479:1Online publication date: 7-Jun-2022
  • (2022)A ML-based resource utilization OpenCL GPU-kernel fusion modelSustainable Computing: Informatics and Systems10.1016/j.suscom.2022.10068335(100683)Online publication date: Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media