Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1995896.1995933acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

MDR: performance model driven runtime for heterogeneous parallel platforms

Published: 31 May 2011 Publication History

Abstract

We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained parallelism within each node, lending naturally to exploiting the intra --- and inter-processing element parallelism present in heterogeneous platforms. We identify four important criteria - Suitability, Locality, Availability and Criticality (SLAC) --- and show that all these criteria must be considered by a heterogeneous runtime framework in order to achieve good performance under varying application and platform characteristics.
The proposed model driven runtime (MDR) considers all the aforementioned factors, and tradeoffs among them, by utilizing performance models. These performance models are used to drive key run-time decisions such as mapping of tasks to PEs, scheduling of tasks on each PE, and copying data between memory spaces.
We discuss the software architecture and implementation of MDR, and evaluate it using several benchmark programs on three different heterogeneous platforms that contain multi-core CPUs and GPUs. The hardware platforms represent server, laptop, and netbook class systems. MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. We also perform a sensitivity analysis that establishes the importance of considering all four SLAC criteria in order to achieve high performance execution in a heterogeneous runtime framework.

References

[1]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Euro-Par '09, volume 5704 of Lecture Notes in Computer Science, pages 863--874, Delft, The Netherlands, Aug. 2009. Springer.
[2]
M. Becchi, S. Byna, S. Cadambi, and S. Chakradhar. Data-aware scheduling of legacy kernels on hetero-geneous platforms with distributed memory. In SPAA '10, pages 82--91, New York, NY, USA, 2010.
[3]
P. Bellens, J. M. Perez, F. Cabarcas, A. Ramirez, R. M. Badia, and J. Labarta. CellSs: Scheduling techniques to better exploit memory hierarchy. Sci. Program., 17:77--95, January 2009.
[4]
S. Byna, J. Meng, A. Raghunathan, S. Chakradhar, and S. Cadambi. Best-effort semantic document search on GPUs. In GPGPU '10, pages 86--93, New York, NY, USA, 2010. ACM.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC '09, pages 44--54, Washington, DC, USA, 2009.
[6]
G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In HPDC '08, pages 197--200, New York, NY, USA, 2008. ACM.
[7]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. SIGPLAN Not., 33:212--223, May 1998.
[8]
I. Gelado et al. An asymmetric distributed shared memory model for heterogeneous parallel systems. SIGPLAN Not., 45(3):347--358, 2010.
[9]
C. Gregg, J. Brantley, and K. Hazelwood. Contention-aware scheduling of parallel code for heterogeneous systems. In HotPar 10, 2010.
[10]
M. Hill and M. Marty. Amdahl's law in the multicore era. Computer, 41(7):33 --38, July 2008.
[11]
C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin. MapCG: writing parallel program portable between CPU and GPU. In PACT '10, pages 217--226, New York, NY, USA, 2010. ACM.
[12]
Intel. Intel threading building blocks (TBB), 2010.
[13]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM.
[14]
V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009.
[15]
Y.-K. Kwok and I. Ahmad. Dynamic critical-path scheduling: an effective technique for allocating task graphs to multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 7(5):506--521, May 1996.
[16]
M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, pages 287--296, New York, NY, USA, 2008. ACM.
[17]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO 42, pages 45--55, New York, NY, USA, 2009. ACM.
[18]
T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
[19]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, 2008.
[20]
NVIDIA. CUDA Zone, 2010.
[21]
K. Spafford, J. Meredith, and J. Vetter. Maestro: data orchestration and tuning for OpenCL devices. In Euro-Par'10, pages 275--286, Berlin, Heidelberg, 2010. Springer-Verlag.
[22]
J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering, 12:66--73, 2010.
[23]
N. Sundaram, A. Raghunathan, and S. T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS '09, pages 1--12, Washington, DC, USA, 2009.
[24]
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. In ASPLOS-XII, pages 325--335, New York, NY, USA, 2006. ACM.
[25]
J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec., 34:44--49, September 2005.

Cited By

View all
  • (2023)GPU Acceleration of LS-SVM, Based on Fractional Orthogonal FunctionsLearning with Fractional Orthogonal Kernel Classifiers in Support Vector Machines10.1007/978-981-19-6553-1_11(247-265)Online publication date: 19-Mar-2023
  • (2022)ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing RuntimesACM Transactions on Architecture and Code Optimization10.1145/351042219:2(1-29)Online publication date: 7-Mar-2022
  • (2016)Scheduling Challenges and Opportunities in Integrated CPU+GPU ProcessorsProceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia10.1145/2993452.2994307(78-83)Online publication date: 1-Oct-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '11: Proceedings of the international conference on Supercomputing
May 2011
398 pages
ISBN:9781450301022
DOI:10.1145/1995896
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpus
  2. heterogeneous platforms
  3. many-core
  4. multi-core
  5. parallel computing
  6. performance model
  7. runtime system

Qualifiers

  • Research-article

Conference

ICS '11
Sponsor:
ICS '11: International Conference on Supercomputing
May 31 - June 4, 2011
Arizona, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)GPU Acceleration of LS-SVM, Based on Fractional Orthogonal FunctionsLearning with Fractional Orthogonal Kernel Classifiers in Support Vector Machines10.1007/978-981-19-6553-1_11(247-265)Online publication date: 19-Mar-2023
  • (2022)ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing RuntimesACM Transactions on Architecture and Code Optimization10.1145/351042219:2(1-29)Online publication date: 7-Mar-2022
  • (2016)Scheduling Challenges and Opportunities in Integrated CPU+GPU ProcessorsProceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia10.1145/2993452.2994307(78-83)Online publication date: 1-Oct-2016
  • (2016)Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.99(1341-1350)Online publication date: May-2016
  • (2016)Online Support for the Elderly - Why Service and Social Network Platforms should be IntegratedProcedia Computer Science10.1016/j.procs.2016.09.06098:C(395-400)Online publication date: 1-Oct-2016
  • (2016)A Support System to Accumulate Interpretations of Multiple Story TimelinesProcedia Computer Science10.1016/j.procs.2016.08.24196:C(607-616)Online publication date: 1-Oct-2016
  • (2016)Workflow Scheduling Algorithms for Hard-deadline Constrained Cloud EnvironmentsProcedia Computer Science10.1016/j.procs.2016.05.52980:C(2098-2106)Online publication date: 1-Jun-2016
  • (2016)Quality-based Approach to Urgent Workflows SchedulingProcedia Computer Science10.1016/j.procs.2016.05.52780:C(2074-2085)Online publication date: 1-Jun-2016
  • (2016)High-productivity Framework for Large-scale GPU/CPU Stencil ApplicationsProcedia Computer Science10.1016/j.procs.2016.05.49980:C(1646-1657)Online publication date: 1-Jun-2016
  • (2016)Tuning framework for stencil computation in heterogeneous parallel platformsThe Journal of Supercomputing10.1007/s11227-015-1575-972:2(468-502)Online publication date: 1-Feb-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media