research-article

MDR: performance model driven runtime for heterogeneous parallel platforms

Authors:

Jacques A. Pienaar,

Anand Raghunathan,

Srimat ChakradharAuthors Info & Claims

ICS '11: Proceedings of the international conference on Supercomputing

Pages 225 - 234

https://doi.org/10.1145/1995896.1995933

Published: 31 May 2011 Publication History

Abstract

We present a runtime framework for the execution of work-loads represented as parallel-operator directed acyclic graphs (PO-DAGs) on heterogeneous multi-core platforms. PO-DAGs combine coarse-grained parallelism at the graph level with fine-grained parallelism within each node, lending naturally to exploiting the intra --- and inter-processing element parallelism present in heterogeneous platforms. We identify four important criteria - Suitability, Locality, Availability and Criticality (SLAC) --- and show that all these criteria must be considered by a heterogeneous runtime framework in order to achieve good performance under varying application and platform characteristics.

The proposed model driven runtime (MDR) considers all the aforementioned factors, and tradeoffs among them, by utilizing performance models. These performance models are used to drive key run-time decisions such as mapping of tasks to PEs, scheduling of tasks on each PE, and copying data between memory spaces.

We discuss the software architecture and implementation of MDR, and evaluate it using several benchmark programs on three different heterogeneous platforms that contain multi-core CPUs and GPUs. The hardware platforms represent server, laptop, and netbook class systems. MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. We also perform a sensitivity analysis that establishes the importance of considering all four SLAC criteria in order to achieve high performance execution in a heterogeneous runtime framework.

References

[1]

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Euro-Par '09, volume 5704 of Lecture Notes in Computer Science, pages 863--874, Delft, The Netherlands, Aug. 2009. Springer.

Digital Library

[2]

M. Becchi, S. Byna, S. Cadambi, and S. Chakradhar. Data-aware scheduling of legacy kernels on hetero-geneous platforms with distributed memory. In SPAA '10, pages 82--91, New York, NY, USA, 2010.

Digital Library

[3]

P. Bellens, J. M. Perez, F. Cabarcas, A. Ramirez, R. M. Badia, and J. Labarta. CellSs: Scheduling techniques to better exploit memory hierarchy. Sci. Program., 17:77--95, January 2009.

Digital Library

[4]

S. Byna, J. Meng, A. Raghunathan, S. Chakradhar, and S. Cadambi. Best-effort semantic document search on GPUs. In GPGPU '10, pages 86--93, New York, NY, USA, 2010. ACM.

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC '09, pages 44--54, Washington, DC, USA, 2009.

Digital Library

[6]

G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In HPDC '08, pages 197--200, New York, NY, USA, 2008. ACM.

Digital Library

[7]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. SIGPLAN Not., 33:212--223, May 1998.

Digital Library

[8]

I. Gelado et al. An asymmetric distributed shared memory model for heterogeneous parallel systems. SIGPLAN Not., 45(3):347--358, 2010.

Digital Library

[9]

C. Gregg, J. Brantley, and K. Hazelwood. Contention-aware scheduling of parallel code for heterogeneous systems. In HotPar 10, 2010.

[10]

M. Hill and M. Marty. Amdahl's law in the multicore era. Computer, 41(7):33 --38, July 2008.

Digital Library

[11]

C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin. MapCG: writing parallel program portable between CPU and GPU. In PACT '10, pages 217--226, New York, NY, USA, 2010. ACM.

Digital Library

[12]

Intel. Intel threading building blocks (TBB), 2010.

[13]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM.

Digital Library

[14]

V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009.

Digital Library

[15]

Y.-K. Kwok and I. Ahmad. Dynamic critical-path scheduling: an effective technique for allocating task graphs to multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 7(5):506--521, May 1996.

Digital Library

[16]

M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, pages 287--296, New York, NY, USA, 2008. ACM.

Digital Library

[17]

C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO 42, pages 45--55, New York, NY, USA, 2009. ACM.

Digital Library

[18]

T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

Digital Library

[19]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, 2008.

Digital Library

[20]

NVIDIA. CUDA Zone, 2010.

[21]

K. Spafford, J. Meredith, and J. Vetter. Maestro: data orchestration and tuning for OpenCL devices. In Euro-Par'10, pages 275--286, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[22]

J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering, 12:66--73, 2010.

Digital Library

[23]

N. Sundaram, A. Raghunathan, and S. T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS '09, pages 1--12, Washington, DC, USA, 2009.

Digital Library

[24]

D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. In ASPLOS-XII, pages 325--335, New York, NY, USA, 2006. ACM.

Digital Library

[25]

J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec., 34:44--49, September 2005.

Digital Library

Cited By

Ahmadzadeh AAsghari MRahmati DGorgin SSalami B(2023)GPU Acceleration of LS-SVM, Based on Fractional Orthogonal FunctionsLearning with Fractional Orthogonal Kernel Classifiers in Support Vector Machines10.1007/978-981-19-6553-1_11(247-265)Online publication date: 19-Mar-2023
https://doi.org/10.1007/978-981-19-6553-1_11
Chen JManivannan MAbduljabbar MPericàs M(2022)ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing RuntimesACM Transactions on Architecture and Code Optimization10.1145/351042219:2(1-29)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3510422
Dev KReda S(2016)Scheduling Challenges and Opportunities in Integrated CPU+GPU ProcessorsProceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia10.1145/2993452.2994307(78-83)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2993452.2994307
Show More Cited By

Index Terms

MDR: performance model driven runtime for heterogeneous parallel platforms
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

The Loop-of-Stencil-Reduce Paradigm
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, ...
The Loop-of-Stencil-Reduce Paradigm
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, ...
Multi- and many-core data mining with adaptive sparse grids
CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '11: Proceedings of the international conference on Supercomputing

May 2011

398 pages

ISBN:9781450301022

DOI:10.1145/1995896

General Chair:
David K. Lowenthal
University of Arizona
,
Program Chairs:
Bronis R. de Supinski
Lawrence Livermore National Laboratory
,
Sally A. McKee
Chalmers University of Technology

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '11

Sponsor:

SIGARCH

ICS '11: International Conference on Supercomputing

May 31 - June 4, 2011

Arizona, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
665
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ahmadzadeh AAsghari MRahmati DGorgin SSalami B(2023)GPU Acceleration of LS-SVM, Based on Fractional Orthogonal FunctionsLearning with Fractional Orthogonal Kernel Classifiers in Support Vector Machines10.1007/978-981-19-6553-1_11(247-265)Online publication date: 19-Mar-2023
https://doi.org/10.1007/978-981-19-6553-1_11
Chen JManivannan MAbduljabbar MPericàs M(2022)ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing RuntimesACM Transactions on Architecture and Code Optimization10.1145/351042219:2(1-29)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3510422
Dev KReda S(2016)Scheduling Challenges and Opportunities in Integrated CPU+GPU ProcessorsProceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia10.1145/2993452.2994307(78-83)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2993452.2994307
Li ZFang JTang TChen XChen CYang C(2016)Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.99(1341-1350)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.99
Boll FBrune P(2016)Online Support for the Elderly - Why Service and Social Network Platforms should be IntegratedProcedia Computer Science10.1016/j.procs.2016.09.06098:C(395-400)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.09.060
Ohmoto YOokaki TNishida T(2016)A Support System to Accumulate Interpretations of Multiple Story TimelinesProcedia Computer Science10.1016/j.procs.2016.08.24196:C(607-616)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.08.241
Visheratin AMelnik MNasonov D(2016)Workflow Scheduling Algorithms for Hard-deadline Constrained Cloud EnvironmentsProcedia Computer Science10.1016/j.procs.2016.05.52980:C(2098-2106)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.05.529
Butakov NNasonov DSvitenkov ARadice ABoukhanovsky A(2016)Quality-based Approach to Urgent Workflows SchedulingProcedia Computer Science10.1016/j.procs.2016.05.52780:C(2074-2085)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.05.527
Shimokawabe TAoki TOnodera N(2016)High-productivity Framework for Large-scale GPU/CPU Stencil ApplicationsProcedia Computer Science10.1016/j.procs.2016.05.49980:C(1646-1657)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.05.499
Cheikh TAguiar ATahar SNicolescu G(2016)Tuning framework for stencil computation in heterogeneous parallel platformsThe Journal of Supercomputing10.1007/s11227-015-1575-972:2(468-502)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1007/s11227-015-1575-9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten