Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Analysis of dependence tracking algorithms for task dataflow execution

Published: 01 December 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multicore and single-core processors due to the need for writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling, increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.
    Task dataflow programming models have a high potential to simplify parallel programming because they alleviate the programmer from identifying precisely all intertask dependences when writing programs. Instead, the task dataflow runtime system detects and enforces intertask dependences during execution based on the description of memory accessed by each task. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel, taking into account dependences specified in the task graph.
    Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this article, we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output, and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs, and lists. We also consider a fourth edgeless scheme that synchronizes tasks using integers. Analysis using microbenchmarks shows that the graph representation is not always scalable and that the edgeless scheme introduces least overhead in nearly all situations.

    References

    [1]
    Agrawal, K., Leiserson, C. E., and Sukha, J. 2010. Executing task graphs using work-stealing. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10). 1--12.
    [2]
    Alvanos, M., Tzenakis, G., Bilas, A., and Nikolopoulos, D. S. 2011. Design and evaluation of a task-based parallel H.264 video encoder for heterogeneous processors. In Proceedings of SAMOS XI: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. 217--224.
    [3]
    Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. 2010. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2, 187--198.
    [4]
    Barcelona Supercomputing Center. 2008. SMP Superscalar (SMPSS) User's Manual, 2.2 ed. Barcelona Supercomputing Center.
    [5]
    Berge, C. 1973. Graphs and Hypergraphs. North-Holland.
    [6]
    Best, M. J., Mottishaw, S., Mustard, C., Roth, M., Fedorova, A., and Brownsword, A. 2011. Synchronization via scheduling: Techniques for efficiently managing shared state. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 640--652.
    [7]
    Bienia, C. 2011. Benchmarking Modern Multiprocessors. PhD Dissertation, Princeton University.
    [8]
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., and Dongarra, J. 2010. DAGuE: A Generic Distributed DAG Engine for High Performance Computing. Technical Report. Innovative Computing Laboratory.
    [9]
    Budimlić, Z., Burke, M., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D., Sarkar, V., Schlimbach, F., and Taşirlar, S. 2010. Concurrent collections. Sci. Program. 18, 3--4, 203--217.
    [10]
    Chan, E., Quintana-Orti, E. S., Quintana-Orti, G., and van de Geijn, R. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th Annual ACM Symposium on Parallelism in Architectures and Applications. 116--125.
    [11]
    Chi, C. C. and Juurlink, B. 2011. A QHD-capable parallel H.264 decoder. In Proceedings of the International Conference on Supercomputing. 317--326.
    [12]
    Conover, W. J. and Iman, R. L. 1981. Rank transformations as a bridge between parametric and nonparametric statistics. American Statistician 35, 3, 124--129.
    [13]
    Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2, 173--193
    [14]
    Dongarra, J., Beckman, P., et al. 2011. The international exascale software project roadmap. International Journal of High Performance Computer Applications 25, 1, 3--60.
    [15]
    Frigo, M., Halpern, P., Leiserson, C. E., and Lewin-Berlin, S. 2009. Reducers and other Cilk++ hyperobjects. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures. 79--90.
    [16]
    Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multi-threaded language. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation. 212--223.
    [17]
    Gupta, G. and Sohi, G. S. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 59--70.
    [18]
    Hennessy, J. L. and Patterson, D. A. 2003. Computer architecture: A Quantitative Approach, 3rd ed. Morgan Kaufmann.
    [19]
    Jenista, J. C., Eom, Y. h., and Demsky, B. C. 2011. OoOJava: Software out-of-order execution. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. 57--68.
    [20]
    Kurzak, J. and Dongarra, J. 2009. Fully Dynamic Scheduler for Numerical Computing on Multicore Processors. Technical Report UT-CS-09-643. LAPACK Working Note 220.
    [21]
    Perez, J. M., Badia, R. M., and Labarta, J. 2008. A dependency-aware task-based programming environment for multicore architectures. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER'08). 142--151.
    [22]
    Perez, J. M., Badia, R. M., and Labarta, J. 2010. Handling task dependencies under strided and aliased references. In Proceedings of the International Conference on Supercomputing. 263--274. Retrieved from http://dx.doi.org/10.1145/1810085.1810122.
    [23]
    Tzenakis, G., Papatriantafyllou, A., Kesapides, J., Pratikakis, P., Vandierendonck, H., and Nikolopoulos, D. S. 2012. BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 301--302. Retrieved from http://dx.doi.org/10.1145/2145816.2145864.
    [24]
    Vandierendonck, H., Chronaki, K., and Nikolopoulos, D. S. 2013. Deterministic Scale-Free Pipeline Parallelism with Hyperqueues. In Proceedings of Supercomputing'13: High-Performance Computing, Networking, Storage and Analysis. 32:1--32:12. Retrieved from http://dx.doi.org/10.1145/2503210.2503233.
    [25]
    Vandierendonck, H., Pratikakis, P., and Nikolopoulos, D. S. 2011a. Parallel programming of general-purpose programs using task-based programming models. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11).
    [26]
    Vandierendonck, H., Tzenakis, G., and Nikolopoulos, D. S. 2011b. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 1--11.

    Cited By

    View all
    • (2021)Advanced synchronization techniques for task-based runtime systemsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441601(334-347)Online publication date: 17-Feb-2021
    • (2019)HyperqueuesACM Transactions on Parallel Computing10.1145/33656606:4(1-35)Online publication date: 19-Nov-2019
    • (2015)Efficiently Scheduling Task Dataflow ParallelismProceedings of the 3rd International Conference on Exascale Applications and Software10.5555/2820083.2820091(36-41)Online publication date: 21-Apr-2015
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
    December 2013
    1046 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2541228
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 December 2013
    Accepted: 01 November 2013
    Revised: 01 November 2013
    Received: 01 June 2013
    Published in TACO Volume 10, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Task dataflow
    2. scheduling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Advanced synchronization techniques for task-based runtime systemsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441601(334-347)Online publication date: 17-Feb-2021
    • (2019)HyperqueuesACM Transactions on Parallel Computing10.1145/33656606:4(1-35)Online publication date: 19-Nov-2019
    • (2015)Efficiently Scheduling Task Dataflow ParallelismProceedings of the 3rd International Conference on Exascale Applications and Software10.5555/2820083.2820091(36-41)Online publication date: 21-Apr-2015
    • (2015)ContechACM Transactions on Architecture and Code Optimization10.1145/277689312:2(1-24)Online publication date: 8-Jul-2015
    • (2015)Architectural Support for Data-Driven ExecutionACM Transactions on Architecture and Code Optimization10.1145/268687411:4(1-25)Online publication date: 9-Jan-2015
    • (2015)Nexus#Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2015.79(1129-1138)Online publication date: 25-May-2015
    • (2014)An Integrated Hardware-Software Approach to Task Graph ManagementProceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)10.1109/HPCC.2014.66(392-399)Online publication date: 20-Aug-2014

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media