Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1989493.1989553acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Scheduling irregular parallel computations on hierarchical caches

Published: 04 June 2011 Publication History
  • Get Citation Alerts
  • Abstract

    For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.
    The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

    References

    [1]
    U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Theory of Computing Systems, 2000.
    [2]
    B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12, 1994.
    [3]
    B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Programming Models for Massively Parallel Computers, 1993.
    [4]
    L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In SPAA, 2008.
    [5]
    M. A. Bender, J. T. Fineman, S. Gilbert, and B. C. Kuszmaul. Concurrent cache-oblivious B-trees. In SPAA, 2005.
    [6]
    G. Bilardi, A. Pietracaprina, G. Pucci, and F. Silvestri. Network-oblivious algorithms. In IPDPS, 2007.
    [7]
    G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA, 2008.
    [8]
    G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. A cache-oblivious model for parallel memory hierarchies. Technical Report Carnegie Mellon University-CS-10--154, Computer Science Department, Carnegie Mellon University, 2010.
    [9]
    G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In SPAA, 2004.
    [10]
    G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low-depth cache oblivious algorithms. In SPAA, 2010.
    [11]
    R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. Dag-consistent distributed shared memory. In IPPS, 1996.
    [12]
    R. A. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In SPAA, 2007.
    [13]
    R. A. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In SPAA, 2008.
    [14]
    R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In IPDPS, 2010.
    [15]
    R. Cole and V. Ramachandran. Efficient resource oblivious scheduling of multicore algorithms. manuscript, 2010.
    [16]
    R. Cole and V. Ramachandran. Resource oblivious sorting on multicores. In ICALP, 2010.
    [17]
    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, 2nd Edition. MIT Press, 2001.
    [18]
    D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. Logp: towards a realistic model of parallel computation. SIGPLAN Not., 28(7), 1993.
    [19]
    P. de la Torre and C. P. Kruskal. Submachine locality in the bulk synchronous setting. In Euro-Par, Vol. II, 1996.
    [20]
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, 1999.
    [21]
    M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, 2006.
    [22]
    P. Kumar. Cache oblivious algorithms. In U. Meyer, P. Sanders, and J. Sibeyn, editors, Algorithms for Memory Hierarchies. Springer, 2003.
    [23]
    C. E. Leiserson. Fat-Trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers, C--34(10), 1985.
    [24]
    L. G. Valiant. A bridging model for parallel computation. CACM, 33(8), 1990.
    [25]
    L. G. Valiant. A bridging model for multi-core computing. In ESA, 2008.

    Cited By

    View all
    • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
    • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
    • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
    June 2011
    404 pages
    ISBN:9781450307437
    DOI:10.1145/1989493
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • EATCS: European Association for Theoretical Computer Science

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 June 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. analysis of parallel algorithms
    2. cache complexity
    3. cost models
    4. parallel hierarchical memory
    5. schedulers

    Qualifiers

    • Research-article

    Conference

    SPAA '11

    Acceptance Rates

    Overall Acceptance Rate 447 of 1,461 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 30 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
    • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
    • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
    • (2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
    • (2021)Processor-Aware Cache-Oblivious Algorithms✱Proceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472506(1-10)Online publication date: 9-Aug-2021
    • (2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
    • (2021)Low-Span Parallel Algorithms for the Binary-Forking ModelProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461802(22-34)Online publication date: 6-Jul-2021
    • (2021)Data Oblivious Algorithms for MulticoresProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461783(373-384)Online publication date: 6-Jul-2021
    • (2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
    • (2021)Open problems in queueing theory inspired by datacenter computingQueueing Systems: Theory and Applications10.1007/s11134-020-09684-697:1-2(3-37)Online publication date: 27-Jan-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media