research-article

Scheduling irregular parallel computations on hierarchical caches

Authors:

Guy E. Blelloch,

Jeremy T. Fineman,

Phillip B. Gibbons,

Harsha Vardhan SimhadriAuthors Info & Claims

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Pages 355 - 366

https://doi.org/10.1145/1989493.1989553

Published: 04 June 2011 Publication History

Get Access

Abstract

For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.

The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Theory of Computing Systems, 2000.

Abstract

References

Cited By

Index Terms

Recommendations

Low depth cache-oblivious algorithms

Fetch Caches

Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers: Extended Abstract

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations