article

Cache locality optimization for recursive programs

Authors:

Jonathan Lifflander,

Sriram KrishnamoorthyAuthors Info & Claims

ACM SIGPLAN Notices, Volume 52, Issue 6

Pages 1 - 16

https://doi.org/10.1145/3140587.3062385

Published: 14 June 2017 Publication History

Abstract

We present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. We present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain-specific optimizer for stencil programs.

References

[1]

K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work-stealing. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–12, 2010.

[2]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, page 66, 2012.

Digital Library

[3]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 207–216, 1995.

Digital Library

[4]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 101–113, 2008.

Digital Library

[5]

U. Bondhugula, V. Bandishti, and I. Pananilath. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems, 28(5):1285–1298, May 2017.

Digital Library

[6]

Boost Context. Boost Context. http://www.boost. org/doc/libs/1_56_0/libs/context/doc/ html/index.html.

[7]

Z. Budimlic, M. G. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. M. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar. Concurrent collections. Scientific Programming, 18(3-4):203–217, 2010.

Digital Library

[8]

R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. Journal of the ACM, 24(1): 44–67, 1977.

Digital Library

[9]

E. Chan, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. A. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116–125, 2007.

Digital Library

[10]

R. Chandra, A. Gupta, and J. L. Hennessy. Data locality and load balancing in COOL. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 249–259, 1993.

Digital Library

[11]

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 105–115, 2007.

Digital Library

[12]

M. E. Conway. Design of a separable transition-diagram compiler. Communications of the ACM, 6(7):396–408, 1963.

Digital Library

[13]

J. S. Danaher, I. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147–171, 2006.

Digital Library

[14]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 212–223, 1998.

Digital Library

[15]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–12, 2009.

Digital Library

[16]

Y. Guo, Y. Zhao, V. Cavé, and V. Sarkar. SLAW: a scalable locality-aware adaptive work-stealing scheduler for multicore systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 341–342, 2010.

Digital Library

[17]

S. Z. Guyer and C. Lin. An annotation language for optimizing software libraries. In Proceedings of the Second Conference on Domain-Specific Languages (DSL), pages 39–52, 1999.

Digital Library

[18]

S. Heumann, V. S. Adve, and S. Wang. The tasks with effects model for safe concurrency. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 239–250, 2013.

Digital Library

[19]

Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 26th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 463–482, 2011.

Digital Library

[20]

R. L. B. Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 535–548, 2011.

Digital Library

[21]

K. Kennedy, B. Broom, K. D. Cooper, J. Dongarra, R. J. Fowler, D. Gannon, S. L. Johnsson, J. M. Mellor-Crummey, and L. Torczon. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distribued Computing, 61(12):1803–1826, 2001.

Digital Library

[22]

D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Java Grande Conference, pages 36–43, 2000.

Digital Library

[23]

J. Lifflander, S. Krishnamoorthy, and L. V. Kalé. Optimizing data locality for fork/join programs using constrained work stealing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 857–868, 2014.

Digital Library

[24]

B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. First-class user-level theads. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles (SOSP), pages 110–121, 1991.

Digital Library

[25]

V. Maslov. Delinearization: An efficient way to break multiloop dependence equations. In Proceedings of the ACM SIGPLAN’92 Conference on Programming Language Design and Implementation (PLDI), pages 152–161, 1992.

Digital Library

[26]

MIT Cilk 5.4.6. MIT Cilk 5.4.6. http://supertech. lcs.mit.edu/cilk.

[27]

V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A transformation framework for optimizing task-parallel programs. ACM Transactions on Programming Languages and Systems, 35(1):3:1–3:48, 2013.

Digital Library

[28]

OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.

[29]

A. Pan and V. Pai. Runtime-driven shared last-level cache management for task-parallel programs. Technical Report 466, Department of Electrical and Computer Engineering, Purdue University, 2015.

Digital Library

[30]

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS-VII Proceedings - Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60–71, 1996.

Digital Library

[31]

L.-N. Pouchet. Polybench: The polyhedral benchmark suite, 2012.

[32]

D. J. Quinlan, M. Schordan, Q. Yi, and A. Sæbjørnsen. Classification and utilization of abstractions for optimization. In Leveraging Applications of Formal Methods, First International Symposium (ISoLA), pages 57–73, 2004.

Digital Library

[33]

J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. 2007.

Digital Library

[34]

A. D. Robison. Composable parallel patterns with Intel Cilk Plus. Computing in Science and Engineering, 15(2):66–71, 2013.

Digital Library

[35]

R. Rugina and M. C. Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 72–83, 1999.

Digital Library

[36]

R. Rugina and M. C. Rinard. Pointer analysis for structured parallel programs. ACM Transactions on Programming Languages and Systems, 25(1):70–116, 2003.

Digital Library

[37]

R. Rugina and M. C. Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM Transactions on Programming Languages and Systems, 27(2): 185–235, 2005.

Digital Library

[38]

S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, A. Castello, D. Genet, T. Herault, P. Jindal, L. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, and P. H. Beckman. Argobots: a lightweight threading/tasking framework. Technical Report ANL/MCS-P5515-0116, Argonne National Laboratory, 2016.

[39]

A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. In ECOOP - Object-Oriented Programming - 27th European Conference, pages 52–78, 2013.

Digital Library

[40]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 117–128, 2011.

Digital Library

[41]

O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10’s task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 267–276, 2012.

Digital Library

[42]

TPL. The Task Parallel Library. http://msdn. microsoft.com/en-us/magazine/cc163340.

[43]

aspx, Oct. 2007.

[44]

S. Treichler, M. Bauer, and A. Aiken. Language support for dynamic, hierarchical data partitioning. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), pages 495–514, 2013.

Digital Library

[45]

T. L. Veldhuizen. Active libraries and universal languages. PhD thesis, Indiana University, 2004.

Digital Library

[46]

E. M. Westbrook, J. Zhao, Z. Budimlic, and V. Sarkar. Permission regions for race-free parallelism. In Runtime Verification - Second International Conference (RV), pages 94–109, 2011.

Digital Library

[47]

K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–8, 2008.

[48]

X10. The X10 Programming Language. www.research. ibm.com/x10/, Mar. 2006.

Index Terms

Cache locality optimization for recursive programs
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Coroutines
        Recursion

Recommendations

Cache locality optimization for recursive programs
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation

We present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data ...
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 52, Issue 6

PLDI '17

June 2017

708 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3140587

Editor:
Matthew Fluet

Issue’s Table of Contents

PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2017
708 pages
ISBN:9781450349888
DOI:10.1145/3062341
General Chair:
Albert Cohen
Inria, France
,
Program Chair:
Martin Vechev
DeepCode, Switzerland / ETH Zurich, Switzerland

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Published in SIGPLAN Volume 52, Issue 6

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
593
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)5

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents