research-article

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

Authors:

Jesmin Jahan Tithi,

Pramod Ganapathi,

Rezaul A. ChowdhuryAuthors Info & Claims

WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil Computations

Pages 1 - 7

https://doi.org/10.1145/2686745.2686752

Published: 20 October 2014 Publication History

Abstract

The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism.

In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer.

We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1--12, 2000.

Digital Library

[2]

K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1--12. IEEE, April 2010.

[3]

R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992. ISSN 0022--3670.

[4]

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235--244. ACM, 2004.

Digital Library

[5]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207--216, Santa Barbara, California, July 1995.

Digital Library

[6]

S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0: 42, 2000.

Digital Library

[7]

R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007.

Digital Library

[8]

R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207--216, 2008.

Digital Library

[9]

R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878--919, 2010.

Digital Library

[10]

R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591--600, 2006.

Digital Library

[11]

R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cacheoblivious dynamic programming for bioinformatics. TCBB, 7 (3):495--510, July-Sept. 2010.

Digital Library

[12]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.

Digital Library

[13]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1--4:12, Austin, TX, Nov. 15-18 2008.

Digital Library

[14]

H. Dursun, K.-i. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par, pages 642--653, Delft, The Netherlands, Aug. 25-28 2009.

Digital Library

[15]

H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July13-16 2009.

[16]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361--366, Cambridge, MA, June 20-22, 2005.

Digital Library

[17]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271--280, 2006.

Digital Library

[18]

M. Frigo and V. Strumpen. The cache complexity of multithreadedcache oblivious algorithms. Theory of Computing Systems, 45(2):203--233, 2009.

Digital Library

[19]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI '98, pages 212--223, 1998.

Digital Library

[20]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285--297, New York, NY, Oct. 17-19 1999.

Digital Library

[21]

Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145--156, Santorini, Greece, June 13-15 2010.

Digital Library

[22]

Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/architectureand- technology/many-integrated-core/intel-many-integratedcore- architecture.html, 2011.

[23]

S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36--43, Chicago, IL, June 12 2005.

Digital Library

[24]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51--60, San Jose, CA, 2006. ISBN 1--59593--578--9.

Digital Library

[25]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10-13 2007.

Digital Library

[26]

A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83 (2--3):197--214, 1994. ISSN 0010--4655.

[27]

A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master's thesis, Institut fur Informatic, Friedrich-Alexander-Universitat Erlangen-Nurnberg, July 2006.

[28]

L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband,W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1--11, Rome, Italy, May 23-29 2009.

Digital Library

[29]

A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.

[30]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011.

Digital Library

[31]

Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencilspecification language. In HotPar'11, Berkeley, CA, USA, May 2011.

[32]

G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27--28, 2013, Revised Selected Papers, pages 17--31, 2013.

[33]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1--14, Miami, FL, Apr. 2008.

Cited By

Shubham Prakash SGanapathi P(2022)An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix TranspositionInformation Processing Letters10.1016/j.ipl.2021.106166173:COnline publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.ipl.2021.106166
Javanmard MGanapathi PDas RAhmad ZTschudi SChowdhury R(2019)Toward Efficient Architecture-Independent Algorithms for Dynamic ProgramsHigh Performance Computing10.1007/978-3-030-20656-7_8(143-164)Online publication date: 17-May-2019
https://doi.org/10.1007/978-3-030-20656-7_8
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Show More Cited By

Index Terms

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

Recommendations

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best ...
The pochoir stencil compiler
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them ...
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil Computations

October 2014

70 pages

ISBN:9781450323086

DOI:10.1145/2686745

Program Chairs:
Saman Amarasinghe
Massachusetts Institute of Technology, USA
,
Shoaib Kamil
Massachusetts Institute of Technology, USA
,
P. Sadayappan
Ohio State University, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SPLASH '14

Sponsor:

SIGPLAN

SPLASH '14: Conference on Systems, Programming, and Applications: Software for Humanity

October 20, 2014

Oregon, Portland, USA

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shubham Prakash SGanapathi P(2022)An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix TranspositionInformation Processing Letters10.1016/j.ipl.2021.106166173:COnline publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.ipl.2021.106166
Javanmard MGanapathi PDas RAhmad ZTschudi SChowdhury R(2019)Toward Efficient Architecture-Independent Algorithms for Dynamic ProgramsHigh Performance Computing10.1007/978-3-030-20656-7_8(143-164)Online publication date: 17-May-2019
https://doi.org/10.1007/978-3-030-20656-7_8
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Chowdhury RGanapathi PTang YTithi JScheideler CHajiaghayi M(2017)Provably Efficient Scheduling of Cache-oblivious Wavefront AlgorithmsProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087586(339-350)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087586

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents