Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2686745.2686752acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

Published: 20 October 2014 Publication History
  • Get Citation Alerts
  • Abstract

    The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism.
    In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer.
    We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.

    References

    [1]
    U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1--12, 2000.
    [2]
    K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1--12. IEEE, April 2010.
    [3]
    R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992. ISSN 0022--3670.
    [4]
    G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235--244. ACM, 2004.
    [5]
    R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207--216, Santa Barbara, California, July 1995.
    [6]
    S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0: 42, 2000.
    [7]
    R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007.
    [8]
    R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207--216, 2008.
    [9]
    R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878--919, 2010.
    [10]
    R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591--600, 2006.
    [11]
    R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cacheoblivious dynamic programming for bioinformatics. TCBB, 7 (3):495--510, July-Sept. 2010.
    [12]
    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.
    [13]
    K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1--4:12, Austin, TX, Nov. 15-18 2008.
    [14]
    H. Dursun, K.-i. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par, pages 642--653, Delft, The Netherlands, Aug. 25-28 2009.
    [15]
    H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July13-16 2009.
    [16]
    M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361--366, Cambridge, MA, June 20-22, 2005.
    [17]
    M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271--280, 2006.
    [18]
    M. Frigo and V. Strumpen. The cache complexity of multithreadedcache oblivious algorithms. Theory of Computing Systems, 45(2):203--233, 2009.
    [19]
    M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI '98, pages 212--223, 1998.
    [20]
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285--297, New York, NY, Oct. 17-19 1999.
    [21]
    Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145--156, Santorini, Greece, June 13-15 2010.
    [22]
    Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/architectureand- technology/many-integrated-core/intel-many-integratedcore- architecture.html, 2011.
    [23]
    S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36--43, Chicago, IL, June 12 2005.
    [24]
    S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51--60, San Jose, CA, 2006. ISBN 1--59593--578--9.
    [25]
    S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10-13 2007.
    [26]
    A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83 (2--3):197--214, 1994. ISSN 0010--4655.
    [27]
    A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master's thesis, Institut fur Informatic, Friedrich-Alexander-Universitat Erlangen-Nurnberg, July 2006.
    [28]
    L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband,W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1--11, Rome, Italy, May 23-29 2009.
    [29]
    A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.
    [30]
    Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011.
    [31]
    Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencilspecification language. In HotPar'11, Berkeley, CA, USA, May 2011.
    [32]
    G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27--28, 2013, Revised Selected Papers, pages 17--31, 2013.
    [33]
    S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1--14, Miami, FL, Apr. 2008.

    Cited By

    View all
    • (2022)An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix TranspositionInformation Processing Letters10.1016/j.ipl.2021.106166173:COnline publication date: 1-Jan-2022
    • (2019)Toward Efficient Architecture-Independent Algorithms for Dynamic ProgramsHigh Performance Computing10.1007/978-3-030-20656-7_8(143-164)Online publication date: 17-May-2019
    • (2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil Computations
    October 2014
    70 pages
    ISBN:9781450323086
    DOI:10.1145/2686745
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 October 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. atomic operation
    2. cache-oblivious wavefront
    3. cilk
    4. multi-core
    5. nested parallel computation
    6. parallel cache-oblivious algorithm
    7. stencil

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SPLASH '14
    Sponsor:

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix TranspositionInformation Processing Letters10.1016/j.ipl.2021.106166173:COnline publication date: 1-Jan-2022
    • (2019)Toward Efficient Architecture-Independent Algorithms for Dynamic ProgramsHigh Performance Computing10.1007/978-3-030-20656-7_8(143-164)Online publication date: 17-May-2019
    • (2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
    • (2017)Provably Efficient Scheduling of Cache-oblivious Wavefront AlgorithmsProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087586(339-350)Online publication date: 24-Jul-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media