Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126908.3126920acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Tessellating stencils

Published: 12 November 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Unlike previous work, which mostly blocks the iteration space of a stencil directly, this paper proposes a novel two-level tessellation scheme. A set of blocks are designed to tessellate the spatial space in various ways. The blocks can be processed in parallel without redundant computation. This corresponds to extending them along the time dimension and can form a tessellation of the iteration space. Experimental results show that our code performs up to 12% better than the existing highly concurrent schemes for the 3d27p stencil.

    References

    [1]
    R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. 2001. Optimal Semi-oblique Tiling. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '01). 153--162.
    [2]
    R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. 2003. Optimal semi-oblique tiling. IEEE Transactions on Parallel and Distributed Systems 14, 9 (2003), 944--960.
    [3]
    V. Bandishti, I. Pananilath, and U. Bondhugula. 2012. Tiling stencil computations to maximize parallelism (SC '12). 1--11.
    [4]
    Muthu Manikandan Baskaran, Albert Hartono, Sanket Tavarageri, Thomas Henretty, J. Ramanujam, and P. Sadayappan. 2010. Parameterized Tiling Revisited (CGO '10). 200--209.
    [5]
    Ian J. Bertolacci, Catherine Olschanowsky, Ben Harshbarger, Bradford L. Chamberlain, David G. Wonnacott, and Michelle Mills Strout. 2015. Parameterized Diamond Tiling for Stencil Computations with Chapel Parallel Iterators (ICS '15). 197--206.
    [6]
    W. Bielecki and P. Skotnicki. 2015. Concurrent Start Tiling of Stencil Computations based on the Transitive Closure of a Data Dependence Graph. Przeglad Elektrotechniczny R. 91, nr 11 (2015), 167--170.
    [7]
    Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer (PLDI '08). 101--113.
    [8]
    M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures (IPDPS '11). 676--687.
    [9]
    M. Christen, O. Schenk, E. Neufeld, P. Messmer, and H. Burkhart. 2009. Parallel data-locality aware stencil computations on modern micro-architectures (IPDPS '09). 1--10.
    [10]
    Hui-Min Cui, Lei Wang, Dong-Rui Fan, and Xiao-Bing Feng. 2010. Landing Stencil Code on Godson-T. J. Comput. Sci. Technol. 25, 4 (July 2010), 886--894.
    [11]
    Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter. Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-tuning on State-of-the-art Multicore Architectures (SC '08). Article 4, 12 pages.
    [12]
    Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems (SC '01). 50--50.
    [13]
    Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations (ICS '05). 361--366.
    [14]
    Matteo Frigo and Volker Strumpen. 2006. The cache complexity of multithreaded cache oblivious algorithms (SPAA '06). 271--280.
    [15]
    Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (May 2008), 1--25.
    [16]
    Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs (CGO '14). 66--75.
    [17]
    Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles (GPGPU-6). 24--31.
    [18]
    Tobias Grosser, Sven Verdoolaege, Albert Cohen, and P. Sadayappan. 2014. The Relation Between Diamond Tiling and Hexagonal Tiling. Parallel Processing Letters 24, 03 (2014).
    [19]
    Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures (ICS '15). 177--186.
    [20]
    Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen. Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric Multi-level Tiling of Imperfectly Nested Loops (ICS '09). 147--157.
    [21]
    A. Hartono, M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processors (IPDPS '10). 1--12.
    [22]
    Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Architectures (ICS '13). 13--24.
    [23]
    S. Heybrock, B. JoÃş, D. D. Kalamkar, M. Smelyanskiy, K. Vaidyanathan, T. Wettig, and P. Dubey. 2014. Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors (SC '14). 69--80.
    [24]
    Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance Code Generation for Stencil Computations on GPU Architectures (ICS '12). 311--320.
    [25]
    Guillaume Iooss, Sanjay Rajopadhye, Christophe Alias, and Yun Zou. 2015. Mono-parametric Tiling is a Polyhedral Transformation. Research Report.
    [26]
    F. Irigoin and R. Triolet. 1988. Supernode Partitioning (POPL '88). 319--329.
    [27]
    Guohua Jin, John Mellor-Crummey, and Robert Fowler. 2001. Increasing Temporal Locality with Skewing and Recursive Blocking (SC '01). 43--43.
    [28]
    Tian Jin, Nirmal Prajapati, Waruna Ranasinghe, Guillaume Iooss, Yun Zou, Sanjay Rajopadhye, and David G. Wonnacott. 2016. Hybrid Static/Dynamic Schedules for Tiled Polyhedral Programs. CoRR abs/1610.07236 (2016).
    [29]
    S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. 2010. An auto-tuning framework for parallel multicore stencil computations (IPDPS '10). 1--12.
    [30]
    Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and Explicit Optimizations for Stencil Computations (MSPC '06). 51--60.
    [31]
    DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level Tiling: M for the Price of One (SC '07). Article 51, 12 pages.
    [32]
    Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations (PLDI '07). 235--244.
    [33]
    Marcin Krotkiewski and Marcin Dabrowski. 2013. Efficient 3D Stencil Computations Using CUDA. Parallel Comput. 39, 10 (Oct. 2013), 533--548.
    [34]
    Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms (ASPLOS IV). 63--74.
    [35]
    Yulong Luo, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2015. FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model (ICS '15). 187--196.
    [36]
    Thibaut Lutz, Christian Fensch, and Murray Cole. 2013. PARTANS: An Auto-tuning Framework for Stencil Computation on multi-GPU Systems. ACM Trans. Archit. Code Optim. 9, 4, Article 59 (Jan. 2013), 24 pages.
    [37]
    T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. Keyes. 2015. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM Journal on Scientific Computing 37, 4 (2015), C439--C464.
    [38]
    Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2015. Multidimensional intra-tile parallelization for memory-starved stencil computations. CoRR abs/1510.04995 (2015). http://arxiv.org/abs/1510.04995
    [39]
    Azamat Mametjanov, Daniel Lowell, Ching-Chen Ma, and Boyana Norris. 2012. Autotuning Stencil-Based Computations on GPUs (CLUSTER '12). 266--274.
    [40]
    N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers (SC '11). 1--12.
    [41]
    A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM 12, 3 (1969), 153--165.
    [42]
    Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs (ICS '09). 256--265.
    [43]
    A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs (SC '10). 1--13.
    [44]
    Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers (SC '14). 793--804.
    [45]
    Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations. ACM Trans. Archit. Code Optim. 12, 2, Article 14 (May 2015), 23 pages.
    [46]
    E. H. Phillips and M. Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters (IPDPS 10). 1--10.
    [47]
    Manuel Prieto, Ignacio M. Llorente, and Francisco Tirado. 2000. Data Locality Exploitation in the Decomposition of Regular Domain Problems. IEEE Trans. Parallel Distrib. Syst. 11, 11 (nov. 2000), 1141--1150.
    [48]
    Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles (IPDPS 02). 138--.
    [49]
    Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, and P. Sadayappan. Effective Resource Management for Enhancing Performance of 2D and 3D Stencils on GPUs (GPGPU '16). 92--102.
    [50]
    Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. Parameterized Tiled Loops for Free (PLDI '07). 405--414.
    [51]
    Lakshminarayanan Renganarayanan, Daegon Kim, Michelle Mills Strout, and Sanjay Rajopadhye. 2012. Parameterized Loop Tiling. ACM Trans. Program. Lang. Syst. 34, 1, Article 3 (May 2012), 41 pages.
    [52]
    Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations (SC '00). Article 32.
    [53]
    Rodrigo C. O. Rocha, Alyson D. Pereira, Luiz Ramos, and Luiss F. W. Goes. 2017. TOAST: Automatic tiling for iterative stencil computations on GPUs. Concurrency and Computation: Practice and Experience (2017).
    [54]
    Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality (PLDI 99). 215--228.
    [55]
    Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations (ICS '01). ACM, 49--59.
    [56]
    Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache Accurate Time Skewing in Iterative Stencil Computations (ICPP '11). 571--581.
    [57]
    Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler (SPAA '11). 117--128.
    [58]
    Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A. Chowdhury. 2014. Improving Parallelism of Recursive Stencil Computations Without Sacrificing Cache Performance (WOSC '14). 1--7.
    [59]
    Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A. Chowdhury. 2015. Cache-oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms Without Losing Cache-efficiency (PPoPP 2015). 205--214.
    [60]
    Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA Performance in 3D Stencil Methods with Annotated C (ICS '11). 214--224.
    [61]
    Sundaresan Venkatasubramanian, Richard W. Vuduc, and none none. 2009. Tuned and Wildly Asynchronous Stencil Kernels for Hybrid CPU/GPU Systems (ICS '09). 244--255.
    [62]
    Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs (SC '13). Article 25, 12 pages.
    [63]
    R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 12 (2001), 3 -- 35. New Trends in High Performance Computing.
    [64]
    Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting Ultra-scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-tuning (SC '11). Article 55, 12 pages.
    [65]
    Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm (PLDI '91). 30--44.
    [66]
    M. Wolfe. 1989. More Iteration Space Tiling (Supercomputing '89). 655--664.
    [67]
    D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations (IPDPS '00). 171--180.
    [68]
    David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002), 181--221.
    [69]
    David G Wonnacott and Michelle Mills Strout. 2013. On the scalability of loop tiling techniques. IMPACT 2013 (2013).
    [70]
    Zhang Xianyi, Wang Qian, and Zhang Yunquan. Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor (ICPADS '12). 684--691.
    [71]
    C. Yount and A. Duran. Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling. In (PMBS '16). 65--75.
    [72]
    Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and Auto-tuning of 3D Stencil Codes on GPU Clusters (CGO 12). 155--164.
    [73]
    Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical Overlapped Tiling (CGO '12). 207--218.

    Cited By

    View all
    • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
    • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
    • (2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2017
    801 pages
    ISBN:9781450351140
    DOI:10.1145/3126908
    • General Chair:
    • Bernd Mohr,
    • Program Chair:
    • Padma Raghavan
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. stencil computation
    2. tessellation
    3. tiling

    Qualifiers

    • Research-article

    Conference

    SC '17
    Sponsor:

    Acceptance Rates

    SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)4

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
    • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
    • (2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
    • (2022)Scalable distributed high-order stencil computationsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571924(1-13)Online publication date: 13-Nov-2022
    • (2022)Exploiting temporal data reuse and asynchrony in the reverse time migrationThe International Journal of High Performance Computing Applications10.1177/1094342022112852937:2(132-150)Online publication date: 3-Oct-2022
    • (2022)Scalable Distributed High-Order Stencil ComputationsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00035(1-13)Online publication date: Nov-2022
    • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
    • (2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
    • (2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
    • (2020)Asynchronous computations for solving the acoustic wave propagation equationInternational Journal of High Performance Computing Applications10.1177/109434202092302734:4(377-393)Online publication date: 30-Jun-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media