Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3458817.3476154acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Reducing redundancy in data organization and arithmetic calculation for stencil computations

Published: 13 November 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization techniques, aiming at exploiting the in-core data parallelism. However, they either incur spatial data conflicts or hurt the data locality when integrated with tiling. In this paper, a novel spatial computation folding is devised to reduce the data reorganization overhead for vectorization and preserve the data locality for tiling in the data space simultaneously. We then propose an approach of temporal computation folding enhanced with shifts reusing, tessellate tiling, and semi-automatic code generation. It aims to further reduce the redundancy of arithmetic calculations and exploit the register reuse along the time dimension. Experimental results on the AVX2 and AVX-512 CPUs show that our approach obtains significant performance improvements compared with state-of-the-art techniques.

    Supplementary Material

    MP4 File (Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations.mp4)
    Presentation video

    References

    [1]
    Alfred V Aho, Stephen C Johnson, and Jeffrey D Ullman. 1976. Code generation for expressions with common subexpressions. In Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages. 19--31.
    [2]
    Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern architectures: a dependence-based approach. Taylor & Francis US.
    [3]
    Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. (2006).
    [4]
    Krste Asanovic, Ras Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John D Kubiatowicz, Edward A Lee, Nelson Morgan, George Necula, David A Patterson, et al. 2008. The parallel computing laboratory at UC Berkeley: A research agenda based on the Berkeley view. EECS Department, University of California, Berkeley, Tech. Rep (2008).
    [5]
    Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
    [6]
    P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In 2015 IEEE International Parallel and Distributed Processing Symposium. 313--323.
    [7]
    Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101--113.
    [8]
    Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 676--687.
    [9]
    Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1--12.
    [10]
    Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3, Article 23 (April 2014), 39 pages.
    [11]
    Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. 65--77.
    [12]
    Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems (SC '01). 50--50.
    [13]
    Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations (ICS '05). 361--366.
    [14]
    Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing. 177--186.
    [15]
    Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In International Conference on Compiler Construction. Springer, 225--245.
    [16]
    Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-Vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). Association for Computing Machinery, New York, NY, USA, 13--24.
    [17]
    F. Irigoin and R. Triolet. 1988. Supernode Partitioning (POPL '88). 319--329.
    [18]
    Guohua Jin, John Mellor-Crummey, and Robert Fowler. 2001. Increasing Temporal Locality with Skewing and Recursive Blocking (SC '01). 43--43.
    [19]
    Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1--12.
    [20]
    Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory system performance and correctness. 51--60.
    [21]
    Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. SIGPLAN Not. 42, 6 (June 2007), 235--244.
    [22]
    Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms (ASPLOS IV). 63--74.
    [23]
    Kun Li, Honghui Shang, Yunquan Zhang, Shigang Li, Baodong Wu, Dong Wang, Libo Zhang, Fang Li, Dexun Chen, and Zhiqiang Wei. 2019. OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.
    [24]
    Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. 4, 3, Article Article 12 (Dec. 2017), 32 pages.
    [25]
    A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM 12, 3 (1969), 153--165.
    [26]
    Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd international conference on Supercomputing. 256--265.
    [27]
    A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs (SC '10). 1--13.
    [28]
    C.S. Department of University of Oregon. 2014. Stencil Pattern. https://ipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdf [Online; accessed 29-July-2020].
    [29]
    Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles (IPDPS '02). 138--.
    [30]
    Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 168--182.
    [31]
    P. S. Rawat, A. Sukumaran-Rajam, A. Rountev, F. Rastello, L. Pouchet, and P. Sadayappan. 2018. Associative Instruction Reordering to Alleviate Register Pressure. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 590--602.
    [32]
    Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations (SC '00). Article 32.
    [33]
    Aaron Sawdey, Matthew O'Keefe, Rainer Bleck, and Robert W Numrich. 1995. The design, implementation, and performance of a parallel ocean circulation model. In Proceedings of 6th ECMWF Workshop on the Use of Parallel Processors in Meteorology: Coming of Age. 523--550.
    [34]
    Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality (PLDI '99). 215--228.
    [35]
    Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 65--76.
    [36]
    Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations. In Proceedings of the 24th ACM International Conference on Supercomputing. 49--59.
    [37]
    Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 117--128.
    [38]
    Xinmin Tian, Aart Bik, Milind Girkar, Paul Grey, Hideki Saito, and Ernesto Su. 2002. Intel® OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance. Intel Technology Journal 6, 1 (2002).
    [39]
    Sundaresan Venkatasubramanian, Richard W Vuduc, and none none. 2009. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international conference on Supercomputing. 244--255.
    [40]
    Wikichip.org. 2019. Wikichip of Intel Xeon Gold 6140. https://en.wikichip.org/wiki/intel/xeon_gold/6140 [Online; accessed 29-July-2020].
    [41]
    Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm (PLDI '91). 30--44.
    [42]
    M. Wolfe. 1989. More Iteration Space Tiling (Supercomputing '89). 655--664.
    [43]
    David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002), 181--221.
    [44]
    David G Wonnacott and Michelle Mills Strout. 2013. On the scalability of loop tiling techniques. IMPACT 2013 (2013).
    [45]
    Charles Yount. 2015. Vector Folding: improving stencil performance via multidimensional SIMD-vector representation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, 865--870.
    [46]
    Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-Yet another stencil kernel: A framework for HPC stencil code-generation and tuning. In 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). IEEE, 30--39.
    [47]
    Liang Yuan, Shan Huang, Yunquan Zhang, and Hang Cao. 2019. Tessellating Star Stencils. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.
    [48]
    Liang Yuan, Yunquan Zhang, Peng Guo, and Shan Huang. 2017. Tessellating Stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article Article 49, 13 pages.
    [49]
    Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 155--164.
    [50]
    Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--44.

    Cited By

    View all
    • (2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
    • (2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
    • (2021)An Accurate and Efficient Large-scale Regression Method through Best Friend ClusteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3134336(1-1)Online publication date: 2021

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2021
    1493 pages
    ISBN:9781450384421
    DOI:10.1145/3458817
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. data locality
    2. register reuse
    3. stencil
    4. vectorization

    Qualifiers

    • Research-article

    Conference

    SC '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)9

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
    • (2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
    • (2021)An Accurate and Efficient Large-scale Regression Method through Best Friend ClusteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3134336(1-1)Online publication date: 2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media