research-article

Understanding stencil code performance on multicore architectures

Authors:

Shah M. Faizur Rahman,

Apan QasemAuthors Info & Claims

CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

Article No.: 30, Pages 1 - 10

https://doi.org/10.1145/2016604.2016641

Published: 03 May 2011 Publication History

Abstract

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCtoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, To Appear, 2009.

Digital Library

[2]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

Digital Library

[3]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pages 101--113, New York, NY, USA, 2008. ACM.

Digital Library

[4]

J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In GGO '07; Proceedings of the Paternational Symposium on Code Generation and Optimization, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[5]

C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Inaternational Symposium on Code Generation and Optimization, March 2005.

Digital Library

[6]

M. Christen, O. Schenk, E. Neufeld, P. Messmer, and H. Burkhart. Parallel data-locality aware stencil computations on modern micro-architectures. In IPDPS '09: Proceedings of the 2009 IEEE Paternational Symposium on Parallels Distributed Processing, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[7]

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.

Digital Library

[8]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SG08), 2008.

Digital Library

[9]

S. Eranian. What can performance counters do for memory subsystem analysis? In MSPG '08: Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness, pages 26--30, 2008.

Digital Library

[10]

B. Fraguela, Y. Voronenko, and M. Puschel. Automatic tuning of discrete fourier transforms driven by analytical modeling. In PACT'09: Parallel Architectures and Compilation Techniques, Raleigh, NC, Sept. 2009.

Digital Library

[11]

Intel Pentium 4 Processor Optimization Reference Manual. Intel Corporation, 2000.

[12]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing. IEEE Computer Society, 2010.

[13]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51--60, New York, NY, USA, 2006. ACM.

Digital Library

[14]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235--244, 2007.

Digital Library

[15]

L. Liu and Z. Li. Improving parallelism and locality with asynchronous algorithms. In PPoPP '10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 213--222, New York, NY, USA, 2010. ACM.

Digital Library

[16]

G. Marin and J. Mellor-Crummey. Pinpointing and exploiting opportunities for enhancing data reuse. In In Proceedings of the 2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'08), 2008.

Digital Library

[17]

N. Peleg and B. Mendelson. Detecting change in program behavior for adaptive optimization. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT07), 2007.

Digital Library

[18]

S. F. Rahman, J. Guo, and Q. Yi. Automated empirical tuning of scientific codes for performance and power consumption. In HIPEAC':High-Performance and Embedded Architectures and Compilers (to appear), Heraklion, Greece, Jan 2011.

Digital Library

[19]

G. Rivera and C.-W. Tseng. Tiling optimizations for 3D scientific computations. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 32, Washington, DC, USA, 2000. IEEE Computer Society.

Digital Library

[20]

K. Singh, M. Bhadauria, and S. A. McKee. Real time power estimation and thread scheduling via performance counters. SIGARCH Comput. Archit. News, 37(2):46--55, 2009.

Digital Library

[21]

Song, Yonghong, and Z. Li. New tiling techniques to improve cache temporal locality. In PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 215--228, New York, NY, USA, 1999. ACM.

Digital Library

[22]

F. Song, S. Moore, and J. Dongarra. Feedback-directed thread scheduling with memory considerations. In HPDC '07; Proceedings of the 16th international symposium on High performance distributed computing, 2007.

Digital Library

[23]

Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001.

Digital Library

[24]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In 13st International Conference on High- Performance Computer Architecture (HPCA-13), 2007.

Digital Library

[25]

N. R. Tallent and J. M. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (PPOPP09), 2009.

Digital Library

[26]

M. M. Tikir and J. K. Hollingsworth. Using hardware counters to automatically improve memory performance. In Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC, 2004.

Digital Library

[27]

J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. Journal of Computational Science, In Press, 2011.

[28]

D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS00), page 171, Washington, DC, USA, 2000. IEEE Computer Society.

Digital Library

[29]

Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Workshop on Performance Optimization for High-Level Languages and Libraries, Mar 2007.

Cited By

Tao XPang JXu JZhu Y(2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-xOnline publication date: 15-May-2021
https://doi.org/10.1007/s11227-021-03853-x
Yantır HEltawil ASalama K(2020)Efficient Acceleration of Stencil Applications through In-Memory ComputingMicromachines10.3390/mi1106062211:6(622)Online publication date: 26-Jun-2020
https://doi.org/10.3390/mi11060622
Kang DRubel OByna SBlanas S(2020)Predicting and Comparing the Performance of Array Management Libraries2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00097(906-915)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00097
Show More Cited By

Index Terms

Understanding stencil code performance on multicore architectures

Recommendations

High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...
Automatic code generation for stencil computations on gpu architectures
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

May 2011

268 pages

ISBN:9781450306980

DOI:10.1145/2016604

General Chair:
Calin Cascaval
Qualcomm Research
,
Program Chairs:
Pedro Trancoso
University of Cyprus, CY
,
Viktor Prasanna
University of Southern California

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

CF'11

Sponsor:

SIGMICRO

CF'11: Computing Frontiers Conference

May 3 - 5, 2011

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
357
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tao XPang JXu JZhu Y(2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-xOnline publication date: 15-May-2021
https://doi.org/10.1007/s11227-021-03853-x
Yantır HEltawil ASalama K(2020)Efficient Acceleration of Stencil Applications through In-Memory ComputingMicromachines10.3390/mi1106062211:6(622)Online publication date: 26-Jun-2020
https://doi.org/10.3390/mi11060622
Kang DRubel OByna SBlanas S(2020)Predicting and Comparing the Performance of Array Management Libraries2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00097(906-915)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00097
Guerrera DMaffia ABurkhart H(2019)Reproducible stencil compiler benchmarks using prova! Future Generation Computer Systems10.1016/j.future.2018.05.02392:C(933-946)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1016/j.future.2018.05.023
Hernández-Hernández MHernández-Hernández JMaldonado EMiranda I(2019)Modern Code Applied in Stencil in Edge Detection of an Image for Architecture Intel Xeon Phi KNLTechnologies and Innovation10.1007/978-3-030-34989-9_12(151-163)Online publication date: 20-Nov-2019
https://doi.org/10.1007/978-3-030-34989-9_12
Waidyasooriya HTakei YTatsumi SHariyama M(2017)OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization MethodologyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.261498128:5(1390-1402)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2614981
Michelogiannakis GShalf J(2017)Last Level Collective Hardware Prefetching For Data-Parallel Applications2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00018(72-83)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00018
Louboutin MLange MHerrmann FKukreja NGorman G(2017)Performance prediction of finite-difference solvers for different computer architecturesComputers & Geosciences10.1016/j.cageo.2017.04.014105(148-157)Online publication date: Aug-2017
https://doi.org/10.1016/j.cageo.2017.04.014
Sourouri MBaden SCai X(2017)PandaInternational Journal of Parallel Programming10.1007/s10766-016-0454-145:3(711-729)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10766-016-0454-1
Saxena GJimack PWalkley M(2017)A quasi‐cache‐aware model for optimal domain partitioning in parallel geometric multigridConcurrency and Computation: Practice and Experience10.1002/cpe.432830:9Online publication date: 9-Oct-2017
https://doi.org/10.1002/cpe.4328
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents