research-article

Loop transformations leveraging hardware prefetching

Authors:

Savvas Sioutas,

Henk Corporaal,

Lou SomersAuthors Info & Claims

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 254 - 264

https://doi.org/10.1145/3168823

Published: 24 February 2018 Publication History

Abstract

Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.

References

[1]

Jaume Abella. Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In Proceedings of the 2002 International Conference on Parallel Processing Workshops, ICPPW ’02, pages 568–, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[2]

J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 303–315, Aug 2014.

Digital Library

[3]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 40:1–40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[4]

B. Bao and C. Ding. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1–11, Feb 2013.

Digital Library

[5]

Jacqueline Chame and Sungdo Moon. A tile selection algorithm for data locality and cache interference. In Proceedings of the 13th International Conference on Supercomputing, ICS ’99, pages 492–499, New York, NY, USA, 1999. ACM.

Digital Library

[6]

Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. SIGPLAN Not., 30(6):279–290, June 1995.

Digital Library

[7]

Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, and Katherine Yelick. Self adapting linear algebra algorithms and software. In Proceedings of the IEEE, page 2005, 2005.

[8]

B. B. Fraguela, M. G. Carmueja, D. Andrade, G. R. Joubert, W. E. Nagel, F. J. Peters, O. Plata, P. Tirado, E. Zapata, Basilio B. Fraguela A, MartÃŋn G. Carmueja A, and Diego Andrade A. Optimal tile size selection guided by analytical models. In In PARCO, pages 565–572, 2005.

[9]

Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 Workshop on Memory System Performance, MSP ’05, pages 36–43, New York, NY, USA, 2005. ACM.

Digital Library

[10]

P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. The effect of cache models on iterative compilation for combined tiling and unrolling: Research articles. Concurr. Comput. : Pract. Exper., 16(2-3):247–270, January 2004.

Digital Library

[11]

Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63–74, April 1991.

Digital Library

[12]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. Analytical modeling is enough for high-performance blis. ACM Trans. Math. Softw., 43(2):12:1–12:18, August 2016.

Digital Library

[13]

Qingda Lu, Sriram Krishnamoorthy, and P. Sadayappan. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT ’06, pages 233–242, New York, NY, USA, 2006. ACM.

Digital Library

[14]

Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. Tile size selection revisited. ACM Trans. Archit. Code Optim., 10(4):35:1–35:27, December 2013.

Digital Library

[15]

Sanyam Mehta, Rajat Garg, Nishad Trivedi, and Pen Chung Yew. TurboTiling: Leveraging prefetching to boost performance of tiled codes, volume 01-03-June-2016. Association for Computing Machinery, 6 2016.

Digital Library

[16]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan RaganKelley, and Kayvon Fatahalian. Automatically scheduling halide image processing pipelines. ACM Trans. Graph., 35(4):83:1–83:11, July 2016.

Digital Library

[17]

Neungsoo Park, Bo Hong, and V. K. Prasanna. Analysis of memory hierarchy performance of block data layout. In Proceedings International Conference on Parallel Processing, pages 35–44, 2002.

Digital Library

[18]

Maurice Peemen, Bart Mesman, and Henk Corporaal. Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE ’15, pages 169–174, San Jose, CA, USA, 2015. EDA Consortium.

Digital Library

[19]

Apan Qasem and Ken Kennedy. Profitable loop fusion and tiling using model-driven empirical search. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, pages 249–258, New York, NY, USA, 2006. ACM.

Digital Library

[20]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIG-PLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.

Digital Library

[21]

Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. Analytical bounds for optimal tile size selection. In Proceedings of the 21st International Conference on Compiler Construction, CC’12, pages 101–121, Berlin, Heidelberg, 2012. Springer-Verlag.

Digital Library

[22]

S. Tavarageri, L. N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Dynamic selection of tile sizes. In 2011 18th International Conference on High Performance Computing, pages 1–10, Dec 2011.

Digital Library

[23]

O. Temam, C. Fricker, and W. Jalby. Cache awareness in blocking techniques. In in Journal of Programming Languages, 1998.

[24]

R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC ’98, pages 1–27, Washington, DC, USA, 1998. IEEE Computer Society.

Digital Library

[25]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the atlas project. PARALLEL COMPUTING, 27:2001, 2000.

[26]

Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. pages 30–44, 1991.

[27]

K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate highperformance blas? Proceedings of the IEEE, 93(2):358–386, Feb 2005.

[28]

Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. Automatic creation of tile size selection models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pages 190–199, New York, NY, USA, 2010. ACM.

Digital Library

[29]

Lixin Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee. Memory system support for image processing. In 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), pages 98–107, 1999.

Digital Library

Cited By

Babalad SShevade SThazhuthaveetil MGovindarajan R(2024)Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656630(388-399)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656630
Singh SHegarty JLeather HSteiner BChaudhuri SSutton C(2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3520312.3534863
Sioutas SStuijk SWaeijen LBasten TCorporaal HSomers L(2019)Schedule Synthesis for Halide Pipelines through Reuse AnalysisACM Transactions on Architecture and Code Optimization10.1145/331024816:2(1-22)Online publication date: 18-Apr-2019
https://dl.acm.org/doi/10.1145/3310248
Show More Cited By

Index Terms

Loop transformations leveraging hardware prefetching
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Domain specific languages

Recommendations

Schedule Synthesis for Halide Pipelines on GPUs

The Halide DSL and compiler have enabled high-performance code generation for image processing pipelines targeting heterogeneous architectures through the separation of algorithmic description and optimization schedule. However, automatic schedule ...
Schedule Synthesis for Halide Pipelines through Reuse Analysis

Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to meet real-time constraints. The inherently complex structure found in most image-processing pipelines, the ...
A Halide-based Synergistic Computing Framework for Heterogeneous Systems

New programming models have been developed to embrace contemporary heterogeneous machines, each of which may contain several types of processors, e.g., CPUs, GPUs, FPGAs and ASICs. Unlike the conventional ones, which use separate programming schemes for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
286
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Babalad SShevade SThazhuthaveetil MGovindarajan R(2024)Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656630(388-399)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656630
Singh SHegarty JLeather HSteiner BChaudhuri SSutton C(2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3520312.3534863
Sioutas SStuijk SWaeijen LBasten TCorporaal HSomers L(2019)Schedule Synthesis for Halide Pipelines through Reuse AnalysisACM Transactions on Architecture and Code Optimization10.1145/331024816:2(1-22)Online publication date: 18-Apr-2019
https://dl.acm.org/doi/10.1145/3310248
Adams AMa KAnderson LBaghdadi RLi TGharbi MSteiner BJohnson SFatahalian KDurand FRagan-Kelley J(2019)Learning to optimize halide with tree search and random programsACM Transactions on Graphics10.1145/3306346.332296738:4(1-12)Online publication date: 12-Jul-2019
https://dl.acm.org/doi/10.1145/3306346.3322967

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents