Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3168823acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Loop transformations leveraging hardware prefetching

Published: 24 February 2018 Publication History

Abstract

Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler.

References

[1]
Jaume Abella. Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In Proceedings of the 2002 International Conference on Parallel Processing Workshops, ICPPW ’02, pages 568–, Washington, DC, USA, 2002. IEEE Computer Society.
[2]
J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 303–315, Aug 2014.
[3]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 40:1–40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[4]
B. Bao and C. Ding. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1–11, Feb 2013.
[5]
Jacqueline Chame and Sungdo Moon. A tile selection algorithm for data locality and cache interference. In Proceedings of the 13th International Conference on Supercomputing, ICS ’99, pages 492–499, New York, NY, USA, 1999. ACM.
[6]
Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. SIGPLAN Not., 30(6):279–290, June 1995.
[7]
Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, and Katherine Yelick. Self adapting linear algebra algorithms and software. In Proceedings of the IEEE, page 2005, 2005.
[8]
B. B. Fraguela, M. G. Carmueja, D. Andrade, G. R. Joubert, W. E. Nagel, F. J. Peters, O. Plata, P. Tirado, E. Zapata, Basilio B. Fraguela A, MartÃŋn G. Carmueja A, and Diego Andrade A. Optimal tile size selection guided by analytical models. In In PARCO, pages 565–572, 2005.
[9]
Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 Workshop on Memory System Performance, MSP ’05, pages 36–43, New York, NY, USA, 2005. ACM.
[10]
P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. The effect of cache models on iterative compilation for combined tiling and unrolling: Research articles. Concurr. Comput. : Pract. Exper., 16(2-3):247–270, January 2004.
[11]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63–74, April 1991.
[12]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. Analytical modeling is enough for high-performance blis. ACM Trans. Math. Softw., 43(2):12:1–12:18, August 2016.
[13]
Qingda Lu, Sriram Krishnamoorthy, and P. Sadayappan. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT ’06, pages 233–242, New York, NY, USA, 2006. ACM.
[14]
Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. Tile size selection revisited. ACM Trans. Archit. Code Optim., 10(4):35:1–35:27, December 2013.
[15]
Sanyam Mehta, Rajat Garg, Nishad Trivedi, and Pen Chung Yew. TurboTiling: Leveraging prefetching to boost performance of tiled codes, volume 01-03-June-2016. Association for Computing Machinery, 6 2016.
[16]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan RaganKelley, and Kayvon Fatahalian. Automatically scheduling halide image processing pipelines. ACM Trans. Graph., 35(4):83:1–83:11, July 2016.
[17]
Neungsoo Park, Bo Hong, and V. K. Prasanna. Analysis of memory hierarchy performance of block data layout. In Proceedings International Conference on Parallel Processing, pages 35–44, 2002.
[18]
Maurice Peemen, Bart Mesman, and Henk Corporaal. Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE ’15, pages 169–174, San Jose, CA, USA, 2015. EDA Consortium.
[19]
Apan Qasem and Ken Kennedy. Profitable loop fusion and tiling using model-driven empirical search. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, pages 249–258, New York, NY, USA, 2006. ACM.
[20]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIG-PLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.
[21]
Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. Analytical bounds for optimal tile size selection. In Proceedings of the 21st International Conference on Compiler Construction, CC’12, pages 101–121, Berlin, Heidelberg, 2012. Springer-Verlag.
[22]
S. Tavarageri, L. N. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayappan. Dynamic selection of tile sizes. In 2011 18th International Conference on High Performance Computing, pages 1–10, Dec 2011.
[23]
O. Temam, C. Fricker, and W. Jalby. Cache awareness in blocking techniques. In in Journal of Programming Languages, 1998.
[24]
R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC ’98, pages 1–27, Washington, DC, USA, 1998. IEEE Computer Society.
[25]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the atlas project. PARALLEL COMPUTING, 27:2001, 2000.
[26]
Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. pages 30–44, 1991.
[27]
K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate highperformance blas? Proceedings of the IEEE, 93(2):358–386, Feb 2005.
[28]
Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. Automatic creation of tile size selection models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pages 190–199, New York, NY, USA, 2010. ACM.
[29]
Lixin Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee. Memory system support for image processing. In 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), pages 98–107, 1999.

Cited By

View all
  • (2024)Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656630(388-399)Online publication date: 30-May-2024
  • (2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
  • (2019)Schedule Synthesis for Halide Pipelines through Reuse AnalysisACM Transactions on Architecture and Code Optimization10.1145/331024816:2(1-22)Online publication date: 18-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization
February 2018
377 pages
ISBN:9781450356176
DOI:10.1145/3179541
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Halide
  2. compiler optimizations
  3. loop optimizations

Qualifiers

  • Research-article

Conference

CGO '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656630(388-399)Online publication date: 30-May-2024
  • (2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
  • (2019)Schedule Synthesis for Halide Pipelines through Reuse AnalysisACM Transactions on Architecture and Code Optimization10.1145/331024816:2(1-22)Online publication date: 18-Apr-2019
  • (2019)Learning to optimize halide with tree search and random programsACM Transactions on Graphics10.1145/3306346.332296738:4(1-12)Online publication date: 12-Jul-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media