Article

Analytical bounds for optimal tile size selection

Authors:

Louis-Noël Pouchet,

Vivek SarkarAuthors Info & Claims

CC'12: Proceedings of the 21st international conference on Compiler Construction

Pages 101 - 121

https://doi.org/10.1007/978-3-642-28652-0_6

Published: 24 March 2012 Publication History

Abstract

In this paper, we introduce a novel approach to guide tile size selection by employing analytical models to limit empirical search within a subspace of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307× to 11,879× for an Intel Core-2-Quad system; 358× to 1,978× for an Intel Nehalem system; and 45× to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40× (Intel Core-2 Quad), 1.28× (Nehalem) and 1.19× (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search optimization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95× and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code.

References

[1]

Barr, T. W., Cox, A. L., Rixner, S.: Translation caching: skip, don't walk (the page table). In: ISCA 2010, pp. 48-59. ACM, New York (2010)

Digital Library

[2]

Baskaran, M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO, pp. 200-209 (2010)

Digital Library

[3]

Bhargava, R., Serebrin, B., Spadini, F., Manne, S.: Accelerating two-dimensional page walks for virtualized systems. In: ASPLOS XIII, pp. 26-35 (2008)

Digital Library

[4]

Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC. In: Proc. ICS, pp. 340-347 (1997)

Digital Library

[5]

Bodin, F., Jalby, W., Windheiser, D., Eisenbeis, C.: A quantitative algorithm for data locality optimization. In: Code Generation, pp. 119-145 (1991)

[6]

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral program optimization system. In: PLDI (2008)

Digital Library

[7]

Boulet, P., Darte, A., Risset, T., Robert, Y. (Pen)-ultimate tiling? Integration, the VLSI Journal 17(1), 33-51 (1994)

Digital Library

[8]

Chame, J., Moon, S.: A tile selection algorithm for data locality and cache interference. In: ICS, pp. 492-499 (1999)

Digital Library

[9]

Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: CGO 2005 (2005)

Digital Library

[10]

Coleman, S., McKinley, K.: Tile Size Selection Using Cache Organization and Data Layout. In: PLDI, pp. 279-290 (1995)

Digital Library

[11]

Datta, K.: Auto-tuning stencil codes for cache-based multicore platforms. Technical report, University of California, Berkeley (December 2009)

[12]

Ferrante, J., Sarkar, V., Thrash, W.: On Estimating and Enhancing Cache Effectiveness. In: Banerjee, U., Nicolau, A., Gelernter, D., Padua, D. A. (eds.) LCPC 1991. LNCS, vol. 589, pp. 328-343. Springer, Heidelberg (1992)

Digital Library

[13]

Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM TOPLAS 21(4), 703-746 (1999)

Digital Library

[14]

Goto, K., van de Geijn, R. A.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1) (July 2008)

Digital Library

[15]

Hartono, A., Baskaran, M. M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proc. ICS (2009)

Digital Library

[16]

Hsu, C., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279-294 (2004)

Digital Library

[17]

Irigoin, F., Triolet, R.: Supernode partitioning. In: ACM POPL, pp. 319-329 (1988)

Digital Library

[18]

Kim, D., Renganarayanan, L., Strout, M., Rajopadhye, S.: Multi-level tiling: 'm' for the price of one. In: SC (2007)

Digital Library

[19]

Knijnenburg, P. M. W., Kisuki, T., O'Boyle, M. F. P.: Combined selection of tile sizes and unroll factors using iterative compilation. The Journal of Supercomputing 24(1), 43-67 (2003)

[20]

Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proc. 4th ACM ASPLOS, pp. 63-74 (1991)

Digital Library

[21]

Luersen, M., Riche, R. L., Guyon, F.: A constrained, globalized, and bounded nelder-mead method for engineering optimization. Structural andMultidisciplinary Optimization 27(1-2), 43-54 (2004)

[22]

Nelder, J. A., Mead, R.: A simplex method for function minimization. Computer Journal 7(4), 308-313 (1965)

[23]

Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. JPDC 16(2), 108-230 (1992)

[24]

Renganarayana, L., Kim, D., Rajopadhye, S., Strout, M.: Parameterized tiled loops for free. In: PLDI, pp. 405-414 (2007)

Digital Library

[25]

Resource Characterization in the PACE Project, http://www.pace.rice.edu/Content.aspx?id=41

[26]

Rivera, G., Tseng, C.: Locality optimizations for multi-level caches. In: SC (1999)

Digital Library

[27]

Sarkar, V.: Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. & Dev. 41(3) (May 1997)

Digital Library

[28]

Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: IEEE ISPASS (2000)

Digital Library

[29]

Schreiber, R., Dongarra, J.: Automatic blocking of nested loops. Tech. Report 90.38, RIACS, NASA Ames Research Center (1990)

[30]

Tabatabaee, V., Tiwari, A., Hollingsworth, J. K.: Parallel parameter tuning for applications with performance variability. In: Proc. Supercomputing 2005 (2005)

Digital Library

[31]

Tapus, C., Chung, I.-H., Hollingsworth, J. K.: Active harmony: towards automated performance tuning. In: SC, pp. 1-11 (2002)

Digital Library

[32]

Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.: Scalable autotuning framework for compiler optimization. In: IPDPS 2009 (2009)

Digital Library

[33]

Whaley, R.C., Petitet, A., Dongarra, J. J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3-35 (2001)

Digital Library

[34]

Wolf, M., Lam, M. S.: A data locality optimizing algorithm. In: PLDI 1991, pp. 30-44 (1991)

Digital Library

[35]

Wolfe, M.: More iteration space tiling. In: Proc. Supercomputing, pp. 655-664 (1989)

Digital Library

[36]

Xue, J.: Loop tiling for parallelism. Kluwer Academic Publishers, Norwell (2000)

Digital Library

[37]

Yotov, K., Pingali, K., Stodghill, P.: Think globally, search locally. In: International Conference on Supercomputing (2005)

Digital Library

[38]

Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A., O'Brien, K.: Automatic creation of tile size selection models. In: CGO, pp. 190-199 (2010)

Digital Library

Cited By

Chatarasi PKwon HParashar APellauer MKrishna TSarkar V(2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3485137
Olivry AIooss GTollenaere NRountev ASadayappan PRastello FFreund SYahav E(2021)IOOpt: automatic derivation of I/O complexity bounds for affine programsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454103(1187-1202)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454103
Narasimhan KAcharya ABaid ABondhugula UZhou HMoreira JMueller FEtsion Y(2021)A practical tile size selection model for affine loop nestsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3462213(27-39)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3462213
Show More Cited By

Recommendations

Tile size selection revisited

Loop tiling is a widely used loop transformation to enhance data locality and allow data reuse. In the tiled code, however, tiles of different sizes can lead to significant variation in performance. Thus, selection of an optimal tile size is critical to ...
Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code ...
Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils
PPoPP '17

Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

CC'12: Proceedings of the 21st international conference on Compiler Construction

March 2012

243 pages

ISBN:9783642286513

Editor:
Michael O'Boyle
School for Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, UK

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 March 2012

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chatarasi PKwon HParashar APellauer MKrishna TSarkar V(2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3485137
Olivry AIooss GTollenaere NRountev ASadayappan PRastello FFreund SYahav E(2021)IOOpt: automatic derivation of I/O complexity bounds for affine programsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454103(1187-1202)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454103
Narasimhan KAcharya ABaid ABondhugula UZhou HMoreira JMueller FEtsion Y(2021)A practical tile size selection model for affine loop nestsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3462213(27-39)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3462213
Li RXu YSukumaran-Rajam ARountev ASadayappan PSherwood TBerger EKozyrakis C(2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446759
Kurt SSukumaran-Rajam ARastello FSadayyapan PCuicchi CQualters IKramer W(2020)Efficient tiled sparse matrix multiplication through matrix signaturesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433816(1-14)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433816
Şuşu A(2020)A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad MemoryACM Transactions on Embedded Computing Systems10.1145/340653619:6(1-30)Online publication date: 3-Oct-2020
https://dl.acm.org/doi/10.1145/3406536
Jangda ABondhugula U(2020)An Effective Fusion and Tile Size Model for PolyMageACM Transactions on Programming Languages and Systems10.1145/340484642:3(1-27)Online publication date: 8-Nov-2020
https://dl.acm.org/doi/10.1145/3404846
Li RSukumaran-Rajam AVeras RLow TRastello FRountev ASadayappan PTaufer MBalaji PPeña A(2019)Analytical cache modeling and tilesize optimization for tensor contractionsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356218(1-13)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356218
Sato YYuki TEndo T(2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293449
Gysi TGrosser THoefler T(2019)Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One ShotProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00036(369-381)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00036
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten