Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-28652-0_6guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Analytical bounds for optimal tile size selection

Published: 24 March 2012 Publication History

Abstract

In this paper, we introduce a novel approach to guide tile size selection by employing analytical models to limit empirical search within a subspace of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307× to 11,879× for an Intel Core-2-Quad system; 358× to 1,978× for an Intel Nehalem system; and 45× to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40× (Intel Core-2 Quad), 1.28× (Nehalem) and 1.19× (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search optimization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95× and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code.

References

[1]
Barr, T. W., Cox, A. L., Rixner, S.: Translation caching: skip, don't walk (the page table). In: ISCA 2010, pp. 48-59. ACM, New York (2010)
[2]
Baskaran, M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO, pp. 200-209 (2010)
[3]
Bhargava, R., Serebrin, B., Spadini, F., Manne, S.: Accelerating two-dimensional page walks for virtualized systems. In: ASPLOS XIII, pp. 26-35 (2008)
[4]
Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC. In: Proc. ICS, pp. 340-347 (1997)
[5]
Bodin, F., Jalby, W., Windheiser, D., Eisenbeis, C.: A quantitative algorithm for data locality optimization. In: Code Generation, pp. 119-145 (1991)
[6]
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral program optimization system. In: PLDI (2008)
[7]
Boulet, P., Darte, A., Risset, T., Robert, Y. (Pen)-ultimate tiling? Integration, the VLSI Journal 17(1), 33-51 (1994)
[8]
Chame, J., Moon, S.: A tile selection algorithm for data locality and cache interference. In: ICS, pp. 492-499 (1999)
[9]
Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: CGO 2005 (2005)
[10]
Coleman, S., McKinley, K.: Tile Size Selection Using Cache Organization and Data Layout. In: PLDI, pp. 279-290 (1995)
[11]
Datta, K.: Auto-tuning stencil codes for cache-based multicore platforms. Technical report, University of California, Berkeley (December 2009)
[12]
Ferrante, J., Sarkar, V., Thrash, W.: On Estimating and Enhancing Cache Effectiveness. In: Banerjee, U., Nicolau, A., Gelernter, D., Padua, D. A. (eds.) LCPC 1991. LNCS, vol. 589, pp. 328-343. Springer, Heidelberg (1992)
[13]
Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM TOPLAS 21(4), 703-746 (1999)
[14]
Goto, K., van de Geijn, R. A.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1) (July 2008)
[15]
Hartono, A., Baskaran, M. M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proc. ICS (2009)
[16]
Hsu, C., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279-294 (2004)
[17]
Irigoin, F., Triolet, R.: Supernode partitioning. In: ACM POPL, pp. 319-329 (1988)
[18]
Kim, D., Renganarayanan, L., Strout, M., Rajopadhye, S.: Multi-level tiling: 'm' for the price of one. In: SC (2007)
[19]
Knijnenburg, P. M. W., Kisuki, T., O'Boyle, M. F. P.: Combined selection of tile sizes and unroll factors using iterative compilation. The Journal of Supercomputing 24(1), 43-67 (2003)
[20]
Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proc. 4th ACM ASPLOS, pp. 63-74 (1991)
[21]
Luersen, M., Riche, R. L., Guyon, F.: A constrained, globalized, and bounded nelder-mead method for engineering optimization. Structural andMultidisciplinary Optimization 27(1-2), 43-54 (2004)
[22]
Nelder, J. A., Mead, R.: A simplex method for function minimization. Computer Journal 7(4), 308-313 (1965)
[23]
Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. JPDC 16(2), 108-230 (1992)
[24]
Renganarayana, L., Kim, D., Rajopadhye, S., Strout, M.: Parameterized tiled loops for free. In: PLDI, pp. 405-414 (2007)
[25]
Resource Characterization in the PACE Project, http://www.pace.rice.edu/Content.aspx?id=41
[26]
Rivera, G., Tseng, C.: Locality optimizations for multi-level caches. In: SC (1999)
[27]
Sarkar, V.: Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. & Dev. 41(3) (May 1997)
[28]
Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: IEEE ISPASS (2000)
[29]
Schreiber, R., Dongarra, J.: Automatic blocking of nested loops. Tech. Report 90.38, RIACS, NASA Ames Research Center (1990)
[30]
Tabatabaee, V., Tiwari, A., Hollingsworth, J. K.: Parallel parameter tuning for applications with performance variability. In: Proc. Supercomputing 2005 (2005)
[31]
Tapus, C., Chung, I.-H., Hollingsworth, J. K.: Active harmony: towards automated performance tuning. In: SC, pp. 1-11 (2002)
[32]
Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.: Scalable autotuning framework for compiler optimization. In: IPDPS 2009 (2009)
[33]
Whaley, R.C., Petitet, A., Dongarra, J. J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3-35 (2001)
[34]
Wolf, M., Lam, M. S.: A data locality optimizing algorithm. In: PLDI 1991, pp. 30-44 (1991)
[35]
Wolfe, M.: More iteration space tiling. In: Proc. Supercomputing, pp. 655-664 (1989)
[36]
Xue, J.: Loop tiling for parallelism. Kluwer Academic Publishers, Norwell (2000)
[37]
Yotov, K., Pingali, K., Stodghill, P.: Think globally, search locally. In: International Conference on Supercomputing (2005)
[38]
Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A., O'Brien, K.: Automatic creation of tile size selection models. In: CGO, pp. 190-199 (2010)

Cited By

View all
  • (2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
  • (2021)IOOpt: automatic derivation of I/O complexity bounds for affine programsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454103(1187-1202)Online publication date: 19-Jun-2021
  • (2021)A practical tile size selection model for affine loop nestsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3462213(27-39)Online publication date: 3-Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
CC'12: Proceedings of the 21st international conference on Compiler Construction
March 2012
243 pages
ISBN:9783642286513
  • Editor:
  • Michael O'Boyle

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 March 2012

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
  • (2021)IOOpt: automatic derivation of I/O complexity bounds for affine programsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454103(1187-1202)Online publication date: 19-Jun-2021
  • (2021)A practical tile size selection model for affine loop nestsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3462213(27-39)Online publication date: 3-Jun-2021
  • (2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
  • (2020)Efficient tiled sparse matrix multiplication through matrix signaturesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433816(1-14)Online publication date: 9-Nov-2020
  • (2020)A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad MemoryACM Transactions on Embedded Computing Systems10.1145/340653619:6(1-30)Online publication date: 3-Oct-2020
  • (2020)An Effective Fusion and Tile Size Model for PolyMageACM Transactions on Programming Languages and Systems10.1145/340484642:3(1-27)Online publication date: 8-Nov-2020
  • (2019)Analytical cache modeling and tilesize optimization for tensor contractionsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356218(1-13)Online publication date: 17-Nov-2019
  • (2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
  • (2019)Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One ShotProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00036(369-381)Online publication date: 23-Sep-2019
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media