Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMD

Published: 01 June 2014 Publication History

Abstract

In this paper, a new methodology for speeding up Matrix—Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

References

[1]
Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O'Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO '06. IEEE Computer Society, Washington, DC, USA, pp 295---305.
[2]
Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345---420
[3]
Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria
[4]
BjØrstad P, Manne F, SØrevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386---401
[5]
Blackford LS, Choi J, Cleary A, D'Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user's guide. Society for industrial and applied mathematics, Philadelphia, PA
[6]
Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207---216.
[7]
Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222---231
[8]
Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655---670
[9]
Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002
[10]
Desprez F, Suter F (2002) Impact of mixed-parallelismpon parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. http://hal.inria.fr/inria-00072106
[11]
Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771---797.
[12]
Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA
[13]
Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: Part II, Euro-Par'10. Springer-Verlag, Berlin, Heidelberg, pp 316---327. http://dl.acm.org/citation.cfm?id=1885276.1885308
[14]
Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA
[15]
Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA
[16]
Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1---12:25.
[17]
Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO
[18]
Guennebaud G, Jacob B, et al (2010) Eigen v3. http://eigen.tuxfamily.org
[19]
Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA
[20]
Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48---59.
[21]
Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par'05. Springer-Verlag, Berlin, pp 837---846.
[22]
Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS'04. ACM, New York, NY, pp. 136---145.
[23]
Intel (2012) Intel mkl. Available at http://software.intel.com/en-us/intel-mkl
[24]
Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185---196
[25]
Kisuki T, Knijnenburg PMW, O'Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC'99. Springer-Verlag, London, pp. 121---132. http://dl.acm.org/citation.cfm?id=646347.690219
[26]
KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b.
[27]
Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF '06, pp. 229---240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054
[28]
Krivutsenko A (2008) Gotoblas--anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA
[29]
Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832---848
[30]
Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171---182.
[31]
Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1---1:36
[32]
Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture--cell processor. Parallel Comput 35(3):138---150.
[33]
Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC'11. ACM, New York, NY, pp. 187---196.
[34]
Milder PA, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2). http://dblp.uni-trier.de/db/journals/todaes/todaes17.html#MilderFHP12
[35]
Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA'02. Springer-Verlag, London, pp 41---50. http://dl.acm.org/citation.cfm?id=646053.677574
[36]
Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001
[37]
Nethercote N, Seward J (20007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89---100.
[38]
Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54---69
[39]
Openblas (2012) An optimized blas library. URL available at http://xianyi.github.com/OpenBLAS/
[40]
Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES'11. ACM, New York, NY, pp. 65---74.
[41]
Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21---38
[42]
Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67---76. International conference on computational science 2010 (ICCS 2010)
[43]
See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/
[44]
Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1---14:31.
[45]
Stephenson M, Amarasinghe S, Martin M, O'Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77---90.
[46]
Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 14(3):354---356
[47]
Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1---46:25.
[48]
Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen's matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)
[49]
Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03. IEEE Computer Society, Washington, DC, USA, pp 204---215. http://dl.acm.org/citation.cfm?id=776261.776284
[50]
Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC '04. IEEE Computer Society, Washington, DC, USA, pp 13---18.
[51]
Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee
[52]
Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing '98. IEEE Computer Society, San Jose, CA, pp 1---27
[53]
Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings
[54]
Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101---121
[55]
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1---2):3---35
[56]
Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)
[57]
Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par '09. Springer-Verlag, Berlin, pp. 948---959.
[58]
Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1---28

Cited By

View all
  • (2023)Improving the Performance of Task-Based Linear Algebra Software with Autotuning Techniques on Heterogeneous ArchitecturesComputational Science – ICCS 202310.1007/978-3-031-35995-8_47(668-682)Online publication date: 3-Jul-2023
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing
The Journal of Supercomputing  Volume 68, Issue 3
June 2014
629 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2014

Author Tags

  1. Cache associativity
  2. Data cache
  3. Matrix---Matrix Multiplication
  4. Memory management
  5. Multi-core
  6. SIMD

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Improving the Performance of Task-Based Linear Algebra Software with Autotuning Techniques on Heterogeneous ArchitecturesComputational Science – ICCS 202310.1007/978-3-031-35995-8_47(668-682)Online publication date: 3-Jul-2023
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
  • (2020)HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUsComputing10.1007/s00607-020-00846-1102:12(2607-2631)Online publication date: 1-Dec-2020
  • (2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
  • (2017)Automatic generation of fast BLAS3-GEMM: a portable compiler approachProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049846(122-133)Online publication date: 4-Feb-2017
  • (2016)Analytical Modeling Is Enough for High-Performance BLISACM Transactions on Mathematical Software10.1145/292598743:2(1-18)Online publication date: 16-Aug-2016
  • (2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media