article

A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMD

Authors:

Vasilios Kelefouras,

Angeliki Kritikakou,

Costas GoutisAuthors Info & Claims

The Journal of Supercomputing, Volume 68, Issue 3

Pages 1418 - 1440

https://doi.org/10.1007/s11227-014-1098-9

Published: 01 June 2014 Publication History

Abstract

In this paper, a new methodology for speeding up Matrix—Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.

References

[1]

Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O'Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO '06. IEEE Computer Society, Washington, DC, USA, pp 295---305.

[2]

Bacon DF, Graham SL, Oliver SJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26:345---420

Digital Library

[3]

Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, Austria

[4]

BjØrstad P, Manne F, SØrevik T, Vajtersic M (1992) Efficient matrix multiplication on simd computers. SIAM J Matrix Anal Appl 13:386---401

Digital Library

[5]

Blackford LS, Choi J, Cleary A, D'Azeuedo E, Demmel J, Dhillon I, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK user's guide. Society for industrial and applied mathematics, Philadelphia, PA

[6]

Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y (1995) Cilk: an efficient multithreaded runtime system. SIGPLAN Not 30(8):207---216.

Digital Library

[7]

Chatterjee S, Lebeck AR, Patnala PK, Thottethodi M (1999) Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of 11th annual ACM symposium on parallel algorithms and architectures, pp 222---231

[8]

Choi J (1998) A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. Concurr Pract Exp 10(8):655---670

[9]

Cooper KD, Subramanian D, Torczon L (2001) Adaptive optimizing compilers for the 21st century. J Supercomput 23:2002

[10]

Desprez F, Suter F (2002) Impact of mixed-parallelismpon parallel implementations of Strassen and Winograd matrix multiplication algorithms. Rapport de recherche RR-4482, INRIA. http://hal.inria.fr/inria-00072106

[11]

Desprez F, Suter F (2004) Impact of mixed-parallelism on parallel implementations of the strassen and winograd matrix multiplication algorithms: Research articles. Concurr Comput Pract Exp 16(8):771---797.

Digital Library

[12]

Frigo M, Johnson SG (1997) The fastest fourier transform in the west. Tech. rep, Cambridge, MA

[13]

Garcia E, Venetis IE, Khan R, Gao GR (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the 16th international Euro-Par conference on parallel processing: Part II, Euro-Par'10. Springer-Verlag, Berlin, Heidelberg, pp 316---327. http://dl.acm.org/citation.cfm?id=1885276.1885308

[14]

Geijn RAVD, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Tech. rep., Cambridge, MA

[15]

Goto K, van de Geijn R (2002) On reducing tlb misses in matrix multiplication. Tech. rep., Cambridge, MA

[16]

Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1---12:25.

Digital Library

[17]

Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization FDDO

[18]

Guennebaud G, Jacob B, et al (2010) Eigen v3. http://eigen.tuxfamily.org

[19]

Hall JD, Carr NA, Hart JC (2003) Cache and bandwidth aware matrix multiplication on the gpu. Tech. rep., Cambridge, MA

[20]

Hattori M, Ito N, Chen W, Wada K (2005) Parallel matrix-multiplication algorithm for distributed parallel computers. Syst Comput Jpn 36(4):48---59.

Digital Library

[21]

Hunold S, Rauber T (2005) Automatic tuning of pdgemm towards optimal performance. In: Proceedings of the 11th international Euro-Par conference on parallel processing, Euro-Par'05. Springer-Verlag, Berlin, pp 837---846.

[22]

Hunold S, Rauber T, Rünger G (2004) Multilevel hierarchical matrix multiplication on clusters. In: Proceedings of the 18th annual international conference on supercomputing, ICS'04. ACM, New York, NY, pp. 136---145.

[23]

Intel (2012) Intel mkl. Available at http://software.intel.com/en-us/intel-mkl

[24]

Jiang C, Snir M (2005) Automatic tuning matrix multiplication performance on graphics hardware. In: In the proceedings of the 14th international conference on parallel architecture and compilation techniques (PACT), pp 185---196

Digital Library

[25]

Kisuki T, Knijnenburg PMW, O'Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC'99. Springer-Verlag, London, pp. 121---132. http://dl.acm.org/citation.cfm?id=646347.690219

[26]

KKrishnan M, Nieplocha J (2004) Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. Parallel and distributed processing symposium, international 1, 70b.

[27]

Krishnan, M., Nieplocha, J.: Memory efficient parallel matrix multiplication operation for irregular problems. In: Proceedings of the 3rd conference on Computing frontiers, CF '06, pp. 229---240. ACM, New York, NY, USA (2006). DOI 10.1145/1128022.1128054. URL http://doi.acm.org/10.1145/1128022.1128054

Digital Library

[28]

Krivutsenko A (2008) Gotoblas--anatomy of a fast matrix multiplication. Tech. rep., Cambridge, MA

[29]

Kulkarni M, Pingali K (2008) An experimental study of self-optimizing dense linear algebra software. Proc IEEE 96(5):832---848

[30]

Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171---182.

Digital Library

[31]

Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1---1:36

Digital Library

[32]

Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector simd architecture--cell processor. Parallel Comput 35(3):138---150.

Digital Library

[33]

Michaud P (2011) Replacement policies for shared caches on symmetric multicores: a programmer-centric point of view. In: Proceedings of the 6th international conference on high performance and embedded architectures and compilers, HiPEAC'11. ACM, New York, NY, pp. 187---196.

[34]

Milder PA, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2). http://dblp.uni-trier.de/db/journals/todaes/todaes17.html#MilderFHP12

[35]

Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA'02. Springer-Verlag, London, pp 41---50. http://dl.acm.org/citation.cfm?id=646053.677574

[36]

Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng 13:2001

Digital Library

[37]

Nethercote N, Seward J (20007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89---100.

[38]

Nikolopoulos DS (2003) Code and data transformations for improving shared cache performance on smt processors. In: ISHPC, pp 54---69

[39]

Openblas (2012) An optimized blas library. URL available at http://xianyi.github.com/OpenBLAS/

[40]

Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES'11. ACM, New York, NY, pp. 65---74.

[41]

Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21---38

[42]

Rünger G., Schwind M (2010 Fast recursive matrix multiplication for multi-core architectures. Procedia Comput Sci 1(1):67---76. International conference on computational science 2010 (ICCS 2010)

[43]

See homepage for details: Atlas homepage (2012). Http://math-atlas.sourceforge.net/

[44]

Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1---14:31.

[45]

Stephenson M, Amarasinghe S, Martin M, O'Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77---90.

Digital Library

[46]

Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 14(3):354---356

Digital Library

[47]

Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1---46:25.

Digital Library

[48]

Thottethodi M, Chatterjee S, Lebeck AR (1998) Tuning strassen's matrix multiplication for memory efficiency. In: In proceedings of SC98 (CD-ROM)

[49]

Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, CGO '03. IEEE Computer Society, Washington, DC, USA, pp 204---215. http://dl.acm.org/citation.cfm?id=776261.776284

[50]

Tsilikas G, Fleury M (2004) Matrix multiplication performance on commodity shared-memory multiprocessors. In: Proceedings of the international conference on parallel computing in electrical engineering, PARELEC '04. IEEE Computer Society, Washington, DC, USA, pp 13---18.

Digital Library

[51]

Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Tech. Rep. UT-CS-97-366, University of Tennessee

Digital Library

[52]

Whaley RC, Dongarra J J (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on supercomputing, Supercomputing '98. IEEE Computer Society, San Jose, CA, pp 1---27

[53]

Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: Ninth SIAM conference on parallel processing for scientific computing. CD-ROM proceedings

[54]

Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw Pract Exp 35(2):101---121

Digital Library

[55]

Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1---2):3---35

[56]

Yotov K, Li X, Ren G, Garzaran M, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance blas? Proceedings of the IEEE 93(2)

[57]

Yuan N, Zhou Y, Tan G, Zhang J., Fan D (2009) High performance matrix multiplication on many cores. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Euro-Par '09. Springer-Verlag, Berlin, pp. 948---959.

[58]

Zhuravlev S, Saez JC, Fedorova A, Prieto M (2012) Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Comput Surv 45(1):1---28

Digital Library

Cited By

Cámara JCuenca JBoratto M(2023)Improving the Performance of Task-Based Linear Algebra Software with Autotuning Techniques on Heterogeneous ArchitecturesComputational Science – ICCS 202310.1007/978-3-031-35995-8_47(668-682)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-35995-8_47
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Kelefouras VDjemame KKeramidas GVoros N(2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
https://dl.acm.org/doi/10.1007/978-3-031-04580-6_7
Show More Cited By

Recommendations

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology ...
An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from ...
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 68, Issue 3

June 2014

629 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cámara JCuenca JBoratto M(2023)Improving the Performance of Task-Based Linear Algebra Software with Autotuning Techniques on Heterogeneous ArchitecturesComputational Science – ICCS 202310.1007/978-3-031-35995-8_47(668-682)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-35995-8_47
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Kelefouras VDjemame KKeramidas GVoros N(2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
https://dl.acm.org/doi/10.1007/978-3-031-04580-6_7
Kang HKwon HKim D(2020)HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUsComputing10.1007/s00607-020-00846-1102:12(2607-2631)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s00607-020-00846-1
Su XLiao XJiang HYang CXue J(2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
https://dl.acm.org/doi/10.1145/3274654
Su XLiao XXue JReddi VSmith ATang L(2017)Automatic generation of fast BLAS3-GEMM: a portable compiler approachProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049846(122-133)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049846
Low TIgual FSmith TQuintana-Orti E(2016)Analytical Modeling Is Enough for High-Performance BLISACM Transactions on Mathematical Software10.1145/292598743:2(1-18)Online publication date: 16-Aug-2016
https://dl.acm.org/doi/10.1145/2925987
Kelefouras VKritikakou AMporas IKolonias V(2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11227-015-1613-7

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents