research-article

Open access

Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication

Authors:

Bariş Aktemur,

MaríA J. Garzarán,

Furkan KiraçAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 1

Article No.: 5, Pages 1 - 26

https://doi.org/10.1145/2851500

Published: 28 March 2016 Publication History

Abstract

Runtime specialization is used for optimizing programs based on partial information available only at runtime. In this paper we apply autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several. In 91% to 96% of the predictions, either the best or the second-best method is chosen. Predictions achieve average speedups that are very close to the speedups achievable when only the best methods are used. By using an efficient code generator and a carefully designed set of matrix features, we show the runtime costs can be amortized to bring performance benefits for many real-world cases.

References

[1]

W. Abu-Sufah and A. Abdel Karim. 2013. Auto-tuning of sparse matrix-vector multiplication on graphics processors. In Supercomputing. Lecture Notes in Computer Science, Vol. 7905. Springer, Berlin, 151--164.

[2]

ACML. 2013. AMD Core Math Library User Guide 6.0.6. Retrieved from http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/acml.pdf.

[3]

B. Aktemur, Y. Kameyama, O. Kiselyov, and C. Shan. 2013. Shonan challenge for generative programming. In Partial Evaluation and Program Manipulation (PEPM’13). 147--154.

[4]

W. Armstrong and A. P. Rendell. 2008. Reinforcement learning for automated performance tuning. In Cluster Computing. 411--420.

[5]

W. Armstrong and A. Rendell. 2010. Runtime sparse matrix format selection. Procedia Comput. Sci. 1, 1 (2010), 135--144.

[6]

M. Belgin, G. Back, and C. J. Ribbens. 2011. A library for pattern-based sparse matrix vector multiply. Int. J. Parallel Program. 39, 1 (2011), 62--87.

[7]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In High Performance Computing Networking, Storage and Analysis (SC’09). 18:1--18:11.

[8]

A. Buluç, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In 21st Annual Symp. on Parallelism in Algorithms and Architectures (SPAA’09). 233--244.

[9]

A. Buluç, S. Williams, L. Oliker, and J. Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS’11. 721--733.

[10]

J. Byun, R. Lin, K. Yelick, and J. Demmel. 2012. Autotuning Sparse Matrix-Vector Multiplication for Multicore. Technical Report UCB/EECS-2012-215. EECS Department, U. of California, Berkeley.

[11]

J. Carette and O. Kiselyov. 2011. Multi-stage programming with functors and monads: Eliminating abstraction overhead from generic code. Sci. Comput. Program. 76, 5 (May 2011), 349--375.

[12]

J. Choi, A. Singh, and R. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Principles and Practice of Parallel Programming (PPoPP’10). 115--126.

[13]

C. Consel, J. Lawall, and A. Le Meur. 2004. A tour of tempo: A program specializer for the C language. Sci. Comput. Program. 52, 1--3 (2004), 341--370.

Digital Library

[14]

T. Davis and Y. Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.

Digital Library

[15]

E. D’Azevedo, M. Fahey, and R. Mills. 2005. Vectorized sparse matrix multiply for compressed row storage format. In ICCS’05. 99--106.

[16]

Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O’Reilly, and S. Amarasinghe. 2015. Autotuning algorithmic choice for input sensitivity. In Prog. Language Design and Implementation (PLDI’15). 379--390.

[17]

A. El Zein and A. Rendell. 2012. Generating optimal CUDA sparse matrixvector product implementations for evolving GPU hardware. Concurr. Comput. Pract. Exper. 24, 1 (2012), 3--13.

[18]

M. Frigo. 1999. A fast fourier transform compiler. In Programming Language Design and Implementation (PLDI’99). 169--180.

[19]

Y. Fukui, H. Yoshida, and S. Higono. 1989. Supercomputing of circuits simulation. In Supercomputing (SC’89). 81--85.

[20]

P. Giorgi and B. Vialla. 2014. Generating optimized sparse matrix vector product over finite fields. In Mathematical Software (ICMS’14). 685--690.

[21]

G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris. 2008. Understanding the performance of sparse matrix-vector multiplication. In Parallel, Distributed and Network-Based Processing (PDP’08). 283--292.

[22]

D. Grewe and A. Lokhmotov. 2011. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation. In General Purpose Processing on Graphics Processing Units (GPGPU-4). 12:1--12:8.

[23]

Roger G. Grimes, David R. Kincaid, and David M. Young. 1978. ITPACK 2.0 User’s Guide. Report CNA-150. Center for Numerical Analysis, University of Texas at Austin, Austin, TX.

[24]

W. Gropp, D. Kaushik, D. Keyes, and B. Smith. 1999. Toward realistic performance bounds for implicit CFD codes. In Parallel CFD’99.

[25]

P. Guo, L. Wang, and P. Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE TPDS 25, 5 (May 2014), 1112--1123.

[26]

F. Gustavson, W. Liniger, and R. Willoughby. 1970. Symbolic generation of an optimal Crout algorithm for sparse systems of linear equations. J. ACM 17, 1 (Jan. 1970), 87--109.

Digital Library

[27]

E. Im, K. Yelick, and R. Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18, 1 (Feb. 2004), 135--158.

Digital Library

[28]

A. Jain. 2008. pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures. Master’s thesis. University of California, Berkeley.

[29]

S. Kamin, L. Clausen, and A. Jarvis. 2003. Jumbo: Run-time code generation for java and its applications. In Code Generation and Optimization (CGO’03). 48--56.

[30]

S. Kamin, M. Garzarán, B. Aktemur, D. Xu, B. Yılmaz, and Z. Chen. 2014. Optimization by runtime specialization for sparse matrix-vector multiplication. In Generative Programming: Concepts and Experiences (GPCE’14). 93--102.

[31]

K. Kourtis, G. Goumas, and N. Koziris. 2010. Exploiting compression opportunities to improve SpMxV performance on shared memory systems. ACM Trans. Archit. Code Optim. 7, 3 (Dec. 2010), 16:1--16:31.

Digital Library

[32]

K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. 2011. CSX: An extended compression format for SpMV on shared memory systems. SIGPLAN Not. 46, 8 (Feb. 2011), 247--256.

Digital Library

[33]

C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization (CGO’’04). 75--86.

[34]

P. Lee and M. Leone. 1996. Optimizing ML with run-time code generation. In Programming Language Design and Implementation (PLDI’96). 137--148.

[35]

J. Li, G. Tan, M. Chen, and N. Sun. 2013. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. SIGPLAN Not. 48, 6 (June 2013), 117--126.

Digital Library

[36]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Supercomputing (ICS’13). 273--282.

[37]

LLVM. 2013. LLVM Web Site. Retrieved from http://llvm.cs.uiuc.edu.

[38]

N. Mateev, K. Pingali, P. Stodghill, and V. Kotlyar. 2000. Next-generation generic programming and its application to sparse matrix computations. In Supercomputing (ICS’00). 88--99.

[39]

Matrix Market. 1997. Matrix Market Web Site. Retrieved from http://math.nist.gov/MatrixMarket.

[40]

J. Mellor-Crummey and J. Garvin. 2004. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl. 18, 2 (May 2004), 225--236.

Digital Library

[41]

MKL. 2013. Intel^® Math Kernel Library. Retrieved from http://software.intel.com/en-us/articles/intel-mkl.

[42]

S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Parallel and Distributed Processing Symp. (IPDPS’14). 501--512.

[43]

B. Neelima, G. Ram, M. Reddy, and Prakash S. Raghavendra. 2014. Predicting an optimal sparse matrix format for SpMV computation on GPU. In Parallel & Distributed Processing Symp. Workshops (IPDPSW’’14). 1427--1436.

[44]

OpenMP 2009. OpenMP API for parallel programming, version 3.0. Retrieved from http://openmp.org/wp.

[45]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830.

Digital Library

[46]

M. Püschel, J. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. Johnson, and N. Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.

[47]

T. Rompf and M. Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Generative Prog. and Component Engineering (GPCE’10). 127--136.

[48]

T. Rompf, A. Sujeeth, N. Amin, K. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. 2013. Optimizing data structures in high-level programs. In Principles of Programming Languages (POPL’13). 497--510.

[49]

Y. Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, PA.

Digital Library

[50]

B. Su and K. Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Supercomputing (ICS’12). 353--364.

[51]

X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan, and L. Rao. 2011. Optimizing SpMV for diagonal sparse matrices on GPU. In Parallel Processing (ICPP’11). 492--501.

[52]

A. Venkat, M. Hall, and M. Strout. 2015. Loop and data transformations for sparse matrix code. In Programming Language Design and Implementation (PLDI’15). 521--532.

[53]

R. Vuduc, J. Demmel, and J. Bilmes. 2004. Statistical models for empirical search-based performance tuning. Int. J. High Perform. Comput. Appl. 18, 1 (Feb. 2004), 65--94.

Digital Library

[54]

R. Vuduc, J. Demmel, and K. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. J. Phys. Conf. Series 16, 1 (2005), 521.

[55]

C. Whaley, A. Petitet, and J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 12 (2001), 3--35.

Digital Library

[56]

J. Willcock and A. Lumsdaine. 2006. Accelerating sparse matrix computations via data compression. In Supercomputing (ICS’06). 307--316.

[57]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. 2009. Optimization of sparse matrixvector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (2009), 178--194.

Digital Library

Cited By

Yesil SHeidarshenas AMorrison ATorrellas JDehnavi MKulkarni MKrishnamoorthy S(2023)WISEProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577506(329-341)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577506
Yesil SMoreira JTorrellas JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Dense dynamic blocksProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532369(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532369
AKTEMUR B(2020)Duff Aygıtı Tabanlı Seyrek Matris-Vektör ÇarpımıDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202022650122:65(315-324)Online publication date: 15-May-2020
https://doi.org/10.21205/deufmd.2020226501
Show More Cited By

Index Terms

Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication
1. Software and its engineering
  1. Software creation and management
    1. Software development process management
      1. Software development methods
  2. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Optimization by runtime specialization for sparse matrix-vector multiplication
GPCE 2014: Proceedings of the 2014 International Conference on Generative Programming: Concepts and Experiences

Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient ...
Optimization by runtime specialization for sparse matrix-vector multiplication
GPCE '14

Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient ...
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1

April 2016

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2899032

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2016

Accepted: 01 November 2015

Revised: 01 November 2015

Received: 01 June 2015

Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Tübitak
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
879
Total Downloads

Downloads (Last 12 months)100
Downloads (Last 6 weeks)16

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yesil SHeidarshenas AMorrison ATorrellas JDehnavi MKulkarni MKrishnamoorthy S(2023)WISEProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577506(329-341)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577506
Yesil SMoreira JTorrellas JRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Dense dynamic blocksProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532369(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532369
AKTEMUR B(2020)Duff Aygıtı Tabanlı Seyrek Matris-Vektör ÇarpımıDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202022650122:65(315-324)Online publication date: 15-May-2020
https://doi.org/10.21205/deufmd.2020226501
Zhou WZhao YShen XChen W(2020)Enabling Runtime SpMV Format Selection through an Overhead Conscious MethodIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.293293131:1(80-93)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1109/TPDS.2019.2932931
Chen YLi KYang WXiao GXie XLi T(2019)Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight SupercomputerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.287118930:4(923-938)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1109/TPDS.2018.2871189
Usman SMehmood RKatib IAlbeshri AAltowaijri S(2019)ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory MachinesMobile Networks and Applications10.1007/s11036-019-01318-328:2(744-763)Online publication date: 31-Jul-2019
https://dl.acm.org/doi/10.1007/s11036-019-01318-3
Gao WZhan JWang LLuo CZheng DTang FXie BZheng CWen XHe XYe HRen REvripidou SStenström PO'Boyle M(2018)Data motifsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243190(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243190
Strout MHall MOlschanowsky C(2018)The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor CodeProceedings of the IEEE10.1109/JPROC.2018.2857721106:11(1921-1934)Online publication date: Dec-2018
https://doi.org/10.1109/JPROC.2018.2857721
Zhao YZhou WShen XYiu G(2018)Overhead-Conscious Format Selection for SpMV-Based Applications2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00104(950-959)Online publication date: May-2018
https://doi.org/10.1109/IPDPS.2018.00104
Yamada KKatagiri TTakizawa HMinami KYokokawa MNagai TOgino M(2018)Preconditioner Auto-Tuning Using Deep Learning for Sparse Iterative Algorithms2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW.2018.00055(257-262)Online publication date: Dec-2018
https://doi.org/10.1109/CANDARW.2018.00055
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents