research-article

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Authors:

Richard Dorrance,

Dejan MarkovićAuthors Info & Claims

FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pages 161 - 170

https://doi.org/10.1145/2554688.2554785

Published: 26 February 2014 Publication History

Abstract

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.

References

[1]

S. Kestur, J.D. Davis, and E.S. Chung, "Towards a Universal FPGA Matrix-Vector Multiplication Architecture," Int. Symp. Field-Programmable Custom Comp. Mach. (FCCM 2012), pp. 9--16, May 2012.

Digital Library

[2]

S. Sun, M. Monga, P.H. Jones, and J. Zambreno, "An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 1, pp. 113--123, Jan. 2012.

[3]

G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris, "Understanding the Performance of Sparse Matrix-Vector Multiplication," Euromicro Conf. Parallel, Distributed and Network-Based Process. (PDP 2008), pp. 283--292, Feb. 2008.

Digital Library

[4]

J. Sun, G. Peterson, and O. Storaasli, "Mapping Sparse Matrix-Vector Multiplication on FPGAs," Reconfigurable Systems Summer Institute (RSSI 2007), July 2007.

[5]

"Intel Math Kernel library." {Online}. Available: http://software.intel.com/en-us/intel-mkl

[6]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demme, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms," in Proc. ACM/IEEE Conf. Supercomputing (SC 2007), pp.1--12, Nov. 2007.

Digital Library

[7]

"Nvidia cuBLAS." {Online}. Available: http://developer.nvidia.com/cublas

[8]

"Nvidia cuSPARSE." {Online}. Available: http://developer.nvidia.com/cusparse

[9]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proc. ACM/IEEE Conf. Supercomputing (SC 2009), pp. 18:1--18:11, Nov. 2009.

Digital Library

[10]

G. Kuzmanov and M. Taouil, "Reconfigurable sparse/dense matrix-vector multiplier," Int. Conf. Field-Programmable Tech. (FPT 2009), pp. 483--488, Dec. 2009.

[11]

L. Zhuo and V.K. Prasanna, "Sparse Matrix-Vector multiplication on FPGAs," in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA '05), pp. 63--74, Feb. 2005.

Digital Library

[12]

Yan Zhang, Y.H. Shalabi, R. Jain, K.K. Nagar, and J.D. Bakos, "FPGA vs. GPU for sparse matrix vector multiply," Int. Conf. Field-Programmable Tech. (FPT 2009), pp. 255--262, Dec. 2009.

[13]

D. Gregg, C. McSweeney, C. McElroy, F. Connor, S. McGettrick, D. Moloney, and D. Geraghty, "FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory," Int. Conf. Field Programmable Logic Applicat. (FPL 2007), pp. 786--791, Aug. 2007.

[14]

C.Y. Lin, H. K.-H. So, and P.H.-W. Leong, "A Model for Matrix Multiplication Performance on FPGAs," Int. Conf. Field Programmable Logic Applicat. (FPL 2011), pp.305--310, Sept. 2011.

Digital Library

[15]

"ROACH." {Online}. Available: https://casper.berkeley.edu/wiki/ROACH

[16]

T. A. Davis and Y. Hu., "The university of Florida sparse matrix collection.," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1--1:25, Dec. 2011.

Digital Library

[17]

P. Gepner, D. L. Fraser, and V. Gamayunov, "Evaluation of the 3rd generation Intel Core Processor focusing on HPC applications," Int. Conf. Parallel Distrib. Process. Techn. Applicat. (PDPTA 2012), pp. 818--823, July 2009.

[18]

"NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built." {Online}. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

[19]

"Intel® Core™ i7--4770 Processor." {Online}. Available: http://ark.intel.com/products/75122/Intel-Core-i7--4770-Processor-8M-Cache-up-to-3_90-GHz

[20]

"Introducing the GeForce GTX TITAN." {Online}. Available: http://www.geforce.com/whats-new/articles/introducing-the-geforce-gtx-titan

Cited By

Gonzalez-Carabarin LHuijben IVeeling BSchmid Avan Sloun R(2024)Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different GranularitiesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317680935:1(733-744)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3176809
Pan YYu JLukefahr ADas RMahlke S(2023)BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural NetworksACM Transactions on Embedded Computing Systems10.1145/360909322:5s(1-24)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3609093
Liu BLiu DTakahashi A(2023)Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector DuplicationProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567839(33-38)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1145/3566097.3567839
Show More Cited By

Index Terms

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Recommendations

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
NAS '13: Proceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage

Scaling up the sparse matrix-vector multiplication has been at the heart of numerous studies in both academia and industry. The massive parallelism of graphics processing units offers tremendous performance in many high-performance computing ...
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

February 2014

272 pages

ISBN:9781450326711

DOI:10.1145/2554688

General Chair:
Vaughn Betz
University of Toronto, Canada
,
Program Chair:
George A. Constantinides
Imperial College London, UK

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FPGA'14

Sponsor:

SIGDA

FPGA'14: The 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 26 - 28, 2014

California, Monterey, USA

Acceptance Rates

FPGA '14 Paper Acceptance Rate 30 of 110 submissions, 27%;

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
1,360
Total Downloads

Downloads (Last 12 months)73
Downloads (Last 6 weeks)11

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gonzalez-Carabarin LHuijben IVeeling BSchmid Avan Sloun R(2024)Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different GranularitiesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317680935:1(733-744)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3176809
Pan YYu JLukefahr ADas RMahlke S(2023)BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural NetworksACM Transactions on Embedded Computing Systems10.1145/360909322:5s(1-24)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3609093
Liu BLiu DTakahashi A(2023)Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector DuplicationProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567839(33-38)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1145/3566097.3567839
Lyu BHamdi MYang YCao YYan ZLi KWen SHuang T(2023)Efficient Spectral Graph Convolutional Network Deployment on Memristive CrossbarsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2022.32109987:2(415-425)Online publication date: Apr-2023
https://doi.org/10.1109/TETCI.2022.3210998
Li SLiu DLiu W(2023)Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware CompressionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171542:12(4606-4617)Online publication date: Dec-2023
https://doi.org/10.1109/TCAD.2023.3281715
Shuvo MIslam SCheng JMorshed B(2023)Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A ReviewProceedings of the IEEE10.1109/JPROC.2022.3226481111:1(42-91)Online publication date: Jan-2023
https://doi.org/10.1109/JPROC.2022.3226481
Haq Rashed MThijssen SJha SZheng HEwetz R(2023)Path-Based Processing using In-Memory Systolic Arrays for Accelerating Data-Intensive Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323622(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323622
Cha JKim S(2023)CNN Hardware Accelerator Architecture Design for Energy-Efficient AIArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_10(319-357)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-22170-5_10
Hara THanawa T(2022)Offloading Transprecision Calculation Using FPGAInternational Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3503470.3503472(19-28)Online publication date: 11-Jan-2022
https://dl.acm.org/doi/10.1145/3503470.3503472
Gonzalez-Carabarin LSchmid ASloun R(2022)Structured and tiled-based pruning of Deep Learning models targeting FPGA implementations2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937748(1392-1396)Online publication date: 28-May-2022
https://doi.org/10.1109/ISCAS48785.2022.9937748
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten