Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2554688.2554785acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Published: 26 February 2014 Publication History

Abstract

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.

References

[1]
S. Kestur, J.D. Davis, and E.S. Chung, "Towards a Universal FPGA Matrix-Vector Multiplication Architecture," Int. Symp. Field-Programmable Custom Comp. Mach. (FCCM 2012), pp. 9--16, May 2012.
[2]
S. Sun, M. Monga, P.H. Jones, and J. Zambreno, "An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 1, pp. 113--123, Jan. 2012.
[3]
G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris, "Understanding the Performance of Sparse Matrix-Vector Multiplication," Euromicro Conf. Parallel, Distributed and Network-Based Process. (PDP 2008), pp. 283--292, Feb. 2008.
[4]
J. Sun, G. Peterson, and O. Storaasli, "Mapping Sparse Matrix-Vector Multiplication on FPGAs," Reconfigurable Systems Summer Institute (RSSI 2007), July 2007.
[5]
"Intel Math Kernel library." {Online}. Available: http://software.intel.com/en-us/intel-mkl
[6]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demme, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms," in Proc. ACM/IEEE Conf. Supercomputing (SC 2007), pp.1--12, Nov. 2007.
[7]
"Nvidia cuBLAS." {Online}. Available: http://developer.nvidia.com/cublas
[8]
"Nvidia cuSPARSE." {Online}. Available: http://developer.nvidia.com/cusparse
[9]
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proc. ACM/IEEE Conf. Supercomputing (SC 2009), pp. 18:1--18:11, Nov. 2009.
[10]
G. Kuzmanov and M. Taouil, "Reconfigurable sparse/dense matrix-vector multiplier," Int. Conf. Field-Programmable Tech. (FPT 2009), pp. 483--488, Dec. 2009.
[11]
L. Zhuo and V.K. Prasanna, "Sparse Matrix-Vector multiplication on FPGAs," in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA '05), pp. 63--74, Feb. 2005.
[12]
Yan Zhang, Y.H. Shalabi, R. Jain, K.K. Nagar, and J.D. Bakos, "FPGA vs. GPU for sparse matrix vector multiply," Int. Conf. Field-Programmable Tech. (FPT 2009), pp. 255--262, Dec. 2009.
[13]
D. Gregg, C. McSweeney, C. McElroy, F. Connor, S. McGettrick, D. Moloney, and D. Geraghty, "FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory," Int. Conf. Field Programmable Logic Applicat. (FPL 2007), pp. 786--791, Aug. 2007.
[14]
C.Y. Lin, H. K.-H. So, and P.H.-W. Leong, "A Model for Matrix Multiplication Performance on FPGAs," Int. Conf. Field Programmable Logic Applicat. (FPL 2011), pp.305--310, Sept. 2011.
[15]
"ROACH." {Online}. Available: https://casper.berkeley.edu/wiki/ROACH
[16]
T. A. Davis and Y. Hu., "The university of Florida sparse matrix collection.," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1--1:25, Dec. 2011.
[17]
P. Gepner, D. L. Fraser, and V. Gamayunov, "Evaluation of the 3rd generation Intel Core Processor focusing on HPC applications," Int. Conf. Parallel Distrib. Process. Techn. Applicat. (PDPTA 2012), pp. 818--823, July 2009.
[18]
"NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built." {Online}. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
[19]
"Intel® Core™ i7--4770 Processor." {Online}. Available: http://ark.intel.com/products/75122/Intel-Core-i7--4770-Processor-8M-Cache-up-to-3_90-GHz
[20]
"Introducing the GeForce GTX TITAN." {Online}. Available: http://www.geforce.com/whats-new/articles/introducing-the-geforce-gtx-titan

Cited By

View all
  • (2024)Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different GranularitiesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317680935:1(733-744)Online publication date: Jan-2024
  • (2023)BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural NetworksACM Transactions on Embedded Computing Systems10.1145/360909322:5s(1-24)Online publication date: 31-Oct-2023
  • (2023)Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector DuplicationProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567839(33-38)Online publication date: 16-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
February 2014
272 pages
ISBN:9781450326711
DOI:10.1145/2554688
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. computational efficiency
  3. cpu
  4. energy-efficiency
  5. fpga
  6. gpu
  7. sparse-blas
  8. spmxv

Qualifiers

  • Research-article

Conference

FPGA'14
Sponsor:

Acceptance Rates

FPGA '14 Paper Acceptance Rate 30 of 110 submissions, 27%;
Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)11
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different GranularitiesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317680935:1(733-744)Online publication date: Jan-2024
  • (2023)BitSET: Bit-Serial Early Termination for Computation Reduction in Convolutional Neural NetworksACM Transactions on Embedded Computing Systems10.1145/360909322:5s(1-24)Online publication date: 31-Oct-2023
  • (2023)Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector DuplicationProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567839(33-38)Online publication date: 16-Jan-2023
  • (2023)Efficient Spectral Graph Convolutional Network Deployment on Memristive CrossbarsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2022.32109987:2(415-425)Online publication date: Apr-2023
  • (2023)Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware CompressionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171542:12(4606-4617)Online publication date: Dec-2023
  • (2023)Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A ReviewProceedings of the IEEE10.1109/JPROC.2022.3226481111:1(42-91)Online publication date: Jan-2023
  • (2023)Path-Based Processing using In-Memory Systolic Arrays for Accelerating Data-Intensive Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323622(1-9)Online publication date: 28-Oct-2023
  • (2023)CNN Hardware Accelerator Architecture Design for Energy-Efficient AIArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_10(319-357)Online publication date: 16-Mar-2023
  • (2022)Offloading Transprecision Calculation Using FPGAInternational Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3503470.3503472(19-28)Online publication date: 11-Jan-2022
  • (2022)Structured and tiled-based pruning of Deep Learning models targeting FPGA implementations2022 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS48785.2022.9937748(1392-1396)Online publication date: 28-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media