research-article

Sparse Matrix-Dense Matrix Multiplication on Heterogeneous CPU+FPGA Embedded System

Authors:

Mohammad Hosseinabady,

Jose Nunez-YanezAuthors Info & Claims

PARMA-DITAM'2020: Proceedings of the 11th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures / 9th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms

Article No.: 1, Pages 1 - 6

https://doi.org/10.1145/3381427.3381428

Published: 16 March 2020 Publication History

Abstract

Embedded intelligence is becoming the primary driver for new applications in industry, healthcare, and automotive, to name a few. The main characteristics of these applications are high computational demand, real-time interaction with the environment, security, low power consumption, and local autonomy, among others. Addressing these diverse characteristics, researchers have proposed heterogeneous multicore embedded systems comprising CPUs, GPUs, FPGAs, and ASICs. Whereas each computing element provides a unique capability to enable one of the application characteristics, collaborating these processing cores in running an application to get the maximum performance is a crucial challenge. This paper considers the collaborative usage of a multicore CPU and an FPGA in a heterogeneous embedded system to improve the performance of sparse matrix operations, which have been essential techniques in reducing the inference complexity in machine learning techniques, especially deep convolutional neural networks. Experimental results show that the collaborative execution of sparse-matrix-dense-matrix multiplication on the Xilinx Zynq MPSoC, a heterogeneous CPU+FPGA embedded system, can improve the performance by a factor of up to 42% compared with just using the FPGA as an accelerator.

References

[1]

Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA, 18:1-- -18:11. https://doi.org/10.1145/1654059.1654078

Digital Library

[2]

Jee W Choi, Amik Singh, and Richard W Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (jan 2010), 115--126. https://doi.org/10.1145/1837853.1693471

Digital Library

[3]

Tim Davis. 2019. SuiteSparse Matrix Collection. https://sparse.tamu.edu/

[4]

Richard Dorrance, Fengbo Ren, and Dejan Marković. 2014. A Scalable Sparse Matrix-vector Multiplication Kernel for Energy-efficient Sparse-blas on FPGAs. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA '14). ACM, New York, NY, USA, 161--170. R@https://doi.org/10.1145/2554688.2554785

Digital Library

[5]

A Elafrou, G Goumas, and N Koziris. 2017. Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Intel Xeon Phi. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1389--1398. https://doi.org/10.1109/IPDPSW.2017.134

[6]

Michael Garland. 2008. Sparse Matrix Computations on Manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC '08). ACM, New York, NY, USA, 2--6. https://doi.org/10.1145/1391469.1391473

Digital Library

[7]

Mohammad Hosseinabady. 2020. Sparse Matrix-Dense Matrix Multiplication (SpMDM) implementation. https://github.com/Hosseinabady/SDSoC-Benchmarks/tree/master/SpMDM

[8]

Mohammad Hosseinabady and Jose Nunez-Yanez. 2019. A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication using High-Level Synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019). https://doi.org/10.1109/TCAD.2019.2912923

[9]

Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, Song Han, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.0 (2016). arXiv:1602.07360 http://arxiv.org/abs/1602.07360

[10]

Sid Samsi Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Ryan Robinett. 2019. Sparse deep neural network graph challenge. 7 pages. https://graphchallenge.mit.edu/

[11]

M Krotkiewski and M Dabrowski. 2010. Parallel Symmetric Sparse Matrix-vector Product on Scalar Multi-core CPUs. Parallel Comput. 36, 4 (apr 2010), 181--198. https://doi.org/10.1016/j.parco.2010.02.003

Digital Library

[12]

L Lu, J Xie, R Huang, J Zhang, W Lin, and Y Liang. 2019. An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 17--25. https://doi.org/10.1109/FCCM.2019.00013

[13]

Jing Nie, Chunlei Zhang, Dan Zou, Fei Xia, Lina Lu, Xiang Wang, and Fei Zhao. 2019. Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture. In Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference (HPCCT 2019). ACM, New York, NY, USA, 6--10. https://doi.org/10.1145/3341069.3341072

Digital Library

[14]

Leonid Yavits and Ran Ginosar. 2018. Accelerator for Sparse Machine Learning. IEEE Comput. Archit. Lett. 17, 1 (jan 2018), 21--24. https://doi.org/10.1109/LCA.2017.2714667

Digital Library

[15]

J Yu, A Lukefahr, D Palframan, G Dasika, R Das, and S Mahlke. 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 548--560. https://doi.org/10.1145/3079856.3080215

Digital Library

Cited By

Joseph TT.S B(2024)Real-time Blood Pressure Prediction on Wearables with Edge-Based DNNs: A Co-Design ApproachACM Transactions on Design Automation of Electronic Systems10.1145/369951230:1(1-24)Online publication date: 7-Oct-2024
https://dl.acm.org/doi/10.1145/3699512
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606

Index Terms

Sparse Matrix-Dense Matrix Multiplication on Heterogeneous CPU+FPGA Embedded System
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
      1. Multicore architectures
  2. Embedded and cyber-physical systems
    1. Embedded systems
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random ...
Sparse Matrix-Vector multiplication on FPGAs
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientific and engineering applications. The poor data locality of sparse matrices significantly reduces the performance of SpMXV on general-purpose processors, ...
Accelerating a Sparse Matrix Iterative Solver Using a High Performance Reconfigurable Computer
HPCMP-UGC '10: Proceedings of the 2010 DoD High Performance Computing Modernization Program Users Group Conference

High performance reconfigurable computers (HPRCs), which combine general-purpose processors (GPPs) and field programmable gate arrays (FPGAs), are now commercially available. These interesting architectures allow for the creation of reconfigurable ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PARMA-DITAM'2020: Proceedings of the 11th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures / 9th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms

January 2020

30 pages

ISBN:9781450375450

DOI:10.1145/3381427

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

HiPEAC: HiPEAC Network of Excellence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Engineering and Physical Sciences Research Council

Conference

PARMA-DITAM'2020

PARMA-DITAM'2020: 11th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures / 9th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms

January 21, 2020

Bologna, Italy

Acceptance Rates

PARMA-DITAM'2020 Paper Acceptance Rate 5 of 9 submissions, 56%;

Overall Acceptance Rate 11 of 24 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Joseph TT.S B(2024)Real-time Blood Pressure Prediction on Wearables with Edge-Based DNNs: A Co-Design ApproachACM Transactions on Design Automation of Electronic Systems10.1145/369951230:1(1-24)Online publication date: 7-Oct-2024
https://dl.acm.org/doi/10.1145/3699512
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten