research-article

Sparse Matrix Multiplication On An Associative Processor

Authors:

R. GinosarAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 26, Issue 11

Pages 3175 - 3183

https://doi.org/10.1109/TPDS.2014.2370055

Published: 01 November 2015 Publication History

Abstract

Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and baseline CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(nnz) where nnz is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.

References

[1]

A. Pedram, “Algorithm/architecture codesign of low power and high performance linear algebra compute fabrics”, Ph.D. Thesis, Dept. Electr. Comput. Eng., Univ. Texas, Austin, TX, USA, 2013.

[2]

A. Pinar and M. Heath, “Improving performance of sparse matrix-vector multiplication,” in Proc. ACM/IEEE Conf. Supercomputing, p. 30. 1999.

[3]

C. Auth et al., “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors,” in Proc. IEEE Symp. VLSI Technol., 2012, pp. 131 –132.

[4]

C. Foster, “Content Addressable Parallel Processors”, Van Nostrand Reinhold Company, New York, NY, USA, 1976

[5]

C. Stormon, “The Coherent Processor: An associative processor architecture and applications,” in Proc. IEEE Compcon, Digest of Papers, pp. 270 –275, 1991.

[6]

D. Bowler, T. Miyazaki, and M. Gillan, “Parallel sparse matrix multiplication for linear scaling electronic structure calculations.” Comput. Phys. Commun., vol. 137, no. 2, pp. 255– 273, 2001.

[7]

D. Hentrich et al., “Performance evaluation of SRAM cells in 22nm predictive CMOS technology,” in Proc. IEEE Int. Conf. Electro/Inf. Technol. , 2009, pp. 470–475.

[8]

E. Im and K. Yelick, “ Optimizing the performance of sparse matrix-vector multiplication,” Dept. Electr. Eng. Comput. Sci., Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/CSD-00-1104, 2000.

[9]

E. Saule, et al., “Performance evaluation of sparse matrix multiplication kernels on Intel xeon phi.” arXiv preprint arXiv, vol. 1302, p. 1078, 2013.

[10]

F. Pollack, “New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address),” MICRO 32, 1999

[11]

G. Blelloch, Vector models for data-parallel computing. Cambridge, MA, USA: MIT Press, 1990.

Digital Library

[12]

G. Goumas et al. “Performance evaluation of the sparse matrix-vector multiplication on modern architectures”, J. Supercomputing, vol. 50, no. 1, pp. 36–77, 2009.

Digital Library

[13]

G. Qing, X. Guo, R. Patel, E. Ipek, and E. Friedman, “AP-DIMM: Associative Computing with STT-MRAM,” ISCA, 2013.

[14]

H. Li et al. “An AND-type match line scheme for high-performance energy-efficient content addressable memories,” IEEE J. Solid-State Circuits, vol. 41, no. 5, pp. 1108–1119, May 2006.

[15]

I. Scherson et al., “Bit-parallel arithmetic in a massively-parallel associative processor,” IEEE Trans. Comput., vol. 41, no. 10, pp. 1201–1210, Oct. 1992

Digital Library

[16]

J. Andersen, G. Mitra, and D. Parkinson. “The scheduling of sparse matrix-vector multiplication on a massively parallel DAP computer.” Parallel Comput.,vol. 18, no. 6, pp. 675–697, 1992.

[17]

J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, “Sparse matrix solvers on the GPU: Conjugate gradients and multigrid,” ACM Trans. Graph., vol. 22, no. 3, pp. 917–924. 2003.

Digital Library

[18]

J. Davis and E. Chung, “SpMV: A memory-bound application on the GPU stuck between a rock and a hard place ” Microsoft Technical Report, 2012.

[19]

J. Kurzak, D. Bader, J. Dongarra, Scientific Computing with Multicore and Accelerators. Boca Raton, FL, USA: CRC Press, Inc., 2010.

[20]

J. Sun, G. Peterson, and O. Storaasli, “Sparse matrix-vector multiplication design on FPGAs,” in Proc. IEEE Symp. Field-Programmable Custom Comput. Mach., pp. 349–352, 2007.

[21]

K. Eshraghian, et al., “Memristor MOS content addressable memory (MCAM): Hybrid architecture for future high performance search engines,” IEEE Trans. VLSI Syst. , vol. 19, no. 8, pp. 1407–1417, Jul. 2011.

Digital Library

[22]

K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006

[23]

L. Yavits, Architecture and design of Associative Processor for image processing and computer vision. M.Sc. Thesis, Technion— Israel Institute of Technol., Haifa, Israel, 1994. [Online]. Available: http://webee.technion.ac.il/publication-link/index/id/633

[24]

L. Yavits, A. Morad, and R. Ginosar, “ Associative Processor,” Supplementary Material, 2014

[25]

L. Yavits, A. Morad, and R. Ginosar, “Computer architecture with associative processor replacing last level cache and SIMD accelerator”, IEEE Trans. Comput., 2014.

[26]

L. Yavits, A. Morad, and R. Ginosar, “The effect of communication and synchronization on Amdahl's law in multicore systems”, Parallel Comput. J., vol. 40, pp. 1–16, 2013 .

[27]

L. Zhuo and V. Prasanna, “Sparse matrix-vector multiplication on FPGAs,” in Proc. ACM/SIGDA 13th Int. Symp. Field-Programmable Gate Arrays, pp. 63–74, 2005.

[28]

M. Baskaran and R. Bordawekar, “Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies.” IBM Research Report, vol. RC24704, no. W0812-047, 2008.

[29]

M. Misra, D. Nassimi, and V. Prasanna, “Efficient VLSI implementation of iterative solutions to sparse linear systems,” Parallel Comput. , vol. 19, no. 5, pp. 525–544, 1993.

[30]

N. Bell, M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. Conf. High Perform. Comput. Netw. Storage Anal., p. 18 . 2009.

[31]

N. Bell, M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” vol. 20, NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, 2008.

[32]

O. Beaumont, et al., “A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers),” IEEE Trans. Comput., vol. 50, no. 10, pp. 1052–1070, Oct. 2001.

Digital Library

[33]

O. Beaumont, et al., “Matrix multiplication on heterogeneous platforms,” IEEE Trans. Parallel Distributed Syst., vol. 12, no. 10, pp. 1033–1051, Oct. 2001.

Digital Library

[34]

O. Wing, “A content-addressable systolic array for sparse matrix computation.” J. Parallel Distributed Comput., vol. 2, no. 2, pp. 170–181, 1985.

[35]

Q. Zhu, et al., “Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware,” in Proc. IEEE High Performance Extreme Comput., 2013, pp. 1–6

[36]

R. Dorrance et al., “A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-BLAS on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2014.

[37]

R. Kieckhager and C. Pottle, “A processor array for factorization of unstructured sparse networks”, IEEE Conf. Circuits Comput., 1982, pp. 380–383.

[38]

R. Vuduc, “Automatic performance tuning of sparse matrix kernels.” Ph.D. Thesis., Univ. California, CA, USA, 2003.

[39]

S. Borkar, “Thousand core chips: A technology perspective,” in Proc. ACM/IEEE 44th Design Autom. Conf. (DAC), 2007, pp. 746 –749.

[40]

S. Sengupta, M. Harris, Y. Zhang, and J. Owens, “Scan primitives for GPU computing,” in Graphics Hardware, vol. 2007, pp. 97– 106.

[41]

S. Toledo, “Improving the memory-system performance of sparse-matrix vector multiplication.” IBM J. Res. Development, vol. 41, no. 6, pp. 711–725, 1997.

Digital Library

[42]

S. Williams et al., “Optimization of sparse matrix–vector multiplication on emerging multicore platforms.” Parallel Comput. , vol. 35, no. 3, pp. 178–194, 2009.

Digital Library

[43]

T. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, p. 1, 2011.

Digital Library

[44]

X. Liu and M. Smelyanskiy, “ Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in Proc. Int. Conf. Supercomputing, 2013.

[45]

Y. Fung, “Associative processor architecture—a survey ”, ACM Comput. Surv. J., vol. 9, no. 1, pp. 3 –27, Mar. 1977.

[46]

Y. Saad and A. Malevsky, “PSPARSLIB: A portable library of distributed memory sparse iterative solvers ”, Tech. Rep. UMSI 95/180, Univ. Minnesota, Minneapolis, MN, USA, 1995.

Cited By

Garzón ETeman ALanuzza MYavits L(2022)AIDA: Associative In-Memory Deep Learning AcceleratorIEEE Micro10.1109/MM.2022.319092442:6(67-75)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/MM.2022.3190924
Sharif HZhao YKotsifakou MKothari ASchreiber BWang ESarita YZhao NJoshi KAdve VMisailovic SAdve SLee JPetrank E(2021)ApproxTunerProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3446108(262-277)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3446108
Yavits LKaplan RGinosar R(2019)GIRAF: General purpose In-storage Resistive Associative FrameworkProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00053(476-477)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00053
Show More Cited By

Index Terms

Sparse Matrix Multiplication On An Associative Processor

Index terms have been assigned to the content through auto-classification.

Recommendations

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data ...
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
Processor-efficient sparse matrix-vector multiplication

The matrix-vector multiplication operation is the kernel of most numerical algorithms.Typically, matrix-vector multiplication algorithms exploit the sparsity of a matrix, either to reduce the time taken or the memory used. In the case of parallel ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 26, Issue 11

Nov. 2015

292 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2014.

Publisher

IEEE Press

Publication History

Published: 01 November 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Garzón ETeman ALanuzza MYavits L(2022)AIDA: Associative In-Memory Deep Learning AcceleratorIEEE Micro10.1109/MM.2022.319092442:6(67-75)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/MM.2022.3190924
Sharif HZhao YKotsifakou MKothari ASchreiber BWang ESarita YZhao NJoshi KAdve VMisailovic SAdve SLee JPetrank E(2021)ApproxTunerProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3446108(262-277)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3446108
Yavits LKaplan RGinosar R(2019)GIRAF: General purpose In-storage Resistive Associative FrameworkProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00053(476-477)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00053
Yavits LGinosar R(2018)Accelerator for Sparse Machine LearningIEEE Computer Architecture Letters10.1109/LCA.2017.271466717:1(21-24)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1109/LCA.2017.2714667

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents