Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Sparse Matrix Multiplication On An Associative Processor

Published: 01 November 2015 Publication History

Abstract

Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and baseline CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(nnz) where nnz is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.

References

[1]
A. Pedram, “Algorithm/architecture codesign of low power and high performance linear algebra compute fabrics”, Ph.D. Thesis, Dept. Electr. Comput. Eng., Univ. Texas, Austin, TX, USA, 2013.
[2]
A. Pinar and M. Heath, “Improving performance of sparse matrix-vector multiplication,” in Proc. ACM/IEEE Conf. Supercomputing, p. 30. 1999.
[3]
C. Auth et al., “A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors,” in Proc. IEEE Symp. VLSI Technol., 2012, pp. 131 –132.
[4]
C. Foster, “Content Addressable Parallel Processors”, Van Nostrand Reinhold Company, New York, NY, USA, 1976
[5]
C. Stormon, “The Coherent Processor: An associative processor architecture and applications,” in Proc. IEEE Compcon, Digest of Papers, pp. 270 –275, 1991.
[6]
D. Bowler, T. Miyazaki, and M. Gillan, “Parallel sparse matrix multiplication for linear scaling electronic structure calculations.” Comput. Phys. Commun., vol. 137, no. 2, pp. 255– 273, 2001.
[7]
D. Hentrich et al., “Performance evaluation of SRAM cells in 22nm predictive CMOS technology,” in Proc. IEEE Int. Conf. Electro/Inf. Technol. , 2009, pp. 470–475.
[8]
E. Im and K. Yelick, “ Optimizing the performance of sparse matrix-vector multiplication,” Dept. Electr. Eng. Comput. Sci., Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/CSD-00-1104, 2000.
[9]
E. Saule, et al., “Performance evaluation of sparse matrix multiplication kernels on Intel xeon phi.” arXiv preprint arXiv, vol. 1302, p. 1078, 2013.
[10]
F. Pollack, “New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address),” MICRO 32, 1999
[11]
G. Blelloch, Vector models for data-parallel computing. Cambridge, MA, USA: MIT Press, 1990.
[12]
G. Goumas et al. “Performance evaluation of the sparse matrix-vector multiplication on modern architectures”, J. Supercomputing, vol. 50, no. 1, pp. 36–77, 2009.
[13]
G. Qing, X. Guo, R. Patel, E. Ipek, and E. Friedman, “AP-DIMM: Associative Computing with STT-MRAM,” ISCA, 2013.
[14]
H. Li et al. “An AND-type match line scheme for high-performance energy-efficient content addressable memories,” IEEE J. Solid-State Circuits, vol. 41, no. 5, pp. 1108–1119, May 2006.
[15]
I. Scherson et al., “Bit-parallel arithmetic in a massively-parallel associative processor,” IEEE Trans. Comput., vol. 41, no. 10, pp. 1201–1210, Oct. 1992
[16]
J. Andersen, G. Mitra, and D. Parkinson. “The scheduling of sparse matrix-vector multiplication on a massively parallel DAP computer.” Parallel Comput.,vol. 18, no. 6, pp. 675–697, 1992.
[17]
J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, “Sparse matrix solvers on the GPU: Conjugate gradients and multigrid,” ACM Trans. Graph., vol. 22, no. 3, pp. 917–924. 2003.
[18]
J. Davis and E. Chung, “SpMV: A memory-bound application on the GPU stuck between a rock and a hard place ” Microsoft Technical Report, 2012.
[19]
J. Kurzak, D. Bader, J. Dongarra, Scientific Computing with Multicore and Accelerators. Boca Raton, FL, USA: CRC Press, Inc., 2010.
[20]
J. Sun, G. Peterson, and O. Storaasli, “Sparse matrix-vector multiplication design on FPGAs,” in Proc. IEEE Symp. Field-Programmable Custom Comput. Mach., pp. 349–352, 2007.
[21]
K. Eshraghian, et al., “Memristor MOS content addressable memory (MCAM): Hybrid architecture for future high performance search engines,” IEEE Trans. VLSI Syst. , vol. 19, no. 8, pp. 1407–1417, Jul. 2011.
[22]
K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006
[23]
L. Yavits, Architecture and design of Associative Processor for image processing and computer vision. M.Sc. Thesis, Technion— Israel Institute of Technol., Haifa, Israel, 1994. [Online]. Available: http://webee.technion.ac.il/publication-link/index/id/633
[24]
L. Yavits, A. Morad, and R. Ginosar, “ Associative Processor,” Supplementary Material, 2014
[25]
L. Yavits, A. Morad, and R. Ginosar, “Computer architecture with associative processor replacing last level cache and SIMD accelerator”, IEEE Trans. Comput., 2014.
[26]
L. Yavits, A. Morad, and R. Ginosar, “The effect of communication and synchronization on Amdahl's law in multicore systems”, Parallel Comput. J., vol. 40, pp. 1–16, 2013 .
[27]
L. Zhuo and V. Prasanna, “Sparse matrix-vector multiplication on FPGAs,” in Proc. ACM/SIGDA 13th Int. Symp. Field-Programmable Gate Arrays, pp. 63–74, 2005.
[28]
M. Baskaran and R. Bordawekar, “Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies.” IBM Research Report, vol. RC24704, no. W0812-047, 2008.
[29]
M. Misra, D. Nassimi, and V. Prasanna, “Efficient VLSI implementation of iterative solutions to sparse linear systems,” Parallel Comput. , vol. 19, no. 5, pp. 525–544, 1993.
[30]
N. Bell, M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. Conf. High Perform. Comput. Netw. Storage Anal., p. 18 . 2009.
[31]
N. Bell, M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” vol. 20, NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, 2008.
[32]
O. Beaumont, et al., “A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers),” IEEE Trans. Comput., vol. 50, no. 10, pp. 1052–1070, Oct. 2001.
[33]
O. Beaumont, et al., “Matrix multiplication on heterogeneous platforms,” IEEE Trans. Parallel Distributed Syst., vol. 12, no. 10, pp. 1033–1051, Oct. 2001.
[34]
O. Wing, “A content-addressable systolic array for sparse matrix computation.” J. Parallel Distributed Comput., vol. 2, no. 2, pp. 170–181, 1985.
[35]
Q. Zhu, et al., “Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware,” in Proc. IEEE High Performance Extreme Comput., 2013, pp. 1–6
[36]
R. Dorrance et al., “A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-BLAS on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2014.
[37]
R. Kieckhager and C. Pottle, “A processor array for factorization of unstructured sparse networks”, IEEE Conf. Circuits Comput., 1982, pp. 380–383.
[38]
R. Vuduc, “Automatic performance tuning of sparse matrix kernels.” Ph.D. Thesis., Univ. California, CA, USA, 2003.
[39]
S. Borkar, “Thousand core chips: A technology perspective,” in Proc. ACM/IEEE 44th Design Autom. Conf. (DAC), 2007, pp. 746 –749.
[40]
S. Sengupta, M. Harris, Y. Zhang, and J. Owens, “Scan primitives for GPU computing,” in Graphics Hardware, vol. 2007, pp. 97– 106.
[41]
S. Toledo, “Improving the memory-system performance of sparse-matrix vector multiplication.” IBM J. Res. Development, vol. 41, no. 6, pp. 711–725, 1997.
[42]
S. Williams et al., “Optimization of sparse matrix–vector multiplication on emerging multicore platforms.” Parallel Comput. , vol. 35, no. 3, pp. 178–194, 2009.
[43]
T. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, p. 1, 2011.
[44]
X. Liu and M. Smelyanskiy, “ Efficient sparse matrix-vector multiplication on x86-based many-core processors,” in Proc. Int. Conf. Supercomputing, 2013.
[45]
Y. Fung, “Associative processor architecture—a survey ”, ACM Comput. Surv. J., vol. 9, no. 1, pp. 3 –27, Mar. 1977.
[46]
Y. Saad and A. Malevsky, “PSPARSLIB: A portable library of distributed memory sparse iterative solvers ”, Tech. Rep. UMSI 95/180, Univ. Minnesota, Minneapolis, MN, USA, 1995.

Cited By

View all
  • (2022)AIDA: Associative In-Memory Deep Learning AcceleratorIEEE Micro10.1109/MM.2022.319092442:6(67-75)Online publication date: 1-Nov-2022
  • (2021)ApproxTunerProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3446108(262-277)Online publication date: 17-Feb-2021
  • (2019)GIRAF: General purpose In-storage Resistive Associative FrameworkProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00053(476-477)Online publication date: 23-Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 26, Issue 11
Nov. 2015
292 pages

Publisher

IEEE Press

Publication History

Published: 01 November 2015

Author Tags

  1. in-memory computing
  2. Sparse linear algebra
  3. SIMD
  4. associative processor
  5. memory intensive computing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)AIDA: Associative In-Memory Deep Learning AcceleratorIEEE Micro10.1109/MM.2022.319092442:6(67-75)Online publication date: 1-Nov-2022
  • (2021)ApproxTunerProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3446108(262-277)Online publication date: 17-Feb-2021
  • (2019)GIRAF: General purpose In-storage Resistive Associative FrameworkProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00053(476-477)Online publication date: 23-Sep-2019
  • (2018)Accelerator for Sparse Machine LearningIEEE Computer Architecture Letters10.1109/LCA.2017.271466717:1(21-24)Online publication date: 1-Jan-2018

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media