research-article

Towards Efficient SpMV on Sunway Manycore Architectures

Authors:

Hailong Yang, and

Xu LiuAuthors Info & Claims

ICS '18: Proceedings of the 2018 International Conference on Supercomputing

June 2018

Pages 363 - 373

https://doi.org/10.1145/3205289.3205313

Published: 12 June 2018 Publication History

Abstract

Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when porting to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level partition mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is generally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5x (12.3x on average) speedups.

References

[1]

Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 -June 2, 2017. 535--544.

[2]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An Efficient Two-dimensional Blocking Strategy for Sparse Matrix-vector Multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS '14). ACM, New York, NY, USA, 273--282.

Digital Library

[3]

Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors. In Proceedings of the ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis (SC '09). ACM, New York, NY, USA, Article 18, 11 pages.

Digital Library

[4]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis. ACM, 18.

Digital Library

[5]

Luc Buatois, Guillaume Caumon, and Bruno Levy. 2009. Concurrent number cruncher: a GPU implementation of a general sparse linear solver. International Journal of Parallel, Emergent and Distributed Systems 24, 3 (2009), 205--223.

Digital Library

[6]

Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. ACM, 233--244.

Digital Library

[7]

Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In Proceedings of the 30th International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 37, 12 pages.

Digital Library

[8]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010), 115--126.

Digital Library

[9]

Timothy A. Davis. 1997. The University of Florida sparse matrix collection. NA DIGEST (1997).

[10]

J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 615--624.

[11]

Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, and Xiaofei Chen. 2017. 18.9Pflopss Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-meter Scenarios. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 2, 12 pages.

Digital Library

[12]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, Wei Zhao, Xunqiang Yin, Chaofeng Hou, Chenglong Zhang, Wei Ge, Jian Zhang, Yangang Wang, Chunbo Zhou, and Guangwen Yang. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (21 Jun 2016), 072001.

[13]

Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing 50, 1 (01 Oct 2009), 36--77.

Digital Library

[14]

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 769--780.

Digital Library

[15]

Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. SIGPLAN Not. 46, 8 (Feb. 2011), 247--256.

Digital Library

[16]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Autotuner for Sparse Matrix-vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 117--126.

Digital Library

[17]

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29-June 2, 2017. 635--645.

[18]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 339--350.

Digital Library

[19]

Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors. Parallel Comput. 49 (2015), 179--193.

Digital Library

[20]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-vector Multiplication on x86-based Manycore Processors. In Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 273--282.

Digital Library

[21]

Duane Merrill and Michael Garland. 2016. Merge-based Parallel Sparse Matrix-vector Multiplication. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE, Piscataway, NJ, USA, Article 58, 12 pages.

Digital Library

[22]

Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.

Digital Library

[23]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 99--108.

Digital Library

[24]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 99--108.

Digital Library

[25]

Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, New York, NY, USA, 353--364.

Digital Library

[26]

Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Autotuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th IEEE/ACM International Symposium on Code Generation and Optimization (CGO '15). IEEE Computer Society, Washington, DC, USA, 136--145.

Digital Library

[27]

Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 338--353.

Digital Library

[28]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms. In Proceedings of the 21st ACM/IEEE Conference on Supercomputing (ICS '07). ACM, New York, NY, USA, Article 38, 12 pages.

Digital Library

[29]

Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO '18). ACM, New York, NY, USA, 149--162.

Digital Library

[30]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 107--118.

Digital Library

[31]

Jian Zhang, Chunbao Zhou, Yangang Wang, Lili Ju, Qiang Du, Xuebin Chi, Dongsheng Xu, Dexun Chen, Yong Liu, and Zhao Liu. 2016. Extreme-scale phase field simulations of coarsening dynamics on the sunway taihulight supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 4.

Digital Library

[32]

Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 94--108.

Digital Library

Cited By

Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Ye YGuo HWang BWang PChen DLi F(2023)Coupled Incomplete Cholesky and Jacobi Preconditioned Conjugate Gradient on the New Generation of Sunway Many-Core ArchitectureIEEE Transactions on Computers10.1109/TC.2023.329688472:11(3326-3339)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TC.2023.3296884
Show More Cited By

Index Terms

Towards Efficient SpMV on Sunway Manycore Architectures
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software
    1. Mathematical software performance
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
Read More
A Cross-Platform SpMV Framework on Many-Core Architectures

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...
Read More
Memory bandwidth optimization of SpMV on GPGPUs

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '18: Proceedings of the 2018 International Conference on Supercomputing

June 2018

407 pages

ISBN:9781450357838

DOI:10.1145/3205289

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China

Conference

ICS '18

Sponsor:

SIGARCH

ICS '18: 2018 International Conference on Supercomputing

June 12 - 15, 2018

Beijing, China

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Song LChen FLi HChen YMohror KArnold DBadia R(2023)ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear SolversProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607077(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607077
Ye YGuo HWang BWang PChen DLi F(2023)Coupled Incomplete Cholesky and Jacobi Preconditioned Conjugate Gradient on the New Generation of Sunway Many-Core ArchitectureIEEE Transactions on Computers10.1109/TC.2023.329688472:11(3326-3339)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TC.2023.3296884
Pan JXiao LTian MWang LYang CChen RRen ZLiu AZhu G(2023)hsSpMV: A Heterogeneous and SPM-aggregated SpMV for SW26010-Pro many-core processor2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00016(62-70)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00016
Sun BLi MYang HXu JLuan ZQian D(2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
https://doi.org/10.1007/s42514-023-00147-x
Li MLiu CLiao JZheng XYang HSun RXu JGan LYang GLuan ZQian D(2023)Towards optimized tensor code generation for deep learning on sunway many-core processorFrontiers of Computer Science10.1007/s11704-022-2440-718:2Online publication date: 13-Sep-2023
https://doi.org/10.1007/s11704-022-2440-7
Li MLiu YChen BYang HLuan ZQian D(2023)Building a domain-specific compiler for emerging processors with a reusable approachScience China Information Sciences10.1007/s11432-022-3727-667:1Online publication date: 27-Dec-2023
https://doi.org/10.1007/s11432-022-3727-6
Gao JJi WTan ZWang YShi F(2022)TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317050133:12(3732-3745)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3170501
Zarebavani BCheshmi KLiu BStrout MDehnavi M(2022)HDagg: Hybrid Aggregation of Loop-carried Dependence Iterations in Sparse Matrix Computations2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00121(1217-1227)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00121
Tran HFernando MSaurabh KGanapathysubramanian BKirby RSundar H(2022)A scalable adaptive-matrix SPMV for heterogeneous architectures2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00011(13-24)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00011
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents