Automatic tuning of sparse matrix-vector multiplication on multicore clusters

Li, ShiGang; Hu, ChangJun; Zhang, JunChao; Zhang, YunQuan

doi:10.1007/s11432-014-5254-x

Automatic tuning of sparse matrix-vector multiplication on multicore clusters

多核集群上稀疏矩阵向量乘法的自动调优

Research Paper
Published: 24 June 2015

Volume 58, pages 1–14, (2015)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

ShiGang Li¹,
ChangJun Hu²,
JunChao Zhang³ &
…
YunQuan Zhang¹

218 Accesses
6 Citations
Explore all metrics

Abstract

To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector (SpMV) multiplication kernel implemented in a partitioned global address space language, which supports a hybrid thread- and process-based communication layer for multicore systems. One-sided communication is used for inter-node data exchange, while intra-node communication uses a mix of process shared memory and multithreading. We develop performance models to facilitate selecting the best configuration of threads and processes hybridization as well as the best communication pattern for SpMV. As a result, our tuned SpMV in the hybrid runtime environment consumes less memory and reduces inter-node communication volume, without damaging the data locality. Experiments are conducted on 12 real sparse matrices. On 16-node Xeon and 8-node Opteron clusters, our tuned SpMV kernel gets on average 1.4X and 1.5X improvement in performance over the well-optimized process-based message-passing implementation, respectively.

抽象

创新点

为了获得理想的性能及可扩展性, 并行应用通常需要精细调优, 以更好地利用多核集群节点内部的高度并行性, 并减少节点间通信开销. 本文研究了多核集群上稀疏矩阵向量(SpMV) 乘法的自动调优技术, 其中SpMV代码基于划分全局地址空间(PGAS)语言UPC实现. UPC 通信层支持多线程/多进程混合运行时环境, 其中节点间数据交换通过单边通信实现, 而节点内通信通过 PSHM(Process SHare Memory) 以及多线程进行优化. 本文为此类混合运行时环境 (如UPC) 建立通信性能模型, 并基于该模型为 SpMV 选择最优混合运行时配置参数以及通信模式, 在保证数据局部性的前提下, 减少内存开销及节点间通信量. 通过对 12个实际稀疏矩阵进行实验测试表明, 相对于高度手工优化的 MPI SpMV 实现, 自动调优后的 SpMV 在 16 节点至强集群及 8 节点皓龙集群上分别获得 1.4 倍及 1.5 倍性能提升.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

UPC Consortium. UPC Language Specification Version 1.2. LBNL Technical Report LBNL-59208. 2005
Hilfinger P, Bonachea D, Datta K, et al. Titanium Language Reference Manual, version 2.19. UCB/EECS Technical Report UCB/EECS-2005-15. 2005
Numrish R W, Reid J. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 1998, 17: 1–31
Article MATH Google Scholar
Shan H, Austin B, Wright N, et al. Accelerating applications at scale using one-sided communication. In: Proceedings of the Sixth Conference on Partitioned Global Address Space Programming Models. New York: ACM, 2012
Blagojevic F, Hargrove P, Iancu C, et al. Hybrid PGAS runtime support for multicore nodes. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. New York: ACM, 2010. 1–10
Chapter Google Scholar
Nishtala R. Architectural probes for measuring communication overlap potential. Dissertation for the Master Degree. Berkeley: University of California, 2006
Molka D, Hackenberg D, Schone R, et al. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. Washington DC: IEEE Computer Society, 2009. 261–270
Chapter Google Scholar
Su J, Yelick K. Automatic support for irregular computations in a high-level language. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. Washington DC: IEEE Computer Society, 2005
Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York: ACM, 2009. 1–11
Chapter Google Scholar
Davis T A, Hu Y. The University of Florida sparse matrix collection. ACM Trans Math Softw, 2011, 1: 1–25
MATH MathSciNet Google Scholar
Balay S, Brown J, Buschelman K, et al. PETSc User’s Manual. ANL Technical Report ANL-95/11—Revision 3.3. 2012
Vuduc R, Demmel J, Yelick K. OSKI: a library of automatically tuned sparse matrix kernels. J Phys Confer Ser, 2005, 16: 521–530
Article Google Scholar
Im E, Yelick K, Vuduc R. Sparsity: optimization framework for sparse matrix kernels. Int J High Perform Comput Appl, 2004. 18: 135–158
Article Google Scholar
Jain A. pOSKI: an extensible autotuning framework to perform optimized SpMVs on multicore architectures. Dissertation for the Master Degree. Berkeley: University of California, 2008
Peng D, Ding Y, Yu S, et al. Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs. In: Proceedings of the 41st International Conference on Parallel Processing. Washington DC: IEEE Computer Society, 2012. 350–359
Google Scholar
Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2010. 115–126
Google Scholar
Boman E, Catalyurek U. Constrained fine-grain parallel sparse matrix distribution. In: Proceedings of SIAMWorkshop on Combinatorial Scientific Computing, Costa Mesa, 2007
Bisseling R H, Meesen W. Communication balancing in parallel sparse matrix-vector multiplication. Electron Trans Numer Anal, 2005, 21: 47–65
MATH MathSciNet Google Scholar
Nastea S, Frieder O, El-Ghazawi T. Load-balancing in sparse matrix-vector multiplication. In: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing, New Orleans, 1996. 218–225
Google Scholar
Lee S, Eigenmann R. Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems. In: Proceedings of the 22nd Annual International Conference on Supercomputing. New York: ACM, 2008. 195–204
Chapter Google Scholar
OpenMP Architecture Review Board. Application Program Interface Version 3.1, 2011
Chow E, Hysom D. Assessing Performance of Hybrid MPI/OpenMP Programs on SMP Clusters. LLNL Technical Report UCRL-JC-143957, 2001
Rabenseifner R. Hybrid parallel programming on HPC platforms. In: Proceedings of the 5th European Workshop on OpenMP, Aachen, 2003. 185–194
Google Scholar
Rabenseifner R, Hager G, Jost G. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing. Washington DC: IEEE Computer Society, 2009. 427–436
Google Scholar
Negara S, Zheng G, Pan K C, et al. Automatic MPI to AMPI program transformation using Photran. In: Proceedings of Euro-Par 2010 Parallel Processing Workshops. Berlin: Springer, 2011. 531–539
Chapter Google Scholar
Li S, Hoefler T, Snir M. NUMA-aware shared memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing. New York: ACM, 2013. 85–96
Chapter Google Scholar
Li S, Hoefler T, Hu C, et al. Improved MPI collectives for MPI processes in shared address spaces. Cluster Comput, 2014, 17: 1139–1155
Article MATH Google Scholar
Friedley A, Hoefler T, Bronevetsky G, et al. Ownership passing: efficient distributed memory programming on multicore systems. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2013. 177–186
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture,Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
ShiGang Li & YunQuan Zhang
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China
ChangJun Hu
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, 61801, USA
JunChao Zhang

Authors

ShiGang Li
View author publications
You can also search for this author in PubMed Google Scholar
ChangJun Hu
View author publications
You can also search for this author in PubMed Google Scholar
JunChao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
YunQuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to ShiGang Li or ChangJun Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, S., Hu, C., Zhang, J. et al. Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Sci. China Inf. Sci. 58, 1–14 (2015). https://doi.org/10.1007/s11432-014-5254-x

Download citation

Received: 08 August 2014
Accepted: 10 October 2014
Published: 24 June 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11432-014-5254-x

Keywords

关键词

092102

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic tuning of sparse matrix-vector multiplication on multicore clusters

Abstract

抽象

创新点

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Sparse Matrix-Vector Multiplication Using Accelerators

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication

Task-Based Parallel Sparse Matrix-Vector Multiplication (SpMVM) with GPI-2

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

关键词

Subscribe and save

Buy Now

Navigation

Automatic tuning of sparse matrix-vector multiplication on multicore clusters

Abstract

抽象

创新点

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Sparse Matrix-Vector Multiplication Using Accelerators

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication

Task-Based Parallel Sparse Matrix-Vector Multiplication (SpMVM) with GPI-2

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

Subscribe and save

Buy Now

Search

Navigation