research-article

Open access

SCP: Shared Cache Partitioning for High-Performance GEMM

Authors:

Jingling XueAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 4

Article No.: 43, Pages 1 - 21

https://doi.org/10.1145/3274654

Published: 10 October 2018 Publication History

All formats PDF

Abstract

GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent used (LRU) caches, which are popular in traditional high-performance processors used in high-end servers and supercomputers. However, the market has recently witnessed a new diversity in processor design, resulting in high-performance processors equipped with shared caches with non-LRU replacement policies. This poses a challenge to the development of high-performance GEMM in a multithreaded context. As several threads try to load data into a shared cache simultaneously, interthread cache conflicts will increase significantly. We present a Shared Cache Partitioning (SCP) method to eliminate interthread cache conflicts in the GEMM routines, by partitioning a shared cache into physically disjoint sets and assigning different sets to different threads. We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, a 64-core AArch64 processor with private LRU L1 caches and shared pseudo-random L2 caches (per four-core cluster). Our evaluation shows that SCP has effectively reduced the conflict misses in both L1 and L2 caches in a highly optimized GEMM implementation, resulting in an improvement of its performance by 2.75% to 6.91%.

References

[1]

2018. The Phytium processor family. Retrieved from http://www.phytium.com.cn.

[2]

November 2017. Top500 supercomputer sites. Retrieved from https://www.top500.org.

[3]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'08). 101--113.

Digital Library

[4]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 1 (2009), 38--53.

Digital Library

[5]

Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, and Xiaobing Feng. 2011. Automatic library generation for BLAS3 on GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS'11). 255--265.

Digital Library

[6]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'04). 82--93.

Digital Library

[7]

Kazushige Goto and Robert A. Van De Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (2008), 1--25.

Digital Library

[8]

Kazushige Goto and Robert Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35, 1 (July 2008), 4:1--4:14.

Digital Library

[9]

Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT'08). 208--219.

Digital Library

[10]

Bo Kågström, Per Ling, and Charles van Loan. 1998. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3 (Sep. 1998), 268--302.

Digital Library

[11]

Bo Kågström and Charles van Loan. 1998. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Softw. 24, 3 (Sep. 1998), 303--316.

Digital Library

[12]

Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. 68, 3 (Jan. 2014), 1418--1440.

Digital Library

[13]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'13). 127--138.

Digital Library

[14]

Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M. Badia. 2010. Scheduling dense linear algebra operations on multicore processors. Concurrency Comput. Pract. Exper. 22, 1 (2010), 15--44.

Digital Library

[15]

Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'91). 63--74.

Digital Library

[16]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'00). 145--156.

Digital Library

[17]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5, 3 (Sep. 1979), 308--323.

Digital Library

[18]

Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA'08). 367--378.

[19]

Jie Liu, Lihua Chi, Chunye Gong, Han Xu, Jie Jiang, Yihui Yan, and Qingfeng Hu. 2012. High-performance matrix multiply on a massively multithreaded Fiteng1000 processor. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, 2012. 166--176.

Digital Library

[20]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2 (Aug. 2016), 12:1--12:18.

Digital Library

[21]

Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture (MICRO'06). 423--432.

Digital Library

[22]

I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers’ Summit. 131--142.

[23]

Tyler M. Smith, Robert A. VandeGeijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'14). 1049--1059.

Digital Library

[24]

Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'14). 23:23--23:32.

Digital Library

[25]

Xing Su, Xiangke Liao, and Jingling Xue. 2017. Automatic generation of fast BLAS3-GEMM: A portable compiler approach. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'17). 122--133.

[26]

Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, Berlin, 157--173.

[27]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (Jun. 2015), 14:1--14:33.

Digital Library

[28]

Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the Conference on High Performance Computing (SC'08). 31:1--31:11.

Digital Library

[29]

Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, and Canqun Yang. 2015. Design and implementation of a highly efficient DGEMM for 64-Bit ARMv8 multi-core processors. In Proceedings of the International Conference on Parallel Processing (ICPP'15). 200--209.

Digital Library

[30]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, 25:1--25:12.

Digital Library

[31]

R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the Conference on Supercomputing (SC’98). 1--27.

Digital Library

[32]

Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston.

Digital Library

[33]

Asim YarKhan, Jakub Kurzak, Piotr Luszczek, and Jack Dongarra. 2017. Porting the PLASMA numerical library to the OpenMP standard. Int. J. Parallel Program. 45, 3 (2017), 612--633.

Digital Library

[34]

Qing Yi. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'11). 97--106.

Digital Library

[35]

Qing Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. 2007. POET: Parameterized optimizations for empirical tuning. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8.

[36]

Qing Yi, Qian Wang, and Huimin Cui. 2014. Specializing compiler optimizations through programmable composition for dense matrix computations. In Proceedings of the International Symposium on Microarchitecture (MICRO'14). 596--608.

Digital Library

[37]

Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special Issue on Program Generation, Optimization, and Adaptation 93, 2 (2005), 358--386.

[38]

Field G. Van Zee, Tyler M. Smith, Bryan Marker, Tze Meng Low, Robert A. Van De Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, and Lee Killough. 2016. The BLIS framework: Experiments in portability. ACM Trans. Math. Softw. 42, 2 (Jun. 2016), 12:1--12:19.

Digital Library

[39]

Xianyi Zhang, Qian Wang, and Yunquan Zhang. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS'12). 684--691.

Digital Library

[40]

Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Trans. Archit. Code Optim. 13, 1, Article 11 (March 2016), 11:1--11:26 pages.

Digital Library

[41]

Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'16). 59--69.

Digital Library

Cited By

Yang WFang JDong D(2021)Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00019(101-110)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00019
Fang JLiao XHuang CDong D(2021)Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+Journal of Computer Science and Technology10.1007/s11390-020-0741-636:1(33-43)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s11390-020-0741-6
Wu ZLi MChi MXu LAn H(2020)Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core ProcessorIEEE Access10.1109/ACCESS.2020.30193028(156915-156928)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3019302

Index Terms

SCP: Shared Cache Partitioning for High-Performance GEMM
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs
SAAHPC '12: Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing

Using GPUs as computational accelerators has been a growing area of research in the past several years. One particular area amenable to exploiting video card hardware is dense linear algebra. We continue this trend by generalizing the MAGMA xGEMM ...
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 4

December 2018

706 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3284745

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2018

Accepted: 01 August 2018

Revised: 01 August 2018

Received: 01 May 2018

Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council
Innovative Team Support Program of Hunan
National Natural Science Foundation of Hunan
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,199
Total Downloads

Downloads (Last 12 months)217
Downloads (Last 6 weeks)41

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang WFang JDong D(2021)Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00019(101-110)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00019
Fang JLiao XHuang CDong D(2021)Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+Journal of Computer Science and Technology10.1007/s11390-020-0741-636:1(33-43)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s11390-020-0741-6
Wu ZLi MChi MXu LAn H(2020)Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core ProcessorIEEE Access10.1109/ACCESS.2020.30193028(156915-156928)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3019302

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents