Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SCP: Shared Cache Partitioning for High-Performance GEMM

Published: 10 October 2018 Publication History

Abstract

GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent used (LRU) caches, which are popular in traditional high-performance processors used in high-end servers and supercomputers. However, the market has recently witnessed a new diversity in processor design, resulting in high-performance processors equipped with shared caches with non-LRU replacement policies. This poses a challenge to the development of high-performance GEMM in a multithreaded context. As several threads try to load data into a shared cache simultaneously, interthread cache conflicts will increase significantly. We present a Shared Cache Partitioning (SCP) method to eliminate interthread cache conflicts in the GEMM routines, by partitioning a shared cache into physically disjoint sets and assigning different sets to different threads. We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, a 64-core AArch64 processor with private LRU L1 caches and shared pseudo-random L2 caches (per four-core cluster). Our evaluation shows that SCP has effectively reduced the conflict misses in both L1 and L2 caches in a highly optimized GEMM implementation, resulting in an improvement of its performance by 2.75% to 6.91%.

References

[1]
2018. The Phytium processor family. Retrieved from http://www.phytium.com.cn.
[2]
November 2017. Top500 supercomputer sites. Retrieved from https://www.top500.org.
[3]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'08). 101--113.
[4]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 1 (2009), 38--53.
[5]
Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, and Xiaobing Feng. 2011. Automatic library generation for BLAS3 on GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS'11). 255--265.
[6]
Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'04). 82--93.
[7]
Kazushige Goto and Robert A. Van De Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (2008), 1--25.
[8]
Kazushige Goto and Robert Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35, 1 (July 2008), 4:1--4:14.
[9]
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT'08). 208--219.
[10]
Bo Kågström, Per Ling, and Charles van Loan. 1998. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3 (Sep. 1998), 268--302.
[11]
Bo Kågström and Charles van Loan. 1998. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Softw. 24, 3 (Sep. 1998), 303--316.
[12]
Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. 68, 3 (Jan. 2014), 1418--1440.
[13]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'13). 127--138.
[14]
Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M. Badia. 2010. Scheduling dense linear algebra operations on multicore processors. Concurrency Comput. Pract. Exper. 22, 1 (2010), 15--44.
[15]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'91). 63--74.
[16]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'00). 145--156.
[17]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5, 3 (Sep. 1979), 308--323.
[18]
Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA'08). 367--378.
[19]
Jie Liu, Lihua Chi, Chunye Gong, Han Xu, Jie Jiang, Yihui Yan, and Qingfeng Hu. 2012. High-performance matrix multiply on a massively multithreaded Fiteng1000 processor. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, 2012. 166--176.
[20]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2 (Aug. 2016), 12:1--12:18.
[21]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture (MICRO'06). 423--432.
[22]
I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers’ Summit. 131--142.
[23]
Tyler M. Smith, Robert A. VandeGeijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'14). 1049--1059.
[24]
Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'14). 23:23--23:32.
[25]
Xing Su, Xiangke Liao, and Jingling Xue. 2017. Automatic generation of fast BLAS3-GEMM: A portable compiler approach. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'17). 122--133.
[26]
Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, Berlin, 157--173.
[27]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (Jun. 2015), 14:1--14:33.
[28]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the Conference on High Performance Computing (SC'08). 31:1--31:11.
[29]
Feng Wang, Hao Jiang, Ke Zuo, Xing Su, Jingling Xue, and Canqun Yang. 2015. Design and implementation of a highly efficient DGEMM for 64-Bit ARMv8 multi-core processors. In Proceedings of the International Conference on Parallel Processing (ICPP'15). 200--209.
[30]
Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, 25:1--25:12.
[31]
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the Conference on Supercomputing (SC’98). 1--27.
[32]
Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston.
[33]
Asim YarKhan, Jakub Kurzak, Piotr Luszczek, and Jack Dongarra. 2017. Porting the PLASMA numerical library to the OpenMP standard. Int. J. Parallel Program. 45, 3 (2017), 612--633.
[34]
Qing Yi. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'11). 97--106.
[35]
Qing Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. 2007. POET: Parameterized optimizations for empirical tuning. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8.
[36]
Qing Yi, Qian Wang, and Huimin Cui. 2014. Specializing compiler optimizations through programmable composition for dense matrix computations. In Proceedings of the International Symposium on Microarchitecture (MICRO'14). 596--608.
[37]
Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special Issue on Program Generation, Optimization, and Adaptation 93, 2 (2005), 358--386.
[38]
Field G. Van Zee, Tyler M. Smith, Bryan Marker, Tze Meng Low, Robert A. Van De Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, and Lee Killough. 2016. The BLIS framework: Experiments in portability. ACM Trans. Math. Softw. 42, 2 (Jun. 2016), 12:1--12:19.
[39]
Xianyi Zhang, Qian Wang, and Yunquan Zhang. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS'12). 684--691.
[40]
Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Trans. Archit. Code Optim. 13, 1, Article 11 (March 2016), 11:1--11:26 pages.
[41]
Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'16). 59--69.

Cited By

View all
  • (2021)Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00019(101-110)Online publication date: May-2021
  • (2021)Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+Journal of Computer Science and Technology10.1007/s11390-020-0741-636:1(33-43)Online publication date: 1-Jan-2021
  • (2020)Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core ProcessorIEEE Access10.1109/ACCESS.2020.30193028(156915-156928)Online publication date: 2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 4
December 2018
706 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3284745
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2018
Accepted: 01 August 2018
Revised: 01 August 2018
Received: 01 May 2018
Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BLAS
  2. GEMM
  3. high-performance computing
  4. linear algebra
  5. optimization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Australian Research Council
  • Innovative Team Support Program of Hunan
  • National Natural Science Foundation of Hunan
  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)217
  • Downloads (Last 6 weeks)41
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00019(101-110)Online publication date: May-2021
  • (2021)Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+Journal of Computer Science and Technology10.1007/s11390-020-0741-636:1(33-43)Online publication date: 1-Jan-2021
  • (2020)Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core ProcessorIEEE Access10.1109/ACCESS.2020.30193028(156915-156928)Online publication date: 2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media