Article

Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Authors:

Juan del Cuvillo,

Guang R. GaoAuthors Info & Claims

Euro-Par'06: Proceedings of the 12th international conference on Parallel Processing

Pages 134 - 144

https://doi.org/10.1007/11823285_14

Published: 28 August 2006 Publication History

Abstract

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area.

This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.

References

[1]

Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part I. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005).

[2]

Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part II: Memory organization, the A-switch, and SPRs. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005).

[3]

Almagor, L., Cooper, K.D., Al., E.: Finding effective compilation sequences. In: LCTES'04, Wahsington, DC, USA (2004).

Digital Library

[4]

Wolf, M.E., Maydan, D.E., Chen, D.K.: Combining loop transformations considering caches and scheduling. In: Proceedings of the 29th Annual International Symposium on Microarchitecture, Paris, IEEE-CS TC-MICRO and ACM SIGMICRO (1996) 274-286.

Digital Library

[5]

del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on Modeling, Benchmarking and Simulation (MoBS'05) of ISCA'05, Madison, Wisconsin (2005).

[6]

Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kauffmann Publishers (2001).

Digital Library

[7]

Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Ontario (1991) 30-44 SIGPLAN Notices, 26(6), June 1991.

Digital Library

[8]

Carr, S., McKinley, K.S., Tseng, C.W.: Compiler optimizations for improving data locality. In: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, ACM SIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1994) 252-262 Computer Architecture News, 22, October 1994; Operating Systems Review, 28(5), December 1994; SIGPLAN Notices, 29(11), November 1994.

Digital Library

[9]

Anderson, J.M., Lam, M.S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico (1993) 112-125 SIGPLAN Notices, 28(6), June 1993.

Digital Library

[10]

Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing, Boston, MA (1995).

Digital Library

[11]

Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine transforms. In: Conference Record of POPL'97: The 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Paris (1997) 201-214.

Digital Library

[12]

Wolfe, M.: Iteration space tiling for memory hierarchies. (SIAM) Parallel Processing for Scientific Computing (1987) 36-361.

Digital Library

[13]

Andonov, R., Bourzoufi, H., Rajopadhye, S.: Two-dimensional orthogonal tiling: from theory to pratice. In: HiPC 1996, Trivandrum, India (1996).

Digital Library

[14]

Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers (2000).

Digital Library

[15]

Calder, B., Krintz, C., John, S., Austin, T.: Cache-conscious data placement. In: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, ACMSIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1998) 139-149 Computer Architecture News, 26, October 1998; Operating Systems Review, 32(5), December 1998; SIGPLAN Notices, 33(11), November 1998.

Digital Library

[16]

Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data and computation reorganization at run time. {22} 229-241 SIGPLAN Notices, 34(5), May 1999.

Digital Library

[17]

Kennedy, K., Kremer, U.: Automatic data layout for distributed memory machines. ACM Transactions on Programming Languages and Systems 20(4) (1998).

Digital Library

[18]

Chilimbi, T.M., Davidson, B., Larus, J.R.: Cache-conscious structure definition. {22} 13-24 SIGPLAN Notices, 34(5), May 1999.

Digital Library

[19]

N. Gloy, Smith, M.D.: Procedure placement using temporal-ordering information. ACM Transactions on Programming Languages and Systems 21(5) (1999).

Digital Library

[20]

Ding, C., Kennedy, K.: Improving effective bandwidth through compiler enhancement of global cache reuse. Parallel and Distributed Computing 64(1) (2004).

Digital Library

[21]

Ding, C., Orlovich, M.: The potential of computation regrouping for improving locality. In: SuperComputing 2004, Pittsburgh, PA. (2004).

Digital Library

[22]

Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation. In: Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, Georgia (1999) SIGPLAN Notices, 34(5), May 1999.

Cited By

Baroudi TSeghir RLoechner V(2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
https://dl.acm.org/doi/10.1145/3162016
Garcia EGao GFranke HHeinecke APalem KUpfal E(2013)Strategies for improving performance and energy efficiency on a many-coreProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482779(1-4)Online publication date: 14-May-2013
https://dl.acm.org/doi/10.1145/2482767.2482779
Cui HYi QXue JFeng X(2013)Layout-oblivious compiler optimization for matrix computationsACM Transactions on Architecture and Code Optimization10.1145/2400682.24006949:4(1-20)Online publication date: 20-Jan-2013
https://dl.acm.org/doi/10.1145/2400682.2400694
Show More Cited By

Recommendations

Optimizing symmetric dense matrix-vector multiplication on GPUs
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector ...
Optimization of quasi-diagonal matrix-vector multiplication on GPU

Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have ...
Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip
CF '06: Proceedings of the 3rd conference on Computing frontiers

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par'06: Proceedings of the 12th international conference on Parallel Processing

August 2006

1221 pages

ISBN:3540377832

Editors:
Wolfgang E. Nagel
ZIH, TU Dresden, Germany
,
Wolfgang V. Walter
Fakultät Mathematik, Institut für wissenschaftliches Rechnen, TU Dresden, Dresden, Germany
,
Wolfgang Lehner
Database Technology Group, Technische Universität Dresden, Dresden, Germany

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 28 August 2006

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baroudi TSeghir RLoechner V(2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
https://dl.acm.org/doi/10.1145/3162016
Garcia EGao GFranke HHeinecke APalem KUpfal E(2013)Strategies for improving performance and energy efficiency on a many-coreProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482779(1-4)Online publication date: 14-May-2013
https://dl.acm.org/doi/10.1145/2482767.2482779
Cui HYi QXue JFeng X(2013)Layout-oblivious compiler optimization for matrix computationsACM Transactions on Architecture and Code Optimization10.1145/2400682.24006949:4(1-20)Online publication date: 20-Jan-2013
https://dl.acm.org/doi/10.1145/2400682.2400694
Garcia EVenetis IKhan RGao G(2010)Optimized dense matrix multiplication on a many-core architectureProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885308(316-327)Online publication date: 31-Aug-2010
https://dl.acm.org/doi/10.5555/1885276.1885308
Venetis IGao GJohnson GTrinitis CGaydadjiev GVeidenbaum A(2009)Mapping the LU decomposition on a many-core architectureProceedings of the 6th ACM conference on Computing frontiers10.1145/1531743.1531756(71-80)Online publication date: 18-May-2009
https://dl.acm.org/doi/10.1145/1531743.1531756
Gan GManzano J(2009)TL-DAEProceedings of the 22nd international conference on Languages and Compilers for Parallel Computing10.1007/978-3-642-13374-9_6(80-94)Online publication date: 8-Oct-2009
https://dl.acm.org/doi/10.1007/978-3-642-13374-9_6
Yuan NZhou YTan GZhang JFan D(2009)High Performance Matrix Multiplication on Many CoresProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_87(948-959)Online publication date: 23-Aug-2009
https://dl.acm.org/doi/10.1007/978-3-642-03869-3_87
Gan GWang XManzano JGao G(2009)Tile PercolationProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_78(839-850)Online publication date: 23-Aug-2009
https://dl.acm.org/doi/10.1007/978-3-642-03869-3_78
Long GFan DZhang JSong FYuan NLin W(2008)A Performance Model of Dense Matrix Operations on Many-Core ArchitecturesProceedings of the 14th international Euro-Par conference on Parallel Processing10.1007/978-3-540-85451-7_14(120-129)Online publication date: 26-Aug-2008
https://dl.acm.org/doi/10.1007/978-3-540-85451-7_14
Zhu WSreedhar VHu ZGao G(2007)Synchronization state bufferACM SIGARCH Computer Architecture News10.1145/1273440.125066835:2(35-45)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1273440.1250668
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents