Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/11823285_14guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Published: 28 August 2006 Publication History

Abstract

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area.
This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.

References

[1]
Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part I. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005).
[2]
Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part II: Memory organization, the A-switch, and SPRs. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005).
[3]
Almagor, L., Cooper, K.D., Al., E.: Finding effective compilation sequences. In: LCTES'04, Wahsington, DC, USA (2004).
[4]
Wolf, M.E., Maydan, D.E., Chen, D.K.: Combining loop transformations considering caches and scheduling. In: Proceedings of the 29th Annual International Symposium on Microarchitecture, Paris, IEEE-CS TC-MICRO and ACM SIGMICRO (1996) 274-286.
[5]
del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on Modeling, Benchmarking and Simulation (MoBS'05) of ISCA'05, Madison, Wisconsin (2005).
[6]
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kauffmann Publishers (2001).
[7]
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Ontario (1991) 30-44 SIGPLAN Notices, 26(6), June 1991.
[8]
Carr, S., McKinley, K.S., Tseng, C.W.: Compiler optimizations for improving data locality. In: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, ACM SIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1994) 252-262 Computer Architecture News, 22, October 1994; Operating Systems Review, 28(5), December 1994; SIGPLAN Notices, 29(11), November 1994.
[9]
Anderson, J.M., Lam, M.S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico (1993) 112-125 SIGPLAN Notices, 28(6), June 1993.
[10]
Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing, Boston, MA (1995).
[11]
Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine transforms. In: Conference Record of POPL'97: The 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Paris (1997) 201-214.
[12]
Wolfe, M.: Iteration space tiling for memory hierarchies. (SIAM) Parallel Processing for Scientific Computing (1987) 36-361.
[13]
Andonov, R., Bourzoufi, H., Rajopadhye, S.: Two-dimensional orthogonal tiling: from theory to pratice. In: HiPC 1996, Trivandrum, India (1996).
[14]
Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers (2000).
[15]
Calder, B., Krintz, C., John, S., Austin, T.: Cache-conscious data placement. In: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, ACMSIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1998) 139-149 Computer Architecture News, 26, October 1998; Operating Systems Review, 32(5), December 1998; SIGPLAN Notices, 33(11), November 1998.
[16]
Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data and computation reorganization at run time. {22} 229-241 SIGPLAN Notices, 34(5), May 1999.
[17]
Kennedy, K., Kremer, U.: Automatic data layout for distributed memory machines. ACM Transactions on Programming Languages and Systems 20(4) (1998).
[18]
Chilimbi, T.M., Davidson, B., Larus, J.R.: Cache-conscious structure definition. {22} 13-24 SIGPLAN Notices, 34(5), May 1999.
[19]
N. Gloy, Smith, M.D.: Procedure placement using temporal-ordering information. ACM Transactions on Programming Languages and Systems 21(5) (1999).
[20]
Ding, C., Kennedy, K.: Improving effective bandwidth through compiler enhancement of global cache reuse. Parallel and Distributed Computing 64(1) (2004).
[21]
Ding, C., Orlovich, M.: The potential of computation regrouping for improving locality. In: SuperComputing 2004, Pittsburgh, PA. (2004).
[22]
Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation. In: Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, Georgia (1999) SIGPLAN Notices, 34(5), May 1999.

Cited By

View all
  • (2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
  • (2013)Strategies for improving performance and energy efficiency on a many-coreProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482779(1-4)Online publication date: 14-May-2013
  • (2013)Layout-oblivious compiler optimization for matrix computationsACM Transactions on Architecture and Code Optimization10.1145/2400682.24006949:4(1-20)Online publication date: 20-Jan-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Euro-Par'06: Proceedings of the 12th international conference on Parallel Processing
August 2006
1221 pages
ISBN:3540377832
  • Editors:
  • Wolfgang E. Nagel,
  • Wolfgang V. Walter,
  • Wolfgang Lehner

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 28 August 2006

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
  • (2013)Strategies for improving performance and energy efficiency on a many-coreProceedings of the ACM International Conference on Computing Frontiers10.1145/2482767.2482779(1-4)Online publication date: 14-May-2013
  • (2013)Layout-oblivious compiler optimization for matrix computationsACM Transactions on Architecture and Code Optimization10.1145/2400682.24006949:4(1-20)Online publication date: 20-Jan-2013
  • (2010)Optimized dense matrix multiplication on a many-core architectureProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885308(316-327)Online publication date: 31-Aug-2010
  • (2009)Mapping the LU decomposition on a many-core architectureProceedings of the 6th ACM conference on Computing frontiers10.1145/1531743.1531756(71-80)Online publication date: 18-May-2009
  • (2009)TL-DAEProceedings of the 22nd international conference on Languages and Compilers for Parallel Computing10.1007/978-3-642-13374-9_6(80-94)Online publication date: 8-Oct-2009
  • (2009)High Performance Matrix Multiplication on Many CoresProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_87(948-959)Online publication date: 23-Aug-2009
  • (2009)Tile PercolationProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_78(839-850)Online publication date: 23-Aug-2009
  • (2008)A Performance Model of Dense Matrix Operations on Many-Core ArchitecturesProceedings of the 14th international Euro-Par conference on Parallel Processing10.1007/978-3-540-85451-7_14(120-129)Online publication date: 26-Aug-2008
  • (2007)Synchronization state bufferACM SIGARCH Computer Architecture News10.1145/1273440.125066835:2(35-45)Online publication date: 9-Jun-2007
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media