A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing system... more A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p^, p^, and p^ proportional to the matrices' dimensions—M, N, and K. Each processor performs a single local matrix multiplication of size Mlp^ x Nlp^ x Klp^. Before the local computation can be carried out, each subcube must receive a single submatrix of A and S. After the single matrix multiplication has completed, Klp^ submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor of P *̂ less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallelTM SP2TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winograd's varia...
A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing system... more A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p^, p^, and p^ proportional to the matrices' dimensions—M, N, and K. Each processor performs a single local matrix multiplication of size Mlp^ x Nlp^ x Klp^. Before the local computation can be carried out, each subcube must receive a single submatrix of A and S. After the single matrix multiplication has completed, Klp^ submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor of P *̂ less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallelTM SP2TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winograd's varia...
Uploads
Papers by Fred Gustavson