Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3571885.3571918acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Deinsum: practically I/O optimal multi-linear algebra

Published: 18 November 2022 Publication History

Abstract

Multilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem. State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity. To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products and Tensor Times Matrix chains. We show performance results and scaling on the Piz Daint supercomputer, with up to 19x speedup over state-of-the-art solutions on 512 nodes.

Supplementary Material

MP4 File (SC22_Presentation_Ziogas.mp4)
Presentation at SC '22

References

[1]
W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler, K. Z. Ibrahim, K. Madduri, S. Williams, L. Oliker, C. Rosales-Fernandez, and T. Williams, "Extreme scale plasma turbulence simulations on top supercomputers worldwide," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '16. IEEE Press, 2016.
[2]
T. D. Kühne, M. Iannuzzi, M. Del Ben, V. V. Rybkin, P. Seewald, F. Stein, T. Laino, R. Z. Khaliullin, O. Schütt, F. Schiffmann, D. Golze, J. Wilhelm, S. Chulkov, M. H. Bani-Hashemian, V. Weber, U. Borštnik, M. Taillefumier, A. S. Jakobovits, A. Lazzaro, H. Pabst, T. Müller, R. Schade, M. Guidon, S. Andermatt, N. Holmberg, G. K. Schenter, A. Hehn, A. Bussy, F. Belleflamme, G. Tabacchi, A. Glöß, M. Lass, I. Bethune, C. J. Mundy, C. Plessl, M. Watkins, J. VandeVondele, M. Krack, and J. Hutter, "Cp2k: An electronic structure and molecular dynamics software package - quickstep: Efficient and accurate electronic structure calculations," The Journal of Chemical Physics, vol. 152, no. 19, p. 194103, 2020. [Online].
[3]
R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettini, V. D. Calhoun, M. Corbetta, S. Della Penna, J. H. Duyn, G. H. Glover, J. Gonzalez-Castillo, D. A. Handwerker, S. Keilholz, V. Kiviniemi, D. A. Leopold, F. de Pasquale, O. Sporns, M. Walter, and C. Chang, "Dynamic functional connectivity: Promise, issues, and interpretations," NeuroImage, vol. 80, pp. 360--378, 2013. [Online]. Available: https://app.dimensions.ai/details/publication/pub.1051116731
[4]
M. Luisier, A. Schenk, W. Fichtner, and G. Klimeck, "Atomistic simulation of nanowires in the s p 3 d 5 s* tight-binding formalism: From boundary conditions to strain calculations," Physical Review B, vol. 74, no. 20, p. 205323, 2006.
[5]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024--8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[6]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[7]
J. Dahm, E. Davis, T. Wicky, M. Cheeseman, O. Elbert, R. George, J. J. McGibbon, L. Groner, E. Paredes, and O. Fuhrer, "Gt4py: Python tool for implementing finite-difference computations for weather and climate," in 101st American Meteorological Society Annual Meeting. AMS, 2021.
[8]
M. Baldauf, A. Seifert, J. Förstner, D. Majewski, and M. Raschendorfer, "Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities." Monthly Weather Review, 139:3387--3905, 2011.
[9]
COSMO, "Consortium for small-scale modeling," oct 1998. [Online]. Available: http://www.cosmo-model.org
[10]
L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry et al., "An updated set of basic linear algebra subprograms (blas)," ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135--151, 2002.
[11]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide, 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999.
[12]
K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick, "A view of the parallel computing landscape," Commun. ACM, vol. 52, no. 10, p. 56--67, Oct. 2009. [Online].
[13]
G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solca', and T. Hoefler, "Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), 2019.
[14]
E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms," in Euro-Par 2011 Parallel Processing, ser. Lecture Notes in Computer Science, E. Jeannot, R. Namyst, and J. Roman, Eds. Springer Berlin Heidelberg, 2011, pp. 90--109. [Online].
[15]
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz, "Communication-optimal parallel algorithm for strassen's matrix multiplication," in Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, 2012, pp. 193--204.
[16]
G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T. Hoefler, "On the parallel i/o optimality of linear algebra kernels: Near-optimal matrix factorizations," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. Association for Computing Machinery, 2021.
[17]
E. Hutter and E. Solomonik, "Communication-avoiding Cholesky-QR2 for rectangular matrices," in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 89--100.
[18]
M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin, "Memory-efficient parallel tensor decompositions," in 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017, pp. 1--7.
[19]
V. T. Chakaravarthy, J. W. Choi, D. J. Joseph, X. Liu, P. Murali, Y. Sabharwal, and D. Sreedhar, "On optimizing distributed tucker decomposition for dense tensors," in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017, pp. 1038--1047.
[20]
G. Ballard, N. Knight, and K. Rouse, "Communication lower bounds for matricized tensor times khatri-rao product," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 557--567.
[21]
P. Springer and P. Bientinesi, "Design of a high-performance gemm-like tensor-tensor multiplication," ACM Transactions on Mathematical Software (TOMS), vol. 44, no. 3, pp. 1--29, 2018.
[22]
J. Kim, A. Sukumaran-Rajam, V. Thumma, S. Krishnamoorthy, A. Panyala, L.-N. Pouchet, A. Rountev, and P. Sadayappan, "A code generator for high-performance tensor contractions on gpus," in 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2019, pp. 85--95.
[23]
E. Solomonik, D. Matthews, J. Hammond, and J. Demmel, "Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions," in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013, pp. 813--824.
[24]
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R'ıo, M. Wiebe, P. Peterson, P. G'erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, "Array programming with NumPy," Nature, vol. 585, no. 7825, pp. 357--362, Sep. 2020. [Online].
[25]
D. G. a. Smith and J. Gray, "opt einsum - a python package for optimizing contraction order for einsum-like expressions," Journal of Open Source Software, vol. 3, no. 26, p. 753, 2018. [Online].
[26]
T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, "Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--14.
[27]
G. Kwasniewski, T. Ben-Nun, L. Gianinazzi, A. Calotoiu, T. Schneider, A. N. Ziogas, M. Besta, and T. Hoefler, "Pebbles, graphs, and a pinch of combinatorics: Towards tight i/o lower bounds for statically analyzable programs," in Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021, pp. 328--339.
[28]
Q. Xiao, S. Zheng, B. Wu, P. Xu, X. Qian, and Y. Liang, "Hasco: Towards agile hardware and software co-design for tensor computation," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1055--1068.
[29]
K. Hayashi, G. Ballard, Y. Jiang, and M. J. Tobia, "Shared-memory parallelization of mttkrp for dense tensors," in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 393--394.
[30]
MPICH, "Mpi cart create," 2022. [Online]. Available: https://www.mpich.org/static/docs/v3.3/www3/MPI_Cart_create.html
[31]
MPICH, "Mpi cart sub," 2022. [Online]. Available: https://www.mpich.org/static/docs/v3.3/www3/MPI_Cart_sub.html
[32]
A. N. Ziogas, T. Schneider, T. Ben-Nun, A. Calotoiu, T. De Matteis, J. de Fine Licht, L. Lavarini, and T. Hoefler, "Productivity, portability, performance: Data-centric python," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online].
[33]
R. Merris, Multilinear algebra. Crc Press, 1997.
[34]
L. Chi-Chung, P. Sadayappan, and R. Wenger, "On optimizing a class of multi-dimensional loops with reduction for parallel execution," Parallel Processing Letters, vol. 7, no. 02, pp. 157--168, 1997.
[35]
A. Darte, "On the complexity of loop fusion," in PACT, 1999.
[36]
Cyclops Community, "Cyclops tensor framework (ctf)." [Online]. Available: https://github.com/cyclops-community/ctf
[37]
P. Springer, T. Su, and P. Bientinesi, "HPTT: A High-Performance Tensor Transposition C++ Library," in Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ser. ARRAY 2017. New York, NY, USA: ACM, 2017, pp. 56--62. [Online].
[38]
P. Springer, "High-performance tensor transpose library." [Online]. Available: https://github.com/springer13/hptt
[39]
E. Solomonik, "High-performance tensor transpose library (forked by edgar solomonik)." [Online]. Available: https://github.com/solomonik/hptt
[40]
Nvidia, "cutensor," 2022. [Online]. Available: https://developer.nvidia.com/cutensor
[41]
B. Efron, "The bootstrap and modern statistics," Journal of the American Statistical Association, vol. 95, no. 452, pp. 1293--1296, 2000.
[42]
J. A. Calvin and E. F. Valeev, "Tiledarray: A general-purpose scalable block-sparse tensor framework." [Online]. Available: https://github.com/valeevgroup/tiledarray
[43]
D. A. Matthews, "High-performance tensor contraction without transposition," SIAM Journal on Scientific Computing, vol. 40, no. 1, pp. C1--C24, 2018. [Online].
[44]
J. Hong and H. Kung, "I/O complexity: The red-blue pebble game," in STOC, 1981, pp. 326--333.
[45]
J. S. Vitter, "External memory algorithms," in European Symposium on Algorithms. Springer, 1998, pp. 1--25.
[46]
V. Elango, F. Rastello, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan, "Data access complexity: The red/blue pebble game revisited," Technical Report, Tech. Rep., 2013.
[47]
J. E. Savage, "Extending the hong-kung model to memory hierarchies," in International Computing and Combinatorics Conference. Springer, 1995, pp. 270--281.
[48]
L. H. Loomis and H. Whitney, "An inequality related to the isoperimetric inequality," Bull. Amer. Math. Soc., vol. 55, no. 10, pp. 961--962, 10 1949.
[49]
D. Irony, S. Toledo, and A. Tiskin, "Communication lower bounds for distributed-memory matrix multiplication," Journal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 1017--1026, 2004.
[50]
T. M. Smith, B. Lowery, J. Langou, and R. A. van de Geijn, "A tight i/o lower bound for matrix multiplication," arXiv preprint arXiv:1702.02017, 2017.
[51]
E. Solomonik and J. Demmel, "Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms," in European Conference on Parallel Processing. Springer, 2011, pp. 90--109.
[52]
A. Petitet and J. Dongarra, "Algorithmic redistribution methods for block-cyclic decompositions," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 1201--1216, 1999.
[53]
S. P. Midkiff, "Local iteration set computation for block-cyclic distributions," in Proceedings of the 1995 International Conference on Parallel Processing, Urbana-Champain, Illinois, USA, August 14--18, 1995. Volume II: Software, C. D. Polychronopoulos, Ed. CRC Press, 1995, pp. 77--84.
[54]
C. Ancourt, C. Fran, and I. R. Keryell, "A linear algebra framework for static hpf code distribution," A; a, vol. 1, no. t2, p. 1, 1993.
[55]
K. Kennedy, N. Nedeljkovic, and A. Sethi, "Efficient address generation for block-cyclic distributions," in Proceedings of the 9th International Conference on Supercomputing, ser. ICS '95. New York, NY, USA: Association for Computing Machinery, 1995, p. 180--184. [Online].
[56]
K. Kennedy, N. Nedeljkovic, and A. Sethi, "A linear-time algorithm for computing the memory access sequence in data-parallel programs," in Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPOPP '95. New York, NY, USA: Association for Computing Machinery, 1995, p. 102--111. [Online].
[57]
K. "Kennedy, N. Nedeljkovic, and A. Sethi, Communication Generation for Cyclic(K) Distributions. Boston, MA: Springer US, 1996, pp. 185--197. [Online].
[58]
S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng, "Generating local addresses and communication sets for data-parallel programs," in Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPOPP '93. New York, NY, USA: Association for Computing Machinery, 1993, p. 149--158. [Online].
[59]
A. Thirumalai and J. Ramanujam, "Fast address sequence generation for data-parallel programs using integer lattices," in Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing, ser. LCPC '95. Berlin, Heidelberg: Springer-Verlag, 1995, p. 191--208.
[60]
J. M. Stichnoth, "Efficient compilation of array statements for private memory multicomputers," CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, USA, Tech. Rep., 1993.
[61]
S. K. S. Gupta, S. D. Kaushik, S. Mufti, S. Sharma, C., Huang, and P. Sadayappan, "On compiling array expressions for efficient execution on distributed-memory machines," in 1993 International Conference on Parallel Processing - ICPP'93, vol. 2, 1993, pp. 301--305.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

  1. automatic programming
  2. distributed computing
  3. hardware acceleration
  4. linear algebra
  5. performance analysis
  6. tensors

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 82
    Total Downloads
  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media