research-article

Processor-Aware Cache-Oblivious Algorithms✱

Authors:

Weiguo GaoAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 55, Pages 1 - 10

https://doi.org/10.1145/3472456.3472506

Published: 05 October 2021 Publication History

Abstract

Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen’s algorithm exactly and efficiently on an arbitrary number of processors.

We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We apply the approach to classic rectangular matrix-matrix multiplication (MM) and Strassen’s algorithm. We provide an almost exact solution to the open problem on parallelizing Strassen. Though this paper focuses mainly on a homogeneous shared-memory setting, we also discuss the extensions of our approach to a distributed-memory and a heterogeneous settings. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. Preliminary experiments show that our MM algorithm outperforms significantly Intel MKL’s dgemm. A full version of this paper is hosted on arXiv.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000). ACM, New York, NY, USA, 1–12.

Digital Library

[2]

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 (Sep. 1995), 575–582. Issue 5.

[3]

Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1990. Communication Complexity of PRAMs. Theor. Comput. Sci. 71, 1 (March 1990), 3–28.

Digital Library

[4]

Grey Ballard, James Demmel, and Andrew Gearhart. 2011. Brief announcement: communication bounds for heterogeneous architectures. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011). 257–258.

Digital Library

[5]

Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012. 77–79.

Digital Library

[6]

Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal Parallel Algorithm for Strassen’s Matrix Multiplication. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’12). ACM, New York, NY, USA, 193–204.

Digital Library

[7]

Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds. CoRR abs/1202.3177(2012).

[8]

Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.

[9]

Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2013. Graph Expansion and Communication Costs of Fast Matrix Multiplication. J. ACM 59, 6 (Jan. 2013).

[10]

Olivier Beaumont, Brett A. Becker, Ashley M. DeFlumere, Lionel Eyraud-Dubois, Thomas Lambert, and Alexey L. Lastovetsky. 2019. Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2019), 218–229.

Digital Library

[11]

Olivier Beaumont, Lionel Eyraud-Dubois, and Thomas Lambert. 2016. Cuboid Partitioning for Parallel Matrix Multiplication on Heterogeneous Platforms. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings. ACM, 171–182.

[12]

Laszlo A. Belady. 1966. A Study of Replacement Algorithms for Virtual-Storage Computer. IBM Systems Journal 5, 2 (1966), 78–101.

Digital Library

[13]

Austin R. Benson and Grey Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP 2015). ACM, New York, NY, USA, 42–53.

[14]

Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato, and Francesco Silvestri. 2016. Network-Oblivious Algorithms. J. ACM 63, 1 (2016), 3:1–3:36.

Digital Library

[15]

Guy E. Blelloch, Rezaul Alam Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008. 501–510.

[16]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011). 355–366.

Digital Library

[17]

Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low Depth Cache-oblivious Algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’10). ACM, New York, NY, USA, 189–199.

Digital Library

[18]

Robert D. Blumofe, Matteo Frigo, Chrisopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA ’96. 297–308.

[19]

Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In IPPS10. Honolulu, Hawaii, 132–141.

[20]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (Sept. 1999), 720–748.

Digital Library

[21]

Randal E. Bryant and David R. O’Hallaron. 2015. Computer Systems: A Programmer’s Perspective (3rd ed.). Pearson Eduction, USA.

[22]

Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Bozeman, MT, USA. AAI7010025.

Digital Library

[23]

Rezaul Chowdhury. 2007. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. Ph.D. Dissertation. Department of Computer Sciences, The University of Texas at Austin, Austin, Texas.

[24]

R. Chowdhury and V. Ramachandran. 2008. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 207–216.

[25]

Rezaul Alam Chowdhury and Vijaya Ramachandran. 2006. Cache-Oblivious Dynamic Programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’06. 591–600.

[26]

Rezaul Alam Chowdhury, Vijaya Ramachandran, Francesco Silvestri, and Brandon Blakeley. 2013. Oblivious algorithms for multicores and networks of processors. J. Parallel Distrib. Comput. 73, 7 (2013), 911–925.

[27]

Rezaul A. Chowdhury, Francesco Silvestri, Brandon Blakeley, and Vijaya Ramachandran. 2010. Oblivious Algorithms for Multicores and Network of Processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium. 1–12. https://doi.org/10.1109/IPDPS.2010.5470354

[28]

Richard Cole and Vijaya Ramachandran. 2011. Efficient Resource Oblivious Algorithms for Multicores. CoRR abs/1103.4071(2011).

[29]

Richard Cole and Vijaya Ramachandran. 2012. Efficient Resource Oblivious Algorithms for Multicores with False Sharing. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. 201–214.

[30]

Richard Cole and Vijaya Ramachandran. 2012. Revisiting the Cache Miss Analysis of Multithreaded Algorithms. In LATIN 2012: Theoretical Informatics - 10th Latin American Symposium, Arequipa, Peru, April 16-20, 2012. Proceedings. 172–183.

[31]

Richard Cole and Vijaya Ramachandran. 2017. Resource Oblivious Sorting on Multicores. ACM Trans. Parallel Comput. 3, 4, Article 23 (March 2017), 31 pages.

Digital Library

[32]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms(third ed.). The MIT Press.

Digital Library

[33]

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013. 261–272.

[34]

David Dinh, Harsha Vardhan Simhadri, and Yuan Tang. 2016. Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers. In SPAA’16. Pacific Grove, CA, USA.

[35]

Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1 (Jan. 2012), 4:1–4:22.

Digital Library

[36]

Matteo Frigo and Volker Strumpen. 2009. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput. Syst. 45, 2 (2009), 203–233.

Digital Library

[37]

Z. Galil and R. Giancarlo. 1989. Speeding up Dynamic Programming with Applications to Molecular Biology. Theoretical Computer Science 64 (1989), 107–118.

Digital Library

[38]

Z. Galil and K. Park. 1994. Parallel Algorithms for Dynamic Programming Recurrences with More Than O(1) Dependency. J. Parallel and Distrib. Comput. 21 (1994), 213–222.

Digital Library

[39]

Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication. J. Parallel Distrib. Comput. 64, 9 (Sept. 2004), 1017–1026.

Digital Library

[40]

Charles E. Leiserson. [n.d.]. Performance Engineering of Software Systems.

[41]

Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. 2020. There’s plenty of room at the top: What will drive computer performance after Moore’s Law. Science 368 (June 2020). Issue 6495.

[42]

Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: implementation and performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 101.

Digital Library

[43]

F. W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3 (1999), 287–297.

[44]

Hiroshi Nagamochi and Yuusuke Abe. 2007. An approximation algorithm for dissecting a rectangle into rectangles with specified areas. Discrete Applied Mathematics 155, 4 (2007), 523–537.

Digital Library

[45]

Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Amortized Efficiency of List Update and Paging Rules. Commun. ACM 28, 2 (1985), 202–208.

Digital Library

[46]

Edgar Solomonik and James Demmel. 2011. Communication-optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II(Euro-Par’11). Springer-Verlag, Berlin, Heidelberg, 90–109.

[47]

Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures(SPAA ’09). ACM, New York, NY, USA, 91–100.

Digital Library

[48]

Volker Strassen. 1969. Gaussian Elimination is not Optimal. Numer. Math. 14, 3 (1969), 354–356.

Digital Library

[49]

Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A. Chowdhury. 2015. Cache-Oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms without losing Cache-Efficiency. In PPoPP’15. San Francisco, CA, USA.

Recommendations

Balanced Partitioning of Several Cache-Oblivious Algorithms
SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

Frigo et al. proposed an ideal cache model and a recursive cache-oblivious technique to design sequential cache-efficient algorithms in an oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an ...
Cache-Oblivious Algorithms

This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: ...
Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation of China
Shanghai Natural Science Funding

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
139
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents