Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472456.3472506acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Processor-Aware Cache-Oblivious Algorithms✱

Published: 05 October 2021 Publication History

Abstract

Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassen’s algorithm exactly and efficiently on an arbitrary number of processors.
We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We apply the approach to classic rectangular matrix-matrix multiplication (MM) and Strassen’s algorithm. We provide an almost exact solution to the open problem on parallelizing Strassen. Though this paper focuses mainly on a homogeneous shared-memory setting, we also discuss the extensions of our approach to a distributed-memory and a heterogeneous settings. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. Preliminary experiments show that our MM algorithm outperforms significantly Intel MKL’s dgemm. A full version of this paper is hosted on arXiv.

References

[1]
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. 2000. The Data Locality of Work Stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000). ACM, New York, NY, USA, 1–12.
[2]
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39 (Sep. 1995), 575–582. Issue 5.
[3]
Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1990. Communication Complexity of PRAMs. Theor. Comput. Sci. 71, 1 (March 1990), 3–28.
[4]
Grey Ballard, James Demmel, and Andrew Gearhart. 2011. Brief announcement: communication bounds for heterogeneous architectures. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011). 257–258.
[5]
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012. 77–79.
[6]
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal Parallel Algorithm for Strassen’s Matrix Multiplication. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’12). ACM, New York, NY, USA, 193–204.
[7]
Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds. CoRR abs/1202.3177(2012).
[8]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing Communication in Numerical Linear Algebra. SIAM J. Matrix Analysis Applications 32, 3 (2011), 866–901.
[9]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2013. Graph Expansion and Communication Costs of Fast Matrix Multiplication. J. ACM 59, 6 (Jan. 2013).
[10]
Olivier Beaumont, Brett A. Becker, Ashley M. DeFlumere, Lionel Eyraud-Dubois, Thomas Lambert, and Alexey L. Lastovetsky. 2019. Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2019), 218–229.
[11]
Olivier Beaumont, Lionel Eyraud-Dubois, and Thomas Lambert. 2016. Cuboid Partitioning for Parallel Matrix Multiplication on Heterogeneous Platforms. In Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings. ACM, 171–182.
[12]
Laszlo A. Belady. 1966. A Study of Replacement Algorithms for Virtual-Storage Computer. IBM Systems Journal 5, 2 (1966), 78–101.
[13]
Austin R. Benson and Grey Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP 2015). ACM, New York, NY, USA, 42–53.
[14]
Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Michele Scquizzato, and Francesco Silvestri. 2016. Network-Oblivious Algorithms. J. ACM 63, 1 (2016), 3:1–3:36.
[15]
Guy E. Blelloch, Rezaul Alam Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008. 501–510.
[16]
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011). 355–366.
[17]
Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low Depth Cache-oblivious Algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures(SPAA ’10). ACM, New York, NY, USA, 189–199.
[18]
Robert D. Blumofe, Matteo Frigo, Chrisopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA ’96. 297–308.
[19]
Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In IPPS10. Honolulu, Hawaii, 132–141.
[20]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (Sept. 1999), 720–748.
[21]
Randal E. Bryant and David R. O’Hallaron. 2015. Computer Systems: A Programmer’s Perspective (3rd ed.). Pearson Eduction, USA.
[22]
Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Bozeman, MT, USA. AAI7010025.
[23]
Rezaul Chowdhury. 2007. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. Ph.D. Dissertation. Department of Computer Sciences, The University of Texas at Austin, Austin, Texas.
[24]
R. Chowdhury and V. Ramachandran. 2008. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 207–216.
[25]
Rezaul Alam Chowdhury and Vijaya Ramachandran. 2006. Cache-Oblivious Dynamic Programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’06. 591–600.
[26]
Rezaul Alam Chowdhury, Vijaya Ramachandran, Francesco Silvestri, and Brandon Blakeley. 2013. Oblivious algorithms for multicores and networks of processors. J. Parallel Distrib. Comput. 73, 7 (2013), 911–925.
[27]
Rezaul A. Chowdhury, Francesco Silvestri, Brandon Blakeley, and Vijaya Ramachandran. 2010. Oblivious Algorithms for Multicores and Network of Processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium. 1–12. https://doi.org/10.1109/IPDPS.2010.5470354
[28]
Richard Cole and Vijaya Ramachandran. 2011. Efficient Resource Oblivious Algorithms for Multicores. CoRR abs/1103.4071(2011).
[29]
Richard Cole and Vijaya Ramachandran. 2012. Efficient Resource Oblivious Algorithms for Multicores with False Sharing. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. 201–214.
[30]
Richard Cole and Vijaya Ramachandran. 2012. Revisiting the Cache Miss Analysis of Multithreaded Algorithms. In LATIN 2012: Theoretical Informatics - 10th Latin American Symposium, Arequipa, Peru, April 16-20, 2012. Proceedings. 172–183.
[31]
Richard Cole and Vijaya Ramachandran. 2017. Resource Oblivious Sorting on Multicores. ACM Trans. Parallel Comput. 3, 4, Article 23 (March 2017), 31 pages.
[32]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms(third ed.). The MIT Press.
[33]
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013. 261–272.
[34]
David Dinh, Harsha Vardhan Simhadri, and Yuan Tang. 2016. Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers. In SPAA’16. Pacific Grove, CA, USA.
[35]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1 (Jan. 2012), 4:1–4:22.
[36]
Matteo Frigo and Volker Strumpen. 2009. The Cache Complexity of Multithreaded Cache Oblivious Algorithms. Theory Comput. Syst. 45, 2 (2009), 203–233.
[37]
Z. Galil and R. Giancarlo. 1989. Speeding up Dynamic Programming with Applications to Molecular Biology. Theoretical Computer Science 64 (1989), 107–118.
[38]
Z. Galil and K. Park. 1994. Parallel Algorithms for Dynamic Programming Recurrences with More Than O(1) Dependency. J. Parallel and Distrib. Comput. 21 (1994), 213–222.
[39]
Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication. J. Parallel Distrib. Comput. 64, 9 (Sept. 2004), 1017–1026.
[40]
Charles E. Leiserson. [n.d.]. Performance Engineering of Software Systems.
[41]
Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. 2020. There’s plenty of room at the top: What will drive computer performance after Moore’s Law. Science 368 (June 2020). Issue 6495.
[42]
Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel strassen: implementation and performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15, 2012. 101.
[43]
F. W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3 (1999), 287–297.
[44]
Hiroshi Nagamochi and Yuusuke Abe. 2007. An approximation algorithm for dissecting a rectangle into rectangles with specified areas. Discrete Applied Mathematics 155, 4 (2007), 523–537.
[45]
Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Amortized Efficiency of List Update and Paging Rules. Commun. ACM 28, 2 (1985), 202–208.
[46]
Edgar Solomonik and James Demmel. 2011. Communication-optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II(Euro-Par’11). Springer-Verlag, Berlin, Heidelberg, 90–109.
[47]
Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures(SPAA ’09). ACM, New York, NY, USA, 91–100.
[48]
Volker Strassen. 1969. Gaussian Elimination is not Optimal. Numer. Math. 14, 3 (1969), 354–356.
[49]
Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A. Chowdhury. 2015. Cache-Oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms without losing Cache-Efficiency. In PPoPP’15. San Francisco, CA, USA.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache-oblivious
  2. perfect strong scaling
  3. processor-aware
  4. processor-oblivious
  5. shared-memory architecture

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Science Foundation of China
  • Shanghai Natural Science Funding

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 139
    Total Downloads
  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media