Article

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Authors:

Rezaul Alam Chowdhury,

Vijaya RamachandranAuthors Info & Claims

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Pages 71 - 80

https://doi.org/10.1145/1248377.1248392

Published: 09 June 2007 Publication History

Abstract

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.

References

[1]

G. Blelloch and P. Gibbons. Effectively sharing a cache among threads. Proc. SPAA, pp. 235--244, 2004.

Digital Library

[2]

R. Blumofe, M. Frigo, C. Joerg, C. Leiserson, and K. Randall. An analysis of DAG-consistent distributed shared-memory algorithms. Proc. SPAA, pp. 297--308, 1996.

Digital Library

[3]

E. Chan, E. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. Proc. SPAA, 2007.

Digital Library

[4]

S. Chatterjee, A. Lebeck, P. Patnala, and M. Thotethodi. Recursive array layouts and fast parallel matrix multiplication. Proc. SPAA, pp. 222--231, 1999.

Digital Library

[5]

C. Cherng and R Ladner. Cache efficient simple dynamic programming. In Proc. Intl. Conf. on the Analysis of Algorithms, pp. 49--58, 2005.

[6]

R. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. Proc. SODA, pp. 591--600, 2006.

Digital Library

[7]

R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: theoretical framework and experimental evaluation. CS TR-06-04, UT Austin, 2006.

[8]

P. D'Alberto and A. Nicolau. R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica, 47(2):203--213, 2007.

Digital Library

[9]

R. Dementiev, L. Kettner, and P. Sanders. STXXL: Standard template library for XXL data sets. Proc. ESA, pp. 640--651, 2005.

Digital Library

[10]

R. Floyd. Algorithm 97 (SHORTEST PATH). CACM, 5(6):345, 1962.

Digital Library

[11]

M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. Proc. FOCS, pp. 285--297, 1999.

Digital Library

[12]

M. Frigo, C. Leiserson, and K. Randall. The implementation of the Cilk-5 multithreaded language. Proc. PLDI, pp. 212--223, 1998.

Digital Library

[13]

M. Frigo and V. Strumpen. The cache-complexity of multithreaded cache-oblivious algorithms. Proc. SPAA, pp. 271--280, 2006.

Digital Library

[14]

J. Gunnels, F. Gustavson, G. Henry, and R. van de Geijn. FLAME: Formal linear algebra methods environment. ACM TOMS, 27(4):422--455, 2001.

Digital Library

[15]

K. Goto. GotoBLAS, 2005. http://www.tacc.utexas.edu/resources/software.

[16]

K. Iverson. A Programming Language. Wiley, 1962.

Digital Library

[17]

MAP3147NC/NP MAP3735NC/NP MAP3367NC/NP disk drives product/maintenance manual. http://www.fujitsu.com/downloads/COMP/fcpa/hdd/.

[18]

S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann, 1997.

Digital Library

[19]

S. Pan, C. Cherng, K. Dick, and R. Ladner. Algorithms to take advantage of hardware prefetching. Proc. ALENEX, pp. 91--98, 2007.

[20]

J. Park, M. Penner and V. Prasanna. Optimizing graph algorithms for improved cache performance. IEEE TPDS, 15(9):769--782, 2004.

Digital Library

[21]

J. Seward and N. Nethercote. Valgrind (debugging/profiling tool for x86-Linux programs). http://valgrind.kde.org/

[22]

S. Warshall. A theorem on boolean matrices. JACM, 9(1):11--12, 1962.

Digital Library

[23]

D. Womble, D. Greenberg, S. Wheat, and R. Riesen. Beyond core: making parallel computer I/O practical. Proc. DAGS/PC Symp., pp. 56--63, 1993.

[24]

M. Wolf and M. Lam. A data locality optimizing algorithm. Proc. PLDI, pp. 30--44, 1991.

Digital Library

[25]

R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.

Digital Library

[26]

K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-aware programs. Proc. SPAA, 2007.

Digital Library

Cited By

Simhadri HBlelloch GFineman JGibbons PKyrola A(2016)Experimental Analysis of Space-Bounded SchedulersACM Transactions on Parallel Computing10.1145/29383893:1(1-27)Online publication date: 28-Jun-2016
https://dl.acm.org/doi/10.1145/2938389
Ma LAgrawal KChamberlain R(2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2747903.2748175
Simhadri HBlelloch GFineman JGibbons PKyrola ABlelloch GSanders P(2014)Experimental analysis of space-bounded schedulersProceedings of the 26th ACM symposium on Parallelism in algorithms and architectures10.1145/2612669.2612678(30-41)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2612669.2612678
Show More Cited By

Index Terms

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Recommendations

Cache-Oblivious Algorithms

This article presents asymptotically optimal algorithms for rectangular matrix transpose, fast Fourier transform (FFT), and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: ...
Low depth cache-oblivious algorithms
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (...
The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
Special Title: Parallelism on Algorithms and Architectures (SPAA); Guest Editors: Cyril Gavoille, Boaz Patt-Shamir and Christian Scheideler

We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

376 pages

ISBN:9781595936677

DOI:10.1145/1248377

General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA07

Sponsor:

SPAA07: 19th ACM Symposium on Parallelism in Algorithms and Architectures

June 9 - 11, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
550
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Simhadri HBlelloch GFineman JGibbons PKyrola A(2016)Experimental Analysis of Space-Bounded SchedulersACM Transactions on Parallel Computing10.1145/29383893:1(1-27)Online publication date: 28-Jun-2016
https://dl.acm.org/doi/10.1145/2938389
Ma LAgrawal KChamberlain R(2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2747903.2748175
Simhadri HBlelloch GFineman JGibbons PKyrola ABlelloch GSanders P(2014)Experimental analysis of space-bounded schedulersProceedings of the 26th ACM symposium on Parallelism in algorithms and architectures10.1145/2612669.2612678(30-41)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2612669.2612678
Ma LAgrawal KChamberlain R(2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.1016/j.future.2013.06.02030(202-215)Online publication date: Jan-2014
https://doi.org/10.1016/j.future.2013.06.020
Solomonik EBuluc ADemmel J(2013)Minimizing Communication in All-Pairs Shortest PathsProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2013.111(548-559)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPS.2013.111
Ajwani DSitchinava N(2013)Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore ArchitecturesAlgorithms – ESA 201310.1007/978-3-642-40450-4_3(25-36)Online publication date: 2013
https://doi.org/10.1007/978-3-642-40450-4_3
Ma LAgrawal KChamberlain R(2012)A Memory Access Model for Highly-threaded Many-core Architectures2012 IEEE 18th International Conference on Parallel and Distributed Systems10.1109/ICPADS.2012.54(339-347)Online publication date: Dec-2012
https://doi.org/10.1109/ICPADS.2012.54
Blelloch GFineman JGibbons PSimhadri HMeyer auf der Heide FRajaraman R(2011)Scheduling irregular parallel computations on hierarchical cachesProceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures10.1145/1989493.1989553(355-366)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/1989493.1989553
Ajwani DSitchinava NZeh N(2011)I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.106(1114-1123)Online publication date: May-2011
https://doi.org/10.1109/IPDPS.2011.106
Ajwani DSitchinava NZeh N(2010)Geometric algorithms for private-cache chip multiprocessorsProceedings of the 18th annual European conference on Algorithms: Part II10.5555/1882123.1882133(75-86)Online publication date: 6-Sep-2010
https://dl.acm.org/doi/10.5555/1882123.1882133
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents