Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1248377.1248392acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Published: 09 June 2007 Publication History

Abstract

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.

References

[1]
G. Blelloch and P. Gibbons. Effectively sharing a cache among threads. Proc. SPAA, pp. 235--244, 2004.
[2]
R. Blumofe, M. Frigo, C. Joerg, C. Leiserson, and K. Randall. An analysis of DAG-consistent distributed shared-memory algorithms. Proc. SPAA, pp. 297--308, 1996.
[3]
E. Chan, E. Quintana-Orti, G. Quintana-Orti, and R. van de Geijn. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. Proc. SPAA, 2007.
[4]
S. Chatterjee, A. Lebeck, P. Patnala, and M. Thotethodi. Recursive array layouts and fast parallel matrix multiplication. Proc. SPAA, pp. 222--231, 1999.
[5]
C. Cherng and R Ladner. Cache efficient simple dynamic programming. In Proc. Intl. Conf. on the Analysis of Algorithms, pp. 49--58, 2005.
[6]
R. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. Proc. SODA, pp. 591--600, 2006.
[7]
R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: theoretical framework and experimental evaluation. CS TR-06-04, UT Austin, 2006.
[8]
P. D'Alberto and A. Nicolau. R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica, 47(2):203--213, 2007.
[9]
R. Dementiev, L. Kettner, and P. Sanders. STXXL: Standard template library for XXL data sets. Proc. ESA, pp. 640--651, 2005.
[10]
R. Floyd. Algorithm 97 (SHORTEST PATH). CACM, 5(6):345, 1962.
[11]
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. Proc. FOCS, pp. 285--297, 1999.
[12]
M. Frigo, C. Leiserson, and K. Randall. The implementation of the Cilk-5 multithreaded language. Proc. PLDI, pp. 212--223, 1998.
[13]
M. Frigo and V. Strumpen. The cache-complexity of multithreaded cache-oblivious algorithms. Proc. SPAA, pp. 271--280, 2006.
[14]
J. Gunnels, F. Gustavson, G. Henry, and R. van de Geijn. FLAME: Formal linear algebra methods environment. ACM TOMS, 27(4):422--455, 2001.
[15]
K. Goto. GotoBLAS, 2005. http://www.tacc.utexas.edu/resources/software.
[16]
K. Iverson. A Programming Language. Wiley, 1962.
[17]
MAP3147NC/NP MAP3735NC/NP MAP3367NC/NP disk drives product/maintenance manual. http://www.fujitsu.com/downloads/COMP/fcpa/hdd/.
[18]
S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann, 1997.
[19]
S. Pan, C. Cherng, K. Dick, and R. Ladner. Algorithms to take advantage of hardware prefetching. Proc. ALENEX, pp. 91--98, 2007.
[20]
J. Park, M. Penner and V. Prasanna. Optimizing graph algorithms for improved cache performance. IEEE TPDS, 15(9):769--782, 2004.
[21]
J. Seward and N. Nethercote. Valgrind (debugging/profiling tool for x86-Linux programs). http://valgrind.kde.org/
[22]
S. Warshall. A theorem on boolean matrices. JACM, 9(1):11--12, 1962.
[23]
D. Womble, D. Greenberg, S. Wheat, and R. Riesen. Beyond core: making parallel computer I/O practical. Proc. DAGS/PC Symp., pp. 56--63, 1993.
[24]
M. Wolf and M. Lam. A data locality optimizing algorithm. Proc. PLDI, pp. 30--44, 1991.
[25]
R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
[26]
K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-aware programs. Proc. SPAA, 2007.

Cited By

View all
  • (2016)Experimental Analysis of Space-Bounded SchedulersACM Transactions on Parallel Computing10.1145/29383893:1(1-27)Online publication date: 28-Jun-2016
  • (2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
  • (2014)Experimental analysis of space-bounded schedulersProceedings of the 26th ACM symposium on Parallelism in algorithms and architectures10.1145/2612669.2612678(30-41)Online publication date: 23-Jun-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
June 2007
376 pages
ISBN:9781595936677
DOI:10.1145/1248377
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Gaussian elimination
  2. all-pairs shortest path
  3. cache-oblivious algorithm
  4. matrix multiplication
  5. tiling

Qualifiers

  • Article

Conference

SPAA07

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Experimental Analysis of Space-Bounded SchedulersACM Transactions on Parallel Computing10.1145/29383893:1(1-27)Online publication date: 28-Jun-2016
  • (2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
  • (2014)Experimental analysis of space-bounded schedulersProceedings of the 26th ACM symposium on Parallelism in algorithms and architectures10.1145/2612669.2612678(30-41)Online publication date: 23-Jun-2014
  • (2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.1016/j.future.2013.06.02030(202-215)Online publication date: Jan-2014
  • (2013)Minimizing Communication in All-Pairs Shortest PathsProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2013.111(548-559)Online publication date: 20-May-2013
  • (2013)Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore ArchitecturesAlgorithms – ESA 201310.1007/978-3-642-40450-4_3(25-36)Online publication date: 2013
  • (2012)A Memory Access Model for Highly-threaded Many-core Architectures2012 IEEE 18th International Conference on Parallel and Distributed Systems10.1109/ICPADS.2012.54(339-347)Online publication date: Dec-2012
  • (2011)Scheduling irregular parallel computations on hierarchical cachesProceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures10.1145/1989493.1989553(355-366)Online publication date: 4-Jun-2011
  • (2011)I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.106(1114-1123)Online publication date: May-2011
  • (2010)Geometric algorithms for private-cache chip multiprocessorsProceedings of the 18th annual European conference on Algorithms: Part II10.5555/1882123.1882133(75-86)Online publication date: 6-Sep-2010
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media