research-article

Optimizing Graph Algorithms for Improved Cache Performance

Authors:

Joon-Sang Park,

Michael Penner,

Viktor K. PrasannaAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 15, Issue 9

Pages 769 - 782

https://doi.org/10.1109/TPDS.2004.44

Published: 01 September 2004 Publication History

Abstract

In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of \Omega (N^3/\sqrt{C}), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.

References

[1]

ADVISOR Project, http://advisor.usc.edu/, 2001.]]

[2]

M. Brenner, “Multiagent Planning with Partially Ordered Temporal Plans,” Proc Int'l Joint Conf. Artificial Intelligence, 2003.]]

Digital Library

[3]

D. Burger and T. Austin, “The SimpleScalar Tool Set, Version 2.0,” Univ. of Wisconsin-Madison Computer Sciences Dept. Technical Report #1342, 1997.]]

[4]

J. Carter W. Hsieh L. Stoller M. Swanson L. Zhang and S. McKee, “Impulse: Memory System Support for Scientific Applications,” J. Scientific Programming, vol. 7, nos. 3-4, 1999.]]

Digital Library

[5]

S. Chatterjee V. Jain A. Lebeck S. Mundhra and M. Thottethodi, “Nonlinear Array Layouts for Hierarchical Memory Systems,” Proc. ACM Symp. Parallel Algorithms and Architectures, 1999.]]

Digital Library

[6]

T. Chilimbi M. Hill and J. Larus, “Cache-Conscious Structure Layout,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1999.]]

Digital Library

[7]

T. Cormen C. Leiserson and R. Rivest, Introduction to Algorithms. MIT Press, 1990.]]

Digital Library

[8]

N. Dutt P. Panda and A. Nicolau, “Data Organization for Improved Performance in Embedded Processor Applications,” ACM Trans. Design Automation of Electronic Systems, vol. 2, no. 4, Oct. 1997.]]

Digital Library

[9]

J. Frens and D. Wise, “Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.]]

Digital Library

[10]

M. Frigo C.E. Leiserson H. Prokop and S. Ramachandran, “Cache-Oblivious Algorithms,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 17-18, Oct. 1999.]]

Digital Library

[11]

R. Gallagher and D. Bertsekas, Data Networks. Prentice Hall, 1987.]]

Digital Library

[12]

S. Gerez, Algorithms for VLSI Design Automation. Wiley, 1998.]]

Digital Library

[13]

A. Gonzalez M. Valero N. Topham and J.M. Parcerisa, “Eliminating Cache Conflict Misses through XOR-Based Placement Functions,” Proc. 1997 Int'l Conf. Supercomputing, July 1997.]]

Digital Library

[14]

J. Hong and H. Kung, “I/O Complexity: The Red Blue Pebble Game,” Proc. ACM Symp. Theory of Computing, 1981.]]

Digital Library

[15]

M. Kallahalla and P.J. Varman, “Optimal Prefetching and Caching for Parallel I/O Systems,” Proc. 13th ACM Symp. Parallel Algorithms and Architectures, 2001.]]

Digital Library

[16]

M. Lam E. Rothberg and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 1991.]]

Digital Library

[17]

A. LaMarca and R. Ladner, “The Influence of Caches on the Performance of Heaps,” ACM J. Experimental Algorithmics, vol. 1, 1996.]]

Digital Library

[18]

E. Lawler, Combinatorial Optimization: Networks and Matroids. New York: Holt, Rhinehart, and Winston, 1976.]]

[19]

R. Murphy and P.M. Kogge, “The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems,” Proc. Intelligent Memory Systems Workshop, ASPLOS-IX 2000, Nov. 2000.]]

Digital Library

[20]

A. Nakaya S. Goto and M. Kanehisa, “Extraction of Correlated Gene Clusters by Multiple Graph Comparison,” Genome Informatics, vol. 12, 2001.]]

[21]

J. Park M. Penner and V.K. Prasanna, “Optimizing Graph Algorithms for Improved Cache Performance,” Technical Report USC-CENG 03-03, Dept. of Electrical Eng., Univ. of Southern California, Nov. 2003.]]

[22]

N. Park B. Hong and V. Prasanna, “Tiling, Block Data Layout, and Memory Hierarchy Performance,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 7, July 2003.]]

Digital Library

[23]

N. Park B. Hong and V. Prasanna, “Analysis of Memory Hierarchy Performance of Block Data Layout,” Proc. Int'l Conf. Parallel Processing (ICPP), Aug. 2002.]]

Digital Library

[24]

N. Park D. Kang K. Bondalapati and V. Prasanna, “Dynamic Data Layouts for Cache-Conscious Factorization of the DFT,” Proc. Int'l Parallel and Distributed Processing Symp., May 2000.]]

Digital Library

[25]

D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach, second ed. San Francisco, Calif.: Morgan Kaufmann, 1996.]]

Digital Library

[26]

M. Penner and V. Prasanna, “Cache-Friendly Implementations of Transitive Closure,” Proc. Int'l Conf. Parallel Architectures and Compiler Techniques, Sept. 2001.]]

Digital Library

[27]

G. Rivera and C. Tseng, “Data Transformations for Eliminating Conflict Misses,” Proc. 1998 ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1998.]]

Digital Library

[28]

F. Rastello and Y. Robert, “Loop Partitioning Versus Tiling for Cache-Based Multiprocessor,” Proc. Int'l Conf. Parallel and Distributed Computing and Systems, 1998.]]

[29]

S. Sahni, Data Structures, Algorithms, and Applications in Java. New York: McGraw Hill, 2000.]]

Digital Library

[30]

P. Sanders, “Fast Priority Queues for Cached Memory,” ACM J. Experimental Algorithmics, vol. 5, 2000.]]

Digital Library

[31]

S. Sarawagi R. Agrawal and A. Gupta, “On Computing the Data Cube,” Research Report 10026, IBM Almaden Research Center, San Jose, Calif., 1996.]]

[32]

S. Sen and S. Chatterjee, “Towards a Theory of Cache-Efficient Algorithms,” Proc. Symp. Discrete Algorithms, 2000.]]

Digital Library

[33]

SPIRAL Project, http://www.ece.cmu.edu/~spiral/, 2004.]]

[34]

G. Venkataraman S. Sahni and S. Mukhopadhyaya, “A Blocked All-Pairs Shortest-Paths Algorithm,” Proc. Scandinavian Workshop Algorithms and Theory, 2000.]]

Digital Library

[35]

D. Weikle S. McKee and W. Wulf, “Caches as Filters: A New Approach to Cache Analysis,” Proc. Grace Murray Hopper Conf., Sept. 2000.]]

Digital Library

[36]

R. Whaley and J. Dongarra, “Automatically Tuned Linear Algebra Software,” High Performance Computing and Networking, Nov. 1998.]]

Digital Library

[37]

M. Yannakakis, “Graph Theoretic Methods in Database Theory,” Proc. ACM Conf. Principles of Database Systems, 1990.]]

Digital Library

Cited By

Wang KLin XQin LZhang WZhang Y(2022)Accelerated butterfly counting with vertex priority on bipartite graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00746-032:2(257-281)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1007/s00778-022-00746-0
Zhu LHua QJin H(2021)Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse GraphsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472524(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472524
Vandierendonck HAyguadé EHwu WBadia RHofstee H(2020)GraptorProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392753(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392753
Show More Cited By

Index Terms

Optimizing Graph Algorithms for Improved Cache Performance

Recommendations

Low depth cache-oblivious algorithms
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (...
A shortest cycle for each vertex of a graph

We present an algorithm that finds, for each vertex of an undirected graph, a shortest cycle containing it. While for directed graphs this problem reduces to the All-Pairs Shortest Paths problem, this is not known to be the case for undirected graphs. ...
On the $k$ Shortest Simple Paths Problem in Weighted Directed Graphs

We present the first approximation algorithm for finding the $k$ shortest simple paths connecting a pair of vertices in a weighted directed graph that breaks the barrier of $mn$. It is deterministic and has a running time of $O(k(m\sqrt{n}+n^{3/2}\log n)...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 15, Issue 9

September 2004

96 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2004.

Publisher

IEEE Press

Publication History

Published: 01 September 2004

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang KLin XQin LZhang WZhang Y(2022)Accelerated butterfly counting with vertex priority on bipartite graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00746-032:2(257-281)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1007/s00778-022-00746-0
Zhu LHua QJin H(2021)Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse GraphsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472524(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472524
Vandierendonck HAyguadé EHwu WBadia RHofstee H(2020)GraptorProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392753(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392753
Endo T(2020)Integrating Cache Oblivious Approach with Modern Processor ArchitectureProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368477(123-130)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368477
Bender MChowdhury RDas RJohnson RKuszmaul WLincoln ALiu QLynch JXu HScheideler CSpear M(2020)Closing the Gap Between Cache-oblivious and Cache-adaptive AnalysisProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400274(63-73)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400274
Wang KLin XQin LZhang WZhang Y(2019)Vertex priority based butterfly counting for large-scale bipartite networksProceedings of the VLDB Endowment10.14778/3339490.333949712:10(1139-1152)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.14778/3339490.3339497
Yuan YLian XChen LWang GYu JWang YMa Y(2019)GCache: Neighborhood-Guided Graph Caching in a Distributed EnvironmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.291530030:11(2463-2477)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1109/TPDS.2019.2915300
Chowdhury RRamachandran V(2018)Cache-Oblivious Buffer Heap and Cache-Efficient Computation of Shortest Paths in GraphsACM Transactions on Algorithms10.1145/314717214:1(1-33)Online publication date: 3-Jan-2018
https://dl.acm.org/doi/10.1145/3147172
Goglin BJacob B(2016)Exposing the Locality of Heterogeneous Memory Architectures to HPC ApplicationsProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989115(30-39)Online publication date: 3-Oct-2016
https://dl.acm.org/doi/10.1145/2989081.2989115
Wei HYu JLu CLin XÖzcan FKoutrika GMadden S(2016)Speedup Graph Processing by Graph OrderingProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915220(1813-1828)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2915220
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents