Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Achieving Scalable Locality with Time Skewing

Published: 01 June 2002 Publication History

Abstract

Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.

References

[1]
1. John D. McCalpin, Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Technical Committee on Computer Architecture Newsletter (December 1995).
[2]
2. F. Irigoin and R. Triolet, Supernode Partitioning, In Conf. Record of the 15th ACM Symp. Principles Progr. Lang., pp. 319-329 (1988).
[3]
3. Michael E. Wolf and Monica S. Lam, A Data Locality Optimizing Algorithm, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation (1991).
[4]
4. Michael Edward Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Dept. of Computer Science, Stanford University (August 1992).
[5]
5. K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (1996).
[6]
6. Gerald Roth, John Mellor-Crummey, Ken Kennedy, and R. Gregg Brickner, Compiling Stencils in High Performance Fortran, Proc. SC'97: High Performance Networking and Computing (November 1997).
[7]
7. R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua, Experience in the Automatic Parallelization of 4 Perfect Benchmark Programs, In Proc. 4th Workshop on Progr. Lang. Compilers for Parallel Computing (August 1991). Also Technical Report 1193, CSRD, University of Illinois.
[8]
8. R. Eigenmann, J. Hoeflinger, and D. Padua, On the Automatic Parallelization of the Perfect Benchmarks. IEEE Trans. Parallel Distributed Systems, 9(1):5-23 (January 1998). Also Technical Report 1392, CSRD, University of Illinois.
[9]
9. Tina Shen and David Wonnacott, Code Generation for Memory Mappings, Mid-Atlantic Student Workshop on Progr. Lang. Syst. (MASPLAS'98) (April 1998). An updated version is available as http://www.haverford.edu/cmsc/davew/cache-opt/mmap.ps.
[10]
10. David Wonnacott, Time Skewing for Parallel Computers, Proc. 12th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 1863, Springer-Verlag, Lecture Notes in Computer Science, pp. 477-480 (August 1999).
[11]
11. David Wonnacott, Using Time Skewing to Eliminate Idle Time Due to Memory Bandwidth and Network Limitations, Proc. Int'l. Parallel and Distributed Proc. Symp. (May 2000).
[12]
12. Yonghong Song and Zhiyuan Li, New Tiling Techniques to Improve Cache Temporal Locality, ACM SIGPLAN'99 Conf. Progr. Lang. Design and Implementation, pp. 215-228 (May 1999).
[13]
13. Yonghong Song, Rong Xu, Cheng Wang, and Zhiyuan Li, Data Locality Enhancement by Memory Reduction, Proc. 15th Int'l. Conf. Supercomputing (June 2001).
[14]
14. D. Callahan, J. Cocke, and K. Kennedy, Estimating Interlock and Improving Balance for Pipelined Machines, J. Parallel and Distributed Computing, 5(4):334-358 (August 1988).
[15]
15. Robert Sedgewick, Algorithms in C++, Addison-Wesley, Third Edition (1998).
[16]
16. M. Lam, E. Rothberg, and M. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Fourth Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst.(April 1991).
[17]
17. O. Temam, E. Granston, and W. Jalby, To Copy or Not to Copy: A Compile-Time Technique for Assessing when Data Copying Should be Used to Eliminate Cache Conflicts, Proc. Supercomputing'93 (November 1993).
[18]
18. Todd C. Mowry, Monica S. Lam, and Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Operat. Syst., pp. 62-73 (October 1992).
[19]
19. Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal, Baring it All to Software: Raw Machines, IEEE Computer, pp. 86-93 (September 1997).
[20]
20. Samuel Larsen, Emmett Witchel, and Saman Amarasinghe, Techniques for Increasing and Detecting Memory Alignment, Technical Report LCS-TM-621, MIT/LCS (November 2001).
[21]
21. M. J. Wolfe, Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, Massachusetts (1989).
[22]
22. Wayne Kelly and William Pugh, Determining Schedules Based on Performance Estimation, Parallel Processing Letters, 4(3):205-219 (September 1994).
[23]
23. William Pugh and David Wonnacott, An Exact Method for Analysis of Value-Based Array Data Dependences. Proc. Sixth Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 768 of Lecture Notes in Computer Science. Springer-Verlag, Berlin (August 1993). Also available as Technical Report CS-TR-3196, Dept. of Computer Science, University of Maryland, College Park.
[24]
24. William Pugh and David Wonnacott, Constraint-Based Array Dependence Analysis, ACM Trans. Progr. Lang. Syst., 20(3):635-678 (May 1998), http://www.acm.org/pubs/ citations/journals/toplas/1998-20-3/p635-pugh/.
[25]
25. Wayne Kelly, William Pugh, and Evan Rosser, Code Generation for Multiple Mappings, Fifth Symp. Frontiers of Massively Parallel Computation, McLean, Virginia, pp. 332-341 (February 1995).
[26]
26. Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and David Wonnacott, The Omega Library interface guide, Technical Report CS-TR-3445, Dept. of Computer Science, University of Maryland, College Park, March 1995, The Omega library is available from http://www.cs.umd.edu/projects/omega.
[27]
27. David Wonnacott, Extending Scalar Optimizations for Arrays, Proc. 13th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 2017, Springer-Verlag, Lecture Notes in Computer Science, pp. 97-111 (August 2000).
[28]
28. Evan J. Rosser, Fine-Grained Analysis of Array Computations, Ph.D. thesis, Dept. of Computer Science, The University of Maryland (September 1998).
[29]
29. David Wonnacott, Achieving scalable locality with Time Skewing, Technical Report DCS-TR-378, Dept. of Computer Science, Rutgers University (February 1999). Available as ftp://www.cs.rutgers.edu/pub/technical-reports/dcs-tr-378.ps.Z.
[30]
30. M. Weiser, Program Slicing, IEEE Trans. Software Engng., pp. 352-357 (July 1984).
[31]
31. William Pugh, Counting Solutions to Presburger Formulas: How and Why. In SIGPLAN Conf. Progr. Lang. Design and Implementation, Orlando, Florida (June 1994).
[32]
32. William Pugh and David Wonnacott, Eliminating False Data Dependences Using the Omega Test. In SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 140-151, San Francisco, California (June 1992).
[33]
33. Qing Yi, Vikram S. Adve, and Ken Kennedy, Transforming Loops to Recursion for Multi-level Memory Hierarchies, SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 169-181 (2000).
[34]
34. Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson, Data Distribution Support on Distributed Shared Memory Multiprocessors, In ACM SIGPLAN '97 Conf. Progr. Lang. Design and Implementation, pp. 334-345 (June 1997).
[35]
35. D. Gannon and W. Jalby, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel and Distributed Computing, pp. 587-616 (1988).
[36]
36. John McCalpin and David Wonnacott, Time Skewing: A Value-Based Approach to Optimizing for Memory Locality, Technical Report DCS-TR-379, Dept. of Computer Science, Rutgers University (February 1999), Available as ftp://www.cs.rutgers.edu/pub/ technical-reports/dcs-tr-379.ps.Z.
[37]
37. Tina Shen, Jaime Spacco, and David Wonnacott, High MFLOP Rates for Out of Core Stencil Calculations Using Time Skewing, SC '97 Poster Session (November 1997). Available as http://www.haverford.edu/cmsc/davew/cache-opt/SC97poster.ps.

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
  • (2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming
International Journal of Parallel Programming  Volume 30, Issue 3
June 2002
72 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2002

Author Tags

  1. compute balance
  2. machine balance
  3. memory locality
  4. scalable locality
  5. storage transformation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
  • (2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2023)A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier TransformsACM Transactions on Parallel Computing10.1145/360633810:4(1-34)Online publication date: 24-Jul-2023
  • (2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
  • (2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
  • (2021)Fast Stencil Computations using Fast Fourier TransformsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461803(8-21)Online publication date: 6-Jul-2021
  • (2021)(When) Do Multiple Passes Save Energy?Embedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_30(451-466)Online publication date: 4-Jul-2021
  • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media