article

Achieving Scalable Locality with Time Skewing

Author:

David WonnacottAuthors Info & Claims

International Journal of Parallel Programming, Volume 30, Issue 3

Pages 181 - 221

https://doi.org/10.1023/A:1015460304860

Published: 01 June 2002 Publication History

Abstract

Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.

References

[1]

1. John D. McCalpin, Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Technical Committee on Computer Architecture Newsletter (December 1995).

[2]

2. F. Irigoin and R. Triolet, Supernode Partitioning, In Conf. Record of the 15th ACM Symp. Principles Progr. Lang., pp. 319-329 (1988).

Digital Library

[3]

3. Michael E. Wolf and Monica S. Lam, A Data Locality Optimizing Algorithm, ACM SIGPLAN Conf. Progr. Lang. Design and Implementation (1991).

Digital Library

[4]

4. Michael Edward Wolf, Improving Locality and Parallelism in Nested Loops, Ph.D. thesis, Dept. of Computer Science, Stanford University (August 1992).

[5]

5. K. S. McKinley, S. Carr, and C.-W. Tseng, Improving Data Locality with Loop Transformations, ACM Trans. Progr. Lang. Syst., 18(4):424-453 (1996).

Digital Library

[6]

6. Gerald Roth, John Mellor-Crummey, Ken Kennedy, and R. Gregg Brickner, Compiling Stencils in High Performance Fortran, Proc. SC'97: High Performance Networking and Computing (November 1997).

Digital Library

[7]

7. R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua, Experience in the Automatic Parallelization of 4 Perfect Benchmark Programs, In Proc. 4th Workshop on Progr. Lang. Compilers for Parallel Computing (August 1991). Also Technical Report 1193, CSRD, University of Illinois.

Digital Library

[8]

8. R. Eigenmann, J. Hoeflinger, and D. Padua, On the Automatic Parallelization of the Perfect Benchmarks. IEEE Trans. Parallel Distributed Systems, 9(1):5-23 (January 1998). Also Technical Report 1392, CSRD, University of Illinois.

Digital Library

[9]

9. Tina Shen and David Wonnacott, Code Generation for Memory Mappings, Mid-Atlantic Student Workshop on Progr. Lang. Syst. (MASPLAS'98) (April 1998). An updated version is available as http://www.haverford.edu/cmsc/davew/cache-opt/mmap.ps.

[10]

10. David Wonnacott, Time Skewing for Parallel Computers, Proc. 12th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 1863, Springer-Verlag, Lecture Notes in Computer Science, pp. 477-480 (August 1999).

[11]

11. David Wonnacott, Using Time Skewing to Eliminate Idle Time Due to Memory Bandwidth and Network Limitations, Proc. Int'l. Parallel and Distributed Proc. Symp. (May 2000).

Digital Library

[12]

12. Yonghong Song and Zhiyuan Li, New Tiling Techniques to Improve Cache Temporal Locality, ACM SIGPLAN'99 Conf. Progr. Lang. Design and Implementation, pp. 215-228 (May 1999).

Digital Library

[13]

13. Yonghong Song, Rong Xu, Cheng Wang, and Zhiyuan Li, Data Locality Enhancement by Memory Reduction, Proc. 15th Int'l. Conf. Supercomputing (June 2001).

Digital Library

[14]

14. D. Callahan, J. Cocke, and K. Kennedy, Estimating Interlock and Improving Balance for Pipelined Machines, J. Parallel and Distributed Computing, 5(4):334-358 (August 1988).

Digital Library

[15]

15. Robert Sedgewick, Algorithms in C++, Addison-Wesley, Third Edition (1998).

Digital Library

[16]

16. M. Lam, E. Rothberg, and M. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, Fourth Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst.(April 1991).

Digital Library

[17]

17. O. Temam, E. Granston, and W. Jalby, To Copy or Not to Copy: A Compile-Time Technique for Assessing when Data Copying Should be Used to Eliminate Cache Conflicts, Proc. Supercomputing'93 (November 1993).

Digital Library

[18]

18. Todd C. Mowry, Monica S. Lam, and Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Proc. Fifth Int'l. Conf. Architectural Support Progr. Lang. Operat. Syst., pp. 62-73 (October 1992).

Digital Library

[19]

19. Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal, Baring it All to Software: Raw Machines, IEEE Computer, pp. 86-93 (September 1997).

Digital Library

[20]

20. Samuel Larsen, Emmett Witchel, and Saman Amarasinghe, Techniques for Increasing and Detecting Memory Alignment, Technical Report LCS-TM-621, MIT/LCS (November 2001).

[21]

21. M. J. Wolfe, Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, Massachusetts (1989).

Digital Library

[22]

22. Wayne Kelly and William Pugh, Determining Schedules Based on Performance Estimation, Parallel Processing Letters, 4(3):205-219 (September 1994).

[23]

23. William Pugh and David Wonnacott, An Exact Method for Analysis of Value-Based Array Data Dependences. Proc. Sixth Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 768 of Lecture Notes in Computer Science. Springer-Verlag, Berlin (August 1993). Also available as Technical Report CS-TR-3196, Dept. of Computer Science, University of Maryland, College Park.

Digital Library

[24]

24. William Pugh and David Wonnacott, Constraint-Based Array Dependence Analysis, ACM Trans. Progr. Lang. Syst., 20(3):635-678 (May 1998), http://www.acm.org/pubs/ citations/journals/toplas/1998-20-3/p635-pugh/.

Digital Library

[25]

25. Wayne Kelly, William Pugh, and Evan Rosser, Code Generation for Multiple Mappings, Fifth Symp. Frontiers of Massively Parallel Computation, McLean, Virginia, pp. 332-341 (February 1995).

Digital Library

[26]

26. Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and David Wonnacott, The Omega Library interface guide, Technical Report CS-TR-3445, Dept. of Computer Science, University of Maryland, College Park, March 1995, The Omega library is available from http://www.cs.umd.edu/projects/omega.

Digital Library

[27]

27. David Wonnacott, Extending Scalar Optimizations for Arrays, Proc. 13th Int'l. Workshop on Lang. Compilers for Parallel Computing, Vol. 2017, Springer-Verlag, Lecture Notes in Computer Science, pp. 97-111 (August 2000).

[28]

28. Evan J. Rosser, Fine-Grained Analysis of Array Computations, Ph.D. thesis, Dept. of Computer Science, The University of Maryland (September 1998).

Digital Library

[29]

29. David Wonnacott, Achieving scalable locality with Time Skewing, Technical Report DCS-TR-378, Dept. of Computer Science, Rutgers University (February 1999). Available as ftp://www.cs.rutgers.edu/pub/technical-reports/dcs-tr-378.ps.Z.

[30]

30. M. Weiser, Program Slicing, IEEE Trans. Software Engng., pp. 352-357 (July 1984).

Digital Library

[31]

31. William Pugh, Counting Solutions to Presburger Formulas: How and Why. In SIGPLAN Conf. Progr. Lang. Design and Implementation, Orlando, Florida (June 1994).

Digital Library

[32]

32. William Pugh and David Wonnacott, Eliminating False Data Dependences Using the Omega Test. In SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 140-151, San Francisco, California (June 1992).

Digital Library

[33]

33. Qing Yi, Vikram S. Adve, and Ken Kennedy, Transforming Loops to Recursion for Multi-level Memory Hierarchies, SIGPLAN Conf. Progr. Lang. Design and Implementation, pp. 169-181 (2000).

Digital Library

[34]

34. Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson, Data Distribution Support on Distributed Shared Memory Multiprocessors, In ACM SIGPLAN '97 Conf. Progr. Lang. Design and Implementation, pp. 334-345 (June 1997).

Digital Library

[35]

35. D. Gannon and W. Jalby, Strategies for Cache and Local Memory Management by Global Program Transformation, J. Parallel and Distributed Computing, pp. 587-616 (1988).

Digital Library

[36]

36. John McCalpin and David Wonnacott, Time Skewing: A Value-Based Approach to Optimizing for Memory Locality, Technical Report DCS-TR-379, Dept. of Computer Science, Rutgers University (February 1999), Available as ftp://www.cs.rutgers.edu/pub/ technical-reports/dcs-tr-379.ps.Z.

[37]

37. Tina Shen, Jaime Spacco, and David Wonnacott, High MFLOP Rates for Out of Core Stencil Calculations Using Time Skewing, SC '97 Poster Session (November 1997). Available as http://www.haverford.edu/cmsc/davew/cache-opt/SC97poster.ps.

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Zhao WYuan LYan BMa PZhang YWang LWang Z(2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656611
Ahmad ZBrowne RChowdhury RDas RHuang YZhu YLee IChabbi MSteuwer M(2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638506
Show More Cited By

Index Terms

Achieving Scalable Locality with Time Skewing

Recommendations

Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing

Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused ...
Improving Memory Efficiency in Heterogeneous MPSoCs through Row-Buffer Locality-aware Forwarding

In heterogeneous multicore systems, the memory subsystem plays a critical role, since most core-to-core communications are conducted through the main memory. Memory efficiency has a substantial impact on system performance. Although memory traffic from ...
Run-time spatial locality detection and optimization
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture

As the disparity between processor and main memory performance grows, the number of execution cycles spent waiting for memory accesses to complete also increases. As a result, latency hiding techniques are critical for improved application performance ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 30, Issue 3

June 2002

72 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 2002 Plenum Publishing Corporation.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2002

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Zhao WYuan LYan BMa PZhang YWang LWang Z(2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656611
Ahmad ZBrowne RChowdhury RDas RHuang YZhu YLee IChabbi MSteuwer M(2024)Fast American Option Pricing using Nonlinear StencilsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638506(316-332)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638506
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Ahmad ZChowdhury RDas RGanapathi PGregory AZhu Y(2023)A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier TransformsACM Transactions on Parallel Computing10.1145/360633810:4(1-34)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1145/3606338
Li KYuan LZhang YYue Yde Supinski BHall MGamblin T(2021)Reducing redundancy in data organization and arithmetic calculation for stencil computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476154(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476154
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Ahmad ZChowdhury RDas RGanapathi PGregory AZhu YAgrawal KAzar Y(2021)Fast Stencil Computations using Fast Fourier TransformsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461803(8-21)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461803
Narmour LYuki TRajopadhye S(2021)(When) Do Multiple Passes Save Energy?Embedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_30(451-466)Online publication date: 4-Jul-2021
https://dl.acm.org/doi/10.1007/978-3-031-04580-6_30
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents