Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Exploiting Locality for Irregular Scientific Codes

Published: 01 July 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Irregular scientific codes experience poor cache performance due to their irregular memory access patterns. In this paper, we present two new locality improving techniques for irregular scientific codes. Our techniques exploit geometric structures hidden in data access patterns and computation structures. Our new data reordering (Gpart) finds the graph structure within data accesses and applies hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighbor nodes with priority on nodes with high degree and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Our new computation reordering (Z-Sort) treats the values of index arrays as coordinates and reorders corresponding computations in Z-curve order. Applied to dense inputs, Z-Sort achieves performance close to data reordering combined with other computation reordering but without the overhead involved in data reordering. Experiments on irregular scientific codes for a variety of meshes show locality optimization techniques are effective for both sequential and parallelized codes, improving performance by 60-87 percent. Gpart achieved within 1-2 percent of the performance of more sophisticated partitioning algorithms, but with one third of the overhead. Z-Sort also yields the performance improvement of 64 percent for dense inputs, which is comparable with data reordering combined with computation reordering.

    References

    [1]
    Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems (TOPLAS), v.18 n.4, p.424-453, July 1996
    [2]
    M.E. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN '91 Conf. Programming Language Design and Implementation, June 1991.
    [3]
    R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, “The Design and Implementation of a Parallel Unstructured Euler Solver Using Software Primitives,” Proc. 30th Aerospace Sciences Meeting and Exhibit, Jan. 1992.
    [4]
    I. Al-Furaih and S. Ranka, “Memory Hierarchy Management for Iterative Graph Structures,” Proc. 12th Int'l Parallel Processing Symp., Apr. 1998.
    [5]
    C. Ding and K. Kennedy, “Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations,” Proc. SIGPLAN '99 Conf. Programming Language Design and Implementation, May 1999.
    [6]
    J. Mellor-Crummey, D. Whalley, and K. Kennedy, “Improving Memory Hierarchy Performance for Irregular Applications,” Proc. 1999 ACM Int'l Conf. Supercomputing, June 1999.
    [7]
    N. Mitchell, L. Carter, and J. Ferrante, “Localizing Non-Affine Array References,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1999.
    [8]
    H. Han and C.-W. Tseng, “A Comparison of Locality Transformations for Irregular Codes,” Proc. Fifth Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, May 2000.
    [9]
    W. Liu and A. Sherman, “Comparative Analysis of the Cuthill-Mckee and the Reverse Cuthill-Mckee Ordering Algorithms for Sparse Matrices,” SIAM J. Numerical Analysis, vol. 13, no. 2, pp. 198-213, Apr. 1976.
    [10]
    E. Im and K. Yelick, “Model-Based Memory Hierarchy Optimizations for Sparse Matrices,” Proc. 1998 Workshop Profile and Feedback-Directed Compilation, Oct. 1998.
    [11]
    M. Berger and S. Bokhari, “A Partitioning Strategy for Non-Uniform Problems on Multiprocessors,” IEEE Trans. Computers, vol. 37, no. 12, pp. 570-580, Dec. 1987.
    [12]
    G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” Proc. 24th Int'l Conf. Parallel Processing, Aug. 1995.
    [13]
    H. Han and C.-W. Tseng, “Improving Locality for Adaptive Irregular Codes,” Proc. 13th Workshop Languages and Compilers for Parallel Computing, Aug. 2000.
    [14]
    S. Tjiang, M.E. Wolf, M. Lam, K. Pieper, and J. Hennessy, “Integrating Scalar Optimization and Parallelization,” Proc. Fourth Int'l Workshop Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., Aug. 1991.
    [15]
    R. v Hanxleden and K. Kennedy, “Give-n-Take— A Balanced Code Placement Framework,” Proc. SIGPLAN '94 Conf. Programming Language Design and Implementation, June 1994.
    [16]
    G. Agarwal, J. Saltz, and R. Das, “Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation,” Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, June 1995.
    [17]
    Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, and J. Saltz, “Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines,” Software— Practice and Experience, vol. 25, no. 6, pp. 597-621, June 1995.
    [18]
    H. Yu and L. Rauchwerger, “Adaptive Reduction Parallelization Techniques,” Proc. 2000 ACM Int'l Conf. Supercomputing, May 2000.
    [19]
    B. Cmelik and D. Keppel, “Shade: A Fast Instruction-Set Simulator for Execution Profile,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, 1994.
    [20]
    H. Han and C.-W. Tseng, “Locality Transformation Package v1.1,”
    [21]
    Parallel Computing, vol. 26, nos. 13-14, pp. 1861-1887, Dec. 2000.
    [22]
    W. Pottenger, “The Role of Associativity and Commutativity in the Detection and Transformation of Loop Level Parallelism,” Proc. 1998 ACM Int'l Conf. Supercomputing, July 1998.
    [23]
    Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng, Compiling Fortran D for MIMD distributed-memory machines, Communications of the ACM, v.35 n.8, p.66-80, Aug. 1992
    [24]
    S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Preliminary Experiences with the Fortran D Compiler,” Proc. Conf. Supercomputing '93, Nov. 1993.
    [25]
    IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 440-451, Oct. 1991.
    [26]
    S. Chandra and J. Larus, “Optimizing Communication in HPF Programs for Fine-Grain Distributed Shared Memory,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
    [27]
    H. Han and C.-W. Tseng, “Improving Compiler and Run-Time Support for Adaptive Irregular Codes,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1998.
    [28]
    E. Gutierrez, O. Plata, and E. Zapata, “A Compiler Method for the Parallel Execution of Irregular Reductions on Scalable Shared Memory Multiprocessors,” Proc. 2000 ACM Int'l Conf. Supercomputing, May. 2000.
    [29]
    E. Gutierrez, O. Plata, and E. Zapata, “Data Partitioning-Based Parallel Irregular Reductions Describes Efficient Parallelization of Irregular Scientific Applications,” Concurrency and Computation: Practice and Experience, vol. 16, pp. 155-172, 2004.
    [30]
    D.E. Singh et al. “A Run-Time Framework for Parallelizing Loops with Irregular Accesses,” Proc. Seventh Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, Mar. 2002.
    [31]
    J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 462-479, Sept. 1994.
    [32]
    M. Strout, L. Carter, and J. Ferrante, “Compile-Time Composition of Run-Time Data and Iteration Reorderings,” Proc. SIGPLAN '03 Conf. Programming Language Design and Implementation, June 2003.
    [33]
    M. Strout, L. Carter, J. Ferrante, J. Freeman, and B. Kreaseck, “Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling,” Proc. Int'l Conf. Computational Science, May 2001.
    [34]
    C. Douglas, J. Hu, M. Kowarschik, U. Rüde, and C. Weiss, “Digital Libraries and Autonomous Citation Indexing,” Electronic Trans. Numerical Analysis, vol. 10, pp. 21-40, Feb. 2000.
    [35]
    C. Ding and Y. Zhong, “Predicting Whole-Program Locality through Reuse Distance Analysis,” Proc. SIGPLAN '03 Conf. Programming Language Design and Implementation, June 2003.
    [36]
    M. Strout and P. Hovland, “Metrics and Models for Reordering Transformations,” Proc. Memory Systems Performance, June 2004.

    Cited By

    View all
    • (2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
    • (2022)Autoscheduling for sparse tensor algebra with an asymptotic cost modelProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523442(269-285)Online publication date: 9-Jun-2022
    • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Parallel and Distributed Systems
    IEEE Transactions on Parallel and Distributed Systems  Volume 17, Issue 7
    July 2006
    143 pages

    Publisher

    IEEE Press

    Publication History

    Published: 01 July 2006

    Author Tags

    1. Compiler optimization
    2. cache memories
    3. computation reordering.
    4. data reordering
    5. inspector/executor

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
    • (2022)Autoscheduling for sparse tensor algebra with an asymptotic cost modelProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523442(269-285)Online publication date: 9-Jun-2022
    • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
    • (2020)A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAsACM Transactions on Architecture and Code Optimization10.1145/339052317:2(1-26)Online publication date: 29-May-2020
    • (2019)Efficient parameterized algorithms for data packingProceedings of the ACM on Programming Languages10.1145/32903663:POPL(1-28)Online publication date: 2-Jan-2019
    • (2019)Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286472930:2(445-458)Online publication date: 1-Feb-2019
    • (2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
    • (2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
    • (2018)Making pull-based graph processing performantACM SIGPLAN Notices10.1145/3200691.317850653:1(246-260)Online publication date: 10-Feb-2018
    • (2018)Enhancing computation-to-core assignment with physical location informationProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192386(312-327)Online publication date: 11-Jun-2018
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media