Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

An integrated compile-time/run-time software distributed shared memory system

Published: 01 September 1996 Publication History
  • Get Citation Alerts
  • Abstract

    On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Parascope programming environment to perform the required analysis, and we augmented the TreadMarks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMarks, ranging from 4% to 59% on 8 processors. Relative to message passing implementations of the same applications, the compile-time run-time system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by the compiler optimized shared memory programs are within 9% of XHPF.

    References

    [1]
    S.P. Amarasinghe et al. The SUIF compiler for scalable parallel machines. In Proceedings of the 7th SiAM Conference on Parallel Processing for Scientific Computing, February 1995.
    [2]
    C. Amza et al. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, February 1996.
    [3]
    Applied Parallel Research. FORGE High Performance Fortran User's Guide, version 2.0.
    [4]
    D. Bailey et al. The NAS parallel benchmarks. Technical Report 103863, NASA, July 1993.
    [5]
    H.E. Bal, M.F. Kaashoek, and A.S. Tanenbaum. Orca: A language for parallel progranuning of distributed systems. IEEE- TSE, June 1992.
    [6]
    B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. The Midway distributed shared memory system. In Proceedings of the '93 CompCon Conference, February 1993.
    [7]
    D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of ASPLOS-g., April 1991.
    [8]
    J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Techniques for reducing consistency-related information in distributed shared memory systems. A CM TOCS, August 1995.
    [9]
    G.A. Geist and V.S. Sunderam. Network-based concurrent computing on the PVM system. Concurrency: Practice and Experience, June 1992.
    [10]
    K. Gharachorloo et al. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of ISCA-17, May 1990.
    [11]
    E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetchlng in multiprocessors with memory hierarchies. In Proceedings of ICS-90, 1990.
    [12]
    E. Granston and H. Wijshoff. Managing pages in shared virtual memory systems: Getting the compiler into the game. in Proceedings of ICS-93, July 1993.
    [13]
    P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE- TPDS, July 1991.
    [14]
    S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. CA CM, August 1992.
    [15]
    T.E. Jeremiassen and S. Eggers. Computing per-process summary side-effect information. In U. Banerjee et al., editors, Fifth Workshop on Languages and Compilers for Parallelism, August 1992.
    [16]
    T.E. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In Proceedings of PPoPP-95, July 1995.
    [17]
    P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed shared memory. In Proceedings of ISCA-19, May 1992.
    [18]
    K. Kennedy, K. S. McKinley, and C. Tseng. Analysis and transformation in an interactive parallel programming tool. Concurrency: Practice and Experience, October 1993.
    [19]
    D. Kranz et al. Integrating xtmssage-passing and sharedmemory: Early experience. In Proceedings of PPoPP-93, May 1993.
    [20]
    J. Kuskin et al. The Stanford FLASH multiprocessor. In Proceedings of ISCA-21, April 1994.
    [21]
    K. Li and P. Hudak. Memory coherence in shared virtual memory systems. A CM TOCS, November 1989.
    [22]
    H. Lu et al. Message passing versus distributed shared memory on networks of workstations. In Proceedings SuperCompu~ing '95, December 1995.
    [23]
    T.C. Mowry, M.S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of ASPLOS-5, October 1992.
    [24]
    S.C. Woo, J.P. Singh, and J.L. Hennessy. The performance advantages of integrating block data transfer in cachecoherent multiprocessors. In Proceedings of ASPLOS-6, October 1994.

    Cited By

    View all
    • (2015)PROBABILISTIC ANALYSIS OF LOAD-IMBALANCED PARALLEL APPLICATIONS WITH PARTIALLY ELIMINATED BARRIERSJournal of the Operations Research Society of Japan10.15807/jorsj.58.14958:2(149-164)Online publication date: 2015
    • (2015)HYDRARevised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing - Volume 951910.1007/978-3-319-29778-1_9(140-155)Online publication date: 9-Sep-2015
    • (2014)Probabilistic Analysis of Barrier Eliminating Method Applied to Load-Imbalanced Parallel ApplicationParallel Processing and Applied Mathematics10.1007/978-3-642-55195-6_18(196-206)Online publication date: 8-May-2014
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGOPS Operating Systems Review
    ACM SIGOPS Operating Systems Review  Volume 30, Issue 5
    Dec. 1996
    273 pages
    ISSN:0163-5980
    DOI:10.1145/248208
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
      October 1996
      290 pages
      ISBN:0897917677
      DOI:10.1145/237090
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 1996
    Published in SIGOPS Volume 30, Issue 5

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)82
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)PROBABILISTIC ANALYSIS OF LOAD-IMBALANCED PARALLEL APPLICATIONS WITH PARTIALLY ELIMINATED BARRIERSJournal of the Operations Research Society of Japan10.15807/jorsj.58.14958:2(149-164)Online publication date: 2015
    • (2015)HYDRARevised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing - Volume 951910.1007/978-3-319-29778-1_9(140-155)Online publication date: 9-Sep-2015
    • (2014)Probabilistic Analysis of Barrier Eliminating Method Applied to Load-Imbalanced Parallel ApplicationParallel Processing and Applied Mathematics10.1007/978-3-642-55195-6_18(196-206)Online publication date: 8-May-2014
    • (2013)Automatic Scaling of OpenMP Beyond Shared MemoryLanguages and Compilers for Parallel Computing10.1007/978-3-642-36036-7_1(1-15)Online publication date: 2013
    • (2007)A Characterization of Shared Data Access Patterns in UPC ProgramsLanguages and Compilers for Parallel Computing10.1007/978-3-540-72521-3_9(111-125)Online publication date: 2007
    • (2006)Dyn-MPIJournal of Parallel and Distributed Computing10.1016/j.jpdc.2006.02.00266:6(822-838)Online publication date: 1-Jun-2006
    • (2001)A synthesis of memory mechanisms for distributed architecturesProceedings of the 15th international conference on Supercomputing10.1145/377792.377799(13-22)Online publication date: 17-Jun-2001
    • (1999)AceACM Transactions on Computer Systems10.1145/320656.32065717:3(202-248)Online publication date: 1-Aug-1999
    • (2024)TrackFM: Far-out Compiler Support for a Far Memory WorldProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624856(401-419)Online publication date: 27-Apr-2024
    • (2019)CoSMIXProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358854(555-570)Online publication date: 10-Jul-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media