Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Algorithms for scalable synchronization on shared-memory multiprocessors

Published: 01 February 1991 Publication History
  • Get Citation Alerts
  • Abstract

    Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory.
    We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes.
    We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors. —From the Authors' Abstract

    References

    [1]
    ADVE, S. V., ANn HILL, M. D. Weak ordering--A new definition. In Proceedings of the International Symposium on Computer Architecture (May 1990), 2-14.
    [2]
    AGARWAL, A., AND CHEmAN, M. Adaptive backoff synchronization techniques. In Proceedings of the International Symposium on Computer Architecture (May 1989), 396 406.
    [3]
    AGARWAL, A., SIMONI, R., HENNESSY, J., AND HOROWITZ, M. An evaluation of directory schemes for cache coherence. In Proceedings of the International Symposium on Computer Architecture (June 1988), 280-289.
    [4]
    ALMASI, G. S., ANn GOTTLIEB, A. Highly Parallel Computing. Benjamin/Cummings, Redwood City, Calif., 1989.
    [5]
    ANDERSON, T.E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distributed Syst. 1, I (Jan. 1990), 6-16.
    [6]
    ANDEaSON, T. E., L~ZOWSKA, E. D., AND LEVY, H. M. The performance implications of thread management alternatives for shared-memory multiprocessors IEEE Trans. Comput. 38, 12 (Dec. 1989), 1631-1644.
    [7]
    A~CnmALn, J. AND BAER, J.-L. An economical solution to the cache coherence problem. In Proceedings of the International Symposium on Computer Architecture (1984), 355-362.
    [8]
    BBN Laboratories. Butterfly parallel processor overview. Tech. Rep. 6148, version 1, BBN Laboratories, Cambridge, Mass, Mar. 1986.
    [9]
    BROOKS, E. D., III. The butterfly barrier Int. J Parallel Program. 15, 4 (1986), 295-307.
    [10]
    BZOOKS, E. D., III. The shared memory hypercube. Parallel Comput. 6 (1988), 235-245.
    [11]
    DAVIS, H, AND HENNESSY, J. Characterizing the synchronization behavior of parallel programs. In Proceedings of the ACM Conference on Parallel Programming: Experzence with Apphcat~ons, Languages and Systems (July 1988), 198-211.
    [12]
    DIJKSTRA, E Solution of a problem in concurrent programming and control. Commun. ACM. 8, 9 (Sept. 1965), 569.
    [13]
    GHARACHORLOO, K., LENOSKI, D., LAUDON, J., GmBONS, P., GUPTA, A., AND HENNESSY, J. L. Memory consistency and event ordering in scalable shared.memory multiprocessors. In Proceedlngs of the International Symposium on Computer Architecture (May 1990), 15-26.
    [14]
    GOODMAN, J.R, VERNON, M. K., AND WOES% P.J. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (Apr. 1989), 64-75
    [15]
    GOOVMAN, J R., AND WOEST, P. J. The Wisconsin Multlcube: A new large-scale cache coherent multiprocessor. In Proceedings of the International Symposium on Computer Architecture (May 1988), 422-431.
    [16]
    GOTTLIEB, A. Scalability, combining and the NYU ultracomputer. Ohio State University Parallel Computing Workshop, Mar. 1990. Invited Lecture.
    [17]
    GOTTLIEB, A., GRISHMAN, R., KRUSKAL, C. P., McAuLIFFE, K. P., RUDOLPH, L., AND SNIR, M. The NYU Ultracomputer--Designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189.
    [18]
    GRAUNI<E, G., AND THAKKAR, S. Synchronization algorithms for shared-memory multiprocessors. Computer 23, 6 (June 1990), 60-69
    [19]
    HENSGEN, D, FINKEL, R, AND MANBER, U. Two algorithms for barrier synchroniztion. Int. J. Parallel Program. 17, 1 (1988) 1-17.
    [20]
    HERLIHY, M. A methodology for implementing highly concurrent data structures. In Proceed~ngs of the Second A CM Symposium on Principles and Practice of Parallel Programming (Mar 1990), 197-206.
    [21]
    JAYASIMHA, D. N. Distributed synchronizers. In Proceedings of the 1988 Internatwnal Conference on Parallel Processing (Aug. 1988), 23-27.
    [22]
    KRUSKAL, C.P., RUDOLPH, L., AND SNm, M. Efficient synchronization on multiprocessors with shared memory. In Proceedings of the F~fih ACM Symposium on Principles of Dzstributecl Computing (1986), 218-228.
    [23]
    KUMAR, M., AND PFISTER, G.F. The onset of hot spot contention In Proceedings of the 1986 International Conference on Parallel Processing (1986), 28-34.
    [24]
    LAMPORT, L. A new solution of Dijkstra's concurrent programming problem. Commun. ACM 17, 8 (Aug. 1974), 453-455.
    [25]
    LAMPORT, L. The mutual exclusion problem: Part I--A theory of interprocess communication: Part II--Statement and solutions. J ACM 33, 2 (Apr. 1986), 313-348.
    [26]
    LAMPORT, L. A fast mutual exclusion algorithm. ACM Trans. Comput. Syst. 5, 1 (Feb. 1987) 1-11.
    [27]
    LEE, C.A. Barrier synchronization over multistage interconnection networks. In Proceedrags of the Second IEEE Symposium on Parallel and Distributed Processing (Dallas, Tex, Dec 1990), 130-133.
    [28]
    LEE, J., AND RAMACHANDRAN, U. Synchronization with multiprocessor caches. In Proceedrags of the Internatwnal Symposlum on Computer Architecture (May 1990), 27-37.
    [29]
    LUBACHEVSK~, B. An approach to automating the verification of compact parallel coordination programs. I Acta Inf. 21 (1984), 125-169.
    [30]
    LUBACnEVSKY, B. Synchronization barrier and related tools for shared memory parallel programming. In Proceedings of the 1989 International Converence on Parallel Processing (Aug 1989), II-175-II-179.
    [31]
    MAEKAWA, M. A ~ algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst. 3, 2 (May 1985), 145-159.
    [32]
    MELLOR-CRUMMEY, J.M. Concurrent queues: Practical fetch-and-~ algorithms. Tech. Rep. 229, Computer Science Dept., Univ. of Rochester, Nov. 1987.
    [33]
    MELLOR-CRUMMEY, J. M., AND SCOTT, M. L. Algorithms for scalable synchronization on shared-memory multiprocessors. Tech. Rep. 342, Computer Science Dept., Univ. of Rochester, Apr. 1990. Also COMP TR90-114, Dept. of Computer Science, Rice Univ., May 1990
    [34]
    OUSTERHOUT, J. K, SCELZA, D. A., AND SINDHU, P. S. Medusa: An experiment in distributed operating system structure. Commun. ACM 23, 2 (Feb. 1980), 92-105.
    [35]
    P1596 Working Group of the IEEE Computer Society Microprocessor Standards Committee. SCI (scalable coherent interface): An overview of extended cache-coherence protocols, Feb 5, 1990. Draft 0.59 P1596/Part III-D.
    [36]
    PETERSON, G. L. A new solution to Lamport's concurrent programming problem using small shared variables. ACM Trans. Program. Lang. Syst. 5, 1 (Jan. 1983), 56-65.
    [37]
    PFISTER, G., BRANTLEY, W. C., GEORGE, D. A., HARVEY, S. L., KLEINFELDER, W. J., McAvLIFFE, K. P., MELTON, T. A., NORTON, V A., AND WEISS, J. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing (Aug. 1985), 764-771.
    [38]
    PFISTER, G., AND NORTON, V. A. "Hot spot" contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943 948.
    [39]
    RAYMOND, K. A tree-based algorithm for distributed mutual exclusion. ACM Trans. Cornput. Syst. 7, i (Feb. 1989), 61-77.
    [40]
    RAYNAL, M. Algorithms for Mutual Exclusion. MIT Press Series in Scientific Computation. MIT Press, Cambridge, Mass., 1986. Translated from the French by D. Beeson.
    [41]
    REED, D. P., A~D KANODIA, R K. Synchronization with eventcounts and sequencers. Commun. ACM22, 2 (Feb. 1979), 115-123.
    [42]
    RETTBERG, R. D., CROWTHER, W. R., CARVEY, P. P., AND TOMHNSON, R. S. The Monarch parallel processor hardware design. Computer 23, 4 (Apr. 1990), 18-30
    [43]
    RICART, G., AND AGRAWALA, A.K. An optimal algorithm for mutual exclusion in computer networks. Commun. ACM. 24, i (Jan. 1981) 9-17.
    [44]
    RUDOLPH, L., A~D SEGALL, Z. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the International Symposium on Computer Architecture (1984), 340-347.
    [45]
    SANDERS, B. A. The information structure of distributed mutual exclusion algorithms. ACM Trans. Comput. Syst. 5, 3 (Aug. 1987), 284--299
    [46]
    SCHNEIDER, F.B. Synchronization in distributed programs. ACM Trans. Program. Lang. Syst. 4, 2 (Apr. 1982), 179-195.
    [47]
    SCOTT, M. L., LEBLANC, T. J., AND MARSH, B. D. Multi-model parallel programming in Psyche. In Proceedings of the Second ACM Symposium on Principles and Practice of Parallel Programming (Mar. 1990), 70-78.
    [48]
    TANG, P., AND YEW, P.-C. Processor self-scheduling for multiple-nested parallel loops. In Proceedtngs of the 1986 International Conference on Parallel Processing (Aug. 1986), 528-535.
    [49]
    TANG, P., AND Y~W, P.-C. Algorithms for distributing hot-spot addressing CSRD Rep. 617, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana- Champaign, Jan. 1987.
    [50]
    YEW, P.-C. Architecture of the Cedar parallel supercomputer. CSRD Rep 609, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Aug. 1986.
    [51]
    YEW, P.-C., TZENG, N.-F., AND LAWRIE, D.H. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-36, 4 (Apr. 1987), 388-395.
    [52]
    ZAHORIAN, J., LAZOWSKA, E. D., AND EAGER, D. L. The effect of scheduling discipline on spin overhead in shared memory parallel processors. Tech. Rep. TR-89-07-03, Computer Science Dept., Univ. of Washington, July 1989.

    Cited By

    View all
    • (2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
    • (2024)Polynima: Practical Hybrid Recompilation for Multithreaded BinariesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650065(1126-1141)Online publication date: 22-Apr-2024
    • (2024)Locks as a Resource: Fairly Scheduling Lock Occupation with CFLProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638477(17-29)Online publication date: 2-Mar-2024
    • Show More Cited By

    Recommendations

    Reviews

    Roy Levin

    The authors have compiled a competent survey of spin-lock and barrier synchronization algorithms, including some new ones of their own. Two sections of the paper present these algorithms with clear pseudocode. Performance comparisons on a BBN Butterfly and a Sequent Symmetry support the conclusions in the subsequent section on architectural implications. The paper closes with a concise list of recommendations that will aid software practitioners who must navigate this sometimes bewildering space of possible algorithms. I was distracted by several small inconsistencies in notation and presentation that should have been picked up in the editorial process (like the references to the nonexistent Algorithms 8 and 11). It also took me a while to realize that a direct correspondence does not always exist between the algorithms whose code is presented and the algorithms whose performance is graphed. Finally, the space devoted to a lame attempt to summarize the correctness proof of the authors' intriguing new spin-lock algorithm would have been better spent explaining how a key race-condition is resolved. These inconsistencies are relatively small flaws in an otherwise well-organized and well-reasoned paper. I encourage designers of parallel architectures and parallel algorithms to read it.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 9, Issue 1
    Feb. 1991
    97 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/103727
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 February 1991
    Published in TOCS Volume 9, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)723
    • Downloads (Last 6 weeks)65

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
    • (2024)Polynima: Practical Hybrid Recompilation for Multithreaded BinariesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650065(1126-1141)Online publication date: 22-Apr-2024
    • (2024)Locks as a Resource: Fairly Scheduling Lock Occupation with CFLProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638477(17-29)Online publication date: 2-Mar-2024
    • (2024)Scaling Up Transactions with Slower ClocksProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638472(2-16)Online publication date: 2-Mar-2024
    • (2024)ALock: Asymmetric Lock Primitive for RDMA SystemsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659977(15-26)Online publication date: 17-Jun-2024
    • (2024)TCSA: Efficient Localization of Busy-Wait Synchronization Bugs for Latency-Critical ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334257335:2(297-309)Online publication date: 1-Feb-2024
    • (2024)A biological-like synthesis framework for software engineering environmentsInternational Journal of Computers and Applications10.1080/1206212X.2023.230118346:4(208-217)Online publication date: 9-Jan-2024
    • (2024)ParSweet: A Suite of Codes for Benchmarking and Testing Mutex-Based Parallel SystemsAsynchronous Many-Task Systems and Applications10.1007/978-3-031-61763-8_7(71-83)Online publication date: 14-Feb-2024
    • (2023)Shared Memory Access Optimization Analysis System for PMIx Standard ImplementationThe Herald of the Siberian State University of Telecommunications and Information Science10.55648/1998-6920-2024-18-1-29-3918:1(29-39)Online publication date: 17-Dec-2023
    • (2023)CITRONProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585957(297-314)Online publication date: 21-Feb-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media