Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2694344.2694350acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Synchronization Using Remote-Scope Promotion

Published: 14 March 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a priori. It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower scope. This puts programmers in a conundrum: optimize the common case by synchronizing at a faster small scope or use work stealing at a slower large scope. In this paper, we propose to extend scoped synchronization with remote-scope promotion. This allows the most frequent sharers to synchronize through a small scope. Infrequent sharers synchronize by promoting that remote small scope to a larger shared scope. Synchronization using remote-scope promotion provides performance robustness for dynamic workloads, where the benefits provided by scoped synchronization and work stealing are hard to anticipate. Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average. In contrast, synchronization using remote-scope promotion achieves a robust 1.25x speedup on average, across a diverse set of graph benchmarks and inputs.

    References

    [1]
    "OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.
    [2]
    "CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
    [3]
    "HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0 Provisional," HSA Foundation, Spring 2013.
    [4]
    T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," In Proceedings of the Conference on High Performance Graphics, New York, N.Y., USA, 2009, pp. 145--149.
    [5]
    M. Frigo, C. E. Leiserson, and K. H. Randall, "The Imple-mentation of the Cilk-5 Multithreaded Language," In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.
    [6]
    OpenMP Architecture Review Board, "OpenMP Application Program Interface Version 4.0," {Online}. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
    [7]
    "Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.
    [8]
    D. Leijen, W. Schulte, and S. Burckhardt, "The design of a task parallel library," In Proceedings of the 24th ACM SIG-PLAN conference on Object oriented programming systems languages and applications, pp. 227--242, 2009.
    [9]
    International Organization for Standardization, "Working Draft, Standard for Programming Language C++," {Online}. Available: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf
    [10]
    D.R. Hower, B.A. Hechtman, B.M. Beckmann, B.R. Gaster, M.D. Hill, S.K. Reinhardt, and D.A. Wood, "Heterogeneous-race-free Memory Models," In The 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-19), 2014.
    [11]
    B.R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models," In Transactions on Architecture and Code Optimization (TACO), 2015.
    [12]
    AMD, "Southern Islands Series Instruction Set Architecture," 2012.
    [13]
    S. Owens, S. Sarkar, and P. Sewell, "A Better x86 Memory Model: x86-TSO," In Proceedings of the Conference on Theorem Proving in Higher Order Logics, 2009.
    [14]
    D. J. Sorin, M. D. Hill, and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Morgan and Claypool, 2011.
    [15]
    B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).
    [16]
    N.S. Arora, R.D. Blumofe, and C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors," In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, ACM, Puerto Vallarta, Mexico, 1998, pp. 119--129.
    [17]
    D. Cederman and P. Tsigas, "Dynamic Load-Balancing Using Work-Stealing," In GPU Computing Gems Jade Edition, Wen-Mei Hwu (Editor-in-Chief), Morgan Kaufmann.
    [18]
    N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," In SIGARCH Computer Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
    [19]
    S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding Irregular GPGPU Graph Applications," In Proceedings of the International Symposium on Workload Characterizations, Sept. 2013.
    [20]
    DIMACS Implementation Challenges. http://dimacs.rutgers.edu/Challenges/
    [21]
    Web resource: http://www.sommer.jp/graphs/
    [22]
    B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon, "The Midway distributed shared memory system," In Proc. 38th IEEE Computer Society Int. Conf., pp. 528--537, 1993.
    [23]
    L. Iftode, J. P. Singh, and K. Li, "Scope consistency: a bridge between release consistency and entry consistency," In Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.277--287, June 24--26, 1996, Padua, Italy.
    [24]
    D. Dice, M.S. Moir, and W.N. Scherer III, "Quickly reacquirable locks," US Patent 7,814,488, 2010.
    [25]
    W.W.L. Fung and T.M. Aamodt, "Energy Efficient GPU Transactional Memory via Space-Time Optimizations," In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 408--420, Davis, CA, Dec. 7--11, 2013.
    [26]
    D. Cederman, P. Tsigas, and M.T. Chaudhry, "Towards a Software Transactional Memory for Graphics Processors," In Proceedings of the 10th Eurographics Symposium on Parallel Graphics and Visualization (EGPGV 2010).
    [27]
    I. Singh, A. Shriraman, W.W.L. Fung, M. O'Connor, and T.M. Aamodt, "Cache Coherence for GPU Architectures," In Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19), pp. 578--590, Shenzhen, China, Feb. 23--27, 2013.
    [28]
    S. Tzeng, A. Patney, and J.D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," In Proceedings of High Performance Graphics 2010, pp. 29--37. June 2010.

    Cited By

    View all
    • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
    • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
    • (2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2015
    720 pages
    ISBN:9781450328357
    DOI:10.1145/2694344
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 March 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. graphics processing unit (GPU)
    2. memory model
    3. scope promotion
    4. scoped synchronization
    5. work stealing

    Qualifiers

    • Research-article

    Conference

    ASPLOS '15

    Acceptance Rates

    ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
    • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
    • (2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
    • (2020)A Primer on Memory Consistency and Cache Coherence, Second EditionSynthesis Lectures on Computer Architecture10.2200/S00962ED2V01Y201910CAC04915:1(1-294)Online publication date: 4-Feb-2020
    • (2020)ScoRDProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00088(1036-1049)Online publication date: 30-May-2020
    • (2019)CoNDAProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322266(629-642)Online publication date: 22-Jun-2019
    • (2019)Optimizing GPU Cache Policies for MI Workloads2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9041977(243-248)Online publication date: Nov-2019
    • (2018)A Case for Scoped Persist Barriers in GPUsProceedings of the 11th Workshop on General Purpose GPUs10.1145/3180270.3180275(2-12)Online publication date: 24-Feb-2018
    • (2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018
    • (2017)Automatically comparing memory consistency modelsACM SIGPLAN Notices10.1145/3093333.300983852:1(190-204)Online publication date: 1-Jan-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media