Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

What Scalable Programs Need from Transactional Memory

Published: 04 April 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained for the STAMP benchmarks on all TM systems we know of are quite limited; for example, with 64 threads on the IBM Blue Gene/Q, we observe a median speedup of 1.4X using the Blue Gene/Q hardware transactional memory (HTM), and a median speedup of 4.1X using a software transactional memory (STM).
    What limits the performance of these benchmarks on TMs? In this paper, we argue that the problem lies with the programming model and data structures used to write them. To make this point, we articulate two principles that we believe must be embodied in any scalable program and argue that STAMP programs violate both of them. By modifying the STAMP programs to satisfy both principles, we produce a new set of programs that we call the Stampede suite. Its median speedup on the Blue Gene/Q is 8.0X when using an STM. The two principles also permit us to simplify the TM design. Using this new STM with the Stampede benchmarks, we obtain a median speedup of 17.7X with 64 threads on the Blue Gene/Q and 13.2X with 32 threads on an Intel Westmere system.
    These results suggest that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.

    References

    [1]
    M. Abadi, A. Birrell, T. Harris, and M. Isard. Semantics of transactional memory and automatic mutual exclusion. ACM Trans. Programming Language and Systems, 33 (1): 2:1--2:50, Jan. 2011. 10.1145/1889997.1889999.
    [2]
    A.-R. Adl-Tabatabai, B. T. Lewis, V. Menon, B. R. Murphy, B. Saha, and T. Shpeisman. Compiler and runtime support for efficient software transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 26--37, 2006. 10.1145/1133981.1133985.
    [3]
    A. W. Appel. Compiling with Continuations. Cambridge University Press, 2007.
    [4]
    H. Avni and N. Shavit. Maintaining consistent transactional states without a global clock. In Proc. Intl Colloq. Structural Information and Communication Complexity, pages 131--140, 2008. 10.1007/978--3--540--69355-0_12.
    [5]
    M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: Techniques for efficiently managing shared state. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 640--652, 2011. 10.1145/1993498.1993573.
    [6]
    C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 72--81, 2008. 10.1145/1454115.1454128.
    [7]
    R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. SIGPLAN Notices, 30 (8): 207--216, 1995. 10.1145/209937.209958.
    [8]
    C. Blundell, J. Devietti, E. C. Lewis, and M. M. K. Martin. Making the fast case common and the uncommon case simple in unbounded transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 24--34, 2007. 10.1145/1250662.1250667.
    [9]
    J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. Token™: Efficient execution of large transactions with hardware transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 127--138, 2008. 10.1109/ISCA.2008.24.
    [10]
    C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford transactional applications for multi-processing. In Proc. IEEE Intl Symp. Workload Characterization, IISWC, Sept. 2008.
    [11]
    B. D. Carlstrom, A. McDonald, H. Chafi, J. Chung, C. C. Minh, C. Kozyrakis, and K. Olukotun. The Atomos transactional programming language. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 1--13, 2006. 10.1145/1133981.1133983.
    [12]
    B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21 (3): 291--312, Aug. 2007. 10.1177/1094342007078442.
    [13]
    A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 1--17, 2013. 10.1145/2517349.2522712.
    [14]
    C. Click. Azul's experiences with hardware transactional memory. In HP Labs' Bay Area Workshop on Transactional Memory, 2009.
    [15]
    L. Dalessandro, F. Carouge, S. White, Y. Lev, M. Moir, M. L. Scott, and M. F. Spear. Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 39--52, 2011. 10.1145/1950365.1950373.
    [16]
    D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In Proc. Intl Conf. Distributed Computing, pages 194--208, 2006. 10.1007/11864219_14.
    [17]
    D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 157--168, 2009. 10.1145/1508244.1508263.
    [18]
    N. Diegues, P. Romano, and L. Rodrigues. Virtues and limitations of commodity hardware transactional memory. In Proc. Intl Conf. Parallel Architectures and Compilation, PACT, pages 3--14, 2014. 10.1145/2628071.2628080.
    [19]
    S. Dolev, D. Hendler, and A. Suissa. CAR-S™: Scheduling-based collision avoidance and resolution for software transactional memory. In Proc. ACM Symp. Principles of Distributed Computing, PODC, pages 125--134, 2008. 10.1145/1400751.1400769.
    [20]
    A. Dragojević, R. Guerraoui, and M. Kapalka. Stretching transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 155--165, 2009. 10.1145/1542476.1542494.
    [21]
    P. Felber, C. Fetzer, and T. Riegel. Dynamic performance tuning of word-based software transactional memory. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 237--246, 2008. 10.1145/1345206.1345241.
    [22]
    S. Ghemawat and P. Menage. TCMalloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2014.
    [23]
    T. Harris and K. Fraser. Language support for lightweight transactions. In Proc. ACM SIGPLAN Conf. Object-oriented Programing, Systems, Languages and Applications, OOPSLA, pages 388--402, New York, NY, USA, 2003. 10.1145/949305.949340.
    [24]
    M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Proc. Intl Symp. Computer Architecture, ISCA, 1993. 10.1145/165123.165164.
    [25]
    M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, March 2008. ISBN 0123705916.
    [26]
    M. Kulkarni, L. P. Chew, and K. Pingali. Using transactions in Delaunay mesh generation. In Proc. Workshop on Transactional Memory Workloads, WTW, 2006.
    [27]
    M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 211--222, 2007. 10.1145/1250734.1250759.
    [28]
    C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 278--289, 2007. 10.1145/1250734.1250766.
    [29]
    A. Lenharth and K. Pingali. Scaling runtimes for irregular algorithms to large-scale NUMA systems. Computer, 48 (8): 35--44, 2015. 10.1109/MC.2015.229.
    [30]
    A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. In Proc. European Conf. Parallel Processing, pages 209--221, 2015.
    [31]
    V. Luchangco, M. Wong, H. Boehm, J. Gottschlich, J. Maurer, P. McKenney, M. Michael, M. Moir, T. Riegel, M. Scott, T. Shpeisman, and M. Spear. Transactional memory support for C+. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3919.pdf, Feb. 2014.
    [32]
    M. Mendez-Lojo, A. Mathew, and K. Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, October 2010.
    [33]
    V. Menon and K. Pingali. High-level semantic optimization of numerical codes. In Proc. Intl Conf. Supercomputing, ICS, pages 434--443, 1999. 10.1145/305138.305230.
    [34]
    C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proc. Intl Symp. Computer Architecture, ISCA, pages 69--80, 2007. 10.1145/1250662.1250673.
    [35]
    R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In Proc. IEEE Intl Symp. Parallel and Distributed Processing, pages 463--474, 2013.
    [36]
    R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In ACM SIGPLAN Notices, volume 48, pages 147--156, 2013
    [37]
    D. Nguyen and K. Pingali. Synthesizing concurrent schedulers for irregular algorithms. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 333--344, 2011. 10.1145/1950365.1950404.
    [38]
    D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 456--471, New York, NY, USA, 2013. 10.1145/2517349.2522739.
    [39]
    Y. Ni, A. Welc, A.-R. Adl-Tabatabai, M. Bach, S. Berkowits, J. Cownie, R. Geva, S. Kozhukow, R. Narayanaswamy, J. Olivier, S. Preis, B. Saha, A. Tal, and X. Tian. Design and implementation of transactional constructs for C/C++. In Proc. ACM SIGPLAN Intl. Conf. Object-oriented Programming Systems Languages and Applications, OOPSLA, pages 195--212, 2008. 10.1145/1449764.1449780.
    [40]
    S. Pai and K. Pingali. A compiler for throughput optimization of graph algorithms on GPUs. In Proc. ACM SIGPLAN Intl Conf. Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, pages 1--19, 2016. 10.1145/2983990.2984015.
    [41]
    K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 12--25, 2011. 10.1145/1993498.1993501.
    [42]
    T. Riegel, P. Felber, and C. Fetzer. A lazy snapshot algorithm with eager validation. In Proc. Intl Conf. on Distributed Computing, DISC, pages 284--298, 2006. 10.1007/11864219_20.
    [43]
    T. Riegel, C. Fetzer, and P. Felber. Time-based transactional memory with scalable time bases. In Proc. ACM Symp. on Parallel Algorithms and Architectures, SPAA, pages 221--228, 2007. 10.1145/1248377.1248415.
    [44]
    W. Ruan, Y. Liu, and M. Spear. Boosting timestamp-based transactional memory by exploiting hardware cycle counters. ACM Trans. Archit. Code Optim., 10 (4): 40:1--40:21, Dec. 2013. 10.1145/2541228.2555297.
    [45]
    W. Ruan, T. Vyas, Y. Liu, and M. Spear. Transactionalizing legacy code: An experience report using gcc and memcached. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 399--412, 2014. 10.1145/2541940.2541960.
    [46]
    B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. McRT-S™: a high performance software transactional memory system for a multi-core runtime. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 187--197, 2006. 10.1145/1122971.1123001.
    [47]
    A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, and M. L. Scott. An integrated hardware-software approach to flexible transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 104--115, 2007. 10.1145/1250662.1250676.
    [48]
    M. F. Spear, M. M. Michael, and C. von Praun. RingS™: scalable transactions with a single atomic instruction. In Proc. Symp. Parallelism in Algorithms and Architectures, SPAA, pages 275--284, 2008. 10.1145/1378533.1378583.
    [49]
    S. Tomić, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, and M. Valero. EazyH™: eager-lazy hardware transactional memory. In Proc. IEEE/ACM Intl Symp. Microarchitecture, MICRO, pages 145--155, 2009. 10.1145/1669112.1669132.
    [50]
    A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael. Evaluation of Blue Gene/Q hardware support for transactional memories. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 127--136, 2012. 10.1145/2370816.2370836.
    [51]
    L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, and D. A. Wood. LogTM-SE: Decoupling hardware transactional memory from caches. In Proc. IEEE Intl Symp. High Performance Computer Architecture, HPCA, pages 261--272, 2007. 10.1109/HPCA.2007.346204.
    [52]
    R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance evaluation of Intel transactional synchronization extensions for high-performance computing. In Proc. Intl Conf. for High Performance Computing, Networking, Storage and Analysis, SC, pages 19:1--19:11, 2013. 10.1145/2503210.2503232.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 45, Issue 1
    Asplos'17
    March 2017
    812 pages
    ISSN:0163-5964
    DOI:10.1145/3093337
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
      April 2017
      856 pages
      ISBN:9781450344654
      DOI:10.1145/3037697
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 April 2017
    Published in SIGARCH Volume 45, Issue 1

    Check for updates

    Author Tags

    1. programming models
    2. scalability
    3. stampede benchmarks
    4. transactional memory
    5. transactions

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)79
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 30 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Adaptive Versioning in Transactional MemoriesStabilization, Safety, and Security of Distributed Systems10.1007/978-3-030-34992-9_22(277-295)Online publication date: 14-Nov-2019
    • (2021)Adaptive Versioning in Transactional Memory SystemsAlgorithms10.3390/a1406017114:6(171)Online publication date: 31-May-2021
    • (2020)TardisTMProceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3380536.3380538(1-10)Online publication date: 22-Feb-2020
    • (2020)A survey on optimizations towards best-effort hardware transactional memoryCCF Transactions on High Performance Computing10.1007/s42514-020-00049-2Online publication date: 15-Sep-2020
    • (2019)FPGA-Accelerated Optimistic Concurrency Control for Transactional MemoryProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358270(911-923)Online publication date: 12-Oct-2019
    • (2019)MV-RLUProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304040(779-792)Online publication date: 4-Apr-2019
    • (2019)Optimization of the Load Balancing Policy for Tiled Many-Core ProcessorsIEEE Access10.1109/ACCESS.2018.28834157(10176-10188)Online publication date: 2019
    • (2018)Boosting Transactional Memory with Stricter SerializabilityCoordination Models and Languages10.1007/978-3-319-92408-3_11(231-251)Online publication date: 27-May-2018
    • (2017)Providing QoS in contention management for software transactional memory2017 13th International Computer Engineering Conference (ICENCO)10.1109/ICENCO.2017.8289793(231-236)Online publication date: Dec-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media