Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3383669.3398278acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains

Published: 30 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming.
    In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.

    References

    [1]
    Jan. 2020. ScaleMP vSMP. (Jan. 2020). https://www.scalemp.com/.
    [2]
    Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. Computer 29, 2 (Feb. 1996), 18--28.
    [3]
    Amazon AWS. 2019. Now Available: Bare Metal Arm-Based EC2 Instances. (Oct. 2019). https://tinyurl.com/y6fd7w5n.
    [4]
    Antonio Barbalace, Robert Lyerly, Christopher Jelesnianski, Anthony Carno, Ho-Ren Chuang, Vincent Legout, and Binoy Ravindran. 2017. Breaking the Boundaries in Heterogeneous-ISA Datacenters. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 645--659.
    [5]
    Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, and Binoy Ravindran. 2015. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, New York, NY, USA, Article 29, 16 pages.
    [6]
    John K. Bennett, John B. Carter, and Willy Zwaenepoel. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd PPoPP. Seattle, WA, USA, 168--176.
    [7]
    Brian N Bershad, Matthew J Zekauskas, and Wayne A Sawdon. 1993. The Midway distributed shared memory system. In Compcon Spring '93, Digest of Papers.
    [8]
    OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface v4.5. Technical Report. https://tinyurl.com/yxzbx5cn (2015).
    [9]
    Broadcom. Jan. 2020. Stingray SmartNIC Adapters and IC. (Jan. 2020). https://tinyurl.com/y6q46rxx.
    [10]
    Francois Cantonnet, Yiyi Yao, Mohamed Zahran, and Tarek El-Ghazawi. 2004. Productivity analysis of the UPC language. In Proceedings of the 18th IPDPS. Phoenix, AZ, USA.
    [11]
    John B. Carter, John K. Bennett, and Willy Zwaenepoel. 1991. Implementation and performance of Munin. In Proceedings of the 13rd SOSP. Pacific Grove, CA, 152--164.
    [12]
    Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In ACM SIGPLAN Notices, Vol. 40. ACM, 519--538.
    [13]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). IEEE, 44--54.
    [14]
    Stephanie Condon. 2017. Intel unveils the Nervana Neural Network Processor. (Oct. 2017). https://tinyurl.com/ydfjwfls.
    [15]
    Ian Cutress. 2017. Qualcomm Launches 48-core Centriq for $1995: Arm Servers for Cloud Native Applications. (Nov. 2017). https://tinyurl.com/yd6obvtl.
    [16]
    Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.
    [17]
    Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th ISCA. San Jose, California, USA, 365--376.
    [18]
    Brett Fleisch and Gerald Popek. 1989. Mirage: A coherent distributed shared memory design. Vol. 23. ACM.
    [19]
    Isaac Gelado, John E Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W Hwu. 2010. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the 15th ASPLOS. New York, NY, 347--358.
    [20]
    William D Gropp, William Gropp, Ewing Lusk, Anthony Skjellum, and Argonne Distinguished Fellow Emeritus Ewing Lusk. 1999. Using MPI: portable parallel programming with the message-passing interface. Vol. 1. MIT press.
    [21]
    Md E. Haque, Yuxiong He, Sameh Elnikety, Thu D. Nguyen, Ricardo Bianchini, and Kathryn S. McKinley. 2017. Exploiting Heterogeneity for Tail Latency and Energy Efficiency. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 625--638.
    [22]
    Nicole Hemsoth. 2017. Cray ARMs Highest End Supercomputer with ThunderX2. (Nov. 2017). https://tinyurl.com/y95ljwd4.
    [23]
    HPC Advisory Council. Jan. 2018. Introduction to High-Speed Infini-Band Interconnect. https://tinyurl.com/y7xl2df7. (Jan. 2018).
    [24]
    Joel Hruska. 2017. Intel Kills Knights Hill, Will Launch Xeon Phi Architecture for Exascale Computing. (Nov. 2017). https://tinyurl.com/yckk77ar.
    [25]
    IDC. 2014. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. (April 2014). https://tinyurl.com/ya8oasf8.
    [26]
    Mohamed L. Karaoui, Anthony Carno, Robert Lyerly, Sang-Hoon Kim, Pierre Olivier, Changwoo Min, and Binoy Ravindran. 2019. POSTER: Scheduling HPC Workloads on Heterogeneous-ISA Architectures. In Proceedings of the 24nd PPoPP. Washington, DC.
    [27]
    Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th ISCA. Queensland, Australia, 13--21.
    [28]
    Sang-Hoon Kim, Ho-Ren Chuang, Robert Lyerly, Pierre Olivier, Chang-woo Min, and Binoy Ravindran. 2020. DEX: Scaling Applications Beyond Machine Boundaries. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE.
    [29]
    Juchang Lee, Kihong Kim, and Sang Kyun Cha. 2001. Differential logging: A commutative and associative logging scheme for highly parallel main memory database. In Proceedings 17th International Conference on Data Engineering. IEEE, 173--182.
    [30]
    Felix Xiaozhu Lin, Zhen Wang, Robert LiKamWa, and Lin Zhong. 2012. Reflex: using low-power processors in smartphones without knowing them. In Proceedings of the 17th ASPLOS. London, UK.
    [31]
    Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A Mobile Operating System for Heterogeneous Coherence Domains. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 285--300.
    [32]
    NASA Advanced Supercomputing Division. Sep 2017. NAS Parallel Benchmarks. https://tinyurl.com/y47k95cc. (Sep 2017).
    [33]
    Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 ATC. Santa Clara, CA, 291--305.
    [34]
    Pierre Olivier, Sang-Hoon Kim, and Binoy Ravindran. 2017. OS Support for Thread Migration and Distribution in the Fully Heterogeneous Datacenter. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS '17). ACM, New York, NY, USA, 174--179.
    [35]
    Princeton University. 2017. The PARSEC Benchmark Suite. http://parsec.cs.princeton.edu. (Sep 2017).
    [36]
    Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. 1996. Distributed shared memory: concepts and systems. IEEE Parallel & Distributed Technology: Systems & Applications 4, 2 (1996), 63--71.
    [37]
    Keith Harold Randall. 1998. Cilk: Efficient multithreaded computing. Ph.D. Dissertation. Massachusetts Institute of Technology.
    [38]
    James Reinders. 2007. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. "O'Reilly Media, Inc.".
    [39]
    Karl Rupp, M Horovitz, F Labonte, O Shacham, K Olukotun, L Hammond, and C Batten. Feb. 2018. 42 Years of Microprocessor Trend Data. Figure available on webpage https://tinyurl.com/yyzzm73w 6 (Feb. 2018).
    [40]
    Marina Sadini, Antonio Barbalace, Binoy Ravindran, and Francesco Quaglia. 2013. A page coherency protocol for Popcorn replicatedkernel operating system. In Proceedings of the 2013 Many-Core Architecture Research Community Symposium (MARC).
    [41]
    Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.
    [42]
    Seoul National University Centers for Manycore Programming. Sep 2017. SNU NPB Suite. https://tinyurl.com/y3jrfrqg. (Sep 2017).
    [43]
    Justin Talbot, Richard M Yoo, and Christos Kozyrakis. 2011. Phoenix++: modular MapReduce for shared-memory systems. In Proceedings of the second international workshop on MapReduce and its applications. ACM, 9--16.
    [44]
    Ashish Venkat, Sriskanda Shamasunder, Hovav Shacham, and Dean M. Tullsen. 2016. HIPStR: Heterogeneous-ISA program state relocation. In Proceedings of the 21st ASPLOS. Atlanta, GA, 727--741.
    [45]
    Ashish Venkat and Dean M. Tullsen. 2014. Harnessing ISA diversity: design of a heterogeneous-ISA chip multiprocessor. In Proceedings of the 41st ISCA. Minneapolis, MN, 121--132.
    [46]
    Yuanyuan Zhou, Liviu Iftode, and Kai Li. 1996. Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems. In Proceedings of the 2nd OSDI. Seattle, WA, 75--88.
    [47]
    Yuanyuan Zhou, Liviu Iftode, Jaswinder Pal Sing, Kai Li, Brian R. Toonen, Ioannis Schoinas, Mark D. Hill, and David A. Wood. 1997. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proceedings of the 6th PPoPP. Las Vegas, Nevada, USA, 193--205.

    Cited By

    View all
    • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SYSTOR '20: Proceedings of the 13th ACM International Systems and Storage Conference
    May 2020
    118 pages
    ISBN:9781450375887
    DOI:10.1145/3383669
    © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. DSM
    2. Heterogeneous Architectures
    3. InfiniBand
    4. Shared Memory Programming
    5. System Software

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • NAVSEA/NEEC
    • US Office of Naval Research

    Conference

    SYSTOR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 94 of 285 submissions, 33%

    Upcoming Conference

    SYSTOR '24
    The 17th ACM International Systems and Storage Conference
    September 23 - 25, 2024
    Tel-Aviv , Israel

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 30 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media