research-article

Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains

Authors:

Binoy RavindranAuthors Info & Claims

SYSTOR '20: Proceedings of the 13th ACM International Systems and Storage Conference

Pages 13 - 24

https://doi.org/10.1145/3383669.3398278

Published: 30 May 2020 Publication History

Abstract

Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming.

In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.

References

[1]

Jan. 2020. ScaleMP vSMP. (Jan. 2020). https://www.scalemp.com/.

[2]

Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. Computer 29, 2 (Feb. 1996), 18--28.

Digital Library

[3]

Amazon AWS. 2019. Now Available: Bare Metal Arm-Based EC2 Instances. (Oct. 2019). https://tinyurl.com/y6fd7w5n.

[4]

Antonio Barbalace, Robert Lyerly, Christopher Jelesnianski, Anthony Carno, Ho-Ren Chuang, Vincent Legout, and Binoy Ravindran. 2017. Breaking the Boundaries in Heterogeneous-ISA Datacenters. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 645--659.

Digital Library

[5]

Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, and Binoy Ravindran. 2015. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, New York, NY, USA, Article 29, 16 pages.

Digital Library

[6]

John K. Bennett, John B. Carter, and Willy Zwaenepoel. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd PPoPP. Seattle, WA, USA, 168--176.

Digital Library

[7]

Brian N Bershad, Matthew J Zekauskas, and Wayne A Sawdon. 1993. The Midway distributed shared memory system. In Compcon Spring '93, Digest of Papers.

[8]

OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface v4.5. Technical Report. https://tinyurl.com/yxzbx5cn (2015).

[9]

Broadcom. Jan. 2020. Stingray SmartNIC Adapters and IC. (Jan. 2020). https://tinyurl.com/y6q46rxx.

[10]

Francois Cantonnet, Yiyi Yao, Mohamed Zahran, and Tarek El-Ghazawi. 2004. Productivity analysis of the UPC language. In Proceedings of the 18th IPDPS. Phoenix, AZ, USA.

[11]

John B. Carter, John K. Bennett, and Willy Zwaenepoel. 1991. Implementation and performance of Munin. In Proceedings of the 13rd SOSP. Pacific Grove, CA, 152--164.

Digital Library

[12]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In ACM SIGPLAN Notices, Vol. 40. ACM, 519--538.

[13]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). IEEE, 44--54.

Digital Library

[14]

Stephanie Condon. 2017. Intel unveils the Nervana Neural Network Processor. (Oct. 2017). https://tinyurl.com/ydfjwfls.

[15]

Ian Cutress. 2017. Qualcomm Launches 48-core Centriq for $1995: Arm Servers for Cloud Native Applications. (Nov. 2017). https://tinyurl.com/yd6obvtl.

[16]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.

[17]

Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th ISCA. San Jose, California, USA, 365--376.

Digital Library

[18]

Brett Fleisch and Gerald Popek. 1989. Mirage: A coherent distributed shared memory design. Vol. 23. ACM.

[19]

Isaac Gelado, John E Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W Hwu. 2010. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the 15th ASPLOS. New York, NY, 347--358.

Digital Library

[20]

William D Gropp, William Gropp, Ewing Lusk, Anthony Skjellum, and Argonne Distinguished Fellow Emeritus Ewing Lusk. 1999. Using MPI: portable parallel programming with the message-passing interface. Vol. 1. MIT press.

Digital Library

[21]

Md E. Haque, Yuxiong He, Sameh Elnikety, Thu D. Nguyen, Ricardo Bianchini, and Kathryn S. McKinley. 2017. Exploiting Heterogeneity for Tail Latency and Energy Efficiency. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 625--638.

[22]

Nicole Hemsoth. 2017. Cray ARMs Highest End Supercomputer with ThunderX2. (Nov. 2017). https://tinyurl.com/y95ljwd4.

[23]

HPC Advisory Council. Jan. 2018. Introduction to High-Speed Infini-Band Interconnect. https://tinyurl.com/y7xl2df7. (Jan. 2018).

[24]

Joel Hruska. 2017. Intel Kills Knights Hill, Will Launch Xeon Phi Architecture for Exascale Computing. (Nov. 2017). https://tinyurl.com/yckk77ar.

[25]

IDC. 2014. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. (April 2014). https://tinyurl.com/ya8oasf8.

[26]

Mohamed L. Karaoui, Anthony Carno, Robert Lyerly, Sang-Hoon Kim, Pierre Olivier, Changwoo Min, and Binoy Ravindran. 2019. POSTER: Scheduling HPC Workloads on Heterogeneous-ISA Architectures. In Proceedings of the 24nd PPoPP. Washington, DC.

Digital Library

[27]

Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th ISCA. Queensland, Australia, 13--21.

[28]

Sang-Hoon Kim, Ho-Ren Chuang, Robert Lyerly, Pierre Olivier, Chang-woo Min, and Binoy Ravindran. 2020. DEX: Scaling Applications Beyond Machine Boundaries. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE.

[29]

Juchang Lee, Kihong Kim, and Sang Kyun Cha. 2001. Differential logging: A commutative and associative logging scheme for highly parallel main memory database. In Proceedings 17th International Conference on Data Engineering. IEEE, 173--182.

[30]

Felix Xiaozhu Lin, Zhen Wang, Robert LiKamWa, and Lin Zhong. 2012. Reflex: using low-power processors in smartphones without knowing them. In Proceedings of the 17th ASPLOS. London, UK.

Digital Library

[31]

Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A Mobile Operating System for Heterogeneous Coherence Domains. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 285--300.

Digital Library

[32]

NASA Advanced Supercomputing Division. Sep 2017. NAS Parallel Benchmarks. https://tinyurl.com/y47k95cc. (Sep 2017).

[33]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 ATC. Santa Clara, CA, 291--305.

[34]

Pierre Olivier, Sang-Hoon Kim, and Binoy Ravindran. 2017. OS Support for Thread Migration and Distribution in the Fully Heterogeneous Datacenter. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS '17). ACM, New York, NY, USA, 174--179.

Digital Library

[35]

Princeton University. 2017. The PARSEC Benchmark Suite. http://parsec.cs.princeton.edu. (Sep 2017).

[36]

Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. 1996. Distributed shared memory: concepts and systems. IEEE Parallel & Distributed Technology: Systems & Applications 4, 2 (1996), 63--71.

Digital Library

[37]

Keith Harold Randall. 1998. Cilk: Efficient multithreaded computing. Ph.D. Dissertation. Massachusetts Institute of Technology.

[38]

James Reinders. 2007. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. "O'Reilly Media, Inc.".

Digital Library

[39]

Karl Rupp, M Horovitz, F Labonte, O Shacham, K Olukotun, L Hammond, and C Batten. Feb. 2018. 42 Years of Microprocessor Trend Data. Figure available on webpage https://tinyurl.com/yyzzm73w 6 (Feb. 2018).

[40]

Marina Sadini, Antonio Barbalace, Binoy Ravindran, and Francesco Quaglia. 2013. A page coherency protocol for Popcorn replicatedkernel operating system. In Proceedings of the 2013 Many-Core Architecture Research Community Symposium (MARC).

[41]

Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.

Digital Library

[42]

Seoul National University Centers for Manycore Programming. Sep 2017. SNU NPB Suite. https://tinyurl.com/y3jrfrqg. (Sep 2017).

[43]

Justin Talbot, Richard M Yoo, and Christos Kozyrakis. 2011. Phoenix++: modular MapReduce for shared-memory systems. In Proceedings of the second international workshop on MapReduce and its applications. ACM, 9--16.

Digital Library

[44]

Ashish Venkat, Sriskanda Shamasunder, Hovav Shacham, and Dean M. Tullsen. 2016. HIPStR: Heterogeneous-ISA program state relocation. In Proceedings of the 21st ASPLOS. Atlanta, GA, 727--741.

[45]

Ashish Venkat and Dean M. Tullsen. 2014. Harnessing ISA diversity: design of a heterogeneous-ISA chip multiprocessor. In Proceedings of the 41st ISCA. Minneapolis, MN, 121--132.

[46]

Yuanyuan Zhou, Liviu Iftode, and Kai Li. 1996. Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems. In Proceedings of the 2nd OSDI. Seattle, WA, 75--88.

Digital Library

[47]

Yuanyuan Zhou, Liviu Iftode, Jaswinder Pal Sing, Kai Li, Brian R. Toonen, Ioannis Schoinas, Mark D. Hill, and David A. Wood. 1997. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. In Proceedings of the 6th PPoPP. Las Vegas, Nevada, USA, 193--205.

Cited By

Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049

Index Terms

Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software implementation planning
        Software design techniques

Recommendations

Write-Aware Management of NVM-based Memory Extensions
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Emerging Non-Volatile Memory (NVM) technologies, such as 3D XPoint, are expected to be in production as early as 2016. Emerging NVMs are very attractive for several reasons. First, they are non-volatile and hence incur no refresh power. Second, they are ...
An improvement technique for hybrid protocol for software distributed shared memory
Distributed and parallel systems
Cache coherent shared memory hypercube multiprocessors
SPDP '92: Proceedings of the 1992 Fourth IEEE Symposium on Parallel and Distributed Processing

The authors examine the feasibility of building cache coherent shared memory multiprocessor systems on hypercube. Various shared memory schemes are investigated and compared with each other. The schemes considered are based on memory coherence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SYSTOR '20: Proceedings of the 13th ACM International Systems and Storage Conference

May 2020

118 pages

ISBN:9781450375887

DOI:10.1145/3383669

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Technion: Israel Institute of Technology
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NAVSEA/NEEC
US Office of Naval Research

Conference

SYSTOR '20

Sponsor:

Technion
SIGOPS

SYSTOR '20: The 13th ACM International Systems and Storage Conference

June 2 - 4, 2020

Haifa, Israel

Acceptance Rates

Overall Acceptance Rate 94 of 285 submissions, 33%

Upcoming Conference

SYSTOR '24

Sponsor:
sigops

The 17th ACM International Systems and Storage Conference

September 23 - 25, 2024

Tel-Aviv , Israel

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
192
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)4

Reflects downloads up to 30 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents