research-article

Open access

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism

Authors:

Shumpei Shiina,

Kenjiro TauraAuthors Info & Claims

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 14, Pages 1 - 15

https://doi.org/10.1145/3581784.3607049

Published: 11 November 2023 Publication History

Abstract

This paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5× speedup when scaled from a single node to 12 nodes and up to 6.0× faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.

References

[1]

Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures (Bar Harbor, Maine, USA) (SPAA '00). 1--12.

Digital Library

[2]

Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 2--14.

[3]

Shigeki Akiyama and Kenjiro Taura. 2015. Uni-Address Threads: Scalable Thread Management for RDMA-Based Work Stealing. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 15--26.

Digital Library

[4]

Shigeki Akiyama and Kenjiro Taura. 2016. Scalable Work Stealing of Native Threads on an x86-64 Infiniband Cluster. Journal of Information Processing 24, 3 (May 2016), 583--596.

[5]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International European Conference on Parallel and Distributed Computing (Delft, The Netherlands) (Euro-Par '09). 863--874.

Digital Library

[6]

Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (June 2008), 404--418.

[7]

John Bachan, Scott Baden, Dan Bonachea, Johnny Corbino, Johnathan Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, and Daniel Waters. 2022. UPC++ v1.0 Programmer's Guide, Revision 2022.9.0. Technical Report LBNL-2001479. Lawrence Berkeley National Laboratory, USA.

[8]

Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, New York, USA) (PPoPP '06). 119--128.

Digital Library

[9]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah, USA) (SC '12). 66:1--66:11.

Digital Library

[10]

J. K. Bennett, J. B. Carter, and W. Zwaenepoel. 1990. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seattle, Washington, USA) (PPOPP '90). 168--176.

[11]

B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. 1993. The Midway Distributed Shared Memory System. In Digest of Papers. The 38th IEEE Computer Society International Conference (San Francisco, California, USA) (COMPCON Spring '93). 528--537.

[12]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling Irregular Parallel Computations on Hierarchical Caches. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA '11). 355--366.

Digital Library

[13]

Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy) (SPAA '96). 297--308.

[14]

Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In Proceedings of the 10th International Parallel Processing Symposium (Honolulu, Hawaii, USA) (IPPS '96). 132--141.

[15]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, California, USA) (PPoPP '95). 207--216.

Digital Library

[16]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720--748.

Digital Library

[17]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.

Digital Library

[18]

Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Bordeaux, France) (Euro-Par '11). 555--566.

[19]

Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient Distributed Memory Management with RDMA and Caching. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1604--1617.

Digital Library

[20]

Hannah Cartier, James Dinan, and D. Brian Larkins. 2021. Optimizing Work Stealing Communication with Structured Atomic Operations. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, Illinois, USA) (ICPP '21). 36:1--36:10.

[21]

Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21, 3 (2007), 291--312.

Digital Library

[22]

Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS Community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model (New York, New York, USA) (PGAS '10). 1--3.

Digital Library

[23]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, California, USA) (OOPSLA '05). 519--538.

Digital Library

[24]

Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, and Binoy Ravindran. 2020. Scaling Shared Memory Multiprocessing Applications in Non-Cache-Coherent Domains. In Proceedings of the 13th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR '20). 13--24.

Digital Library

[25]

Salvatore Di Girolamo, Flavio Vella, and Torsten Hoefler. 2017. Transparent Caching for RMA Systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (Orlando, Florida, USA) (IPDPS '17). 1018--1027.

[26]

James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayappan. 2008. Scioto: A Framework for Global-View Task Parallelism. In Proceedings of the 37th International Conference on Parallel Processing (Portland, Oregon, USA) (ICPP '08). 586--593.

[27]

James Dinan, D. Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Portland, Oregon, USA) (SC '09). 53:1--53:11.

Digital Library

[28]

Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. John Wiley & Sons.

Digital Library

[29]

Wataru Endo, Shigeyuki Sato, and Kenjiro Taura. 2020. MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA. In Proceedings of 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (Virtual Event) (IPDRM '20). 9--16.

[30]

Michael P. Ferguson and Daniel Buettner. 2015. Caching Puts and Gets in a PGAS Language Runtime. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (Washington, District of Columbia, USA) (PGAS '15). 13--24.

[31]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) (PLDI '98). 212--223.

[32]

Karl Fuerlinger, Tobias Fuchs, and Roger Kowalewski. 2016. DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (Sydney, NSW, Australia) (HPCC '16). 983--990.

[33]

Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). 15--23.

Digital Library

[34]

Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 15--26.

Digital Library

[35]

Sayan Ghosh, Yanfei Guo, Pavan Balaji, and Assefaw H. Gebremedhin. 2021. RMACXX: An Efficient High-Level C++ Interface over MPI-3 RMA. In Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia) (CCGrid '21). 143--155.

[36]

Max Grossman, Vivek Kumar, Zoran Budimlić, and Vivek Sarkar. 2016. Integrating Asynchronous Task Parallelism with OpenSHMEM. In Proceedings of the Third Workshop on OpenSHMEM and Related Technologies (Baltimore, Maryland, USA) (OpenSHMEM '16). 3--17.

[37]

Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-Based Load Balancing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, North Carolina, USA) (PPoPP '09). 55--64.

Digital Library

[38]

Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. 1992. Compiling Fortran D for MIMD Distributed-Memory Machines. Commun. ACM 35, 8 (1992), 66--80.

Digital Library

[39]

Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. ACM Transactions on Parallel Computing 2, 2 (July 2015), 1--26.

Digital Library

[40]

Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 6:1--6:11.

Digital Library

[41]

Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning Centralized Coherence and Distributed Critical-Section Execution on Their Head: A New Approach for Scalable Distributed Shared Memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 3--14.

Digital Library

[42]

Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference (San Francisco, California, USA) (WTEC '94).

Digital Library

[43]

Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (Queensland, Australia) (ISCA '92). 13--21.

Digital Library

[44]

Charles H. Koelbel, David Loveman, Robert S. Schreiber, Guy L. Steele Jr., and Mary Zosel. 1993. High Performance Fortran Handbook. The MIT Press.

[45]

Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A Compiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 5:1--5:10.

Digital Library

[46]

Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New Orleans, Louisiana, USA) (PPoPP '12). 75--84.

Digital Library

[47]

Jinpil Lee and Mitsuhisa Sato. 2010. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems. In Proceedings of the 39th International Conference on Parallel Processing Workshops (San Diego, California, USA) (ICPPW '10). 413--420.

Digital Library

[48]

Kai Li and Paul Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7, 4 (Nov. 1989), 321--359.

Digital Library

[49]

Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. 2022. OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In Proceedings of the 28th International European Conference on Parallel and Distributed Computing (Glasgow, Scotland, UK) (Euro-Par '22). 319--334.

[50]

Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (Galveston Island, Texas, USA) (PGAS '11). 1--10.

[51]

Eric Mohr, David A. Kranz, and Robert H. Halstead. 1990. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (Nice, France) (LFP '90). 185--197.

[52]

Alessandro Morari, Antonino Tumeo, Daniel Chavarría-Miranda, Oreste Villa, and Mateo Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1126--1135.

Digital Library

[53]

Jun Nakashima and Kenjiro Taura. 2014. MassiveThreads: A Thread Library for High Productivity Languages. Concurrent Objects and Beyond 8665 (Jan. 2014), 222--238.

[54]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A Latency-Tolerant Runtime for Large-Scale Irregular Applications. In Proceedings of the First International Workshop on Rack-Scale Computing (Amsterdam, The Netherlands) (WRSC '14). 1--7.

[55]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Annual Technical Conference (Denver, Colorado, USA) (USENIX ATC '15). 291--305.

[56]

Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. 1996. Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers. The Journal of Supercomputing 10, 2 (1996), 169--189.

[57]

Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.

Digital Library

[58]

Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (New Orleans, Los Angeles, USA) (LCPC '06). 235--250.

[59]

Jeeva Paudel, Olivier Tardieu, and José Nelson Amaral. 2013. On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks. In Proceedings of the 42nd International Conference on Parallel Processing (Lyon, France) (ICPP '13). 100--109.

Digital Library

[60]

Keith H. Randall. 1998. Cilk: Efficient Multithreaded Computing. Ph. D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.

[61]

James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media.

Digital Library

[62]

Tao B. Schardl and I-Ting Angelina Lee. 2023. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP '23). 189--203.

[63]

Joseph Schuchart and José Gracia. 2019. Global Task Data-Dependencies in PGAS Applications. In High Performance Computing: the 34th International Conference, ISC High Performance 2019 (Frankfurt/Main, Germany) (ISC '19). 312--329.

[64]

Shumpei Shiina and Kenjiro Taura. 2019. Almost Deterministic Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA) (SC '19). 47:1--47:16.

Digital Library

[65]

Shumpei Shiina and Kenjiro Taura. 2022. Distributed Continuation Stealing is More Scalable than You Might Think. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (Heidelberg, Germany) (Cluster '22). 129--141.

[66]

Shumpei Shiina and Kenjiro Taura. 2022. Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing. IEEE Transactions on Parallel and Distributed Systems 33, 12 (Dec. 2022), 4530--4546.

Digital Library

[67]

Min Si, Huansong Fu, Jeff R. Hammond, and Pavan Balaji. 2021. OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In Proceedings of the 8th Workshop on OpenSHMEM and Related Technologies (Virtual Event) (OpenSHMEM '21). 39--60.

[68]

Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii, USA) (MICRO-48). 647--659.

[69]

Kenjiro Taura, Jun Nakashima, Rio Yokota, and Naoya Maruyama. 2012. A Task Parallel Implementation of Fast Multipole Methods. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis-Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Salt Lake City, Utah, USA) (ScalA' 12). 617--625.

[70]

Keisuke Tsugane, Jinpil Lee, Hitoshi Murai, and Mitsuhisa Sato. 2018. Multi-Tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-Core Clusters. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Chiyoda, Tokyo, Japan) (HPC Asia 2018). 75--85.

Digital Library

[71]

Rio Yokota and Lorena Barba. 2020. GitHub repository: exafmm/exafmm-beta. Retrieved 2022-11-30 from https://github.com/exafmm/exafmm-beta

[72]

Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.

[73]

Jin Zhang, Xiangyao Yu, Zhengwei Qi, and Haibing Guan. 2022. Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory. In Proceedings of the 36th IEEE International Parallel and Distributed Processing Symposium (Lyon, France) (IPDPS '22). 974--984.

[74]

Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-Based Global Load Balancing Library in X10. In Proceedings of the First Workshop on Parallel Programming for Analytics Applications (Orlando, Florida, USA) (PPAA '14). 31--40.

Digital Library

[75]

Zhang Zhang, Jeevan Savant, and Steven Seidel. 2006. A UPC Runtime System Based on MPI and POSIX Threads. In Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (Montbeliard-Sochaux, France) (PDP '06). 195--202.

Digital Library

[76]

Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS Extension for C++. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1105--1114.

Digital Library

Index Terms

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism
1. Computing methodologies
  1. Distributed computing methodologies
  2. Parallel computing methodologies

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Helper locks for fork-join parallel programming
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Helper locks for fork-join parallel programming
PPoPP '10

Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2023

1428 pages

ISBN:9798400701092

DOI:10.1145/3581784

Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror

Copyright © 2023 Owner/Author(s).

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '23

Sponsor:

SIGHPC

SC '23: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2023

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
448
Total Downloads

Downloads (Last 12 months)448
Downloads (Last 6 weeks)34

Reflects downloads up to 30 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents