Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3624062.3624139acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Dragon Proxy Runtimes and Multi-system Workflows

Published: 12 November 2023 Publication History

Abstract

We present a new method for obtaining proxy access to remote instances of the Dragon distributed runtime. Dragon is a composable distributed runtime for managing dynamic processes, high-performance communication objects, memory, and data at scale that is based on an abstraction of a distributed system. Proxy access to a remote instance of the Dragon runtime makes it possible for the client program to expand and contract its compute resources by extending into a remote instance of Dragon as needed. Compute to be run on a remote Dragon runtime is mediated by a Python object that acts as a proxy, which we call a proxy runtime. These proxy runtimes, combined with the ability to start and tear down remote Dragon runtimes both programmatically and via the command line interface, make a number of challenging workflows simple to program. Such workflows include edge-to-cloud scientific workflows, batch services, and scientific applications based on Python multiprocessing. The ability to program complex workflows on systems that span clusters, scientific instruments, and cloud resources is critical to the development of post-exascale applications, infrastructures and frameworks.

Supplemental Material

MP4 File
Recording of "Dragon Proxy Runtimes and Multi-system Workflows" presentation at HPPSS 2023.

References

[1]
Dong H Ahn, Jim Garlick, Mark Grondona, Don Lipari, Becky Springmeyer, and Martin Schulz. 2014. Flux: A Next-Generation Resource Management Framework for Large HPC Centers. In 2014 43rd International Conference on Parallel Processing Workshops. IEEE, 9–17.
[2]
Amazon. [n. d.]. AWS Batch: Batch processing, ML model training, and analysis at any scale. https://aws.amazon.com/batch/.
[3]
Aitor Arjona, Gerard Finol, and Pedro Garcia-Lopez. 2022. Transparent Serverless execution of Python multiprocessing applications. arXiv preprint arXiv:2205.08818 (2022).
[4]
Michael Burke, Eric Cozzi, Zach Crisler, Julius Donnert, Veena Ghorakavi, Faisal Hadi, Nick Hill, Maria Kalantzi, Kent Lee, Pete Mendygral, Nick Radcliffe, and Rajesh Ratnakaram. [n. d.]. DragonHPC. https://www.dragonhpc.org/
[5]
George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. 2011. Distributed Systems: Concepts and Design (5th ed.). Addison-Wesley Publishing Company, USA.
[6]
[6] Dask. [n. d.]. https://www.dask.org/.
[7]
Jeff Forcier. [n. d.]. Paramiko. https://www.paramiko.org/
[8]
Pedro García-López, Aleksander Slominski, Simon Shillaker, Michael Behrendt, and Barnard Metzler. 2020. Serverless End Game: Disaggregation enabling Transparency. https://doi.org/10.48550/ARXIV.2006.01251
[9]
Google. [n. d.]. Batch: Fully managed batch service to schedule, queue, and execute batch jobs on Google’s infrastructure.https://cloud.google.com/batch.
[10]
HPE. [n. d.]. Project Breckenridge. https://console.breckenridge.cloud/.
[11]
Anubhav Jain, Shyue Ping Ong, Wei Chen, Bharat Medasani, Xiaohui Qu, Michael Kocher, Miriam Brafman, Guido Petretto, Gian-Marco Rignanese, Geoffroy Hautier, Daniel Gunter, and Kristin A. Persson. 2015. FireWorks: a dynamic workflow system designed for high-throughput applications. Concurrency and Computation: Practice and Experience 27, 17 (2015), 5037–5059. https://doi.org/10.1002/cpe.3505 CPE-14-0307.R2.
[12]
Microsoft. [n. d.]. Batch: Cloud-scale job scheduling and compute management. https://azure.microsoft.com/en-us/products/batch.
[13]
NERSC. [n. d.]. Hyperparameter Optimization with RayTune. https://docs.nersc.gov/machinelearning/hpo/#raytune.
[14]
NERSC. [n. d.]. Slurm Ray Cluster. https://github.com/NERSC/slurm-ray-cluster.
[15]
[15] Ray. [n. d.]. https://github.com/ray-project/ray.
[16]
Michael A. Salim, Thomas D. Uram, J. Taylor Childers, Prasanna Balaprakash, Venkatram Vishwanath, and Michael E. Papka. 2019. Balsam: Automated Scheduling and Execution of Dynamic, Data-Intensive HPC Workflows. CoRR abs/1909.08704 (2019). arXiv:1909.08704http://arxiv.org/abs/1909.08704
[17]
The SPIFFE authors. [n. d.]. Universal identity control plane for distributed systems. https://spiffe.io.
[18]
Guido Van Rossum and Fred L. Drake. 2009. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SC-W '23: Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. Python
  3. batch service
  4. cloud computing
  5. distributed computing
  6. parallel computing
  7. supercomputing
  8. workflows

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC-W 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 33
    Total Downloads
  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media