Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3291168.3291210acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Ray: a distributed framework for emerging AI applications

Published: 08 October 2018 Publication History

Abstract

The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray--a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.

References

[1]
Akka. https://akka.io/.
[2]
Apache Arrow. https://arrow.apache.org/.
[3]
Dask Benchmarks. http://matthewrocklin.com/blog/work/2017/07/03/scaling.
[4]
EC2 Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/.
[5]
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms. https://github.com/openai/baselines.
[6]
TensorFlow Serving. https://www.tensorflow.org/serving/.
[7]
ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., ET AL. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA (2016).
[8]
AGARWAL, A., BIRD, S., COZOWICZ, M., HOANG, L., LANGFORD, J., LEE, S., LI, J., MELAMED, D., OSHRI, G., RIBAS, O., SEN, S., AND SLIVKINS, A. A multiworld testing decision service. arXiv preprint arXiv:1606.03966 (2016).
[9]
ALVARO, P., CONDIE, T., CONWAY, N., ELMELEEGY, K., HELLERSTEIN, J. M., AND SEARS, R. BOOM Analytics: exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European conference on Computer systems (2010), ACM, pp. 223-236.
[10]
ARMSTRONG, J., VIRDING, R., WIKSTRÖM, C., AND WILLIAMS, M. Concurrent programming in ERLANG.
[11]
BEATTIE, C., LEIBO, J. Z., TEPLYASHIN, D., WARD, T., WAINWRIGHT, M., KÜTTLER, H., LEFRANCQ, A., GREEN, S., VALDÉ S, V., SADIK, A., ET AL. DeepMind Lab. arXiv preprint arXiv:1612.03801 (2016).
[12]
BLUMOFE, R. D., AND LEISERSON, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720-748.
[13]
BROCKMAN, G., CHEUNG, V., PETTERSSON, L., SCHNEIDER, J., SCHULMAN, J., TANG, J., AND ZAREMBA, W. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016).
[14]
BYKOV, S., GELLER, A., KLIOT, G., LARUS, J. R., PANDYA, R., AND THELIN, J. Orleans: Cloud computing for everyone. In Proceedings of the 2nd ACM Symposium on Cloud Computing (2011), ACM, p. 16.
[15]
CARBONE, P., EWEN, S., FÓRA, G., HARIDI, S., RICHTER, S., AND TZOUMAS, K. State management in Apache Flink: Consistent stateful distributed stream processing. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1718-1729.
[16]
CASADO, M., FREEDMAN, M. J., PETTIT, J., LUO, J., MCKEOWN, N., AND SHENKER, S. Ethane: Taking control of the enterprise. SIGCOMM Comput. Commun. Rev. 37, 4 (Aug. 2007), 1-12.
[17]
CHAROUSSET, D., SCHMIDT, T. C., HIESGEN, R., AND WÄHLISCH, M. Native actors: A scalable software platform for distributed, heterogeneous environments. In Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control (2013), ACM, pp. 87-96.
[18]
CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys'16) (2016).
[19]
CRANKSHAW, D., WANG, X., ZHOU, G., FRANKLIN, M. J., GONZALEZ, J. E., AND STOICA, I. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 613-627.
[20]
DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107-113.
[21]
DENNIS, J. B., AND MISUNAS, D. P. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture (New York, NY, USA, 1975), ISCA '75, ACM, pp. 126-132.
[22]
GABRIEL, E., FAGG, G. E., BOSILCA, G., ANGSKUN, T., DONGARRA, J. J., SQUYRES, J. M., SAHAY, V., KAMBADUR, P., BARRETT, B., LUMSDAINE, A., CASTAIN, R. H., DANIEL, D. J., GRAHAM, R. L., AND WOODALL, T. S. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting (Budapest, Hungary, September 2004), pp. 97-104.
[23]
GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. 29-43.
[24]
GONZALEZ, J. E., XIN, R. S., DAVE, A., CRANKSHAW, D., FRANKLIN, M. J., AND STOICA, I. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, USENIX Association, pp. 599-613.
[25]
GU, S., HOLLY, E., LILLICRAP, T., AND LEVINE, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA 2017) (2017).
[26]
HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI, A., JOSEPH, A. D., KATZ, R., SHENKER, S., AND STOICA, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295-308.
[27]
HORGAN, D., QUAN, J., BUDDEN, D., BARTH-MARON, G., HESSEL, M., VAN HASSELT, H., AND SILVER, D. Distributed prioritized experience replay. International Conference on Learning Representations (2018).
[28]
ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (New York, NY, USA, 2007), EuroSys '07, ACM, pp. 59-72.
[29]
JIA, Y., SHELHAMER, E., DONAHUE, J., KARAYEV, S., LONG, J., GIRSHICK, R., GUADARRAMA, S., AND DARRELL, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).
[30]
JORDAN, M. I., AND MITCHELL, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255-260.
[31]
LEIBIUSKY, J., EISBRUCH, G., AND SIMONASSI, D. Getting Started with Storm. O'Reilly Media, Inc., 2012.
[32]
LI, M., ANDERSEN, D. G., PARK, J. W., SMOLA, A. J., AHMED, A., JOSIFOVSKI, V., LONG, J., SHEKITA, E. J., AND SU, B.-Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, pp. 583-598.
[33]
LOOKS, M., HERRESHOFF, M., HUTCHINS, D., AND NORVIG, P. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181 (2017).
[34]
LOW, Y., GONZALEZ, J., KYROLA, A., BICKSON, D., GUESTRIN, C., AND HELLERSTEIN, J. GraphLab: A new framework for parallel machine learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (Arlington, Virginia, United States, 2010), UAI'10, pp. 340-349.
[35]
MALEWICZ, G., AUSTERN, M. H., BIK, A. J., DEHNERT, J. C., HORN, I., LEISER, N., AND CZAJKOWSKI, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 135-146.
[36]
MNIH, V., BADIA, A. P., MIRZA, M., GRAVES, A., LILLICRAP, T. P., HARLEY, T., SILVER, D., AND KAVUKCUOGLU, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (2016).
[37]
MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529-533.
[38]
MURRAY, D. A Distributed Execution Engine Supporting Data-dependent Control Flow. University of Cambridge, 2012.
[39]
MURRAY, D. G., MCSHERRY, F., ISAACS, R., ISARD, M., BARHAM, P., AND ABADI, M. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 439-455.
[40]
MURRAY, D. G., SCHWARZKOPF, M., SMOWTON, C., SMITH, S., MADHAVAPEDDY, A., AND HAND, S. CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 113-126.
[41]
NAIR, A., SRINIVASAN, P., BLACKWELL, S., ALCICEK, C., FEARON, R., MARIA, A. D., PANNEERSHELVAM, V., SULEYMAN, M., BEATTIE, C., PETERSEN, S., LEGG, S., MNIH, V., KAVUKCUOGLU, K., AND SILVER, D. Massively parallel methods for deep reinforcement learning, 2015.
[42]
NG, A., COATES, A., DIEL, M., GANAPATHI, V., SCHULTE, J., TSE, B., BERGER, E., AND LIANG, E. Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX (2006), 363-372.
[43]
NISHIHARA, R., MORITZ, P., WANG, S., TUMANOV, A., PAUL, W., SCHLEIER-SMITH, J., LIAW, R., NIKNAMI, M., JORDAN, M. I., AND STOICA, I. Real-time machine learning: The missing pieces. In Workshop on Hot Topics in Operating Systems (2017).
[44]
OPENAI. OpenAI Dota 2 1v1 bot. https://openai.com/the-international/, 2017.
[45]
OUSTERHOUT, K., WENDELL, P., ZAHARIA, M., AND STOICA, I. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 69-84.
[46]
PASZKE, A., GROSS, S., CHINTALA, S., CHANAN, G., YANG, E., DEVITO, Z., LIN, Z., DESMAISON, A., ANTIGA, L., AND LERER, A. Automatic differentiation in PyTorch.
[47]
QU, H., MASHAYEKHI, O., TEREI, D., AND LEVIS, P. Canary: A scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412 (2016).
[48]
ROCKLIN, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130-136.
[49]
SALIMANS, T., HO, J., CHEN, X., AND SUTSKEVER, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017).
[50]
SANFILIPPO, S. Redis: An open source, in-memory data structure store. https://redis.io/, 2009.
[51]
SCHULMAN, J., WOLSKI, F., DHARIWAL, P., RADFORD, A., AND KLIMOV, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[52]
SCHWARZKOPF, M., KONWINSKI, A., ABD-EL-MALEK, M., AND WILKES, J. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), EuroSys '13, ACM, pp. 351-364.
[53]
SERGEEV, A., AND DEL BALSO, M. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018).
[54]
SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M., ET AL. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484-489.
[55]
SILVER, D., LEVER, G., HEESS, N., DEGRIS, T., WIERSTRA, D., AND RIEDMILLER, M. Deterministic policy gradient algorithms. In ICML (2014).
[56]
SUTTON, R. S., AND BARTO, A. G. Reinforcement Learning: An Introduction. MIT press Cambridge, 1998.
[57]
THAKUR, R., RABENSEIFNER, R., AND GROPP, W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49-66.
[58]
TIAN, Y., GONG, Q., SHANG, W., WU, Y., AND ZITNICK, C. L. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. Advances in Neural Information Processing Systems (NIPS) (2017).
[59]
TODOROV, E., EREZ, T., AND TASSA, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on (2012), IEEE, pp. 5026-5033.
[60]
VAN DEN BERG, J., MILLER, S., DUCKWORTH, D., HU, H., WAN, A., FU, X.-Y., GOLDBERG, K., AND ABBEEL, P. Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In Robotics and Automation (ICRA), 2010 IEEE International Conference on (2010), IEEE, pp. 2074-2081.
[61]
VAN RENESSE, R., AND SCHNEIDER, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association.
[62]
VENKATARAMAN, S., PANDA, A., OUSTERHOUT, K., GHODSI, A., ARMBRUST, M., RECHT, B., FRANKLIN, M., AND STOICA, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the Twenty-Sixth ACM Symposium on Operating Systems Principles (2017), SOSP '17, ACM.
[63]
WHITE, T. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012.
[64]
ZAHARIA, M., CHOWDHURY, M., DAS, T., DAVE, A., MA, J., MCCAULEY, M., FRANKLIN, M. J., SHENKER, S., AND STOICA, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2-2.
[65]
ZAHARIA, M., XIN, R. S., WENDELL, P., DAS, T., ARMBRUST, M., DAVE, A., MENG, X., ROSEN, J., VENKATARAMAN, S., FRANKLIN, M. J., GHODSI, A., GONZALEZ, J., SHENKER, S., AND STOICA, I. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56-65.

Cited By

View all
  • (2025)Automatic Tracing in Task-Based Runtime SystemsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707237(84-99)Online publication date: 3-Feb-2025
  • (2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
  • (2024)Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor CollectionsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695975(195-210)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
October 2018
815 pages
ISBN:9781931971478

Sponsors

  • NetApp
  • Google Inc.
  • NSF
  • Microsoft: Microsoft
  • Facebook: Facebook

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 08 October 2018

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Automatic Tracing in Task-Based Runtime SystemsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707237(84-99)Online publication date: 3-Feb-2025
  • (2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
  • (2024)Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor CollectionsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695975(195-210)Online publication date: 4-Nov-2024
  • (2023)Parallel Q-learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619211(19440-19459)Online publication date: 23-Jul-2023
  • (2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
  • (2023)P2D: A Transpiler Framework for Optimizing Data Science PipelinesProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595853(1-4)Online publication date: 18-Jun-2023
  • (2023)Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement LearningProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607042(1-13)Online publication date: 12-Nov-2023
  • (2022)ChukonuProceedings of the VLDB Endowment10.14778/3503585.350359615:4(872-885)Online publication date: 14-Apr-2022
  • (2022)Rearchitecting in-memory object stores for low latencyProceedings of the VLDB Endowment10.14778/3494124.349413815:3(555-568)Online publication date: 4-Feb-2022
  • (2022)ProteaProceedings of the 1st ACM Workshop on Data Privacy and Federated Learning Technologies for Mobile Edge Network10.1145/3556557.3557950(1-6)Online publication date: 17-Oct-2022
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media