Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform

Published: 17 August 2017 Publication History

Abstract

Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model.
We systematically study the scalability limits of Erlang and then address the issues at the virtual machine, language, and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large-scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug, we have developed and made open source releases of five complementary tools, some specific to SD Erlang.
Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6,144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however, scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6,144 cores).

References

[1]
Gul Agha. 1985. ACTORS: A Model of Concurrent Computation in Distributed Systems. Ph.D. Dissertation. MIT.
[2]
Gul Agha. 1986. An overview of actor languages. SIGPLAN Not. 21, 10 (1986), 58--67.
[3]
AMD 2015. Bulldozer (microarchitecture) Retrieved from https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture).
[4]
Apache SF. 2016. Liblcoud. Retrieved from https://libcloud.apache.org/.
[5]
C. R. Aragon and R. G. Seidel. 1989. Randomized search trees. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. 540--545.
[6]
Joe Armstrong. 2007. Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf.
[7]
Joe Armstrong. 2010. Erlang. Commun. ACM 53 (2010), 68--75. Issue 9.
[8]
Stavros Aronis, Nikolaos Papaspyrou, Katerina Roukounaki, Konstantinos Sagonas, Yiannis Tsiouris, and Ioannis E. Venetis. 2012. A scalability benchmark suite for erlang/OTP. In Proceedings of the 11th ACM SIGPLAN Workshop on Erlang, Torben Hoffman and John Hughes (Eds.). ACM, New York, NY, 33--42.
[9]
Thomas Arts, John Hughes, Joakim Johansson, and Ulf Wiger. 2006. Testing telecoms software with quviq quickcheck. In Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 2--10.
[10]
Robert Baker, Peter Rodgers, Simon Thompson, and Huiqing Li. 2013. Multi-level visualization of concurrent and distributed computation in erlang. In Visual Languages and Computing (VLC): Proceedings of the 19th International Conference on Distributed Multimedia Systems (DMS’13).
[11]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer (2nd ed.). Morgan and Claypool.
[12]
Basho Technologies. 2014. Riakdocs. Basho Bench. Retrieved from http://docs.basho.com/riak/latest/ops/building/benchmarking/.
[13]
J. E. Beasley. 1990. OR-library: Distributing test problems by electronic mail. J. Operat. Res. Soc. 41, 11 (1990), 1069--1072. 01605682, 14769360 Retrieved from http://www.jstor.org/stable/2582903. Datasets available at http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/wtinfo.html.
[14]
Cory Bennett and Ariel Tseitlin. 2012. Chaos Monkey Released into the Wild. Netflix Blog (2012).
[15]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 207--216.
[16]
Olivier Boudeville. 2012. Technical manual of the sim-diasca simulation engine. EDF R&D (2012).
[17]
István Bozó, Viktória Fördős, Dániel Horpácsi, Zoltán Horváth, Tamás Kozsik, Judit Kőszegi, and Melinda Tóth. 2015. Refactorings to enable parallelization. In Proceedings of the 15th International Symposium on Trends in Functional Programming (TFP’14). Revised Selected Papers (LNCS), Jurriaan Hage and Jay McCarthy (Eds.), Vol. 8843. Springer, 104--121.
[18]
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 157--166.
[19]
Francesco Cesarini and Simon Thompson. 2009. Erlang Programming: A Concurrent Approach to Software Development (1st ed.). O’Reilly Media.
[20]
Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon. 2001. Parallel Programming in OpenMP. Morgan Kaufmann, San Francisco, CA.
[21]
Natalia Chechina, Huiqing Li, Amir Ghaffari, Simon Thompson, and Phil Trinder. 2016. Improving the network scalability of erlang. J. Parallel Distrib. Comput. 90, C (2016), 22--34.
[22]
Natalia Chechina, Kenneth MacKenzie, Simon Thompson, Phil Trinder, Olivier Boudeville, Viktória Fördős, Csaba Hoch, Amir Ghaffari, and Mario Moro Hernandez. 2017. Evaluating scalable distributed erlang for scalability and reliability. IEEE Transactions on Parallel and Distributed Systems (2017).
[23]
Natalia Chechina, Mario Moro Hernandez, and Phil Trinder. 2016. A scalable reliable instant messenger using the SD erlang libraries. In Proceedings of Erlang’16. ACM, New York, NY, 33--41.
[24]
Koen Claessen and John Hughes. 2000. QuickCheck: A lightweight tool for random testing of Haskell programs. In Proceedings of the 5th ACM SIGPLAN International Conference on Functional Programming. ACM, New York, NY, 268--279.
[25]
H. A. J. Crauwels, C. N. Potts, and L. N. van Wassenhove. 1998. Local search heuristics for the single machine total weighted tardiness scheduling problem. INFORMS J. Comput. 10, 3 (1998), 341--350. arXiv:http://pubsonline.informs.org/doi/pdf/10.1287/ijoc.10.3.341. The datasets from this article are included in Beasley’s ORLIB.
[26]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[27]
David Dewolfs, Jan Broeckhove, Vaidy Sunderam, and Graham E. Fagg. 2006. FT-MPI, fault-tolerant metacomputing and generic name services: A case study. In Proceedings of EuroPVM/MPI’06. Springer-Verlag, Berlin, 133--140.
[28]
Alan A. A. Donovan and Brian W. Kernighan. 2015. The Go Programming Language. Addison-Wesley Professional.
[29]
Marco Dorigo and Thomas Stützle. 2004. Ant Colony Optimization. Bradford Company, Scituate, MA.
[30]
Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. 2007. SNZI: Scalable nonzero indicators. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. ACM, New York, NY, 13--22.
[31]
Jeff Epstein, Andrew P. Black, and Simon Peyton-Jones. 2011. Towards Haskell in the cloud. In Proceedings of Haskell ’11. ACM, 118--129.
[32]
Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.
[33]
Ana Gainaru and Franck Cappello. 2015. Errors and faults. In Fault-Tolerance Techniques for High-Performance Computing. Springer International Publishing, 89--144.
[34]
Martin Josef Geiger. 2010. New Instances for the Single Machine Total Weighted Tardiness Problem. Technical Report Research Report 10-03-01. Retrieved from http://logistik.hsu-hh.de/SMTWTP.
[35]
Guillaume Germain. 2006. Concurrency oriented programming in termite scheme. In Proceedings of the 2006 ACM SIGPLAN workshop on Erlang. ACM, New York, NY, USA, 20--20.
[36]
Amir Ghaffari. 2014a. DE-Bench, A Benchmark Tool for Distributed Erlang (2014). Retrieved from https://github.com/amirghaffari/DEbench.
[37]
Amir Ghaffari. 2014b. Investigating the scalability limits of distributed Erlang. In Proceedings of the 13th ACM SIGPLAN Workshop on Erlang. ACM, 43--49.
[38]
Amir Ghaffari, Natalia Chechina, Phip Trinder, and Jon Meredith. 2013. Scalable persistent storage for Erlang: Theory and practice. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 73--74.
[39]
Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33, 2 (2002), 51--59.
[40]
Andrew S. Grimshaw, Wm A. Wulf, and the Legion team. 1997. The legion vision of a worldwide virtual computer. Commun. ACM 40, 1 (1997), 39--45.
[41]
Carl Hewitt. 2010. Actor model for discretionary, adaptive concurrency. CoRR abs/1008.1459 (2010).
[42]
Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the International joint Conference on Artificial Intelligence (IJCAI’73). Morgan Kaufmann, San Francisco, CA, 235--245.
[43]
Rich Hickey. 2008. The clojure programming language. In Proceedings of the Dynamic Languages Symposium (DLS’08). ACM, New York, NY, 1:1--1:1.
[44]
Zoltán Horváth, László Lövei, Tamás Kozsik, Róbert Kitlei, Anikó Nagyné Víg, Tamás Nagy, Melinda Tóth, and Roland Király. 2008. Building a refactoring tool for Erlang. In Proceedings of the Workshop on Advanced Software Development Tools and Techniques (WASDETT’08).
[45]
Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In ACM Sigplan Notices, Vol. 28. ACM, 91--108.
[46]
David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2013. On the scalability of the Erlang term storage. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang. ACM, New York, NY, 15--26. Retrieved from
[47]
David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2014. Delegation locking libraries for improved performance of multithreaded programs. In Proceedings of Euro-Par 2014 Parallel Processing (LNCS’14), Vol. 8632. Springer, 572--583.
[48]
David Klaftenegger, Konstantinos Sagonas, and Kjell Winblad. 2017. Queue delegation locking. IEEE Trans. Parallel Distrib. Syst. (2017). To appear.
[49]
Rusty Klophaus. 2010. Riak core: Building distributed applications without shared state. In ACM SIGPLAN Commercial Users of Functional Programming (CUFP’10). ACM, New York, NY, Article 14, 1 pages.
[50]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35--40. 0163-5980
[51]
J. Lee et al. 2010. Python Actor Runtime Library. Retrieved from osl.cs.uiui.edu/parley/.
[52]
Huiqing Li and Simon Thompson. 2012. Automated API migration in a user-extensible refactoring tool for erlang programs. In Proceedings of the Conference on Automated Software Engineering (ASE’12), Tim Menzies and Motoshi Saeki (Eds.). IEEE Computer Society.
[53]
Huiqing Li and Simon Thompson. 2013. Multicore profiling for Erlang programs using percept2. In Proceedings of the 12th ACM SIGPLAN Workshop on Erlang.
[54]
Huiqing Li and Simon Thompson. 2014. Improved semantics and implementation through property-based testing with quickcheck. In Proceedings of the 9th International Workshop on Automation of Software Test.
[55]
Huiqing Li and Simon Thompson. 2015. Safe concurrency introduction through slicing. In Proceedings of Workshop on Partial Evaluation and Program Manipulation (PEPM’15). ACM SIGPLAN.
[56]
Huiqing Li, Simon Thompson, György Orosz, and Melinda Töth. 2008. Refactoring with Wrangler, updated. In ACM SIGPLAN Erlang Workshop, Vol. 2008.
[57]
Frank Lubeck and Max Neunhoffer. 2001. Enumerating large orbits and direct condensation. Exp. Math. 10, 2 (2001), 197--205.
[58]
Andreea Lutac, Natalia Chechina, Gerardo Aragon-Camarasa, and Phil Trinder. 2016. Towards reliable and scalable robot communication. In Proceedings of the 15th International Workshop on Erlang (Erlang’16). ACM, New York, NY, 12--23. Retrieved from
[59]
LWN.net. 2006. The high-resolution timer API. (Jan. 2006). Retrieved from https://lwn.net/Articles/167897/
[60]
Kenneth MacKenzie, Natalia Chechina, and Phil Trinder. 2015. Performance portability through semi-explicit placement in distributed Erlang. In Proceedings of the 14th ACM SIGPLAN Workshop on Erlang. ACM, 27--38.
[61]
Jeff Matocha and Tracy Camp. 1998. A taxonomy of distributed termination detection algorithms. J. Syst. Softw. 43, 221 (1998), 207--221.
[62]
Nicholas D. Matsakis and Felix S. Klock II. 2014. The rust language. In ACM SIGAda Ada Letters, Vol. 34. ACM, 103--104.
[63]
Robert McNaughton. 1959. Scheduling with deadlines and loss functions. Manage. Sci. 6, 1 (1959), 1--12.
[64]
Martin Odersky et al. 2012. The Scala Programming Language. (2012). Retrieved from www.scala-lang.org.
[65]
William F. Opdyke. 1992. Refactoring Object-Oriented Frameworks. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.
[66]
Nikolaos Papaspyrou and Konstantinos Sagonas. 2012. On preserving term sharing in the Erlang virtual machine. In Proceedings of the 11th ACM SIGPLAN Erlang Workshop, Torben Hoffman and John Hughes (Eds.). ACM, 11--20.
[67]
RELEASE Project Team. 2015. EU Framework 7 Project 287510 (2011--2015). Retrieved from http://www.release-project.eu.
[68]
Konstantinos Sagonas and Thanassis Avgerinos. 2009. Automatic refactoring of Erlang programs. In Proceedings of the Conference on Principles and Practice of Declarative Programming (PPDP’09). ACM, 13--24.
[69]
Konstantinos Sagonas and Kjell Winblad. 2014. More scalable ordered set for ETS using adaptation. In Proceedings of the ACM SIGPLAN Workshop on Erlang. ACM, 3--11.
[70]
Konstantinos Sagonas and Kjell Winblad. 2015. Contention adapting search trees. In Proceedings of the 14th International Symposium on Parallel and Distributed Computing. IEEE Computing Society, 215--224.
[71]
Konstantinos Sagonas and Kjell Winblad. 2016. Efficient support for range queries and range updates using contention adapting search trees. In Proceedings of the 28th International Workshop on Languages and Compilers for Parallel Computing (LNCS’16), Vol. 9519. Springer, 37--53.
[72]
Marc Snir, Steve W. Otto, D. W. Walker, Jack Dongarra, and Steven Huss-Lederman. 1995. MPI: The Complete Reference. MIT Press, Cambridge, MA.
[73]
Sriram Srinivasan and Alan Mycroft. 2008. Kilim: Isolation-typed actors for Java. In Proceedings of the European Conference on Object Oriented Programming (ECOOP’08). Springer-Verlag, Berlin, 104--128.
[74]
Don Syme, Adam Granicz, and Antonio Cisternino. 2015. Expert F# 4.0. Springer.
[75]
Simon Thompson and Huiqing Li. 2013. Refactoring tools for functional languages. J. Funct. Program. 23, 3 (2013), 293--350.
[76]
Marcus Völker. 2014. Linux Timers. Retrieved from https://upvoid.com/devblog/2014/05/linux-timers/.
[77]
WhatsApp. 2015. Homepage. Retrieved from https://www.whatsapp.com/.
[78]
Tom White. 2012. Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3rd ed., revised and updated). O’Reilly. I--XXIII, 1--657 pages.
[79]
Ulf Wiger. 2000. Industrial-strength functional programming: Experiences with the ericsson AXD301 project. In Proceedings of the Conference on Implementing Functional Languages (IFL’00). Springer-Verlag, Aachen, Germany.

Cited By

View all
  • (2023)Special Delivery: Programming with Mailbox TypesProceedings of the ACM on Programming Languages10.1145/36078327:ICFP(78-107)Online publication date: 31-Aug-2023
  • (2022)PerformERL: a performance testing framework for erlangDistributed Computing10.1007/s00446-022-00429-735:5(439-454)Online publication date: 1-Oct-2022
  • (2021)Network Traffic WLAN Monitoring based SNMP using MRTG with Erlang Theory2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT)10.1109/EIConCIT50028.2021.9431898(391-394)Online publication date: 9-Apr-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 39, Issue 4
December 2017
191 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/3133234
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2017
Accepted: 01 June 2017
Revised: 01 April 2017
Received: 01 December 2015
Published in TOPLAS Volume 39, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Erlang
  2. reliability
  3. scalability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • UK’s Engineering and Physical Sciences Research Council
  • HPC-GAP: High Performance Computational Algebra and Discrete Mathematics
  • European Union
  • SCIEnce: Symbolic Computing Infrastructure in Europe
  • RELEASE: A High-Level Paradigm for Reliable Large-scale Server Software

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)98
  • Downloads (Last 6 weeks)24
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Special Delivery: Programming with Mailbox TypesProceedings of the ACM on Programming Languages10.1145/36078327:ICFP(78-107)Online publication date: 31-Aug-2023
  • (2022)PerformERL: a performance testing framework for erlangDistributed Computing10.1007/s00446-022-00429-735:5(439-454)Online publication date: 1-Oct-2022
  • (2021)Network Traffic WLAN Monitoring based SNMP using MRTG with Erlang Theory2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT)10.1109/EIConCIT50028.2021.9431898(391-394)Online publication date: 9-Apr-2021
  • (2020)High-throughput stream processing with actorsProceedings of the 10th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control10.1145/3427760.3428338(1-10)Online publication date: 17-Nov-2020
  • (2020)Improving the Performance of Actors on Multi-cores with Parallel PatternsInternational Journal of Parallel Programming10.1007/s10766-020-00663-148:4(692-712)Online publication date: 4-Jun-2020
  • (2019)Comparison of Erlang/OTP and JADE implementations for standby redundancy in a holonic controllerInternational Journal of Computer Integrated Manufacturing10.1080/0951192X.2019.169068332:12(1207-1230)Online publication date: 17-Nov-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media