Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3277355.3277416guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Litz: elastic framework for high-performance distributed machine learning

Published: 11 July 2018 Publication History

Abstract

Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications.
Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes -- stateful workers, model scheduling, and relaxed consistency -- which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution.
We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.

References

[1]
Apache Hadoop. http://hadoop.apache.org/.
[2]
Boost Context. www.boost.org/doc/libs/1_63_0/libs/context/.
[3]
Boost Serialization. http://www.boost.org/doc/libs/1_64_0/libs/serialization/.
[4]
etcd. http://coreos.com/etcd/.
[5]
Kubernetes. http://kubernetes.io.
[6]
NNVM. http://nnvm.tvmlang.org/.
[7]
XLA. https://www.tensorflow.org/performance/xla/.
[8]
ZeroMQ. http://zeromq.org.
[9]
ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., ET AL. Tensorflow: A system for large-scale machine learning.
[10]
AGARWAL, A., AND DUCHI, J. C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems (2011), pp. 873-881.
[11]
AHAMAD, M., NEIGER, G., BURNS, J. E., KOHLI, P., AND HUTTO, P. W. Causalmemory: definitions, implementation, and programming. Distributed Computing 9, 1 (Mar 1995), 37-49.
[12]
AHN, S., SHAHBABA, B., WELLING, M., ET AL. Distributed stochastic gradient mcmc. In ICML (2014), pp. 1044-1052.
[13]
BAILIS, P., VENKATARAMAN, S., FRANKLIN, M. J., HELLERSTEIN, J. M., AND STOICA, I. Probabilistically bounded staleness for practical partial quorums. Proc. VLDB Endow. 5, 8 (Apr. 2012), 776-787.
[14]
CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[15]
CHILIMBI, T., SUZUE, Y., APACIBLE, J., AND KALYANARAMAN, K. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, Oct. 2014), USENIX Association, pp. 571-582.
[16]
CIPAR, J., HO, Q., KIM, J. K., LEE, S., GANGER, G. R., GIBSON, G., KEETON, K., AND XING, E. Solving the straggler problem with bounded staleness. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems (Berkeley, CA, 2013), USENIX.
[17]
DAI, W., KUMAR, A., WEI, J., HO, Q., GIBSON, G., AND XING, E. P. High-performance distributed ml at scale through parameter server consistency models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), AAAI'15, AAAI Press, pp. 79-87.
[18]
DEAN, J., CORRADO, G., MONGA, R., CHEN, K., DEVIN, M., MAO, M., SENIOR, A., TUCKER, P., YANG, K., LE, Q. V., ET AL. Large scale distributed deep networks. In Advances in neural information processing systems (2012), pp. 1223-1231.
[19]
GABRILOVICH, E., RINGGAARD, M., AND SUBRAMANYA, A. Facc1: Freebase annotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correction level 0). http://lemurproject.org/clueweb12/, 2013.
[20]
GEMULLA, R., NIJKAMP, E., HAAS, P. J., AND SISMANIS, Y. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2011), KDD '11, ACM, pp. 69-77.
[21]
GHEMAWAT, S., AND MENAGE, P. TCMalloc : Thread-Caching Malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
[22]
GOODFELLOW, I., BENGIO, Y., AND COURVILLE, A. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[23]
GRANDL, R., CHOWDHURY, M., AKELLA, A., AND ANANTHANARAYANAN, G. Altruistic scheduling in multi-resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 65-80.
[24]
GRANDL, R., KANDULA, S., RAO, S., AKELLA, A., AND KULKARNI, J. Graphene: Packing and dependency-aware scheduling for data-parallel clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 81-97.
[25]
HARLAP, A., CUI, H., DAI, W., WEI, J., GANGER, G. R., GIBBONS, P. B., GIBSON, G. A., AND XING, E. P. Addressing the straggler problem for iterative convergent parallel ml. In Proceedings of the Seventh ACM Symposium on Cloud Computing (New York, NY, USA, 2016), SoCC '16, ACM, pp. 98-111.
[26]
HARLAP, A., TUMANOV, A., CHUNG, A., GANGER, G., AND GIBBONS, P. Proteus: agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys '17, ACM.
[27]
HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI, A., JOSEPH, A. D., KATZ, R., SHENKER, S., AND STOICA, I. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295-308.
[28]
HO, Q., CIPAR, J., CUI, H., LEE, S., KIM, J. K., GIBBONS, P. B., GIBSON, G. A., GANGER, G., AND XING, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 1223-1231.
[29]
HONG, M. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm based approach. arXiv preprint arXiv:1412.6058 (2014).
[30]
HUANG, Q., SU, S., XU, S., LI, J., XU, P., AND SHUANG, K. Migration-based elastic consolidation scheduling in cloud data center. In 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops (July 2013), pp. 93-97.
[31]
HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2010), USENIXATC'10, USENIX Association, pp. 11-11.
[32]
JYOTHI, S. A., CURINO, C., MENACHE, I., NARAYANAMURTHY, S. M., TUMANOV, A., YANIV, J., MAVLYUTOV, R., GOIRI, I., KRISHNAN, S., KULKARNI, J., AND RAO, S. Morpheus: Towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 117-134.
[33]
KIM, J. K., HO, Q., LEE, S., ZHENG, X., DAI, W., GIBSON, G. A., AND XING, E. P. Strads: A distributed framework for scheduled model parallel machine learning. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2016), EuroSys '16, ACM, pp. 5:1-5:16.
[34]
KRIZHEVSKY, A. Learning multiple layers of features from tiny images.
[35]
KUMAR, A., BEUTEL, A., HO, Q., AND XING, E. P. Fugue: Slow-worker-agnostic distributed learning for big models on big data. In AISTATS (2014), pp. 531-539.
[36]
LI, M., ANDERSEN, D. G., PARK, J. W., SMOLA, A. J., AHMED, A., JOSIFOVSKI, V., LONG, J., SHEKITA, E. J., AND SU, B.-Y. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, Oct. 2014), USENIX Association, pp. 583-598.
[37]
LI, M., ANDERSEN, D. G., AND SMOLA, A. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning (2013).
[38]
LOW, Y., BICKSON, D., GONZALEZ, J., GUESTRIN, C., KYROLA, A., AND HELLERSTEIN, J. M. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012), 716-727.
[39]
MCMAHAN, B., AND STREETER, M. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems (2014), pp. 2915-2923.
[40]
NEUBIG, G., DYER, C., GOLDBERG, Y., MATTHEWS, A., AMMAR, W., ANASTASOPOULOS, A., BALLESTEROS, M., CHIANG, D., CLOTHIAUX, D., COHN, T., ET AL. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017).
[41]
PUNDIR, M., KUMAR, M., LESLIE, L. M., GUPTA, I., AND CAMPBELL, R. H. Supporting on-demand elasticity in distributed graph processing. In Cloud Engineering (IC2E), 2016 IEEE International Conference on (2016), IEEE, pp. 12-21.
[42]
RECHT, B., RE, C., WRIGHT, S., AND NIU, F. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 693-701.
[43]
RUSSAKOVSKY, O., DENG, J., SU, H., KRAUSE, J., SATHEESH, S., MA, S., HUANG, Z., KARPATHY, A., KHOSLA, A., BERNSTEIN, M., BERG, A. C., AND FEI-FEI, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211-252.
[44]
SCHERRER, C., TEWARI, A., HALAPPANAVAR, M., AND HAGLIN, D. Feature clustering for accelerating parallel coordinate descent. In Advances in Neural Information Processing Systems (2012), pp. 28- 36.
[45]
SCHWARZKOPF, M., KONWINSKI, A., ABD-EL-MALEK, M., AND WILKES, J. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems (EuroSys) (Prague, Czech Republic, 2013), pp. 351-364.
[46]
SHARMA, P., GUO, T., HE, X., IRWIN, D., AND SHENOY, P. Flint: Batch-interactive data-intensive processing on transient servers. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2016), EuroSys '16, ACM, pp. 6:1-6:15.
[47]
SUBRAMANYA, S., GUO, T., SHARMA, P., IRWIN, D., AND SHENOY, P. Spoton: A batch computing service for the spot market. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC '15, ACM, pp. 329-341.
[48]
WANG, J., YANG, J., YU, K., LV, F., HUANG, T., AND GONG, Y. Locality-constrained linear coding for image classification. In IN: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN CLASSIFICATOIN (2010).
[49]
WANG, M., XIAO, T., LI, J., ZHANG, J., HONG, C., AND ZHANG, Z. Minerva: A scalable and highly efficient training platform for deep learning. In NIPS Workshop, Distributed Machine Learning and Matrix Computations (2014).
[50]
WEI, J., DAI, W., QIAO, A., HO, Q., CUI, H., GANGER, G. R., GIBBONS, P. B., GIBSON, G. A., AND XING, E. P. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC '15, ACM, pp. 381-394.
[51]
WEI, J., KIM, J. K., AND GIBSON, G. A. Benchmarking Apache Spark with Machine Learning Applications. In Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-107, Oct. 2016.
[52]
XIE, P., KIM, J. K., ZHOU, Y., HO, Q., KUMAR, A., YU, Y., AND XING, E. P. Distributed machine learning via sufficient factor broadcasting. CoRR abs/1511.08486 (2015).
[53]
XING, E. P., HO, Q., DAI, W., KIM, J. K., WEI, J., LEE, S., ZHENG, X., XIE, P., KUMAR, A., AND YU, Y. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49-67.
[54]
YAN, Y., GAO, Y., CHEN, Y., GUO, Z., CHEN, B., AND MOSCIBRODA, T. Tr-spark: Transient computing for big data analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing (New York, NY, USA, 2016), SoCC '16, ACM, pp. 484-496.
[55]
YUAN, J., GAO, F., HO, Q., DAI, W., WEI, J., ZHENG, X., XING, E. P., LIU, T.-Y., AND MA, W.-Y. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web (2015), ACM, pp. 1351-1361.
[56]
YUN, H., YU, H.-F., HSIEH, C.-J., VISHWANATHAN, S., AND DHILLON, I. Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Proceedings of the VLDB Endowment 7, 11 (2014), 975-986.
[57]
ZAHARIA, M., CHOWDHURY, M., FRANKLIN, M. J., SHENKER, S., AND STOICA, I. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (Berkeley, CA, USA, 2010), HotCloud'10, USENIX Association, pp. 10-10.
[58]
ZHANG, R., AND KWOK, J. T. Asynchronous distributed admm for consensus optimization. In ICML (2014), pp. 1701-1709.

Cited By

View all
  • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 26-Oct-2021
  • (2019)In-database distributed machine learningProceedings of the VLDB Endowment10.14778/3352063.335208312:12(1854-1857)Online publication date: 1-Aug-2019
  • (2019)CrossbowProceedings of the VLDB Endowment10.14778/3342263.334227612:11(1399-1412)Online publication date: 1-Jul-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference
July 2018
1019 pages
ISBN:9781931971447

Sponsors

  • VMware
  • NetApp
  • NSF
  • Facebook: Facebook
  • ORACLE: ORACLE

Publisher

USENIX Association

United States

Publication History

Published: 11 July 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 26-Oct-2021
  • (2019)In-database distributed machine learningProceedings of the VLDB Endowment10.14778/3352063.335208312:12(1854-1857)Online publication date: 1-Aug-2019
  • (2019)CrossbowProceedings of the VLDB Endowment10.14778/3342263.334227612:11(1399-1412)Online publication date: 1-Jul-2019

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media