research-article

Distributed deep learning on data systems: a comparative analysis of approaches

Authors:

Frank McQuillan,

Nandish Jayaram,

Domino Valdano,

Arun KumarAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 10

Pages 1769 - 1782

https://doi.org/10.14778/3467861.3467867

Published: 01 June 2021 Publication History

Abstract

Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single "best" approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.

References

[1]

Cerebro Documentation. https://adalabucsd.github.io/cerebro-system/.

[2]

First hand knowledge from the authors.

[3]

Create, Train, and Deploy Machine Learning Models in Amazon Red-shift Using SQL with Amazon Redshift ML, Accessed December 13, 2020. https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/.

[4]

The CREATE MODEL Statement for Deep Neural Network (DNN) Models, Accessed December 13, 2020. https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models.

[5]

Script for Tensorflow Model Averaging, Accessed January 31, 2020. https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py.

[6]

Code Release of This Work, Accessed November 19, 2020. https://github.com/makemebitter/cerebro-ds.

[7]

About Greenplum Query Processing, Accessed October 31, 2020. https://gpdb.docs.pivotal.io/560/admin_guide/query/topics/parallel-proc.html.

[8]

Google BigQuery ML, Accessed October 31, 2020. https://cloud.google.com/bigquery-ml/docs.

[9]

Google BigQuery ML TensorFlow integration, Accessed October 31, 2020. https://cloud.google.com/bigquery-ml/docs/making-predictions-with-imported-tensorflow-models.

[10]

Horovod on Spark, Accessed October 31, 2020. https://github.com/horovod/horovod/blob/master/docs/spark.rst.

[11]

MADlib Deep Learning, Accessed October 31, 2020. https://madlib.apache.org/docs/latest/group__grp__dl.html.

[12]

MADlib Model Selection, Accessed October 31, 2020. https://madlib.apache.org/docs/latest/group__grp__keras__run__model__selection.html.

[13]

Microsoft SQL Server Machine Learning Services, Accessed October 31, 2020. https://docs.microsoft.com/en-us/sql/machine-learning/sql-server-machine-learning-services?view=sql-server-2017.

[14]

Oracle Data Mining, Accessed October 31, 2020. https://www.oracle.com/database/technologies/advanced-analytics/odm.html.

[15]

Oracle Machine Learning, Accessed October 31, 2020. https://www.oracle.com/data-science/machine-learning.html.

[16]

TensorFrames, Accessed October 31, 2020. https://github.com/databricks/tensorframes.

[17]

TOAST Tables in Postgres, Accessed October 31, 2020. https://wiki.postgresql.org/wiki/TOAST.

[18]

A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Godwal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, J. Leeka, K. Park, H. Patel, O. Poppe, F. Psallidas, R. Ramakrishnan, A. Roy, K. Saur, R. Sen, M. Weimer, T. Wright, and Y. Zhu. Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML. In CIDR. www.cidrdb.org, 2020.

[19]

D. AI. AI Infrastructure for Everyone, Now Open Source, Accessed October 31, 2020. https://determined.ai/blog/ai-infrastructure-for-everyone/.

[20]

R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara. Deep learning for stock prediction using numerical and textual information. In ICIS, pages 1--6. IEEE Computer Society, 2016.

[21]

Amazon. RedShift Query Planning and Execution Workflow, Accessed November 19, 2020. https://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html.

[22]

R. Anil, G. Çapan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and Ö. Yilmazel. Apache Mahout: Machine Learning on Distributed Dataflow Systems. J. Mach. Learn. Res., 21:127:1--127:6, 2020.

[23]

M. P. Atkinson, F. Bancilhon, D. J. DeWitt, K. R. Dittrich, D. Maier, and S. B. Zdonik. The Object-Oriented Database System Manifesto. In DOOD, pages 223--240. North-Holland/Elsevier Science Publishers, 1989.

[24]

Y. Bengio. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390, 2015.

[25]

J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In ICML (1), volume 28 of JMLR Workshop and Conference Proceedings, pages 115--123. JMLR.org, 2013.

Digital Library

[26]

M. Boehm, M. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. Reiss, P. Sen, A. Surve, and S. Tatikonda. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow., 9(13):1425--1436, 2016.

Digital Library

[27]

M. Boehm, B. Reinwald, D. Hutchison, P. Sen, A. V. Evfimievski, and N. Pansare. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. Proc. VLDB Endow., 11(12):1755--1768, Aug. 2018.

Digital Library

[28]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. R. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proc. VLDB Endow., 7(7):553--564, 2014.

Digital Library

[29]

X. Bouthillier and G. Varoquaux. Survey of Machine-Learning Experimental Methods at NeurIPS2019 and ICLR2020. Research report, Inria Saclay Ile de France, Jan. 2020.

[30]

S. Chakraborty, R. Tomsett, R. Raghavendra, D. Harborne, M. Alzantot, F. Cerutti, M. B. Srivastava, A. D. Preece, S. Julier, R. M. Rao, T. D. Kelley, D. Braines, M. Sensoy, C. J. Willis, and P. Gurram. Interpretability of Deep Learning Models: A Survey of Results. In SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, pages 1--6. IEEE, 2017.

[31]

S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec., 26(1):65--74, 1997.

Digital Library

[32]

Y. Cheng, C. Qin, and F. Rusu. GLADE: big data analytics made easy. In SIGMOD Conference, pages 697--700. ACM, 2012.

Digital Library

[33]

E. Commission. GDPR, Accessed October 31, 2020. https://ec.europa.eu/info/law/law-topic/data-protection/eu-data-protection-rules_en.

[34]

CriteoLabs. Kaggle Contest Dataset Is Now Available for Academic Use!, Accessed January 31, 2020. https://ailab.criteo.com/category/dataset.

[35]

Databricks. Introducing Apache Spark 2.4, Accessed October 31, 2020. https://databricks.com/blog/2018/11/08/introducing-apache-spark-2-4.html.

[36]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-scale Hierarchical Image Database. In CVPR, pages 248--255. IEEE, 2009.

[37]

B. Derakhshan, A. R. Mahdiraji, Z. Abedjan, T. Rabl, and V. Markl. Optimizing Machine Learning Workloads in Collaborative Environments. In SIGMOD Conference, pages 1701--1716. ACM, 2020.

Digital Library

[38]

J. V. D'silva, F. De Moor, and B. Kemme. AIDA - Abstraction for Advanced In-Database Analytics. Proc. VLDB Endow., 11(11):1400--1413, 2018.

Digital Library

[39]

A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, and B. Reinwald. Compressed Linear Algebra for Large-Scale Machine Learning. Proc. VLDB Endow., 9(12):960--971, 2016.

Digital Library

[40]

Facebook. Introducing FBLearner Flow: Facebook's AI backbone, Accessed January 31, 2020. https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/.

[41]

A. Fard, A. Le, G. Larionov, W. Dhillon, and C. Bear. Vertica-ML: Distributed Machine Learning in Vertica Database. In SIGMOD Conference, pages 755--768. ACM, 2020.

Digital Library

[42]

X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD Conference, pages 325--336. ACM, 2012.

Digital Library

[43]

Z. J. Gao, N. Pansare, and C. M. Jermaine. Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization. IEEE Trans. Knowl. Data Eng., 31(11):2079--2092, 2019.

Digital Library

[44]

D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A Service for Black-box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487--1495. ACM, 2017.

Digital Library

[45]

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.

Digital Library

[46]

H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In IJCAI, pages 1725--1731. ijcai.org, 2017.

Digital Library

[47]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Analytics Library or MAD Skills, the SQL. Proc. VLDB Endow., 5(12):1700--1711, 2012.

Digital Library

[48]

Y. Huang, T. Jin, Y. Wu, Z. Cai, X. Yan, F. Yang, J. Li, Y. Guo, and J. Cheng. FlexPS: Flexible Parallelism Control in Parameter Server Architecture. Proc. VLDB Endow., 11(5):566--579, 2018.

Digital Library

[49]

hyperopt. Scaling out search with Apache Spark, Accessed January 31, 2020. http://hyperopt.github.io/hyperopt/scaleout/spark/.

[50]

M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. Population Based Training of Neural Networks. arXiv preprint arXiv:1711.09846, 2017.

[51]

D. Jankov, S. Luo, B. Yuan, Z. Cai, J. Zou, C. Jermaine, and Z. J. Gao. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. SIGMOD Rec., 49(1):43--50, 2020.

Digital Library

[52]

M. Jasny, T. Ziegler, T. Kraska, U. Röhm, and C. Binnig. DB4ML - An In-Memory Database Kernel with Machine Learning Support. In SIGMOD Conference, pages 159--173. ACM, 2020.

Digital Library

[53]

Kaggle. Kaggle Survey 2020, Accessed March 13, 2021. https://www.kaggle.com/kaggle-survey-2020.

[54]

Kaggle. State of Data Science and Machine Learning 2019, Accessed October 31, 2020. https://www.kaggle.com/kaggle-survey-2019.

[55]

K. Karanasos, M. Interlandi, F. Psallidas, R. Sen, K. Park, I. Popivanov, D. Xin, S. Nakandala, S. Krishnan, M. Weimer, Y. Yu, R. Ramakrishnan, and C. Curino. Extending Relational Query Processing with ML Inference. In CIDR. www.cidrdb.org, 2020.

[56]

M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-Database Learning with Sparse Tensors. In PODS, pages 325--340. ACM, 2018.

Digital Library

[57]

M. Kim and K. S. Candan. Efficient Static and Dynamic In-Database Tensor Decompositions on Chunk-Based Array Stores. In CIKM, pages 969--978. ACM, 2014.

Digital Library

[58]

D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.

[59]

A. Koliousis, P. Watcharapichat, M. Weidlich, L. Mai, P. Costa, and P. Pietzuch. Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. Proc. VLDB Endow., 12(11):1399--1412, 2019.

Digital Library

[60]

Kubeflow. Kubeflow, Accessed November 26, 2020. https://www.kubeflow.org/.

[61]

A. Kumar. ML/AI Systems and Applications: Is the SIGMOD/VLDB community losing relevance?, Accessed November 19, 2020. https://wp.sigmod.org/?p=2454.

[62]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: the Next Frontier of Advanced Analytics. SIGMOD Record, 2016.

Digital Library

[63]

A. Kumar, S. Nakandala, Y. Zhang, S. Li, A. Gemawat, and K. Nagrecha. Cerebro: A Layered Data Platform for Scalable Deep Learning. In CIDR. www.cidrdb.org, 2021.

[64]

A. Kunft, A. Katsifodimos, S. Schelter, S. Breß, T. Rabl, and V. Markl. An Intermediate Representation for Optimizing Machine Learning Pipelines. Proc. VLDB Endow., 12(11):1553--1567, 2019.

Digital Library

[65]

F. Li, L. Chen, Y. Zeng, A. Kumar, X. Wu, J. F. Naughton, and J. M. Patel. Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent. In SIGMOD Conference, pages 1517--1534. ACM, 2019.

Digital Library

[66]

K. Li, D. Z. Wang, A. Dobra, and C. Dudley. UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics. Proc. VLDB Endow., 8(5):557--568, 2015.

Digital Library

[67]

L. Li, K. G. Jamieson, A. Rostamizadeh, E. Gonina, J. Ben-tzur, M. Hardt, B. Recht, and A. Talwalkar. A System for Massively Parallel Hyperparameter Tuning. In MLSys. mlsys.org, 2020.

[68]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, 2014.

Digital Library

[69]

J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In KDD, pages 1754--1763. ACM, 2018.

Digital Library

[70]

R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.

[71]

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In CVPR, pages 1096--1104. IEEE Computer Society, 2016.

[72]

J. Lu, C. Lin, J. Wang, and C. Li. Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join. In CIKM, pages 2975--2976. ACM, 2019.

Digital Library

[73]

S. Luo, Z. J. Gao, M. N. Gubanov, L. L. Perez, and C. M. Jermaine. Scalable Linear Algebra on a Relational Database System. In ICDE, pages 523--534. IEEE Computer Society, 2017.

[74]

X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res., 17:34:1--34:7, 2016.

Digital Library

[75]

Microsoft. Azure SQL Query Processing Architecture Guide, Accessed November 19, 2020. https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-architecture-guide?view=sql-server-ver15#distributed-query-architecture.

[76]

MLflow. MLflow, Accessed November 26, 2020. https://mlflow.org/.

[77]

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A Distributed Framework for Emerging AI Applications. In OSDI, 2018.

Digital Library

[78]

S. Nakandala and A. Kumar. Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale. In SIGMOD Conference, pages 1685--1700. ACM, 2020.

Digital Library

[79]

S. Nakandala, Y. Zhang, and A. Kumar. Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pages 1--4, 2019.

Digital Library

[80]

S. Nakandala, Y. Zhang, and A. Kumar. Cerebro: A Data System for Optimized Deep Learning Model Selection. Proc. VLDB Endow., 13(11):2159--2173, 2020.

Digital Library

[81]

S. Nakandala, Y. Zhang, and A. Kumar. Cerebro: A Data System for Optimized Deep Learning Model Selection. https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf, 2020. [Tech report].

[82]

S. of California Department of Justice. CCPA, Accessed October 31, 2020. https://oag.ca.gov/privacy/ccpa.

[83]

B. C. Ooi, K. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. SINGA: A Distributed Deep Learning Platform. In ACM Multimedia, pages 685--688. ACM, 2015.

Digital Library

[84]

C. Ordonez. Integrating K-Means Clustering with a Relational DBMS Using SQL. IEEE Trans. Knowl. Data Eng., 18(2):188--201, 2006.

Digital Library

[85]

V. Oria, M. T. Özsu, P. Iglinski, S. Lin, and B. B. Yao. DISIMA: A Distributed and Interoperable Image Database System. In SIGMOD Conference, page 600. ACM, 2000.

Digital Library

[86]

A. Qiao, A. Aghayev, W. Yu, H. Chen, Q. Ho, G. A. Gibson, and E. P. Xing. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In USENIX Annual Technical Conference, pages 631--644. USENIX Association, 2018.

Digital Library

[87]

M. Raasveldt, P. Holanda, H. Mühleisen, and S. Manegold. Deep Integration of Machine Learning Into Column Stores. In EDBT, pages 473--476, 2018.

[88]

C. Renggli, F. A. Hubis, B. Karlas, K. Schawinski, W. Wu, and C. Zhang. Ease.ml/ci and Ease.ml/meter in Action: Towards Data Management for Statistical Generalization. Proc. VLDB Endow., 12(12):1962--1965, 2019.

Digital Library

[89]

C. Renggli, B. Karlas, B. Ding, F. Liu, K. Schawinski, W. Wu, and C. Zhang. Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment. In MLSys. mlsys.org, 2019.

[90]

A. Renz-Wieland, R. Gemulla, S. Zeuch, and V. Markl. Dynamic Parameter Allocation in Parameter Servers. Proc. VLDB Endow., 13(11):1877--1890, 2020.

Digital Library

[91]

R. Ricci, E. Eide, and CloudLabTeam. Introducing Cloudlab: Scientific Infrastructure for Advancing Cloud Architectures and Applications. ; login:: the magazine of USENIX & SAGE, 39(6):36--38, 2014.

[92]

A. S. R. Santos, S. Castelo, C. Felix, J. P. Ono, B. Yu, S. R. Hong, C. T. Silva, E. Bertini, and J. Freire. Visus: An Interactive System for Automatic Machine Learning Model Building and Curation. In HILDA@SIGMOD, pages 6:1--6:7. ACM, 2019.

Digital Library

[93]

M. E. Schüle, M. Bungeroth, A. Kemper, S. Günnemann, and T. Neumann. MLearn: A Declarative Machine Learning Language for Database Systems. In DEEM@SIGMOD, pages 7:1--7:4. ACM, 2019.

Digital Library

[94]

A. Sergeev and M. D. Balso. Horovod: Fast and Easy Distributed Deep Learning in TF. arXiv preprint arXiv:1802.05799, 2018.

[95]

S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: from Theory to Algorithms. Cambridge university press, 2014.

[96]

Z. Shang, E. Zgraggen, B. Buratti, F. Kossmann, P. Eichmann, Y. Chung, C. Binnig, E. Upfal, and T. Kraska. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD Conference, pages 1171--1188. ACM, 2019.

Digital Library

[97]

E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Key-stoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE, pages 535--546. IEEE Computer Society, 2017.

[98]

H. Su and H. Chen. Experiments on Parallel Training of Deep Neural Network using Model Averaging. CoRR, abs/1507.01239, 2015.

[99]

K. Tsuda, K. Yamamoto, M. Hirakawa, M. Tanaka, and T. Ichikawa. MORE: An Object-Oriented Data Model with a Facility for Changing Object Structures. IEEE Trans. Knowl. Data Eng., 3(4):444--460, 1991.

Digital Library

[100]

VMware Tanzu/Pivotal. gpfdist, Accessed November 19, 2020. https://gpdb.docs.pivotal.io/510/utility_guide/admin_utilities/gpfdist.html.

[101]

D. Wang, P. Cui, and W. Zhu. Structural Deep Network Embedding. In KDD, pages 1225--1234. ACM, 2016.

Digital Library

[102]

H. Wang, N. Wang, and D. Yeung. Collaborative Deep Learning for Recommender Systems. In KDD, pages 1235--1244. ACM, 2015.

Digital Library

[103]

R. Wang, B. Fu, G. Fu, and M. Wang. Deep & Cross Network for Ad Click Predictions. In ADKDD@KDD, pages 12:1--12:7. ACM, 2017.

[104]

W. Wang, G. Chen, T. T. A. Dinh, J. Gao, B. C. Ooi, K. Tan, and S. Wang. SINGA: Putting Deep Learning in the Hands of Multimedia Users. In ACM Multimedia, pages 25--34. ACM, 2015.

Digital Library

[105]

W. Wang, J. Gao, M. Zhang, S. Wang, G. Chen, T. K. Ng, B. C. Ooi, J. Shao, and M. Reyad. Rafiki: Machine Learning as an Analytics Service System. Proc. VLDB Endow., 12(2):128--140, 2018.

Digital Library

[106]

W. Wang, X. Yang, B. C. Ooi, D. Zhang, and Y. Zhuang. Effective deep learning-based multi-modal retrieval. VLDB J., 25(1):79--101, 2016.

Digital Library

[107]

P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, page 84--97, 2016.

Digital Library

[108]

J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In ICML, volume 307 of ACM International Conference Proceeding Series, pages 1168--1175. ACM, 2008.

Digital Library

[109]

D. Xin, S. Macke, L. Ma, J. Liu, S. Song, and A. G. Parameswaran. Helix: Holistic Optimization for Accelerating Iterative Machine Learning. Proc. VLDB Endow., 12(4):446--460, 2018.

Digital Library

[110]

A. Yoshitaka and T. Ichikawa. A Survey on Content-Based Retrieval for Multi-media Databases. IEEE Trans. Knowl. Data Eng., 11(1):81--93, 1999.

Digital Library

[111]

B. Yuan, D. Jankov, J. Zou, Y. Tang, D. Bourgeois, and C. Jermaine. Tensor Relational Algebra for Machine Learning System Design. CoRR, abs/2009.00524, 2020.

[112]

M. Zaharia, A. Ghodsi, R. Xin, and M. Armbrust. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In CIDR. www.cidrdb.org, 2021.

[113]

J. Zhang, C. D. Sa, I. Mitliagkas, and C. Ré. Parallel SGD: When does averaging help? CoRR, abs/1606.07365, 2016.

[114]

Q. Zhang and S. Zhu. Visual Interpretability for Deep Learning: A Survey. Frontiers Inf. Technol. Electron. Eng., 19(1):27--39, 2018.

[115]

Z. Zhang, B. Cui, Y. Shao, L. Yu, J. Jiang, and X. Miao. PS2: Parameter Server on Spark. In SIGMOD Conference, pages 376--388. ACM, 2019.

Digital Library

[116]

Z. Zhang, J. Jiang, W. Wu, C. Zhang, L. Yu, and B. Cui. MLlib^*: Fast Training of GLMs Using Spark MLlib. In ICDE, pages 1778--1789. IEEE, 2019.

[117]

M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized Stochastic Gradient Descent. In NIPS, pages 2595--2603. Curran Associates, Inc., 2010.

Digital Library

Cited By

Salazar-Díaz RGlavic BRabl T(2024)InferDB: In-Database Machine Learning Inference Using IndexesProceedings of the VLDB Endowment10.14778/3659437.365944117:8(1830-1842)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659441
Xing NCai SChen GLuo ZOoi BPei J(2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641212
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Show More Cited By

Index Terms

Distributed deep learning on data systems: a comparative analysis of approaches
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Big data analytics deep learning techniques and applications: A survey
Highlights
- This paper provides an in-depth review of the latest deep learning methods for use in big data analytics.
- Explain the importance of deep learning, its taxonomy, and big data analytics techniques.
- Explores deep learning approaches ...
Abstract
Deep learning (DL), as one of the most active machine learning research fields, has achieved great success in numerous scientific and technological disciplines, including speech recognition, image classification, language processing, big data ...
Distributed SPARQL over Big RDF Data: A Comparative Analysis Using Presto and MapReduce
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data

The processing of large volumes of RDF data require an efficient storage and query processing engine that can scale well with the volume of data. The initial attempts to address this issue focused on optimizing native RDF stores as well as conventional ...
Mobile big data analytics using deep learning and apache spark
The proliferation of mobile devices, such as smartphones and Internet of Things gadgets, has resulted in the recent mobile big data era. Collecting mobile big data is unprofitable unless suitable analytics and learning methods are utilized to extract ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 10

June 2021

219 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2021

Published in PVLDB Volume 14, Issue 10

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
303
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)5

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Salazar-Díaz RGlavic BRabl T(2024)InferDB: In-Database Machine Learning Inference Using IndexesProceedings of the VLDB Endowment10.14778/3659437.365944117:8(1830-1842)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659441
Xing NCai SChen GLuo ZOoi BPei J(2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641212
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Sirin UIdreos S(2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639307
Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00845-0
Zhang YKumar A(2023)Lotan: Bridging the Gap between GNNs and Scalable Graph Analytics EnginesProceedings of the VLDB Endowment10.14778/3611479.361148316:11(2728-2741)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611483
Zeng YChen BPan PLi KChen G(2023)Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning SystemsInternational Journal of Intelligent Systems10.1155/2023/26631152023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/2663115
Demirci GHaldar AFerhatosmanoglu H(2022)Scalable Graph Convolutional Network Training on Distributed-Memory SystemsProceedings of the VLDB Endowment10.14778/3574245.357425616:4(711-724)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574256
Sichert MNeumann T(2022)User-defined operatorsProceedings of the VLDB Endowment10.14778/3510397.351040815:5(1119-1131)Online publication date: 18-May-2022
https://dl.acm.org/doi/10.14778/3510397.3510408
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents