Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Scalable linear algebra on a relational database system

Published: 22 July 2020 Publication History

Abstract

As data analytics has become an important application for modern data management systems, a new category of data management system has appeared recently: the scalable linear algebra system. We argue that a parallel or distributed database system is actually an excellent platform upon which to build such functionality. Most relational systems already have support for cost-based optimization---which is vital to scaling linear algebra computations---and it is well known how to make relational systems scalable.
We show that by making just a few changes to a parallel/distributed relational database system, such a system can become a competitive platform for scalable linear algebra. Taken together, our results should at least raise the possibility that brand new systems designed from the ground up to support scalable linear algebra are not absolutely necessary, and that such systems could instead be built on top of existing relational technology.

References

[1]
Apache spark mllib: http://spark.apache.org/docs/latest/mllib-data-types.html.
[2]
Oracle corporation: https://docs.oracle.com/cd/B1930-6_01/index.htm.
[3]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th [USENIX] Symposium on Operating Systems Design and Implementation ([OSDI] 16, 2016), 265--283.
[4]
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al. Spark sql: Relational data processing in spark. In SIGMOD (2015), ACM, 1383--1394.
[5]
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N. The multidimensional database system rasdaman. In SIGMOD Record (Volume 27, 1998), ACM, 575--577.
[6]
Blackford, L.S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., et al. ScaLAPACK Users' Guide, Volume 4. SIAM, 1997.
[7]
Boehm, M., Burdick, D.R., Evfimievski, A.V., Reinwald, B., Reiss, F.R., Sen, P., Tatikonda, S., Tian, Y. Systemml's optimizer: Plan generation for large-scale machine learning programs. IEEE Data Eng. Bull. 3, 37 (2014), 52--62.
[8]
Brown, P.G. Overview of SciDB: Large scale array storage, processing and analysis. In SIGMOD, 2010, 963--968.
[9]
Cai, Z., Vagena, Z., Perez, L.L., Arumugam, S., Haas, P.J., Jermaine, C. Simulation of database-valued Markov chains using SimSQL. In SIGMOD, 2013, 637--648.
[10]
Chaudhuri, S. An overview of query optimization in relational systems. In PODS (1998), ACM, 34--43.
[11]
Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J. Ricardo: integrating R and Hadoop. In SIGMOD, 2010, 987--998.
[12]
Ghoting, A., Krishnamurthy, R., Pednault, E., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S. SystemML: Declarative machine learning on mapreduce. In ICDE, 2011, 231--242.
[13]
Goyal, P., Dollár, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. CoRR, 2017, abs/1706.02677.
[14]
Huang, B., Babu, S., Yang, J. Cumulon: Optimizing statistical data analysis in the cloud. In SIGMOD, 2013, 1--12.
[15]
Jankov, D., Luo, S., Yuan, B., Cai, Z., Zou, J., Jermaine, C., Gao, Z.J. Declarative recursive computation on an rdbms, or, why you should use a database for distributed machine learning. PVLDB, 2019, 12.
[16]
Lebanon, G. Metric learning for text documents. IEEE PAMI 4, 28 (2006), 497--508
[17]
Libkin, L., Machlin, R., Wong, L. A query language for multidimensional arrays: Design, implementation, and optimization techniques. In SIGMOD (1996), 228--239.
[18]
Qian, Z., Chen, X., Kang, N., Chen, M., Yu, Y., Moscibroda, T., Zhang, Z. Madlinq: large-scale distributed matrix computation for the cloud. In EuroSys (2012), ACM, 197--210
[19]
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. VLDB 2, 2 (2009), 1626--1629.
[20]
Zaharia, M., Chowdhury, M., Franklin M.J., Shenker, S., Stoica, I. Spark: Cluster computing with working sets. In USENIX HotCloud, 2010, 1--10.
[21]
Zhang, Y., Zhang, W., Yang, J. I/o-efficient statistical computing with riot. In ICDE, 2010, 1157--1160.

Cited By

View all
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
  • (2024)Givens rotations for QR decomposition, SVD and PCA over database joinsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00818-933:4(1013-1037)Online publication date: 1-Jul-2024
  • (2023)Optimizing Tensor Programs on Flexible StorageProceedings of the ACM on Management of Data10.1145/35887171:1(1-27)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 63, Issue 8
August 2020
93 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3411844
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2020
Published in CACM Volume 63, Issue 8

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)307
  • Downloads (Last 6 weeks)42
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
  • (2024)Givens rotations for QR decomposition, SVD and PCA over database joinsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00818-933:4(1013-1037)Online publication date: 1-Jul-2024
  • (2023)Optimizing Tensor Programs on Flexible StorageProceedings of the ACM on Management of Data10.1145/35887171:1(1-27)Online publication date: 30-May-2023
  • (2023)Pushing ML Predictions Into DBMSsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326959235:10(10295-10308)Online publication date: 1-Oct-2023
  • (2022)In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data ShuffleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526150(1286-1300)Online publication date: 10-Jun-2022
  • (2022)Givens QR Decomposition over Relational DatabasesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526144(1948-1961)Online publication date: 10-Jun-2022
  • (2022)End-to-end Optimization of Machine Learning Prediction QueriesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526141(587-601)Online publication date: 10-Jun-2022
  • (2021)Matrix Query LanguagesACM SIGMOD Record10.1145/3503780.350378250:3(6-19)Online publication date: 1-Dec-2021
  • (2021)Expressiveness within Sequence DatalogProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458327(70-81)Online publication date: 20-Jun-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media