Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3085504.3085512acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Published: 27 June 2017 Publication History

Abstract

Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative solutions.

References

[1]
A. Agarwal et al. A Reliable Effective Terascale Linear Learning System. JMLR, 15(1), 2014.
[2]
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RASDAMAN. In SIGMOD 1998.
[3]
E. G. Boman, K. D. Devine, and S. Rajamanickam. Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning. In SC 2013.
[4]
A. Z. Broder et al. Min-Wise Independent Permutations. In STOC 1998.
[5]
P. Brown et al. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD 2010.
[6]
D. Buono et al. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In ICS 2016.
[7]
Z. Cai et al. Simulation of Database-Valued Markov Chains using SimSQL. In SIGMOD 2013.
[8]
L. Carter and M. Wegman. Universal Classes of Hash Functions. In STOC 1977.
[9]
Y. Cheng, C. Qin, and F. Rusu. GLADE: Big Data Analytics Made Easy. In SIGMOD 2012.
[10]
J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed Dual Averaging in Networks. In NIPS 2010.
[11]
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD 2012.
[12]
A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE 2011.
[13]
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB 1999.
[14]
J. Hellerstein et al. The MADlib Analytics Library: Or MAD Skills, the SQL. PVLDB, 5(12), 2012.
[15]
S. Idreos and al. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Eng. Bull., 35(1), 2012.
[16]
D. Kernert, F. Kohler, and W. Lehner. SLACID -- Sparse Linear Algebra in a Column-Oriented In-Memory Database System. In SSDBM 2014.
[17]
A. Kumar, J. Naughton, and J. M. Patel. Learning Generalized Linear Models Over Normalized Data. In SIGMOD 2015.
[18]
A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In OSDI 2012.
[19]
S. Lee, J. K. Kim et al. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. In NIPS 2015.
[20]
M. Li et al. Scaling Distributed Machine Learning with the Parameter Server. In OSDI 2014.
[21]
Y. Low et al. GraphLab: A New Parallel Framework for Machine Learning. In UAI 2010.
[22]
Y. Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8), 2012.
[23]
F. Niu, B. Recht, C. Ré, and S. J. Wright. A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS 2011.
[24]
C. Ordonez. Building Statistical Models and Scoring with UDFs. In SIGMOD 2007.
[25]
C. Qin and F. Rusu. Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE. In DanaC 2013.
[26]
C. Qin and F. Rusu. Speculative Approximations for Terascale Distributed Gradient Descent Optimization. In DanaC 2015.
[27]
C. Qin and F. Rusu. Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics. arXiv 1602.08845v2, 2016.
[28]
E. Saule, K. Kaya, and U. V. Catalyurek Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi. arXiv 1302.1078v1, 2013.
[29]
S. Seo et al. Hama: Efficient Matrix Computation with the MapReduce Framework. In CloudCom 2010.
[30]
E. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM 2013.
[31]
A. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML 2011.
[32]
V. V. Vazirani. Approximation Algorithms. Springer, 2003.
[33]
A. Venkat et al. Automating Wavefront Parallelization for Sparse Matrix Computations. In SC 2016.
[34]
S. Williams et al. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. In SC 2007.
[35]
X. Yang, S. Parthasarathy, and P. Sadayappan. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining. PVLDB, 4(4), 2011.
[36]
L. Yu, Y. Shao, and B. Cui. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD 2015.
[37]
D. Zheng et al. Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs. arXiv 1602.02864v3, 2016.
[38]
Z. Zhou et al. An Out-Of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers. In ICPP Workshops 2012.

Cited By

View all
  • (2023)TenSQL: An SQL Database Built on GraphBLAS2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363601(1-8)Online publication date: 25-Sep-2023
  • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
  • (2019)Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00113(1063-1072)Online publication date: May-2019
  • Show More Cited By

Index Terms

  1. Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management
      June 2017
      373 pages
      ISBN:9781450352826
      DOI:10.1145/3085504
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • Northwestern University: Northwestern University

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 June 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      SSDBM '17

      Acceptance Rates

      Overall Acceptance Rate 56 of 146 submissions, 38%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 22 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)TenSQL: An SQL Database Built on GraphBLAS2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363601(1-8)Online publication date: 25-Sep-2023
      • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
      • (2019)Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00113(1063-1072)Online publication date: May-2019
      • (2019)Concept acquisition and improved in-database similarity analysis for medical dataDistributed and Parallel Databases10.1007/s10619-018-7249-x37:2(297-321)Online publication date: 25-May-2019
      • (2018)Regularizing irregularityProceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3210259.3210263(1-8)Online publication date: 10-Jun-2018

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media