Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3514221.3526150acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

Published: 11 June 2022 Publication History

Abstract

Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access).
In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement---they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

Supplemental Material

MP4 File
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. However, SGD requires random data access that is inefficient when implemented in systems that rely on secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. In this paper, we propose a simple but novel two-level data shuffling strategy for SGD, namely CorgiPile. CorgiPile can avoid full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

References

[1]
2008. Epsilon dataset. https://www.k4all.org/project/large-scale-learning-challenge/.
[2]
2009. CIFAR-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html.
[3]
2011. LIBSVM Data. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
[4]
2022. Apache MADlib: Big Data Machine Learning in SQL. http://madlib.apache.org/.
[5]
2022. Buffer Manager of PostgreSQL. https://www.interdb.jp/pg/pgsql08.html.
[6]
2022. Google BigQuery ML. https://cloud.google.com/bigquery-ml/docs/introduction.
[7]
2022. I/O performance test on HDD and SSD with distinct block sizes. https://github.com/DS3Lab/CorgiPile-PostgreSQL/blob/corgipile/IO-perf.pdf.
[8]
2022. Microsoft SQL Server Machine Learning Services. https://docs.microsoft.com/en-us/sql/machine-learning/sql-server-machine-learning-services?view=sql-server-ver15.
[9]
2022. Oracle R Enterprise Versions of R Models. https://docs.oracle.com/cd/E11882_01/doc.112/e36761/orelm.htm.
[10]
2022. PostgreSQL. https://www.postgresql.org/.
[11]
2022. PostgreSQL TOAST. https://www.postgresql.org/docs/9.5/storage-toast.html.
[12]
2022. Sliding-Window Shuffle in TensorFlow. https://www.tensorflow.org/api_docs/python/tf/data/Dataset.
[13]
2022. The Greenplum MADlib extension. https://greenplum.docs.pivotal.io/6--19/analytics/madlib.html.
[14]
2022. The proof of our theorems in the theoretical analysis of CorgiPile. https://github.com/DS3Lab/CorgiPile-PostgreSQL/blob/corgipile/Proof.pdf.
[15]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016. USENIX Association, 265--283.
[16]
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2018. Operating Systems: Three Easy Pieces (1.00 ed.). Arpaci-Dusseau Books.
[17]
Léon Bottou. 2009. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, Vol. 8. 2624--2633.
[18]
Léon Bottou. 2010. Large-Scale Machine Learning with Stochastic Gradient Descent. In 19th International Conference on Computational Statistics, COMPSTAT 2010. 177--186.
[19]
Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade. Springer, 421--436.
[20]
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 60, 2 (2018), 223--311.
[21]
Zhuhua Cai, Zografoula Vagena, Luis Leopoldo Perez, Subramanian Arumugam, Peter J. Haas, and Christopher M. Jermaine. 2013. Simulation of database-valued markov chains using SimSQL. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013. ACM, 637--648.
[22]
Lingjiao Chen, Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2017. Towards Linear Algebra over Normalized Data. Proc. VLDB Endow. 10, 11 (2017), 1214--1225.
[23]
Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, et al. 2015. Xgboost: extreme gradient boosting. R package version 0.4--2 1, 4 (2015), 1--4.
[24]
Christopher M De Sa. 2020. Random reshuffling is not always better. Advances in Neural Information Processing Systems 33 (2020).
[25]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems 25 (2012), 1223--1231.
[26]
Arash Fard, Anh Le, George Larionov, Waqas Dhillon, and Chuck Bear. 2020. Vertica-ML: Distributed Machine Learning in Vertica Database. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020. ACM, 755--768.
[27]
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a Unified Architecture for in-RDBMS Analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). 325--336.
[28]
Saeed Ghadimi and Guanghui Lan. 2013. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23, 4 (2013), 2341--2368.
[29]
Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6, 1 (1994), 120--135.
[30]
Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. 2019. Why random reshuffling beats stochastic gradient descent. Mathematical Programming (2019), 1--36.
[31]
Mert Gürbüzbalaban, Asuman E. Ozdaglar, and Pablo A. Parrilo. 2021. Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 1 (2021), 49--84.
[32]
Jeff Z. HaoChen and Suvrit Sra. 2019. Random Shuffling Beats SGD after Finite Epochs. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, (Proceedings of Machine Learning Research, Vol. 97). PMLR, 2624--2633.
[33]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770--778.
[34]
Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700--1711.
[35]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2020. Declarative Recursive Computation on an RDBMS: or, Why You Should Use a Database For Distributed Machine Learning. SIGMOD Rec. 49, 1 (2020), 43--50.
[36]
Dimitrije Jankov, Binhang Yuan, Shangyu Luo, and Chris Jermaine. 2021. Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees. Proc. VLDB Endow. 14, 7 (2021), 1228--1240.
[37]
Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Röhm, and Carsten Binnig. 2020. DB4ML - An In-Memory Database Kernel with Machine Learning Support. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020. ACM, 159--173.
[38]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the 2017 ACM International Conference on Management of Data. 463--478.
[39]
Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proceedings of the VLDB Endowment 12, 4 (2018), 348--361.
[40]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017), 3146--3154.
[41]
Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 325--340.
[42]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.
[43]
Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1969--1984.
[44]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5336--5346.
[45]
Ji Liu and Ce Zhang. 2020. Distributed Learning Systems with First-Order Methods. Found. Trends Databases 9, 1 (2020), 1--100.
[46]
Shangyu Luo, Zekai J. Gao, Michael N. Gubanov, Luis Leopoldo Perez, Dimitrije Jankov, and Christopher M. Jermaine. 2020. Scalable linear algebra on a relational database system. Commun. ACM 63, 8 (2020), 93--101.
[47]
Shangyu Luo, Dimitrije Jankov, Binhang Yuan, and Chris Jermaine. 2021. Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra. In Proceedings of the 2021 International Conference on Management of Data. 1222--1234.
[48]
John MacGregor. 2013. Predictive Analysis with SAP. Bonn: Galileo Press.
[49]
Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17 (2016), 34:1--34:7.
[50]
Konstantin Mishchenko, Ahmed Khaled Ragab Bayoumi, and Peter Richtárik. 2020. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems 33 (2020).
[51]
Eric Moulines and Francis R Bach. 2011. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems. 451--459.
[52]
Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A Data System for Optimized Deep Learning Model Selection. Proc. VLDB Endow. 13, 11 (2020), 2159--2173.
[53]
Dan Olteanu and Maximilian Schleich. 2016. F: Regression Models over Factorized Views. Proc. VLDB Endow. 9, 13 (2016), 1573--1576.
[54]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1532--1543.
[55]
Boris Teodorovich Polyak. 1963. Gradient methods for minimizing functionals. Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki 3, 4 (1963), 643--653.
[56]
Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. 2020. Closing the convergence gap of SGD without replacement. In International Conference on Machine Learning. PMLR, 7964--7973.
[57]
Steffen Rendle. 2013. Scaling Factorization Machines to Relational Data. Proc. VLDB Endow. 6, 5 (2013), 337--348.
[58]
Itay Safran and Ohad Shamir. 2020. How good is SGD with random shuffling?. In Conference on Learning Theory. PMLR, 3250--3284.
[59]
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning Linear Regression Models over Factorized Joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016. ACM, 3--18.
[60]
Ohad Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems. 46--54.
[61]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015.
[62]
Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 2018. D 2 : Decentralized Training over Decentralized Data. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research, Vol. 80). PMLR, 4855--4863.
[63]
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: the new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73.
[64]
Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE transactions on Big Data 1, 2 (2015), 49--67.
[65]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. 1480--1489.
[66]
Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H Sayed. 2019. Stochastic Learning Under Random Reshuffling With Constant Step-Sizes. IEEE Transactions on Signal Processing 67, 2 (2019), 474--489.
[67]
Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor Relational Algebra for Distributed Machine Learning System Design. Proc. VLDB Endow. 14, 8 (2021), 1338--1350.
[68]
Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. 2021. Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?. In Conference on Learning Theory, COLT 2021 (Proceedings of Machine Learning Research, Vol. 134). PMLR, 4653--4658.
[69]
Ce Zhang and Christopher Ré. 2014. DimmWitted: A Study of Main-Memory Statistical Analytics. Proceedings of the VLDB Endowment 7, 12 (2014).
[70]
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. 649--657.
[71]
Yuhao Zhang, Frank Mcquillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar. 2021. Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. Proc. VLDB Endow. 14, 10 (2021), 1769--1782.
[72]
Zhipeng Zhang, Jiawei Jiang, Wentao Wu, Ce Zhang, Lele Yu, and Bin Cui. 2019. MLlib*: Fast Training of GLMs Using Spark MLlib. In 35th IEEE International Conference on Data Engineering, ICDE 2019. IEEE, 1778--1789.

Cited By

View all
  • (2025)Powering In-Database Dynamic Model Slicing for Structured Data AnalyticsProceedings of the VLDB Endowment10.14778/3704965.370498517:13(4813-4826)Online publication date: 18-Feb-2025
  • (2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 2-May-2024
  • (2024)A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL TrainingProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665947(63-70)Online publication date: 8-Jul-2024
  • Show More Cited By

Index Terms

  1. In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
      June 2022
      2597 pages
      ISBN:9781450392495
      DOI:10.1145/3514221
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 June 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. in-database machine learning
      2. shuffle
      3. stochastic gradient descent

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)125
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Powering In-Database Dynamic Model Slicing for Structured Data AnalyticsProceedings of the VLDB Endowment10.14778/3704965.370498517:13(4813-4826)Online publication date: 18-Feb-2025
      • (2024)Database Native Model Selection: Harnessing Deep Neural Networks in Database SystemsProceedings of the VLDB Endowment10.14778/3641204.364121217:5(1020-1033)Online publication date: 2-May-2024
      • (2024)A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL TrainingProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665947(63-70)Online publication date: 8-Jul-2024
      • (2024)GaussML: An End-to-End In-Database Machine Learning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00391(5198-5210)Online publication date: 13-May-2024
      • (2024)In-database query optimization on SQL with ML predicatesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00888-334:1Online publication date: 23-Dec-2024
      • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
      • (2023)Auto-differentiation of relational computations for very large scale machine learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619806(33581-33598)Online publication date: 23-Jul-2023
      • (2022)Towards communication-efficient vertical federated learning training via cache-enabled local updatesProceedings of the VLDB Endowment10.14778/3547305.354731615:10(2111-2120)Online publication date: 1-Jun-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media