Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access).

In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement---they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

Supplemental Material

MP4 File

Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. However, SGD requires random data access that is inefficient when implemented in systems that rely on secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. In this paper, we propose a simple but novel two-level data shuffling strategy for SGD, namely CorgiPile. CorgiPile can avoid full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

Download
24.24 MB

References

[1]

2008. Epsilon dataset. https://www.k4all.org/project/large-scale-learning-challenge/.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Stochastic Gradient Descent with Polyak’s Learning Rate

Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations