Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3314050acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Data Platform for Machine Learning

Published: 25 June 2019 Publication History

Abstract

In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.

References

[1]
Apple. Turi create. https://github.com/apple/turicreate/, 2018; accessed November 28, 2018.
[2]
A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.
[3]
M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. In PVLDB, volume 9, 2016.
[4]
M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald. Declarative machine learning - a classification of basic properties and types. In CoRR, abs/1605.05826, 2016.
[5]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings on Computer Vision and Pattern Recognition. IEEE Computer Society, June 2009.
[6]
Facebook. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/ introducing-fblearner-flow-facebook-s-ai-backbone/, 2018; accessed November 28, 2018.
[7]
R. Gruener, O. Cheng, and Y. Litvin. Introducing Petastorm: Uber ATG's data access library for deep learning. https://eng.uber.com/petastorm/, 2018; accessed November 28, 2018.
[8]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing Google's datasets. SIGMOD, 2016.
[9]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620--629, Feb 2018.
[10]
T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009.
[11]
T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
[12]
A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
[13]
A. Maccioni and R. Torlone. Crossing the finish line faster when paddling the data lake with kayak. Proceedings of the VLDB Endowment, 10(12):1853--1856, 2017.
[14]
M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. Proceedings of the VLDB Endowment, 9(9):624--635, 2016.
[15]
Apache MXNet. Mxnet data api. https://mxnet.incubator.apache.org/ versions/master/api/python/io/io.html, 2018; accessed November 28, 2018.
[16]
H. Miao, A. Chavan, and A. Deshpande. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2ndWorkshop on Human-In-the-Loop Data Analytics, page 7. ACM, 2017.
[17]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 571--582. IEEE, 2017.
[18]
V. Sridhar, S. Subramanian,D. Arteaga, S. Sundararaman,D. Roselli, and N. Talagala. Model governance: Reducing the anarchy of production {ML}. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 351--358, Boston, MA, 2018. USENIX Association.
[19]
M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, Scientific and Statistical Database Management, pages 1--16, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
[20]
TensorFlow. An open source machine learning framework for everyone. https://www.tensorflow.org/, 2018; accessed November 28, 2018.
[21]
Uber. Meet Michelangelo: Uber's machine learning platform. https: //eng.uber.com/michelangelo/, 2017; accessed November 28, 2018.
[22]
L. Xu, S. Huang, S. Hui, A. J. Elmore, and A. Parameswaran. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1655--1658. ACM, 2017.
[23]
Y. Zhang, F. Xu, E. Frise, S. Wu, B. Yu, and W. Xu. Datalab: A version data management and analytics system. In Proceedings of the 2Nd International Workshop on BIG Data Software Engineering, BIGDSE '16, pages 12--18, New York, NY, USA, 2016. ACM.

Cited By

View all
  • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
  • (2024)GaussML: An End-to-End In-Database Machine Learning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00391(5198-5210)Online publication date: 13-May-2024
  • (2023)A Study on Distributed Machine Learning Techniques for Large-Scale Weather ForecastingScalable and Distributed Machine Learning and Deep Learning Patterns10.4018/978-1-6684-9804-0.ch003(44-64)Online publication date: 2-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data platform
  2. data streaming access
  3. data version control
  4. dataset management for machine learning
  5. physical data layout

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:
SIGMOD/PODS '19: International Conference on Management of Data
June 30 - July 5, 2019
Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,473
  • Downloads (Last 6 weeks)204
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
  • (2024)GaussML: An End-to-End In-Database Machine Learning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00391(5198-5210)Online publication date: 13-May-2024
  • (2023)A Study on Distributed Machine Learning Techniques for Large-Scale Weather ForecastingScalable and Distributed Machine Learning and Deep Learning Patterns10.4018/978-1-6684-9804-0.ch003(44-64)Online publication date: 2-Jun-2023
  • (2023)Role of federated learning in healthcare systems: A surveyMathematical Foundations of Computing10.3934/mfc.2023023(0-0)Online publication date: 2023
  • (2023)HAMLETFuture Generation Computer Systems10.1016/j.future.2022.12.035142:C(182-194)Online publication date: 1-May-2023
  • (2023)The Impact of Resource Allocation on the Machine Learning LifecycleBusiness & Information Systems Engineering10.1007/s12599-023-00842-766:2(203-219)Online publication date: 14-Nov-2023
  • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
  • (2023)Smart Farming Monitoring Using ML and MLOpsInternational Conference on Innovative Computing and Communications10.1007/978-981-99-3315-0_51(665-675)Online publication date: 23-Jul-2023
  • (2022)NEW ARPEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch022(342-354)Online publication date: 14-Oct-2022
  • (2022)Towards Observability for Production Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3565838.356585315:13(4015-4022)Online publication date: 1-Sep-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media