Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3620678.3624666acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

tf.data service: A Case for Disaggregating ML Input Data Processing

Published: 31 October 2023 Publication History

Abstract

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32× training time and 26× cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2×, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.

References

[1]
2022. Apache Beam: An advanced unified programming model. https://beam.apache.org/.
[2]
2022. Apache Flume. https://flume.apache.org/.
[3]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proc. of OSDI. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
[4]
Amazon. 2022. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/.
[5]
Amazon. 2022. Amazon EC2 Pricing. https://aws.amazon.com/ec2/instance-types/.
[6]
Rohan Anil, Battulga Bayarsaikhan, Ryan Doherty, and Emanuel Taropa. 2021. Distributed computing pipeline processing. https://patents.google.com/patent/WO2021177976A1.
[7]
Leon Bottou. 2009. Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms. In Proc. of the Symposium on Learning and Data Science.
[8]
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010:19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Springer, 177--186.
[9]
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
[10]
Broadcom. 2019. Broadcom Stingray PS250 SmartNIC. https://docs.broadcom.com/doc/PS250-PB
[11]
Tianshi Cao, Sasha (Alexandre) Doubov, David Acuna, and Sanja Fidler. 2021. Scalable Neural Data Server: A Data Recommender for Transfer Learning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.).
[12]
Upen S. Chakravarthy, John Grant, and Jack Minker. 1990. Logic-Based Approach to Semantic Query Optimization. ACM Trans. Database Syst. 15, 2 (jun 1990), 162--207. https://doi.org/10.1145/78922.78924
[13]
Dami Choi, Alexandre Passos, Christopher J. Shallue, and George E. Dahl. 2019. Faster Neural Network Training with Data Echoing. arXiv:1907.05550 [cs.LG]
[14]
Torch Contributors. 2022. PyTorch Docs: torch.utils.data. https://pytorch.org/docs/stable/data.html.
[15]
Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. 2019. AutoAugment: Learning Augmentation Strategies From Data. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
[16]
Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. 2020. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 18613--18624.
[17]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16).
[18]
Jia Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR.
[19]
Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. 2023. How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. arXiv:2210.06441 [cs.LG]
[20]
Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2014. Shared workload optimization. Proceedings of the VLDB Endowment 7, 6, 429--440.
[21]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
[22]
Google. 2022. Better performance with the tf.data API. https://www.tensorflow.org/guide/data_performance
[23]
Google. 2022. Google Cloud: All Pricing. https://cloud.google.com/compute/all-pricing.
[24]
Google. 2022. Google Cloud: TPU regions and zones. https://cloud.google.com/tpu/docs/regions-zones.
[25]
Google. 2022. tf.data service API documentation. https://www.tensorflow.org/api_docs/python/tf/data/experimental/service
[26]
Google. 2023. Colossus under the hood: a peek into Google's scalable storage system.
[27]
Google. 2023. Google Storage. https://cloud.google.com/storage.
[28]
Google. 2023. gRPC Documentation.
[29]
Google. 2023. Network Pricing. https://cloud.google.com/vpc/network-pricing#vpc-pricing.
[30]
Dan Graur, Damien Aymon, Dan Kluser, Tanguy Albrici, Chandramohan A Thekkath, and Ana Klimovic. 2022. Cachew: Machine Learning Input Data Processing as a Service. In Proc. of USENIX ATC.
[31]
Joaquin Anton Guirao, Krzysztof Łęcki, Janusz Lisiecki, Serge Panev, Michał Szołucha, Albert Wolant, and Michał Zientkiewicz. 2019. Fast AI Data Preprocessing with NVIDIA DALI. https://devblogs.nvidia.com/fast-ai-data-preprocessing-with-nvidia-dali.
[32]
Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastassia Ailamaki. 2005. Qpipe: A simultaneously pipelined relational query engine. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 383--394.
[33]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of CVPR. IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.90
[34]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. ACM Computing Surveys (CSUR) 46, 4 (2014), 1--34.
[35]
Kubernetes HPA. 2023. Kubernetes Horizontal Pod Autoscaler Documentation. https://kubernetes.io/docs/tasks/run-application/horizontalpod-autoscale/
[36]
Chip Huyen. 2022. Designing Machine Learning Systems. O'Reilly Media, USA.
[37]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).
[38]
Norman P Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433 (2023).
[39]
Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM 63, 7 (2020).
[40]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proc. of ISCA (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3079856.3080246
[41]
Aarati Kakaraparthy, Abhay Venkatesh, Amar Phanishayee, and Shivaram Venkataraman. 2019. The Case for Unifying Data Loading in Machine Learning Clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19).
[42]
Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, and Sanjeev Kumar. 2016. Flash Storage Disaggregation. In Proc. EuroSys (EuroSys '16). Article 29.
[43]
Kubernetes. 2023. kubernetes Documentation. https://kubernetes.io/docs/home/
[44]
Michael Kuchnik, Ana Klimovic, Jiri Simsa, Virginia Smith, and George Amvrosiadis. 2022. Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines. In Proc. of Machine Learning and Systems, Vol. 4. 33--51.
[45]
Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An Informed Storage Cache for Deep Learning. In Proc. of FAST.
[46]
Gyewon Lee, Irene Lee, Hyeonmin Ha, Kyunggeun Lee, Hwarim Hyun, Ahnjae Shin, and Byung-Gon Chun. 2021. Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. In Proc. of USENIX ATC.
[47]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14).
[48]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[49]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. of ECCV (2014-01-01). Zürich. /se3/wp-content/uploads/2014/09/coco_eccvpdf, http://mscoco.org Oral.
[50]
Renato Marroquin, Ingo Müller, Darko Makreshanski, and Gustavo Alonso. 2018. Pay one, get hundreds for free: Reducing cloud costs through shared query execution. In Proceedings of the ACM Symposium on Cloud Computing. 439--450.
[51]
Meta. 2022. Scaling data ingestion for machine learning training at Meta. https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/.
[52]
Jaehong Min, Ming Liu, Tapan Chugh, Chenxingyu Zhao, Andrew Wei, In Hwan Doh, and Arvind Krishnamurthy. 2021. Gimbal: Enabling Multi-Tenant Storage Disaggregation on SmartNIC JBOFs. In Proc. of ACM SIGCOMM (SIGCOMM '21). 106--122.
[53]
MLCommons. 2022. ML Perf v2 Google Hardware Configurations. https://github.com/mlcommons/training_results_v2.0/tree/main/Google/systems
[54]
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. {CheckFreq}: Frequent, {Fine-Grained}{DNN} Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 203--216.
[55]
Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. 2021. Analyzing and Mitigating Data Stalls in DNN Training. arXiv:2007.06775 [cs.DC]
[56]
Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. 2021. tf.data: A Machine Learning Data Processing Framework. Proc. VLDB Endow. 14, 12 (2021).
[57]
MXNET. 2018. Designing Efficient Data Loaders for Deep Learning. https://mxnet.apache.org/api/architecture/note_data_loading.
[58]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[59]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, et al. 2020. Autopilot: workload autoscaling at google. In Proc. of the Fifteenth European Conference on Computer Systems.
[60]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proc. of OSDI.
[61]
Sreekumar T. Shenoy and Z. Meral Ozsoyoglu. 1987. A System for Semantic Query Optimization. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '87). Association for Computing Machinery, New York, NY, USA, 181--195. https://doi.org/10.1145/38713.38736
[62]
Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1--48.
[63]
Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. 2021. Text data augmentation for deep learning. Journal of big Data 8 (2021), 1--34.
[64]
Patrice Y. Simard, Dave Steinkraus, and John C. Platt. 2003. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Proc. of ICDAR (ICDAR '03). IEEE Computer Society, USA, 1 pages.
[65]
Apache Spark. 2023. Spark Streaming Programming Guide. https://spark.apache.org/docs/latest/streaming-programming-guide.html.
[66]
TensorFlow. 2022. Module: tf.data.experimental.service. https://www.tensorflow.org/api_docs/python/tf/data/experimental/service.
[67]
TensorFlow. 2022. tf.data: Build TensorFlow input pipelines. https://www.tensorflow.org/guide/data.
[68]
TensorFlow. 2023. Tensorflow. https://github.com/tensorflow/tensorflow.
[69]
TensorFlow. 2023. TensorFlow Model Garden. https://github.com/tensorflow/models.
[70]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md E Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: the next generation. In Proceedings of the fifteenth European conference on computer systems. 1--14.
[71]
Taegeon Um, Byungsoo Oh, Byeongchan Seo, Minhyeok Kweun, Goeun Kim, and Woo-Yeon Lee. 2023. FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline. Proceedings of the VLDB Endowment 16, 5 (2023), 1086--1099.
[72]
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A Survey on Distributed Machine Learning. ACM Comput. Surv. 53, 2, Article 30 (mar 2020).
[73]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proc. of EuroSys.
[74]
Kubernetes VPA. 2023. Kubernetes Vertical Pod Autoscaler Documentation. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
[75]
Xi Yan, David Acuna, and Sanja Fidler. 2020. Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[76]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proc. of HotCloud (Boston, MA) (HotCloud'10). USENIX Association, USA, 10.
[77]
Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. 2022. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product. In Proc. of ISCA (ISCA '22).

Cited By

View all
  • (2024)PecanProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692032(649-665)Online publication date: 10-Jul-2024
  • (2024)A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL TrainingProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665947(63-70)Online publication date: 8-Jul-2024
  • (2024)PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00033(340-353)Online publication date: 29-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing
October 2023
624 pages
ISBN:9798400703874
DOI:10.1145/3620678
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2023

Check for updates

Author Tags

  1. Data Processing
  2. Distributed Systems
  3. Machine Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '23
Sponsor:
SoCC '23: ACM Symposium on Cloud Computing
October 30 - November 1, 2023
CA, Santa Cruz, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)579
  • Downloads (Last 6 weeks)49
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PecanProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692032(649-665)Online publication date: 10-Jul-2024
  • (2024)A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL TrainingProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665947(63-70)Online publication date: 8-Jul-2024
  • (2024)PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00033(340-353)Online publication date: 29-Jun-2024
  • (2023)FPGA-Accelerated Data Preprocessing for Personalized Recommendation SystemsIEEE Computer Architecture Letters10.1109/LCA.2023.333684123:1(7-10)Online publication date: 28-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media