Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3404397.3404472acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training

Published: 17 August 2020 Publication History
  • Get Citation Alerts
  • Abstract

    We observe three problems in existing storage and caching systems for deep-learning training (DLT) tasks: (1) accessing a dataset containing a large number of small files takes a long time, (2) global in-memory caching systems are vulnerable to node failures and slow to recover, and (3) repeatedly reading a dataset of files in shuffled orders is inefficient when the dataset is too large to be cached in memory. Therefore, we propose DIESEL, a dataset-based distributed storage and caching system for DLT tasks. Our approach is via a storage-caching system co-design. Firstly, since accessing small files is a metadata-intensive operation, DIESEL decouples the metadata processing from metadata storage, and introduces metadata snapshot mechanisms for each dataset. This approach speeds up metadata access significantly. Secondly, DIESEL deploys a task-grained distributed cache across the worker nodes of a DLT task. This way node failures are contained within each DLT task. Furthermore, the files are grouped into large chunks in storage, so the recovery time of the caching system is reduced greatly. Thirdly, DIESEL provides chunk-based shuffle so that the performance of random file access is improved without sacrificing training accuracy. Our experiments show that DIESEL achieves a linear speedup on metadata access, and outperforms an existing distributed caching system in both file caching and file reading. In real DLT tasks, DIESEL halves the data access time of an existing storage system, and reduces the training time by hours without changing any training code.

    References

    [1]
    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283.
    [2]
    Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 4.
    [3]
    Facebook. 2019. Pytorch-example. https://github.com/pytorch/examples/tree/master/imagenet
    [4]
    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29–43.
    [5]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
    [6]
    David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In STOC, Vol. 97. 654–663.
    [7]
    Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
    [8]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
    [9]
    Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An Informed Storage Cache for Deep Learning. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 283–296.
    [10]
    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. 2018. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982 (2018).
    [11]
    Redis Labs. 2019. NoSQL Redis and Memcache traffic generation and benchmarking tool.https://github.com/RedisLabs/memtier_benchmark
    [12]
    Siyang Li, Youyou Lu, Jiwu Shu, Yang Hu, and Tao Li. 2017. Locofs: A loosely-coupled metadata service for distributed file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 4.
    [13]
    Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen Grohsschmiedt, and Mikael Ronström. 2017. Hopsfs: Scaling hierarchical file system metadata using newsql databases. In 15th USENIX Conference on File and Storage Technologies (FAST 17). 89–104.
    [14]
    Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385.
    [15]
    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
    [16]
    Gregory Popovitch. 2019. A header-only, very fast and memory-friendly hash map. https://github.com/greg7mdp/parallel-hashmap
    [17]
    Red-Hat. 2019. Gluster Filesystem. https://www.gluster.org/
    [18]
    Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing Metadata Efficiency in the Local File System. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13). 145–156.
    [19]
    Kai Ren, Qing Zheng, Swapnil Patil, and Garth Gibson. 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 237–248.
    [20]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
    [21]
    Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST, Vol. 2.
    [22]
    Philip Schwan 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380–386.
    [23]
    Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, 2010. The hadoop distributed file system. In MSST, Vol. 10. 1–10.
    [24]
    Konstantin V Shvachko. 2016. Giraffa: A distributed highly available file system. https://github.com/GiraffaFS/giraffa
    [25]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).
    [26]
    Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. 2007. Thrift: Scalable cross-language services implementation. Facebook White Paper 5, 8 (2007).
    [27]
    Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS File System. In USENIX Annual Technical Conference, Vol. 15.
    [28]
    Miklos Szeredi and Nikolaus Rath. 2019. Fuse: Filesystem in Userspace. https://github.com/libfuse/libfuse
    [29]
    Twitter. 2019. A fast, light-weight proxy for memcached and redis. https://github.com/twitter/twemproxy
    [30]
    Bharath Kumar Reddy Vangoor, Vasily Tarasov, and Erez Zadok. 2017. To FUSE or Not to FUSE: Performance of User-Space File Systems. In 15th USENIX Conference on File and Storage Technologies (FAST 17). 59–72.
    [31]
    Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 307–320.
    [32]
    Zhao Zhang, Lei Huang, Uri Manor, Linjing Fang, Gabriele Merlo, Craig Michoski, John Cazes, and Niall Gaffney. 2018. FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning. CoRR abs/1809.10799(2018). arxiv:1809.10799http://arxiv.org/abs/1809.10799
    [33]
    Qing Zheng, Charles D Cranor, Danhao Guo, Gregory R Ganger, George Amvrosiadis, Garth A Gibson, Bradley W Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 30–44.
    [34]
    Qing Zheng, Kai Ren, Garth Gibson, Bradley W Settlemyer, and Gary Grider. 2015. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop. ACM, 1–6.
    [35]
    Yue Zhu, Fahim Chowdhury, Huansong Fu, Adam Moody, Kathryn Mohror, Kento Sato, and Weikuan Yu. 2018. Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 145–156.
    [36]
    Yue Zhu, Weikuan Yu, Bing Jiao, Kathryn Mohror, Adam Moody, and Fahim Chowdhury. 2019. Efficient User-Level Storage Disaggregation for Deep Learning. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 1–12.

    Cited By

    View all
    • (2024)Towards High-Performance Data Loading in Cloud-Native Deep Learning Systems2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)10.1109/COMSNETS59351.2024.10427257(361-369)Online publication date: 3-Jan-2024
    • (2023)SHADEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585947(135-151)Online publication date: 21-Feb-2023
    • (2023)I/O Access Patterns in HPC Applications: A 360-Degree SurveyACM Computing Surveys10.1145/361100756:2(1-41)Online publication date: 15-Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
    August 2020
    844 pages
    ISBN:9781450388160
    DOI:10.1145/3404397
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dataset management
    2. dataset shuffling
    3. deep learning
    4. distributed cache
    5. storage system

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICPP '20

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)192
    • Downloads (Last 6 weeks)26

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards High-Performance Data Loading in Cloud-Native Deep Learning Systems2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)10.1109/COMSNETS59351.2024.10427257(361-369)Online publication date: 3-Jan-2024
    • (2023)SHADEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585947(135-151)Online publication date: 21-Feb-2023
    • (2023)I/O Access Patterns in HPC Applications: A 360-Degree SurveyACM Computing Surveys10.1145/361100756:2(1-41)Online publication date: 15-Sep-2023
    • (2023)LiquidProceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3609510.3609811(50-57)Online publication date: 24-Aug-2023
    • (2023)SiloD: A Co-design of Caching and Scheduling for Deep Learning ClustersProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567499(883-898)Online publication date: 8-May-2023
    • (2023)High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331465934:11(2946-2964)Online publication date: Nov-2023
    • (2023)Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC58943.2023.00023(107-116)Online publication date: May-2023
    • (2023)Towards Optimizing Storage Costs on the Cloud2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00223(2919-2932)Online publication date: Apr-2023
    • (2023)The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00077(936-949)Online publication date: Apr-2023
    • (2023)iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070964(220-232)Online publication date: Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media