short-paper

Keeping deep learning GPUs well fed using object storage

Authors:

Or Ozeri,

Effi Ofer,

Ronen KatAuthors Info & Claims

SYSTOR '18: Proceedings of the 11th ACM International Systems and Storage Conference

Page 128

https://doi.org/10.1145/3211890.3211910

Published: 04 June 2018 Publication History

Get Access

Abstract

In recent years, machine learning and deep learning techniques such as deep neural networks and recurrent neural networks have found uses in diverse fields including computer vision, speech recognition, natural language processing, social network analysis, bioinformatics and medicine, where they have produced results comparable to and in some cases surpassing human experts. Machine learning requires large amount of data for training its models with much of this data residing in object storage, an inexpensive and scalable data store. Also, deep learning make use of state of the art processing capabilities from high-end GPUs and accelerators, such as Google Tensor Processing Units (TPUs), which enable parallel and efficient execution. The throughput that such GPUs can support is very high.

This however constitutes an impedance mismatch as the object storage is not designed for high performance data transfers and standard practices for feeding deep learning models from the object storage can result in poor training performance. Furthermore, the typical deep learning framework uses a file access interface, and object storage support a REST based interface with different APIs and semantics than a file system [2]. To fully take advantage of these GPUs and operate at full utilization, frameworks, such as TensorFlow, Cafe, and Torch, needs to deliver data as fast as possible to keep the GPUs busy. This becomes a significant challenge when the training data does not reside in the same machine as the GPUs, as is the case when using object storage, resulting in a utilization challenge for the expensive processing units.

To solve the impedance mismatch and keep the processing units fully utilized, we have added a FUSE based file system, S3fs [1], to our deep learning stack. S3fs translates POSIX file API requests into REST API against the object storage. It is an open source project which, as part of this work, we optimized so that read requests are performed using new innovative logic that translates the requests into multiple concurrent range reads requests against the object storage. This enables us to obtain higher throughput from the object storage than is possible using the naive approach. Reads are cached in memory and are served back to the deep learning framework asynchronously. Since deep learning frameworks often run their training in multiple epochs the in memory cache speed is highly beneficial.

Our FUSE based architecture has been implemented in the Deep Learning as a Service offering on the IBM Cloud, and our S3fs enhancements have been contributed to the S3fs project repository. Using our architecture we are able to speed up deep learning performance many folds and keep expensive GPUs fully utilized.

References

[1]

2018. S3fs git repository. (2018). https://github.com/s3fs-fuse/s3fs-fuse

Google Scholar

[2]

Gil Vernik, Michael Factor, Elliot Kolodner, Pietro Michiardi, Effi Ofer, and Francesco Pace. 2018. Stocator: Providing High Performance and Fault Tolerance for Apache Spark over Object Storage. 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2018).

Digital Library

Google Scholar

Cited By

View all

Lillaney KTarasov VPease DBurns RYadgar GNoh S(2019)The case for dual-access file systems over object storageProceedings of the 11th USENIX Conference on Hot Topics in Storage and File Systems10.5555/3357062.3357080(13-13)Online publication date: 8-Jul-2019
https://dl.acm.org/doi/10.5555/3357062.3357080
Lillaney KTarasov VPease DBurns R(2019)AgniProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362703(390-402)Online publication date: 20-Nov-2019
https://dl.acm.org/doi/10.1145/3357223.3362703

Recommendations

Object Storage for Deep Learning Frameworks
DIDL '18: Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning

The advent of big datasets and high speed GPUs is fueling the growth in machine and deep learning techniques. In this paper we explore storing the training data in object storage and demonstrate how this can be done effectively while providing ...
Storage management in large distributed object-based storage systems
GPUs as Storage System Accelerators

Massively multicore processors, such as graphics processing units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the ...

Comments

Information & Contributors

Information

Published In

SYSTOR '18: Proceedings of the 11th ACM International Systems and Storage Conference

June 2018

144 pages

ISBN:9781450358491

DOI:10.1145/3211890

General Chair:
David Breitgand
IBM Research
,
Program Chairs:
Gala Yadgar
Technion
,
Donald E. Porter
University of North Carolina at Chapel Hill
,
Publications Chair:
Ittay Eyal
Technion

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper
Research
Refereed limited

Conference

SYSTOR '18

Sponsor:

Technion
SIGOPS
USENIX Assoc

SYSTOR '18: International Systems and Storage Conference

June 4 - 7, 2018

Haifa, Israel

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
161
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Lillaney KTarasov VPease DBurns RYadgar GNoh S(2019)The case for dual-access file systems over object storageProceedings of the 11th USENIX Conference on Hot Topics in Storage and File Systems10.5555/3357062.3357080(13-13)Online publication date: 8-Jul-2019
https://dl.acm.org/doi/10.5555/3357062.3357080
Lillaney KTarasov VPease DBurns R(2019)AgniProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362703(390-402)Online publication date: 20-Nov-2019
https://dl.acm.org/doi/10.1145/3357223.3362703

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

Object Storage for Deep Learning Frameworks

Storage management in large distributed object-based storage systems

GPUs as Storage System Accelerators

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations