Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1298455.1298485acmconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Ceph: a scalable, high-performance distributed file system

Published: 06 November 2006 Publication History

Abstract

We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

References

[1]
A. Adya, W. J. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, Dec. 2002. USENIX.
[2]
P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering, pages 562--570. IEEE Computer Society Press, 1976.
[3]
A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor, N. Rinetzky, O. Rodeh, J. Satran, A. Tavory, and L. Yerushalmi. Towards an object store. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 165--176, Apr. 2003.
[4]
P. J. Braam. The Lustre storage architecture. http://www.lustre.org/documentation.html, Cluster File Systems, Inc., Aug. 2004.
[5]
L.-F. Cabrera and D. D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405--436, 1991.
[6]
P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225--264, 1996.
[7]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), Bolton Landing, NY, Oct. 2003. ACM.
[8]
G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 92--103, San Jose, CA, Oct. 1998.
[9]
D. Hildebrand and P. Honeyman. Exporting storage systems in a scalable manner with pNFS. Technical Report CITI-05-1, CITI, University of Michigan, Feb. 2005.
[10]
D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In ACM Symposium on Theory of Computing, pages 654--663, May 1997.
[11]
J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cambridge, MA, Nov. 2000. ACM.
[12]
R. Latham, N. Miller, R. Ross, and P. Carns. A next-generation parallel file system for Linux clusters. Linux-World, pages 56--59, Jan. 2004.
[13]
A. Leung and E. L. Miller. Scalable security for large, high performance storage systems. In Proceedings of the 2006 ACM Workshop on Storage Security and Survivability. ACM, Oct. 2006.
[14]
B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M. Williams. Replication in the Harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91), pages 226--238. ACM, 1991.
[15]
C. R. Lumb, G. R. Ganger, and R. Golding. D-SPTF: Decentralized request distribution in brick-based storage systems. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--47, Boston, MA, 2004.
[16]
J. Menon, D. A. Pease, R. Rees, L. Duyanovich, and B. Hillsberg. IBM Storage Tank---a heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250--267, 2003.
[17]
N. Nieuwejaar and D. Kotz. The Galley parallel file system. In Proceedings of 10th ACM International Conference on Supercomputing, pages 374--381, Philadelphia, PA, 1996. ACM Press.
[18]
N. Nieuwejaar, D. Kotz, A. Purakayastha, C. S. Ellis, and M. Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075--1089, Oct. 1996.
[19]
C. A. Olson and E. L. Miller. Secure capabilities for a petabyte-scale object-based distributed file system. In Proceedings of the 2005 ACM Workshop on Storage Security and Survivability, Fairfax, VA, Nov. 2005.
[20]
B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz. NFS version 3: Design and implementation. In Proceedings of the Summer 1994 USENIX Technical Conference, pages 137--151, 1994.
[21]
O. Rodeh and A. Teperman. zFS---a scalable distributed file system using object disks. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 207--218, Apr. 2003.
[22]
D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41--54, San Diego, CA, June 2000. USENIX Association.
[23]
Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence. FAB: Building distributed enterprise disk arrays from commodity components. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48--58, 2004.
[24]
F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 231--244. USENIX, Jan. 2002.
[25]
M. Szeredi. File System in User Space. http://fuse.sourceforge.net, 2006.
[26]
H. Tang, A. Gulbeden, J. Zhou, W. Strathearn, T. Yang, and L. Chu. A self-organizing storage cluster for parallel data-intensive applications. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04), Pittsburgh, PA, Nov. 2004.
[27]
F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller, D. D. E. Long, and T. T. McLarty. File system workload analysis for large scale scientific computing applications. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 139--152, College Park, MD, Apr. 2004.
[28]
S. A. Weil. Scalable archival data and metadata management in object-based file systems. Technical Report SSRC-04-01, University of California, Santa Cruz, May 2004.
[29]
S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, Nov. 2006. ACM.
[30]
S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04). ACM, Nov. 2004.
[31]
B. Welch. POSIX IO extensions for HPC. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST), Dec. 2005.
[32]
B. Welch and G. Gibson. Managing scalability in object storage systems for HPC Linux clusters. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 433--445, Apr. 2004.
[33]
B. S. White, M. Walker, M. Humphrey, and A. S. Grimshaw. LegionFS: A secure and scalable file system supporting cross-domain high-performance applications. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (SC '01), Denver, CO, 2001.
[34]
J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 96--108, Copper Mountain, CO, 1995. ACM Press.
[35]
T. M. Wong, R. A. Golding, J. S. Glider, E. Borowsky, R. A. Becker-Szendy, C. Fleiner, D. R. Kenchammana-Hosekote, and O. A. Zaki. Kybos: self-management for distributed brick-base storage. Research Report RJ 10356, IBM Almaden Research Center, Aug. 2005.
[36]
J. C. Wu and S. A. Brandt. The design and implementation of AQuA: an adaptive quality of service aware object-based storage device. In Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 209--218, College Park, MD, May 2006.
[37]
Q. Xin, E. L. Miller, and T. J. E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 172--181, Honolulu, HI, June 2004.

Cited By

View all
  • (2024)RFUSEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650706(141-158)Online publication date: 27-Feb-2024
  • (2024)I/O PassthruProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650704(107-122)Online publication date: 27-Feb-2024
  • (2024)Specializing Data Access in a Distributed File System (Generative Pearl)Proceedings of the 23rd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3689484.3690736(44-52)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation
November 2006
407 pages
ISBN:1931971471

Sponsors

Publisher

USENIX Association

United States

Publication History

Published: 06 November 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)29
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)RFUSEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650706(141-158)Online publication date: 27-Feb-2024
  • (2024)I/O PassthruProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650704(107-122)Online publication date: 27-Feb-2024
  • (2024)Specializing Data Access in a Distributed File System (Generative Pearl)Proceedings of the 23rd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3689484.3690736(44-52)Online publication date: 21-Oct-2024
  • (2024)zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673156(618-628)Online publication date: 12-Aug-2024
  • (2024)DPC: DPU-accelerated High-Performance File System ClientProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673123(63-72)Online publication date: 12-Aug-2024
  • (2024)Inductive Diagrams for Causal ReasoningProceedings of the ACM on Programming Languages10.1145/36498308:OOPSLA1(529-554)Online publication date: 29-Apr-2024
  • (2024)ScaleDFS: Accelerating Decentralized and Private File Sharing via Scaling Directed Acyclic Graph ProcessingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658690(295-308)Online publication date: 3-Jun-2024
  • (2024)Block-level Image Service for the CloudACM Transactions on Storage10.1145/362067220:1(1-28)Online publication date: 30-Jan-2024
  • (2024)EXO: Accelerating Storage Paravirtualization with eBPFProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00112(1-15)Online publication date: 17-Nov-2024
  • (2024)Tango: A Cross-layer Approach to Managing I/O Interference over Local Ephemeral StorageProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00020(1-15)Online publication date: 17-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media