research-article

CORFU: A distributed shared log

Authors:

Mahesh Balakrishnan,

Vijayan Prabhakaran,

Ted WobberAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 31, Issue 4

Article No.: 10, Pages 1 - 24

https://doi.org/10.1145/2535930

Published: 20 December 2013 Publication History

Abstract

CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.

The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.

References

[1]

10Gen. 2011. MongoDB. http://www.10gen.com/white-papers.

[2]

Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., and Wang, R. 1995. Serverless network file systems. ACM SIGOPS Oper. Syst. Rev. 29, 109--126.

Digital Library

[3]

Apache. 2011. CouchDB. http://couchdb.apache.org/.

[4]

Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin, A., Larson, J., L'Eon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 223--234.

[5]

Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. 2012. Corfu: A shared log design for flash clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association.

Digital Library

[6]

Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, New York.

Digital Library

[7]

Bernstein, P., Reid, C., and Das, S. 2011. Hyder—A transactional record manager for shared flash. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR). 9--20.

[8]

Birman, K., Malkhi, D., and Van Renesse, R. 2010. Virtually synchronous methodology for dynamic service replication. Tech. rep. MSR-TR-2010-151, Microsoft Research.

[9]

Burrows, M. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06). USENIX Association, 335--350.

Digital Library

[10]

Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, 143--157.

Digital Library

[11]

Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, 4.

Digital Library

[12]

Chockler, G. and Malkhi, D. 2005. Active disk Paxos with infinitely many processes. Distrib. Comput. 18, 1, 73--84.

Digital Library

[13]

Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, 251--264.

Digital Library

[14]

Davis, J., Thacker, C. P., and Chang, C. 2009. BEE3: Revitalizing computer architecture research. Tech. rep. MSR-TR-2009-45, Microsoft Research.

[15]

Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP'07).

Digital Library

[16]

Defago, X., Schiper, A., and Urban, P. 2003. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 2004.

Digital Library

[17]

Gafni, E. and Lamport, L. 2000. Disk Paxos. In Proceedings of the 14th International Conference on Distributed Computing (DISC'00). Springer, Berlin, 330--344.

Digital Library

[18]

Hartman, J. H. and Ousterhout, J. K. 1995. The zebra striped network file system. ACM Trans. Comput. Syst. 13, 3, 274--310.

Digital Library

[19]

Haskin, R., Malachi, Y., and Chan, G. 1988. Recovery management in quicksilver. ACM Trans. Comput. Syst. 6, 1, 82--108.

Digital Library

[20]

Herlihy, M. P. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3, 463--492.

Digital Library

[21]

Holbrook, H. W., Singhal, S. K., and Cheriton, D. R. 1995. Log-based receiver-reliable multicast for distributed interactive simulation. SIGCOMM Comput. Commun. Rev. 25, 4, 328--341.

Digital Library

[22]

Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. 2010. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, 11--11.

Digital Library

[23]

Ji, M., Veitch, A., and Wilkes, J., et al. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference.

[24]

Junqueira, F. 2012. Durability with BookKeeper. In Proceedings of LADIS'12.

[25]

Junqueira, F., Reed, B., and Yabandeh, M. 2011. Lock-free transactional support for large-scale storage systems. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 176--181.

Digital Library

[26]

Kapritsos, M. and Junqueira, F. P. 2010. Scalable agreement: Toward ordering as a service. In Proceedings of the Sixth International Conference on Hot Topics In System Dependability (HotDep'10). USENIX Association, 1--8.

Digital Library

[27]

Lakshman, A. and Malik, P. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35--40.

Digital Library

[28]

Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565.

Digital Library

[29]

Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 133--169.

Digital Library

[30]

Lamport, L., Malkhi, D., and Zhou, L. 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC'09). ACM, New York, 312--313.

Digital Library

[31]

Lamport, L., Malkhi, D., and Zhou, L. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1, 63--73.

Digital Library

[32]

Lee, E. and Thekkath, C. 1996. Petal: Distributed virtual disks. ACM SIGOPS Oper. Syst. Rev. 30, 5, 84--92.

Digital Library

[33]

Linkedin. 2011. Voldemort. http://www.project-voldemort.com/voldemort/.

[34]

Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., and Shrira, L. 1991. Replication in the harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 226--238.

Digital Library

[35]

MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 105--120.

Digital Library

[36]

Mao, Y., Junqueira, F. P., and Marzullo, K. 2008. Mencius: Building efficient replicated state machines for WANS. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 369--384.

Digital Library

[37]

Meyer, D. T., Aggarwal, G., Cully, B., Lefebvre, G., Feeley, M. J., Hutchinson, N. C., and Warfield, A. 2008. Parallax: virtual disks for virtual machines. SIGOPS Oper. Syst. Rev. 42, 4, 41--54.

Digital Library

[38]

Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 1--15.

Digital Library

[39]

Rosenblum, M. and Ousterhout, J. K. 1991. The design and implementation of a log-structured file system. SIGOPS Oper. Syst. Rev. 25, 5, 1--15.

Digital Library

[40]

Schmuck, F. and Wylie, J. 1991. Experience with transactions in quicksilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 239--253.

Digital Library

[41]

Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319.

Digital Library

[42]

Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Technical Conference (TCON'95). USENIX Association, Berkeley, CA, 21--21.

Digital Library

[43]

Sovran, Y., Power, R., Aguilera, M. K., and Li, J. 2011. Transactional storage for geo-replicated systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11). ACM, New York, 385--400.

Digital Library

[44]

Spector, A. Z., Daniels, D., Duchamp, D., Eppinger, J. L., and Pausch, R. 1985. Distributed transactions for reliable systems. SIGOPS Oper. Syst. Rev. 19, 5, 127--146.

Digital Library

[45]

Thacker, C. P. Beehive: A many-core computer for FPGAs. Unpublished Manuscript.

[46]

Thekkath, C. A., Mann, T., and Lee, E. K. 1997. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP'97). ACM, New York, NY, 224--237.

Digital Library

[47]

Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., and Abadi, D. J. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). ACM, New York, 1--12.

Digital Library

[48]

Van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 7--7.

Digital Library

[49]

Wei, M., Davis, J. D., Wobber, T., Balakrishnan, M., and Malkhi, D. 2013. Beyond block i/o: implementing a distributed shared log in hardware. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR'13). ACM, New York, 21:1--21:11.

Digital Library

[50]

XILINX. 2011. Xilinx university program xupv5-lx110t development system. http://www.xilinx.com/univ/xupv5-lx110t.htm.

Cited By

Li YZhu YShi CZhang GWang JZhang X(2024)Timestamp as a Service, Not an OracleProceedings of the VLDB Endowment10.14778/3641204.364121017:5(994-1006)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641210
Che YCheng DWang XWang R(2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3441623
Kim TChung W(2023)Collaborative Social Metric Learning in Trust Network for Recommender SystemsInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31653519:1(1-15)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.4018/IJSWIS.316535
Show More Cited By

Index Terms

CORFU: A distributed shared log
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

CORFU: a shared log design for flash clusters
NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

CORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high ...
Higher reliability redundant disk arrays: Organization, operation, and coding

Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability

Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 31, Issue 4

December 2013

90 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2542150

Editor:
Todd C. Mowry

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2013

Accepted: 01 March 2013

Received: 01 December 2012

Published in TOCS Volume 31, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
1,741
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)19

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YZhu YShi CZhang GWang JZhang X(2024)Timestamp as a Service, Not an OracleProceedings of the VLDB Endowment10.14778/3641204.364121017:5(994-1006)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641210
Che YCheng DWang XWang R(2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3441623
Kim TChung W(2023)Collaborative Social Metric Learning in Trust Network for Recommender SystemsInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31653519:1(1-15)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.4018/IJSWIS.316535
Coelho FAlonso AFerreira LPereira JOliveira R(2023)Loom: A Closed-Box Disaggregated Database SystemProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615424(30-39)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3615366.3615424
Scharf JXavier LMendizabal O(2023)Joining Parallel and Partitioned State Machine Replication Models for Enhanced Shared Logging PerformanceProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615422(90-99)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3615366.3615422
Liu ZGrunwald DIzraelevitz JWang GHa S(2023)MRTOM: Mostly Reliable Totally Ordered Multicast, a Network Primitive to Offload Distributed Systems2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00022(638-648)Online publication date: Jul-2023
https://doi.org/10.1109/ICDCS57875.2023.00022
Chen QMa SChen KMa TLiu XChen DWu YChen ZWolf FShende SCulhane CAlam SJagode H(2022)SeqDLMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571958(1-14)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571958
Arun BRavindran B(2022)Scalable byzantine fault tolerance via partial decentralizationProceedings of the VLDB Endowment10.14778/3538598.353859915:9(1739-1752)Online publication date: 27-Jul-2022
https://dl.acm.org/doi/10.14778/3538598.3538599
Chen QMa SChen KMa TLiu XChen DWu YChen Z(2022)SeqDLM: A Sequencer-Based Distributed Lock Manager for Efficient Shared File Access in a Parallel File SystemSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00060(1-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00060
Honoré WKim JShin JShao Z(2021)Much ADO about failures: a fault-aware model for compositional verification of strongly consistent distributed systemsProceedings of the ACM on Programming Languages10.1145/34854745:OOPSLA(1-31)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3485474
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents