Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CORFU: A distributed shared log

Published: 20 December 2013 Publication History

Abstract

CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second.
The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.

References

[1]
10Gen. 2011. MongoDB. http://www.10gen.com/white-papers.
[2]
Anderson, T., Dahlin, M., Neefe, J., Patterson, D., Roselli, D., and Wang, R. 1995. Serverless network file systems. ACM SIGOPS Oper. Syst. Rev. 29, 109--126.
[3]
Apache. 2011. CouchDB. http://couchdb.apache.org/.
[4]
Baker, J., Bond, C., Corbett, J., Furman, J., Khorlin, A., Larson, J., L'Eon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 223--234.
[5]
Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobber, T., Wei, M., and Davis, J. 2012. Corfu: A shared log design for flash clusters. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12). USENIX Association.
[6]
Balakrishnan, M., Malkhi, D., Wobber, T., Wu, M., Prabhakaran, V., Wei, M., Davis, J. D., Rao, S., Zou, T., and Zuck, A. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, New York.
[7]
Bernstein, P., Reid, C., and Das, S. 2011. Hyder—A transactional record manager for shared flash. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR). 9--20.
[8]
Birman, K., Malkhi, D., and Van Renesse, R. 2010. Virtually synchronous methodology for dynamic service replication. Tech. rep. MSR-TR-2010-151, Microsoft Research.
[9]
Burrows, M. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI'06). USENIX Association, 335--350.
[10]
Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). ACM, New York, 143--157.
[11]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, 4.
[12]
Chockler, G. and Malkhi, D. 2005. Active disk Paxos with infinitely many processes. Distrib. Comput. 18, 1, 73--84.
[13]
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, 251--264.
[14]
Davis, J., Thacker, C. P., and Chang, C. 2009. BEE3: Revitalizing computer architecture research. Tech. rep. MSR-TR-2009-45, Microsoft Research.
[15]
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP'07).
[16]
Defago, X., Schiper, A., and Urban, P. 2003. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36, 2004.
[17]
Gafni, E. and Lamport, L. 2000. Disk Paxos. In Proceedings of the 14th International Conference on Distributed Computing (DISC'00). Springer, Berlin, 330--344.
[18]
Hartman, J. H. and Ousterhout, J. K. 1995. The zebra striped network file system. ACM Trans. Comput. Syst. 13, 3, 274--310.
[19]
Haskin, R., Malachi, Y., and Chan, G. 1988. Recovery management in quicksilver. ACM Trans. Comput. Syst. 6, 1, 82--108.
[20]
Herlihy, M. P. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3, 463--492.
[21]
Holbrook, H. W., Singhal, S. K., and Cheriton, D. R. 1995. Log-based receiver-reliable multicast for distributed interactive simulation. SIGCOMM Comput. Commun. Rev. 25, 4, 328--341.
[22]
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. 2010. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC'10). USENIX Association, Berkeley, CA, 11--11.
[23]
Ji, M., Veitch, A., and Wilkes, J., et al. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference.
[24]
Junqueira, F. 2012. Durability with BookKeeper. In Proceedings of LADIS'12.
[25]
Junqueira, F., Reed, B., and Yabandeh, M. 2011. Lock-free transactional support for large-scale storage systems. In Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 176--181.
[26]
Kapritsos, M. and Junqueira, F. P. 2010. Scalable agreement: Toward ordering as a service. In Proceedings of the Sixth International Conference on Hot Topics In System Dependability (HotDep'10). USENIX Association, 1--8.
[27]
Lakshman, A. and Malik, P. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35--40.
[28]
Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565.
[29]
Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 133--169.
[30]
Lamport, L., Malkhi, D., and Zhou, L. 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC'09). ACM, New York, 312--313.
[31]
Lamport, L., Malkhi, D., and Zhou, L. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1, 63--73.
[32]
Lee, E. and Thekkath, C. 1996. Petal: Distributed virtual disks. ACM SIGOPS Oper. Syst. Rev. 30, 5, 84--92.
[33]
Linkedin. 2011. Voldemort. http://www.project-voldemort.com/voldemort/.
[34]
Liskov, B., Ghemawat, S., Gruber, R., Johnson, P., and Shrira, L. 1991. Replication in the harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 226--238.
[35]
MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 105--120.
[36]
Mao, Y., Junqueira, F. P., and Marzullo, K. 2008. Mencius: Building efficient replicated state machines for WANS. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 369--384.
[37]
Meyer, D. T., Aggarwal, G., Cully, B., Lefebvre, G., Feeley, M. J., Hutchinson, N. C., and Warfield, A. 2008. Parallax: virtual disks for virtual machines. SIGOPS Oper. Syst. Rev. 42, 4, 41--54.
[38]
Peng, D. and Dabek, F. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 1--15.
[39]
Rosenblum, M. and Ousterhout, J. K. 1991. The design and implementation of a log-structured file system. SIGOPS Oper. Syst. Rev. 25, 5, 1--15.
[40]
Schmuck, F. and Wylie, J. 1991. Experience with transactions in quicksilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP'91). ACM, New York, 239--253.
[41]
Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4, 299--319.
[42]
Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Technical Conference (TCON'95). USENIX Association, Berkeley, CA, 21--21.
[43]
Sovran, Y., Power, R., Aguilera, M. K., and Li, J. 2011. Transactional storage for geo-replicated systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11). ACM, New York, 385--400.
[44]
Spector, A. Z., Daniels, D., Duchamp, D., Eppinger, J. L., and Pausch, R. 1985. Distributed transactions for reliable systems. SIGOPS Oper. Syst. Rev. 19, 5, 127--146.
[45]
Thacker, C. P. Beehive: A many-core computer for FPGAs. Unpublished Manuscript.
[46]
Thekkath, C. A., Mann, T., and Lee, E. K. 1997. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP'97). ACM, New York, NY, 224--237.
[47]
Thomson, A., Diamond, T., Weng, S.-C., Ren, K., Shao, P., and Abadi, D. J. 2012. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). ACM, New York, 1--12.
[48]
Van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, Berkeley, CA, 7--7.
[49]
Wei, M., Davis, J. D., Wobber, T., Balakrishnan, M., and Malkhi, D. 2013. Beyond block i/o: implementing a distributed shared log in hardware. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR'13). ACM, New York, 21:1--21:11.
[50]
XILINX. 2011. Xilinx university program xupv5-lx110t development system. http://www.xilinx.com/univ/xupv5-lx110t.htm.

Cited By

View all
  • (2024)Timestamp as a Service, Not an OracleProceedings of the VLDB Endowment10.14778/3641204.364121017:5(994-1006)Online publication date: 1-Jan-2024
  • (2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
  • (2023)Collaborative Social Metric Learning in Trust Network for Recommender SystemsInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31653519:1(1-15)Online publication date: 20-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 31, Issue 4
December 2013
90 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2542150
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2013
Accepted: 01 March 2013
Received: 01 December 2012
Published in TOCS Volume 31, Issue 4

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)19
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Timestamp as a Service, Not an OracleProceedings of the VLDB Endowment10.14778/3641204.364121017:5(994-1006)Online publication date: 1-Jan-2024
  • (2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
  • (2023)Collaborative Social Metric Learning in Trust Network for Recommender SystemsInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31653519:1(1-15)Online publication date: 20-Jan-2023
  • (2023)Loom: A Closed-Box Disaggregated Database SystemProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615424(30-39)Online publication date: 16-Oct-2023
  • (2023)Joining Parallel and Partitioned State Machine Replication Models for Enhanced Shared Logging PerformanceProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615422(90-99)Online publication date: 16-Oct-2023
  • (2023)MRTOM: Mostly Reliable Totally Ordered Multicast, a Network Primitive to Offload Distributed Systems2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00022(638-648)Online publication date: Jul-2023
  • (2022)SeqDLMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571958(1-14)Online publication date: 13-Nov-2022
  • (2022)Scalable byzantine fault tolerance via partial decentralizationProceedings of the VLDB Endowment10.14778/3538598.353859915:9(1739-1752)Online publication date: 27-Jul-2022
  • (2022)SeqDLM: A Sequencer-Based Distributed Lock Manager for Efficient Shared File Access in a Parallel File SystemSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00060(1-14)Online publication date: Nov-2022
  • (2021)Much ADO about failures: a fault-aware model for compositional verification of strongly consistent distributed systemsProceedings of the ACM on Programming Languages10.1145/34854745:OOPSLA(1-31)Online publication date: 15-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media