Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A practical scalable distributed B-tree

Published: 01 August 2008 Publication History

Abstract

Internet applications increasingly rely on scalable data structures that must support high throughput and store huge amounts of data. These data structures can be hard to implement efficiently. Recent proposals have overcome this problem by giving up on generality and implementing specialized interfaces and functionality (e.g., Dynamo [4]). We present the design of a more general and flexible solution: a fault-tolerant and scalable distributed B-tree. In addition to the usual B-tree operations, our B-tree provides some important practical features: transactions for atomically executing several operations in one or more B-trees, online migration of B-tree nodes between servers for load-balancing, and dynamic addition and removal of servers for supporting incremental growth of the system.
Our design is conceptually simple. Rather than using complex concurrency and locking protocols, we use distributed transactions to make changes to B-tree nodes. We show how to extend the B-tree and keep additional information so that these transactions execute quickly and efficiently. Our design relies on an underlying distributed data sharing service, Sinfonia [1], which provides fault tolerance and a light-weight distributed atomic primitive. We use this primitive to commit our transactions. We implemented our B-tree and show that it performs comparably to an existing open-source B-tree and that it scales to hundreds of machines. We believe that our approach is general and can be used to implement other distributed data structures easily.

References

[1]
M. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: a new paradigm for building scalable distributed systems. In Proc. SOSP '07, pages 159--174, Oct. 2007.
[2]
A. Andrzejak and Z. Xu. Scalable, efficient range queries for grid information services. In Proc. P2P '02, pages 33--40, Sept. 2002.
[3]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In Proc. SOSP '06, pages 205--218, Nov. 2006.
[4]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In Proc. SOSP '07, pages 205--220, Oct. 2007.
[5]
S. Gribble, E. Brewer, J. Hellersteiu, and D. Culler. Scalable, distributed data structures for Internet service construction. In Proc. OSDI '00, pages 319--332, Oct. 2000.
[6]
M. Herlihy, V. Luchangco, M. Moir, and I. William N. Scherer. Software transactional memory for dynamic-sized data structures. In Proc. PODC '03, pages 92--101, July 2003.
[7]
M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Proc. ISCA '93, pages 289--300, May 1993.
[8]
HyperTable. HyperTable: An Open Source, High Performance, Scalabe Database, 2008. Online: http://hypertable.org/.
[9]
T. Johnson and A. Colbrook. A distributed data-balanced dictionary based on the B-link tree. In Proc. IPPS '92, pages 319--324, Mar. 1992. A longer version appears as MIT Tech Report MIT/LCS/TR-530, Feb. 1992.
[10]
H. T. Kung and J. T. Robinson. On optimistic methods for concurrency control. ACM Trans. Database Syst., 6(2):213--226, June 1981.
[11]
P. L. Lehman and S. B. Yao. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems, 6(4):650--670, Dec. 1981.
[12]
W. Litwin, M.-A. Neimat, and D. Schneider. RP*: A Family of Order Preserving Scalable Distributed Data Structures. In Proc. VLDB '94, pages 342--353, Sept. 1994.
[13]
W. Litwin, M.-A. Neimat, and D. A. Schneider. LH* - a scalable, distributed data structure. ACM Trans. Database Syst., 21(4):480--525, Dec. 1996.
[14]
J. MacCormick, N. Murphy, M. Najork, C. Thekkath, and L. Zhou. Boxwood: Abstractions as the foundation for storage infrastructure. In Proc. OSDI '04, pages 105--120, Dec. 2004.
[15]
C. Mohan. ARIES/KVL: A key-value locking method for concurrency control of multiaction transactions operating on B-tree indexes. In Proc. VLDB '90, pages 392--405, Aug. 1990.
[16]
S. Ratnasamy. P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proc. SIGCOMM '01, pages 161--172, Aug. 2001.
[17]
Y. Sagiv. Concurrent operations on B-trees with overtaking. In Proc. PODS '85, pages 28--37, Mar. 1985.
[18]
N. Shavit and D. Touitou. Software transactional memory. In Proc. PODC '95, pages 204--213, Aug. 1995.
[19]
A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proc. SIGMOD '08, pages 765--778, June 2008.
[20]
I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proc. SIGCOMM '01, pages 149--160, Aug. 2001.
[21]
Sun Microsystems. Lustre, 2008. Online: http://lustre.org/.
[22]
The Apache Software Foundation. Hadoop, 2008. Online: http://hadoop.apache.org/.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)8
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Memory-Disaggregated Radix TreeACM Transactions on Storage10.1145/366428920:3(1-41)Online publication date: 6-Jun-2024
  • (2023)A distributed B+Tree indexing method for processing range queries over streaming dataCluster Computing10.1007/s10586-023-04015-927:2(1251-1274)Online publication date: 7-May-2023
  • (2023)Learning Optimal Tree-Based Index Placement for Autonomous DatabaseDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_22(304-309)Online publication date: 28-Aug-2023
  • (2022)Provenance-based data skippingProceedings of the VLDB Endowment10.14778/3494124.349413015:3(451-464)Online publication date: 4-Feb-2022
  • (2020)A Framework for supporting DBMS-like indexes in the cloudProceedings of the VLDB Endowment10.14778/3402707.34027114:11(702-713)Online publication date: 3-Jun-2020
  • (2017)DITIRProceedings of the VLDB Endowment10.14778/3137765.313779510:12(1865-1868)Online publication date: 1-Aug-2017
  • (2017)Toward a New Model of Indexing Big Uncertain DataProceedings of the 9th International Conference on Management of Digital EcoSystems10.1145/3167020.3167034(93-98)Online publication date: 7-Nov-2017
  • (2017)EunomiaACM SIGPLAN Notices10.1145/3155284.301875252:8(385-399)Online publication date: 26-Jan-2017
  • (2017)EunomiaProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018752(385-399)Online publication date: 26-Jan-2017
  • (2017)A MapReduce-based scalable discovery and indexing of structured big dataFuture Generation Computer Systems10.1016/j.future.2017.03.02873:C(32-43)Online publication date: 1-Aug-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media