Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

uTree: a persistent B+-tree with low tail latency

Published: 01 July 2020 Publication History

Abstract

Tail latency is a critical design issue in recent storage systems. B+-tree, as a fundamental building block in storage systems, incurs high tail latency, especially when placed in persistent memory (PM). Our empirical study specifies two factors that lead to such latency spikes: (i) the internal structural refinement operations (i.e., split, merge, and balance), and (ii) the interference between concurrent operations. The problem is even worse when high concurrency meets with the low write bandwidth of persistent memory.
In this paper, we propose a B+-tree variant named μTree. It incorporates a shadow list-based layer to the leaf nodes of a B+-tree to gain benefits from both list and tree data structures. The list layer in PM is exempt from the structural refinement operations since list nodes in the list layer own separate PM spaces, which are organized in an element-based way. Meanwhile, μTree still gains the locality benefit from the tree-based nodes. To alleviate the interference overhead, μTree coordinates the concurrency control between the tree and list layer, which moves the slow PM accesses out of the critical path. We compare μTree to state-of-the-art designs of PM-aware B+-tree indices under both YCSB workload and real-world applications. μTree achieves a 99th percentile latency that is one order of magnitude lower and 2.8 - 4.7 times higher throughput.

References

[1]
The persistent memory development kit. "pmem.io".
[2]
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM '12, pages 123--132, New York, NY, USA, 2012. ACM.
[3]
H. Akinaga and H. Shima. Resistive random access memory (ReRAM) based on metal oxides. Proc. IEEE, 98(12), 2010.
[4]
M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data center tcp (dctcp). pages 63--74, 2010.
[5]
D. Apalkov, A. Khvalkovskiy, S. Watts, V. Nikitin, X. Tang, D. Lottis, K. Moon, X. Luo, E. Chen, A. Ong, A. Driskill-Smith, and M. Krounbi. Spin-transfer torque magnetic random access memory (STT-MRAM). ACM J. Emerg. Technol. Comput. Syst., 9(2):13:1--13:35, May 2013.
[6]
J. Arulraj, J. Levandoski, U. F. Minhas, and P.-A. Larson. Bztree: A high-performance latch-free range index for non-volatile memory. PVLDB, 11(5):553--565, 2018.
[7]
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 53--64, 2012.
[8]
J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc Conf Innovative Data Syst Res (CIDR), volume 11, pages 223--234, 01 2011.
[9]
D. S. Berger, B. Berg, T. Zhu, S. Sen, and M. Harchol-Balter. Robinhood: Tail latency aware caching - dynamic reallocation from cache-rich to cache-poor. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 195--212, Carlsbad, CA, Oct. 2018. USENIX Association.
[10]
K. Bhandari, D. R. Chakrabarti, and H.-J. Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '16, pages 677--694, 2016.
[11]
S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Cache-conscious concurrency control of main-memory indexes on shared-memory multiprocessor systems. VLDB '01, pages 181--190, 2001.
[12]
S. Chen and Q. Jin. Persistent b+-trees in non-volatile main memory. PVLDB, 8(7):786--797, Feb. 2015.
[13]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143--154, 2010.
[14]
I. Corporation. Key/value datastore for persistent memory. https://github.com/pmem/pmemkv, 2020.
[15]
J. Dean and L. A. Barroso. The tail at scale. Commun. ACM, 56(2):74--80, Feb. 2013.
[16]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS operating systems review, volume 41, pages 205--220. ACM, 2007.
[17]
D. Didona and W. Zwaenepoel. Size-aware sharding for improving tail latencies in in-memory key-value stores. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 79--94, Boston, MA, Feb. 2019. USENIX Association.
[18]
A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI '14, pages 401--414, 2014.
[19]
F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. Sap hana database: Data management for modern business applications. SIGMOD Rec., 40(4):45--51, Jan. 2012.
[20]
K. Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, Feb. 2004.
[21]
T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In International Symposium on Distributed Computing, pages 300--314. Springer, 2001.
[22]
Y. Huang, M. Pavlovic, V. Marathe, M. Seltzer, T. Harris, and S. Byan. Closing the performance gap between volatile and persistent key-value stores using cross-referencing logs. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 967--979, Boston, MA, July 2018. USENIX Association.
[23]
D. Hwang, W.-H. Kim, Y. Won, and B. Nam. Endurable transient inconsistency in byte-addressable persistent b+-tree. In 16th USENIX Conference on File and Storage Technologies (FAST 18), pages 187--200, Oakland, CA, 2018. USENIX Association.
[24]
Intel. The NVM Library. http://pmem.io/, 2016.
[25]
Intel. Processor counter monitor (pcm). https://github.com/opcm/pcm, 2020.
[26]
J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J. Soh, Z. Wang, Y. Xu, S. R. Dulloor, et al. Basic performance measurements of the intel optane dc persistent memory module. arXiv preprint arXiv:1903.05714, 2019.
[27]
K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Mazières, and C. Kozyrakis. Shinjuku: Preemptive scheduling for usecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, Feb. 2019. USENIX Association.
[28]
A. Kejriwal, A. Gopalan, A. Gupta, Z. Jia, S. Yang, and J. Ousterhout. Slik: Scalable low-latency indexes for a key-value store. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '16, pages 57--70, Berkeley, CA, USA, 2016. USENIX Association.
[29]
H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu. Evaluating phase change memory for enterprise storage systems: A study of caching and tiering approaches. In Proceedings of the 12th USENIX Conference on File and Storage Technologies, FAST '14, pages 33--45, 2014.
[30]
A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, and C. Kozyrakis. Pocket: Elastic ephemeral storage for serverless analytics. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pages 427--444, Berkeley, CA, USA, 2018. USENIX Association.
[31]
E. Kültürsay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. Evaluating STT-RAM as an energy-efficient main memory alternative. In Proceeding of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS '13, pages 256--267, Apr. 2013.
[32]
A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010.
[33]
B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger. Phase-change technology and the future of main memory. IEEE Micro, 30:131--141, Jan. 2010.
[34]
B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change memory as a scalable DRAM alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 2--13, 2009.
[35]
S. K. Lee, J. Mohan, S. Kashyap, T. Kim, and V. Chidambaram. Recipe: Converting concurrent dram indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, pages 462--477, New York, NY, USA, 2019. ACM.
[36]
P. L. Lehman and s. B. Yao. Efficient locking for concurrent operations on b-trees. ACM Trans. Database Syst., 6(4):650--670, Dec. 1981.
[37]
L. Lersch, X. Hao, I. Oukid, T. Wang, and T. Willhalm. Evaluating persistent memory range indexes. PVLDB, 13(4):574--587, Dec. 2019.
[38]
J. J. Levandoski, D. B. Lomet, and S. Sengupta. The bw-tree: A b-tree for new hardware platforms. In 2013 IEEE 29th International Conference on Data Engineering, ICDE '13, pages 302--313, 2013.
[39]
M. Liu, J. Xing, K. Chen, and Y. Wu. Building scalable nvm-based b+ tree with htm. In Proceedings of the 48th International Conference on Parallel Processing, pages 1--10, 2019.
[40]
Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys '12, pages 183--196, 2012.
[41]
I. Newsroom. Intel® optaneTM dc persistent memory. https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html, Apr. 2019.
[42]
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 385--398, 2013.
[43]
I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner. Fptree: A hybrid scm-dram persistent and concurrent b-tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 371--386, 2016.
[44]
A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Balakrishnan. Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads. In NSDI, 2019.
[45]
G. Prekas, M. Kogias, and E. Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 325--341, New York, NY, USA, 2017. ACM.
[46]
W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM, 33(6):668--676, June 1990.
[47]
H. Qin, Q. Li, J. Speiser, P. Kraft, and J. K. Ousterhout. Arachne: Core-aware thread management. In OSDI, 2018.
[48]
M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 24--33, 2009.
[49]
S. SANFILIPPO and P. NOORDHUIS. Redis. http://redis.io, 2009.
[50]
J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey. Palm: Parallel architecture-friendly latch-free modifications to b+ trees on many-core processors. 4:795--806, 08 2011.
[51]
P. B. G. Shimin Chen and S. Nath. Rethinking database algorithms for phase change memory. In Fifth Biennial Conference on Innovative Data Systems Research, CIDR '11, pages 21--31, January 2011.
[52]
S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 18--32, New York, NY, USA, 2013. ACM.
[53]
S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies, FAST '11, pages 61--75, 2011.
[54]
V. Venkataramani, Z. Amsden, N. Bronson, G. Cabrera III, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, J. Hoon, S. Kulkarni, N. Lawrence, M. Marchukov, D. Petrov, and L. Puzar. Tao: How facebook serves the social graph. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 791--792, New York, NY, USA, 2012. ACM.
[55]
F. Xia, D. Jiang, J. Xiong, N. Sun, and T. Moscibroda. Hikv: A hybrid index key-value store for dram-nvm memory systems. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference, USENIX ATC '17, 2017.
[56]
J. Yang, J. Kim, M. Hoseinzadeh, J. Izraelevitz, and S. Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 169--182, Santa Clara, CA, Feb. 2020. USENIX Association.
[57]
J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and B. He. NV-Tree: Reducing consistency cost for NVM-based single level systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST '15, pages 167--181, 2015.
[58]
P. Zhou, B. Zhao, J. Yang, and Y. Zhang. A durable and energy efficient main memory using phase change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 14--23, 2009.
[59]
X. Zhou, L. Shou, K. Chen, W. Hu, and G. Chen. Dptree: differential indexing for persistent memory. PVLDB, 13(4):421--434, 2019.

Cited By

View all
  • (2024)Practical Persistent Multi-word Compare-and-Swap Algorithms for Many-core CPUsJournal of Information Processing10.2197/ipsjjip.32.100332(1003-1012)Online publication date: 2024
  • (2024)FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast StorageProceedings of the VLDB Endowment10.14778/3648160.364817717:6(1377-1390)Online publication date: 3-May-2024
  • (2024)BonsaiKV: Towards Fast, Scalable, and Persistent Key-Value Stores with Tiered, Heterogeneous Memory SystemProceedings of the VLDB Endowment10.14778/3636218.363622817:4(726-739)Online publication date: 5-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)115
  • Downloads (Last 6 weeks)10
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Practical Persistent Multi-word Compare-and-Swap Algorithms for Many-core CPUsJournal of Information Processing10.2197/ipsjjip.32.100332(1003-1012)Online publication date: 2024
  • (2024)FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast StorageProceedings of the VLDB Endowment10.14778/3648160.364817717:6(1377-1390)Online publication date: 3-May-2024
  • (2024)BonsaiKV: Towards Fast, Scalable, and Persistent Key-Value Stores with Tiered, Heterogeneous Memory SystemProceedings of the VLDB Endowment10.14778/3636218.363622817:4(726-739)Online publication date: 5-Mar-2024
  • (2024)Buffered Persistence in B+ TreesProceedings of the ACM on Management of Data10.1145/36988012:6(1-24)Online publication date: 20-Dec-2024
  • (2024)Trigram-Based Persistent IDE Indices with Quick StartupProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648460(81-90)Online publication date: 20-Apr-2024
  • (2024)A Concise Concurrent B+-Tree for Persistent MemoryACM Transactions on Architecture and Code Optimization10.1145/363871721:2(1-25)Online publication date: 15-Feb-2024
  • (2024)WIPE: A Write-Optimized Learned Index for Persistent MemoryACM Transactions on Architecture and Code Optimization10.1145/363491521:2(1-25)Online publication date: 15-Feb-2024
  • (2024)Perseid: A Secondary Indexing Mechanism for LSM-Based Storage SystemsACM Transactions on Storage10.1145/363328520:2(1-28)Online publication date: 19-Feb-2024
  • (2024)CCL-BTree: A Crash-Consistent Locality-Aware B+-Tree for Reducing XPBuffer-Induced Write Amplification in Persistent MemoryProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629582(441-455)Online publication date: 22-Apr-2024
  • (2024)Revisiting PM-Based B$^{+}$+-Tree With Persistent CPU CacheIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337262135:5(796-813)Online publication date: 5-Mar-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media