Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

Competitive Data-Structure Dynamization

Online AM: 28 June 2024 Publication History

Abstract

Data-structure dynamization is a general approach for making static data structures dynamic. It is used extensively in geometric settings and in the guise of so-called merge (or compaction) policies in big-data databases such as LevelDB and Google Bigtable. Previous theoretical work is based on worst-case analyses for uniform inputs – insertions of one item at a time and non-varying read rate. In practice, merge policies must not only handle batch insertions and varying read/write ratios, they can take advantage of such non-uniformity to reduce cost on a per-input basis.
To model this, we initiate the study of data-structure dynamization through the lens of competitive analysis, via two new online set-cover problems. For each, the input is a sequence of disjoint sets of weighted items. The sets are revealed one at a time. The algorithm must respond to each with a set cover that covers all items revealed so far. It obtains the cover incrementally from the previous cover by adding one or more sets and optionally removing existing sets. For each new set the algorithm incurs build cost equal to the weight of the items in the set. In the first problem the objective is to minimize total build cost plus total query cost, where the algorithm incurs a query cost at each time t equal to the current cover size. In the second problem, the objective is to minimize the build cost while keeping the query cost from exceeding \(k\) (a given parameter) at any time. We give deterministic online algorithms for both variants, with competitive ratios of \(\Theta(\log^* n)\) and \(k\) , respectively. The latter ratio is optimal for the second variant.

References

[1]
Pankaj K. Agarwal, Lars Arge, Octavian Procopiuc, and Jeffrey Scott Vitter. 2001. A Framework for Index Bulk Loading and Dynamization. In Automata, Languages and Programming (Lecture Notes in Computer Science), Fernando Orejas, Paul G. Spirakis, and Jan van Leeuwen (Eds.). Springer Berlin Heidelberg, 115–127.
[2]
Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. 2004. Approximating Extent Measures of Points. J. ACM 51, 4 (2004), 606–635.
[3]
Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1987. Hierarchical Memory with Block Transfer. In 28th Annual Symposium on Foundations of Computer Science. IEEE, 204–216. https://doi.org/10.1109/SFCS.1987.31
[4]
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, and Khurram Faraaz. 2014. AsterixDB: A Scalable, Open Source BDMS. Proceedings of the VLDB Endowment 7, 14 (2014), 1905–1916.
[5]
Lars Arge. 2002. External Memory Data Structures. In Handbook of Massive Data Sets, James Abello, Panos M. Pardalos, and Mauricio G. C. Resende (Eds.). Springer US, Boston, MA, 313–357. https://doi.org/10.1007/978-1-4615-0005-6_9
[6]
Lars Arge and Jan Vahrenhold. 2004. I/O-Efficient Dynamic Planar Point Location. Computational Geometry 29, 2 (Oct. 2004), 147–162. https://doi.org/10.1016/j.comgeo.2003.04.001
[7]
Amitabha Bagchi, Amitabh Chaudhary, David Eppstein, and Michael T. Goodrich. 2007. Deterministic Sampling and Range Counting in Geometric Data Streams. ACM Transactions on Algorithms 3, 2 (May 2007). https://doi.org/10.1145/1240233.1240239
[8]
Rakesh D. Barve, Edward F. Grove, and Jeffrey Scott Vitter. 2000. Application-Controlled Paging for a Shared Cache. SIAM J. Comput. 29, 4 (Jan. 2000), 1290–1303. https://doi.org/10.1137/S0097539797324278
[9]
Michael A. Bender, Rezaul A. Chowdhury, Rathish Das, Rob Johnson, William Kuszmaul, Andrea Lincoln, Quanquan C. Liu, Jayson Lynch, and Helen Xu. 2020. Closing the Gap Between Cache-Oblivious and Cache-Adaptive Analysis. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. ACM, Virtual Event USA, 63–73. https://doi.org/10.1145/3350755.3400274
[10]
Michael A. Bender, Roozbeh Ebrahimi, Jeremy T. Fineman, Golnaz Ghasemiesfeh, Rob Johnson, and Samuel McCauley. 2014. Cache-Adaptive Algorithms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 958–971.
[11]
Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-Oblivious Streaming b-Trees. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. ACM, New York, NY, USA, 81–92. https://doi.org/10.1145/1248377.1248393
[12]
Jon Louis Bentley. 1979. Decomposable Searching Problems. Inform. Process. Lett. 8, 5 (June 1979), 244–251. https://doi.org/10.1016/0020-0190(79)90117-0
[13]
Jon Louis Bentley and James B Saxe. 1980. Decomposable Searching Problems I. Static-to-Dynamic Transformation. Journal of Algorithms 1, 4 (Dec. 1980), 301–358. https://doi.org/10.1016/0196-6774(80)90015-2
[14]
Allan Borodin and Ran El-Yaniv. 1998. Online Computation and Competitive Analysis. Cambridge University Press, USA.
[15]
Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, and Gali Sheffi. 2018. Accordion: Better Memory Organization for LSM Key-Value Stores. Proceedings of the VLDB Endowment 11, 12 (Aug. 2018), 1863–1875. https://doi.org/10.14778/3229863.3229873
[16]
Gerth Stolting Brodal and Rolf Fagerberg. 2003. Lower Bounds for External Memory Dictionaries. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 546–554.
[17]
Gerth Stø lting Brodal, Rolf Fagerberg, and Gabriel Moruz. 2005. Cache-Aware and Cache-Oblivious Adaptive Sorting. In Automata, Languages and Programming (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 576–588. https://doi.org/10.1007/11523468_47
[18]
Hervé Brönnimann, Timothy M. Chan, and Eric Y. Chen. 2004. Towards In-Place Geometric Algorithms and Data Structures. In Proceedings of the Symposium on Computational Geometry. ACM, New York, NY, USA, 239–246. https://doi.org/10.1145/997817.997854
[19]
Niv Buchbinder, Shahar Chen, and Joseph Naor. 2014. Competitive analysis via regularization. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. SIAM, 436–444.
[20]
Niv Buchbinder and Joseph Naor. 2009. The Design of Competitive Online Algorithms via a Primal—Dual Approach. Foundations and Trends® in Theoretical Computer Science 3, 2–3 (2009), 93–263. https://doi.org/10.1561/0400000024
[21]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. BigTable: A Distributed Storage System for Structured Data. ACM Transactions on Computing Systems 26, 2 (June 2008), 4:1–4:26. https://doi.org/10.1145/1365815.1365816
[22]
Y. Chiang and R. Tamassia. 1992. Dynamic Algorithms in Computational Geometry. Proc. IEEE 80, 9 (Sept. 1992), 1412–1434. https://doi.org/10.1109/5.163409
[23]
V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. 2002. Static Optimality Theorem for External Memory String Access. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings. 219–227. https://doi.org/10.1109/SFCS.2002.1181945
[24]
Reuven Cohen, Liran Katzir, and Aviv Yehezkel. 2017. A Minimal Variance Estimator for the Cardinality of Big Data Set Intersection. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Halifax NS Canada, 95–103. https://doi.org/10.1145/3097983.3097999
[25]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google's Globally Distributed Database. ACM Transactions on Computing Systems 31, 3 (Aug. 2013), 8:1–8:22. https://doi.org/10.1145/2491245
[26]
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM International Conference on Management of Data. ACM, New York, NY, USA, 79–94. https://doi.org/10.1145/3035918.3064054
[27]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles. ACM, New York, NY, USA, 205–220. https://doi.org/10.1145/1294261.1294281
[28]
Andy Dent. 2013. Getting Started with LevelDB. Packt Publishing Ltd.
[29]
Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Stumm. 2017. Optimizing Space Amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research. 3–12.
[30]
D.T. Lee and F.P. Preparata. 1984. Computational Geometry—A Survey. IEEE Trans. Comput. C-33, 12 (Dec. 1984), 1072–1101. https://doi.org/10.1109/TC.1984.1676388
[31]
Dan Feldman, Melanie Schmidt, and Christian Sohler. 2013. Turning Big Data into Tiny Data: Constant-Size Coresets for k-Means, PCA and Projective Clustering. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1434–1453.
[32]
Lars George. 2011. HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly Media, Inc.
[33]
Goetz Graefe. 2010. Modern B-Tree Techniques. Foundations and Trends in Databases 3, 4 (2010), 203–402. https://doi.org/10.1561/1900000028
[34]
Sariel Har-Peled and Soham Mazumdar. 2004. On Coresets for K-Means and k-Median Clustering. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 291–300. https://doi.org/10.1145/1007352.1007400 See also https://doi.org/10.48550/arXiv.1810.12826.
[35]
Anna R. Karlin, Claire Kenyon, and Dana Randall. 2003. Dynamic TCP Acknowledgment and Other Stories about e/(e - 1). Algorithmica 36, 3 (July 2003), 209–224. https://doi.org/10.1007/s00453-003-1013-x
[36]
Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Peter. Michaleas, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, and Charles Yee. 2014. Achieving 100,000,000 Database Inserts per Second Using Accumulo and D4M. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2014.7040945
[37]
Pang Ko and Srinivas Aluru. 2007. Optimal Self-Adjusting Trees for Dynamic String Data in Secondary Storage. In String Processing and Information Retrieval, Nivio Ziviani and Ricardo Baeza-Yates (Eds.). Vol. 4726. Springer Berlin Heidelberg, Berlin, Heidelberg, 184–194. https://doi.org/10.1007/978-3-540-75530-2_17
[38]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. SIGOPS Operating Systems Review 44, 2 (April 2010), 35–40. https://doi.org/10.1145/1773912.1773922
[39]
Caleb Levy and Robert Tarjan. 2019. A New Path from Splay to Dynamic Optimality. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1311–1330.
[40]
Hyeontaek Lim, David G. Andersen, and Michael Kaminsky. 2016. Towards Accurate and Fast Evaluation of Multi-Stage Log-Structured Designs. In Proceedings of the Usenix Conference on File and Storage Technologies. USENIX Association, Berkeley, CA, USA, 149–166.
[41]
Chen Luo and Michael J. Carey. 2020. LSM-Based Storage Techniques: a survey. VLDB J. 29, 1 (2020), 393–418. https://doi.org/10.1007/s00778-019-00555-y
[42]
Claire Mathieu, Rajmohan Rajaraman, Neal E Young, and Arman Yousefi. 2021. Competitive data-structure dynamization. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2269–2287.
[43]
Kurt Mehlhorn. 1981. Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Structures. Mathematical systems theory 15, 1 (Dec. 1981), 1–16. https://doi.org/10.1007/BF01786969
[44]
Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (June 1996), 351–385. https://doi.org/10.1007/s002360050048
[45]
Mark H. Overmars. 1987. The Design of Dynamic Data Structures. Number 156 in Lecture Notes in Computer Science. Springer, Berlin.
[46]
Rasmus Pagh, Morten Stöckel, and David P. Woodruff. 2014. Is Min-Wise Hashing Optimal for Summarizing Set Intersection?. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, Snowbird, Utah, USA, 109–120. https://doi.org/10.1145/2594538.2594554
[47]
Mendel Rosenblum and John K. Ousterhout. 1991. The Design and Implementation of a Log-Structured File System. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 1–15. https://doi.org/10.1145/121132.121137
[48]
Dennis G. Severance and Guy M. Lohman. 1976. Differential Files: Their Application to the Maintenance of Large Databases. ACM Transaction on Database Systems 1, 3 (Sept. 1976), 256–267. https://doi.org/10.1145/320473.320484
[49]
Daniel Dominic Sleator and Robert Endre Tarjan. 1983. Self-Adjusting Binary Trees. In Proceedings of the ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 235–245. https://doi.org/10.1145/800061.808752
[50]
Carl Staelin. 2013. Personal communication.
[51]
Jan van Leeuwen and Mark H. Overmars. 1981. The Art of Dynamizing. In Mathematical Foundations of Computer Science (Lecture Notes in Computer Science), Jozef Gruska and Michal Chytil (Eds.). Springer Berlin Heidelberg, 121–131.
[52]
Jeffrey Scott Vitter. 2008. Algorithms and Data Structures for External Memory. Number 2:4 in Foundations and Trends in Theoretical Computer Science. Now Publishers, Boston.
[53]
Ke Yi. 2012. Dynamic Indexability and the Optimality of B-Trees. J. ACM 59, 4 (Aug. 2012), 1–19. https://doi.org/10.1145/2339123.2339129
[54]
Neal E Young. 2000. K-medians, facility location, and the Chernoff-Wald bound. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 86–95.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms Just Accepted
EISSN:1549-6333
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 28 June 2024
Accepted: 06 June 2024
Revised: 30 April 2024
Received: 09 December 2021

Check for updates

Author Tags

  1. online algorithms
  2. competitive analysis
  3. data-structure dynamization
  4. log-structured merge-tree
  5. compaction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 43
    Total Downloads
  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)23
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media