research-article

Practical Dynamic Extension for Sampling Indexes

Authors:

Douglas B. Rumbaugh,

Dong XieAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 254, Pages 1 - 26

https://doi.org/10.1145/3626744

Published: 12 December 2023 Publication History

Abstract

The execution of analytical queries on massive datasets presents challenges due to long response times and high computational costs. As a result, the analysis of representative samples of data has emerged as an attractive alternative; this avoids the cost of processing queries against the entire dataset, while still producing statistically valid results. Unfortunately, the sampling techniques in common use sacrifice either sample quality or performance, and so are poorly suited for this task. However, it is possible to build high quality sample sets efficiently with the assistance of indexes. This introduces a new challenge: real-world data is subject to continuous update, and so the indexes must be kept up to date. This is difficult, because existing sampling indexes present a dichotomy; efficient sampling indexes are difficult to update, while easily updatable indexes have poor sampling performance. This paper seeks to address this gap by proposing a general and practical framework for extending most sampling indexes with efficient update support, based on splitting indexes into smaller shards, combined with a systematic approach to the periodic reconstruction. The framework's design space is examined, with an eye towards exploring trade-offs between update performance, sampling performance, and memory usage. Three existing static sampling indexes are extended using this framework to support updates, and the generalization of the framework to concurrent operations and larger-than-memory data is discussed. Through a comprehensive suite of benchmarks, the extended indexes are shown to match or exceed the update throughput of state-of-the-art dynamic baselines, while presenting significant improvements in sampling latency.

References

[1]

2023. Delicious Dataset. http://konect.cc/networks/delicious-ti/

[2]

2023. Open Street Map Dataset. https://planet.openstreetmap.org/

[3]

2023. PostgreSQL Documentation. https://www.postgresql.org/docs/15/sql-select.html

[4]

2023. Twitter Dataset. https://github.com/ANLAB-KAIST/traces/releases/tag/twitter_rv.net

[5]

Peyman Afshani and Jeff M. Phillips. 2019. Independent Range Sampling, Revisited Again. In 35th International Symposium on Computational Geometry, SoCG 2019, June 18--21, 2019, Portland, Oregon, USA (LIPIcs, Vol. 129), Gill Barequet and Yusu Wang (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 4:1--4:13. https://doi.org/10.4230/LIPIcs.SoCG.2019.4

[6]

Peyman Afshani and Zhewei Wei. 2017. Independent Range Sampling, Revisited. In 25th Annual European Symposium on Algorithms, ESA 2017, September 4--6, 2017, Vienna, Austria (LIPIcs, Vol. 87), Kirk Pruhs and Christian Sohler (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 3:1--3:14. https://doi.org/10.4230/LIPIcs.ESA.2017.3

[7]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 14--17, 2013, Zdenek Hanzálek, Hermann Härtig, Miguel Castro, and M. Frans Kaashoek (Eds.). ACM, 29--42. https://doi.org/10.1145/2465351.2465355

Digital Library

[8]

Daniel Allendorf. 2023. A Simple Data Structure for Maintaining a Discrete Probability Distribution. CoRR abs/2302.05682 (2023). https://doi.org/10.48550/arXiv.2302.05682 arXiv:2302.05682

[9]

Fatemeh Almodaresi, Jamshed Khan, Sergey Madaminov, Michael Ferdman, Rob Johnson, Prashant Pandey, and Rob Patro. 2022. An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinform. 38, 12 (2022), 3155--3163. https://doi.org/10.1093/bioinformatics/btac142

[10]

Martin Aumüller, Rasmus Pagh, and Francesco Silvestri. 2020. Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 191--204. https://doi.org/10.1145/3375395.3387648

Digital Library

[11]

Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. 2019. SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10--12, 2019, Dahlia Malkhi and Dan Tsafrir (Eds.). USENIX Association, 753--766. https://www.usenix.org/conference/atc19/presentation/balmau

[12]

Omri Ben-Eliezer and Eylon Yogev. 2020. The Adversarial Robustness of Sampling. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 49--62. https://doi.org/10.1145/3375395.3387643

Digital Library

[13]

Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM 13, 7 (1970), 422--426. https://doi.org/10.1145/362686.362692

Digital Library

[14]

M.G. Bulmer. 1979. Principles of Statistics. Dover, New York.

[15]

Edith Cohen. 2023. Sampling Big Ideas in Query Optimization. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18--23, 2023, Floris Geerts, Hung Q. Ngo, and Stavros Sintos (Eds.). ACM, 361--371. https://doi.org/10.1145/3584372.3589935

Digital Library

[16]

Bram Custers, Mees van de Kerkhof, Wouter Meulemans, Bettina Speckmann, and Frank Staals. 2019. Maximum Physically Consistent Trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2019, Chicago, IL, USA, November 5--8, 2019, Farnoush Banaei Kashani, Goce Trajcevski, Ralf Hartmut Güting, Lars Kulik, and Shawn D. Newsam (Eds.). ACM, 79--88. https://doi.org/10.1145/3347146.3359363

Digital Library

[17]

Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 79--94. https://doi.org/10.1145/3035918.3064054

Digital Library

[18]

Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2018. Optimal Bloom Filters and Adaptive Merging for LSM-Trees. ACM Trans. Database Syst. 43, 4 (2018), 16:1--16:48. https://doi.org/10.1145/3276980

Digital Library

[19]

Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 505--520. https://doi.org/10.1145/3183713.3196927

Digital Library

[20]

Niv Dayan and Stratos Idreos. 2019. The Log-Structured Merge-Bush & the Wacky Continuum. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 449--466. https://doi.org/10.1145/3299869.3319903

Digital Library

[21]

Niv Dayan, Tamar Weiss, Shmuel Dashevsky, Michael Pan, Edward Bortnikov, and Moshe Twitto. 2022. Spooky: Granulating LSM-Tree Compactions Correctly. Proc. VLDB Endow. 15, 11 (2022), 3071--3084. https://www.vldb.org/pvldb/vol15/p3071-dayan.pdf

Digital Library

[22]

Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample Seek: Approximating Aggregates with Distribution Precision Guarantee. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 679--694. https://doi.org/10.1145/2882903.2915249

Digital Library

[23]

Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Trans. Storage 17, 4 (2021), 26:1--26:32. https://doi.org/10.1145/3483840

Digital Library

[24]

Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21--24, 2015, Laurent Réveillère, Tim Harris, and Maurice Herlihy (Eds.). ACM, 32:1--32:14. https://doi.org/10.1145/2741948.2741973

Digital Library

[25]

Jarek Gryz, Junjie Guo, Linqi Liu, and Calisto Zuzarte. 2004. Query Sampling in DB2 Universal Database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004, Gerhard Weikum, Arnd Christian König, and Stefan Deßloch (Eds.). ACM, 839--843. https://doi.org/10.1145/1007568.1007664

Digital Library

[26]

Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. 1993. Maintaining Discrete Probability Distributions Optimally. In Automata, Languages and Programming, 20nd International Colloquium, ICALP93, Lund, Sweden, July 5--9, 1993, Proceedings (Lecture Notes in Computer Science, Vol. 700), Andrzej Lingas, Rolf G. Karlsson, and Svante Carlsson (Eds.). Springer, 253--264. https://doi.org/10.1007/3--540--56939--1_77

[27]

Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2014. Independent range sampling. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22--27, 2014, Richard Hull and Martin Grohe (Eds.). ACM, 246--255. https://doi.org/10.1145/2594538.2594545

Digital Library

[28]

Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2015. External Memory Stream Sampling. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Tova Milo and Diego Calvanese (Eds.). ACM, 229--239. https://doi.org/10.1145/2745754.2745757

Digital Library

[29]

Silu Huang, Chi Wang, Bolin Ding, and Surajit Chaudhuri. 2019. Efficient Identification of Approximate Best Configuration of Training in Large Datasets. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3862--3869. https://doi.org/10.1609/aaai.v33i01.33013862

Digital Library

[30]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 631--646. https://doi.org/10.1145/2882903.2882940

Digital Library

[31]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26--30, 2010, Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti (Eds.). ACM, 591--600. https://doi.org/10.1145/1772690.1772751

Digital Library

[32]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2019. Wander Join and XDB: Online Aggregation via Random Walks. ACM Trans. Database Syst. 44, 1 (2019), 2:1--2:41. https://doi.org/10.1145/3284551

Digital Library

[33]

Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. 2020. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2071--2086. https://doi.org/10.1145/3318464.3389731

Digital Library

[34]

Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. 2003. Dynamic Generation of Discrete Random Variates. Theory Comput. Syst. 36, 4 (2003), 329--358. https://doi.org/10.1007/s00224-003--1078--6

[35]

Bilegsaikhan Naidan and Magnus Lie Hetland. 2014. Static-to-dynamic transformation for metric indexing structures (extended version). Inf. Syst. 45 (2014), 48--60. https://doi.org/10.1016/j.is.2013.08.002

[36]

Frank Olken. 1993. Random Sampling from Databases. Ph. D. Dissertation. University of California at Berkeley.

[37]

Frank Olken and Doron Rotem. 1986. Simple Random Sampling from Relational Databases. In VLDB'86 Twelfth International Conference on Very Large Data Bases, August 25--28, 1986, Kyoto, Japan, Proceedings, Wesley W. Chu, Georges Gardarin, Setsuo Ohsuga, and Yahiko Kambayashi (Eds.). Morgan Kaufmann, 160--169. http://www.vldb.org/conf/1986/P160.PDF

[38]

Frank Olken and Doron Rotem. 1989. Random Sampling from B Trees. In Proceedings of the Fifteenth International Conference on Very Large Data Bases, August 22--25, 1989, Amsterdam, The Netherlands, Peter M. G. Apers and Gio Wiederhold (Eds.). Morgan Kaufmann, 269--277. http://www.vldb.org/conf/1989/P269.PDF

[39]

Frank Olken and Doron Rotem. 1995. Random sampling from databases: a survey. Statistics and Computing 5 (1995), 25--42. https://doi.org/10.1007/BF00140664

[40]

Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. https://doi.org/10.1007/s002360050048

Digital Library

[41]

Mark H. Overmars. 1983. The Design of Dynamic Data Structures. Lecture Notes in Computer Science, Vol. 156. Springer. https://doi.org/10.1007/BFb0014927

[42]

Mark H. Overmars and Jan van Leeuwen. 1981. Worst-Case Optimal Insertion and Deletion Methods for Decomposable Searching Problems. Inf. Process. Lett. 12, 4 (1981), 168--173. https://doi.org/10.1016/0020-0190(81)90093--4

[43]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1461--1476. https://doi.org/10.1145/3183713.3196905

Digital Library

[44]

James B. Saxe and Jon Louis Bentley. 1979. Transforming Static Data Structures to Dynamic Structures (Abridged Version). In 20th Annual Symposium on Foundations of Computer Science, San Juan, Puerto Rico, 29--31 October 1979. IEEE Computer Society, 148--168. https://doi.org/10.1109/SFCS.1979.47

Digital Library

[45]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3--7, 2010, Mohammed G. Khatib, Xubin He, and Michael Factor (Eds.). IEEE Computer Society, 1--10. https://doi.org/10.1109/MSST.2010.5496972

Digital Library

[46]

Yufei Tao. 2022. Algorithmic Techniques for Independent Query Sampling. In PODS '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Leonid Libkin and Pablo Barceló (Eds.). ACM, 129--138. https://doi.org/10.1145/3517804.3526068

Digital Library

[47]

Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37--57. https://doi.org/10.1145/3147.3165

Digital Library

[48]

Michael D. Vose. 1991. A Linear Algorithm For Generating Random Numbers With a Given Distribution. IEEE Trans. Software Eng. 17, 9 (1991), 972--975. https://doi.org/10.1109/32.92917

Digital Library

[49]

A.J. Walker. 1974. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters 10 (1974), 127--128(1). Issue 8.

[50]

Dong Xie, Jeff M. Phillips, Michael Matheny, and Feifei Li. 2021. Spatial Independent Range Sampling. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 2023--2035. https://doi.org/10.1145/3448016.3452806

Digital Library

[51]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25--27, 2012, Steven D. Gribble and Dina Katabi (Eds.). USENIX Association, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

Digital Library

[52]

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 323--336. https://doi.org/10.1145/3183713.3196931

Digital Library

[53]

Zhuoyue Zhao, Dong Xie, and Feifei Li. 2022. AB-tree: Index for Concurrent Random Sampling and Updates. Proc. VLDB Endow. 15, 9 (2022), 1835--1847. https://www.vldb.org/pvldb/vol15/p1835-zhao.pdf

Digital Library

[54]

Zichen Zhu, Ju Hyoung Mun, Aneesh Raman, and Manos Athanassoulis. 2021. Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices. In Proceedings of the 17th International Workshop on Data Management on New Hardware, DaMoN 2021, 21 June 2021, Virtual Event, China, Danica Porobic and Spyros Blanas (Eds.). ACM, 1:1--1:10. https://doi.org/10.1145/3465998.3466002

Digital Library

Index Terms

Practical Dynamic Extension for Sampling Indexes
1. Information systems
  1. Data management systems
    1. Data structures

Recommendations

Dynamic maintenance of web indexes using landmarks
WWW '03: Proceedings of the 12th international conference on World Wide Web

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Can Learned Indexes be Built Efficiently? A Deep Dive into Sampling Trade-offs
SIGMOD

By embedding the distribution of keys in indexing structure, learned indexes can minimize the index size and maximize the lookup performance. Yet, one of the problems in the present learned index is the long index-building time. The conventional learned ...
Efficient Dynamic Weighted Set Sampling and Its Extension

Given a weighted set S of n elements, weighted set sampling (WSS) samples an element in S so that each element a_i; is sampled with a probability proportional to its weight w(a_i). The classic alias method pre-processes an index in O(n) time with O(n) ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
83
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)13

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents