Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Practical Dynamic Extension for Sampling Indexes

Published: 12 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The execution of analytical queries on massive datasets presents challenges due to long response times and high computational costs. As a result, the analysis of representative samples of data has emerged as an attractive alternative; this avoids the cost of processing queries against the entire dataset, while still producing statistically valid results. Unfortunately, the sampling techniques in common use sacrifice either sample quality or performance, and so are poorly suited for this task. However, it is possible to build high quality sample sets efficiently with the assistance of indexes. This introduces a new challenge: real-world data is subject to continuous update, and so the indexes must be kept up to date. This is difficult, because existing sampling indexes present a dichotomy; efficient sampling indexes are difficult to update, while easily updatable indexes have poor sampling performance. This paper seeks to address this gap by proposing a general and practical framework for extending most sampling indexes with efficient update support, based on splitting indexes into smaller shards, combined with a systematic approach to the periodic reconstruction. The framework's design space is examined, with an eye towards exploring trade-offs between update performance, sampling performance, and memory usage. Three existing static sampling indexes are extended using this framework to support updates, and the generalization of the framework to concurrent operations and larger-than-memory data is discussed. Through a comprehensive suite of benchmarks, the extended indexes are shown to match or exceed the update throughput of state-of-the-art dynamic baselines, while presenting significant improvements in sampling latency.

    References

    [1]
    2023. Delicious Dataset. http://konect.cc/networks/delicious-ti/
    [2]
    2023. Open Street Map Dataset. https://planet.openstreetmap.org/
    [3]
    2023. PostgreSQL Documentation. https://www.postgresql.org/docs/15/sql-select.html
    [4]
    2023. Twitter Dataset. https://github.com/ANLAB-KAIST/traces/releases/tag/twitter_rv.net
    [5]
    Peyman Afshani and Jeff M. Phillips. 2019. Independent Range Sampling, Revisited Again. In 35th International Symposium on Computational Geometry, SoCG 2019, June 18--21, 2019, Portland, Oregon, USA (LIPIcs, Vol. 129), Gill Barequet and Yusu Wang (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 4:1--4:13. https://doi.org/10.4230/LIPIcs.SoCG.2019.4
    [6]
    Peyman Afshani and Zhewei Wei. 2017. Independent Range Sampling, Revisited. In 25th Annual European Symposium on Algorithms, ESA 2017, September 4--6, 2017, Vienna, Austria (LIPIcs, Vol. 87), Kirk Pruhs and Christian Sohler (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 3:1--3:14. https://doi.org/10.4230/LIPIcs.ESA.2017.3
    [7]
    Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 14--17, 2013, Zdenek Hanzálek, Hermann Härtig, Miguel Castro, and M. Frans Kaashoek (Eds.). ACM, 29--42. https://doi.org/10.1145/2465351.2465355
    [8]
    Daniel Allendorf. 2023. A Simple Data Structure for Maintaining a Discrete Probability Distribution. CoRR abs/2302.05682 (2023). https://doi.org/10.48550/arXiv.2302.05682 arXiv:2302.05682
    [9]
    Fatemeh Almodaresi, Jamshed Khan, Sergey Madaminov, Michael Ferdman, Rob Johnson, Prashant Pandey, and Rob Patro. 2022. An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinform. 38, 12 (2022), 3155--3163. https://doi.org/10.1093/bioinformatics/btac142
    [10]
    Martin Aumüller, Rasmus Pagh, and Francesco Silvestri. 2020. Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 191--204. https://doi.org/10.1145/3375395.3387648
    [11]
    Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. 2019. SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10--12, 2019, Dahlia Malkhi and Dan Tsafrir (Eds.). USENIX Association, 753--766. https://www.usenix.org/conference/atc19/presentation/balmau
    [12]
    Omri Ben-Eliezer and Eylon Yogev. 2020. The Adversarial Robustness of Sampling. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 49--62. https://doi.org/10.1145/3375395.3387643
    [13]
    Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM 13, 7 (1970), 422--426. https://doi.org/10.1145/362686.362692
    [14]
    M.G. Bulmer. 1979. Principles of Statistics. Dover, New York.
    [15]
    Edith Cohen. 2023. Sampling Big Ideas in Query Optimization. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18--23, 2023, Floris Geerts, Hung Q. Ngo, and Stavros Sintos (Eds.). ACM, 361--371. https://doi.org/10.1145/3584372.3589935
    [16]
    Bram Custers, Mees van de Kerkhof, Wouter Meulemans, Bettina Speckmann, and Frank Staals. 2019. Maximum Physically Consistent Trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2019, Chicago, IL, USA, November 5--8, 2019, Farnoush Banaei Kashani, Goce Trajcevski, Ralf Hartmut Güting, Lars Kulik, and Shawn D. Newsam (Eds.). ACM, 79--88. https://doi.org/10.1145/3347146.3359363
    [17]
    Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 79--94. https://doi.org/10.1145/3035918.3064054
    [18]
    Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2018. Optimal Bloom Filters and Adaptive Merging for LSM-Trees. ACM Trans. Database Syst. 43, 4 (2018), 16:1--16:48. https://doi.org/10.1145/3276980
    [19]
    Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 505--520. https://doi.org/10.1145/3183713.3196927
    [20]
    Niv Dayan and Stratos Idreos. 2019. The Log-Structured Merge-Bush & the Wacky Continuum. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 449--466. https://doi.org/10.1145/3299869.3319903
    [21]
    Niv Dayan, Tamar Weiss, Shmuel Dashevsky, Michael Pan, Edward Bortnikov, and Moshe Twitto. 2022. Spooky: Granulating LSM-Tree Compactions Correctly. Proc. VLDB Endow. 15, 11 (2022), 3071--3084. https://www.vldb.org/pvldb/vol15/p3071-dayan.pdf
    [22]
    Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample Seek: Approximating Aggregates with Distribution Precision Guarantee. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 679--694. https://doi.org/10.1145/2882903.2915249
    [23]
    Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Trans. Storage 17, 4 (2021), 26:1--26:32. https://doi.org/10.1145/3483840
    [24]
    Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21--24, 2015, Laurent Réveillère, Tim Harris, and Maurice Herlihy (Eds.). ACM, 32:1--32:14. https://doi.org/10.1145/2741948.2741973
    [25]
    Jarek Gryz, Junjie Guo, Linqi Liu, and Calisto Zuzarte. 2004. Query Sampling in DB2 Universal Database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004, Gerhard Weikum, Arnd Christian König, and Stefan Deßloch (Eds.). ACM, 839--843. https://doi.org/10.1145/1007568.1007664
    [26]
    Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. 1993. Maintaining Discrete Probability Distributions Optimally. In Automata, Languages and Programming, 20nd International Colloquium, ICALP93, Lund, Sweden, July 5--9, 1993, Proceedings (Lecture Notes in Computer Science, Vol. 700), Andrzej Lingas, Rolf G. Karlsson, and Svante Carlsson (Eds.). Springer, 253--264. https://doi.org/10.1007/3--540--56939--1_77
    [27]
    Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2014. Independent range sampling. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22--27, 2014, Richard Hull and Martin Grohe (Eds.). ACM, 246--255. https://doi.org/10.1145/2594538.2594545
    [28]
    Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2015. External Memory Stream Sampling. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Tova Milo and Diego Calvanese (Eds.). ACM, 229--239. https://doi.org/10.1145/2745754.2745757
    [29]
    Silu Huang, Chi Wang, Bolin Ding, and Surajit Chaudhuri. 2019. Efficient Identification of Approximate Best Configuration of Training in Large Datasets. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3862--3869. https://doi.org/10.1609/aaai.v33i01.33013862
    [30]
    Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 631--646. https://doi.org/10.1145/2882903.2882940
    [31]
    Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26--30, 2010, Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti (Eds.). ACM, 591--600. https://doi.org/10.1145/1772690.1772751
    [32]
    Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2019. Wander Join and XDB: Online Aggregation via Random Walks. ACM Trans. Database Syst. 44, 1 (2019), 2:1--2:41. https://doi.org/10.1145/3284551
    [33]
    Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. 2020. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2071--2086. https://doi.org/10.1145/3318464.3389731
    [34]
    Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. 2003. Dynamic Generation of Discrete Random Variates. Theory Comput. Syst. 36, 4 (2003), 329--358. https://doi.org/10.1007/s00224-003--1078--6
    [35]
    Bilegsaikhan Naidan and Magnus Lie Hetland. 2014. Static-to-dynamic transformation for metric indexing structures (extended version). Inf. Syst. 45 (2014), 48--60. https://doi.org/10.1016/j.is.2013.08.002
    [36]
    Frank Olken. 1993. Random Sampling from Databases. Ph. D. Dissertation. University of California at Berkeley.
    [37]
    Frank Olken and Doron Rotem. 1986. Simple Random Sampling from Relational Databases. In VLDB'86 Twelfth International Conference on Very Large Data Bases, August 25--28, 1986, Kyoto, Japan, Proceedings, Wesley W. Chu, Georges Gardarin, Setsuo Ohsuga, and Yahiko Kambayashi (Eds.). Morgan Kaufmann, 160--169. http://www.vldb.org/conf/1986/P160.PDF
    [38]
    Frank Olken and Doron Rotem. 1989. Random Sampling from B Trees. In Proceedings of the Fifteenth International Conference on Very Large Data Bases, August 22--25, 1989, Amsterdam, The Netherlands, Peter M. G. Apers and Gio Wiederhold (Eds.). Morgan Kaufmann, 269--277. http://www.vldb.org/conf/1989/P269.PDF
    [39]
    Frank Olken and Doron Rotem. 1995. Random sampling from databases: a survey. Statistics and Computing 5 (1995), 25--42. https://doi.org/10.1007/BF00140664
    [40]
    Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. https://doi.org/10.1007/s002360050048
    [41]
    Mark H. Overmars. 1983. The Design of Dynamic Data Structures. Lecture Notes in Computer Science, Vol. 156. Springer. https://doi.org/10.1007/BFb0014927
    [42]
    Mark H. Overmars and Jan van Leeuwen. 1981. Worst-Case Optimal Insertion and Deletion Methods for Decomposable Searching Problems. Inf. Process. Lett. 12, 4 (1981), 168--173. https://doi.org/10.1016/0020-0190(81)90093--4
    [43]
    Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1461--1476. https://doi.org/10.1145/3183713.3196905
    [44]
    James B. Saxe and Jon Louis Bentley. 1979. Transforming Static Data Structures to Dynamic Structures (Abridged Version). In 20th Annual Symposium on Foundations of Computer Science, San Juan, Puerto Rico, 29--31 October 1979. IEEE Computer Society, 148--168. https://doi.org/10.1109/SFCS.1979.47
    [45]
    Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3--7, 2010, Mohammed G. Khatib, Xubin He, and Michael Factor (Eds.). IEEE Computer Society, 1--10. https://doi.org/10.1109/MSST.2010.5496972
    [46]
    Yufei Tao. 2022. Algorithmic Techniques for Independent Query Sampling. In PODS '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Leonid Libkin and Pablo Barceló (Eds.). ACM, 129--138. https://doi.org/10.1145/3517804.3526068
    [47]
    Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37--57. https://doi.org/10.1145/3147.3165
    [48]
    Michael D. Vose. 1991. A Linear Algorithm For Generating Random Numbers With a Given Distribution. IEEE Trans. Software Eng. 17, 9 (1991), 972--975. https://doi.org/10.1109/32.92917
    [49]
    A.J. Walker. 1974. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters 10 (1974), 127--128(1). Issue 8.
    [50]
    Dong Xie, Jeff M. Phillips, Michael Matheny, and Feifei Li. 2021. Spatial Independent Range Sampling. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 2023--2035. https://doi.org/10.1145/3448016.3452806
    [51]
    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25--27, 2012, Steven D. Gribble and Dina Katabi (Eds.). USENIX Association, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
    [52]
    Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 323--336. https://doi.org/10.1145/3183713.3196931
    [53]
    Zhuoyue Zhao, Dong Xie, and Feifei Li. 2022. AB-tree: Index for Concurrent Random Sampling and Updates. Proc. VLDB Endow. 15, 9 (2022), 1835--1847. https://www.vldb.org/pvldb/vol15/p1835-zhao.pdf
    [54]
    Zichen Zhu, Ju Hyoung Mun, Aneesh Raman, and Manos Athanassoulis. 2021. Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices. In Proceedings of the 17th International Workshop on Data Management on New Hardware, DaMoN 2021, 21 June 2021, Virtual Event, China, Danica Porobic and Spyros Blanas (Eds.). ACM, 1:1--1:10. https://doi.org/10.1145/3465998.3466002

    Index Terms

    1. Practical Dynamic Extension for Sampling Indexes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 4
      PACMMOD
      December 2023
      1317 pages
      EISSN:2836-6573
      DOI:10.1145/3637468
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 December 2023
      Published in PACMMOD Volume 1, Issue 4

      Permissions

      Request permissions for this article.

      Author Tags

      1. dynamic extension
      2. independent range sampling
      3. sampling indexes

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 83
        Total Downloads
      • Downloads (Last 12 months)83
      • Downloads (Last 6 weeks)13
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media