Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3555041.3589394acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
abstract

SynopsisDB: Distributed Synopsis-based Data Processing System

Published: 05 June 2023 Publication History

Abstract

As the data volume continues to expand at an unprecedented rate, data scientists face the challenge of effectively processing and exploring vast amounts of data. To carry out tasks such as analyzing wildfire clusters, querying diverse datasets, and visualizing results with tools like IncVisage, Pangloss, Marviq, and GeoSparkViz, data scientists require data processing systems that are efficient, flexible, and capable of handling different types of queries across various data sources. Two critical features that these systems should possess are the ability to process data efficiently and handle a wide range of queries for diverse data types.

Supplemental Material

MP4 File
In this video, we introduce SynopsisDB, which is a distributed synopsis-based data processing system. There are two projects under SynopsisDB: partition-based synopses combination and progressive query processing. SynopsisDB fixed the gap that no system can combine both approximate and progressive queries. SynopsisDB uses data synopsis as the bridge to combine them. SynopsisDB supports 1D, 2D Histograms, Geometric Histogram, 1D, 2D Wavelets, Uniform samples, and Stratified samples. The partition-base data synopses cannot be merged if there are constructed from different partition functions. Therefore, we design a two-step merging framework for SynopsisDB to combine partition-based synopses with different shapes. For progressive processing, SynopsisDB uses data synopses to estimate the result size and improve the quality of progressive results. If you have any questions, feel free to contact us by email: [email protected]

References

[1]
Ildar Absalyamov, Michael J Carey, and Vassilis J Tsotras. 2018. Lightweight cardinality estimation in LSM-based systems. In Proceedings of the 2018 International Conference on Management of Data. 841--855.
[2]
Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013a. Mergeable summaries. ACM Transactions on Database Systems (TODS), Vol. 38, 4 (2013), 1--28.
[3]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013b. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.
[4]
Sattam Alsubaiee et al. 2014. AsterixDB: a scalable, open source BDMS. Proceedings of the VLDB Endowment, Vol. 7, 14 (2014), 1905--1916.
[5]
Ahmed M Aly, Ahmed R Mahmood, Mohamed S Hassan, Walid G Aref, Mourad Ouzzani, Hazem Elmeleegy, and Thamir Qadah. 2015. Aqwa: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, Vol. 8, 13 (2015), 2062--2073.
[6]
Ning An, Zhen-Yu Yang, and Anand Sivasubramaniam. 2001. Selectivity estimation for spatial joins. In Proceedings 17th International Conference on Data Engineering. IEEE, 368--375.
[7]
Apache. 2011. Apache Flink. https://flink.apache.org
[8]
Richard Beigel and Egemen Tanin. 1998. The geometry of browsing. In Latin American Symposium on Theoretical Informatics. Springer, 331--340.
[9]
Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. The VLDB Journal, Vol. 10, 2 (2001), 199--223.
[10]
Badrish Chandramouli, Jonathan Goldstein, and Abdul Quamar. 2013. Scalable progressive analytics on big data in the cloud. PVLDB, Vol. 6, 14 (2013), 1726--1737.
[11]
Abhinandan Das, Johannes Gehrke, and Mirek Riedewald. 2004. Approximation techniques for spatial data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 695--706.
[12]
Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample seek: Approximating aggregates with distribution precision guarantee. In Proceedings of the 2016 International Conference on Management of Data. 679--694.
[13]
Liming Dong et al. 2020. Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization. In SIGMOD. 67--82.
[14]
Facebook. 2012. RocksDB. https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format
[15]
Edward Gan, Peter Bailis, and Moses Charikar. 2020. Coopstore: Optimizing precomputed summaries for aggregation. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 2174--2187.
[16]
Jaemin Jo, Sehi L'Yi, Bongshin Lee, and Jinwook Seo. 2019. Proreveal: Progressive visual analytics with safeguards. IEEE Transactions on Visualization and Computer Graphics, Vol. 27, 7 (2019), 3109--3122.
[17]
Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. 2016. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, Vol. 25, 4 (2016), 449--472.
[18]
Yossi Matias, Jeffrey Scott Vitter, and Min Wang. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 448--459.
[19]
Dominik Moritz, Danyel Fisher, Bolin Ding, and Chi Wang. 2017. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In CHI. 2904--2915.
[20]
Johns Paul et al. 2020. Poet: an Interactive Spatial Query Processing System in Grab. In SIGSPATIAL. 477--486.
[21]
Rudi Poepsel-Lemaitre, Martin Kiefer, Joscha Von Hein, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. In the land of data streams where synopses are missing, one framework to bring them all. Proceedings of the VLDB Endowment, Vol. 14, 10 (2021), 1818--1831.
[22]
Marianne Procopio et al. 2021. Impact of cognitive biases on progressive visualization. TVCG, Vol. 28, 9 (2021), 3093--3112.
[23]
Sajjadur Rahman et al. 2017. I've seen" enough" incrementally improving visualizations to support rapid decision making. PVLDB, Vol. 10, 11 (2017), 1262--1273.
[24]
Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS), Vol. 33, 3 (2008), 1--46.
[25]
Salman Salloum and Joshua Zhexue Huang. 2021. RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 412--417.
[26]
AB Siddique, Ahmed Eldawy, and Vagelis Hristidis. 2019a. Euler: Improved Selectivity Estimation for Rectangular Spatial Records. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 4129--4133.
[27]
Abu Bakar Siddique, Ahmed Eldawy, and Vagelis Hristidis. 2019b. Comparing synopsis techniques for approximate spatial data analysis. Proceedings of the VLDB Endowment, Vol. 12, 11 (2019).
[28]
Samriddhi Singla et al. 2020. WildfireDB: A Spatio-Temporal Dataset Combining Wildfire Occurrence with Relevant Covariates. (2020).
[29]
Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2002. Selectivity estimation for spatial joins with geometric selections. In International Conference on Extending Database Technology. Springer, 609--626.
[30]
Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, and Dimitris Papadias. 2004. Spatio-temporal aggregation using sketches. In Proceedings. 20th International Conference on Data Engineering. IEEE, 214--225.
[31]
Wee Hyong Tok and Stéphane Bressan. 2013. Progressive and approximate join algorithms on data streams. In Advanced query processing. Springer.
[32]
Jeffrey Scott Vitter and Min Wang. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. Acm Sigmod Record, Vol. 28, 2 (1999), 193--204.
[33]
Jeffrey Scott Vitter, Min Wang, and Bala Iyer. 1998. Data cube approximation and histograms via wavelets. In Proceedings of the seventh international conference on Information and knowledge management. 96--104.
[34]
Jin-Feng Wang, A Stein, Bin-Bo Gao, and Yong Ge. 2012. A review of spatial sampling. Spatial Statistics, Vol. 2 (2012), 1--14.
[35]
Min Wang, Jeffrey Scott Vitter, Lipyeow Lim, and Sriram Padmanabhan. 2001. Wavelet-based cost estimation for spatial queries. In International Symposium on Spatial and Temporal Databases. Springer, 175--193.
[36]
Jia Yu and Mohamed Sarwat. 2021. GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data. PVLDB, Vol. 30, 2 (2021), 237--258.
[37]
Fuheng Zhao, Sujaya Maiyya, Ryan Wiener, Divyakant Agrawal, and Amr El Abbadi. 2021. Kll$pm$approximate quantile sketches over dynamic datasets. Proceedings of the VLDB Endowment, Vol. 14, 7 (2021), 1215--1227.
[38]
Zhuoyue Zhao, Feifei Li, and Yuxi Liu. 2020. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2027--2042.

Index Terms

  1. SynopsisDB: Distributed Synopsis-based Data Processing System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '23: Companion of the 2023 International Conference on Management of Data
    June 2023
    330 pages
    ISBN:9781450395076
    DOI:10.1145/3555041
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2023

    Check for updates

    Qualifiers

    • Abstract

    Data Availability

    In this video, we introduce SynopsisDB, which is a distributed synopsis-based data processing system. There are two projects under SynopsisDB: partition-based synopses combination and progressive query processing. SynopsisDB fixed the gap that no system can combine both approximate and progressive queries. SynopsisDB uses data synopsis as the bridge to combine them. SynopsisDB supports 1D, 2D Histograms, Geometric Histogram, 1D, 2D Wavelets, Uniform samples, and Stratified samples. The partition-base data synopses cannot be merged if there are constructed from different partition functions. Therefore, we design a two-step merging framework for SynopsisDB to combine partition-based synopses with different shapes. For progressive processing, SynopsisDB uses data synopses to estimate the result size and improve the quality of progressive results. If you have any questions, feel free to contact us by email: [email protected] https://dl.acm.org/doi/10.1145/3555041.3589394#SIGMOD23-modug007.mp4

    Funding Sources

    • NSF

    Conference

    SIGMOD/PODS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 109
      Total Downloads
    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media