abstract

SynopsisDB: Distributed Synopsis-based Data Processing System

Author:

Xin ZhangAuthors Info & Claims

SIGMOD '23: Companion of the 2023 International Conference on Management of Data

Pages 289 - 291

https://doi.org/10.1145/3555041.3589394

Published: 05 June 2023 Publication History

Abstract

As the data volume continues to expand at an unprecedented rate, data scientists face the challenge of effectively processing and exploring vast amounts of data. To carry out tasks such as analyzing wildfire clusters, querying diverse datasets, and visualizing results with tools like IncVisage, Pangloss, Marviq, and GeoSparkViz, data scientists require data processing systems that are efficient, flexible, and capable of handling different types of queries across various data sources. Two critical features that these systems should possess are the ability to process data efficiently and handle a wide range of queries for diverse data types.

Supplemental Material

MP4 File

In this video, we introduce SynopsisDB, which is a distributed synopsis-based data processing system. There are two projects under SynopsisDB: partition-based synopses combination and progressive query processing. SynopsisDB fixed the gap that no system can combine both approximate and progressive queries. SynopsisDB uses data synopsis as the bridge to combine them. SynopsisDB supports 1D, 2D Histograms, Geometric Histogram, 1D, 2D Wavelets, Uniform samples, and Stratified samples. The partition-base data synopses cannot be merged if there are constructed from different partition functions. Therefore, we design a two-step merging framework for SynopsisDB to combine partition-based synopses with different shapes. For progressive processing, SynopsisDB uses data synopses to estimate the result size and improve the quality of progressive results. If you have any questions, feel free to contact us by email: [email protected]

Download
110.22 MB

References

[1]

Ildar Absalyamov, Michael J Carey, and Vassilis J Tsotras. 2018. Lightweight cardinality estimation in LSM-based systems. In Proceedings of the 2018 International Conference on Management of Data. 841--855.

Digital Library

[2]

Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013a. Mergeable summaries. ACM Transactions on Database Systems (TODS), Vol. 38, 4 (2013), 1--28.

Digital Library

[3]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013b. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.

Digital Library

[4]

Sattam Alsubaiee et al. 2014. AsterixDB: a scalable, open source BDMS. Proceedings of the VLDB Endowment, Vol. 7, 14 (2014), 1905--1916.

Digital Library

[5]

Ahmed M Aly, Ahmed R Mahmood, Mohamed S Hassan, Walid G Aref, Mourad Ouzzani, Hazem Elmeleegy, and Thamir Qadah. 2015. Aqwa: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, Vol. 8, 13 (2015), 2062--2073.

Digital Library

[6]

Ning An, Zhen-Yu Yang, and Anand Sivasubramaniam. 2001. Selectivity estimation for spatial joins. In Proceedings 17th International Conference on Data Engineering. IEEE, 368--375.

[7]

Apache. 2011. Apache Flink. https://flink.apache.org

[8]

Richard Beigel and Egemen Tanin. 1998. The geometry of browsing. In Latin American Symposium on Theoretical Informatics. Springer, 331--340.

Digital Library

[9]

Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. The VLDB Journal, Vol. 10, 2 (2001), 199--223.

Digital Library

[10]

Badrish Chandramouli, Jonathan Goldstein, and Abdul Quamar. 2013. Scalable progressive analytics on big data in the cloud. PVLDB, Vol. 6, 14 (2013), 1726--1737.

Digital Library

[11]

Abhinandan Das, Johannes Gehrke, and Mirek Riedewald. 2004. Approximation techniques for spatial data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 695--706.

Digital Library

[12]

Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample seek: Approximating aggregates with distribution precision guarantee. In Proceedings of the 2016 International Conference on Management of Data. 679--694.

Digital Library

[13]

Liming Dong et al. 2020. Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization. In SIGMOD. 67--82.

[14]

Facebook. 2012. RocksDB. https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format

[15]

Edward Gan, Peter Bailis, and Moses Charikar. 2020. Coopstore: Optimizing precomputed summaries for aggregation. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 2174--2187.

Digital Library

[16]

Jaemin Jo, Sehi L'Yi, Bongshin Lee, and Jinwook Seo. 2019. Proreveal: Progressive visual analytics with safeguards. IEEE Transactions on Visualization and Computer Graphics, Vol. 27, 7 (2019), 3109--3122.

Digital Library

[17]

Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. 2016. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, Vol. 25, 4 (2016), 449--472.

Digital Library

[18]

Yossi Matias, Jeffrey Scott Vitter, and Min Wang. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 448--459.

Digital Library

[19]

Dominik Moritz, Danyel Fisher, Bolin Ding, and Chi Wang. 2017. Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In CHI. 2904--2915.

[20]

Johns Paul et al. 2020. Poet: an Interactive Spatial Query Processing System in Grab. In SIGSPATIAL. 477--486.

[21]

Rudi Poepsel-Lemaitre, Martin Kiefer, Joscha Von Hein, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. In the land of data streams where synopses are missing, one framework to bring them all. Proceedings of the VLDB Endowment, Vol. 14, 10 (2021), 1818--1831.

Digital Library

[22]

Marianne Procopio et al. 2021. Impact of cognitive biases on progressive visualization. TVCG, Vol. 28, 9 (2021), 3093--3112.

[23]

Sajjadur Rahman et al. 2017. I've seen" enough" incrementally improving visualizations to support rapid decision making. PVLDB, Vol. 10, 11 (2017), 1262--1273.

[24]

Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS), Vol. 33, 3 (2008), 1--46.

Digital Library

[25]

Salman Salloum and Joshua Zhexue Huang. 2021. RSP-Hist: Approximate Histograms for Big Data Exploration on Hadoop Clusters. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 412--417.

[26]

AB Siddique, Ahmed Eldawy, and Vagelis Hristidis. 2019a. Euler: Improved Selectivity Estimation for Rectangular Spatial Records. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 4129--4133.

[27]

Abu Bakar Siddique, Ahmed Eldawy, and Vagelis Hristidis. 2019b. Comparing synopsis techniques for approximate spatial data analysis. Proceedings of the VLDB Endowment, Vol. 12, 11 (2019).

Digital Library

[28]

Samriddhi Singla et al. 2020. WildfireDB: A Spatio-Temporal Dataset Combining Wildfire Occurrence with Relevant Covariates. (2020).

[29]

Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2002. Selectivity estimation for spatial joins with geometric selections. In International Conference on Extending Database Technology. Springer, 609--626.

[30]

Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, and Dimitris Papadias. 2004. Spatio-temporal aggregation using sketches. In Proceedings. 20th International Conference on Data Engineering. IEEE, 214--225.

[31]

Wee Hyong Tok and Stéphane Bressan. 2013. Progressive and approximate join algorithms on data streams. In Advanced query processing. Springer.

[32]

Jeffrey Scott Vitter and Min Wang. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. Acm Sigmod Record, Vol. 28, 2 (1999), 193--204.

Digital Library

[33]

Jeffrey Scott Vitter, Min Wang, and Bala Iyer. 1998. Data cube approximation and histograms via wavelets. In Proceedings of the seventh international conference on Information and knowledge management. 96--104.

Digital Library

[34]

Jin-Feng Wang, A Stein, Bin-Bo Gao, and Yong Ge. 2012. A review of spatial sampling. Spatial Statistics, Vol. 2 (2012), 1--14.

[35]

Min Wang, Jeffrey Scott Vitter, Lipyeow Lim, and Sriram Padmanabhan. 2001. Wavelet-based cost estimation for spatial queries. In International Symposium on Spatial and Temporal Databases. Springer, 175--193.

[36]

Jia Yu and Mohamed Sarwat. 2021. GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data. PVLDB, Vol. 30, 2 (2021), 237--258.

Digital Library

[37]

Fuheng Zhao, Sujaya Maiyya, Ryan Wiener, Divyakant Agrawal, and Amr El Abbadi. 2021. Kll$pm$approximate quantile sketches over dynamic datasets. Proceedings of the VLDB Endowment, Vol. 14, 7 (2021), 1215--1227.

Digital Library

[38]

Zhuoyue Zhao, Feifei Li, and Yuxi Liu. 2020. Efficient join synopsis maintenance for data warehouse. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2027--2042.

Digital Library

Index Terms

SynopsisDB: Distributed Synopsis-based Data Processing System
1. Information systems
  1. Data management systems
    1. Database design and models

Recommendations

Processing Big Data with Azure HDInsight: Building Real-World Big Data Systems on Azure HDInsight Using the Hadoop Ecosystem
Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '23: Companion of the 2023 International Conference on Management of Data

June 2023

330 pages

ISBN:9781450395076

DOI:10.1145/3555041

General Chairs:
Sudipto Das
Amazon Web Services, USA
,
Ippokratis Pandis
Amazon Web Services, USA
,
Program Chairs:
K. Selçuk Candan
Arizona State University, USA
,
Sihem Amer-Yahia
CNRS, Université Grenoble Alpes, France

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2023

Check for updates

Qualifiers

Abstract

Data Availability

In this video, we introduce SynopsisDB, which is a distributed synopsis-based data processing system. There are two projects under SynopsisDB: partition-based synopses combination and progressive query processing. SynopsisDB fixed the gap that no system can combine both approximate and progressive queries. SynopsisDB uses data synopsis as the bridge to combine them. SynopsisDB supports 1D, 2D Histograms, Geometric Histogram, 1D, 2D Wavelets, Uniform samples, and Stratified samples. The partition-base data synopses cannot be merged if there are constructed from different partition functions. Therefore, we design a two-step merging framework for SynopsisDB to combine partition-based synopses with different shapes. For progressive processing, SynopsisDB uses data synopses to estimate the result size and improve the quality of progressive results. If you have any questions, feel free to contact us by email: [email protected] https://dl.acm.org/doi/10.1145/3555041.3589394#SIGMOD23-modug007.mp4

Funding Sources

NSF

Conference

SIGMOD/PODS '23

Sponsor:

SIGMOD

SIGMOD/PODS '23: International Conference on Management of Data

June 18 - 23, 2023

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
109
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents