research-article

In the land of data streams where synopses are missing, one framework to bring them all

Authors:

Rudi Poepsel-Lemaitre,

Joscha von Hein,

Jorge-Arnulfo Quiané-Ruiz,

Volker MarklAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 10

Pages 1818 - 1831

https://doi.org/10.14778/3467861.3467871

Published: 01 June 2021 Publication History

Abstract

In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses' implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor's key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.

References

[1]

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (2015), 557--581.

Digital Library

[2]

Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 26.

Digital Library

[3]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.

Digital Library

[4]

Charu C Aggarwal. 2006. On biased reservoir sampling in the presence of stream evolution. In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 607--618.

Digital Library

[5]

Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! PVLDB 11, 11 (2018), 1414--1427.

Digital Library

[6]

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. (2015).

[7]

Apache Software Foundation. 2020. Apache Hive. https://hive.apache.org/

[8]

Apache Software Foundation. 2020. Apache Pig. https://pig.apache.org/

[9]

Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 1--16.

Digital Library

[10]

Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.

Digital Library

[11]

Andrei Broder, Michael Mitzenmacher, and Andrei Broder I Michael Mitzenmacher. 2002. Network applications of bloom filters: A survey. In Internet mathematics. Citeseer.

[12]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).

[13]

Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate sharing for user-defined windows. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1201--1210.

Digital Library

[14]

Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Jerry Peng, et al. 2016. Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 1789--1792.

[15]

Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Nsdi, Vol. 10. 20.

Digital Library

[16]

Graham Cormode, Antonios Deligiannakis, Minos Garofalakis, and Andrew McGregor. 2009. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment 2, 1 (2009), 526--537.

Digital Library

[17]

Graham Cormode and Minos Garofalakis. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 13--24.

Digital Library

[18]

Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (2012), 1--294.

Digital Library

[19]

Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530--1541.

Digital Library

[20]

Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.

Digital Library

[21]

Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. ACM, 75--88.

Digital Library

[22]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.

[23]

Apache Flink. 2020. The Broadcast State Pattern. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html

[24]

Apache Flink. 2020. Physical Partitioning. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/#physical-partitioning

[25]

Minos N Garofalakis and Phillip B Gibbons. 2001. Approximate Query Processing: Taming the TeraBytes. In VLDB. 343--352.

Digital Library

[26]

Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. 1997. Aqua project white paper. Technical Report. Technical report, Bell Laboratories, Murray Hill, New Jersey.

[27]

Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. 2002. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems (TODS) 27, 3 (2002), 261--298.

Digital Library

[28]

Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. 383--397.

Digital Library

[29]

Alfred Haar. 1909. Zur theorie der orthogonalen funktionensysteme. Georg-August-Universitat, Gottingen.

[30]

Paulo Jesus, Carlos Baquero, and Paulo Sérgio Almeida. 2014. A survey of distributed data aggregation algorithms. IEEE Communications Surveys & Tutorials 17, 1 (2014), 381--404.

Digital Library

[31]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In Proceedings of the 2016 international conference on management of data. 631--646.

Digital Library

[32]

Panagiotis Karras and Nikos Mamoulis. 2005. One-pass wavelet synopses for maximum-error metrics. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 421--432.

Digital Library

[33]

Martin Kiefer, Ilias Poulakis, Sebastian Breß, and Volker Markl. 2020. Scotch: Generating FPGA-Accelerators for Sketching at Line Rate. Proceedings of the VLDB Endowment 14, 3 (2020), 281--293.

[34]

Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi. 2019. Big data stream analysis: a systematic literature review. Journal of Big Data 6, 1 (2019), 47.

[35]

Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 623--634.

Digital Library

[36]

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. Acm Sigmod Record 34, 1 (2005), 39--44.

Digital Library

[37]

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 311--322.

Digital Library

[38]

Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 311--322.

Digital Library

[39]

Kaiyu Li and Guoliang Li. 2018. Approximate query processing: What is new and where to go? Data Science and Engineering 3, 4 (2018), 379--397.

[40]

Charles Masson, Jee E Rim, and Homin K Lee. 2019. DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees. Proceedings of the VLDB Endowment 12, 12 (2019), 2195--2205.

Digital Library

[41]

Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV).

Digital Library

[42]

Barzan Mozafari, Jags Ramnarayan, Sudhir Menon, Yogesh Mahajan, Soubhik Chakraborty, Hemant Bhanawat, and Kishor Bachhav. 2017. SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics. In CIDR.

[43]

Shanmugavelayutham Muthukrishnan. 2005. Data streams: Algorithms and applications. Now Publishers Inc.

[44]

Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. ACM Sigmod Record 14, 2 (1984), 256--276.

Digital Library

[45]

Viswanath Poosala, Peter J Haas, Yannis E Ioannidis, and Eugene J Shekita. 1996. Improved histograms for selectivity estimation of range predicates. ACM Sigmod Record 25, 2 (1996), 294--305.

Digital Library

[46]

Do Le Quoc, Ruichuan Chen, Pramod Bhatotia, Christof Fetzer, Volker Hilt, and Thorsten Strufe. 2017. StreamApprox: approximate computing for stream analytics. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, 185--197.

Digital Library

[47]

Madhavapeddi Shreedhar and George Varghese. 1995. Efficient fair queueing using deficit round robin. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. 231--242.

Digital Library

[48]

Apache Spark. 2020. Scheduling Within an Application. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

[49]

Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, and Michael Stonebraker. 2003. Load shedding in a data stream manager. In Proceedings 2003 vldb conference. Elsevier, 309--320.

Digital Library

[50]

NYC Taxi and Limousine Commission (TLC). 2020. New York City Taxi and Limousine Commission (TLC) Trip Record Data. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

[51]

Jonas Traub, Philipp Marian Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2018. Scotty: Efficient window aggregation for out-of-order stream processing. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1300--1303.

[52]

Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient Window Aggregation with General Stream Slicing. In 22th International Conference on Extending Database Technology (EDBT).

[53]

Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2020. Scotty Window Processor. https://doi.org/TU-Berlin-DIMA/scotty-window-processor

[54]

Jonas Traub, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2020. Agora: Bringing Together Datasets, Algorithms, Models and More in a Unified Ecosystem [Vision]. SIGMOD Record 49, 4 (2020), 6--11.

Digital Library

[55]

Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, and Volker Markl. 2017. I2: Interactive Real-Time Visualization for Streaming Data. In EDBT. 526--529.

[56]

Jan E Trost. 1986. Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative sociology 9, 1 (1986), 54--57.

[57]

Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11, 1 (1985), 37--57.

Digital Library

[58]

Yahoo! 2020. DataSketches: Sketches Library from Yahoo! https://datasketches.github.io/

[59]

Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.

Digital Library

Cited By

Moustakas TKolomvatsos K(2024)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10586-023-04116-5
Zhang XDas SPandis ISelçuk Candan KAmer-Yahia S(2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589394
Verwiebe JGrulich PTraub JMarkl V(2023)Survey of window types for aggregation in stream processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00778-632:5(985-1011)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1007/s00778-022-00778-6
Show More Cited By

Index Terms

In the land of data streams where synopses are missing, one framework to bring them all

Index terms have been assigned to the content through auto-classification.

Recommendations

Geometric synopses for multi-dimensional data streams
Framework for bringing data streams to the grid
AxGrids 2004

Data streams are a prevalent and growing source of timely data, particularly in the scientific domain. Just as it is common today to read starting conditions such as initial weather conditions, for a scientific simulation from a file, it should be ...
Data streams and data synopses for massive data sets
ECMLPKDD'05: Proceedings of the 9th European Conference on European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

With the proliferation of data intensive applications, it has become necessary to develop new techniques to handle massive data sets. Traditional algorithmic techniques and data structures are not always suitable to handle the amount of data that is ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 10

June 2021

219 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2021

Published in PVLDB Volume 14, Issue 10

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
183
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moustakas TKolomvatsos K(2024)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10586-023-04116-5
Zhang XDas SPandis ISelçuk Candan KAmer-Yahia S(2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589394
Verwiebe JGrulich PTraub JMarkl V(2023)Survey of window types for aggregation in stream processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00778-632:5(985-1011)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1007/s00778-022-00778-6
Mehmood EAnees T(2022)Distributed real-time ETL architecture for unstructured big dataKnowledge and Information Systems10.1007/s10115-022-01757-764:12(3419-3445)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1007/s10115-022-01757-7

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents