Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/564691.564699acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Processing complex aggregate queries over data streams

Published: 03 June 2002 Publication History

Abstract

Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.In this paper, we consider the problem of approximately answering general aggregate SQL queries over continuous data streams with limited memory. Our method relies on randomizing techniques that compute small "sketch" summaries of the streams that can then be used to provide approximate answers to aggregate queries with provable guarantees on the approximation error. We also demonstrate how existing statistical information on the base data (e.g., histograms) can be used in the proposed framework to improve the quality of the approximation provided by our algorithms. The key idea is to intelligently partition the domain of the underlying attribute(s) and, thus, decompose the sketching problem in a way that provably tightens our guarantees. Results of our experimental study with real-life as well as synthetic data streams indicate that sketches provide significantly more accurate answers compared to histograms for aggregate queries. This is especially true when our domain partitioning methods are employed to further boast the accuracy of the final estimates.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. "Join Synopses for Approximate Query Answering". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.
[2]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proc. of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, May 1999.
[3]
N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proc. of the 28th Annual ACM Symp. on the Theory of Computing, May 1996.
[4]
S. Babu and J. Widom. "Continous Queries over Data Streams". ACM SIGMOD Record, 30(3), September 2001.
[5]
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. "Classification and Regression Trees". Chapman & Hall, 1984.
[6]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. "Approximate Query Processing Using Wavelets". In Proc. of the 26th Intl. Conf. on Very Large Data Bases, September 2000.
[7]
S. Chaudhuri and U. Dayal. "An Overview of Data Warehousing and OLAP Technology". ACM SIGMOD Record, 26(1), March 1997.
[8]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. "Maintaining Stream Statistics over Sliding Windows". In Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms, January 2002.
[9]
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams". Bell Labs Tech. Memorandum, March 2002.
[10]
P. Domingos and G. Hulten. "Mining high-speed data streams". In Proc. of the Sixth ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, August 2000.
[11]
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. "An Approximate L1-Difference Algorithm for Massive Data Streams". In Proc. of the 40th Annual IEEE Symp. on Foundations of Computer Science, October 1999.
[12]
M. Garofalakis and P. B. Gibbons. "Approximate Query Processing: Taming the Terabytes". Tutorial in 27th Intl. Conf. on Very Large Data Bases, September 2001.
[13]
J. Gehrke, F. Korn, and D. Srivastava. "On Computing Correlated Aggregates over Continual Data Streams". In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, September 2001.
[14]
P. B. Gibbons, Y. Matias, and V. Poosala. "Fast Incremental Maintenance of Approximate Histograms". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, August 1997.
[15]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, September 2000.
[16]
M. Greenwald and S. Khanna. "Space-efficient online computation of quantile summaries". In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, May 2001.
[17]
S. Guha, N. Koudas, and K. Shim. "Data streams and histograms". In Proc. of the 2001 ACM Symp. on Theory of Computing (STOC), July 2001.
[18]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. "Clustering data streams". In Proc. of the 2000 Annual Symp. on Foundations of Computer Science (FOCS), November 2000.
[19]
P. J. Haas and J. M. Hellerstein. "Ripple Joins for Online Aggregation". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.
[20]
Y. E. Ioannidis and V. Poosala. "Histogram-Based Approximation of Set-Valued Query Answers". In Proc. of the 25th Intl. Conf. on Very Large Data Bases, September 1999.
[21]
G. Manku, S. Rajagopalan, and B. Lindsay. "Random sampling techniques for space efficient online computation of order statistics of large datasets". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.
[22]
Y. Matias, J. S. Vitter, and M. Wang. "Dynamic Maintenance of Wavelet-Based Histograms". In Proc. of the 26th Intl. Conf. on Very Large Data Bases, September 2000.
[23]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 1985.
[24]
J. S. Vitter and M. Wang. "Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets". In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, May 1999.

Cited By

View all
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
  • (2023)JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product EstimationProceedings of the ACM on Management of Data10.1145/35889351:1(1-26)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
June 2002
654 pages
ISBN:1581134975
DOI:10.1145/564691
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS02

Acceptance Rates

SIGMOD '02 Paper Acceptance Rate 42 of 240 submissions, 18%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
  • (2023)JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product EstimationProceedings of the ACM on Management of Data10.1145/35889351:1(1-26)Online publication date: 30-May-2023
  • (2022)Persistent SummariesACM Transactions on Database Systems10.1145/353105347:3(1-42)Online publication date: 18-Aug-2022
  • (2022)Efficient Transmission and Reconstruction of Dependent Data Streams via Edge Sampling2022 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E55432.2022.00013(47-57)Online publication date: Sep-2022
  • (2021)Experimental Comparison of ATS Algorithm for Wireless Sensor Network2021 40th Chinese Control Conference (CCC)10.23919/CCC52363.2021.9550192(5708-5713)Online publication date: 26-Jul-2021
  • (2021)COMPASS: Online Sketch-based Query Optimization for In-Memory DatabasesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452840(804-816)Online publication date: 9-Jun-2021
  • (2021)Weighted Distinct Sampling: Cardinality Estimation for SPJ QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452821(1465-1477)Online publication date: 9-Jun-2021
  • (2020)Delegation sketchProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387542(1-16)Online publication date: 15-Apr-2020
  • (2020)Efficient Join Synopsis Maintenance for Data WarehouseProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389717(2027-2042)Online publication date: 11-Jun-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media