Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2463676.2465312acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Quantiles over data streams: an experimental study

Published: 22 June 2013 Publication History

Abstract

A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data is described incrementally, and we must compute the quantiles in an online, streaming fashion. Yet while such algorithms have proved to be tremendously useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods, and describe efficient implementations. In doing so, we propose and analyze variations that have not been explicitly studied before, yet which turn out to perform the best. To illustrate this, we provide detailed experimental comparisons demonstrating the tradeoffs between space, time, and accuracy for quantile computation.

References

[1]
P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z.Wei, and K. Yi. Mergeable summaries. In ACM PODS, 2012.
[2]
A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows. In ACM PODS, 2004.
[3]
M. Blum, R. W. Floyd, V. Pratt, R. L. Rievest, and R. E. Tarjan. Time bounds for selection. J. Computer and System Sciences, 7:448--461, 1973.
[4]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, 2002.
[5]
G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In ACM SIGMOD, 2005.
[6]
G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In VLDB, 2008.
[7]
G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In ACM SIGMOD, 2004.
[8]
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In ACM PODS, 2006.
[9]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
[10]
S. Ganguly and A. Majumder. CR-precis: A deterministic summary structure for update data streams. In ESCAPE, 2007.
[11]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In VLDB, 2002.
[12]
N. K. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In ACM SIGMOD, 2005.
[13]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD, 2001.
[14]
M. Greenwald and S. Khanna. Power conserving computation of order-statistics over sensor networks. In ACM PODS, 2004.
[15]
Z. Huang, L. Wang, K. Yi, and Y. Liu. Sampling based algorithms for quantile computation in sensor networks. In ACM SIGMOD, 2011.
[16]
R. Y. S. Hung and H.-F. Ting. An ( 1 " log 1 " ) space lower bound for finding "-approximate quantiles in a data stream. In Frontiers in Algorithmics (FAW), 2010.
[17]
Z. Li, Y. Liu, M. Li, J. Wang, and Z. Cao. Exploiting ubiquitous data collection for mobile users in wireless sensor networks. IEEE TPDS, 24(2):312--326, 2013.
[18]
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In ACM SIGMOD, 1998.
[19]
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In ACM SIGMOD, 1999.
[20]
J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315--323, 1980.
[21]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Dynamic Grids and Worldwide Computing, 13(4):277--298, 2005.
[22]
N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In ACM SenSys, 2004.
[23]
S. Suri, C. Toth, and Y. Zhou. Range counting over multidimensional data streams. Discrete and Computational Geometry, 2006.
[24]
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.
[25]
K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In ACM PODS, 2009.

Cited By

View all
  • (2024)A Probabilistic Sketch for Summarizing Cold Items of Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2023.331642632:2(1287-1302)Online publication date: Apr-2024
  • (2024)Online Detection of Outstanding Quantiles with QuantileFilter2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00069(831-844)Online publication date: 13-May-2024
  • (2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299(111941)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. Quantiles over data streams: an experimental study

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data stream algorithms
    2. quantiles

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Probabilistic Sketch for Summarizing Cold Items of Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2023.331642632:2(1287-1302)Online publication date: Apr-2024
    • (2024)Online Detection of Outstanding Quantiles with QuantileFilter2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00069(831-844)Online publication date: 13-May-2024
    • (2024)Visible-hidden hybrid automatic feature engineering via multi-agent reinforcement learningKnowledge-Based Systems10.1016/j.knosys.2024.111941299(111941)Online publication date: Sep-2024
    • (2023)Panakos: Chasing the Tails for Multidimensional Data StreamsProceedings of the VLDB Endowment10.14778/3583140.358314716:6(1291-1304)Online publication date: 20-Apr-2023
    • (2023)Quancurrent: A Concurrent Quantiles SketchProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591074(15-25)Online publication date: 17-Jun-2023
    • (2023)Efficient and Secure Quantile Aggregation of Private Data StreamsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.327277518(3058-3073)Online publication date: 2023
    • (2022)Differentially private linear sketchesProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601192(12691-12704)Online publication date: 28-Nov-2022
    • (2022)Streaming Quantiles Algorithms with Small Space and Update TimeSensors10.3390/s2224961222:24(9612)Online publication date: 8-Dec-2022
    • (2022)Frequency estimation under multiparty differential privacyProceedings of the VLDB Endowment10.14778/3547305.354731215:10(2058-2070)Online publication date: 1-Jun-2022
    • (2022)SpaceSaving±Proceedings of the VLDB Endowment10.14778/3514061.351406815:6(1215-1227)Online publication date: 1-Feb-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media