Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213556.2213562acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Mergeable summaries

Published: 21 May 2012 Publication History
  • Get Citation Alerts
  • Abstract

    We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε) for ε-approximate quantiles, there is a deterministic summary of size O(1 over ε log(εn))that has a restricted form of mergeability, and a randomized one of size O(1 over ε log 3/21 over ε) with full mergeability. We also extend our results to geometric summaries such as ε-approximations and εkernels.
    We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O(1 over ε log 3/21 over ε, and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

    References

    [1]
    P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measure of points. Journal of the ACM, 51(4):660--635, 2004.
    [2]
    P. K. Agarwal, J. M. Phillips, and H. Yu. Stability of $\eps$-kernels. In Proc. European Symposium on Algorithms, 2010.
    [3]
    P. K. Agarwal and H. Yu. A space-optimal data-stream algorithm for coresets in the plane. In Proc. Annual Symposium on Computational Geometry, 2007.
    [4]
    N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999.
    [5]
    N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. IEEE Symposium on Foundations of Computer Science, pages 407--414, 2010.
    [6]
    Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 2002.
    [7]
    G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. Journal of Algorithms, 38:91--109, 2001.
    [8]
    R. Berinde, G. Cormode, P. Indyk, and M. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems, 35(4), 2010.
    [9]
    T. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry: Theory and Applications, 35:20--35, 2006.
    [10]
    T. Chan. Dynamic coresets. In Proc. Annual Symposium on Computational Geometry, 2008.
    [11]
    M. Charikar, A. Newman, and A. Nikolov. Tight hardness results for minimizing discrepancy. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011.
    [12]
    B. Chazelle. The Discrepancy Method. Cambridge, 2000.
    [13]
    B. Chazelle and J. Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimensions. Journal of Algorithms, 21:579--597, 1996.
    [14]
    G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endowment, 1(2):1530--1541, 2008.
    [15]
    G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
    [16]
    J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: The value of space. In ACM-SIAM Symposium on Discrete Algorithms, 2005.
    [17]
    J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM Journal on Computing, 32(1):131--151, 2003.
    [18]
    J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2008.
    [19]
    S. Ganguly and A. Majumder. CR-precis: A deterministic summary structure for update data streams. In ESCAPE, 2007.
    [20]
    A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proc. International Conference on Very Large Data Bases, 2002.
    [21]
    M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001.
    [22]
    M. Greenwald and S. Khanna. Power conserving computation of order-statistics over sensor networks. In Proc. ACM Symposium on Principles of Database Systems, 2004.
    [23]
    S. Guha. Tight results for clustering and summarizing data streams. In Proc. International Conference on Database Theory, 2009.
    [24]
    S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. IEEE Conference on Foundations of Computer Science, 2000.
    [25]
    S. Har-Peled. Approximation Algorithm in Geometry (Chapter 21). http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/, 2010.
    [26]
    D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proc. ACM Symposium on Principles of Database Systems, 2010.
    [27]
    K. G. Larsen. On range searching in the group model and combinatorial discrepancy. under submission, 2011.
    [28]
    Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62:516--527, 2001.
    [29]
    S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TAG: a tiny aggregation service for ad-hoc sensor networks. In Proc. Symposium on Operating Systems Design and Implementation, 2002.
    [30]
    A. Manjhi, S. Nath, and P. B. Gibbons. Tributaries and deltas: efficient and robust aggregation in sensor network streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2005.
    [31]
    A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. IEEE International Conference on Data Engineering, 2005.
    [32]
    G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. ACM SIGMOD International Conference on Management of Data, 1998.
    [33]
    J. Matousek. Approximations and optimal geometric divide-and-conquer. In Proc. ACM Symposium on Theory of Computing, 1991.
    [34]
    J. Matousek. Geometric Discrepancy; An Illustrated Guide. Springer, 1999.
    [35]
    A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006.
    [36]
    J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.
    [37]
    J. Nelson and D. P. Woodruff. Fast manhattan sketches in data streams. In Proc. ACM Symposium on Principles of Database Systems, 2010.
    [38]
    J. M. Phillips. Algorithms for $\eps$-approximations of terrains. In Proc. ICALP, 2008.
    [39]
    N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In Proc. ACM SenSys, 2004.
    [40]
    S. Suri, C. Toth, and Y. Zhou. Range counting over multidimensional data streams. Discrete and Computational Geometry, 36(4):633--655, 2006.
    [41]
    M. Talagrand. Sharper bounds for Gaussian and emperical processes. Annals of Probability, 22:76, 1994.
    [42]
    V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.
    [43]
    H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. In Proc. Annual Symposium on Computational Geometry, 2004.
    [44]
    H. Zarrabi-Zadeh. An almost space-optimal streaming algorithm for coresets in fixed dimensions. In Proc. European Symposium on Algorithms, 2008.

    Cited By

    View all
    • (2023)Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and InsertionsProceedings of the VLDB Endowment10.14778/3611479.361152116:11(3227-3239)Online publication date: 1-Jul-2023
    • (2023)Efficient Framework for Operating on Data SketchesProceedings of the VLDB Endowment10.14778/3594512.359452616:8(1967-1978)Online publication date: 1-Apr-2023
    • (2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
    May 2012
    332 pages
    ISBN:9781450312486
    DOI:10.1145/2213556
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 May 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. streaming algorithms
    2. summaries

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '12
    Sponsor:

    Acceptance Rates

    PODS '12 Paper Acceptance Rate 26 of 101 submissions, 26%;
    Overall Acceptance Rate 642 of 2,707 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and InsertionsProceedings of the VLDB Endowment10.14778/3611479.361152116:11(3227-3239)Online publication date: 1-Jul-2023
    • (2023)Efficient Framework for Operating on Data SketchesProceedings of the VLDB Endowment10.14778/3594512.359452616:8(1967-1978)Online publication date: 1-Apr-2023
    • (2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
    • (2023)Quancurrent: A Concurrent Quantiles SketchProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591074(15-25)Online publication date: 17-Jun-2023
    • (2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
    • (2022)SpaceSaving±Proceedings of the VLDB Endowment10.14778/3514061.351406815:6(1215-1227)Online publication date: 1-Feb-2022
    • (2022)Fast Concurrent Data SketchesACM Transactions on Parallel Computing10.1145/35127589:2(1-35)Online publication date: 11-Apr-2022
    • (2021)On the algebra of data sketchesProceedings of the VLDB Endowment10.14778/3461535.346155314:9(1655-1667)Online publication date: 22-Oct-2021
    • (2021)KLL± approximate quantile sketches over dynamic datasetsProceedings of the VLDB Endowment10.14778/3450980.345099014:7(1215-1227)Online publication date: 1-Mar-2021
    • (2021)ScotchProceedings of the VLDB Endowment10.14778/3430915.343091914:3(281-293)Online publication date: 9-Dec-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media