Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Mergeable summaries

Published: 04 December 2013 Publication History
  • Get Citation Alerts
  • Abstract

    We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O((1/ε) log(ε n)) that has a restricted form of mergeability, and a randomized one of size O((1/ε) log3/2(1/ε)) with full mergeability. We also extend our results to geometric summaries such as ε-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.
    We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O((1/ε) log3/2(1/ε)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

    References

    [1]
    Agarwal, P. K., Cormode, G., Huang, Z., Phillips, J. M., Wei, Z., and Yi, K. 2012. Mergeable summaries. In Proceedings of the 31st ACM Symposium on Principals of Database Systems. 23--34.
    [2]
    Ahn, K. J., Guha, S., and McGregor, A. 2012. Analyzing graph structure via linear measurements. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
    [3]
    Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
    [4]
    Bansal, N. 2010. Constructive algorithms for discrepancy minimization. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 3--10.
    [5]
    Bansal, N. 2012. Semidefinite optimization in discrepancy theory. Math. Program. 134, 1, 5--22.
    [6]
    Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RandOM'02). 1--10.
    [7]
    Berinde, R., Cormode, G., Indyk, P., and Strauss, M. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Datab. Syst. 35, 4.
    [8]
    Chazelle, B. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press.
    [9]
    Chazelle, B. and Matousek, J. 1996. On linear-time deterministic algorithms for optimization problems in fixed dimension. J. Algor. 21, 3, 579--597.
    [10]
    Cormode, G. and Hadjieleftheriou, M. 2008a. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541.
    [11]
    Cormode, G. and Hadjieleftheriou, M. 2008b. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases.
    [12]
    Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75.
    [13]
    Feigenbaum, J., Kannan, S., Strauss, M. J., and Viswanathan, M. 2003. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151.
    [14]
    Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., and Svitkina, Z. 2008. On distributing symmetric streaming computations. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
    [15]
    Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases.
    [16]
    Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [17]
    Greenwald, M. and Khanna, S. 2004. Power conserving computation of order-statistics over sensor networks. In Proceedings of the ACM Symposium on Principles of Database Systems.
    [18]
    Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the International Conference on Database Theory. ACM Press, New York, 268--275.
    [19]
    Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366.
    [20]
    Huang, Z., Wang, L., Yi, K., and Liu, Y. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [21]
    Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53, 307--323.
    [22]
    Kane, D. M., Nelson, J., Porat, E., and Woodruff, D. P. 2011. Fast moment estimation in data streams in optimal space. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing.
    [23]
    Larsen, K. 2011. On range searching in the group model and combinatorial discrepancy. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 542--549.
    [24]
    Li, Y., Long, P., and Srinivasan, A. 2001. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 3, 516--527.
    [25]
    Lovett, S. and Meka, R. 2012. Constructive discrepancy minimization by walking on the edges. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science.
    [26]
    Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: A tiny aggregation service for ad-hoc sensor networks. In Proceedings of the Symposium on Operating Systems Design and Implementation.
    [27]
    Manjhi, A., Nath, S., and Gibbons, P. B. 2005a. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [28]
    Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005b. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering.
    [29]
    Manku, G. S., Rajagopalan, S., and Lindsay, B. G. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
    [30]
    Matousek, J. 1991. Approximations and optimal geometric divide-and-conquer. In Proceedings of the ACM Symposium on Theory of Computing. ACM Press, New York, 505--511.
    [31]
    Matousek, J. 1995. Tight upper bounds for the discrepancy of half-spaces. Discr. Comput. Geom. 13, 593--601.
    [32]
    Matousek, J. 2010. Geometric Discrepancy: An Illustrated Guide, vol. 18. Springer http://bookshelf.theopensourcelibrary.org/2010_CharlesUniversity_GeometricDiscrepancy.pdf.
    [33]
    Metwally, A., Agrawal, D., and Abbadi, A. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133.
    [34]
    Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.
    [35]
    Nelson, J., Nguyen, H. L., and Woodruff, D. P. 2012. On deterministic sketching and streaming for sparse recovery and norm estimation. In Proceedings of the 16th International Workshop on Randomization and Computation (RandOM'12).
    [36]
    Phillips, J. 2008. Algorithms for approximations of terrains. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP'08). 447--458.
    [37]
    Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). 239-249.
    [38]
    Suri, S., Toth, C., and Zhou, Y. 2006. Range counting over multidimensional data streams. Discr. Comput. Geom. 36, 4, 633--655.
    [39]
    Talagrand, M. 1994. Sharper bounds for gaussian and empirical processes. Ann. Probab. 22, 1, 28--76.
    [40]
    Vapnik, V. and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264--280.

    Cited By

    View all
    • (2024)Determining Exact Quantiles with Randomized SummariesProceedings of the ACM on Management of Data10.1145/36392802:1(1-26)Online publication date: 26-Mar-2024
    • (2024) Randomized counter-based algorithms for frequency estimation over data streams in space Theoretical Computer Science10.1016/j.tcs.2023.114317984(114317)Online publication date: Feb-2024
    • (2024)Enhancing data efficiency for autonomous vehicles: Using data sketches for detecting driving anomaliesMachine Learning with Applications10.1016/j.mlwa.2024.10053015(100530)Online publication date: Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 38, Issue 4
    Invited papers issue
    November 2013
    294 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/2539032
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 December 2013
    Accepted: 01 June 2013
    Revised: 01 April 2013
    Received: 01 October 2012
    Published in TODS Volume 38, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data summarization
    2. heavy hitters
    3. quantiles

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)119
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Determining Exact Quantiles with Randomized SummariesProceedings of the ACM on Management of Data10.1145/36392802:1(1-26)Online publication date: 26-Mar-2024
    • (2024) Randomized counter-based algorithms for frequency estimation over data streams in space Theoretical Computer Science10.1016/j.tcs.2023.114317984(114317)Online publication date: Feb-2024
    • (2024)Enhancing data efficiency for autonomous vehicles: Using data sketches for detecting driving anomaliesMachine Learning with Applications10.1016/j.mlwa.2024.10053015(100530)Online publication date: Mar-2024
    • (2023)LONE SAMPLERProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i7.26014(8413-8420)Online publication date: 7-Feb-2023
    • (2023)Optimistic Data Parallelism for FPGA-Accelerated SketchingProceedings of the VLDB Endowment10.14778/3579075.357908516:5(1113-1125)Online publication date: 1-Jan-2023
    • (2023)Relative Error Streaming QuantilesJournal of the ACM10.1145/361789170:5(1-48)Online publication date: 16-Oct-2023
    • (2023)STAR: A Cache-based Stream Warehouse System for Spatial DataACM Transactions on Spatial Algorithms and Systems10.1145/36059449:4(1-27)Online publication date: 27-Jun-2023
    • (2023)OmniWindow: A General and Efficient Window Mechanism Framework for Network TelemetryProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604847(867-880)Online publication date: 10-Sep-2023
    • (2023)EasyQuantile: Efficient Quantile Tracking in the Data PlaneProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600084(123-129)Online publication date: 29-Jun-2023
    • (2023)Intermediate Value Linearizability: A Quantitative Correctness CriterionJournal of the ACM10.1145/358469970:2(1-21)Online publication date: 18-Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media