Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2452376.2452456acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Published: 18 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them. In this paper, we present a series of improvements to this algorithm that reduce its memory requirements and significantly increase its accuracy for an important range of cardinalities. We have implemented our proposed algorithm for a system at Google and evaluated it empirically, comparing it to the original HyperLogLog algorithm. Like HyperLogLog, our improved algorithm parallelizes perfectly and computes the cardinality estimate in a single pass.

    References

    [1]
    K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in OLAP. In Workshop on Data Warehousing and OLAP (DOLAP), pages 17--24, 2007.
    [2]
    Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Workshop on Randomization and Approximation Techniques (RANDOM), pages 1--10, London, UK, UK, 2002. Springer-Verlag.
    [3]
    P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, pages 1--14, 2011.
    [4]
    M. Durand and P. Flajolet. Loglog counting of large cardinalities. In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA), volume 2832, pages 605--617, 2003.
    [5]
    C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Transactions on Networking, pages 925--937, 2006.
    [6]
    P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.
    [7]
    P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), pages 127--146, 2007.
    [8]
    F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009.
    [9]
    A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In Very Large Databases (VLDB), 2012.
    [10]
    P. Indyk. Tight lower bounds for the distinct elements problem. In Foundations of Computer Science (FOCS), pages 283--288, 2003.
    [11]
    D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Principles of database systems (PODS), pages 41--52. ACM, 2010.
    [12]
    J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Analysis of Algorithms (AOFA), pages 489--504, 2010.
    [13]
    S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, and G. Inc. Dremel: Interactive analysis of web-scale datasets. In Very Large Databases (VLDB), pages 330--339, 2010.
    [14]
    A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Extending database technology (EDBT), pages 618--629, 2008.
    [15]
    R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data, parallel analysis with Sawzall. Journal on Scientific Programming, pages 277--298, 2005.
    [16]
    K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15:208--229, 1990.

    Cited By

    View all
    • (2024)UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct CountingProceedings of the VLDB Endowment10.14778/3654621.365463217:7(1655-1668)Online publication date: 1-Mar-2024
    • (2024)An LDP Compatible Sketch for Securely Approximating Set Intersection CardinalitiesProceedings of the ACM on Management of Data10.1145/36392812:1(1-27)Online publication date: 26-Mar-2024
    • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
    • Show More Cited By

    Index Terms

    1. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
        March 2013
        793 pages
        ISBN:9781450315975
        DOI:10.1145/2452376
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 18 March 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Conference

        EDBT/ICDT '13

        Acceptance Rates

        Overall Acceptance Rate 7 of 10 submissions, 70%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)203
        • Downloads (Last 6 weeks)20

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct CountingProceedings of the VLDB Endowment10.14778/3654621.365463217:7(1655-1668)Online publication date: 1-Mar-2024
        • (2024)An LDP Compatible Sketch for Securely Approximating Set Intersection CardinalitiesProceedings of the ACM on Management of Data10.1145/36392812:1(1-27)Online publication date: 26-Mar-2024
        • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
        • (2024)Cardinality Counting in "Alcatraz": A Privacy-aware Federated Learning ApproachProceedings of the ACM on Web Conference 202410.1145/3589334.3645655(3076-3084)Online publication date: 13-May-2024
        • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
        • (2024)Multi-Resolution Odd Sketch for Mining Extended Jaccard Similarity of Dynamic Streaming setsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327580911:3(2399-2414)Online publication date: May-2024
        • (2024)Cardinality Estimation Adaptive Cuckoo Filters (CE-ACF): Approximate Membership Check and Distinct Query Count for High-Speed Network MonitoringIEEE/ACM Transactions on Networking10.1109/TNET.2023.330230632:2(959-970)Online publication date: 1-Apr-2024
        • (2024)Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big TablesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335971036:7(3111-3125)Online publication date: Jul-2024
        • (2024)SuperGuardianInformation Systems10.1016/j.is.2024.102351122:COnline publication date: 1-May-2024
        • (2023)Sketch-flip-mergeProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618930(12846-12865)Online publication date: 23-Jul-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media