Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807085.1807094acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

An optimal algorithm for the distinct elements problem

Published: 06 June 2010 Publication History

Abstract

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,...,n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(1/ε-2 + log(n)) bits of space with 2/3 success probability, where 0<ε<1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.
We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In SIGMOD, pages 574--576, 1999.
[2]
A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In Proc. MPDS, 2003.
[3]
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999.
[4]
Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002.
[5]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. SODA, pages 623--632, 2002.
[6]
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007.
[7]
D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Alg., 4(2), 2008.
[8]
A. Brodnik. Computation of the least significant set bit. In Proc. ERK, 1993.
[9]
J. Brody and A. Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In Proc. CCC, pages 358--368, 2009.
[10]
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180, 2005.
[11]
L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.
[12]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997.
[13]
G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003.
[14]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, pages 240--251, 2002.
[15]
D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Struct. Algorithms, 13(2):99--124, 1998.
[16]
M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In Proc. ESA, pages 605--617, 2003.
[17]
C. Estan, G. Varghese, and M. E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925--937, 2006.
[18]
S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Trans. Database Syst., 13(1):91--128, 1988.
[19]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Disc. Math. and Theor. Comp. Sci., AH:127--146, 2007.
[20]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985.
[21]
M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424--436, 1993.
[22]
S. Ganguly. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211--222, 2007.
[23]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001.
[24]
P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, pages 281--291, 2001.
[25]
P. Indyk. Algorithms for dynamic geometric problems over data streams. In Proc. STOC, pages 373--380, 2004.
[26]
P. Indyk and D. P. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. FOCS, pages 283---, 2003.
[27]
D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proc. SODA, pages 1161--1178, 2010.
[28]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[29]
M. H. Overmars. The Design of Dynamic Data Structures. Springer, 1983.
[30]
S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: A new data layout scheme in db2. In SIGMOD, pages 637--641, 2003.
[31]
A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008.
[32]
C. R. Palmer, G. Siganos, M. Faloutsos, and C. Faloutsos. The connectivity and fault-tolerance of the internet topology. In NRDM Workshop, 2001.
[33]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979.
[34]
A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996.
[35]
A. Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. Computing, 33(3):505--543, 2004.
[36]
D. P. Woodruff. Optimal space lower bounds for all frequency moments. In Proc. SODA, pages 167--175, 2004.

Cited By

View all
  • (2024)Unmasking vulnerabilitiesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692093(550-576)Online publication date: 21-Jul-2024
  • (2024)Streaming Algorithms for Geometric Steiner ForestACM Transactions on Algorithms10.1145/366366620:4(1-38)Online publication date: 5-Aug-2024
  • (2024)On the Feasibility of Forgetting in Data StreamsProceedings of the ACM on Management of Data10.1145/36516032:2(1-17)Online publication date: 14-May-2024
  • Show More Cited By

Index Terms

  1. An optimal algorithm for the distinct elements problem

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2010
      350 pages
      ISBN:9781450300339
      DOI:10.1145/1807085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data mining
      2. distinct elements
      3. query optimization
      4. streaming

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '10
      Sponsor:
      SIGMOD/PODS '10: International Conference on Management of Data
      June 6 - 11, 2010
      Indiana, Indianapolis, USA

      Acceptance Rates

      PODS '10 Paper Acceptance Rate 27 of 113 submissions, 24%;
      Overall Acceptance Rate 642 of 2,707 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)94
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 12 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Unmasking vulnerabilitiesProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692093(550-576)Online publication date: 21-Jul-2024
      • (2024)Streaming Algorithms for Geometric Steiner ForestACM Transactions on Algorithms10.1145/366366620:4(1-38)Online publication date: 5-Aug-2024
      • (2024)On the Feasibility of Forgetting in Data StreamsProceedings of the ACM on Management of Data10.1145/36516032:2(1-17)Online publication date: 14-May-2024
      • (2024)Explicit Orthogonal Arrays and Universal Hashing with Arbitrary ParametersProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649642(1259-1267)Online publication date: 10-Jun-2024
      • (2024)Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big TablesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335971036:7(3111-3125)Online publication date: Jul-2024
      • (2024)Statistical Inference With Limited Memory: A SurveyIEEE Journal on Selected Areas in Information Theory10.1109/JSAIT.2024.34812965(623-644)Online publication date: 2024
      • (2024)A Revisit to Graph Neighborhood Cardinality Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00242(3125-3137)Online publication date: 13-May-2024
      • (2024)A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00110(1338-1351)Online publication date: 13-May-2024
      • (2024) A Strong Separation for Adversarially Robust ℓ 0 Estimation for Linear Sketches 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00136(2318-2343)Online publication date: 27-Oct-2024
      • (2024)Sketching and Streaming for Dictionary Compression2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00029(213-222)Online publication date: 19-Mar-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media