Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357713.3384259acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Fast hashing with strong concentration bounds

Published: 22 June 2020 Publication History

Abstract

Previous work on tabulation hashing by Pǎtraşcu and Thorup from STOC’11 on simple tabulation and from SODA’13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain Σ should be small enough that character tables of size |Σ| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|Σ|). However, the concentration bounds by Pǎtraşcu and Thorup only apply if the expected sums are ≪ |Σ|.
To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |Σ| for large data sets, e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

References

[1]
Anders Aamand, Evangelos Kipouridis, Jakob B. T. Knudsen, Peter M. R. Rasmussen, and Mikkel Thorup. 2020. No Repetitions: Fast Streaming with Highly Concentrated Hashing. ArXiv, abs/2004.01156.
[2]
Anders Aamand, Jakob Bæ k Tejs Knudsen, Mathias B. T. Knudsen, Peter M. R. Rasmussen, and Mikkel Thorup. 2019. Fast hashing with Strong Concentration Bounds. ArXiv, abs/1905.00369 (2019).
[3]
Arne Andersson, Peter Bro Miltersen, Søren Riis, and Mikkel Thorup. 1996. Static Dictionaries on AC^0 RAMs: Query Time Θ (√ ologn/ olog ologn) is Necessary and Sufficient. In 37th Annual Symposium on Foundations of Computer Science (FOCS). 441–450. https://doi.org/10.1109/SFCS.1996.548503
[4]
Austin Appleby. 2016. MurmurHash3.
[5]
Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn, and Christian Winnerlein. 2013. BLAKE2: Simpler, Smaller, Fast as MD5. In Applied Cryptography and Network Security, Michael Jacobson, Michael Locasto, Payman Mohassel, and Reihaneh Safavi-Naini (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 119–135. isbn:978-3-642-38980-1
[6]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting Distinct Elements in a Data Stream. In International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM). 1–10.
[7]
George Bennett. 1962. Probability Inequalities for the Sum of Independent Random Variables. J. Amer. Statist. Assoc., 57, 297 (1962), 33–45. https://doi.org/10.1080/01621459.1962.10482149
[8]
Sergei Natanovich Bernstein. 1924. On a modification of Chebyshev’s inequality and of the error formula of Laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math., 38–49.
[9]
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES). 21–29.
[10]
Larry Carter and Mark N. Wegman. 1979. Universal classes of hash functions. J. Comput. System Sci., 18, 2 (1979), 143–154. Announced at STOC’77.
[11]
L. Elisa Celis, Omer Reingold, Gil Segev, and Udi Wieder. 2011. Balls and Bins: Smaller Hash Families and Faster Evaluation. In 52nd Annual Symposium on Foundations of Computer Science (FOCS). 599–608.
[12]
Ashok K. Chandra, Larry J. Stockmeyer, and Uzi Vishkin. 1984. Constant Depth Reducibility. SIAM J. Comput., 13, 2 (1984), 423–439. https://doi.org/10.1137/0213028
[13]
Herman Chernoff. 1952. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Annals of Mathematical Statistics, 23, 4 (1952), 493–507.
[14]
Tobias Christiani and Rasmus Pagh. 2014. Generating k-Independent Variables in Constant Time. In 55th Annual Symposium on Foundations of Computer Science (FOCS). 196–205.
[15]
Tobias Christiani, Rasmus Pagh, and Mikkel Thorup. 2015. From independence to expansion and back again. In Proceedings of the 47rd ACM Symposium on Theory of Computing (STOC).
[16]
Kai-Min Chung, Michael Mitzenmacher, and Salil Vadhan. 2013. Why simple hash functions work: Exploiting the entropy in a data stream. Theory of Computing, 9, 1 (2013), 897–945.
[17]
Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Eva Rotenberg, and Mikkel Thorup. 2015. Hashing for Statistics over K-Partitions. In 56th Annual Symposium on Foundations of Computer Science (FOCS). 1292–1310. https://doi.org/10.1109/FOCS.2015.83
[18]
Søren Dahlgaard, Mathias Bæ k Tejs Knudsen, and Mikkel Thorup. 2017. Practical Hash Functions for Similarity Estimation and Dimensionality Reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc., 6618–6628. isbn:978-1-5108-6096-4 http://dl.acm.org/citation.cfm?id=3295222.3295407
[19]
Martin Dietzfelbinger. 1996. Universal hashing and k-wise independent random variables via integer arithmetic without primes. In Proceedings of the 13th Symposium on Theoretical Aspects of Computer Science (STACS). 569–580.
[20]
Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. 1992. Dynamic Hashing in Real Time. In Informatik, Festschrift zum 60. Geburtstag von Günter Hotz. 95–119. https://doi.org/10.1007/978-3-322-95233-2_7
[21]
Martin Dietzfelbinger and Michael Rink. 2009. Applications of a Splitting Trick. In Proceedings of the 36th International Colloquium on Automata, Languages and Programming (ICALP). 354–365.
[22]
Martin Dietzfelbinger and Christoph Weidling. 2007. Balanced allocation and dictionaries with tightly packed constant size bins. Theor. Comput. Sci., 380 (2007), 06, 47–68. https://doi.org/10.1016/j.tcs.2007.02.054
[23]
Martin Dietzfelbinger and Philipp Woelfel. 2003. Almost Random Graphs with Simple Hash Functions. In Proceedings of the 25th ACM Symposium on Theory of Computing (STOC). 629–638.
[24]
A. I. Dumey. 1956. Indexing for rapid random access memory systems. Computers and Automation, 5, 12 (1956), 6–9.
[25]
Dimitris Fotakis, Rasmus Pagh, Peter Sanders, and Paul Spirakis. 2005. Space Efficient Hash Tables with Worst Case Constant Access Time. Theory of Computing Systems, 38, 2 (2005), 01 Feb, 229–248. https://doi.org/10.1007/s00224-004-1195-x
[26]
Parikshit Gopalan, Daniel M. Kane, and Raghu Meka. 2018. Pseudorandomness via the Discrete Fourier Transform. SIAM J. Comput., 47, 6 (2018), 2451–2487. https://doi.org/10.1137/16M1062132
[27]
Torben Hagerup and Torsten Tholey. 2001. Efficient Minimal Perfect Hashing in Nearly Minimal Space. In Proceedings of the 18th Symposium on Theoretical Aspects of Computer Science (STACS). 317–326.
[28]
John L. Hennessy and David A. Patterson. 2012. Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann. isbn:978-0-12-383872-8
[29]
Donald E. Knuth. 1963. Notes on open addressing. Unpublished memorandum. See http://citeseer.ist.psu.edu/knuth63notes.html.
[30]
Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. 2003. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd Internet Measurement Conference (IMC). 234–247. https://doi.org/10.1145/948205.948236
[31]
Daniel Lemire and Owen Kaser. 2016. Faster 64-bit universal hashing using carry-less multiplications. Journal of Cryptographic Engineering, 6, 3 (2016), 01 Sep, 171–185. issn:2190-8516 https://doi.org/10.1007/s13389-015-0110-5
[32]
Yishay Mansour, Noam Nisan, and Prasoon Tiwari. 1993. The Computational Complexity of Universal Hashing. Theor. Comput. Sci., 107, 1 (1993), 121–133. https://doi.org/10.1016/0304-3975(93)90257-T
[33]
Raghu Meka, Omer Reingold, Guy N. Rothblum, and Ron D. Rothblum. 2014. Fast Pseudorandomness for Independence and Load Balancing - (Extended Abstract). In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP). 859–870.
[34]
Peter Bro Miltersen. 1996. Lower Bounds for Static Dictionaries on RAMs with Bit Operations But No Multiplication. In Proceedings of the 23rd International Colloquium on Automata, Languages and Programming (ICALP). 442–453. https://doi.org/10.1007/3-540-61440-0_149
[35]
Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press.
[36]
Anna Pagh and Rasmus Pagh. 2008. Uniform Hashing in Constant Time and Optimal Space. SIAM J. Comput., 38, 1 (2008), 85–96.
[37]
Mihai Pǎtraşcu and Mikkel Thorup. 2012. The Power of Simple Tabulation-Based Hashing. J. ACM, 59, 3 (2012), Article 14. Announced at STOC’11.
[38]
Mihai Pǎtraşcu and Mikkel Thorup. 2016. On the k-Independence Required by Linear Probing and Minwise Independence. ACM Trans. Algorithms, 12, 1 (2016), 8:1–8:27.
[39]
Geoff Pike and Jyrki Alakuijala. 2011. Introducing cityhash. https://opensource.googleblog.com/2011/04/introducing-cityhash.html
[40]
Mihai Pǎtraşcu and Mikkel Thorup. 2013. Twisted Tabulation Hashing. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 209–228.
[41]
Jeanette P. Schmidt, Alan Siegel, and Aravind Srinivasan. 1995. Chernoff-Hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics, 8, 2 (1995), 223–250. Announced at SODA’93.
[42]
Alan Siegel. 2004. On Universal Classes of Extremely Random Constant-Time Hash Functions. SIAM J. Comput., 33, 3 (2004), 505–543. Announced at FOCS’89.
[43]
Mikkel Thorup. 2013. Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. In 54th Annual Symposium on Foundations of Computer Science (FOCS). 90–99.
[44]
Mikkel Thorup. 2015. High Speed Hashing for Integers and Strings. ArXiv, abs/1504.06804 (2015).
[45]
Mikkel Thorup and Yin Zhang. 2012. Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation. SIAM J. Comput., 41, 2 (2012), 293–331. Announced at SODA’04 and ALENEX’10.
[46]
Mark N. Wegman and Larry Carter. 1981. New Classes and Applications of Hash Functions. J. Comput. System Sci., 22, 3 (1981), 265–279. Announced at FOCS’79.
[47]
Albert Lindsey Zobrist. 1970. A New Hashing Method with Application for Game Playing. Computer Sciences Department, University of Wisconsin, Madison, Wisconsin.

Cited By

View all
  • (2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
  • (2023)Locally Uniform Hashing2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00089(1440-1470)Online publication date: 6-Nov-2023
  • (2022)No RepetitionProceedings of the VLDB Endowment10.14778/3565838.356585115:13(3989-4001)Online publication date: 1-Sep-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2020: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
June 2020
1429 pages
ISBN:9781450369794
DOI:10.1145/3357713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chernoff bounds
  2. concentration bounds
  3. hashing
  4. sampling
  5. streaming algorithms

Qualifiers

  • Research-article

Conference

STOC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
  • (2023)Locally Uniform Hashing2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00089(1440-1470)Online publication date: 6-Nov-2023
  • (2022)No RepetitionProceedings of the VLDB Endowment10.14778/3565838.356585115:13(3989-4001)Online publication date: 1-Sep-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media