Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1321440.1321532acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

External perfect hashing for very large key sets

Published: 06 November 2007 Publication History

Abstract

We present a simple and efficient external perfect hashing scheme (referred to as EPH algorithm) for very large static key sets. We use a number of techniques from the literature to obtain a novel scheme that is theoretically well-understood and at the same time achieves an order-of-magnitude increase in the size of the problem to be solved compared to previous "practical" methods. We demonstrate the scalability of our algorithm by constructing minimum perfect hash functions for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 62 minutes, using a commodity PC. Our scheme produces minimal perfect hash functions using approximately 3.8 bits per key. For perfect hash functions in the range {0,...,2n - 1} the space usage drops to approximately 2.7 bits per key. The main contribution is the first algorithm that has experimentally proven practicality for sets in the order of billions of keys and has time and space usage carefully analyzed without unrealistic assumptions.

References

[1]
N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and G. Tardos. Linear hash functions. J. of the ACM, 46(5):667--683, 1999.
[2]
N. Alon and M. Naor. Derandomization, witnesses for Boolean matrix multiplication and construction of perfect hash functions. Algorithmica, 16(4-5):434--449, 1996.
[3]
P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In Proc. of the 13th Intl. World Wide Web Conference, pages 595--602, 2004.
[4]
F. Botelho, Y. Kohayakawa, and N. Ziviani. A practical minimal perfect hashing method. In Proc. of the 4th Intl. Workshop on Efficient and Experimental Algorithms, pages 488--500. Springer LNCS, 2005.
[5]
F. Botelho, R. Pagh, and N. Ziviani. Simple and space-efficient minimal perfect hash functions. In Proc. of the 10th Intl. Workshop on Data Structures and Algorithms, pages 139--150. Springer LNCS, 2007.
[6]
F. C. Botelho, R. Pagh, and N. Ziviani. Perfect hashing for data management applications. Technical Report TR002/07, Federal University of Minas Gerais, 2007. Available at http://arxiv.org/pdf/cs/0702159.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the 7th Intl. World Wide Web Conference, pages 107--117, April 1998.
[8]
C.-C. Chang and C.-Y. Lin. A perfect hashing schemes for mining association rules. The Computer Journal, 48(2):168--179, 2005.
[9]
C.-C. Chang, C.-Y. Lin, and H. Chou. Perfect hashing schemes for mining traversal patterns. J. of Fundamenta Informaticae, 70(3):185--202, 2006.
[10]
Z. Czech, G. Havas, and B. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257--264, 1992.
[11]
Z. Czech, G. Havas, and B. Majewski. Fundamental study perfect hashing. Theoretical Computer Science, 182:1--143, 1997.
[12]
M. Dietzfelbinger and T. Hagerup. Simple minimal perfect hashing in less space. In Proc. of the 9th European Symposium on Algorithms, pages 109--120. Springer LNCS vol. 2161, 2001.
[13]
E. Fox, Q. Chen, and L. Heath. A faster algorithm for constructing minimal perfect hash functions. In Proc. of the 15th Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266--273, 1992.
[14]
E. Fox, L. S. Heath, Q. Chen, and A. Daoud. Practical minimal perfect hash functions for large databases. Communications of the ACM, 35(1):105--121, 1992.
[15]
M. L. Fredman, J. Komlós, and E. Szemerédi. On the size of separating systems and families of perfect hashing functions. SIAM J. on Algebraic and Discrete Methods, 5:61--68, 1984.
[16]
M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with O(1) worst case access time. J. of the ACM, 31(3):538--544, July 1984.
[17]
T. Hagerup and T. Tholey. Efficient minimal perfect hashing in nearly minimal space. In Proc. of the 18th Symposium on Theoretical Aspects of Computer Science, pages 317--326. Springer LNCS vol. 2010, 2001.
[18]
R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
[19]
B. Jenkins. Algorithm alley: Hash functions. Dr. Dobb's J. of Software Tools, 22(9), 1997.
[20]
D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, second edition, 1973.
[21]
P. Larson and G. Graefe. Memory management during run generation in external sorting. In Proc. of the 1998 ACM SIGMOD intl. conference on Management of data, pages 472--483. ACM Press, 1998.
[22]
S. Lefebvre and H. Hoppe. Perfect spatial hashing. ACM Transactions on Graphics, 25(3):579--588, 2006.
[23]
B. Majewski, N. Wormald, G. Havas, and Z. Czech. A family of perfect hashing methods. The Computer Journal, 39(6):547--554, 1996.
[24]
S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: Memory access. The VLDB journal, 9:231--246, 2000.
[25]
K. Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching. Springer-Verlag, 1984.
[26]
A. Pagh, R. Pagh, and S. S. Rao. An optimal bloom filter replacement. In Proc. of the 16th ACM-SIAM symposium on Discrete algorithms, pages 823--829, 2005.
[27]
R. Pagh. Hash and displace: Efficient evaluation of minimal perfect hash functions. In Workshop on Algorithms and Data Structures, pages 49--54, 1999.
[28]
B. Prabhakar and F. Bonomi. Perfect hashing for network applications. In Proc. of the IEEE International Symposium on Information Theory. IEEE Press, 2006.
[29]
J. Radhakrishnan. Improved bounds for covering complete uniform hypergraphs. Information Processing Letters, 41:203--207, 1992.
[30]
J. P. Schmidt and A. Siegel. The spatial complexity of oblivious k-probe hash functions. SIAM J. on Computing, 19(5):775--786, October 1990.
[31]
M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.
[32]
M. Thorup. Even strongly universal hashing is pretty fast. In Proc. of the 11th ACM-SIAM symposium on Discrete algorithms, pages 496--497, 2000.
[33]
P. Woelfel. Maintaining external memory efficient hash tables. In Proc. of the 10th International Workshop on Randomization and Computation (RANDOM '06), pages 508--519. Springer LNCS vol. 4110, 2006.

Cited By

View all
  • (2022)Can Learned Models Replace Hash Functions?Proceedings of the VLDB Endowment10.14778/3570690.357070216:3(532-545)Online publication date: 1-Nov-2022
  • (2022)Vectorising k-Truss Decomposition for Simple Multi-Core and SIMD Acceleration2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA56318.2022.9904350(1-6)Online publication date: 18-Jul-2022
  • (2020)Data-Parallel Hashing Techniques for GPU ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.292976831:1(237-250)Online publication date: 1-Jan-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
November 2007
1048 pages
ISBN:9781595938039
DOI:10.1145/1321440
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. functions
  2. hash
  3. key sets
  4. large
  5. minimal
  6. perfect

Qualifiers

  • Research-article

Conference

CIKM07

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Can Learned Models Replace Hash Functions?Proceedings of the VLDB Endowment10.14778/3570690.357070216:3(532-545)Online publication date: 1-Nov-2022
  • (2022)Vectorising k-Truss Decomposition for Simple Multi-Core and SIMD Acceleration2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA56318.2022.9904350(1-6)Online publication date: 18-Jul-2022
  • (2020)Data-Parallel Hashing Techniques for GPU ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.292976831:1(237-250)Online publication date: 1-Jan-2020
  • (2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
  • (2017)Fast Plagiarism Detection in Large-Scale DataBeyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation10.1007/978-3-319-58274-0_27(329-343)Online publication date: 27-Apr-2017
  • (2016)Research on the privacy-preserving retrieval over ciphertext on cloud2016 6th International Conference on Information Communication and Management (ICICM)10.1109/INFOCOMAN.2016.7784223(100-104)Online publication date: Oct-2016
  • (2016)External-Memory State Space SearchAlgorithm Engineering10.1007/978-3-319-49487-6_6(185-225)Online publication date: 11-Nov-2016
  • (2014)Symbolic and explicit search hybrid through perfect hash functions - a case study in connect fourProceedings of the Twenty-Fourth International Conferenc on International Conference on Automated Planning and Scheduling10.5555/3038794.3038807(101-110)Online publication date: 21-Jun-2014
  • (2014)Retrieval and Perfect Hashing Using FingerprintingProceedings of the 13th International Symposium on Experimental Algorithms - Volume 850410.1007/978-3-319-07959-2_12(138-149)Online publication date: 29-Jun-2014
  • (2013)Linked Data in Enterprise IntegrationBig Data Computing10.1201/b16014-8(169-203)Online publication date: 4-Nov-2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media