Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Near-optimal space perfect hashing algorithms

2008

Near-Optimal Space Perfect Hashing Algorithms Fabiano C. Botelho1,2 and Nivio Ziviani2 1 Department of Computer Science Federal University of Minas Gerais, Belo Horizonte, Brazil 2 Department of Computer Engineering Federal Center for Technological Education of Minas Gerais, Belo Horizonte, Brazil fabiano@decom.cefetmg.br, nivio@dcc.ufmg.br Abstract. A perfect hash function (PHF) is an injective function that maps keys from a set S to unique values. Since no collisions occur, each key can be retrieved from a hash table with a single probe. A minimal perfect hash function (MPHF) is a PHF with the smallest possible range, that is, the hash table size is exactly the number of keys in S. Differently from other hashing schemes, MPHFs completely avoid the problem of wasted space and wasted time to deal with collisions. The study of perfect hash functions started in the early 80s, when it was proved that the theoretic information lower bound to describe a minimal perfect hash function was approximately 1.44 bits per key. Although the proof indicates that it would be possible to build an algorithm capable of generating optimal functions, no one was able to obtain a practical algorithm that could be used in real applications. Thus, there was a gap between theory and practice. The main result of the thesis filled this gap, lowering the space complexity to represent MPHFs that are useful in practice from O(n log n) to O(n) bits. This allows the use of perfect hashing in applications to which it was not considered a good option. This explicit construction of PHFs is something that the data structures and algorithms community has been looking for since the 1980s. 1. Introduction The need to access items based on the value of a key is ubiquitous in Computer Science. Some types of databases are updated only rarely, typically by periodic batch updates. This happens for most data warehousing applications (see [Seltzer 2005] for more examples and discussion). In applications where the key set is fixed for a long period of time the construction of a minimal perfect hash function can be done as part of the preprocessing phase. For example, On-Line Analytical Processing applications use extensive preprocessing of data to allow very fast evaluation of certain types of queries. More formally, given a static key set S ⊆ U of size n from a key universe U of size u, where each key is associated with satellite data, the question we are interested in is: what are the data structures that provide the best tradeoff between space usage and lookup time? Perfect hashing is a space-efficient way of creating compact representation for a static set S of n keys. For applications with successful searches1 the representation of a key x ∈ S is simply the value h(x), where h is a perfect hash function (PHF) for the set S of values considered. The word “perfect” refers to the fact that the function will map 1 A successful search happens when the queried key is found in the hash table and an unsuccessful search happens otherwise. 0 1 2 ... n−1 Key Set (a) 0 1 2 0 1 2 ... ... Hash Table m−1 n−1 Key Set (b) 0 1 2 ... Hash Table n−1 Figure 1. (a) Perfect hash function (b) Minimal perfect hash function. the elements of S to unique values (is identity preserving). Minimal perfect hash function (MPHF) produces values that are integers in the range [0, n − 1], which is the smallest possible range. Figure 1(a) illustrates a PHF and Figure 1(b) illustrates an MPHF. The study of perfect hash functions started in the early 80s, when it was proved that the theoretic information lower bound to describe a minimal perfect hash function was approximately 1.44 bits per key [Mehlhorn 1984]. Although the proof indicates that it would be possible to build an algorithm capable of generating optimal functions, no one was able to obtain a practical algorithm that could be used in real applications. Thus, there was a gap between theory and practice. The main result of the thesis filled this gap, lowering the space complexity to represent minimal perfect hash functions that are useful in practice from O(n log n) to O(n) bits. This allows the use of perfect hashing in applications to which it was not considered a good option. This explicit construction of PHFs is something that the data structures and algorithms community has been looking for since the 1980s, as said by a reviewer of a prior submission: “Taking into account the fact that people had been looking for such constructions all the time since the 1980s, this is a big achievement and might make the central result of the paper a candidate for...”. The remainder of this paper is organized as follows. Section 2 discusses the main contributions. Section 3 discusses the impact of the results. Section 4 presents the conclusions. Section 5 discusses some ongoing work and future directions. 2. Key Contributions The attractiveness of using PHFs and MPHFs depends on the following issues [Hagerup and Tholey 2001]: (i) the amount of CPU time required for generating the functions; (ii) the space requirements for generating the functions; (iii) the amount of CPU time required by the functions for each retrieval; and (iv) the space requirements of the description of the resulting functions to be used at retrieval time. No previously known algorithm performs well for all these requirements. Usually, the space requirement for generating the functions is overlooked. That is why the algorithms in the literature cannot scale for key sets on the order of billions of keys. Also, as mentioned before, there is a gap between theory and practice on perfect hashing algorithms [Botelho 2008]. So, the main contributions of the thesis are: 1. We present a simple, practical and highly scalable perfect hashing algorithm that takes into account the four aforementioned requirements [Botelho et al. 2007, Botelho and Ziviani 2007, Botelho et al. 2009b]. When the input key set fits in 2. 3. 4. 5. the internal memory available, it becomes an internal random Access memory algorithm, referred to as RAM algorithm from now on; otherwise, it becomes an external memory algorithm, referred to as EM algorithm from now on. We provide a scalable parallel implementation of the EM algorithm, referred to as parallel external memory (PEM) algorithm from now on [Botelho et al. 2008a]. We present techniques that allow the generation of PHFs and MPHFs based on random graphs containing cycles [Botelho et al. 2005]. We show that the PHFs and MPHFs we have designed can now be used for applications in which they were not considered a good option in the past. In [Botelho et al. 2008b, Botelho et al. 2009a] we show that MPHFs provide the best tradeoff between space usage and lookup time when compared to other hashing schemes for indexing internal memory when static key sets are involved. We have created the C Minimal Perfect Hashing Library [Botelho et al. 2006], referred to as CMPH Library from now on, that is a free software library available under the GNU Lesser General Public License (LGPL). The library was conceived for two reasons. First, we would like to make available our algorithms to test their applicability in practice. Second, we realized that there was a lack of similar libraries in the open source community. We now describe the key contributions in the order they appear in the original thesis document [Botelho 2008]. For the sake of space, we do not provide extended details about each contribution. Please check the thesis document for details about the algorithms and implementations related to each contribution. 2.1. Random Access Memory and External Memory Algorithms The RAM algorithm [Botelho et al. 2007, Botelho et al. 2009b] works on acyclic random graphs given by function values of uniform hash functions on the keys of S (see [Botelho 2008] for the definition of uniform hashing). The idea of basing perfect hashing on acyclic random graphs is not new, see e.g. [Majewski et al. 1996], but we proceed differently to achieve a space usage of O(1) bits per key rather than O(log n) bits per key. We use r hash functions and acyclic hypergraphs with hyperedges e(x) = {h0 (x), . . . , hr−1 (x)}, for x a key, but add two tricks: (i) to key x assign an element hi(x) (x) of e(x) such that the assignment x 7→ hi(x) (x) is one-to-one on S; (ii) use a linear equation to calculate the index i(x) ∈ [0, r − 1] from x. This makes it possible to obtain a space usage of c(r)⌈log(r + 1)⌉ bits per key, for certain numbers c(2), c(3) . . .; the value that minimizes the cost per key is r = 3. The connection to acyclic random graphs allows us to perform a tight analysis and to optimize the space usage constant by using appropriate succinct data structures in a theoretically sound way. The EM algorithm [Botelho and Ziviani 2007, Botelho et al. 2009b] is a result of a careful engineering process that uses a number of techniques from the literature to allow the generation of PHFs or MPHFs for sets on the order of billions of keys. The EM algorithm is the first step aiming to bridge the gap between theory and practice on perfect hashing. Therefore, it is the first algorithm that can be used in practice, has time and space usage carefully analyzed without unrealistic assumptions, and scales for billions of keys. We have designed two versions of the EM algorithm. The first one uses the hash functions described in [Botelho 2008], which guarantee that the EM algorithm can be made to work for every key set. The second one uses faster and more compact pseudo random hash functions proposed in [Jenkins 1997], referred to as heuristic EM algorithm, or simply HEM algorithm from now on, because it is not guaranteed that it can be made to work for every key set. However, limited randomness often suffices in practice [Alon et al. 1999], and the HEM algorithm has worked for all key sets we have applied it to. The RAM and EM algorithms generate in linear time PHFs and MPHFs that are evaluated in O(1) time. The space requirements to describe the resulting functions depend on the relation between m and n. For m = n, the space usage is approximately 2.62n for the RAM algorithm and approximately 3.3n bits for the EM algorithm. For m = 1.23n, the space usage is approximately 1.95n bits for the RAM algorithm and approximately 2.7n bits for the EM algorithm. In all cases, this is within a small constant factor from the information theoretical minimum of approximately 1.44n bits for MPHFs and 0.89n bits for PHFs, something that has not been achieved by previous algorithms, except asymptotically for very large n. The main practical perfect hashing algorithms we found in the literature to compare the RAM, EM and HEM algorithms with are: Botelho, Kohayakawa and Ziviani [Botelho et al. 2005] (referred to as BKZ), Fox, Chen and Heath [Fox et al. 1992] (referred to as FCH), Majewski, Wormald, Havas and Czech [Majewski et al. 1996] (referred to as MWHC), and Pagh [Pagh 1999] (referred to as PAGH). For the MWHC algorithm we used the version based on random hypergraphs with r = 3. We did not consider the one that uses random graphs with r = 2 because it is shown in [Botelho et al. 2005] that the BKZ algorithm outperforms it. Table 1 shows that the RAM (for r = 3), EM and HEM algorithms are the fastest ones to generate the functions and the resulting functions are the most compact. The performance of both EM and HEM algorithms is quite surprising once they use external memory at generation time and the other algorithms do not. However, as the key set is stored in external memory, all the other algorithms scan the whole key set everytime a failure occurs, whereas both EM and HEM algorithms simply scan the whole key set once and maps it to a set of fixed length fingerprints. Also, the whole key set is broken into buckets with at most 256 keys and the memory is accessed in a less random fashion, implying in fewer cache misses. Table 1. Comparison of the algorithms for constructing MPHFs considering generation time and storage space, and using n = 3, 541, 615 for the two collections. Algorithms r=2 r=3 EM Heuristic EM BKZ FCH MWHC PAGH RAM Generation Time (sec) 4-byte Integers URLs 11.39 ± 1.33 16.73 ± 1.89 5.46 ± 0.01 6.74 ± 0.01 5.86 ± 0.17 7.68 ± 0.22 5.56 ± 0.16 6.27 ± 0.11 9.22 ± 0.63 11.33 ± 0.70 2, 052.7 ± 530.96 2, 400.1 ± 711.60 5.98 ± 0.01 7.18 ± 0.01 39.18 ± 2.36 42.84 ± 2.42 Storage Space Bits/Key Size (MB) 3.60 1.52 2.62 1.11 3.31 1.40 3.17 1.34 21.76 9.19 4.22 1.78 26.76 11.30 44.16 18.65 Figure 2 illustrates that the both versions of the EM algorithm is able to generate an MPHF for a key set of 1.024 billion keys in less than 46 minutes, using a commodity PC. There is no algorithm in the perfect hashing literature that gets even close. 2.2. Parallel External Memory Algorithm The Parallel External Memory (PEM) algorithm [Botelho et al. 2008a] allows to distribute the construction, description and evaluation of the resulting functions, which is 3000 Linear regression for EM Linear regression for HEM 0 Time (s) 1000 2000 EM HEM 0 200 400 600 Number of keys (millions) 800 1000 Figure 2. Number of keys in S versus generation time for the EM algorithm and the heuristic HEM algorithm. The solid line corresponds to a linear regression model for the generation time. of fundamental importance when the key set size increases considerably. For instance, using a 14-computer cluster the PEM algorithm generates an MPHF for 1.024 billion URLs in approximately 4 minutes, achieving an almost linear speedup. Also, for 14.336 billion 16-byte random integers evenly distributed among the 14 participating machines the PEM algorithm outputs an MPHF in approximately 50 minutes, resulting in a performance degradation of 20%. To the best of our knowledge there is no previous result in the perfect hashing literature that can be implemented in a parallel way to obtain better scalability and performance than the results presented by the PEM algorithm. 2.3. MPHFs and Random Graphs with Cycles The reason to use random graphs with cycles comes from the fact that the functions are generated faster and are more compact than the ones generated based on acyclic random graphs. This is because both the generation time and the space usage of the resulting functions depend on the number of vertices in the random graphs and the acyclic ones are more sparse. That is, the ratio between the number of vertices and number of edges must be larger than two. Our result presented in [Botelho et al. 2005] improved the space requirement of one instance of the algorithms proposed in [Majewski et al. 1996]. Both algorithms are linear on n, but our algorithm runs 59% faster than the one in [Majewski et al. 1996], and the resulting MPHFs are stored using half of the space. The resulting MPHFs still need O(n log n) bits to be stored. As in [Majewski et al. 1996], the algorithm assumes uniform hashing and needs O(n) computer words of the Word RAM model to construct the functions. Recently, using ideas similar to the ones presented in [Botelho et al. 2005], we have optimized the version of the RAM algorithm that works on random bipartite graphs to output the resulting functions 40% faster when cycles are allowed. These results are presented in [Botelho 2008, Chapter 6] and are being prepared for publication. 2.4. Indexing Internal Memory with MPHFs We have shown that MPHFs provide the best tradeoff between space usage and lookup time when compared to other hashing schemes for indexing static key sets in internal memory [Botelho et al. 2008b]. It was not the case in the past because the space overhead to store MPHFs was O(log n) bits per key for practical algorithms [Majewski et al. 1996]. However, the MPHFs generated with the RAM algorithm [Botelho et al. 2007, Botelho et al. 2009b] require approximately 2.6 bits per key of space to describe the function and can be evaluated in O(1) time, and completely changed that scenario. In [Botelho et al. 2009a] we extended our prior study in two aspects. First, we have designed an optimization of the MPHFs that considerably improves their lookup time performance. Second, we have surveyed the main hashing schemes available in the literature and added four other methods to our comparative study. We have shown that other hashing schemes cannot outperform minimal perfect hashing considering lookup time even when the hash table occupancy is lower than 20%. An MPHF requiring just 2.6 bits per key of storage space is able to store sets in the order of 10 million keys in a 4 MB CPU cache, which is enough for a large range of applications. Besides, the space overhead of minimal perfect hashing is within a factor of O(log n) bits lower than other hashing schemes. 2.5. CMPH Library The CMPH Library [Botelho et al. 2006] contains a professional implementation of our main results and is the state-of-the-art software for perfect hashing. We have received very good feedbacks about the practicality of the library. For instance, it has received more than 3, 300 downloads (July 2009) and is incorporated by two Linux distributions: Debian and Ubuntu This have contributed to make the results of this thesis becoming widely used in a short period of time, which usually takes much more time. 3. Impact of the Results Three published papers have 21 citations (excluding self-citations) and one of them has more than 145 downloads in the ACM Portal in the last 12 months. Two papers that cite our results mention that we have the first really practical perfect hashing result in 20 years of research [Edelkamp and Sulewski 2008]. As mentioned before, the CMPH library has more than 3, 300 downloads up to July 2009 and is incorporated by two Linux distributions: Debian and Ubuntu, and has been used for applications that were inviable in the past. For instance, the results are being used into the products of two big companies hosted in California, United States: (i) Symantec Incorporation, and (ii) Data Domain Incorporation. Due to the impact of the results in the products of Data Domain Inc. (company with a net revenue exceeding 270 million dollars in 2008 and an expected growth of 100% in 2009), Fabiano C. Botelho was offered a position and will join the team of the company from August 2009 on. Besides, some of the knowledge acquired in the doctorate process was used in a book [Ziviani and Botelho 2006] that has sold more than 1, 500 copies. 4. Conclusions In the thesis here summarized we have designed a time efficient, highly scalable and nearoptimal space perfect hashing algorithm. In a 64-bit architecture our algorithm is able to deal with key sets of size up to n = 1.8 × 1021 . The resulting functions are evaluated in O(1) time. The space necessary to describe the functions takes a constant number of bits per key, and it is within a factor of two from the information theoretical minimum of approximately 1.44n bits for MPHFs and 0.89n bits for PHFs, something that has not been achieved by previous algorithms, except asymptotically for very large n. The algorithm is theoretically well understood and is the first one with theoretical properties that scale for billions of keys and can be used in practice. The algorithm is suitable for a distributed and parallel implementation, as the one presented in [Botelho et al. 2008a], which is able to generate an MPHF for a set of 14.336 billion 16-byte integer keys in 50 minutes using 14 commodity PCs, achieving an almost linear speedup. We have shown that MPHFs provide the best tradeoff between space usage and lookup time when compared to other hashing for indexing internal memory when static key sets are involved. 5. Ongoing and Future Work We strongly believe that our results on perfect hashing and the advent of solid state disks, which are built based on flash memory technology, have a perfect match to improve the performance of computer systems in several contexts. For example, this has been successfully done in [Edelkamp and Sulewski 2008]. So, we are working on very promising applications in the Information Retrieval field. Besides, we are working on three more papers to be submitted in 2009. The first paper is a joint work with Professor Nicholas C. Wormald from the Department of Combinatorics and Optimization at University of Waterloo, the second one is a journal paper that extends the results presented in [Botelho et al. 2008a], and the third one is a journal paper that extends the results presented in [Botelho et al. 2005]. Acknowledgements We thank the partial support given by the Brazilian National Institute of Science and Technology for the Web (grant MCT/CNPq 573871/2008-6), Project InfoWeb (grant MCT/CNPq/CT-INFO 550874/2007-0) and CNPq Grant 30.5237/02-0 (Nivio Ziviani). References Alon, N., Dietzfelbinger, M., Miltersen, P. B., Petrank, E., and Tardos, G. (1999). Linear hash functions. Journal of the ACM, 46(5):667–683. Botelho, F. C. (2008). Near-Optimal Space Perfect Hashing Algorithms. PhD thesis, Federal University of Minas Gerais. Supervised by Nivio Ziviani, http://www.decom.cefetmg.br/docentes/fabiano_botelho/ en/publications.html. Botelho, F. C., Galinkin, D., Meira-Jr., W., and Ziviani, N. (2008a). Distributed perfect hashing for very large key sets. In Proceedings of the 3rd International ICST Conference on Scalable Information Systems (InfoScale’08). ACM Press. One non self-citations from scientific articles in Google Scholar. Botelho, F. C., Kohayakawa, Y., and Ziviani, N. (2005). A practical minimal perfect hashing method. In Proceedings of the 4th International Workshop on Efficient and Experimental Algorithms (WEA’05), pages 488–500. Springer LNCS vol. 3503. Seven non self-citations from scientific articles in Google Scholar. Botelho, F. C., Lacerda, A., Menezes, G. V., and Ziviani, N. (2009a). Minimal perfect hashing: A competitive method for indexing internal memory. Information Sciences. Submitted. Botelho, F. C., Langbehn, H. R., Menezes, G. V., and Ziviani, N. (2008b). Indexing internal memory with minimal perfect hash functions. In Proceedings of the 23rd Brazilian Symposium on Database (SBBD’08), pages 16–30. Botelho, F. C., Pagh, R., and Ziviani, N. (2007). Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th Workshop on Algorithms and Data Structures (WADs’07), pages 139–150. Springer LNCS vol. 4619. Ten non selfcitations from scientific articles in Google Scholar. Botelho, F. C., Pagh, R., and Ziviani, N. (2009b). A scalable and near-optimal space perfect hashing algorithm. Transactions on Algorithms. Submitted. Botelho, F. C., Reis, D., and Ziviani, N. (2006). CMPH: C minimal perfect hashing library. Free Software Library. More than three thousands downloads by February 2009. Botelho, F. C. and Ziviani, N. (2007). External perfect hashing for very large key sets. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), pages 653–662. ACM Press. Four non self-citations from scientific articles in Google Scholar. Edelkamp, S. and Sulewski, D. (2008). Flash-efficient ltl model checking with minimal counterexamples. In Proceedings of the 2008 Sixth IEEE International Conference on Software Engineering and Formal Methods (SEFM’08), pages 73–82, Washington, DC, USA. IEEE Computer Society. Fox, E. A., Chen, Q. F., and Heath, L. S. (1992). A faster algorithm for constructing minimal perfect hash functions. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), pages 266–273. Hagerup, T. and Tholey, T. (2001). Efficient minimal perfect hashing in nearly minimal space. In Proceedings of the 18th Symposium on Theoretical Aspects of Computer Science (STACS’01), pages 317–326. Springer LNCS vol. 2010. Jenkins, B. (1997). Algorithm alley: Hash functions. Dr. Dobb’s Journal of Software Tools, 22(9). Majewski, B. S., Wormald, N. C., Havas, G., and Czech, Z. J. (1996). A family of perfect hashing methods. The Computer Journal, 39(6):547–554. Mehlhorn, K. (1984). Springer-Verlag. Data Structures and Algorithms 1: Sorting and Searching. Pagh, R. (1999). Hash and displace: Efficient evaluation of minimal perfect hash functions. In Workshop on Algorithms and Data Structures (WADS’99), pages 49–54. Seltzer, M. (2005). Beyond relational databases. ACM Queue, 3(3). Ziviani, N. and Botelho, F. C. (2006). Projeto de Algoritmos com Implementações em Java e C++. Thomson Learning.