research-article

Pb-Hash: Partitioned b-bit Hashing

Authors:

Ping Li,

Weijie ZhaoAuthors Info & Claims

ICTIR '24: Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 239 - 246

https://doi.org/10.1145/3664190.3672523

Published: 05 August 2024 Publication History

Get Access

Abstract

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of B bits. With k hashes for each data vector, the storage would be BXk bits; and when used for large-scale learning, the model size would be 2^B X k, which can be expensive. A standard strategy is to use only the lowest b bits out of the B bits and somewhat increase k, the number of hashes. In this study, we propose to re-use the hashes by partitioning the B bits into m chunks, e.g., b X m =B. Correspondingly, the model size becomes m X 2^b X k, which can be substantially smaller than 2^BX k.

The proposed "partitioned b-bit hashing'' (Pb-Hash) is desirable for various reasons: (1) Generating hashes can be expensive for industrial-scale (user-facing) systems. Thus, engineers may hope to make use of each hash as much as possible, instead of generating more hashes (i.e., increasing k). (2) To protect user privacy, the hashes might be artificially "polluted'' and the differential privacy (DP) budget is proportional to k. (3) After hashing, the original data are not necessarily stored and hence it might not be even possible to generate more hashes. (4) For advertising and recommendation, engineers can also apply Pb-Hash to large categorical (ID) features.

Our theoretical analysis reveals that by partitioning the hash values into m chunks, the accuracy would drop. In other words, using m chunks of B/m bits would not be as accurate as directly using B bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=2 ∼ 4. In some regions, Pb-Hash still works well even for m much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. Finally, we verify the effectiveness of Pb-Hash for linear SVM models as well as deep learning models.

References

[1]

Sujoy Bag, Sri Krishna Kumar, and Manoj Kumar Tiwari. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences, 483: 53--64, 2019.

Abstract

References

Index Terms

Recommendations

Preference preserving hashing for efficient recommendation

b-bit minwise hashing in practice

Complementary Projection Hashing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations