Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664190.3672523acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

Pb-Hash: Partitioned b-bit Hashing

Published: 05 August 2024 Publication History

Abstract

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of B bits. With k hashes for each data vector, the storage would be BXk bits; and when used for large-scale learning, the model size would be 2B X k, which can be expensive. A standard strategy is to use only the lowest b bits out of the B bits and somewhat increase k, the number of hashes. In this study, we propose to re-use the hashes by partitioning the B bits into m chunks, e.g., b X m =B. Correspondingly, the model size becomes m X 2b X k, which can be substantially smaller than 2BX k.
The proposed "partitioned b-bit hashing'' (Pb-Hash) is desirable for various reasons: (1) Generating hashes can be expensive for industrial-scale (user-facing) systems. Thus, engineers may hope to make use of each hash as much as possible, instead of generating more hashes (i.e., increasing k). (2) To protect user privacy, the hashes might be artificially "polluted'' and the differential privacy (DP) budget is proportional to k. (3) After hashing, the original data are not necessarily stored and hence it might not be even possible to generate more hashes. (4) For advertising and recommendation, engineers can also apply Pb-Hash to large categorical (ID) features.
Our theoretical analysis reveals that by partitioning the hash values into m chunks, the accuracy would drop. In other words, using m chunks of B/m bits would not be as accurate as directly using B bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=2 ∼ 4. In some regions, Pb-Hash still works well even for m much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. Finally, we verify the effectiveness of Pb-Hash for linear SVM models as well as deep learning models.

References

[1]
Sujoy Bag, Sri Krishna Kumar, and Manoj Kumar Tiwari. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences, 483: 53--64, 2019.
[2]
Michael Bendersky and W. Bruce Croft. Finding text reuse on the web. In Proceedings of the Second International Conference on Web Search and Web Data Mining (WSDM), pages 262--271, Barcelona, Spain, 2009.
[3]
Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. A web search enginebased approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng., 23(7):977--990, 2011.
[4]
Andrei Z Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES), pages 21--29, Salerno, Italy, 1997.
[5]
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Comput. Networks, 29(8--13):1157--1166, 1997.
[6]
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 327--336, Dallas, TX, 1998.
[7]
Gregory Buehrer and Kumar Chellapilla. A scalable pattern mining approach to web graph compression with communities. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), pages 95--106, Stanford, CA, 2008.
[8]
Larry Carter and Mark N. Wegman. Universal classes of hash functions (extended abstract). In Proceedings of the 9th Annual ACM Symposium on Theory of Computing (STOC), pages 106--112, Boulder, CO, 1977.
[9]
Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (STOC), pages 380--388, Montreal, Canada, 2002.
[10]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 219--228, Paris, France, 2009.
[11]
Ondrej Chum and Jiri Matas. Fast computation of min-hash signatures for image collections. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3077--3084, Providence, RI, 2012.
[12]
Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW), pages 271--280, Banff, Alberta, Canada, 2007.
[13]
Agustín D. Delgado, Raquel Martínez-Unanue, Víctor Fresno-Fernández, and Soto Montalvo. A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), pages 301--310, Dublin, Ireland, 2014.
[14]
Fan Deng, Stefan Siersdorfer, and Sergej Zerr. Efficient jaccard-based diversity analysis of large document collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 1402--1411, Maui, HI, 2012.
[15]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference (TCC), pages 265--284, New York, NY, 2006.
[16]
Otmar Ertl. BagMinHash - minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1368--1377, London, UK, 2018.
[17]
Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. MOBIUS: towards the next generation of query-ad matching in baidu's sponsored search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 2509--2517, Anchorage, AK, 2019.
[18]
Weiqi Feng and Dong Deng. Allign: Aligning all-pair near-duplicate passages in long texts. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 541--553, Virtual Event, China, 2021.
[19]
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference (WWW), pages 669--678, Budapest, Hungary, 2003.
[20]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST), pages 331--344, Santa Clara, CA, 2015.
[21]
Gilad Fuchs, Yoni Acriche, Idan Hasson, and Pavel Petrov. Intent-driven similarity in e-commerce listings. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2437--2444, Virtual Event, Ireland, 2020.
[22]
Sreenivas Gollapudi and Rina Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management (CIKM), pages 475--482, Arlington, VA, 2006.
[23]
Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 381--390, Madrid, Spain, 2009.
[24]
Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2938--2945, 2013.
[25]
Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), pages 246--255, Sydney, Australia, 2010.
[26]
Peng Jia, Pinghui Wang, Junzhou Zhao, Shuo Zhang, Yiyan Qi, Min Hu, Chao Deng, and Xiaohong Guan. Bidirectionally densifying LSH sketches with empty bins. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 830--842, Virtual Event, China, 2021.
[27]
Jon Kleinberg and Eva Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), pages 14--23, New York, NY, 1999.
[28]
David C. Lee, Qifa Ke, and Michael Isard. Partition min-hash for partial duplicate image discovery. In Proceedings of the 11th European Conference on Computer Vision (ECCV), Part I, pages 648--662, Heraklion, Crete, Greece, 2010.
[29]
Yifan Lei, Qiang Huang, Mohan S. Kankanhalli, and Anthony K. H. Tung. Localitysensitive hashing scheme based on longest circular co-substring. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), pages 2589--2599, Online conference [Portland, OR, USA], 2020.
[30]
Jakub Lemiesz. On the algebra of data sketches. Proc. VLDB Endow., 14(9): 1655--1667, 2021.
[31]
Jin Li, Sudipta Sengupta, Ran Kalach, Ronakkumar N Desai, Paul Adrian Oltean, and James Robert Benton. Using index partitioning and reconciliation for data deduplication, August 18 2015. US Patent 9,110,936.
[32]
Ping Li. Linearized GMM kernels and normalized random Fourier features. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 315--324, Halifax, Canada, 2017.
[33]
Ping Li and Kenneth Ward Church. Using sketches to estimate associations. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 708--715, Vancouver, Canada, 2005.
[34]
Ping Li and Arnd Christian König. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 671--680, Raleigh, NC, 2010.
[35]
Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd Christian König. Hashing algorithms for large-scale learning. In Advances in Neural Information Processing Systems (NIPS), pages 2672--2680, Granada, Spain, 2011.
[36]
Ping Li, Xiaoyun Li, and Cun-Hui Zhang. Re-randomized densification for one permutation hashing and bin-wise consistent weighted sampling. In Advances in Neural Information Processing Systems (NeurIPS), pages 15900--15910, Vancouver, Canada, 2019.
[37]
Ping Li, Xiaoyun Li, Gennady Samorodnitsky, and Weijie Zhao. Consistent sampling through extremal process. In Proceedings of theWeb Conference (WWW), pages 1317--1327, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, 2021.
[38]
Ping Li, Xiaoyun Li, and Gennady. P-minhash algorithm for continuous probability measures: Theory and application to machine learning. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, 2022.
[39]
Xiaoyun Li and Ping Li. Rejection sampling for weighted jaccard similarity revisited. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 2021.
[40]
Xiaoyun Li and Ping Li. C-MinHash: Improving minwise hashing with circulant permutation. In Proceedings of the International Conference on Machine Learning (ICML), pages 12857--12887, Baltimore, MD, 2022.
[41]
Xiaoyun Li and Ping Li. Differentially private one permutation hashing and bin-wise consistent weighted sampling. arXiv preprint arXiv:2306.07674, 2023.
[42]
Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010--73, Microsoft Research, 2010.
[43]
Emaad A. Manzoor, Sadegh M. Milajerdi, and Leman Akoglu. Fast memoryefficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1035--1044, San Francisco, CA, 2016.
[44]
Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. Table union search on open data. Proc. VLDB Endow., 11(7):813--825, 2018.
[45]
Sandeep Pandey, Asndrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar, and Sergei Vassilvitskii. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441--450, Madrid, Spain, 2009.
[46]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. Cross-architecture bug search in binary executables. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP), pages 709--724, San Jose, CA, 2015.
[47]
Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, and Vahab S. Mirrokni. Variance reduction in bipartite experiments through correlation clustering. In Advances in Neural Information Processing Systems (NeurIPS), pages 13288--13298, Vancouver, Canada, 2019.
[48]
Edward Raff and Charles K. Nicholas. An alternative to NCD for large sequences, lempel-ziv jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1007--1015, Halifax, Canada, 2017.
[49]
Erich Schubert, Michael Weiler, and Hans-Peter Kriegel. SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 871--880, New York, NY, 2014.
[50]
Rajen Dinesh Shah and Nicolai Meinshausen. On b-bit min-wise hashing for large-scale regression and classification with sparse data. J. Mach. Learn. Res., 18: 178:1--178:42, 2017.
[51]
Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 165--175, Virtual Event, CA, 2020.
[52]
Anshumali Shrivastava. Simple and efficient weighted minwise hashing. In Neural Information Processing Systems (NIPS), pages 1498--1506, Barcelona, Spain, 2016.
[53]
Anshumali Shrivastava and Ping Li. In defense of minhash over simhash. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 886--894, Reykjavik, Iceland, 2014.
[54]
Acar Tamersoy, Kevin A. Roundy, and Duen Horng Chau. Guilt by association: large scale malware detection by mining file-relation graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1524--1533, New York, NY, 2014.
[55]
Kateryna Tymoshenko and Alessandro Moschitti. Cross-pair text representations for answer sentence selection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2162--2173, Brussels, Belgium, 2018.
[56]
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1386--1393, Columbus, OH, 2014.
[57]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 25--33, Anchorage, AK, 2019.
[58]
Dingqi Yang, Paolo Rosso, Bin Li, and Philippe Cudré-Mauroux. NodeSketch: Highly-efficient graph embeddings via recursive sketching. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1162--1172, Anchorage, AK, 2019.
[59]
Yun William Yu and Griffin M. Weber. Hyperminhash: Minhash in loglog space. IEEE Trans. Knowl. Data Eng., 34(1):328--339, 2022.
[60]
Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. AIBox: CTR prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 319--328, Beijing, China, 2019.
[61]
Xinyi Zheng, Weijie Zhao, Xiaoyun Li, and Ping Li. Building k-anonymous user cohorts with consecutive consistent weighted sampling (ccws). In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Taipei, 2023.
[62]
Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, and Renée J. Miller. Interactive navigation of open data linkages. Proc. VLDB Endow., 10(12):1837--1840, 2017.
[63]
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. JOSIE: overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 847--864, Amsterdam, The Netherlands, 2019.

Index Terms

  1. Pb-Hash: Partitioned b-bit Hashing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICTIR '24: Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval
      August 2024
      267 pages
      ISBN:9798400706813
      DOI:10.1145/3664190
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 August 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. categorical features
      2. efficiency
      3. hashing
      4. partition bits

      Qualifiers

      • Research-article

      Conference

      ICTIR '24
      Sponsor:

      Acceptance Rates

      ICTIR '24 Paper Acceptance Rate 26 of 45 submissions, 58%;
      Overall Acceptance Rate 235 of 527 submissions, 45%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 76
        Total Downloads
      • Downloads (Last 12 months)76
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media