Article

Minwise-Independent Permutations with Insertion and Deletion of Features

Authors:

Rameshwar Pratap,

Raghav KulkarniAuthors Info & Claims

Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings

Pages 171 - 184

https://doi.org/10.1007/978-3-031-46994-7_15

Published: 27 October 2023 Publication History

Abstract

The seminal work of Broder et al. [5] introduces the

minHash

algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention,

minHash

has been commonly used by practitioners in various big data applications. In many real-life scenarios, the data is dynamic and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. A naive solution to this problem is to repeatedly recompute

minHash

with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of

minHash

is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the

minHash

sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with supporting experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running

minHash

from scratch. Our proposal is efficient, accurate, and easy to implement in practice.

References

[1]

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 131–140. Association for Computing Machinery, New York, NY, USA (2007)

[2]

Bera D and Pratap R Dinh TN and Thai MT Frequent-itemset mining using locality-sensitive hashing Computing and Combinatorics 2016 Cham Springer 143-155

[3]

Broder, A.Z.: On the resemblance and containment of documents. In: . Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE (1997)

[4]

Broder AZ Giancarlo R and Sankoff D Identifying and filtering near-duplicate documents Combinatorial Pattern Matching 2000 Heidelberg Springer 1-10

[5]

Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 327–336. Association for Computing Machinery, New York, NY, USA (1998)

[6]

Broder, A.Z., Glassman, S.C., Nelson, C.G., Manasse, M.S., Zweig, G.G.: Method for clustering closely resembling data objects, September 12 2000. US Patent 6,119,124

[7]

Christiani, T., Pagh, R.: Set similarity search beyond minhash. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1094–1107. Association for Computing Machinery, New York, NY, USA, (2017)

[8]

Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19, 2018, pp. 1240–1243. IEEE Computer Society (2018)

[9]

Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: min-hash and TF-IDF weighting. In: Everingham, M., Needham, C.J., Fraile, R. (Eds.), Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September 2008, pp. 1–10. British Machine Vision Association (2008)

[10]

Cormen TH, Leiserson CE, Rivest RL, and Stein C Introduction to Algorithms 2009 3 Cambridge MIT Press

[11]

Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 271–280. ACM, New York, NY, USA (2007)

[12]

Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 284–291. Association for Computing Machinery, New York, NY, USA (2006)

[13]

Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23–26, 1998, pp. 604–613 (1998)

[14]

Li P and König AC Theory and applications of b-bit minwise hashing Commun. ACM 2011 54 8 101-109

[15]

Li, P., Owen, A.B., Zhang, C.-H.: One permutation hashing. In: Bartlett, P.L., Pereira, F.C.N., Burges, Léon Bottou, C.J.C., Weinberger, K.Q., (Eds.), Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, United States, pp. 3122–3130 (2012)

[16]

Li, P., Shrivastava, A., König, A.C.: B-bit minwise hashing in practice. In: Proceedings of the 5th Asia-Pacific Symposium on Internetware, Internetware 2013, New York, NY, USA. Association for Computing Machinery (2013)

[17]

Lichman, M.: UCI machine learning repository (2013)

[18]

Singh Manku, G., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 141–150. Association for Computing Machinery, New York, NY, USA (2007)

[19]

McCauley, S., Mikkelsen, J.W., Pagh, R.: Set similarity search for skewed data. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, SIGMOD/PODS 2018, pap.63–74, New York, NY, USA, 2018. Association for Computing Machinery (2018)

[20]

Mitzenmacher, M., Pagh, R. Pham,N.: Efficient estimation for high similarities using odd sketches. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, p–118. Association for Computing Machinery, New York, NY, USA, 2014

[21]

Pratap,R ., Kulkarni, R.: Minwise-independent permutations with insertion and deletion of features. arxiv.org/abs/2308.11240 (2023)

[22]

Shrivastava, A., Li, P.: Improved densification of one permutation hashing. In: Proceedings of the Thirtieth Conference On Uncertainty In Artificial Intelligence, UAI 2014, pp. 732–741. AUAI Press, Arlington, Virginia, USA, (2014)

[23]

Sundaram N et al. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing Proc. VLDB Endow. 2013 6 14 1930-1941

Recommendations

The Intersection of Insertion and Deletion Balls
2021 IEEE Information Theory Workshop (ITW)
This paper studies the intersections of insertion and deletion balls. The t-insertion, t-deletion ball of a sequence x is the set of all sequences received by t insertions, deletions to x, respectively. While the intersection of either deletion balls or ...
An Efficient Similarity Searching Scheme in Massive Databases
ICDT '08: Proceedings of the 2008 The Third International Conference on Digital Telecommunications

Locality Sensitive Hashing (LSH) is a method of performing probabilistic dimension reduction of high dimensional data. It is a popular technique for approximate nearest neighbor search. However, LSH needs large memory space and long processing time to ...
P Systems with Insertion and Deletion Exo-Operations
Theory that Counts: To Oscar Ibarra on His 70th Birthday

It is known that insertion-deletion (P) systems with two symbols context-free insertion and deletion rules are not computationally complete. It is thus interesting to consider conditions that would allow such systems to reach computational completeness. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings

Oct 2023

324 pages

ISBN:978-3-031-46993-0

DOI:10.1007/978-3-031-46994-7

Editors:
Oscar Pedreira
https://ror.org/01qckj285University of A Coruña, Coruña, Spain
,
Vladimir Estivill-Castro
https://ror.org/04n0g0b29Pompeu Fabra University, Barcelona, Spain

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 27 October 2023

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents