research-article

Practical near neighbor search via group testing

AUTHORs:

Benjamin Coleman,

Anshumali ShrivastavaAuthors Info & Claims

NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Article No.: 761, Pages 9950 - 9962

Published: 10 June 2024 Publication History

Abstract

We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as "positives," non-neighbors as "negatives," and approximate membership queries as group tests. We instantiate this framework using distance-sensitive Bloom Filters to Identify Near-Neighbor Groups (FLINNG). We prove that FLINNG has sub-linear query time and show that our algorithm comes with a variety of practical advantages. For example, FLINNG can be constructed in a single pass through the data, consists entirely of efficient integer operations, and does not require any distance computations. We conduct large-scale experiments on high-dimensional search tasks such as genome search, URL similarity search, and embedding search over the massive YFCC100M dataset. In our comparison with leading algorithms such as HNSW and FAISS, we find that FLINNG can provide up to a 10x query speedup with substantially smaller indexing time and memory.

Supplementary Material

Additional material (3540261.3541022_supp.pdf)

Supplemental material.

Download
771.06 KB

References

[1]

Matthew Aldridge, Oliver Johnson, Jonathan Scarlett, et al. Group testing: An information theory perspective. Foundations and Trends® in Communications and Information Theory, 15(3-4):196–392, 2019.

Digital Library

[2]

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and optimal lsh for angular distance. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 1225–1233. Curran Associates, Inc., 2015.

[3]

Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 793–801, 2015.

Digital Library

[4]

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In International conference on database theory, pages 217–235. Springer, 1999.

[5]

Benjamin Coleman, Richard Baraniuk, and Anshumali Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In International Conference on Machine Learning, pages 2089–2099. PMLR, 2020.

[6]

Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Learning space partitions for nearest neighbor search. In International Conference on Learning Representations, 2019.

[7]

Aristides Gionis, Piotr Indyk, and Rajeev Motwaniz. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Base, 1999, 1999.

[8]

Gaurav Gupta, Minghao Yan, Benjamin Coleman, Bryce Kille, R. A. Leo Elworth, Tharun Medini, Todd Treangen, and Anshumali Shrivastava. Fast processing and querying of 170tb of genomics data via a repeated and merged bloom filter (rambo). In Proceedings of the 2021 International Conference on Management of Data, SIGMOD/PODS '21, page 2226–2234, New York, NY, USA, 2021. Association for Computing Machinery.

Digital Library

[9]

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.

Digital Library

[10]

Ahmet Iscen, Teddy Furon, Vincent Gripon, Michael Rabbat, and Hervé Jégou. Memory vectors for similarity search in high-dimensional spaces. IEEE transactions on big data, 4(1):65–77, 2017.

[11]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.

[12]

Adam Kirsch and Michael Mitzenmacher. Distance-sensitive bloom filters. In 2006 Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 41–50. SIAM, 2006.

[13]

Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In 33rd International Conference on Very Large Data Bases, VLDB 2007, pages 950–961. Association for Computing Machinery, Inc, 2007.

[14]

Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.

[15]

Tharun Medini, Beidi Chen, and Anshumali Shrivastava. Solar: Sparse orthogonal learned and random embeddings. In International Conference on Learning Representations, 2021.

[16]

Samuel M Nicholls, Joshua C Quick, Shuiquan Tang, and Nicholas J Loman. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience, 8(5):giz043, 2019.

[17]

Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17(1):1–14, 2016.

[18]

Liudmila Prokhorenkova and Aleksandr Shekhovtsov. Graph-based nearest neighbor search: From practice to theory. In International Conference on Machine Learning, pages 7803–7813. PMLR, 2020.

[19]

Parikshit Ram and Kaushik Sinha. Revisiting kd-tree for nearest neighbor search. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining, pages 1378–1388, 2019.

Digital Library

[20]

Miaojing Shi, Teddy Furon, and Hervé Jégou. A group testing framework for similarity search in high-dimensional spaces. In Proceedings of the 22nd ACM international conference on Multimedia, pages 407–416, 2014.

Digital Library

[21]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.

Digital Library

[22]

Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. Randomized algorithms accelerated over cpu-gpu for ultra-high dimensional similarity search. In Proceedings of the 2018 International Conference on Management of Data, pages 889–903, 2018.

Digital Library

Recommendations

Densifying one permutation hashing via rotation for fast near neighbor search
ICML'14: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32

The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data ...
Ranked Reverse Nearest Neighbor Search

Given a set of data points P and a query point q in a multidimensional space, Reverse Nearest Neighbor (RNN) query finds data points in P whose nearest neighbors are q. Reverse k-Nearest Neighbor (RkNN) query (where k ≥ 1) generalizes RNN query to find ...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

December 2021

30517 pages

ISBN:9781713845393

Copyright © 2021 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 June 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents