research-article

Permutation search methods are efficient, yet faster search is possible

Editors: Chen Li, Volker Markl Authors:

Bilegsaikhan Naidan,

Leonid Boytsov,

Eric NybergAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 1618 - 1629

https://doi.org/10.14778/2824032.2824059

Published: 01 August 2015 Publication History

Abstract

We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the high-accuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available.

References

[1]

A. Abdullah, J. Moeller, and S. Venkatasubramanian. Approximate bregman near neighbors in sublinear time: Beyond the triangle inequality. In Proceedings of the twenty-eighth annual symposium on Computational geometry, pages 31--40. ACM, 2012.

[2]

G. Amato, C. Gennaro, and P. Savino. MI-file: using inverted files for scalable approximate similarity search. Multimedia tools and applications, 71(3):1333--1362, 2014.

[3]

G. Amato and P. Savino. Approximate similarity search in metric spaces using inverted files. In Proceedings of the 3rd international conference on Scalable information systems, InfoScale '08, pages 28:1--28:10, ICST, Brussels, Belgium, Belgium, 2008. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering).

[4]

C. Beecks. Distance based similarity models for content based multimedia retrieval. PhD thesis, 2013.

[5]

E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245--250. ACM, 2001.

[6]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993--1022, 2003.

[7]

P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627v2, 2009.

[8]

L. Boytsov and B. Naidan. Engineering efficient and effective non-metric space library. In SISAP, pages 280--293, 2013. Available at https://github.com/searchivarius/NonMetricSpaceLib

[9]

L. Boytsov and B. Naidan. Learning to prune in metric and non-metric spaces. In NIPS, pages 1574--1582, 2013.

[10]

L. Cayton. Fast nearest neighbor retrieval for bregman divergences. In ICML, pages 112--119, 2008.

[11]

E. Chávez, M. Graff, G. Navarro, and E. Téllez. Near neighbor searching with k nearest references. Information Systems, 2015.

[12]

E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM computing surveys (CSUR), 33(3):273--321, 2001.

[13]

L. Chen and X. Lian. Efficient similarity search in nonmetric spaces with local constant embedding. Knowledge and Data Engineering, IEEE Transactions on, 20(3):321--336, 2008.

[14]

P. Diaconis. Group representations in probability and statistics. Lecture Notes-Monograph Series, pages i--192, 1988.

[15]

W. Dong. High-Dimensional Similarity Search for Large Datasets. PhD thesis, Princeton University, 2011.

[16]

W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577--586. ACM, 2011.

[17]

W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling lsh for performance tuning. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 669--678, New York, NY, USA, 2008. ACM.

[18]

D. M. Endres and J. E. Schindelin. A new metric for probability distributions. Information Theory, IEEE Transactions on, 49(7):1858--1860, 2003.

[19]

A. Esuli. PP-index: Using permutation prefixes for efficient and scalable approximate similarity search. Proceedings of LSDS-IR, 2009, 2009.

[20]

R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD '03, pages 301--312, New York, NY, USA, 2003. ACM.

[21]

A. Faragó, T. Linder, and G. Lugosi. Fast nearest-neighbor search in dissimilarity spaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(9):957--962, 1993.

[22]

K. Figueroa and K. Frediksson. Speeding up permutation based indexing with indexing. In Proceedings of the 2009 Second International Workshop on Similarity Search and Applications, pages 107--114. IEEE Computer Society, 2009.

[23]

K.-S. Goh, B. Li, and E. Chang. Dyndex: a dynamic and non-metric space indexer. In Proceedings of the tenth ACM international conference on Multimedia, pages 466--475. ACM, 2002.

[24]

E. C. Gonzalez, K. Figueroa, and G. Navarro. Effective proximity retrieval by ordering permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(9):1647--1658, 2008.

[25]

D. Jacobs, D. Weinshall, and Y. Gdalyahu. Classification with nonmetric distances: Image retrieval and class representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(6):583--600, 2000.

[26]

H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861--864. IEEE, 2011.

[27]

R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-store: A high-performance, distributed main memory transaction processing system. Proc. VLDB Endow., 1(2):1496--1499, Aug. 2008.

[28]

Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. Sk-lsh: An efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow., 7(9):745--756, May 2014.

[29]

Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45: 61--68, 2014.

[30]

R. Paredes, E. Chávez, K. Figueroa, and G. Navarro. Practical construction of k-nearest neighbor graphs in metric spaces. In Experimental Algorithms, pages 85--97. Springer, 2006.

[31]

V. Pestov. Indexability, concentration, and {VC} theory. Journal of Discrete Algorithms, 13(0): 2--18, 2012. Best Papers from the 3rd International Conference on Similarity Search and Applications (SISAP 2010).

[32]

R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45--50, Valletta, Malta, May 2010. ELRA.

[33]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.

[34]

H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005.

[35]

B. Schlegel, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd instructions. In ADMS@ VLDB, pages 1--8, 2011.

[36]

T. B. Sebastian and B. B. Kimia. Metric-based shape retrieval in large databases. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 291--296. IEEE, 2002.

[37]

T. Skopal. Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst., 32(4), Nov. 2007.

[38]

E. S. Téllez, E. Chávez, and A. Camarena-Ibarrola. A brief index for proximity searching. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 529--536. Springer, 2009.

[39]

E. S. Tellez, E. Chávez, and G. Navarro. Succinct nearest neighbor search. Information Systems, 38(7):1019--1030, 2013.

[40]

J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information processing letters, 40(4):175--179, 1991.

[41]

S. Van Dongen and A. J. Enright. Metric distances derived from cosine similarity and pearson and spearman correlations. arXiv preprint arXiv: 1208.3145, 2012.

[42]

R. Weber, H. J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases, pages 194--205. Morgan Kaufmann, August 1998.

[43]

P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, volume 93, pages 311--321, 1993.

[44]

P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

Cited By

Czumaj ASohler C(2024)Sublinear Time Approximation of the Cost of a Metric \({k}\)-Nearest Neighbor GraphSIAM Journal on Computing10.1137/22M154410553:2(524-571)Online publication date: 17-Apr-2024
https://doi.org/10.1137/22M1544105
Chen LGao YSong XLi ZZhu YMiao XJensen C(2022)Indexing Metric Spaces for Exact Similarity SearchACM Computing Surveys10.1145/353496355:6(1-39)Online publication date: 7-Dec-2022
https://dl.acm.org/doi/10.1145/3534963
Sabnis ASi Salem TNeglia GGaretto MLeonardi ESitaraman R(2022)GRADES: Gradient Descent for Similarity CachingIEEE/ACM Transactions on Networking10.1109/TNET.2022.318704431:1(30-41)Online publication date: 13-Jul-2022
https://dl.acm.org/doi/10.1109/TNET.2022.3187044
Show More Cited By

Permutation search methods are efficient, yet faster search is possible
1. Computing methodologies
2. Information systems
  1. Information retrieval

Recommendations

Efficient Local Search with Conflict Minimization: A Case Study of the n-Queens Problem

Backtracking search is frequently applied to solve a constraint-based search problem, but it often suffers from exponential growth of computing time. We present an alternative to backtracking search: local search with conflict minimization. We have ...
Efficient approximate nearest neighbor search with integrated binary codes
MM '11: Proceedings of the 19th ACM international conference on Multimedia

Nearest neighbor search in Euclidean space is a fundamental problem in multimedia retrieval. The difficulty of exact nearest neighbor search has led to approximate solutions that sacrifice precision for efficiency. Among such solutions, approaches that ...
Anytime focal search with applications
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Focal search (FS) is a bounded-suboptimal search (BSS) variant of A*. Like A*, it uses an open list whose states are sorted in increasing order of their f-values. Unlike A*, it also uses a focal list containing all states from the open list whose f-...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Czumaj ASohler C(2024)Sublinear Time Approximation of the Cost of a Metric \({k}\)-Nearest Neighbor GraphSIAM Journal on Computing10.1137/22M154410553:2(524-571)Online publication date: 17-Apr-2024
https://doi.org/10.1137/22M1544105
Chen LGao YSong XLi ZZhu YMiao XJensen C(2022)Indexing Metric Spaces for Exact Similarity SearchACM Computing Surveys10.1145/353496355:6(1-39)Online publication date: 7-Dec-2022
https://dl.acm.org/doi/10.1145/3534963
Sabnis ASi Salem TNeglia GGaretto MLeonardi ESitaraman R(2022)GRADES: Gradient Descent for Similarity CachingIEEE/ACM Transactions on Networking10.1109/TNET.2022.318704431:1(30-41)Online publication date: 13-Jul-2022
https://dl.acm.org/doi/10.1109/TNET.2022.3187044
Qi ZYue KDuan LLiang Z(2022)Similarity Search with Graph Index on Directed Social Network EmbeddingWeb Engineering10.1007/978-3-031-09917-5_6(82-97)Online publication date: 5-Jul-2022
https://dl.acm.org/doi/10.1007/978-3-031-09917-5_6
Wang MXu XYue QWang Y(2021)A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor searchProceedings of the VLDB Endowment10.14778/3476249.347625514:11(1964-1978)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476249.3476255
Sabnis ASalem TNeglia GGaretto MLeonardi ESitaraman R(2021)GRADES: Gradient Descent for Similarity CachingIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488757(1-10)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1109/INFOCOM42981.2021.9488757
Echihabi KZoumpatianos KPalpanas T(2021)High-Dimensional Similarity Search for Scalable Data Science2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00268(2369-2372)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00268
Shimomura LOyamada RVieira MKaster D(2021)A survey on graph-based methods for similarity searches in metric spacesInformation Systems10.1016/j.is.2020.10150795(101507)Online publication date: Jan-2021
https://doi.org/10.1016/j.is.2020.101507
Malkov YYashunin D(2020)Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World GraphsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288947342:4(824-836)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1109/TPAMI.2018.2889473
Li WZhang YSun YWang WLi MZhang WLin X(2020)Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and ImprovementIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.290920432:8(1475-1488)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TKDE.2019.2909204
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents