Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Permutation search methods are efficient, yet faster search is possible

Published: 01 August 2015 Publication History

Abstract

We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the high-accuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available.

References

[1]
A. Abdullah, J. Moeller, and S. Venkatasubramanian. Approximate bregman near neighbors in sublinear time: Beyond the triangle inequality. In Proceedings of the twenty-eighth annual symposium on Computational geometry, pages 31--40. ACM, 2012.
[2]
G. Amato, C. Gennaro, and P. Savino. MI-file: using inverted files for scalable approximate similarity search. Multimedia tools and applications, 71(3):1333--1362, 2014.
[3]
G. Amato and P. Savino. Approximate similarity search in metric spaces using inverted files. In Proceedings of the 3rd international conference on Scalable information systems, InfoScale '08, pages 28:1--28:10, ICST, Brussels, Belgium, Belgium, 2008. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering).
[4]
C. Beecks. Distance based similarity models for content based multimedia retrieval. PhD thesis, 2013.
[5]
E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245--250. ACM, 2001.
[6]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3: 993--1022, 2003.
[7]
P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627v2, 2009.
[8]
L. Boytsov and B. Naidan. Engineering efficient and effective non-metric space library. In SISAP, pages 280--293, 2013. Available at https://github.com/searchivarius/NonMetricSpaceLib
[9]
L. Boytsov and B. Naidan. Learning to prune in metric and non-metric spaces. In NIPS, pages 1574--1582, 2013.
[10]
L. Cayton. Fast nearest neighbor retrieval for bregman divergences. In ICML, pages 112--119, 2008.
[11]
E. Chávez, M. Graff, G. Navarro, and E. Téllez. Near neighbor searching with k nearest references. Information Systems, 2015.
[12]
E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM computing surveys (CSUR), 33(3):273--321, 2001.
[13]
L. Chen and X. Lian. Efficient similarity search in nonmetric spaces with local constant embedding. Knowledge and Data Engineering, IEEE Transactions on, 20(3):321--336, 2008.
[14]
P. Diaconis. Group representations in probability and statistics. Lecture Notes-Monograph Series, pages i--192, 1988.
[15]
W. Dong. High-Dimensional Similarity Search for Large Datasets. PhD thesis, Princeton University, 2011.
[16]
W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577--586. ACM, 2011.
[17]
W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. Modeling lsh for performance tuning. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 669--678, New York, NY, USA, 2008. ACM.
[18]
D. M. Endres and J. E. Schindelin. A new metric for probability distributions. Information Theory, IEEE Transactions on, 49(7):1858--1860, 2003.
[19]
A. Esuli. PP-index: Using permutation prefixes for efficient and scalable approximate similarity search. Proceedings of LSDS-IR, 2009, 2009.
[20]
R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD '03, pages 301--312, New York, NY, USA, 2003. ACM.
[21]
A. Faragó, T. Linder, and G. Lugosi. Fast nearest-neighbor search in dissimilarity spaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(9):957--962, 1993.
[22]
K. Figueroa and K. Frediksson. Speeding up permutation based indexing with indexing. In Proceedings of the 2009 Second International Workshop on Similarity Search and Applications, pages 107--114. IEEE Computer Society, 2009.
[23]
K.-S. Goh, B. Li, and E. Chang. Dyndex: a dynamic and non-metric space indexer. In Proceedings of the tenth ACM international conference on Multimedia, pages 466--475. ACM, 2002.
[24]
E. C. Gonzalez, K. Figueroa, and G. Navarro. Effective proximity retrieval by ordering permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(9):1647--1658, 2008.
[25]
D. Jacobs, D. Weinshall, and Y. Gdalyahu. Classification with nonmetric distances: Image retrieval and class representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(6):583--600, 2000.
[26]
H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861--864. IEEE, 2011.
[27]
R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-store: A high-performance, distributed main memory transaction processing system. Proc. VLDB Endow., 1(2):1496--1499, Aug. 2008.
[28]
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. Sk-lsh: An efficient index structure for approximate nearest neighbor search. Proc. VLDB Endow., 7(9):745--756, May 2014.
[29]
Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45: 61--68, 2014.
[30]
R. Paredes, E. Chávez, K. Figueroa, and G. Navarro. Practical construction of k-nearest neighbor graphs in metric spaces. In Experimental Algorithms, pages 85--97. Springer, 2006.
[31]
V. Pestov. Indexability, concentration, and {VC} theory. Journal of Discrete Algorithms, 13(0): 2--18, 2012. Best Papers from the 3rd International Conference on Similarity Search and Applications (SISAP 2010).
[32]
R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45--50, Valletta, Malta, May 2010. ELRA.
[33]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
[34]
H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., 2005.
[35]
B. Schlegel, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd instructions. In ADMS@ VLDB, pages 1--8, 2011.
[36]
T. B. Sebastian and B. B. Kimia. Metric-based shape retrieval in large databases. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 291--296. IEEE, 2002.
[37]
T. Skopal. Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst., 32(4), Nov. 2007.
[38]
E. S. Téllez, E. Chávez, and A. Camarena-Ibarrola. A brief index for proximity searching. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 529--536. Springer, 2009.
[39]
E. S. Tellez, E. Chávez, and G. Navarro. Succinct nearest neighbor search. Information Systems, 38(7):1019--1030, 2013.
[40]
J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information processing letters, 40(4):175--179, 1991.
[41]
S. Van Dongen and A. J. Enright. Metric distances derived from cosine similarity and pearson and spearman correlations. arXiv preprint arXiv: 1208.3145, 2012.
[42]
R. Weber, H. J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases, pages 194--205. Morgan Kaufmann, August 1998.
[43]
P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, volume 93, pages 311--321, 1993.
[44]
P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

Cited By

View all
  • (2024)Sublinear Time Approximation of the Cost of a Metric \({k}\)-Nearest Neighbor GraphSIAM Journal on Computing10.1137/22M154410553:2(524-571)Online publication date: 17-Apr-2024
  • (2022)Indexing Metric Spaces for Exact Similarity SearchACM Computing Surveys10.1145/353496355:6(1-39)Online publication date: 7-Dec-2022
  • (2022)GRADES: Gradient Descent for Similarity CachingIEEE/ACM Transactions on Networking10.1109/TNET.2022.318704431:1(30-41)Online publication date: 13-Jul-2022
  • Show More Cited By
  1. Permutation search methods are efficient, yet faster search is possible

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 8, Issue 12
      Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
      August 2015
      728 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 August 2015
      Published in PVLDB Volume 8, Issue 12

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Sublinear Time Approximation of the Cost of a Metric \({k}\)-Nearest Neighbor GraphSIAM Journal on Computing10.1137/22M154410553:2(524-571)Online publication date: 17-Apr-2024
      • (2022)Indexing Metric Spaces for Exact Similarity SearchACM Computing Surveys10.1145/353496355:6(1-39)Online publication date: 7-Dec-2022
      • (2022)GRADES: Gradient Descent for Similarity CachingIEEE/ACM Transactions on Networking10.1109/TNET.2022.318704431:1(30-41)Online publication date: 13-Jul-2022
      • (2022)Similarity Search with Graph Index on Directed Social Network EmbeddingWeb Engineering10.1007/978-3-031-09917-5_6(82-97)Online publication date: 5-Jul-2022
      • (2021)A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor searchProceedings of the VLDB Endowment10.14778/3476249.347625514:11(1964-1978)Online publication date: 1-Jul-2021
      • (2021)GRADES: Gradient Descent for Similarity CachingIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488757(1-10)Online publication date: 10-May-2021
      • (2021)High-Dimensional Similarity Search for Scalable Data Science2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00268(2369-2372)Online publication date: Apr-2021
      • (2021)A survey on graph-based methods for similarity searches in metric spacesInformation Systems10.1016/j.is.2020.10150795(101507)Online publication date: Jan-2021
      • (2020)Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World GraphsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288947342:4(824-836)Online publication date: 1-Apr-2020
      • (2020)Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and ImprovementIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.290920432:8(1475-1488)Online publication date: 1-Aug-2020
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media