Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

iDEC: indexable distance estimating codes for approximate nearest neighbor search

Published: 01 May 2020 Publication History

Abstract

Approximate Nearest Neighbor (ANN) search is a fundamental algorithmic problem, with numerous applications in many areas of computer science. In this work, we propose indexable distance estimating codes (iDEC), a new solution framework to ANN that extends and improves the locality sensitive hashing (LSH) framework in a fundamental and systematic way. Empirically, an iDEC-based solution has a low index space complexity of O(n) and can achieve a low average query time complexity of approximately O(log n). We show that our iDEC-based solutions for ANN in Hamming and edit distances outperform the respective state-of-the-art LSH-based solutions for both in-memory and external-memory processing. We also show that our iDEC-based in-memory ANN-H solution is more scalable than all existing solutions. We also discover deep connections between Error-Estimating Codes (EEC), LSH, and iDEC.

References

[1]
http://man7.org/linux/man-pages/man2/getrusage.2.html.
[2]
https://www.mathworks.com/help/matlab/ref/profile.html.
[3]
Annoy: Approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy.
[4]
Datasets for approximate nearest neighbor search. http://corpus-texmex.irisa.fr/.
[5]
Enron email dataset. http://www.cs.cmu.edu/~enron/.
[6]
D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671--687, June 2003.
[7]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, Feb. 1999.
[8]
A. Andoni, P. Indyk, and I. P. Razenshteyn. Approximate nearest neighbor search in high dimensions. CoRR, abs/1806.09823, 2018.
[9]
A. Arora, S. Sinha, P. Kumar, and A. Bhattacharya. HD-Index: Pushing the scalability-accuracy boundary for approximate KNN search in high-dimensional spaces. PVLDB, 11(8):906--919, 2018.
[10]
M. Aumüller, E. Bernhardsson, and A. Faithfull. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In Similarity Search and Applications, pages 34--49, Cham, 2017.
[11]
A. Babenko and V. Lempitsky. The inverted multi-index. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3069--3076, June 2012.
[12]
M. Bawa, T. Condie, and P. Ganesan. LSH Forest: Self-tuning indexes for similarity search. In Proceedings of the International Conference on World Wide Web, page 651--660, 2005.
[13]
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, Sept. 1975.
[14]
A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the International Conference on Machine Learning, pages 97--104, 2006.
[15]
J. L. Blanco and P. K. Rai. nanoflann: a C++11 header-only library for nearest neighbor (NN) search with KD-trees. https://github.com/jlblancoc/nanoflann, 2014.
[16]
J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17(5):419--428, 2001.
[17]
R. Cai, C. Zhang, L. Zhang, and W.-Y. Ma. Scalable music recommendation by search. In Proceedings of the ACM International Conference on Multimedia, pages 1065--1074, 2007.
[18]
D. Chakraborty, E. Goldenberg, and M. Koucký. Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proceedings of the ACM Symposium on Theory of Computing, pages 712--725, 2016.
[19]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing, pages 380--388, 2002.
[20]
B. Chen, Z. Zhou, Y. Zhao, and H. Yu. Efficient error estimating coding: Feasibility and applications. In Proceedings of the ACM Special Interest Group on Data Communication, pages 3--14, New Delhi, India, Aug. 2010.
[21]
K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the Annual Symposium on Computational Geometry, pages 160--164, 1994.
[22]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
[23]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual Symposium on Computational Geometry, pages 253--262, 2004.
[24]
DBWangGroupUNSW. SRS - fast approximate nearest neighbor search in high dimensional euclidean space with a tiny index. https://github.com/DBWangGroupUNSW/SRS.
[25]
K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim. Return of the Lernaean Hydra: experimental evaluation of data series approximate similarity search. PVLDB, 13(3):403--420, 2020.
[26]
R. Elmasri and S. Navathe. Fundamentals of Database Systems. Addison-Wesley Publishing Company, USA, 6th edition, 2010.
[27]
J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan. An approximate l1-difference algorithm for massive data streams. SIAM Journal on Computing, 32(1):131--151, Jan. 2003.
[28]
R. Fergus, A. Torralba, and W. T. Freeman. Tiny images dataset. http://horatio.cs.nyu.edu/mit/tiny/data/index.html.
[29]
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209--226, Sept. 1977.
[30]
C. Fu, C. Wang, and D. Cai. Satellite system graph: Towards the efficiency up-boundary of graph-based approximate nearest neighbor search. CoRR, abs/1907.06146, 2019.
[31]
C. Fu, C. Xiang, C. Wang, and D. Cai. Fast approximate nearest neighbor search with the navigating spreading-out graph. PVLDB, 12(5):461--474, 2019.
[32]
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 541--552, Scottsdale, Arizona, USA, May 2012. Source code: https://github.com/fengjl18/C2LSH-Code.
[33]
T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946--2953, June 2013.
[34]
R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6):2325--2383, Oct 1998.
[35]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 47--57, Boston, Massachusetts, June 1984.
[36]
B. Harwood and T. Drummond. FANNG: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
[37]
N. Hua, A. Lall, B. Li, and J. Xu. A simpler and better design of error estimating coding. In Proceedings of the IEEE International Conference on Computer Communications, pages 235--243, Orlando, FL, USA, Mar. 2012.
[38]
N. Hua, A. Lall, B. Li, and J. Xu. Towards optimal error-estimating codes through the lens of fisher information analysis. In Proceedings of the ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, pages 125--136, London, England, UK, June 2012.
[39]
J. Huang, S. Yang, A. Lall, J. Romberg, J. Xu, and C. Lin. Error estimating codes for insertion and deletion channels. In Proceedings of the ACM SIGMETRICS international conference on Measurement and Modeling of Computer Systems, pages 381--393, Austin, Texas, USA, June 2014.
[40]
Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1--12, 2015. Source code: https://github.com/DBWangGroupUNSW/nns_benchmark/tree/master/algorithms/QALSH.
[41]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the ACM Symposium on Theory of Computing, pages 604--613, Dallas, Texas, USA, May 1998.
[42]
S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri, R. Krishnawamy, and R. Kadekodi. DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems 32, pages 13771--13781. Curran Associates, Inc., 2019.
[43]
J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, pages 1--1, 2019.
[44]
W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale image retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 45--54, Portland, Oregon, USA, Aug. 2012.
[45]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[46]
W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, pages 1--1, 2019.
[47]
K. Lin, H. Yang, J. Hsiao, and C. Chen. Deep learning of binary hash codes for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 27--35, Boston, MA, USA, June 2015.
[48]
W. Liu, H. Wang, Y. Zhang, W. Wang, and L. Qin. I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In Proceedings of the IEEE International Conference on Data Engineering, pages 1670--1673, April 2019.
[49]
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: An efficient index structure for approximate nearest neighbor search. PVLDB, 7(9):745--756, 2014.
[50]
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proceedings of the International Conference on Very Large Data Bases, pages 950--961, Vienna, Austria, Sept. 2007.
[51]
Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824--836, 2020. Source code: https://github.com/nmslib/hnswlib.
[52]
G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the International Conference on World Wide Web, pages 141--150, 2007.
[53]
Y. Manolopoulos, A. Nanopoulos, A. N. Papadopoulos, and Y. Theodoridis. R-Trees: Theory and Applications. Springer Publishing Company, Incorporated, 2005.
[54]
M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2227--2240, Nov 2014.
[55]
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
[56]
T. T. Nguyen, P.-M. Hui, F. M. Harper, L. Terveen, and J. A. Konstan. Exploring the filter bubble: The effect of using recommender systems on content diversity. In Proceedings of the International Conference on World Wide Web, pages 677--686, Seoul, Korea, Apr. 2014.
[57]
M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in Hamming space with multi-index hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3108--3115, Providence, RI, USA, June 2012.
[58]
R. O'Donnell, Y. Wu, and Y. Zhou. Optimal lower bounds for locality-sensitive hashing (except when q is tiny). ACM Transactions on Computation Theory, 6(1):5:1--5:13, Mar. 2014.
[59]
S. M. Omohundro. Five balltree construction algorithms. Technical report, International Computer Science Institute, 1989.
[60]
R. Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm, page 1186--1195. Society for Industrial and Applied Mathematics, 2006.
[61]
J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. https://nlp.stanford.edu/projects/glove/.
[62]
L. Qi, X. Zhang, W. Dou, and Q. Ni. A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE Journal on Selected Areas in Communications, 35(11):2616--2624, Nov 2017.
[63]
A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2011.
[64]
K. Rong, C. E. Yoon, K. J. Bergen, H. Elezabi, P. Bailis, P. Levis, and G. C. Beroza. Locality-sensitive hashing for earthquake detection: A case study of scaling data-driven science. PVLDB, 11(11):1674--1687, 2018.
[65]
H. Sagan. Space-filling curves. Springer Science & Business Media, 2012.
[66]
S. Sood and D. Loguinov. Probabilistic near-duplicate detection using simhash. In Proceedings of the ACM International Conference on Information and Knowledge Management, pages 1117--1126, Glasgow, Scotland, UK, Oct. 2011.
[67]
M. Šošić and M. Šikić Edlib: A C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394--1395, 2017.
[68]
Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, 8(1):1--12, 2014.
[69]
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Transactions on Database Systems, 35(3), July 2010.
[70]
S. S. Vempala. The Random Projection Method, volume 65. American Mathematical Soc., 2005.
[71]
Z. Wang, W. Dong, W. Josephson, Q. Lv, M. Charikar, and K. Li. Sizing sketches: A rank-based analysis for similarity search. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 157--168. Association for Computing Machinery, 2007. Dataset is available at http://www.cs.princeton.edu/cass/audio.tar.gz.
[72]
L. Yann, C. Corinna, and J. B. Christopher. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
[73]
A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary report). pages 209--213, Atlanta, Georgia, USA, Apr. 1979.
[74]
P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, page 311--321, USA, 1993. Society for Industrial and Applied Mathematics.
[75]
H. Zhang. String datasets. https://iu.box.com/s/x7hg7uxj7xmmcdvc62k7iux9txtt9doi.
[76]
H. Zhang and Q. Zhang. EmbedJoin: Efficient edit similarity joins via embeddings. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 585--594, Halifax, NS, Canada, Aug. 2017.
[77]
Z. Zhang and P. Kumar. mEEC: A novel error estimation code with multi-dimensional feature. In Proceedings of the IEEE International Conference on Computer Communications, pages 1--9, Atlanta, GA, USA, May 2017.

Cited By

View all
  • (2025)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation GraphProceedings of the ACM on Management of Data10.1145/37096793:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Efficient top-k spatial-range-constrained approximate nearest neighbor search on geo-tagged high-dimensional vectorsThe VLDB Journal10.1007/s00778-024-00894-534:1Online publication date: 4-Jan-2025
  • (2024)Scalable billion-point approximate nearest neighbor search using SmartSSDsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692061(1135-1150)Online publication date: 10-Jul-2024
  • Show More Cited By
  1. iDEC: indexable distance estimating codes for approximate nearest neighbor search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 13, Issue 9
    May 2020
    295 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 May 2020
    Published in PVLDB Volume 13, Issue 9

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation GraphProceedings of the ACM on Management of Data10.1145/37096793:1(1-28)Online publication date: 11-Feb-2025
    • (2025)Efficient top-k spatial-range-constrained approximate nearest neighbor search on geo-tagged high-dimensional vectorsThe VLDB Journal10.1007/s00778-024-00894-534:1Online publication date: 4-Jan-2025
    • (2024)Scalable billion-point approximate nearest neighbor search using SmartSSDsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692061(1135-1150)Online publication date: 10-Jul-2024
    • (2024)Fast vector query processing for large datasets beyond GPU memory with reordered pipeliningProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691827(23-40)Online publication date: 16-Apr-2024
    • (2024)RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36549702:3(1-27)Online publication date: 30-May-2024
    • (2024)ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured DataProceedings of the ACM on Management of Data10.1145/36549232:3(1-27)Online publication date: 30-May-2024
    • (2024)Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data SegmentProceedings of the ACM on Management of Data10.1145/36392692:1(1-27)Online publication date: 26-Mar-2024
    • (2024)Optimizing the Number of Clusters for Billion-Scale Quantization-Based Nearest Neighbor SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340881536:11(6786-6800)Online publication date: 4-Jun-2024
    • (2024)Verifiable Graph-Based Approximate Nearest Neighbor SearchAdvanced Data Mining and Applications10.1007/978-981-96-0821-8_1(3-17)Online publication date: 3-Dec-2024
    • (2023)An efficient and robust framework for approximate nearest neighbor search with attribute constraintProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666814(15738-15751)Online publication date: 10-Dec-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media