Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-46994-7_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

An Alternating Optimization Scheme for Binary Sketches for Cosine Similarity Search

Published: 27 October 2023 Publication History

Abstract

Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. Sketches have been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search on the unit sphere, i.e., for cosine similarity. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization.

References

[1]
Balu, R., Furon, T., Jégou, H.: Beyond “project and sign” for cosine estimation with binary codes. In: IEEE International Conference Acoustics, Speech and Signal Processing, ICASSP, pp. 6884–6888 (2014).
[2]
Black, J., Rogaway, P.: Ciphers with arbitrary finite domains. In: Topics in Cryptology, CT-RSA, pp. 114–130 (2002).
[3]
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Symposium Theory of Computing, pp. 380–388 (2002).
[4]
Fischler MA and Bolles RC Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography Commun. ACM 1981 24 6 381-395
[5]
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Very Large Data Bases, VLDB, pp. 518–529 (1999).
[6]
Hamerly, G.: Making k-means even faster. In: Proceedings of SIAM Data Mining, SDM, pp. 130–140 (2010).
[7]
Jenny B, Patterson T, and Hurni L Flex projector-interactive software for designing world map projections Cartographic Perspect. 2008 59 12-27
[8]
Johnson J, Douze M, and Jégou H Billion-scale similarity search with GPUs IEEE Trans. Big Data 2019 7 3 535-547
[9]
Malkov YA and Yashunin DA Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Trans. Pattern Anal. Mach. Intell. 2018 42 4 824-836
[10]
Mic, V., Novak, D., Vadicamo, L., Zezula, P.: Selecting sketches for similarity search. In: Advance Databases and Information Systems, ADBIS, pp. 127–141 (2018).
[11]
Mic, V., Novak, D., Zezula, P.: Improving sketches for similarity search. In: Proceedings of MEMICS, pp. 130–140 (2015)
[12]
Mic, V., Novak, D., Zezula, P.: Sketches with unbalanced bits for similarity search. In: Similarity Search and Applications, SISAP, pp. 53–63 (2017).
[13]
Plan Y and Vershynin R Dimension reduction by random hyperplane tessellations Discret. Comput. Geom. 2014 51 2 438-461
[14]
Santoyo, F., Chávez, E., Tellez, E.S.: A compressed index for hamming distances. In: Similarity Search and Applications, SISAP, pp. 113–126 (2014).
[15]
Schneider, R., Weil, W.: Stochastic and integral geometry (2008).
[16]
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
[17]
Shaft U and Ramakrishnan R Theory of nearest neighbors indexability ACM Trans. Database Syst. 2006 31 3 814-838
[18]
Sokal RR and Michener CD A statiscal method for evaluating systematic relationships Univ. Kansas Sci. Bull. 1958 38 22 1409-1438
[19]
Tellez, E.S., Aumüller, M., Chavez, E.: Overview of the SISAP 2023 indexing challenges. In: Pedreira, O., Estivill-Castro, V. (eds.) SISAP 2023, LNCS, vol. 14289, pp. 255–264. Springer, Cham (2023).
[20]
Thordsen, E., Schubert, E.: ABID: angle based intrinsic dimensionality. In: Similarity Search and Applications, SISAP, pp. 218–232 (2020).
[21]
Thordsen E and Schubert E ABID: angle based intrinsic dimensionality - theory and analysis Inf. Syst. 2022 108 101989

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings
Oct 2023
324 pages
ISBN:978-3-031-46993-0
DOI:10.1007/978-3-031-46994-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 27 October 2023

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media