Similarity Sketching

Pagh, Rasmus

doi:10.1007/978-3-319-63962-8_58-1

Rasmus Pagh³

491 Accesses
1 Altmetric

Synonyms

Distance estimation; Similarity estimation; Similarity summarization

Overview

Similarity between a pair of objects, usually expressed as a similarity score in [0, 1], is a key concept when dealing with noisy or uncertain data, as is common in big data applications.

The aim of similarity sketching is to estimate similarities in a (high-dimensional) space using fewer computational resources (time and/or storage) than a naïve approach that stores unprocessed objects. This is achieved using a form of lossy compression that produces succinct representations of objects in the space, from which similarities can be estimated. In some spaces, it is more natural to consider distances rather than similarities; we will consider both of these measures of proximity in the following.

Definitions

Formally, consider a space X of objects and a function d : X × X →R ₊. We refer to d as a distance function for X. Similarity sketching with respect to (X, d) is done by using a sketching function c:...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122
Google Scholar
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences. IEEE, pp 21–29
Google Scholar
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8):1157–1166
Google Scholar
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of symposium on theory of computing (STOC), pp 380–388
Google Scholar
Chierichetti F, Kumar R (2015) Lsh-preserving functions and their applications. J ACM 62(5):33
Google Scholar
Dahlgaard S, Knudsen MBT, Thorup M (2017) Fast similarity sketching. In: Proceedings of symposium on foundations of computer science (FOCS), pp 663–671
Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of conference on very large databases (VLDB), pp 518–529
Google Scholar
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Google Scholar
Li P, König AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109
Google Scholar
Li P, Owen AB, Zhang C (2012) One permutation hashing. In: Advances in neural information processing systems (NIPS), pp 3122–3130
Google Scholar
Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of international world wide web conference (WWW), pp 109–118
Google Scholar
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems (NIPS), pp 1177–1184
Google Scholar
Thorup M (2013) Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proceedings of symposium on theory of computing (STOC). ACM, pp 371–380
Google Scholar
Wang J, Zhang T, Song J, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell 13(9) https://doi.org/10.1109/TPAMI.2017.2699960

Download references

Acknowledgements

This work received support from the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013)/ ERC grant agreement no. 614331.

Author information

Authors and Affiliations

Computer Science Department, IT University of Copenhagen, 2300, Rued Langgaards Vej 7, Copenhagen S, Denmark
Rasmus Pagh

Authors

Rasmus Pagh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rasmus Pagh .

Editor information

Editors and Affiliations

School of Comp. Sci. and Engineering, University of New South Wales School of Comp. Sci. and Engineering, Eveleigh, New South Wales, Australia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Department of Computer Science, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Pagh, R. (2018). Similarity Sketching. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_58-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_58-1
Published: 29 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Living Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics