Overview
Similarity between a pair of objects, usually expressed as a similarity score in [0, 1], is a key concept when dealing with noisy or uncertain data, as is common in big data applications.
The aim of similarity sketching is to estimate similarities in a (high-dimensional) space using fewer computational resources (time and/or storage) than a naïve approach that stores unprocessed objects. This is achieved using a form of lossy compression that produces succinct representations of objects in the space, from which similarities can be estimated. In some spaces, it is more natural to consider distances rather than similarities; we will consider both of these measures of proximity in the following.
Definitions
Formally, consider a space X of objects and a function d : X × X →R +. We refer to d as a distance function for X. Similarity sketching with respect to (X, d) is done by using a sketching function c:...