The cosine similarity in terms of the euclidean distance

M Kryszkiewicz - Encyclopedia of Business Analytics and …, 2014 - igi-global.com
Encyclopedia of Business Analytics and Optimization, 2014igi-global.com
In many applications, especially in information retrieval, text mining, biomedical engineering
and chemistry, the cosine similarity is often used to find objects most similar to a given one,
so called nearest neighbors. Objects are typically represented as vectors. In particular,
documents are often represented as term frequency vectors or its variants such as tf_idf
vectors (Salton, Wong, & Yang 1975; Han, Kamber, & Pei 2011). The cosine similarity
measure between vectors is interpreted as the cosine of the angle between them. According …
In many applications, especially in information retrieval, text mining, biomedical engineering and chemistry, the cosine similarity is often used to find objects most similar to a given one, so called nearest neighbors. Objects are typically represented as vectors. In particular, documents are often represented as term frequency vectors or its variants such as tf_idf vectors (Salton, Wong, & Yang 1975; Han, Kamber, & Pei 2011). The cosine similarity measure between vectors is interpreted as the cosine of the angle between them. According to this measure, two vectors are treated as similar if the angle between them is sufficiently small; that is, if its cosine is sufficiently close to 1. The determination of nearest neighbors is challenging if analyzed vectors are high dimensional. In the case of distance metrics, one may apply the triangle inequality to quickly prune large numbers of objects that certainly are not nearest neighbors of a given vector (Uhlmann, 1991, Moore, 2000; Elkan, 2003; Kryszkiewicz & Lasek, 2010a; Kryszkiewicz & Lasek, 2010b; Kryszkiewicz & Lasek, 2010c; Patra, Hubballi, Biswas & Nandi, 2010; Kryszkiewicz & Lasek, 2011). Nevertheless, the cosine similarity is not a distance metric and, in particular, does not preserve the triangle inequality in general. In spite of this fact it was shown recently in (Kryszkiewicz, 2011; Kryszkiewicz, 2013a) that the problem of determining a cosine similarity neighborhood can be transformed to the problem of determining the Euclidean distance. This result allows applying the triangle inequality to make the determination of cosine similarity neighborhoods faster.
IGI Global