An Information-Theoretic Definition of Similarity: Dekang Lin

An Information-Theoretic Definition of Similarity
Dekang Lin
Department of Computer Science
University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2
Abstract ticular measure. Almost all of the comparisons and evalu-

ations of previous similarity measures have been based on
Similarity is an important and widely used con- empirical results.
cept. Previous definitions of similarity are tied This paper presents a definition of similarity that achieves
to a particular application or a form of knowl- two goals:
edge representation. We present an information-
theoretic definition of similarity that is applica- Universality: We define similarity in information-
ble as long as there is a probabilistic model. We theoretic terms. It is applicable as long as the domain
demonstrate how our definition can be used to has a probabilistic model. Since probability theory
measure the similarity in a number of different can be integrated with many kinds of knowledge
domains. representations, such as first order logic [Bacchus,
1988] and semantic networks [Pearl, 1988], our def-
inition of similarity can be applied to many different
1 Introduction domains where very different similarity measures had
previously been proposed. Moreover, the universality
Similarity is a fundamental and widely used concept. of the definition also allows the measure to be used in
Many similarity measures have been proposed, such as domains where no similarity measure has previously
information content [Resnik, 1995b], mutual information been proposed, such as the similarity between ordinal
[Hindle, 1990], Dice coefficient [Frakes and Baeza-Yates, values.
1992], cosine coefficient [Frakes and Baeza-Yates, 1992],
Theoretical Justification: The similarity measure is not
distance-based measurements [Lee et al., 1989; Rada et al.,
defined directly by a formula. Rather, it is derived
1989], and feature contrast model [Tversky, 1977]. McGill
from a set of assumptions about similarity. In other
etc. surveyed and compared 67 similarity measures used in
words, if the assumptions are deemed reasonable, the
information retrieval [McGill et al., 1979].
similarity measure necessarily follows.
A problem with previous similarity measures is that each
of them is tied to a particular application or assumes a The remainder of this paper is organized as follows. The
particular domain model. For example, distance-based next section presents the derivation of a similarity mea-
measures of concept similarity (e.g., [Lee et al., 1989; sure from a set of assumptions about similarity. Sections 3
Rada et al., 1989]) assume that the domain is represented in through 6 demonstrate the universality of our proposal by
a network. If a collection of documents is not represented applying it to different domains. The properties of different
as a network, the distance-based measures do not apply. similarity measures are compared in Section 7.
The Dice and cosine coefficients are applicable only when
the objects are represented as numerical feature vectors.
2 Definition of Similarity
Another problem with the previous similarity measures is
that their underlying assumptions are often not explicitly Since our goal is to provide a formal definition of the in-
stated. Without knowing those assumptions, it is impossi- tuitive concept of similarity, we first clarify our intuitions
ble to make theoretical arguments for or against any par- about similarity.
Intuition 1: The similarity between A and B is related ferences. That is,
to their commonality. The more commonality they (
sim A; B )= ( (
f I common A; B ; I description ( )) ( (A; B)))
share, the more similar they are. ( )
The domain of f is f x; y jx ; y > ; y xg. 0 0
Intuition 2: The similarity between A and B is related to Intuition 3 states that the similarity measure reaches a con-
the differences between them. The more differences stant maximum when the two objects are identical. We as-
they have, the less similar they are. sume the constant is 1.
Intuition 3: The maximum similarity between A and B is Assumption 4: The similarity between a pair of identical
reached when A and B are identical, no matter how objects is 1.
much commonality they share. When A and B are identical, knowing their commonalities
means knowing what they are, i.e., I common A; B ( ( )) =
Our goal is to arrive at a definition of similarity that cap- ( ( ))
I description A; B . Therefore, the function f must have
tures the above intuitions. However, there are many alter- the property: 8x > ; f x; x . 0 ( )=1
native ways to define similarity that would be consistent
with the intuitions. In this section, we first make a set of When there is no commonality between A and B, we as-
additional assumptions about similarity that we believe to sume their similarity is 0, no matter how different they are.
be reasonable. A similarity measure can then be derived For example, the similarity between “depth-first search”
from those assumptions. and “leather sofa” is neither higher nor lower than the sim-
ilarity between “rectangle” and “interest rate”.
In order to capture the intuition that the similarity of two
objects are related to their commonality, we need a measure Assumption 5: 8y > 0; f (0; y) = 0.
of commonality. Our first assumption is: Suppose two objects A and B can be viewed from two in-
Assumption 1: The commonality between A and B is mea- dependent perspectives. Their similarity can be computed
sured by separately from each perspective. For example, the simi-
(
I common A; B ( )) larity between two documents can be calculated by com-
( )
where common A; B is a proposition that states the comparing the sets of words in the documents or by compar-
()
monalities between A and B; I s is the amount of infor- ing their stylistic parameter values, such as average word
mation contained in a proposition s. length, average sentence length, average number of verbs
per sentence, etc. We assume that the overall similarity of
For example, if A is an orange and B is an apple. The the two documents is a weighted average of their similari-
proposition that states the commonality between A and B ties computed from different perspectives. The weights are
is “fruit(A) and fruit(B)”. In information theory [Cover and the amounts of information in the descriptions. In other
Thomas, 1991], the information contained in a statement words, we make the following assumption:
is measured by the negative logarithm of the probability of
Assumption 6:
the statement. Therefore,
( ( )) = log (
I common A; B ? P fruit A and fruit B ( ) ( )) 8x y ; x y : f (x + x ; y + y ) =
1 1 2 2 1 2 1 2
1 2 f (x ; y ) + 1 2 f (x ; y )
1 y 2 1 1
y
2 2
y +y y +y
We also need a measure of the differences between two ob-
jects. Since knowing both the commonalities and the dif- From the above assumptions, we can proved the following
ferences between A and B means knowing what A and B theorem:
are, we assume: Similarity Theorem: The similarity between A and B is
measured by the ratio between the amount of information
Assumption 2: The differences between A and B is mea- needed to state the commonality of A and B and the infor-
sured by mation needed to fully describe what A and B are:
( ( ))
I description A; B ? I common A; B( ( ))
( )
where description A; B is a proposition that describes (common(A; B))
(A; B) = loglogPP(description
what A and B are. sim
(A; B))
Intuition 1 and 2 state that the similarity between two ob-
Proof:
jects are related to their commonalities and differences. We f (x; y)
assume that commonalities and differences are the only fac- = f (x + 0; x + (y ?? x))
tors. = f (x; x) + f (0; y ? x) (Assumption 6)
x y x
= 1+ ? 0=
y y
x y x x
Assumption 3: The similarity between A and B, (Assumption 4 and 5)
( )
sim A; B , is a function of their commonalities and dif-
y y y
Q.E.D.
50%
Since similarity is the ratio between the amount of infor-
mation in the commonality and the amount of information
in the description of the two objects, if we know the com-
probability
monality of the two objects, their similarity tells us how
much more information is needed to determine what these
two objects are. 20%
15%
In the next 4 sections, we demonstrate how the above defi- 10%
nition can be applied in different domains. 5%
excellent good average bad awful Quality

3 Similarity between Ordinal Values
Many features have ordinal values. For example, the “qual- Figure 1: Example Distribution of Ordinal Values
ity” attribute can take one of the following values “excel-
lent”, “good”, “average”, “bad”, or “awful”. None of the
previous definitions of similarity provides a measure for
the similarity between two ordinal values. We now show
how our definition can be applied here. caused by less important features. The assignment of the
weight parameters is generally heuristic in nature in pre-
If “the quality of X is excellent” and “the quality of Y is vious approaches. Our definition of similarity provides a
average”, the maximally specific statement that can be said
more principled approach, as demonstrated in the follow-
of both X and Y is that “the quality of X and Y are between ing case study.
“average” and “excellent”. Therefore, the commonality be-
tween two ordinal values is the interval delimited by them.
4.1 String Similarity—A case study
Suppose the distribution of the “quality” attribute is known
(Figure 1). The following are four examples of similarity Consider the task of retrieving from a word list the words
calculations: that are derived from the same root as a given word. For
(excellent; good) =
sim _
2 log P (excellent good)
example, given the word “eloquently”, our objective is to
log P (excellent)+log P (good) retrieve the other related words such as “ineloquent”, “in-
=
2 log(0:05+0:10)
= 0:72 eloquently”, “eloquent”, and “eloquence”. To do so, as-
log 0:05+log 0:10
sim(good; average) = _
2 log P (good average) suming that a morphological analyzer is not available, one
log P (average)+log P (good)
=
2 log(0:10+0:50)
= 0:34 can define a similarity measure between two strings and
log 0:10+log 0:50 rank the words in the word list in descending order of their
sim(excellent; average) = _ _
2 log P (excellent good average)
log P (excellent)+log P (average) similarity to the given word. The similarity measure should
=
2 log(0:05+0:10+0:50)
= 0:23 be such that words derived from the same root as the given
log 0:05+log 0:50
sim(good; bad) = _
2 log P (good average bad) _ word should appear early in the ranking.
log P (good)+log P (bad)
=
2 log(0:10+0:50+:20)
= 0:11 We experimented with three similarity measures. The first
log 0:10+log 0:20
one is defined as follows:
The results show that, given the probability distribution in
1
(x; y) = 1 + editDist
Figure 1, the similarity between “excellent” and “good” is
much higher than the similarity between “good” and “av-
simedit
(x; y)
erage”; the similarity between “excellent” and “average” is
much higher than the similarity between “good” and “bad”.
( )
where editDist x; y is the minimum number of character
insertion and deletion operations needed to transform one
string to the other.
4 Feature Vectors The second similarity measure is based on the number of
different trigrams in the two strings:
Feature vectors are one of the simplest and most commonly
used forms of knowledge representation, especially in case-
based reasoning [Aha et al., 1991; Stanfill and Waltz, 1986] simtri (x; y) = 1 + jtri(x)j + jtri(y)j1? 2 jtri(x) \ tri(y)j
and machine learning. Weights are often assigned to fea-
tures to account for the fact that the dissimilarity caused ()
where tri x is the set of trigrams in x. For example,
by more important features is greater than the dissimilarity tri(eloquent) = felo, loq, oqu, que, entg.
Table 1: Top-10 Most Similar Words to “grandiloquent’
Rank simedit simtri sim
1 grandiloquently 1/3 grandiloquently 1/2 grandiloquently 0.92
2 grandiloquence 1/4 grandiloquence 1/4 grandiloquence 0.89
3 magniloquent 1/6 eloquent 1/8 eloquent 0.61
4 gradient 1/6 grand 1/9 magniloquent 0.59
5 grandaunt 1/7 grande 1/10 ineloquent 0.55
6 gradients 1/7 rand 1/10 eloquently 0.55
7 grandiose 1/7 magniloquent 1/10 ineloquently 0.50
8 diluent 1/7 ineloquent 1/10 magniloquence 0.50
9 ineloquent 1/8 grands 1/10 eloquence 0.50
10 grandson 1/8 eloquently 1/10 ventriloquy 0.42
Table 2: Evaluation of String Similarity Measures

11-point average precisions
Root Meaning jWroot j simedit simtri sim
agog leader, leading, bring 23 37% 40% 70%
cardi heart 56 18% 21% 47%
circum around, surrounding 58 24% 19% 68%
gress to step, to walk, to go 84 22% 31% 52%
loqu to speak 39 19% 20% 57%
The third similarity measure is based on our proposed defi- 11-point average of its precisions at recall levels 0%, 10%,
nition of similarity under the assumption that the probabil- 20%, ..., and 100%. The average precision values are then
ity of a trigram occurring in a word is independent of other averaged over all the words in Wroot . The results on 5
2 P 2tri \tri log P (t)

trigrams in the word: roots are shown in Table 2. It can be seen that much better
results were achieved with sim than with the other similar-
sim(x; y ) = P P t (x) (y ) ity measures. The reason for this is that sim edit and simtri
2tri log P (t) + 2tri log P (t)
t (x) t (y ) treat all characters or trigrams equally, whereas sim is able
to automatically take into account the varied importance in
Table 1 shows the top 10 most similar words to “grandilo- different trigrams.
quent” according to the above three similarity measures.
To determine which similarity measure ranks higher the 5 Word Similarity
words that are derived from the same root as the given
word, we adopted the evaluation metrics used in the Text In this section, we show how to measure similarities be-
Retrieval Conference [Harman, 1993]. We used a 109,582- tween words according to their distribution in a text corpus
word list from the AI Repository.1 The probabilities of [Pereira et al., 1993]. Similar to [Alshawi and Carter, 1994;
trigrams are estimated by their frequencies in the words. Grishman and Sterling, 1994; Ruge, 1992], we use a parser
Let W denote the set of words in the word list and Wroot to extract dependency triples from the text corpus. A de-
denote the subset of W that are derived from root. Let pendency triple consists of a head, a dependency type and
( )
w1 ; : : : ; wn denote the ordering of W ? fwg in de- a modifier. For example, the dependency triples in “I have
scending similarity to w according to a similarity measure. a brown dog” consist of:
( )
The precision of w1 ; : : : ; wn at recall level N% is de-
fined as the maximum value of jWroot \fw 1 ;:::;wk gj such that (1) (have subj I), (have obj dog), (dog adj-mod brown),
k 2 f ; : : : ; ng and jWrootj\f
1 w1 ;:::;wk gj
%
k
Wroot j
N . The qual- (dog det a)
( )
ity of the sequence w1 ; : : : ; wn can be measured by the
where “subj” is the relationship between a verb and its sub-
1
http://www.cs.cmu.edu/afs/cs/project/ai-repository ject; “obj” is the relationship between a verb and its object;
“adj-mod” is the relationship between a noun and its adjec- mation in subj-of(include), 3.15. This agrees with our intu-
tive modifier and “det” is the relationship between a noun ition that saying that a word can be modified by “fiduciary”
and its determiner. is more informative than saying that the word can be the
subject of “include”.
We can view dependency triples extracted from a corpus
as features of the heads and modifiers in the triples. Sup- The fourth column in Table 3 shows the amount of infor-
pose (avert obj duty) is a dependency triple, we say that mation contained in each feature. If the features in Table 3
“duty” has the feature obj-of(avert) and “avert” has the fea- were all the features of “duty” and “sanction”, the similar-
ture obj(duty). Other words that also possess the feature ity between “duty” and “sanction” would be:
obj-of(avert) include “default”, “crisis”, “eye”, “panic”, 2 I (ff ; f ; f ; f g)
1 3 5 7
“strike”, “war”, etc., which are also used as objects of
“avert” in the corpus.
I (ff ; f ; f ; f ; f ; f g) + I (ff ; f ; f ; f ; f ; f g)
1 2 3 5 6 7 1 3 4 5 7 8
which is equal to 0.66.

Table 3 shows a subset of the features of “duty” and “sanc-
tion”. Each row corresponds to a feature. A ‘x’ in the We parsed a 22-million-word corpus consisting of Wall
“duty” or “sanction” column means that the word possesses Street Journal and San Jose Mercury with a principle-based
that feature. broad-coverage parser, called PRINCIPAR [Lin, 1993;
Lin, 1994]. Parsing took about 72 hours on a Pentium
200 with 80MB memory. From these parse trees we ex-
Table 3: Features of “duty” and “sanction” tracted about 14 million dependency triples. The frequency
Feature duty sanction I (fi ) counts of the dependency triples are stored and indexed in
f1 : subj-of(include) x x 3.15 a 62MB dependency database, which constitutes the set of
f2 : obj-of(assume) x 5.43
f3 : obj-of(avert) x x 5.88
feature descriptions of all the words in the corpus. Using
f4 : obj-of(ease) x 4.99 this dependency database, we computed pairwise similarity
f5 : obj-of(impose) x x 4.97 between 5230 nouns that occurred at least 50 times in the
f6 : adj-mod(fiduciary) x 7.76 corpus.
f7 : adj-mod(punitive) x x 7.10
f8 : adj-mod(economic) x 3.70 The words with similarity to “duty” greater than 0.04 are
listed in (3) in descending order of their similarity.
( )
Let F w be the set of features possessed by w. F w can ( ) (3) responsibility, position, sanction, tariff, obligation,
be viewed as a description of the word w. The commonali- fee, post, job, role, tax, penalty, condition, function,
ties between two words w1 and w2 is then F w1 \ F w2 . ( ) ( ) assignment, power, expense, task, deadline, training,
work, standard, ban, restriction, authority,
The similarity between two words is defined as follows:
commitment, award, liability, requirement, staff,
membership, limit, pledge, right, chore, mission,
(2) sim (w ; w ) = 2 I (F (w 1 )\F (w2 ))
1 2 I (F (w 1 ))+I (F (w2 )) care, title, capability, patrol, fine, faith, seat, levy,
violation, load, salary, attitude, bonus, schedule,
( )
where I S is the amount of information contained in a set instruction, rank, purpose, personnel, worth,
one another, I S( )= P
of features S . Assuming that features are independent of
log ( )
? f 2S P f , where P f is the ()
jurisdiction, presidency, exercise.
probability of feature f . When two words have identical The following is the entry for “duty” in the Random House
sets of features, their similarity reaches the maximum value Thesaurus [Stein and Flexner, 1984].
of 1. The minimum similarity 0 is reached when two words
(4) duty n. 1. obligation , responsibility ; onus;
do not have any common feature.
()
The probability P f can be estimated by the percentage
business, province; 2. function , task , assignment ,
of words that have feature f among the set of words that charge. 3. tax , tariff , customs, excise, levy .
have the same part of speech. For example, there are 32868
The shadowed words in (4) also appear in (3). It can be
unique nouns in a corpus, 1405 of which were used as sub-
seen that our program captured all three senses of “duty” in
jects of “include”. The probability of subj-of(include) is
1405 [Stein and Flexner, 1984].
32868
. The probability of the feature adj-mod(fiduciary) is
14
32868
because only 14 (unique) nouns were modified by Two words are a pair of respective nearest neighbors
“fiduciary”. The amount of information in the feature adj- (RNNs) if each is the other’s most similar word. Our pro-
mod(fiduciary), 7.76, is greater than the amount of infor- gram found 622 pairs of RNNs among the 5230 nouns that
occurred at least 50 times in the parsed corpus. Table 4
Table 4: Respective Nearest Neighbors
shows every 10th RNN.
Rank RNN Sim
1 earnings profit 0.50 Some of the pairs may look peculiar. Detailed examination
11 revenue sale 0.39 actually reveals that they are quite reasonable. For exam-
21 acquisition merger 0.34 ple, the 221 ranked pair is “captive” and “westerner”. It is
31 attorney lawyer 0.32
41 data information 0.30 very unlikely that any manually created thesaurus will list
51 amount number 0.27 them as near-synonyms. We manually examined all 274 oc-
61 downturn slump 0.26 currences of “westerner” in the corpus and found that 55%
71 there way 0.24
81 fear worry 0.23
of them refer to westerners in captivity. Some of the bad
91 jacket shirt 0.22 RNNs, such as (avalanche, raft), (audition, rite), were due
101 film movie 0.21 to their relative low frequencies, 2 which make them sus-
111 felony misdemeanor 0.21 ceptible to accidental commonalities, such as:
121 importance significance 0.20
131 reaction response 0.19
141 heroin marijuana 0.19 (5) The favalanche, raftg fdrifted, hitg ....
151 championship tournament 0.18 To fhold, attendg the faudition, riteg.
An uninhibited faudition, riteg.
161 consequence implication 0.18
171 rape robbery 0.17
181 dinner lunch 0.17
191 turmoil upheaval 0.17
201 biggest largest 0.17 6 Semantic Similarity in a Taxonomy
211 blaze fire 0.16
221 captive westerner 0.16 Semantic similarity [Resnik, 1995b] refers to similarity be-
231 imprisonment probation 0.16 tween two concepts in a taxonomy such as the WordNet
241 apparel clothing 0.15
[Miller, 1990] or CYC upper ontology. The semantic simi-
larity between two classes C and C 0 is not about the classes
251 comment elaboration 0.15
261 disadvantage drawback 0.15
271 infringement negligence 0.15 themselves. When we say “rivers and ditches are simi-
281 angler fishermen 0.14
291 emission pollution 0.14
lar”, we are not comparing the set of rivers with the set
301 granite marble 0.14 of ditches. Instead, we are comparing a generic river and
311 gourmet vegetarian 0.14 (
a generic ditch. Therefore, we define sim C; C 0 to be the )
321 publicist stockbroker 0.14 similarity between x and x0 if all we know about x and x0
is that x 2 C and x0 2 C 0 .
331 maternity outpatient 0.13
341 artillery warplanes 0.13
351 psychiatrist psychologist 0.13 The two statements “x 2 C ” and “x0 2 C 0 ” are indepen-
361 blunder fiasco 0.13
371 door window 0.13 dent (instead of being assumed to be independent) because
381 counseling therapy 0.12 the selection of a generic C is not related to the selection
391 austerity stimulus 0.12 of a generic C 0 . The amount of information contained in
401
411
ours yours
procurement zoning
0.12
0.12
“x 2 C and x0 2 C 0 ” is
421
431
neither none
briefcase wallet
0.12
0.11 ? log P (C ) ? log P (C 0 )
441 audition rite 0.11
451 nylon silk 0.11 where P (C ) and P (C 0 ) are probabilities that a randomly
461 columnist commentator 0.11 selected object belongs to C and C 0 , respectively.
471 avalanche raft 0.11
481 herb olive 0.11 Assuming that the taxonomy is a tree, if x 2 C and x 2
1 2
491 distance length 0.10
C , the commonality between x and x is x 2 C ^ w 2
2 1 2 1 0 2
501
511
interruption pause
ocean sea
0.10
0.10 C , where C is the most specific class that subsumes both
0 0
521 flying watching 0.10 C and C . Therefore,
1 2
531 ladder spectrum 0.09
541 lotto poker 0.09
sim(x ; x ) =
2 log P (C ) 0
551
561
camping skiing
lip mouth
0.09
0.09
1
log P (C ) + log P (C )
2
1 2
571 mounting reducing 0.09

581 pill tablet 0.08
591 choir troupe 0.08 For example, Figure 2 is a fragment of the WordNet. The
601
611
conservatism nationalism
bone flesh
0.08
0.07
( )
number attached to each node C is P C . The similarity
621 powder spray 0.06 2
They all occurred 50–60 times in the parsed corpus.
entity 0.395
Table 5: Results of Comparison between Semantic Simi-
inanimate-object 0.167 larity Measures
Word Pair Miller& Resnik Wu & sim
natural-object 0.0163 Charles Palmer
car, automobile 3.92 11.630 1.00 1.00
geological-formation 0.00176 gem, jewel 3.84 15.634 1.00 1.00
journey, voyage 3.84 11.806 .91 .89
boy, lad 3.76 7.003 .90 .85
0.000113 natural-elevation shore 0.0000836 coast, shore 3.70 9.375 .90 .93
asylum, madhouse 3.61 13.517 .93 .97
0.0000189 hill coast 0.0000216 magician, wizard 3.50 8.744 1.00 1.00
midday, noon 3.42 11.773 1.00 1.00
furnace, stove 3.11 2.246 .41 .18
Figure 2: A Fragment of WordNet food, fruit 3.08 1.703 .33 .24
bird, cock 3.05 8.202 .91 .83
bird, crane 2.97 8.202 .78 .67
tool, implement 2.95 6.136 .90 .80
brother, monk 2.82 1.722 .50 .16
crane, implement 1.68 3.263 .63 .39
between the concepts of Hill and Coast is: lad, brother 1.66 1.722 .55 .20
journey, car 1.16 0 0 0
(
sim Hill; Coast
log P (Geological-Formation)
) = 2 log P (Hill) + log P (Coast)
monk, oracle
food, rooster
1.10
0.89
1.722
.538
.41
.7
.14
.04
coast, hill 0.87 6.329 .63 .58
which is equal to 0.59. forest, graveyard 0.84 0 0 0
monk, slave 0.55 1.722 .55 .18
There have been many proposals to use the distance be- coast, forest 0.42 1.703 .33 .16
tween two concepts in a taxonomy as the basis for their lad, wizard 0.42 1.722 .55 .20
chord, smile 0.13 2.947 .41 .20
similarity [Lee et al., 1989; Rada et al., 1989]. Resnik
glass, magician 0.11 .538 .11 .06
[Resnik, 1995b] showed that the distance-based similar- noon, string 0.08 0 0 0
ity measures do not correlate to human judgments as rooster, voyage 0.08 0 0 0
well as his measure. Resnik’s similarity measure is Correlation with 1.00 0.795 0.803 0.834
quite close to the one proposed here: simResnik A; B ( )= Miller & Charles
1
( ( ))
I common A; B . For example, in Figure 2,
( ) = log (
? P Geological-Formation . )
2
simResnik Hill; Coast
Wu and Palmer [Wu and Palmer, 1994] proposed a measure
for semantic similarity that could be regarded as a special
(
case of sim A; B : ) the same data set and evaluation methodology to compare
(A; B) = N + N2 +N2 N
simResnik , simWu&Palmer and sim. Table 5 shows the simi-
3
simWu&Palmer larities between 28 pairs of concepts, using three different
1 2 3 similarity measures. Column Miller&Charles lists the av-
where N1 and N2 are the number of IS-A links from A and erage similarity scores (on a scale of 0 to 4) assigned by
B to their most specific common superclass C; N3 is the human subjects in Miller&Charles’s experiments [Miller
number of IS-A links from C to the root of the taxonomy. and Charles, 1991]. Our definition of similarity yielded
For example, the most specific common superclass of Hill slightly higher correlation with human judgments than the
and Coast is Geological-Formation. Thus, N1 , N2 =2 = other two measures.
2, N3 =3and simWu&Palmer Hill; Coast :.( )=06
( )
Interestingly, if P C jC 0 is the same for all pairs of con-
cepts such that there is an IS-A link from C to C 0 in the 7 Comparison between Different Similarity
( )
taxonomy, simWu&Palmer A; B coincides with sim A; B . ( ) Measures
Resnik [Resnik, 1995a] evaluated three different similar-
ity measures by correlating their similarity scores on 28 One of the most commonly used similarity measure is
pairs of concepts in the WordNet with assessments made call Dice coefficient. Suppose two objects can be de-
by human subjects [Miller and Charles, 1991]. We adopted ( )
scribed with two numerical vectors a1 ; a2 ; : : : ; an and
their shades, B and C are similar in their shape, but A and
Table 6: Comparison between Similarity Measures C are not similar.
111
000 1111
0000
0000
1111
Similarity Measures:
WP: simWu&Palmer
000
111
000
111 0000
1111
0000
1111
R: simResnik
Dice: simdice
Property sim WP R Dice simdist A B C
increase with yes yes yes yes no
commonality Figure 3: Counter-example of Triangle Inequality
decrease with yes yes no yes yes
difference Assumption 6: The strongest assumption that we made in
triangle no no no no yes Section 2 is Assumption 6. However, this assumption is
inequality not unique to our proposal. Both simWu&Palmer and simdice
Assumption 6 yes yes no yes no also satisfy Assumption 6. Suppose two objects A and B
max value=1 yes yes no yes yes are represented by two feature vectors a1 ; a2 ; : : : ; an and ( )
semantic yes yes yes no yes ( )
b1 ; b2 ; : : : ; bn , respectively. Without loss of generality,
similarity suppose the first k features and the rest n ? k features rep-
word yes no no yes yes resent two independent perspectives of the objects.
similarity P
ordinal yes no no no no
sim (A; B ) = P =1P
2 2 =
2
i ;n
a i bi
P =1 2 P =1 2=1 P =1=1
dice +
values i ;n
a
i i ;n
bi
P 2 P 2 P 2 P 2+
i ;k
ai +
i ;k
bi 2
i ;k
a i bi
P =1= +1 2 P=1= +1 2=1 P=1= +1

i ;n
a
i
+
i ;n
b
i i ;k
a
i
+
i ;k
b
i
P =1 2 P =1 2 P = +1 2 P = +1
i k ;n
ai +
a
i
+
i k ;n
b
i
bi 2
i
a
i
k
+
;n
ai bi
b 2i
(b ; b ; : : : ; b ), their Dice coefficient is defined as
i ;n i ;n i k ;n i k ;n
P ab
1 2 n
which is a weighted average of the similarity between A
2
simdice (A; B ) = P
a +P
and B in each of the two perspectives.
b:
i=1;n i i
2 2
i=1;n i i=1;n i Maximum Similarity Values: With most similarity mea-
sures, the maximum similarity is 1, except simResnik , which
Another class of similarity measures is based a distance have no upper bound for similarity values.
( )
metric. Suppose dist A; B is a distance metric between Application Domains: The similarity measure proposed in
two objects, simdist can be defined as follows: this paper can be applied in all the domains listed in Table
(A; B) = 1 + dist1 (A; B)

6, including the similarity of ordinal values, where none of
simdist the other similarity measures is applicable.
Table 6 summarizes the comparison among 5 similarity 8 Conclusion

measures.
Similarity is an important and fundamental concept in AI
Commonality and Difference: While most similarity
and many other fields. Previous proposals for similarity
measures increase with commonality and decrease with
measures are heuristic in nature and tied to a particular do-
difference, simdist only decreases with difference and
main or form of knowledge representation. In this paper,
simResnik only takes commonality into account.
we present a universal definition of similarity in terms of
Triangle Inequality: A distance metrics must satisfy the information theory. The similarity measure is not directly
triangle inequality: stated as in earlier definitions, rather, it is derived from a
( ) ( )+
dist A; C dist A; B dist B; C . ( ) set of assumptions. In other words, if one accepts the as-
Consequently, simdist has the property that simdist A; C ( ) sumptions, the similarity measure necessarily follows. The
cannot be arbitrarily close to 0 if none of sim dist A; B and ( ) universality of the definition is demonstrated by its applica-
( )
simdist B; C is 0. This can be counter-intuitive in some tions in different domains where different similarity mea-
situations. For example, in Figure 3, A and B are similar in sures have been employed before.
Acknowledgment [McGill et al., 1979] McGill et al., M. (1979). An evalua-
tion of factors affecting document ranking by informa-
The author wishes to thank the anonymous reviewers for tion retrieval systems. Project report, Syracuse Univer-
their valuable comments. This research has been partially sity School of Information Studies.
supported by NSERC Research Grant OGP121338. [Miller, 1990] Miller, G. A. (1990). WordNet: An on-line
lexical database. International Journal of Lexicography,
References 3(4):235–312.
[Miller and Charles, 1991] Miller, G. A. and Charles,
[Aha et al., 1991] Aha, D., Kibler, D., and Albert, M.
W. G. (1991). Contextual correlates of semantic simi-
(1991). Instance-Based Learning Algorithms. Machine
larity. Language and Cognitive Processes, 6(1):1–28.
Learning, 6(1):37–66.
[Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning in
[Alshawi and Carter, 1994] Alshawi, H. and Carter, D. Intelligent Systems: Networks of Plausible Inference.
(1994). Training and scaling preference functions for Morgan Kaufmann Publishers, Inc., San Mateo, Cali-
disambiguation. Computational Linguistics, 20(4):635– fornia.
648.
[Pereira et al., 1993] Pereira, F., Tishby, N., and Lee, L.
[Bacchus, 1988] Bacchus, F. (1988). Representing and (1993). Distributional Clustering of English Words.
Reasoning with Probabilistic Knowledge. PhD thesis, In Proceedings of ACL-93, pages 183–190, Ohio State
University of Alberta, Edmonton, Alberta, Canada. University, Columbus, Ohio.
[Rada et al., 1989] Rada, R., Mili, H., Bicknell, E., and
[Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A.
Blettner, M. (1989). Development and application ofa
(1991). Elements of information theory. Wiley series in
metric on semantic nets. IEEE Transaction on Systems,
telecommunications. Wiley, New York.
Man, and Cybernetics, 19(1):17–30.
[Frakes and Baeza-Yates, 1992] Frakes, W. B. and Baeza- [Resnik, 1995a] Resnik, P. (1995a). Disambiguating noun
Yates, R., editors (1992). Information Retrieval, Data groupings with respect to wordnet senses. In Third
Structure and Algorithms. Prentice Hall. Workshop on Very Large Corpora. Association for Com-
putational Linguistics.
[Grishman and Sterling, 1994] Grishman, R. and Sterling,
J. (1994). Generalizing automatically generated selec- [Resnik, 1995b] Resnik, P. (1995b). Using information
tional patterns. In Proceedings of COLING-94, pages content to evaluate semantic similarity in a taxonomy.
742–747, Kyoto, Japan. In Proceedings of IJCAI-95, pages 448–453, Montreal,
Canada.
[Harman, 1993] Harman, D. (1993). Overview of the first
text retrieval conference. In Proceedings of SIGIR’93, [Ruge, 1992] Ruge, G. (1992). Experiments on linguisti-
pages 191–202. cally based term associations. Information Processing
& Management, 28(3):317–332.
[Hindle, 1990] Hindle, D. (1990). Noun classification [Stanfill and Waltz, 1986] Stanfill, C. and Waltz, D.
from predicate-argument structures. In Proceedings of (1986). Toward Memory-based Reasoning. Commu-
ACL-90, pages 268–275, Pittsburg, Pennsylvania. nications of ACM, 29:1213–1228.
[Lee et al., 1989] Lee, J. H., Kim, M. H., and Lee, Y. J. [Stein and Flexner, 1984] Stein, J. and Flexner, S. B., ed-
(1989). Information retrieval based on conceptual dis- itors (1984). Random House College Thesaurus. Ran-
tance in is-a hierarchies. Journal of Documentation, dom House, New York.
49(2):188–207.
[Tversky, 1977] Tversky, A. (1977). Features of similarity.
[Lin, 1993] Lin, D. (1993). Principle-based parsing with- Psychological Review, 84:327–352.
out overgeneration. In Proceedings of ACL–93, pages [Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994).
112–120, Columbus, Ohio. Verb semantics and lexical selection. In Proceedings of
[Lin, 1994] Lin, D. (1994). Principar—an efficient, broad- the 32nd Annual Meeting of the Associations for Com-
putational Linguistics, pages 133–138, Las Cruces, New
coverage, principle-based parser. In Proceedings of
Mexico.
COLING–94, pages 482–488. Kyoto, Japan.

An Information-Theoretic Definition of Similarity: Dekang Lin

Uploaded by

Copyright:

Available Formats

An Information-Theoretic Definition of Similarity: Dekang Lin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Information-Theoretic Definition of Similarity: Dekang Lin

Uploaded by

Copyright:

Available Formats

An Information-Theoretic Definition of Similarity

Abstract ticular measure. Almost all of the comparisons and evalu-

excellent good average bad awful Quality

Table 2: Evaluation of String Similarity Measures

2 P 2tri \tri log P (t)

which is equal to 0.66.

571 mounting reducing 0.09

P =1= +1 2 P=1= +1 2=1 P=1= +1

(A; B) = 1 + dist1 (A; B)

Table 6 summarizes the comparison among 5 similarity 8 Conclusion

You might also like