An Information-Theoretic Definition of Similarity: Dekang Lin
An Information-Theoretic Definition of Similarity: Dekang Lin
An Information-Theoretic Definition of Similarity: Dekang Lin
Dekang Lin
Department of Computer Science
University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2
Intuition 3: The maximum similarity between A and B is Assumption 4: The similarity between a pair of identical
reached when A and B are identical, no matter how objects is 1.
much commonality they share. When A and B are identical, knowing their commonalities
means knowing what they are, i.e., I common A; B ( ( )) =
Our goal is to arrive at a definition of similarity that cap- ( ( ))
I description A; B . Therefore, the function f must have
tures the above intuitions. However, there are many alter- the property: 8x > ; f x; x . 0 ( )=1
native ways to define similarity that would be consistent
with the intuitions. In this section, we first make a set of When there is no commonality between A and B, we as-
additional assumptions about similarity that we believe to sume their similarity is 0, no matter how different they are.
be reasonable. A similarity measure can then be derived For example, the similarity between “depth-first search”
from those assumptions. and “leather sofa” is neither higher nor lower than the sim-
ilarity between “rectangle” and “interest rate”.
In order to capture the intuition that the similarity of two
objects are related to their commonality, we need a measure Assumption 5: 8y > 0; f (0; y) = 0.
of commonality. Our first assumption is: Suppose two objects A and B can be viewed from two in-
Assumption 1: The commonality between A and B is mea- dependent perspectives. Their similarity can be computed
sured by separately from each perspective. For example, the simi-
(
I common A; B ( )) larity between two documents can be calculated by com-
( )
where common A; B is a proposition that states the com- paring the sets of words in the documents or by compar-
()
monalities between A and B; I s is the amount of infor- ing their stylistic parameter values, such as average word
mation contained in a proposition s. length, average sentence length, average number of verbs
per sentence, etc. We assume that the overall similarity of
For example, if A is an orange and B is an apple. The the two documents is a weighted average of their similari-
proposition that states the commonality between A and B ties computed from different perspectives. The weights are
is “fruit(A) and fruit(B)”. In information theory [Cover and the amounts of information in the descriptions. In other
Thomas, 1991], the information contained in a statement words, we make the following assumption:
is measured by the negative logarithm of the probability of
Assumption 6:
the statement. Therefore,
( ( )) = log (
I common A; B ? P fruit A and fruit B ( ) ( )) 8x y ; x y : f (x + x ; y + y ) =
1 1 2 2 1 2 1 2
1 2 f (x ; y ) + 1 2 f (x ; y )
1 y 2 1 1
y
2 2
y +y y +y
We also need a measure of the differences between two ob-
jects. Since knowing both the commonalities and the dif- From the above assumptions, we can proved the following
ferences between A and B means knowing what A and B theorem:
are, we assume: Similarity Theorem: The similarity between A and B is
measured by the ratio between the amount of information
Assumption 2: The differences between A and B is mea- needed to state the commonality of A and B and the infor-
sured by mation needed to fully describe what A and B are:
( ( ))
I description A; B ? I common A; B( ( ))
( )
where description A; B is a proposition that describes (common(A; B))
(A; B) = loglogPP(description
what A and B are. sim
(A; B))
Intuition 1 and 2 state that the similarity between two ob-
Proof:
jects are related to their commonalities and differences. We f (x; y)
assume that commonalities and differences are the only fac- = f (x + 0; x + (y ?? x))
tors. = f (x; x) + f (0; y ? x) (Assumption 6)
x y x
= 1+ ? 0=
y y
x y x x
Assumption 3: The similarity between A and B, (Assumption 4 and 5)
( )
sim A; B , is a function of their commonalities and dif-
y y y
Q.E.D.
50%
Since similarity is the ratio between the amount of infor-
mation in the commonality and the amount of information
in the description of the two objects, if we know the com-
probability
monality of the two objects, their similarity tells us how
much more information is needed to determine what these
two objects are. 20%
15%
In the next 4 sections, we demonstrate how the above defi- 10%
nition can be applied in different domains. 5%
Many features have ordinal values. For example, the “qual- Figure 1: Example Distribution of Ordinal Values
ity” attribute can take one of the following values “excel-
lent”, “good”, “average”, “bad”, or “awful”. None of the
previous definitions of similarity provides a measure for
the similarity between two ordinal values. We now show
how our definition can be applied here. caused by less important features. The assignment of the
weight parameters is generally heuristic in nature in pre-
If “the quality of X is excellent” and “the quality of Y is vious approaches. Our definition of similarity provides a
average”, the maximally specific statement that can be said
more principled approach, as demonstrated in the follow-
of both X and Y is that “the quality of X and Y are between ing case study.
“average” and “excellent”. Therefore, the commonality be-
tween two ordinal values is the interval delimited by them.
4.1 String Similarity—A case study
Suppose the distribution of the “quality” attribute is known
(Figure 1). The following are four examples of similarity Consider the task of retrieving from a word list the words
calculations: that are derived from the same root as a given word. For
(excellent; good) =
sim _
2 log P (excellent good)
example, given the word “eloquently”, our objective is to
log P (excellent)+log P (good) retrieve the other related words such as “ineloquent”, “in-
=
2 log(0:05+0:10)
= 0:72 eloquently”, “eloquent”, and “eloquence”. To do so, as-
log 0:05+log 0:10
sim(good; average) = _
2 log P (good average) suming that a morphological analyzer is not available, one
log P (average)+log P (good)
=
2 log(0:10+0:50)
= 0:34 can define a similarity measure between two strings and
log 0:10+log 0:50 rank the words in the word list in descending order of their
sim(excellent; average) = _ _
2 log P (excellent good average)
log P (excellent)+log P (average) similarity to the given word. The similarity measure should
=
2 log(0:05+0:10+0:50)
= 0:23 be such that words derived from the same root as the given
log 0:05+log 0:50
sim(good; bad) = _
2 log P (good average bad) _ word should appear early in the ranking.
log P (good)+log P (bad)
=
2 log(0:10+0:50+:20)
= 0:11 We experimented with three similarity measures. The first
log 0:10+log 0:20
one is defined as follows:
The results show that, given the probability distribution in
1
(x; y) = 1 + editDist
Figure 1, the similarity between “excellent” and “good” is
much higher than the similarity between “good” and “av-
simedit
(x; y)
erage”; the similarity between “excellent” and “average” is
much higher than the similarity between “good” and “bad”.
( )
where editDist x; y is the minimum number of character
insertion and deletion operations needed to transform one
string to the other.
4 Feature Vectors The second similarity measure is based on the number of
different trigrams in the two strings:
Feature vectors are one of the simplest and most commonly
used forms of knowledge representation, especially in case-
based reasoning [Aha et al., 1991; Stanfill and Waltz, 1986] simtri (x; y) = 1 + jtri(x)j + jtri(y)j1? 2 jtri(x) \ tri(y)j
and machine learning. Weights are often assigned to fea-
tures to account for the fact that the dissimilarity caused ()
where tri x is the set of trigrams in x. For example,
by more important features is greater than the dissimilarity tri(eloquent) = felo, loq, oqu, que, entg.
Table 1: Top-10 Most Similar Words to “grandiloquent’
Rank simedit simtri sim
1 grandiloquently 1/3 grandiloquently 1/2 grandiloquently 0.92
2 grandiloquence 1/4 grandiloquence 1/4 grandiloquence 0.89
3 magniloquent 1/6 eloquent 1/8 eloquent 0.61
4 gradient 1/6 grand 1/9 magniloquent 0.59
5 grandaunt 1/7 grande 1/10 ineloquent 0.55
6 gradients 1/7 rand 1/10 eloquently 0.55
7 grandiose 1/7 magniloquent 1/10 ineloquently 0.50
8 diluent 1/7 ineloquent 1/10 magniloquence 0.50
9 ineloquent 1/8 grands 1/10 eloquence 0.50
10 grandson 1/8 eloquently 1/10 ventriloquy 0.42
The third similarity measure is based on our proposed defi- 11-point average of its precisions at recall levels 0%, 10%,
nition of similarity under the assumption that the probabil- 20%, ..., and 100%. The average precision values are then
ity of a trigram occurring in a word is independent of other averaged over all the words in Wroot . The results on 5
Wroot j
N . The qual- (dog det a)
( )
ity of the sequence w1 ; : : : ; wn can be measured by the
where “subj” is the relationship between a verb and its sub-
1
http://www.cs.cmu.edu/afs/cs/project/ai-repository ject; “obj” is the relationship between a verb and its object;
“adj-mod” is the relationship between a noun and its adjec- mation in subj-of(include), 3.15. This agrees with our intu-
tive modifier and “det” is the relationship between a noun ition that saying that a word can be modified by “fiduciary”
and its determiner. is more informative than saying that the word can be the
subject of “include”.
We can view dependency triples extracted from a corpus
as features of the heads and modifiers in the triples. Sup- The fourth column in Table 3 shows the amount of infor-
pose (avert obj duty) is a dependency triple, we say that mation contained in each feature. If the features in Table 3
“duty” has the feature obj-of(avert) and “avert” has the fea- were all the features of “duty” and “sanction”, the similar-
ture obj(duty). Other words that also possess the feature ity between “duty” and “sanction” would be:
obj-of(avert) include “default”, “crisis”, “eye”, “panic”, 2 I (ff ; f ; f ; f g)
1 3 5 7
“strike”, “war”, etc., which are also used as objects of
“avert” in the corpus.
I (ff ; f ; f ; f ; f ; f g) + I (ff ; f ; f ; f ; f ; f g)
1 2 3 5 6 7 1 3 4 5 7 8
( )
Let F w be the set of features possessed by w. F w can ( ) (3) responsibility, position, sanction, tariff, obligation,
be viewed as a description of the word w. The commonali- fee, post, job, role, tax, penalty, condition, function,
ties between two words w1 and w2 is then F w1 \ F w2 . ( ) ( ) assignment, power, expense, task, deadline, training,
work, standard, ban, restriction, authority,
The similarity between two words is defined as follows:
commitment, award, liability, requirement, staff,
membership, limit, pledge, right, chore, mission,
(2) sim (w ; w ) = 2 I (F (w 1 )\F (w2 ))
1 2 I (F (w 1 ))+I (F (w2 )) care, title, capability, patrol, fine, faith, seat, levy,
violation, load, salary, attitude, bonus, schedule,
( )
where I S is the amount of information contained in a set instruction, rank, purpose, personnel, worth,
one another, I S( )= P
of features S . Assuming that features are independent of
log ( )
? f 2S P f , where P f is the ()
jurisdiction, presidency, exercise.
probability of feature f . When two words have identical The following is the entry for “duty” in the Random House
sets of features, their similarity reaches the maximum value Thesaurus [Stein and Flexner, 1984].
of 1. The minimum similarity 0 is reached when two words
(4) duty n. 1. obligation , responsibility ; onus;
do not have any common feature.
()
The probability P f can be estimated by the percentage
business, province; 2. function , task , assignment ,
of words that have feature f among the set of words that charge. 3. tax , tariff , customs, excise, levy .
have the same part of speech. For example, there are 32868
The shadowed words in (4) also appear in (3). It can be
unique nouns in a corpus, 1405 of which were used as sub-
seen that our program captured all three senses of “duty” in
jects of “include”. The probability of subj-of(include) is
1405 [Stein and Flexner, 1984].
32868
. The probability of the feature adj-mod(fiduciary) is
14
32868
because only 14 (unique) nouns were modified by Two words are a pair of respective nearest neighbors
“fiduciary”. The amount of information in the feature adj- (RNNs) if each is the other’s most similar word. Our pro-
mod(fiduciary), 7.76, is greater than the amount of infor- gram found 622 pairs of RNNs among the 5230 nouns that
occurred at least 50 times in the parsed corpus. Table 4
Table 4: Respective Nearest Neighbors
shows every 10th RNN.
Rank RNN Sim
1 earnings profit 0.50 Some of the pairs may look peculiar. Detailed examination
11 revenue sale 0.39 actually reveals that they are quite reasonable. For exam-
21 acquisition merger 0.34 ple, the 221 ranked pair is “captive” and “westerner”. It is
31 attorney lawyer 0.32
41 data information 0.30 very unlikely that any manually created thesaurus will list
51 amount number 0.27 them as near-synonyms. We manually examined all 274 oc-
61 downturn slump 0.26 currences of “westerner” in the corpus and found that 55%
71 there way 0.24
81 fear worry 0.23
of them refer to westerners in captivity. Some of the bad
91 jacket shirt 0.22 RNNs, such as (avalanche, raft), (audition, rite), were due
101 film movie 0.21 to their relative low frequencies, 2 which make them sus-
111 felony misdemeanor 0.21 ceptible to accidental commonalities, such as:
121 importance significance 0.20
131 reaction response 0.19
141 heroin marijuana 0.19 (5) The favalanche, raftg fdrifted, hitg ....
151 championship tournament 0.18 To fhold, attendg the faudition, riteg.
An uninhibited faudition, riteg.
161 consequence implication 0.18
171 rape robbery 0.17
181 dinner lunch 0.17
191 turmoil upheaval 0.17
201 biggest largest 0.17 6 Semantic Similarity in a Taxonomy
211 blaze fire 0.16
221 captive westerner 0.16 Semantic similarity [Resnik, 1995b] refers to similarity be-
231 imprisonment probation 0.16 tween two concepts in a taxonomy such as the WordNet
241 apparel clothing 0.15
[Miller, 1990] or CYC upper ontology. The semantic simi-
larity between two classes C and C 0 is not about the classes
251 comment elaboration 0.15
261 disadvantage drawback 0.15
271 infringement negligence 0.15 themselves. When we say “rivers and ditches are simi-
281 angler fishermen 0.14
291 emission pollution 0.14
lar”, we are not comparing the set of rivers with the set
301 granite marble 0.14 of ditches. Instead, we are comparing a generic river and
311 gourmet vegetarian 0.14 (
a generic ditch. Therefore, we define sim C; C 0 to be the )
321 publicist stockbroker 0.14 similarity between x and x0 if all we know about x and x0
is that x 2 C and x0 2 C 0 .
331 maternity outpatient 0.13
341 artillery warplanes 0.13
351 psychiatrist psychologist 0.13 The two statements “x 2 C ” and “x0 2 C 0 ” are indepen-
361 blunder fiasco 0.13
371 door window 0.13 dent (instead of being assumed to be independent) because
381 counseling therapy 0.12 the selection of a generic C is not related to the selection
391 austerity stimulus 0.12 of a generic C 0 . The amount of information contained in
401
411
ours yours
procurement zoning
0.12
0.12
“x 2 C and x0 2 C 0 ” is
421
431
neither none
briefcase wallet
0.12
0.11 ? log P (C ) ? log P (C 0 )
441 audition rite 0.11
451 nylon silk 0.11 where P (C ) and P (C 0 ) are probabilities that a randomly
461 columnist commentator 0.11 selected object belongs to C and C 0 , respectively.
471 avalanche raft 0.11
481 herb olive 0.11 Assuming that the taxonomy is a tree, if x 2 C and x 2
1 2
491 distance length 0.10
C , the commonality between x and x is x 2 C ^ w 2
2 1 2 1 0 2
501
511
interruption pause
ocean sea
0.10
0.10 C , where C is the most specific class that subsumes both
0 0
521 flying watching 0.10 C and C . Therefore,
1 2
531 ladder spectrum 0.09
541 lotto poker 0.09
sim(x ; x ) =
2 log P (C ) 0
551
561
camping skiing
lip mouth
0.09
0.09
1
log P (C ) + log P (C )
2
1 2
(A; B) = N + N2 +N2 N
simResnik , simWu&Palmer and sim. Table 5 shows the simi-
3
simWu&Palmer larities between 28 pairs of concepts, using three different
1 2 3 similarity measures. Column Miller&Charles lists the av-
where N1 and N2 are the number of IS-A links from A and erage similarity scores (on a scale of 0 to 4) assigned by
B to their most specific common superclass C; N3 is the human subjects in Miller&Charles’s experiments [Miller
number of IS-A links from C to the root of the taxonomy. and Charles, 1991]. Our definition of similarity yielded
For example, the most specific common superclass of Hill slightly higher correlation with human judgments than the
and Coast is Geological-Formation. Thus, N1 , N2 =2 = other two measures.
2, N3 =3and simWu&Palmer Hill; Coast :.( )=06
( )
Interestingly, if P C jC 0 is the same for all pairs of con-
cepts such that there is an IS-A link from C to C 0 in the 7 Comparison between Different Similarity
( )
taxonomy, simWu&Palmer A; B coincides with sim A; B . ( ) Measures
Resnik [Resnik, 1995a] evaluated three different similar-
ity measures by correlating their similarity scores on 28 One of the most commonly used similarity measure is
pairs of concepts in the WordNet with assessments made call Dice coefficient. Suppose two objects can be de-
by human subjects [Miller and Charles, 1991]. We adopted ( )
scribed with two numerical vectors a1 ; a2 ; : : : ; an and
their shades, B and C are similar in their shape, but A and
Table 6: Comparison between Similarity Measures C are not similar.
111
000 1111
0000
0000
1111
Similarity Measures:
WP: simWu&Palmer
000
111
000
111 0000
1111
0000
1111
R: simResnik
Dice: simdice
Property sim WP R Dice simdist A B C
increase with yes yes yes yes no
commonality Figure 3: Counter-example of Triangle Inequality
decrease with yes yes no yes yes
difference Assumption 6: The strongest assumption that we made in
triangle no no no no yes Section 2 is Assumption 6. However, this assumption is
inequality not unique to our proposal. Both simWu&Palmer and simdice
Assumption 6 yes yes no yes no also satisfy Assumption 6. Suppose two objects A and B
max value=1 yes yes no yes yes are represented by two feature vectors a1 ; a2 ; : : : ; an and ( )
semantic yes yes yes no yes ( )
b1 ; b2 ; : : : ; bn , respectively. Without loss of generality,
similarity suppose the first k features and the rest n ? k features rep-
word yes no no yes yes resent two independent perspectives of the objects.
similarity P
ordinal yes no no no no
sim (A; B ) = P =1P
2 2 =
2
i ;n
a i bi
P =1 2 P =1 2=1 P =1=1
dice +
values i ;n
a
i i ;n
bi
P 2 P 2 P 2 P 2+
i ;k
ai +
i ;k
bi 2
i ;k
a i bi
P =1 2 P =1 2 P = +1 2 P = +1
i k ;n
ai +
a
i
+
i k ;n
b
i
bi 2
i
a
i
k
+
;n
ai bi
b 2i
(b ; b ; : : : ; b ), their Dice coefficient is defined as
i ;n i ;n i k ;n i k ;n
P ab
1 2 n
which is a weighted average of the similarity between A
2
simdice (A; B ) = P
a +P
and B in each of the two perspectives.
b:
i=1;n i i
2 2
i=1;n i i=1;n i Maximum Similarity Values: With most similarity mea-
sures, the maximum similarity is 1, except simResnik , which
Another class of similarity measures is based a distance have no upper bound for similarity values.
( )
metric. Suppose dist A; B is a distance metric between Application Domains: The similarity measure proposed in
two objects, simdist can be defined as follows: this paper can be applied in all the domains listed in Table