Term Weighting and Similarity Measures
Term Weighting and Similarity Measures
Term Weighting and Similarity Measures
measures
1
Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors
or “bags of words” (BOW).
Each vector holds a place for every term in the
collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn
W=0 if a term is absent
Documents are represented by binary weights or
Non-binary weighted vectors of terms.
2
Term weighting
Term weighting is the assignment of numerical values to
3
What weighting means?
Term weighting is a procedure that takes place during
5
Document Collection
A collection of n documents can be represented in
the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
6
Binary Weights
docs t1 t2 t3
D1 1 0 1
• Only the presence (1) or D2 1 0 0
D3 0 1 1
absence (0) of a term is D4 1 0 0
included in the vector D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
D11 1 0 1
DF = document frequency
12
Document frequency
Is count the occurrence of term t in the document
set.
The number of the document in which the word is
present.
We consider the occurrence if the term is present in
Terms that appear in many different documents are less indicative of overall topic.
as:
tf=ft/N=3/100=0.03
idft= log N
dft
Where N=10,000,000 (total no docs)
dft=1000
=log(10,000,000 / 1,000) = 4.
Tf-idf = 0.03 * 4 = 0.12.
16
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Low idf terms are stop words
17
18
Tf, idf, tf*idf results
tf idf Tf*idf
the given document) and a low document frequency (in the whole
collection of documents);
the weights hence tend to filter out common terms.
subject fields 21
TF*IDF Weighting
The most used term-weighting is tf*idf weighting scheme:
= Tf-Idf
zero.
22
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And also
term frequencies (TF) of these terms are: A(3), B(2), C(1).
Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log(10000/50) = 2.30103;
tf*idf = 2.30103
B: tf = 2/3=0.67; idf = log2(10000/1300) = 0.88605665;
tf*idf = 0.88605665 *0.67
C: tf = 1/3=0.33; idf = log2(10000/250) = 1.60206;
tf*idf = 1.60206*0.33
23
More Example
Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10
million documents and cow appears in 1,000 of these.
24
Exercise
Compute: tf, Ntf, idf, tf-idf
Doc 1: Ben studies about computers in Computer Lab.
25
Compute the tf, idf,tf-idf
Doc1: The sky is blue
Doc2: The sky is not blue
26
Exercises
Consider a document containing 100 words wherein the
27
Exercise (5%) Word
C TW TD
DF TF
ID TF*ID
(tf) (n) (N) F F
• Let C = number of times
a given word appears in 5/4
Log
airplane 5 46 3 1 2(3/
a document; 6
1)
• TW = total number of
blue 1 46 3 1
words in a document;
chair 7 46 3 3
• TD = total number of
compute
documents in a corpus, r
3 46 3 1
and forest 2 46 3 1
• DF = total number of
justice 7 46 3 3
documents containing a
love 2 46 3 1
given word;
might 2 46 3 1
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
32
Cosine similarity
Is a matrix which determine how two object are alike
objects are lower similarity score, they are not similar. Large value
33
It is often used to measure document similarity in text
analysis.
This similarity score ranges from 0 to 1, with 0 being the
lowest (the least similar) and 1 being the highest (the most
similar).
Angle increase cosine decrease
Greater similarity is smaller angle
Cos 0=1
Cos 30= 0.8661
Cos 45=0.7071
Cos 90=0
34
Example: Computing Cosine Similarity
• Let us say we have a query vector Q = (0.4, 0.8); and
a document vector D1 = (0.2, 0.7). Compute their
similarity using cosine?
cos 1 0.74
cos 2 0.98
36
example
Terms
Doc
WT1 WT2 TM1 TM2 SM1 SM2
A 3 0 5 0 0 0
B 2 3 0 3 4 0
C 5 4 4 0 5 0
D 5 5 2 0 5 4
37
Cosine similarity
38
Exercise:
D1: the best Italian restaurant enjoy the bets pasta
Compute:
Sim(d1,d2)
Sim(d1,d3)
40
Cos similarity
D1= {3,2,0,5,0,0,0,2,0,0}
D2={1,0,0,0,0,0,0,1,0,1}
41
Example
Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are
more similar using the three measurement?
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
42
Sas pap wh
Cos (sas,pap)=0.996*0.993+0.087*0.120+0.017*0.00
= 0.999
Cos(ass,wh)=0.996*0.847+0.847*0.446+0.017*0.254
=0.929
43
Advantage of tf idf
Easy to get doc. Similarity
Keep relevant word score
Lower just frequent words score
44
Drawback
Only based on term/word
Weak on capturing document topics
Weak handling synonyms (d/t word the same
meaning)
45
Compute Cosine similarity
46
Inner Product
Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n
sim(dj,q) = dj•q =
w ·w
i 1
ij iq
unique terms.
Again, the issue of normalization
48
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7
Retrieval
sim(D, Q) = 3 Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product: k1
d2 d6 d7
k2
Example 1 d4
d5
d3
d1
k1 k2 k3 q dj
d1 1 0 1 2 k3
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
50
k2
Inner Product: k1
d2 d6 d7
Exercise d4 d5
d1 d3
k1 k2 k3 q dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
51
Euclidean distance
Similarity between vectors for the document di and
query q can be computed as:
n
sim(dj,q) = |dj – q| = (w
i 1
ij wiq ) 2
2 2 2 2 2
(0 2) (3 7) (2 1) (1 0) (10 0) 11 .05
52
Exercises
A database collection consists of 1 million
documents, of which 200,000 contain the term
holiday while 250,000 contain the term season. A
document repeats holiday 7 times and season 5 times.
It is known that holiday is repeated more than any
other term in the document. Calculate the weight of
both terms in this document using three different
term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized
TF
53
Question?
54