Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Term Weighting and Similarity Measures

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

Term weighting and similarity

measures

1
Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors
or “bags of words” (BOW).
Each vector holds a place for every term in the
collection.
Position 1 corresponds to term 1, position 2 to term 2,
position n to term n.
Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn
W=0 if a term is absent
Documents are represented by binary weights or
Non-binary weighted vectors of terms.
2
Term weighting
Term weighting is the assignment of numerical values to

terms that represent their importance in a document in


order to improve retrieval effectiveness
 Weighing the terms is the means that enables the retrieval

system to determine the importance of a given term in a certain


document or a query.

What is term frequency and weighting?


The simplest approach is to assign the weight to be equal to

the number of occurrences of term in document .

3
What weighting means?
Term weighting is a procedure that takes place during

the text indexing process in order to assess the value of

each term to the document.

 Weighing the terms is the means that enables the

retrieval system to determine the importance of a given

term in a certain document or a query


4
Term-weighting improves quality of answer set.

Term weighting enables ranking of retrieved

documents; such that best matching documents are

ordered at the top as they are more relevant than others.

5
Document Collection
A collection of n documents can be represented in
the vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

6
Binary Weights
docs t1 t2 t3
D1 1 0 1
• Only the presence (1) or D2 1 0 0
D3 0 1 1
absence (0) of a term is D4 1 0 0
included in the vector D5 1 1 1
D6 1 1 0
D7 0 1 0
D8 0 1 0
D9 0 0 1
D10 0 1 1
D11 1 0 1

• Binary Weights Formula:


1 if freq ij  0

freq ij  
0 if freq ij  0

Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the number
of times term occurs in document. docs t1 t2 t3
Which measure how frequently a term D1 2 0 3
occurs in document D2 1 0 0
Terms would appera much more times in
longer documents than a shorter one
D3 0 4 7
fij = frequency of term i in document j D4 3 0 0
The more times a term t occurs in D5 1 6 3
document d the more likely it is that t is D6 3 5 0
relevant to the document,
D7 0 8 0
i.e. more indicative of the topic.
 It gives too much credit to words that appears D8 0 10 0
more frequently. D9 0 0 1
May want to normalize term frequency (tf) D10 0 3 5
across the entire corpus:
D11 4 0 1
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
D3 0 4 7
D4 3 0 0
D5 1 6 3
D6 3 5 0
D7 0 8 0
D8 0 10 0
D9 0 0 1
D1 0 0 3 5
D1 1 4 0 1
9
term of frequency
 tfij = fij / ∑{fij}

tf(t)=No of times the t appears in the document /Total no of

term in the document


If the term don’t exist in a particular document. The

particular tf value will be 0 for the particular document.


In an extreme case if all word in the document are the

same, then tf will be 1


The final value of the normalized tf value will be in the
10
examples
Doc1:good boy
Doc2: good girl
Doc3: boy girl good
Word frequency
Terms frequency
good 3
Boy 2
girl 2
tf

terms doc1 doc2 doc3


good ½ ½ 1/3
boy ½ 0 1/3
Girl 0 ½ 1/3
11
Document Frequency
 It is defined to be the number of documents in the
collection that contain a term t.
 how many documents in the collection contain the term t

DF = document frequency

 Count the frequency considering the whole collection of


documents.
 Less frequently a term appears in the whole collection, the
more discriminating it is.

df i = document frequency of term i


= number of documents containing term i

12
Document frequency
Is count the occurrence of term t in the document

set.
The number of the document in which the word is

present.
We consider the occurrence if the term is present in

the document at least once, we don’t need to know


the number of times the term is present
13
14
Inverse Document Frequency (IDF)
IDF measures rarity of the term in collection.
The IDF is a measure of the general importance of the term
It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
 Gives full weight to terms that occur in one document only.

 Gives lowest weight to terms that occur in all documents.

 Terms that appear in many different documents are less indicative of overall topic.

 Used to filter out common terms

 N=1000000 (corpus size), idf value explodes. So to dampen the effect we


take the log of idf

idfi = inverse document frequency of term i,

= log (N/ df i) (N: total number of documents)


15
Example:
Consider a document containing 100 words wherein the word

cat appears 3 times.


Then, the inverse document frequency (i.e., idf) is calculated

as:

tf=ft/N=3/100=0.03
idft= log N
dft
Where N=10,000,000 (total no docs)
dft=1000
=log(10,000,000 / 1,000) = 4.
 Tf-idf = 0.03 * 4 = 0.12.

16
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966

• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Low idf terms are stop words
17
18
Tf, idf, tf*idf results
tf idf Tf*idf

terms doc1 doc2 doc3 Doc1 Doc2 doc3

good ½=0.5 ½=0.5 1/3=0.3 Log(3/3) 0 0 0


=0

boy ½=0.5 0 1/3=0.3 Log(3/2) 0.5*0.1760 0 00.3*.176


=0.17609 9126 09126
126

girl 0 ½=0.5 1/3=0.3 Log(3/2) 0 0.5*0.17 0.3*0.176


=0.17609 609126 09126
126
19
TF*IDF weighting
When does TF*IDF registers a high weight?
When a term t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term frequency (in

the given document) and a low document frequency (in the whole
collection of documents);
the weights hence tend to filter out common terms.

Thus lending high discriminating power to those documents

Lower TF*IDF is registered when the term occurs fewer times in


a document, or occurs in many documents
Thus offering a less pronounced relevance signal
TF*IDF
Is a numerical statistics that intended to reflect how

important a word is to the document in a collection or


document.
It often used as a weighting factor in search of information

retrieval, text mining and user modeling


Is the most popular term weighing schemes today

Often used by search engines as a central tool in scoring and

ranking a document’s relevance given a users query


It can be successfully used for stop word filtering in various

subject fields 21
TF*IDF Weighting
The most used term-weighting is tf*idf weighting scheme:

wij = tfij idfi = tfij * log2 (N/ dfi)

= Tf-Idf

A term occurring frequently in the document but rarely in the

rest of the collection is given high weight.


The tf-idf value for a term will always be greater than or equal to

zero.
22
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And also
term frequencies (TF) of these terms are: A(3), B(2), C(1).
Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log(10000/50) = 2.30103;
tf*idf = 2.30103
B: tf = 2/3=0.67; idf = log2(10000/1300) = 0.88605665;
tf*idf = 0.88605665 *0.67
C: tf = 1/3=0.33; idf = log2(10000/250) = 1.60206;
tf*idf = 1.60206*0.33

23
More Example
Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10
million documents and cow appears in 1,000 of these.

The term frequency (TF) for cow :


3/100 = 0.03

The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

The TF*IDF score is the product of these frequencies: 0.03 *


13.228 = 0.39684

24
Exercise
Compute: tf, Ntf, idf, tf-idf
Doc 1: Ben studies about computers in Computer Lab.

Doc 2: Steve teaches at Brown University.


Doc 3: Data Scientists work on large datasets

25
Compute the tf, idf,tf-idf
Doc1: The sky is blue
Doc2: The sky is not blue

Doc1: The president talks about Uconn


Doc2: Apple stock is hot
Doc: The job market is hot and hot

26
Exercises
Consider a document containing 100 words wherein the

word apple appears 5 times. The term frequency (i.e., TF)

for apple is then (5 / 100) = 0.05.

Now, assume we have 10 million documents and the

word apple appears in one thousand of these.

Calculate inverse document frequency (i.e., IDF)

27
Exercise (5%) Word
C TW TD
DF TF
ID TF*ID
(tf) (n) (N) F F
• Let C = number of times
a given word appears in 5/4
Log
airplane 5 46 3 1 2(3/
a document; 6
1)
• TW = total number of
blue 1 46 3 1
words in a document;
chair 7 46 3 3
• TD = total number of
compute
documents in a corpus, r
3 46 3 1
and forest 2 46 3 1
• DF = total number of
justice 7 46 3 3
documents containing a
love 2 46 3 1
given word;
might 2 46 3 1

• compute TF, IDF and perl 5 46 3 2


TF*IDF score for each rose 6 46 3 3
term shoe 4 46 3 1
thesis 2 46 3 2 28
29
Similarity Measure
We now have vectors for all documents in
the collection, and a vector for the query, t3
How do we compute similarity?


A similarity measure is a function that D1


Q
computes the degree of similarity or

distance between document vector and t1
query vector.
t2 D2

Using a similarity measure between the


query and each document:
It is possible to rank the retrieved
documents in the order of presumed
relevance.
30
Similarity Measure
1. If d1 is near d2, then d2 is near d1.

2. If d1 near d2, and d2 near d3, then d1 is not far from d3.

3. No document is closer to d than d itself.

A similarity measure attempts to compute the distance

between document vector wj and query wq vector.


 The assumption here is that documents whose vectors are close

to the query vector are more relevant to the query than


documents whose vectors are away from the query vector.
31
Similarity Measure: Techniques
• Euclidean distance
It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates
of a pair of document and query terms.
Dot product
The dot product is also known as the scalar product or inner
product
the dot product is defined as the product of the magnitudes
of query and document vectors
Cosine similarity (or normalized inner product)
It projects document and query vectors into a term space and
calculate the cosine angle between these.

32
Cosine similarity 
Is a matrix which determine how two object are alike

objects are lower similarity score, they are not similar. Large value

they are similar


Cosine similarity measures the similarity between two vectors

of an inner product space.


 It is measured by the cosine of the angle between two vectors and

determines whether two vectors are pointing in roughly the same


direction.

33
It is often used to measure document similarity in text

analysis.
This similarity score ranges from 0 to 1, with 0 being the

lowest (the least similar) and 1 being the highest (the most
similar).
Angle increase cosine decrease
Greater similarity is smaller angle
 Cos 0=1
 Cos 30= 0.8661

 Cos 45=0.7071

 Cos 90=0

34
Example: Computing Cosine Similarity
• Let us say we have a query vector Q = (0.4, 0.8); and
a document vector D1 = (0.2, 0.7). Compute their
similarity using cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim (Q, D2 ) 
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 =
(0.8, 0.3) and D2 = (0.2, 0.7). Given query vector Q
= (0.4, 0.8), determine which document is the most
relevant for the query?

cos 1  0.74
cos  2  0.98

36
example
Terms
Doc
WT1 WT2 TM1 TM2 SM1 SM2

A 3 0 5 0 0 0

B 2 3 0 3 4 0

C 5 4 4 0 5 0

D 5 5 2 0 5 4

Compute cosine similarity (A,B) and (A,C)

37
Cosine similarity

38
Exercise:
D1: the best Italian restaurant enjoy the bets pasta

D2: American restaurant enjoy the best Hemberger

D2: Korean restaurant enjoy the best bibimbap

Q1: the best of the best American restaurant

Compute:

i. Compute cos similarity of (d1,q1),(d2,q1),(d3,q1)

ii. Which documents is more similar to the query Q1

iii.Find the distance between the similar documents


39
Exercise
D1= American restaurant
D2= American restaurant hamburger pizza
D3= hamburger pizza

Sim(d1,d2)
Sim(d1,d3)

40
Cos similarity
D1= {3,2,0,5,0,0,0,2,0,0}
D2={1,0,0,0,0,0,0,1,0,1}

41
Example
Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are
more similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

42
Sas pap wh

Affection 0.996 0.993 0.847


Jealous 0.087 0.120 0.446
Gossip 0.017 0.00 0.254

Cos (sas,pap)=0.996*0.993+0.087*0.120+0.017*0.00
 = 0.999

Cos(ass,wh)=0.996*0.847+0.847*0.446+0.017*0.254

 =0.929

43
Advantage of tf idf
Easy to get doc. Similarity
Keep relevant word score
Lower just frequent words score

44
Drawback
Only based on term/word
Weak on capturing document topics
Weak handling synonyms (d/t word the same
meaning)

45
Compute Cosine similarity

D1=U.S president speech in public


D2= Donald trump presentation to people

46
Inner Product
Similarity between vectors for the document di and
query q can be computed as the vector inner product:
n

sim(dj,q) = dj•q = 
w ·w
i 1
ij iq

where wij is the weight of term i in document j and wiq is the


weight of term i in the query q
For binary vectors, the inner product is the number of
matched query terms in the document (size of
intersection).
For weighted term vectors, it is the sum of the products
of the weights of the matched terms.
47
Properties of Inner Product
Favors long documents with a large number of

unique terms.
Again, the issue of normalization

Measures how many terms matched but not how

many terms are not matched.

48
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7

Retrieval
sim(D, Q) = 3 Database Term Computer Text Manage Data

D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product: k1
d2 d6 d7
k2

Example 1 d4
d5
d3
d1
k1 k2 k3 q  dj
d1 1 0 1 2 k3
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1

50
k2
Inner Product: k1
d2 d6 d7
Exercise d4 d5
d1 d3

k1 k2 k3 q  dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3
51
Euclidean distance
Similarity between vectors for the document di and
query q can be computed as:
n

sim(dj,q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the


weight of term i in the query q
• Example: Determine the Euclidean distance between the
document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0,
0). 0 means corresponding term not found in document
or query

2 2 2 2 2
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11 .05
52
Exercises
A database collection consists of 1 million
documents, of which 200,000 contain the term
holiday while 250,000 contain the term season. A
document repeats holiday 7 times and season 5 times.
It is known that holiday is repeated more than any
other term in the document. Calculate the weight of
both terms in this document using three different
term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized
TF

53
Question?
54

You might also like