03 Laws
03 Laws
03 Laws
Laws of Text
Instructor:
Walid Magdy
30-Sep-2020
Pre-Lecture
• Lab 0: How did it go?
• Lab 1: next week, important to everyone
• Join Piazza (search for TTDS)
• Live lecture
2
Walid Magdy, TTDS 2020/2021
1
9/29/2020
3
Walid Magdy, TTDS 2020/2021
Lecture Objectives
• Learn about some text laws
• Zipf’s law
• Benford’s law
• Heap’s law
• Clumping/contagion
4
Walid Magdy, TTDS 2020/2021
2
9/29/2020
5
Walid Magdy, TTDS 2020/2021
Words’ nature
• Word → basic unit to represent text
• Certain characteristics are observed for the
words we use!
• These characteristics are very consistent, that
we can apply laws for them
• These laws apply for:
• Different languages
• Different domains of text
6
Walid Magdy, TTDS 2020/2021
3
9/29/2020
Frequency of words
• Some words are very frequent
e.g. “the”, “of”, “to”
• Many words are less frequent
e.g. “schizophrenia”, “bazinga”
• ~50% terms appears once
• Frequency of words has
Log(frequency)
Frequency
hard exponential decay
rank
Log(rank)
7
Walid Magdy, TTDS 2020/2021
Zipf’s Law:
• For a given collection of text, ranking unique
terms according to their frequency, then:
𝑟 × 𝑃𝑟 ≅ 𝑐𝑜𝑛𝑠𝑡
• 𝑟, rank of term according to frequency
• 𝑃𝑟 , probability of appearance of term
• 𝑃𝑟 ≅ 𝑐𝑜𝑛𝑠𝑡
𝑟
→𝑓 𝑥 ≅
1
𝑥
8
Walid Magdy, TTDS 2020/2021
4
9/29/2020
Zipf’s Law:
Wikipedia abstracts Term Rank Frequency r x freq
the 1 5,134,790 5,134,790
→ 3.5M En abstracts of
in
2
3
3,102,474
2,607,875
6,204,948
7,823,625
a 4 2,492,328 9,969,312
is 5 2,181,502 10,907,510
𝑟 × 𝑃𝑟 ≅ 𝑐𝑜𝑛𝑠𝑡 → and
was
6
7
1,962,326
1,159,088
11,773,956
8,113,616
to 8 1,088,396 8,707,168
𝑟 × 𝑓𝑟𝑒𝑞𝑟 ≅ 𝑐𝑜𝑛𝑠𝑡 by 9 766,656 6,899,904
an 10 566,970 5,669,700
it 11 557,492 6,132,412
for 13 493,374 5,970,456
as 14 480,277 6,413,862
on 15 471,544 6,723,878
from 16 412,785 7,073,160
9
Walid Magdy, TTDS 2020/2021
Practical
10
Walid Magdy, TTDS 2020/2021
10
5
9/29/2020
11
Benford’s Law:
• First digit of a number follows a Zipf’s like law!
• Terms frequencies
• Physical constants
• Energy bills
• Population numbers
• Benford’s law:
1
𝑃 𝑑 = log(1 + )
𝑑
12
Walid Magdy, TTDS 2020/2021
12
6
9/29/2020
Practical
13
Walid Magdy, TTDS 2020/2021
13
Heap’s Law:
• While going through documents, the number of new
terms noticed will reduce over time
• For a book/collection, while reading through, record:
• 𝑛: number of words read
• 𝑣: number of news words (unique words)
• Vocabulary growth:
v (vocabulary)
𝑣 𝑛 = 𝑘 × 𝑛𝑏
where, 𝑏 < 1
typically, 0.4 < 𝑏 < 0.7
n (words)
14
Walid Magdy, TTDS 2020/2021
14
7
9/29/2020
15
Walid Magdy, TTDS 2020/2021
15
Practical
16
Walid Magdy, TTDS 2020/2021
16
8
9/29/2020
Clumping/Contagion in text
• From Zipf’s law, we notice:
• Most words do not appear that much!
• Once you see a word once → expect to see again!
• Words are like:
• Rare contagious disease
• Not, rare independent lightening
17
Walid Magdy, TTDS 2020/2021
17
Clumping/Contagion in text
• Wiki abstract collection
• Identify terms appeared only twice
• Measure distance between the two
occurrences of the terms:
𝑑 = 𝑛𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒2 − 𝑛𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒1
• Plot density function of 𝑑
density
18
9
9/29/2020
19
Walid Magdy, TTDS 2020/2021
19
Summary
• Text follows well-known phenomena
• Text Laws:
• Zipf
• Heap
• Contagion in text
20
Walid Magdy, TTDS 2020/2021
20
10
9/29/2020
Recourses
• Text book:
• Search engines: IR in practice → chapter 4
• Videos:
• Zipf’s law, Vsouce:
https://www.youtube.com/watch?v=fCn8zs912OE
• Benford’s law, Numberphile:
https://www.youtube.com/watch?v=XXjlR2OK1kM
• Tools:
• Unix commands for windows
https://sourceforge.net/projects/unxutils
21
Walid Magdy, TTDS 2020/2021
21
Next Lecture
• Getting ready for indexing?
• Pre-processing steps before the indexing process
22
Walid Magdy, TTDS 2020/2021
22
11