ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count
[Pickhardt+ ACL2014] A Generalized Language Model as the
Comination of Skipped n-grams and Modified Kneser-Ney Smoothing
2014/7/12 ACL Reading @ PFI
Nakatani Shuyo, Cybozu Labs Inc.

Kneser-Ney Smoothing
[Kneser+ 1995]
• Discounting & Interpolation
𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
=
max 𝑐 𝑤𝑖−𝑛+1
𝑖
− 𝐷, 0
𝑐 𝑤𝑖−𝑛+1
𝑖−1
+
𝐷
𝑖−1
𝑁1+ 𝑤𝑖−𝑛+1
𝑖−1
∙ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+2
𝑖−1
• where
𝑤 𝑚
𝑛 = 𝑤 𝑚 ⋯ 𝑤 𝑛, 𝑁1+ 𝑤 𝑚
𝑛 ⋅ = 𝑤𝑖|𝑐 𝑤 𝑚
𝑛 𝑤𝑖 > 0
Number of
Discounting

Modified KN-Smoothing
[Chen+ 1999]
𝑖−1
=
𝑖
− 𝐷 𝑤𝑖−𝑛+1
𝑖
𝑖−1
+ 𝛾 𝑤𝑖−𝑛+1
𝑖−1
𝑖−1
• where 𝐷 𝑐 = 0 if 𝑐 = 0,
𝐷1 if 𝑐 = 1, 𝐷2 if 𝑐 = 2, _ 𝐷3+ if 𝑐 ≥ 3
𝛾 𝑤𝑖−𝑛+1
𝑖−1
=
[amount of discounting]
𝑖−1
Weighted Discounting
(D_n are estimated by leave-1-out CV)

[Zhang+ ACL2014] Kneser-Ney
Smoothing on Expected Count
• When each sentence has fractional
weight
– Domain adaptation
– EM-algorithm on word alignment
• Propose KN-smoothing using expected
fractional counts
I’m interested in it!

Model
• 𝒖 means 𝑤𝑖−𝑛+1
𝑖−1
, and 𝒖′ means 𝑤𝑖−𝑛+2
𝑖−1
• A sequence 𝒖𝑤 occurs 𝑘 times and each
occurring has probability 𝑝𝑖
(𝑖 = 1, ⋯ , 𝑘) as weight,
• then count 𝑐(𝒖𝑤) is distributed according to
Poisson Binomial Distribution.
• 𝑝 𝑐 𝑢𝑤 = 𝑟 = 𝑠 𝑘, 𝑟 , where
𝑠 𝑘, 𝑟 =
𝑠 𝑘 − 1, 𝑟 1 − 𝑝 𝑘
+ 𝑠 𝑘 − 1, 𝑟 − 1 𝑝 𝑘
if 0 ≤ 𝑟 ≤ 𝑘
1 if 𝑘 = 𝑟 = 0
0 otherwise

MLE on this model
• Expectations
– 𝔼 𝑐 𝒖𝑤 = 𝑟 ⋅ 𝑝 𝑐 𝒖𝑤 = 𝑟𝑟
– 𝔼 𝑁𝑟 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 = 𝑟𝑤
– 𝔼 𝑁𝑟+ 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 ≥ 𝑟𝑤
• Maximize (expected) likelihood
– 𝔼 𝐿 = 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤
= 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤
– obtain 𝑝MLE 𝑤 𝒖 =
𝔼 𝑐 𝒖𝑤
𝔼 𝑐 𝒖⋅

Expected Kneser-Ney
• 𝑐 𝒖𝑤 =
max 0, 𝑐 𝒖𝑤 − 𝐷 + 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′
)
• So, 𝔼 𝑐 𝒖𝑤 = 𝔼 𝑐 𝒖𝑤 − 𝑝 𝑐 𝒖𝑤 > 0 𝐷 +
𝔼 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′
)
– where 𝑝′ 𝑤 𝒖′
=
𝔼 𝑁1+ ⋅𝒖′ 𝑤
𝔼 𝑁1+ ⋅𝒖′⋅
• then 𝑝 𝑤 𝒖 =
𝔼 𝑐 𝒖𝑤
𝔼 𝑐 𝒖⋅

Language model adaptation
• Our corpus consists on
– large general-domain data and
– small specific domain data
• Sentence 𝒘 ‘s weight:
– 𝑝 𝒘 is in − domain =
1
1+exp −𝐻 𝒘
– where 𝐻 𝒘 =
log 𝑝in 𝒘 −log 𝑝out 𝒘
𝒘
,
– 𝑝in:lang. model of in-domain, 𝑝out: out’s one

• Figure 1: On the language model adaptation task, expected KN outperforms all
other methods across all sizes of selected subsets. Integral KN is applied to
unweighted instances, while fractional WB, fractional KN and expected KN are
applied to weighted instances. (via [Zhang+ ACL2014])
from general-domain data
in-domain data
- training: 54k
- testing: 3k
192
162
156
148
Why isn't there
Modified KN as a
baseline?

[Pickhardt+ ACL2014] A Generalized Language Model
as the Comination of Skipped n-grams
and Modified Kneser-Ney Smoothing
• Higher-order n-grams are very sparse
– Especially remarkable on small data(e.g.
domain specific data!)
• Improve performance for small data
by skipped n-grams and Modified KN-
smoothing
– Perplexity reduces 25.7% for very small
training data of only 736KB text

“Generalized Language Models”
• 𝜕3 𝑤1 𝑤2 𝑤3 𝑤4 = 𝑤1 𝑤2_𝑤4
– “_” means a word placeholder
𝑃GLM 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
=
𝑖
− 𝐷 𝑐 𝑤𝑖−𝑛+1
𝑖
𝑖−1
+𝛾high 𝑤𝑖−𝑛+1
𝑖−1 1
𝑛 − 1
𝑃GLM
𝑛−1
𝑗=1
𝑤𝑖 𝜕𝑗 𝑤𝑖−𝑛+1
𝑖−1
𝑃GLM 𝑤𝑖 𝜕𝑗 𝑤𝑖−𝑛+1
𝑖−1
=
𝑁1+ 𝜕𝑗 𝑤𝑖−𝑛
𝑖
− 𝐷 𝑐 𝜕𝑗 𝑤𝑖−𝑛+1
𝑖
𝑁1+ 𝜕𝑗 𝑤𝑖−𝑛+1
𝑖−1
∗
+𝛾mid 𝜕𝑗 𝑤𝑖−𝑛+1
𝑖−1 1
𝑛 − 2
𝑃GLM 𝑤𝑖 𝜕𝑗 𝜕 𝑘 𝑤𝑖−𝑛+1
𝑖−1
𝑛−1
𝑘=1,𝑘≠𝑗

• The bold arrows correspond to interpolation of models in traditional
modified Kneser-Ney smoothing. The lighter arrows illustrate the
additional interpolations introduced by our generalized language
models. (via [Pickhardt+ ACL2014])

• shrunk training data
sets for the English
Wikipedia
small domain
specific data

Space Complexity
model size = 9.5GB
# of entries = 427M
model size = 15GB
# of entries = 742M

References
• [Zhang+ ACL2014] Kneser-Ney Smoothing
on Expected Count
• [Pickhardt+ ACL2014] A Generalized
Language Model as the Comination of
Skipped n-grams and Modified Kneser-Ney
Smoothing
• [Kneser+ 1995] Improved backing-off for m-
gram language modeling
• [Chen+ 1999] An Empirical Study of
Smoothing Techniques for Language Modeling

ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Similar to ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing" (20)

More from Shuyo Nakatani

More from Shuyo Nakatani (19)

Recently uploaded

Recently uploaded (20)

ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"