Compressed Nonparametric Language Modelling

Compressed Nonparametric Language Modelling

Ehsan Shareghi, Gholamreza Haffari, Trevor Cohn

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 2701-2707. https://doi.org/10.24963/ijcai.2017/376

Hierarchical Pitman-Yor Process priors are compelling for learning language models, outperforming point-estimate based methods. However, these models remain unpopular due to computational and statistical inference issues, such as memory and time usage, as well as poor mixing of sampler. In this work we propose a novel framework which represents the HPYP model compactly using compressed suffix trees. Then, we develop an efficient approximate inference scheme in this framework that has a much lower memory footprint compared to full HPYP and is fast in the inference time. The experimental results illustrate that our model can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and outperforming the perplexity of the state-of-the-art Modified Kneser-Ney count-based LM smoothing by up to 15%.
Keywords:
Machine Learning: Machine Learning
Natural Language Processing: Natural Language Processing
Uncertainty in AI: Approximate Probabilistic Inference