Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge

Najar, Fatma; Bouguila, Nizar

doi:10.1007/978-3-030-86331-9_23

Fatma Najar¹¹ &
Nizar Bouguila¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3622 Accesses
2 Citations

Abstract

Smoothing the parameters of multinomial distributions is an important concern in statistical inference tasks. In this paper, we present a new smoothing prior for the Multinomial Naive Bayes classifier. Our approach takes advantage of the Beta-Liouville distribution for the estimation of the multinomial parameters. Dealing with sparse documents, we exploit vocabulary knowledge to define two distinct priors over the “observed” and the “unseen” words. We analyze the problem of large-scale and sparse data by enhancing Multinomial Naive Bayes classifier through smoothing the estimation of words with a Beta-scale. Our approach is evaluated on two different challenging applications with sparse and large-scale documents namely: emotion intensity analysis and hate speech detection. Experiments on real-world datasets show the effectiveness of our proposed classifier compared to the related-work methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sparse Generalized Dirichlet Prior Based Bayesian Multinomial Estimation

PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Supervised Intensive Topic Models for Emotion Detection over Short Text

Notes

1.
http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

References

Abbas, M., Memon, K.A., Jamali, A.A., Memon, S., Ahmed, A.: Multinomial Naive Bayes classification model for sentiment analysis. IJCSNS 19(3), 62 (2019)
Google Scholar
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. (TOIS) 20(4), 357–389 (2002)
Article Google Scholar
Bai, J., Nie, J.Y., Paradis, F.: Using language models for text classification. In: Proceedings of the Asia Information Retrieval Symposium, Beijing, China (2004)
Google Scholar
Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)
Article Google Scholar
Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)
Article Google Scholar
Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 22(2), 186–198 (2010)
Article Google Scholar
Bouguila, N.: Infinite Liouville mixture models with application to text and texture categorization. Pattern Recognit. Lett. 33(2), 103–110 (2012)
Article Google Scholar
Bouguila, N.: On the smoothing of multinomial estimates using Liouville mixture models and applications. Pattern Anal. Appl. 16(3), 349–363 (2013)
Article MathSciNet Google Scholar
Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)
Article Google Scholar
Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J. Vis. Commun. Image Represent. 18(4), 295–309 (2007)
Article Google Scholar
Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11 (2017)
Google Scholar
Epaillard, E., Bouguila, N.: Proportional data modeling with hidden Markov models based on generalized Dirichlet and Beta-Liouville mixtures applied to anomaly detection in public areas. Pattern Recognit. 55, 125–136 (2016)
Article Google Scholar
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the Naive Bayes model for text categorization (2003)
Google Scholar
Fan, W., Bouguila, N.: Learning finite Beta-Liouville mixture models via variational Bayes for proportional data clustering. In: Rossi, F. (ed.) IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013, pp. 1323–1329. IJCAI/AAAI (2013)
Google Scholar
Fan, W., Bouguila, N.: Online learning of a Dirichlet process mixture of Beta-Liouville distributions via variational inference. IEEE Trans. Neural Networks Learn. Syst. 24(11), 1850–1862 (2013)
Article Google Scholar
Kadam, S., Gala, A., Gehlot, P., Kurup, A., Ghag, K.: Word embedding based multinomial Naive Bayes algorithm for spam filtering. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–5. IEEE (2018)
Google Scholar
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)
Google Scholar
McCallum, A., Nigam, K., et al.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Mohammad, S., Bravo-Marquez, F.: Emotion intensities in tweets. In: Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pp. 65–77. Association for Computational Linguistics, Vancouver, Canada, August 2017
Google Scholar
Najar, F., Bouguila, N.: Happiness analysis with fisher information of Dirichlet-multinomial mixture model. In: Goutte, C., Zhu, X. (eds.) Canadian AI 2020. LNCS (LNAI), vol. 12109, pp. 438–444. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47358-7_45
Chapter Google Scholar
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 616–623 (2003)
Google Scholar
Singer, N.F.Y.: Efficient Bayesian parameter estimation in large discrete domains. Adv. Neural. Inf. Process. Syst. 11, 417 (1999)
Google Scholar
Sivazlian, B.: On a multivariate extension of the gamma and beta distributions. SIAM J. Appl. Math. 41(2), 205–209 (1981)
Article MathSciNet Google Scholar
Willems, D., Vuurpijl, L.: A Bayesian network approach to mode detection for interactive maps. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 869–873. IEEE (2007)
Google Scholar
Wong, T.T.: Alternative prior assumptions for improving the performance of Naïve Bayesian classifiers. Data Min. Knowl. Disc. 18(2), 183–213 (2009)
Article Google Scholar
Xiao, Y., Lin, C., Jiang, Y., Chu, X., Shen, X.: Reputation-based QoS provisioning in cloud computing via Dirichlet multinomial model. In: 2010 IEEE International Conference on Communications, pp. 1–5. IEEE (2010)
Google Scholar
Yuan, Q., Cong, G., Thalmann, N.M.: Enhancing Naive Bayes with various smoothing methods for short text classification. In: Proceedings of the 21st International Conference on World Wide Web, pp. 645–646 (2012)
Google Scholar
Zamzami, N., Bouguila, N.: A novel scaled Dirichlet-based statistical framework for count data modeling: unsupervised learning and exponential approximation. Pattern Recogn. 95, 36–47 (2019)
Article Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22(2), 179–214 (2004)
Article Google Scholar
Zhang, J., Ghahramani, Z., Yang, Y.: A probabilistic model for online document clustering with application to novelty detection. Adv. Neural. Inf. Process. Syst. 17, 1617–1624 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Concordia Institute for Information and Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada
Fatma Najar & Nizar Bouguila

Authors

Fatma Najar
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatma Najar .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Najar, F., Bouguila, N. (2021). Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-86331-9_23
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86330-2
Online ISBN: 978-3-030-86331-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Generalized Dirichlet Prior Based Bayesian Multinomial Estimation

PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Supervised Intensive Topic Models for Emotion Detection over Short Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Sparse Document Analysis Using Beta-Liouville Naive Bayes with Vocabulary Knowledge

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Generalized Dirichlet Prior Based Bayesian Multinomial Estimation

PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Supervised Intensive Topic Models for Emotion Detection over Short Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation