On smoothing and scaling language model for sentiment based information retrieval

Najar, Fatma; Bouguila, Nizar

doi:10.1007/s11634-022-00522-6

On smoothing and scaling language model for sentiment based information retrieval

Regular Article
Published: 13 October 2022

Volume 17, pages 725–744, (2023)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

439 Accesses
3 Citations
Explore all metrics

Abstract

Sentiment analysis or opinion mining refers to the discovery of sentiment information within textual documents, tweets, or review posts. This field has emerged with the social media outgrowth which becomes of great interest for several applications such as marketing, tourism, and business. In this work, we approach Twitter sentiment analysis through a novel framework that addresses simultaneously the problems of text representation such as sparseness and high-dimensionality. We propose an information retrieval probabilistic model based on a new distribution namely the Smoothed Scaled Dirichlet distribution. We present a likelihood learning method for estimating the parameters of the distribution and we propose a feature generation from the information retrieval system. We apply the proposed approach Smoothed Scaled Relevance Model on four Twitter sentiment datasets: STD, STS-Gold, SemEval14, and SentiStrength. We evaluate the performance of the offered solution with a comparison against the baseline models and the related-works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Sentiment Analysis of Twitter Posts Using Density Matrix Representation

An End-to-End Topic-Based Sentiment Analysis Framework from Twitter Using Feature Set Cumulation

Sentiment Analysis on Twitter through Topic-Based Lexicon Expansion

Notes

Available at http://help.sentiment140.com/.
Available at https://www.kaggle.com/divyansh22/stsgold-dataset/.
http://www.alchemyapi.com/.
Available at https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools.
Available at http://sentistrength.wlv.ac.uk/documentation/.

References

Bengio Y (2009) Learning deep architectures for AI. Now Publishers Inc, Norwell
Book MATH Google Scholar
Bouguila N, Ziou D (2004) Improving content based image retrieval systems using finite multinomial dirichlet mixture. In: Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004., pp. 23–32. IEEE
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on Machine learning, pp. 89–96
Coletta LF, da Silva NF, Hruschka ER, Hruschka ER (2014) Combining classification and clustering for tweet sentiment analysis. In: 2014 Brazilian conference on intelligent systems, pp. 210–215. IEEE
Da Silva NF, Hruschka ER, Hruschka ER Jr (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179
Article Google Scholar
Davidov D, Tsur O, Rappoport A (2010) Enhanced sentiment learning using twitter hashtags and smileys. In: Coling 2010: Posters, pp. 241–249
Fan Y, Guo J, Lan Y, Xu J, Zhai C, Cheng X (2018) Modeling diverse relevance patterns in ad-hoc retrieval. In: The 41st international ACM SIGIR conference on research and development in information retrieval, pp. 375–384
Feng SL, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: Proceedings of the 2004 IEEE Computer society conference on computer vision and pattern recognition, 2004. CVPR 2004., vol. 2. IEEE
Fuhr N (2008) A probability ranking principle for interactive information retrieval. Inf Retr. 11(3):251–265
Article Google Scholar
Gao J, Pantel P, Gamon M, He X, Deng L (2014) Modeling interestingness with deep neural networks. In: Conference on empirical methods in natural language processing (EMNLP)
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1(12)
Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp. 55–64
Htait A, Fournier S, Bellot P, Azzopardi L, Pasi G (2020) Using sentiment analysis for pseudo-relevance feedback in social book search. In: Proceedings of the 2020 ACM SIGIR on international conference on theory of information retrieval, pp. 29–32
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K.Q. Weinberger (eds.) Advances in neural information processing systems, vol. 27. Curran Associates, Inc
Huang PS, He X, Gao J, Deng L, Acero A, Heck L (2013) Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on Information and Knowledge Management, pp. 2333–2338
Hui K, Yates A, Berberich K, De Melo G (2018) Co-pacrr: A context-aware neural ir model for ad-hoc retrieval. In: Proceedings of the eleventh ACM international conference on web search and data mining, pp. 279–287
Jianqiang Z, Xiaolin G, Xuejun Z (2018) Deep convolution neural networks for twitter sentiment analysis. IEEE Access 6:23253–23260
Article Google Scholar
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: Proc 52nd Annu Meet Assoc Comput Linguistics, pp. 655–666
Kauer AU, Moreira VP (2016) Using information retrieval for sentiment polarity prediction. Expert Syst Appl 61:282–289
Article Google Scholar
Lavrenko V (2004) A generative theory of relevance. Ph.D. thesis
Lavrenko V, Croft WB (2017) Relevance-based language models. ACM SIGIR Forum, vol 51. ACM, New York NY, USA, pp 260–267
Google Scholar
Metzler D, Lavrenko V, Croft WB (2004) Formal multiple-bernoulli models for language modeling. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 540–541
Mitra B, Craswell N (2017) Neural models for information retrieval. arXiv preprint arXiv:1705.01509
Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Notes on the scaled dirichlet distribution. Compositional data analysis, pp. 128–138
Nallapati R (2006) The smoothed dirichlet distribution: Understanding cross-entropy ranking in information retrieval. Ph.D. thesis, University of Massachusetts Amherst
Nallapati R, Minka T, Robertson S (2006) The smoothed-dirichlet distribution: a new building block for generative models. CIIR Technical Report http://www.cs.cmu.edu/~nmramesh/sd_tc.pdf
Oboh BS, Bouguila N (2017) Unsupervised learning of finite mixtures using scaled dirichlet distribution and its application to software modules categorization. In: 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1085–1090
Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42th annual meeting of the association of computational linguistics (ACL), pp. 271–278
Petrucci G, Dragoni M (2015) An information retrieval-based system for multi-domain sentiment analysis. Semantic web evaluation challenges. Springer, Cham, pp 234–243
Chapter Google Scholar
Qin T, Liu TY, Xu J, Li H (2010) LETOR: a benchmark collection for research on learning to rank for information retrieval. Inf Retrieval 13(4):346–374
Article Google Scholar
Rath TM, Lavrenko V, Manmatha R (2003) A statistical approach to retrieving historical manuscript images without recognition. Tech. rep, Space and Naval Warfare Systems Center San Diego CA
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2012) BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618
Robertson SE (1977) The probability ranking principle in IR. Journal of documentation
Rosenthal S, Ritter A, Nakov P, Stoyanov V (2014) SemEval-2014 task 9: Sentiment analysis in Twitter. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 73–80. Association for Computational Linguistics, Dublin, Ireland
Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for twitter sentiment analysis a survey and a new dataset, the STS-gold. CEUR Workshop Proceedings, 1096, 9–21 . https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908157393 &partnerID=40 &md5=cc68d8aa78e8b62f4f1724747bbdd1dc
Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978
Article Google Scholar
Shen Y, He X, Gao J, Deng L, Mesnil G (2014) Learning semantic representations using convolutional neural networks for web search. In: Proceedings of the 23rd international conference on world wide web, pp. 373–374
Thelwall M, Buckley K, Paltoglou G (2012) Sentiment strength detection for the social web. J Am Soc Inform Sci Technol 63(1):163–173
Article Google Scholar
Vosoughi S, Zhou H, Roy D (2016) Enhanced twitter sentiment classification using contextual information. Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis
Wang J, Yu L, Zhang W, Gong Y, Xu Y, Wang B, Zhang P, Zhang D (2017) Irgan: A minimax game for unifying generative and discriminative information retrieval models. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 515–524
Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185
Yi X, Allan J (2009) A comparative study of utilizing topic models for information retrieval. European conference on information retrieval. Springer, Berlin, pp 29–41
Google Scholar
Zamzami N, Alsuroji R, Eromonsele O, Bouguila N (2020) Proportional data modeling via selection and estimation of a finite mixture of scaled Dirichlet distributions. Comput Intell 36(2):459–485
Article MathSciNet Google Scholar
Zamzami N, Bouguila N (2019) A novel scaled dirichlet-based statistical framework for count data modeling: Unsupervised learning and exponential approximation. Pattern Recogn 95:36–47
Article Google Scholar
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. In: ACM SIGIR Forum, vol. 51, pp. 268–276. ACM New York, NY, USA
Zhang Y, Zhang J, Cui Z, Wu S, Wang L (2021) A graph-based relevance matching model for ad-hoc retrieval. arXiv preprint arXiv:2101.11873

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada
Fatma Najar & Nizar Bouguila

Authors

Fatma Najar
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatma Najar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Calculation of first order derivatives

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \alpha _{v}}&=\sum _{i=1}^N \Big ( \log \beta _{v} -(\log (\alpha _{v}) +1) \nonumber \\&\quad + \log (w_{iv}^s) + \log \alpha _+ +1 - \log \left( \sum _{v=1}^D \beta _{v} w_{iv}^s \right) \end{aligned}$$

(20)

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \alpha _{v}}&= 0\nonumber \\&\iff \end{aligned}$$

(21)

$$\begin{aligned} \sum _{i=1}^N \log \alpha _{v}&=\sum _{i=1}^N \Big ( \log \beta _{v} + \log (w_{iv}^s) - \log \Bigg (\sum _{v=1}^D \beta _{v} w_{iv}^s \Bigg ) \end{aligned}$$

(22)

$$\begin{aligned}&\hat{\alpha }_{v} = \sum _{i=1}^N \beta _{v} \frac{w^s_{iv}}{\sum _{v=1}^V \beta _{v} w^s_{iv}} \end{aligned}$$

(23)

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \beta _{v}}&=\sum _{i=1}^N \Bigg ( \alpha _{v} \frac{1}{\beta _{v}} - \alpha _+ \frac{w_{vi}^s}{\sum _{v=1}^D \beta _{v} w_{iv}^s} \Bigg ) \end{aligned}$$

(24)

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \beta _{v}}&= 0\nonumber \\&\iff \end{aligned}$$

(25)

$$\begin{aligned} \sum _{i=1}^N \frac{1}{\beta _{v}}&=\sum _{i=1}^N \Bigg ( \frac{\alpha _+}{\alpha _{v}} \frac{w_{vi}^s}{\sum _{v=1}^D \beta _{v} w_{iv}^s} \Bigg ) \end{aligned}$$

(26)

$$\begin{aligned}&\hat{\beta }_{v} = \sum _{i=1}^N \frac{\alpha _{v}}{\alpha _+} \frac{\sum _{v=1}^V \beta _{v} w^s_{iv}}{w^s_{iv}} \end{aligned}$$

(27)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Najar, F., Bouguila, N. On smoothing and scaling language model for sentiment based information retrieval. Adv Data Anal Classif 17, 725–744 (2023). https://doi.org/10.1007/s11634-022-00522-6

Download citation

Received: 29 September 2021
Revised: 02 September 2022
Accepted: 22 September 2022
Published: 13 October 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11634-022-00522-6

Keywords

Mathematics Subject Classification

68P20: Information storage and retrieval of data

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On smoothing and scaling language model for sentiment based information retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unsupervised Sentiment Analysis of Twitter Posts Using Density Matrix Representation

An End-to-End Topic-Based Sentiment Analysis Framework from Twitter Using Feature Set Cumulation

Sentiment Analysis on Twitter through Topic-Based Lexicon Expansion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Calculation of first order derivatives

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

On smoothing and scaling language model for sentiment based information retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unsupervised Sentiment Analysis of Twitter Posts Using Density Matrix Representation

An End-to-End Topic-Based Sentiment Analysis Framework from Twitter Using Feature Set Cumulation

Sentiment Analysis on Twitter through Topic-Based Lexicon Expansion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

A Calculation of first order derivatives

A Calculation of first order derivatives

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation