Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Azarbonyad, Hosein; Dehghani, Mostafa; Kenter, Tom; Marx, Maarten; Kamps, Jaap; de Rijke, Maarten

doi:10.1007/978-3-319-56608-5_6

Hosein Azarbonyad²⁰,
Mostafa Dehghani²⁰,
Tom Kenter²⁰,
Maarten Marx²⁰,
Jaap Kamps²⁰ &
…
Maarten de Rijke²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

European Conference on Information Retrieval

2722 Accesses

Abstract

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents’ topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Leveraging external information in topic modelling

Article 12 May 2018

Obtaining More Specific Topics and Detecting Weak Signals by Topic Word Selection

Topic Optimization Method Based on Pointwise Mutual Information

Notes

1.
As the DR level of re-estimation directly employs the parsimonious language modeling techniques in [9], we omit it from our in-depth analysis.
2.
We use a dump of June 2, 2015, containing 15.6 million articles.
3.
Available at http://www.ai.mit.edu/people/~jrennie/20Newsgroups/.
4.
Available at http://disi.unitn.it/moschitti/corpora.htm.

References

U.S. National Library of Medicine. Pubmed Central Open Access Initiative (2010)
Google Scholar
Azarbonyad, H., Saan, F., Dehghani, M., Marx, M., Kamps, J.: Are topically diverse documents also interesting? In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 215–221. Springer, Cham (2015). doi:10.1007/978-3-319-24027-5_19
Chapter Google Scholar
Bache, K., Newman, D., Smyth, P.: Text-based measures of document diversity. In KDD (2013)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)
MATH Google Scholar
Boyd-Gaber, J., Mimno, D., Newman, D.: Care and feeding of topic models. In: Mixed Membership Models & Their Applic. CRC Press (2014)
Google Scholar
Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: Two-way parsimonious classification models for evolving hierarchies. In: Fuhr, N., Quaresma, P., Gonçalves, T., Larsen, B., Balog, K., Macdonald, C., Cappellato, L., Ferro, N. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 69–82. Springer, Heidelberg (2016). doi:10.1007/978-3-319-44564-9_6
Chapter Google Scholar
Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: On horizontal and vertical separation in hierarchical text classification. In: ICTIR (2016)
Google Scholar
Derzinski, M., Rohanimanesh, K.: An information theoretic approach to quantifying text interestingness. In: NIPS MLNLP Workshop (2014)
Google Scholar
Hiemstra, D., Robertson, S., Zaragoza, H.: Parsimonious language models for information retrieval. In: SIGIR (2004)
Google Scholar
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: NIPS (2009)
Google Scholar
Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL (2014)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: Mining focused topics and focused terms in short text. In: WWW (2014)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR (2013)
Google Scholar
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Google Scholar
Rao, C.: Diversity and dissimilarity coefficients: a unified approach. Theoret. Popul. Biol. 21(1), 24–43 (1982)
Article MathSciNet MATH Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: WSDM (2015)
Google Scholar
Soleimani, H., Miller, D.: Parsimonious topic models with salient word discovery. IEEE Trans. Knowl. Data Eng. 27(3), 824–837 (2015)
Article Google Scholar
Solow, A., Polasky, S., Broadus, J.: On the measurement of biological diversity. J. Environ. Econ. Manag. 24(1), 60–68 (1993)
Article Google Scholar
Wallach, H.M., Mimno, D.M., McCallum, A.: Rethinking LDA: why priors matter. In: NIPS (2009)
Google Scholar
Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: NIPS (2009)
Google Scholar
Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The IBP compound Dirichlet process and its application to focused topic modeling. In: ICML (2010)
Google Scholar
Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. In: UAI (2013)
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW (2013)
Google Scholar
Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM (2001)
Google Scholar

Download references

Acknowledgments

This research was supported by Ahold Delhaize, Amsterdam Data Science, Blendle, the Bloomberg Research Grant program, the Dutch national program COMMIT, Elsevier, the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements nr 283465 (ENVRI) and 312827 (VOX-Pol), the Microsoft Research Ph.D. program, the Netherlands eScience Center under project number 027.012.105, the Netherlands Institute for Sound and Vision, the Netherlands Organisation for Scientific Research (NWO) under project nrs 314.99.108, 600.006.014, HOR-11-10, CI-14-25, 652.-002.-001, 612.-001.-551, 652.-001.-003, 314-98-071, and Yandex. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

Author information

Authors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Hosein Azarbonyad, Mostafa Dehghani, Tom Kenter, Maarten Marx, Jaap Kamps & Maarten de Rijke

Authors

Hosein Azarbonyad
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Dehghani
View author publications
You can also search for this author in PubMed Google Scholar
Tom Kenter
View author publications
You can also search for this author in PubMed Google Scholar
Maarten Marx
View author publications
You can also search for this author in PubMed Google Scholar
Jaap Kamps
View author publications
You can also search for this author in PubMed Google Scholar
Maarten de Rijke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hosein Azarbonyad .

Editor information

Editors and Affiliations

University of Glasgow , Glasgow, United Kingdom
Joemon M Jose
TU Delft - EWI/ST/WIS , Delft, The Netherlands
Claudia Hauff
Middle East Technical University , Ankara, Turkey
Ismail Sengor Altıngovde
Open University , Milton Keynes, United Kingdom
Dawei Song
Signal Media , London, United Kingdom
Dyaa Albakour
Toronto, Canada
Stuart Watt
JohnTait.net Ltd. and BCS IRSG , Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azarbonyad, H., Dehghani, M., Kenter, T., Marx, M., Kamps, J., de Rijke, M. (2017). Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-56608-5_6
Published: 08 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56607-8
Online ISBN: 978-3-319-56608-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging external information in topic modelling

Obtaining More Specific Topics and Detecting Weak Signals by Topic Word Selection

Topic Optimization Method Based on Pointwise Mutual Information

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging external information in topic modelling

Obtaining More Specific Topics and Detecting Weak Signals by Topic Word Selection

Topic Optimization Method Based on Pointwise Mutual Information

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation