short-paper

Topic Quality Metrics Based on Distributed Word Representations

Author:

Sergey I. NikolenkoAuthors Info & Claims

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 1029 - 1032

https://doi.org/10.1145/2911451.2914720

Published: 07 July 2016 Publication History

Abstract

Automated evaluation of topic quality remains an important unsolved problem in topic modeling and represents a major obstacle for development and evaluation of new topic models. Previous attempts at the problem have been formulated as variations on the coherence and/or mutual information of top words in a topic. In this work, we propose several new metrics for evaluating topic quality with the help of distributed word representations; our experiments suggest that the new metrics are a better match for human judgement, which is the gold standard in this case, than previously developed approaches.

References

[1]

R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual NLP. In Proc. 17th Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

[2]

N. Arefyev, A. Panchenko, A. Lukanin, O. Lesota, and P. Romanov. Evaluating three corpus-based semantic similarity systems for russian. In Proc. International Conference on Computational Linguistics Dialogue, 2015.

[3]

D. M. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011.

Digital Library

[4]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4--5):993--1022, 2003.

Digital Library

[5]

S. Bodrunova, S. Koltsov, O. Koltsova, S. I. Nikolenko, and A. Shimorina. Interval semi-supervised LDA: Classifying needles in a haystack. In Proc. 12th Mexican International Conference on Artificial Intelligence, volume 8625 of Lecture Notes in Computer Science, pages 265--274. Springer, 2013.

[6]

G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proc. Biennial GSCL Conference, pages 31--40, 2013.

[7]

J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 20, 2009.

[8]

T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101 (Suppl. 1):5228--5335, 2004.

[9]

D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171--186, 2001.

Digital Library

[10]

T. Hoffmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177--196, 2001.

[11]

J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530--539, 2014.

[12]

C. X. Ling, J. Huang, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Proc. International Joint Conference on Artificial Intelligence 2003, pages 519--526, 2003.

Digital Library

[13]

D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proc. Conference on Empirical Methods in Natural Language Processing, pages 262--272, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.

Digital Library

[14]

D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies 2010, HLT '10, pages 100--108, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

Digital Library

[15]

S. I. Nikolenko, O. Koltsova, and S. Koltsov. Topic modelling for qualitative studies. Journal of Information Science, 2015.

[16]

A. Panchenko, N. Loukachevitch, D. Ustalov, D. Paperno, C. M. Meyer, and N. Konstantinova. Russe: The first workshop on Russian semantic similarity. In Proc. International Conference on Computational Linguistics and Intellectual Technologies (Dialogue), pages 89--105, May 2015.

[17]

J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, Doha, Qatar, October 2014. Association for Computational Linguistics.

[18]

A. S. Rathore and D. Roy. Performance of LDA and DCT models. Journal of Information Science, 40(3):281--292, 2014.

Digital Library

[19]

K. Vorontsov. Additive regularization for topic models of text collections. Doklady Mathematics, 89(3):301--304, 2014.

[20]

K. V. Vorontsov and A. A. Potapenko. Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization with Applications, 101(1):303--323, 2015.

Digital Library

Cited By

Solovyev VSolnyshkina MTutubalina E(2023)Topic Modeling for Text Structure Assessment: The case of Russian Academic TextsJournal of Language and Education10.17323/jle.2023.166049:3(143-158)Online publication date: 30-Sep-2023
https://doi.org/10.17323/jle.2023.16604
Pereira AViegas FGonçalves MRocha L(2023)Evaluating the Limits of the Current Evaluation Metrics for Topic ModelingProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617040(119-127)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617040
Yuan MLin PRashidi LZobel JYoshioka MKiseleva JAliannejadi M(2023)Assessment of the Quality of Topic Models for Information Retrieval ApplicationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605118(265-274)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605118
Show More Cited By

Index Terms

Topic Quality Metrics Based on Distributed Word Representations
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

Topic modelling for qualitative studies

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification

With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers' opinions on products and services. However, most of the traditional opinion mining methods are coarse-grained and cannot ...
Using PageRank for Characterizing Topic Quality in LDA
ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

July 2016

1296 pages

ISBN:9781450340694

DOI:10.1145/2911451

General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Russian Science Foundation

Conference

SIGIR '16

Sponsor:

SIGIR

SIGIR '16: The 39th International ACM SIGIR conference on research and development in Information Retrieval

July 17 - 21, 2016

Pisa, Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
390
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Solovyev VSolnyshkina MTutubalina E(2023)Topic Modeling for Text Structure Assessment: The case of Russian Academic TextsJournal of Language and Education10.17323/jle.2023.166049:3(143-158)Online publication date: 30-Sep-2023
https://doi.org/10.17323/jle.2023.16604
Pereira AViegas FGonçalves MRocha L(2023)Evaluating the Limits of the Current Evaluation Metrics for Topic ModelingProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617040(119-127)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617040
Yuan MLin PRashidi LZobel JYoshioka MKiseleva JAliannejadi M(2023)Assessment of the Quality of Topic Models for Information Retrieval ApplicationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605118(265-274)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605118
Tutubalina ENikolenko S(2023)Topic Models with Sentiment Priors Based on Distributed RepresentationsJournal of Mathematical Sciences10.1007/s10958-023-06525-8273:4(639-652)Online publication date: 24-Jun-2023
https://doi.org/10.1007/s10958-023-06525-8
Alekseev ANikolenko SKabaeva G(2023)Benchmarking Multilabel Topic Classification in the Kyrgyz LanguageAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_2(21-35)Online publication date: 28-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-54534-4_2
Khodorchenko MButakov NNasonov D(2022)Towards Better Evaluation of Topic Model Quality2022 32nd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT56874.2022.9953874(128-134)Online publication date: 9-Nov-2022
https://doi.org/10.23919/FRUCT56874.2022.9953874
Rüdiger MAntons DJoshi ASalge T(2022)Topic modeling revisited: New evidence on algorithm performance and quality metricsPLOS ONE10.1371/journal.pone.026632517:4(e0266325)Online publication date: 28-Apr-2022
https://doi.org/10.1371/journal.pone.0266325
Júnior ACecilio PViegas FCunha WAlbergaria ERocha L(2022)Evaluating Topic Modeling Pre-processing Pipelines for Portuguese TextsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557052(191-201)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3539637.3557052
Viegas FPereira ACecílio PTuler EMeira WGonçalves MRocha L(2022)Semantic Academic Profiler (SAP): a framework for researcher assessment based on semantic topic modelingScientometrics10.1007/s11192-022-04449-9127:8(5005-5026)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s11192-022-04449-9
Wu QHare AWang STu YLiu ZBrinton CLi Y(2021)BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and SegmentationACM Transactions on Intelligent Systems and Technology10.1145/346826812:5(1-29)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3468268
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten