Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2911451.2914720acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Topic Quality Metrics Based on Distributed Word Representations

Published: 07 July 2016 Publication History

Abstract

Automated evaluation of topic quality remains an important unsolved problem in topic modeling and represents a major obstacle for development and evaluation of new topic models. Previous attempts at the problem have been formulated as variations on the coherence and/or mutual information of top words in a topic. In this work, we propose several new metrics for evaluating topic quality with the help of distributed word representations; our experiments suggest that the new metrics are a better match for human judgement, which is the gold standard in this case, than previously developed approaches.

References

[1]
R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual NLP. In Proc. 17th Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
[2]
N. Arefyev, A. Panchenko, A. Lukanin, O. Lesota, and P. Romanov. Evaluating three corpus-based semantic similarity systems for russian. In Proc. International Conference on Computational Linguistics Dialogue, 2015.
[3]
D. M. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011.
[4]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4--5):993--1022, 2003.
[5]
S. Bodrunova, S. Koltsov, O. Koltsova, S. I. Nikolenko, and A. Shimorina. Interval semi-supervised LDA: Classifying needles in a haystack. In Proc. 12th Mexican International Conference on Artificial Intelligence, volume 8625 of Lecture Notes in Computer Science, pages 265--274. Springer, 2013.
[6]
G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proc. Biennial GSCL Conference, pages 31--40, 2013.
[7]
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 20, 2009.
[8]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101 (Suppl. 1):5228--5335, 2004.
[9]
D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171--186, 2001.
[10]
T. Hoffmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177--196, 2001.
[11]
J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530--539, 2014.
[12]
C. X. Ling, J. Huang, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Proc. International Joint Conference on Artificial Intelligence 2003, pages 519--526, 2003.
[13]
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proc. Conference on Empirical Methods in Natural Language Processing, pages 262--272, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[14]
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies 2010, HLT '10, pages 100--108, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
[15]
S. I. Nikolenko, O. Koltsova, and S. Koltsov. Topic modelling for qualitative studies. Journal of Information Science, 2015.
[16]
A. Panchenko, N. Loukachevitch, D. Ustalov, D. Paperno, C. M. Meyer, and N. Konstantinova. Russe: The first workshop on Russian semantic similarity. In Proc. International Conference on Computational Linguistics and Intellectual Technologies (Dialogue), pages 89--105, May 2015.
[17]
J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, Doha, Qatar, October 2014. Association for Computational Linguistics.
[18]
A. S. Rathore and D. Roy. Performance of LDA and DCT models. Journal of Information Science, 40(3):281--292, 2014.
[19]
K. Vorontsov. Additive regularization for topic models of text collections. Doklady Mathematics, 89(3):301--304, 2014.
[20]
K. V. Vorontsov and A. A. Potapenko. Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization with Applications, 101(1):303--323, 2015.

Cited By

View all
  • (2023)Evaluating the Limits of the Current Evaluation Metrics for Topic ModelingProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617040(119-127)Online publication date: 23-Oct-2023
  • (2023)Assessment of the Quality of Topic Models for Information Retrieval ApplicationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605118(265-274)Online publication date: 9-Aug-2023
  • (2023)Topic Models with Sentiment Priors Based on Distributed RepresentationsJournal of Mathematical Sciences10.1007/s10958-023-06525-8273:4(639-652)Online publication date: 24-Jun-2023
  • Show More Cited By

Index Terms

  1. Topic Quality Metrics Based on Distributed Word Representations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
    July 2016
    1296 pages
    ISBN:9781450340694
    DOI:10.1145/2911451
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. text mining
    2. topic modeling
    3. topic quality

    Qualifiers

    • Short-paper

    Funding Sources

    • Russian Science Foundation

    Conference

    SIGIR '16
    Sponsor:

    Acceptance Rates

    SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Evaluating the Limits of the Current Evaluation Metrics for Topic ModelingProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617040(119-127)Online publication date: 23-Oct-2023
    • (2023)Assessment of the Quality of Topic Models for Information Retrieval ApplicationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605118(265-274)Online publication date: 9-Aug-2023
    • (2023)Topic Models with Sentiment Priors Based on Distributed RepresentationsJournal of Mathematical Sciences10.1007/s10958-023-06525-8273:4(639-652)Online publication date: 24-Jun-2023
    • (2023)Benchmarking Multilabel Topic Classification in the Kyrgyz LanguageAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_2(21-35)Online publication date: 28-Sep-2023
    • (2022)Towards Better Evaluation of Topic Model Quality2022 32nd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT56874.2022.9953874(128-134)Online publication date: 9-Nov-2022
    • (2022)Topic modeling revisited:  New evidence on algorithm performance and quality metricsPLOS ONE10.1371/journal.pone.026632517:4(e0266325)Online publication date: 28-Apr-2022
    • (2022)Evaluating Topic Modeling Pre-processing Pipelines for Portuguese TextsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557052(191-201)Online publication date: 7-Nov-2022
    • (2022)Semantic Academic Profiler (SAP): a framework for researcher assessment based on semantic topic modelingScientometrics10.1007/s11192-022-04449-9127:8(5005-5026)Online publication date: 1-Aug-2022
    • (2021)BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and SegmentationACM Transactions on Intelligent Systems and Technology10.1145/346826812:5(1-29)Online publication date: 15-Oct-2021
    • (2021)A Topic Coverage Approach to Evaluation of Topic ModelsIEEE Access10.1109/ACCESS.2021.31094259(123280-123312)Online publication date: 2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media