Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512007acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Public Access

A Guided Topic-Noise Model for Short Texts

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets.

    References

    [1]
    David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In International Conference on Machine Learning. 25–32.
    [2]
    David M Blei and Jon D McAuliffe. 2010. Supervised topic models. arXiv preprint arXiv:1003.0783(2010).
    [3]
    David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
    [4]
    Jamillah Bowman Williams, Naomi Mezey, and Lisa Singh. 2021. #BlackLivesMatter: Getting from Contemporary Social Movements to Structural Change. California Law Review Online 12 (2021).
    [5]
    Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. 2010. Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation. In ACM KDD Workshop on Multimedia Data Mining. 1–10.
    [6]
    Rob Churchill and Lisa Singh. 2020. Percolation-based topic modeling for tweets. In KDD Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM).
    [7]
    Rob Churchill and Lisa Singh. 2021. The Evolution of Topic Modeling. ACM Computing Surveys (CSUR)(2021).
    [8]
    Rob Churchill and Lisa Singh. 2021. textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. In International Conference on Data Science, Technology, and Applications (DATA).
    [9]
    Rob Churchill and Lisa Singh. 2021. Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections. In International Conference on Data Mining (ICDM). 71–80.
    [10]
    Rob Churchill, Lisa Singh, and Christo Kirov. 2018. A Temporal Topic Model for Noisy Mediums. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 42–53.
    [11]
    P. Davis-Kean, R. Ryan, L. Singh, and N. Waters. 2021. Groundhog day: Homeschooling in the time of Covid-19. MOSAIC Data Brief: Measuring Online Social Attitudes and Information Collaborative (10 2021).
    [12]
    Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The Dynamic Embedded Topic Model. CoRR abs/1907.05545(2019). arxiv:1907.05545http://arxiv.org/abs/1907.05545
    [13]
    Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and Eric Xing. 2012. TopicViz: Interactive topic exploration in document collections. In Extended Abstracts on Human Factors in Computing Systems. 2177–2182.
    [14]
    Ryan J Gallagher, Kyle Reing, David Kale, and Greg Ver Steeg. 2017. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics 5 (2017), 529–542.
    [15]
    Enamul Hoque and Giuseppe Carenini. 2015. Convisit: Interactive topic modeling for exploring asynchronous online conversations. In International Conference on Intelligent User Interfaces. 169–180.
    [16]
    Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Machine learning 95, 3 (2014), 423–469.
    [17]
    Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL). 204–213.
    [18]
    Hayato Kobayashi, Hiromi Wakaki, Tomohiro Yamasaki, and Masaru Suzuki. 2011. Topic Models with Logical Constraints on Words. In Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing. 33–40.
    [19]
    Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies 105 (2017).
    [20]
    Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Conference on Research and Development in Information Retrieval (SIGIR). 165–174.
    [21]
    Fangtao Li, Sheng Wang, Shenghua Liu, and Ming Zhang. 2014. Suit: A supervised user-item based topic model for sentiment analysis. In AAAI Conference on Artificial Intelligence.
    [22]
    Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit.(2002).
    [23]
    Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In The Web Conference (WWW). 2121–2132.
    [24]
    David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Empirical Methods in Natural Language Processing (EMNLP). 262–272.
    [25]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
    [26]
    Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 363–374.
    [27]
    Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In International Joint Conference on Artificial Intelligence.
    [28]
    Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing (EMNLP). 248–256.
    [29]
    Alison Smith, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2018. Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system. In International Conference on Intelligent User Interfaces. 293–304.
    [30]
    Yang Wang and Greg Mori. 2009. Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 10(2009), 1762–1774.
    [31]
    Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A Biterm Topic Model for Short Texts. In The Web Conference (WWW). 1445–1456.
    [32]
    Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xueqi Cheng, and Yanfeng Wang. 2013. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In SIAM International Conference on Data Mining (SDM). 749–757.
    [33]
    Liansheng Zhuang, Haoyuan Gao, Jiebo Luo, and Zhouchen Lin. 2013. Regularized semi-supervised latent dirichlet allocation for visual concept learning. Neurocomputing 119(2013), 26–32.

    Index Terms

    1. A Guided Topic-Noise Model for Short Texts
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. guided topic model
          2. seed topics
          3. semi-supervised topic model
          4. social media
          5. topic modeling
          6. topic-noise model

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 539
            Total Downloads
          • Downloads (Last 12 months)246
          • Downloads (Last 6 weeks)17
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media