Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3269206.3271797acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Semantically-Enhanced Topic Modeling

Published: 17 October 2018 Publication History

Abstract

In this paper, we advance the state-of-the-art in topic modeling by means of the design and development of a novel (semi-formal) general topic modeling framework. The novel contributions of our solution include: (i) the introduction of new semantically-enhanced data representations for topic modeling based on pooling, and (ii) the proposal of a novel topic extraction strategy - ASToC - that solves the difficulty in representing topics in our semantically-enhanced information space. In our extensive experimentation evaluation, covering 12 datasets and 12 state-of-the-art baselines, totalizing 108 tests, we exceed (with a few ties) in almost 100 cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). We provide qualitative and quantitative statistical analyses of why our solutions work so well. Finally, we show that our method is able to improve document representation in automatic text classification.

References

[1]
M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.
[2]
P. V. Bicalho, T. de Oliveira Cunha, F. H. J. Mourão, G. L. Pappa, and W. M. Jr. Generating cohesive semantic topics from latent factors. In BRACIS, 2014.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 2003.
[4]
R. Campos, S. Canuto, T. Salles, C. C. de Sá, and M. A. Gonçalves. Stacking bagged and boosted forests for effective automated classification. In SIGIR, 2017.
[5]
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In SDAIR'94, 1994.
[6]
Z. Chen and B. Liu. Topic modeling using topics from many domains, lifelong learning and big data. In ICML'14, 2014.
[7]
X. Cheng, X. Yan, Y. Lan, and J. Guo. Btm: Topic modeling over short texts. IEEE TKDE '14, 2014.
[8]
J. Choo, C. Lee, C. K. Reddy, and H. Park. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE TVCG, 2013.
[9]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIST, 1990.
[10]
E. Guzman andW. Maalej. Howdo users like this feature? a fine grained sentiment analysis of app reviews. In Requirements Engineering, 2014.
[11]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR '99, 1999.
[12]
C. J. Hutto and E. Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM'14, 2014.
[13]
R. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. 1991.
[14]
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 1999.
[15]
G. Lev, B. Klein, and L. Wolf. NLDB'15, chapter In Defense of Word Embedding for Generic Text Representation. Springer International Publishing, 2015.
[16]
C. Li, Y. Duan, H. Wang, Z. Zhang, A. Sun, and Z. Ma. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM TOIS, 2017.
[17]
Q. Li, S. Shah, X. Liu, A. Nourbakhsh, and R. Fang. Tweetsift: Tweet topic classification based on entity knowledge base and topic enhanced word embedding. In CIKM'16.
[18]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013.
[19]
S. I. Nikolenko. Topic quality metrics based on distributed word representations. In SIGIR'16, 2016.
[20]
S. I. Nikolenko, S. Koltcov, and O. Koltsova. Topic modelling for qualitative studies. Journal of Information Science, 2017.
[21]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
[22]
J. Qiang, P. Chen, T. Wang, and X. Wu. Topic modeling over short texts by incorporating word embeddings. In PAKDD. Springer, 2017.
[23]
F. Viegas, M. Gonçalves,W. Martins, and L. Rocha. Parallel lazy semi-naive bayes strategies for effective and efficient document classification. In CIKM, 2015.
[24]
K. Vorontsov and A. Potapenko. Additive regularization of topic models. Mach. Learn., 101(1--3):303--323, 2015.
[25]
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 2004.

Cited By

View all
  • (2022)Correlating Historical Events and Cinematic Releases Using Web InformationProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557059(178-181)Online publication date: 7-Nov-2022
  • (2022)Evaluating Topic Modeling Pre-processing Pipelines for Portuguese TextsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557052(191-201)Online publication date: 7-Nov-2022
  • (2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
October 2018
2362 pages
ISBN:9781450360142
DOI:10.1145/3269206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bag of words
  2. topic modeling
  3. word embeddings

Qualifiers

  • Research-article

Funding Sources

  • Fapemig
  • Astrein
  • InWeb
  • CNPq
  • CAPES
  • Finep
  • Mundiale
  • MASWeb

Conference

CIKM '18
Sponsor:

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Correlating Historical Events and Cinematic Releases Using Web InformationProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557059(178-181)Online publication date: 7-Nov-2022
  • (2022)Evaluating Topic Modeling Pre-processing Pipelines for Portuguese TextsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3557052(191-201)Online publication date: 7-Nov-2022
  • (2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
  • (2022)Latent Semantic Indexing (LSI) and Hierarchical Dirichlet Process (HDP) Models on News Data2022 5th International Conference of Computer and Informatics Engineering (IC2IE)10.1109/IC2IE56416.2022.9970067(314-319)Online publication date: 13-Sep-2022
  • (2021)On the cost-effectiveness of neural and non-neural approaches and representations for text classificationInformation Processing and Management: an International Journal10.1016/j.ipm.2020.10248158:3Online publication date: 1-May-2021
  • (2021)An Article-Oriented Framework for Automatic Semantic Analysis of COVID-19 ResearchesComputational Science and Its Applications – ICCSA 202110.1007/978-3-030-86970-0_13(172-187)Online publication date: 11-Sep-2021
  • (2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020
  • (2020)An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00130(821-826)Online publication date: Nov-2020
  • (2019)Novel semantic tagging detection algorithms based non-negative matrix factorizationSN Applied Sciences10.1007/s42452-019-1836-y2:1Online publication date: 9-Dec-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media