research-article

Mining topics in documents: standing on the shoulders of big data

Authors:

Bing LiuAuthors Info & Claims

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1116 - 1125

https://doi.org/10.1145/2623330.2623622

Published: 24 August 2014 Publication History

Abstract

Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.

Supplementary Material

MP4 File (p1116-sidebyside.mp4)

Download
280.68 MB

References

[1]

D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In ICML, pages 25--32, 2009.

Digital Library

[2]

D. Andrzejewski, X. Zhu, M. Craven, and B. Recht. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In IJCAI, pages 1171--1177, 2011.

Digital Library

[3]

D. M. Blei and J. D. McAuliffe. Supervised Topic Models. In NIPS, pages 121--128, 2007.

Digital Library

[4]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[5]

S. R. K. Branavan, H. Chen, J. Eisenstein, and R. Barzilay. Learning Document-Level Semantic Properties from Free-Text Annotations. In ACL, pages 263--271, 2008.

[6]

J. Chang, J. Boyd-Graber, W. Chong, S. Gerrish, and D. M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In NIPS, pages 288--296, 2009.

Digital Library

[7]

Z. Chen and B. Liu. Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data. In ICML, 2014.

[8]

Z. Chen, A. Mukherjee, and B. Liu. Aspect Extraction with Automated Prior Knowledge Learning. In ACL, pages 347--358, 2014.

[9]

Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Discovering Coherent Topics Using General Knowledge. In CIKM, pages 209--218, 2013.

Digital Library

[10]

Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Exploiting Domain Knowledge in Aspect Extraction. In EMNLP, pages 1655--1667, 2013.

[11]

G. Heinrich. A Generic Approach to Topic Models. In ECML PKDD, pages 517 -- 532, 2009.

Digital Library

[12]

T. Hofmann. Probabilistic Latent Semantic Analysis. In UAI, pages 289--296, 1999.

Digital Library

[13]

M. Hu and B. Liu. Mining and Summarizing Customer Reviews. In KDD, pages 168--177, 2004.

Digital Library

[14]

Y. Hu, J. Boyd-Graber, and B. Satinoff. Interactive Topic Modeling. In ACL, pages 248--257, 2011.

Digital Library

[15]

J. Jagarlamudi, H. D. III, and R. Udupa. Incorporating Lexical Priors into Topic Models. In EACL, pages 204--213, 2012.

Digital Library

[16]

Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM, pages 815--824, Feb. 2011.

Digital Library

[17]

J.-h. Kang, J. Ma, and Y. Liu. Transfer Topic Modeling with Ease and Scalability. In SDM, pages 564--575, 2012.

[18]

B. Liu. Web data mining. Springer, 2007.

[19]

B. Liu. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers, 2012.

[20]

B. Liu, W. Hsu, and Y. Ma. Mining association rules with multiple minimum supports. In KDD, pages 337--341. ACM, 1999.

Digital Library

[21]

Y. Lu and C. Zhai. Opinion integration through semi- supervised topic modeling. In WWW, pages 121--130, 2008.

Digital Library

[22]

H. Mahmoud. Polya Urn Models. Chapman & Hall/CRC Texts in Statistical Science, 2008.

Digital Library

[23]

Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW, pages 171--180, 2007.

Digital Library

[24]

D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In EMNLP, pages 262--272, 2011.

Digital Library

[25]

S. Moghaddam and M. Ester. The FLDA Model for Aspect-based Opinion Mining: Addressing the Cold Start Problem. In WWW, pages 909--918, 2013.

Digital Library

[26]

A. Mukherjee and B. Liu. Aspect Extraction through Semi-Supervised Modeling. In ACL, pages 339--348, 2012.

Digital Library

[27]

S. J. Pan and Q. Yang. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng., 22(10):1345--1359, 2010.

Digital Library

[28]

J. Petterson, A. Smola, T. Caetano, W. Buntine, and S. Narayanamurthy. Word Features for Latent Dirichlet Allocation. In NIPS, pages 1921--1929, 2010.

[29]

D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, pages 248--256, 2009.

Digital Library

[30]

D. L. Silver, Q. Yang, and L. Li. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, 2013.

[31]

S. Thrun. Lifelong Learning Algorithms. In S. Thrun and L. Pratt, editors, Learning To Learn. Kluwer Academic Publishers, 1998.

Digital Library

[32]

I. Titov and R. McDonald. Modeling online reviews with multi-grain topic models. In WWW, pages 111--120, 2008.

Digital Library

[33]

H. Wang, Y. Lu, and C. Zhai. Latent aspect rating analysis on review text data: a rating regression approach. In KDD, pages 783--792, 2010.

Digital Library

[34]

G. Xue, W. Dai, Q. Yang, and Y. Yu. Topic-bridged PLSA for cross-domain text classification. In SIGIR, pages 627--634, 2008.

Digital Library

[35]

S. H. Yang, S. P. Crain, and H. Zha. Bridging the Language Gap: Topic Adaptation for Documents with Different Technicality. In AISTATS, volume 15, pages 823--831, 2011.

[36]

Z. Zhai, B. Liu, H. Xu, and P. Jia. Constrained LDA for grouping product features in opinion mining. In PAKDD, pages 448--459, May 2011.

Digital Library

[37]

W. X. Zhao, J. Jiang, H. Yan, and X. Li. Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid. In EMNLP, pages 56--65, 2010.

Digital Library

[38]

G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, 1932.

Cited By

Praneeth CNoorullah RKarthik KRupak P(2024)Visualization and Analysis of Social Network-Based Diverse Datasets Using Multi-Viewpoint Similarity Metrics2024 International Conference on Expert Clouds and Applications (ICOECA)10.1109/ICOECA62351.2024.00046(199-206)Online publication date: 18-Apr-2024
https://doi.org/10.1109/ICOECA62351.2024.00046
Koochemeshkian PBouguila N(2024)Integration of Neural Embeddings and Probabilistic Models in Topic ModelingApplied Artificial Intelligence10.1080/08839514.2024.240390438:1Online publication date: 4-Oct-2024
https://doi.org/10.1080/08839514.2024.2403904
Abulaish MWasi NSharma S(2024)The role of lifelong machine learning in bridging the gap between human and machine learning: A scientometric analysisWIREs Data Mining and Knowledge Discovery10.1002/widm.152614:2Online publication date: 10-Jan-2024
https://doi.org/10.1002/widm.1526
Show More Cited By

Index Terms

Mining topics in documents: standing on the shoulders of big data
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model
WWW '16: Proceedings of the 25th International Conference on World Wide Web

Aspect-level sentiment analysis or opinion mining consists of several core sub-tasks: aspect extraction, opinion identification, polarity classification, and separation of general and aspect-specific opinions. Various topic models have been proposed by ...
Prior-Knowledge-Embedded LDA with Word2vec – for Detecting Specific Topics in Documents
Knowledge Management and Acquisition for Intelligent Systems
Abstract
This paper proposes a method to apply prior knowledge about topics of interest to Latent Dirichlet Allocation (LDA). The conventional LDA sometimes fails to detect specific topics of interest. Therefore, our approach uses word2vec to acquire ...
Mining coherent topics in documents using word embeddings and large-scale text data

Probabilistic topic models have been extensively used to extract low-dimension aspects from document collections. However, such models without any human knowledge often generate topics that are not interpretable. Recently, a number of knowledge-based ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2014

2028 pages

ISBN:9781450329569

DOI:10.1145/2623330

General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Information and Intelligent Systems

Conference

KDD '14

Sponsor:

KDD '14: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2014

New York, New York, USA

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
1,538
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Praneeth CNoorullah RKarthik KRupak P(2024)Visualization and Analysis of Social Network-Based Diverse Datasets Using Multi-Viewpoint Similarity Metrics2024 International Conference on Expert Clouds and Applications (ICOECA)10.1109/ICOECA62351.2024.00046(199-206)Online publication date: 18-Apr-2024
https://doi.org/10.1109/ICOECA62351.2024.00046
Koochemeshkian PBouguila N(2024)Integration of Neural Embeddings and Probabilistic Models in Topic ModelingApplied Artificial Intelligence10.1080/08839514.2024.240390438:1Online publication date: 4-Oct-2024
https://doi.org/10.1080/08839514.2024.2403904
Abulaish MWasi NSharma S(2024)The role of lifelong machine learning in bridging the gap between human and machine learning: A scientometric analysisWIREs Data Mining and Knowledge Discovery10.1002/widm.152614:2Online publication date: 10-Jan-2024
https://doi.org/10.1002/widm.1526
Al Asad MImran HAlamin MAbdullah TChowdhury S(2023)Sentiment and Interest Detection in Social Media using GPT-based Large Language ModelsProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing10.1145/3639479.3639523(209-214)Online publication date: 27-Dec-2023
https://dl.acm.org/doi/10.1145/3639479.3639523
Parmar JChouhan SRaychoudhury VRathore S(2023)Open-world Machine Learning: Applications, Challenges, and OpportunitiesACM Computing Surveys10.1145/356138155:10(1-37)Online publication date: 2-Feb-2023
https://dl.acm.org/doi/10.1145/3561381
Yang ZZheng JGe Z(2023)Lifelong Bayesian Learning Machines for Streaming Industrial Big DataIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2022.319883353:3(1554-1565)Online publication date: Mar-2023
https://doi.org/10.1109/TSMC.2022.3198833
Lei ZLiu HYan JRao YLi Q(2023)NMTF-LTM: Towards an Alignment of Semantics for Lifelong Topic ModelingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326749635:10(10616-10632)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3267496
Chauhan GNahta RMeena YGopalani D(2023)Aspect based sentiment analysis using deep learning approachesComputer Science Review10.1016/j.cosrev.2023.10057649:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.cosrev.2023.100576
Shah AAli BWahab FUllah IAmesho KShafiq M(2023)Entropy-based grid approach for handling outliers: a case study to environmental monitoring dataEnvironmental Science and Pollution Research10.1007/s11356-023-26780-130:60(125138-125157)Online publication date: 12-Jun-2023
https://doi.org/10.1007/s11356-023-26780-1
Li MWang RLi JBao XHe JChen JHe L(2023)Topic Modeling for Short Texts via Adaptive P$$\acute{o}$$lya Urn Dirichlet Multinomial MixtureNeural Information Processing10.1007/978-981-99-8181-6_28(364-376)Online publication date: 27-Nov-2023
https://doi.org/10.1007/978-981-99-8181-6_28
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents