Knowledge discovery through directed probabilistic topic models: a survey

Daud, Ali; Li, Juanzi; Zhou, Lizhu; Muhammad, Faqir

doi:10.1007/s11704-009-0062-y

Knowledge discovery through directed probabilistic topic models: a survey

Review Article
Published: 27 January 2010

Volume 4, pages 280–301, (2010)
Cite this article

Frontiers of Computer Science in China Aims and scope Submit manuscript

Ali Daud¹,
Juanzi Li¹,
Lizhu Zhou¹ &
…
Faqir Muhammad²

810 Accesses
Explore all metrics

Abstract

Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploratory Topic Modeling with Distributional Semantics

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Topic Representation using Semantic-Based Patterns

References

Popescul A, Flake G W, Lawrence S, Ungar L H, Giles C L. Clustering and identifying temporal trends in document databases. IEEE ADL, 2000, 173–182
McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD, 2000, 169–178
Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden, July 30–August 1, 1999
Steyvers M, Griffiths T. Probabilistic topic models. In: Landauer T, Mcnamara D, Dennis S, Kintsch W (Eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2007
Heinrich G. Parameter Estimation for Text Analysis. Technical report, Version 2, February 2008
Smolensky P. Information processing in dynamical systems: foundations of harmony theory. In: Rumehart D E, McClelland J L (Eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. McGraw-Hill, New York, 1986
Google Scholar
Welling M, Rosen-Zvi M, Hinton G. Exponential family harmoniums with an application to information retrieval. In: Advances in Neural Information Processing Systems (NIPS). Cambridge, MA, MIT Press, 2004
Google Scholar
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022
Article MATH Google Scholar
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P. The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI), Banff, Canada, July 7–11, 2004
Griffiths T L, Steyvers M. Finding scientific topics. In: Proceedings of the National Academy of Sciences. USA, 2004, 101: 5228–5235
Article Google Scholar
Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarhical Dirichlet Processes. Technical Report 653, Department of Statistics, UC Berkeley, 2004
Google Scholar
Blei D M, McAuliffe J. Supervised topic models. In: Advances in Neural Information Processing Systems (NIPS) 21. Cambridge, MA, MIT Press, 2007, 121–128
Google Scholar
Buntine W L. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 1994, 2: 159–225
Google Scholar
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T. Probabilistic author-topic models for information discovery. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington, August 22–25, 2004
Wang X, Li W, McCallum A. A continuous-time model of topic cooccurrence trends. In: AAAI Workshop on Event Detection. Boston, Massachusetts, USA, July 16–20, 2006
Nigam K, McCallum A K, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Journal of Machine Learning, 2000, 39(2–3): 103–134
Article MATH Google Scholar
Griffiths T L, Steyvers M. A probabilistic approach to semantic representation. In: Proceedings of the 24th Conference of the Cognitive Science Society. USA, 2002
Griffiths T L, Steyvers M. Prediction and semantic association. In: Advances in Neural Information Processing Systems (NIPS) 15. Cambridge, MA, MIT Press, 2003
Google Scholar
Wray L, Buntine, Jakulin A. Applying discrete PCA in data analysis. In: Proceedings of 20th Conference on Uncertainty in Artificial Intelligence (UAI), Banff, Canada, July 7–11, 2004, 59–66
Minka T, Lafferty J. Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI), Alberta, Canada, August 1–4, 2002, 352–359
Hofmann T, Puzicha J, Jordan M I. Learning from dyadic data. In: Advances in Neural Information Processing Systems (NIPS) 11. Cambridge, MA, MIT Press, 1999
Google Scholar
Cohn D, Hofmann T. The missing link- a probabilistic model of document content and hypertext connectivity. In: Advances in Neural Information Processing Systems (NIPS) 13. Cambridge, MA, MIT Press, 2001
Google Scholar
Blei D M, Moreno P J. Topic segmentation with an aspect hidden Markov model. In: Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans. LA USA, September 9–13, 2001, 343–348
Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications. In: Proceedings of the National Academy of Sciences, USA, 2004, 101: 5220–5227
Article Google Scholar
Nallapati R, Cohen W. Link-plsa-lda: A new unsupervised model for topics and influence of blogs. In: Proceedings of International Conference for Weblogs and Social Media, Seattle, Washington, USA, March 30–April 2, 2008
McCallum A, Corrada-Emmanuel A, Wang X. The Authorrecipient-topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096, 2004
Blei D M, Lafferty J. Correlated topic models. In: Advances in Neural Information Processing Systems (NIPS) 18. Cambridge, MA, MIT Press, 2006, 147–154
Google Scholar
Li W, McCallum A. Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, Pennsylvania, June 25–29, 2006, 577–584
Newman D, Chemudugunta C, Smyth P, Steyvers M. Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, August 20–23, 2006, 680–686
Zhang H, Giles C L, Foley H C, Yen J. Probabilistic community discovery using hierarchical latent Gaussian mixture model. In: Proceedings of 22nd AAAI Conference on Artificial Intelligence, Vancouver, British Columbia, Canada, July 22–26, 2007, 663–668
Dietz L, Bickel S, Scheffer T. Unsupervised prediction of citation influences. In: Proceedings of 24th International Conference on Machine Learning (ICML), Corvallis, Oregon, USA, June 20–24, 2007
Gruber A, Rosen-Zvi M, Weiss Y. Latent topic models for hypertext. In: Proceedings of Uncertainty in Artificial Intelligence (UAI), Helsinki, Finland, July 9–12, 2008
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. ArnetMiner: extraction and mining of academic social networks. In: Proceedings of ACM SIGKDD, 2008
Daud A, Li J, Zhu L, Muhammad F. A generalized topic modeling approach for maven search. In: Proceedings of International Asia-Pacific Web Conference and Web-Age Information Management (APWEB-WAIM), Suzhou, China, 2009
Daud A, Li J, Zhu L, Muhammad F. Conference mining via generalized topic modeling. In: Proceedings of European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECML PKDD), Bled, Slovenia, 2009
Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In: Advances in Neural Information Processing Systems (NIPS) 17. Cambridge, MA, MIT Press, 2005, 537–544
Google Scholar
Gruber A, Rosen-Zvi M, Weiss Y. Hidden topic Markov models. In: Proceedings of Artificial Intelligence and Statistics (AISTATS), San Juan, Puerto Rico, USA, March 21–24, 2007
Wallach J M. Topic modeling: Beyond bag-of-words. In: Proceedings of 23rd International Conference on Machine Learning (ICML), Pittsburgh, Pennsylvania, USA, June 25–29, 2006
Mei Q, Zhai C X. A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, August 20–23, 2006, 649–655
Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391–407
Article Google Scholar
Wang X, McCallum A, Wei X. Topical N-grams: phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha NE, USA, October 28–31, 2007
Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, 1989, 77(2): 257–286
Article Google Scholar
Blei D M, Lafferty J. Dynamic topic models. In: Proceedings of 23rd International Conference on Machine Learning (ICML), Pittsburgh, Pennsylvania, USA, June 25–29, 2006
Nallapati R, Cohen W, Ditmore S, Lafferty J, Ung K. Multiscale topic tomography. In: Proceedings of 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12–15, 2007
Wang C, Blei M D, Heckerman D. Continuous time dynamic topic models. In: Proceedings of Uncertainty in Artificial Intelligence (UAI), Helsinki, Finland, July 9–12, 2008
Uhlenbeck G E, Ornstein L S. On the theory of Brownian motion. Physics Reviews, 1930, 36: 823–841
Article Google Scholar
Wang X, McCallum A. Topics over time: A non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, August 20–23, 2006
Daud A, Li J, Zhu L, Muhammad F. Exploiting temporal authors interests via temporal-author-topic modeling. In: Proceedings of 5th International Conference on Advance Data Mining and Applications (ADMA), Beijing, China, 2009
Blei D M, Jordan M. Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, July 28–August 1, 2003, 127–134
Flaherty P, Giaever G, Kumm J, Jordan M, Arkin A. A latent variable model for chemogenomic profiling. Bioinformatics, 2005, 21(15): 3286–3293
Article Google Scholar
Murphy K. An Introduction to Graphical Models. Technical report, University of California, Berkeley, May 2001
Google Scholar
Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov modals. Berkeley, ICSI TR-97-021, 1997
Jordan MI, Ghahramani Z, Jaakkola T S, Saul L K. An introduction to variational methods for graphical models. In: Jordan M (Eds), Learning in Graphical Models. MIT Press, 1998
Buntine W. Variational Extensions to EM and Multinomial PCA. In: Elomaa T et al. (Eds.): ECML, LNAI 2430, Springer-Verlag, Berlin, 2002, 23–34
Google Scholar
Gilks W R, Richardson S, Spiegelhalter D J. Markov Chain Monte Carlo in Practice. London: Chapman & Hall, 1996
MATH Google Scholar
Andrieu C, Freitas N D, Doucet A, Jordan M. An introduction to MCMC for machine learning. Journal of Machine Learning, 2003, 50: 5–43
Article MATH Google Scholar
Erosheva E A. Grade of membership and latent structure models with applications to disability survey data. Unpublished doctoral dissertation, Department of Statistics, Carnegie Mellon University, 2002
Teh Y W, Newman D, Wellingm M. A collapsed variational Bayesian inference algorithm for latent dirichlet allocation. In: Advances in Neural Information Processing Systems (NIPS). Cambridge, MA, MIT Press, 2006
Google Scholar
Azzopardi L, Girolami M, Risjbergen K V. Investigating the relationship between language model perplexity and IR precision-recall measures. In: Proceedings of the 26th ACM SIGIR, Toronto, Canada, 2003
Zhang J, Tang J, Liu L, Li J. A mixture model for expert finding. In: Proceedings of the PAKDD, Washio T et al. (Eds). LNAI, 2008, 5012: 466–478
Chang Y L, Chien J T. Latent dirichlet learning for document summarization. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009
Arora R, Ravindran B. Latent dirichlet allocation based multidocument summarization. In: Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Rext Data, 2008
Bíró I, Szabó J, Benczúr A A. Latent dirichlet allocation in web spam filtering. In: Proceedings of the Adversarial Information Retrieval on the Web (AIRWeb’08), 2008
Elango P K, Jayaraman K. Clustering images using the latent dirichlet allocation model, 2005
Wang Y, Mori G. Human action recognition by semi-latent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence Special Issue on Probabilistic Graphical Models in Computer Vision (T-PAMI), 2009
Wang Y, Sabzmeydani P, Mori G. Semi-latent dirichlet allocation: A hierarchical model for fuman action recognition. In: 2nd Workshop on Human Motion Understanding, Modeling, Capture and Animation (ICCV), 2007
Rath T M, Lavrenko V, Manmatha R. A Statistical Approach to Retrieving Historical Manuscript Images Without Recognition. Technical Report, 2003

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Ali Daud, Juanzi Li & Lizhu Zhou
Department of Mathematics and Statistics, Allama Iqbal Open University, Sector H-8, Islamabad, 44000, Pakistan
Faqir Muhammad

Authors

Ali Daud
View author publications
You can also search for this author in PubMed Google Scholar
Juanzi Li
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Faqir Muhammad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Daud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Daud, A., Li, J., Zhou, L. et al. Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. China 4, 280–301 (2010). https://doi.org/10.1007/s11704-009-0062-y

Download citation

Received: 30 May 2009
Accepted: 05 September 2009
Published: 27 January 2010
Issue Date: June 2010
DOI: https://doi.org/10.1007/s11704-009-0062-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge discovery through directed probabilistic topic models: a survey

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploratory Topic Modeling with Distributional Semantics

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Topic Representation using Semantic-Based Patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Knowledge discovery through directed probabilistic topic models: a survey

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploratory Topic Modeling with Distributional Semantics

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Topic Representation using Semantic-Based Patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation