A Hierarchical Model for Clustering and Categorising Documents

Gaussier, E.; Goutte, C.; Popat, K.; Chen, F.

doi:10.1007/3-540-45886-7_16

E. Gaussier⁷,
C. Goutte⁷,
K. Popat⁸ &
…
F. Chen⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2291))

Included in the following conference series:

European Conference on Information Retrieval

523 Accesses
19 Citations

Abstract

We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Training algorithms are derived for both cases, and illustrated on real data by clustering news stories and categorising newsgroup messages. Finally, the generative model may be used to derive a Fisher kernel expressing similarity between documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Article 25 May 2020

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Article 06 June 2017

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Article 15 May 2023

References

L. Douglas Baker, Thomas Hofmann, Andrew McCallum, and Yiming Yang. A hierarchical probabilistic model for novelty detection in text. http://www-2.cs.cmu.edu/ mccallum/papers/tdt-nips99s.ps.gz.
Gilles Celeux and Gérard Govaert. A Classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315–332, 1992.
Article MATH MathSciNet Google Scholar
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
Article Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
MATH MathSciNet Google Scholar
E. Gaussier and N. Cancedda. Probabilistic models for terminology extraction and knowleddge structuring from documents. In Proceedings of the 2001IEEE International Conference on Systems, Man & Cybernetics, 2001.
Google Scholar
G. Grefenstette. Explorations in Automatic Thesaurus Construction. Kluwer Academic Publishers, 1994.
Google Scholar
Z. S. Harris. Distributional structure. Word, 10:146–162, 1954.
Google Scholar
Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 289–296. Morgan Kaufmann, 1999. http://www2.sis.pitt.edu/ dsl/UAI/uai99.html.
Thomas Hofmann. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in Neural Information Processing Systems 12, page 914. MIT Press, 2000.
Google Scholar
Thomas Hofmann and Jan Puzicha. Statistical models for co-occurence data. A.I. Memo 1625, A.I. Laboratory, February 1998.
Google Scholar
Tommi S. Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, pages 487–493, 1999.
Google Scholar
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML98), number 1398 in Lecture Notes in Computer Science, pages 137–142. Springer Verlag, 1998.
Google Scholar
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.
MATH Google Scholar
Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 359–367, 1998.
Google Scholar
Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of english words. In Proceedings of the International Conference of the Association for Computational Linguistics, 1993.
Google Scholar
K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945–848, 1990.
Article Google Scholar
G. Salton. Automatic Thesaurus Construction for Information Retrieval. North Holland Publishing, 1972.
Google Scholar
D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distribution. John Wiley & Sons, San Diego, 1985.
Google Scholar
Kristina Toutanova, Francine Chen, Kris Popat, and Thomas Hofmann. Text classification in a hierarchical mixture model for small training sets. In Proceedings of the ACM Conference on Information and Knowledge Management, 2001.
Google Scholar
Naonori Ueda and Ryohei Nakano. Deterministic annealing variant of the EM algorithm. In Gerry Tesauro, David Touretzky, and Todd Leen, editors, Advances in Neural Information Processing Systems 7, pages 545–552. MIT Press, 1995.
Google Scholar
C. J. van Rijsbergen. Information Retrieval. Butterworth, 2nd edition edition, 1979.
Google Scholar
Alexei Vinokourov and Mark Girolami. A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18(2–3):153–172, 2002.
Article Google Scholar
Peter Willett. Recent trends in hierarchical document clustering: A critical review. Information Processing & Management, 24(5):577–597, 1988.
Article Google Scholar
Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42–49, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Center Europe, 6 Ch. de Maupertuis, F-38240, Meylan, France
E. Gaussier & C. Goutte
Xerox PARC, 3333 Coyote Hill Rd, 94304, Palo Alto, CA, USA
K. Popat & F. Chen

Authors

E. Gaussier
View author publications
You can also search for this author in PubMed Google Scholar
C. Goutte
View author publications
You can also search for this author in PubMed Google Scholar
K. Popat
View author publications
You can also search for this author in PubMed Google Scholar
F. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Sciences, University of Strathclyde, 26 Richmond Street, G1 1XH, Glasgow, UK
Fabio Crestani
School of Information and Communication Technologies, University of Paisley, High Street, PA1 2BE, Paisley, UK
Mark Girolami
Computing Science Department, University of Glasgow, 17 Lilybank Gardens, G12 8RZ, Glasgow, UK
Cornelis Joost van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gaussier, E., Goutte, C., Popat, K., Chen, F. (2002). A Hierarchical Model for Clustering and Categorising Documents. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_16

Download citation

DOI: https://doi.org/10.1007/3-540-45886-7_16
Published: 14 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43343-9
Online ISBN: 978-3-540-45886-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Hierarchical Model for Clustering and Categorising Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Hierarchical Model for Clustering and Categorising Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation