Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1281192.1281277acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Generalized component analysis for text with heterogeneous attributes

Published: 12 August 2007 Publication History

Abstract

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as principal component analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. A latent space representation for a previously unseen document can be obtained through a fast matrix multiplication using our method. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be more broadly applicable to many real-world applications where one wishes to efficiently make predictions for a large number of potential outputs using dimensionality reduction in a well defined probabilistic framework.

Supplementary Material

JPG File (p794-wang-200.jpg)
JPG File (p794-wang-768.jpg)
Low Resolution (p794-wang-200.mov)
High Resolution (p794-wang-768.mov)

References

[1]
C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5--43, 2003.
[2]
D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 2006.
[3]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[4]
W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In M. Chickering and J. Halpern, editors, Proceedings of the 20th Conference on Uncertainty in Artificial Intel ligence, pages 59--66, Banff, Alberta, Canada, 2004.
[5]
S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing maximum entropy models. Technical report, Carnegie Mellon University, CMU-CS-99-108, 1999.
[6]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[7]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 2004.
[8]
P. Gehler, A. Holub, and M. Welling. The rate adapting Poisson model for information retrieval and object recognition. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
[9]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228--5235, 2004.
[10]
T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, Vancouver, British Columbia, Canada, 2004.
[11]
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2001.
[12]
G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771--1800, 2002.
[13]
T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intel ligence, Stockholm, Sweden, 1999.
[14]
I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 2002.
[15]
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Proceedings of the NATO Advanced Study Institute on Learning in graphical models, pages 105--161, 1998.
[16]
C. Kemp, T. L. Griffiths, and J. Tenenbaum. Discovering latent classes in relational data. Technical report, MIT CSAIL, 2004.
[17]
A. McCallum, A. Corrada-Emanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2005.
[18]
A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative training for clustering and classification. In Proceedings of the 21st National Conference on Artificial Intel ligence, 2006.
[19]
T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002.
[20]
K. Nowicki and T. A. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 2001.
[21]
S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305--345, 1999.
[22]
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, 2007.
[23]
G. Salton and M. McGill.Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[24]
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.
[25]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Technical report, University of California, Berkeley, Department of Statistics, 2004.
[26]
M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3):611--622, 1990.
[27]
X. Wang and A. McCallum. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
[28]
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and their attributes. In Advances in Neural Information Processing Systems 18, Vancouver, British Columbia, Canada, 2005.
[29]
M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems 17, Vancouver, British Columbia, Canada, 2004.
[30]
E. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual--wing harmoniums. In Proceedings of the 21st Conference on Uncertainty in Artificial Intel ligence, 2005.
[31]
J. Yang, Y. Liu, E. P. Xing, and A. Hauptmann. Harmonium-based models for semantic video representation and classification. In Proceedings of the Seventh SIAM International Conference on Data Mining, 2007.
[32]
S. Yu, K. Yu, V. Tresp, H.-P. Kriegel, and M. Wu. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 464--473, New York, NY, USA, 2006. ACM Press.
[33]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information System, 22(2):179--214, 2004.

Cited By

View all
  • (2012)Dimensionality Reduction on Heterogeneous Feature SpaceProceedings of the 2012 IEEE 12th International Conference on Data Mining10.1109/ICDM.2012.30(635-644)Online publication date: 10-Dec-2012
  • (2010)Nonnegative shared subspace learning and its application to social media retrievalProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1835804.1835951(1169-1178)Online publication date: 25-Jul-2010

Index Terms

  1. Generalized component analysis for text with heterogeneous attributes

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2007
      1080 pages
      ISBN:9781595936097
      DOI:10.1145/1281192
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 August 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. author prediction
      2. multimodal heterogeneous data
      3. recipient prediction
      4. text mining
      5. topic modeling
      6. undirected graphical models

      Qualifiers

      • Article

      Conference

      KDD07

      Acceptance Rates

      KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2012)Dimensionality Reduction on Heterogeneous Feature SpaceProceedings of the 2012 IEEE 12th International Conference on Data Mining10.1109/ICDM.2012.30(635-644)Online publication date: 10-Dec-2012
      • (2010)Nonnegative shared subspace learning and its application to social media retrievalProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1835804.1835951(1169-1178)Online publication date: 25-Jul-2010

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media