Abstract
A theoretical foundation for latent semantic indexing (LSI) is proposed by adapting a model first used in array signal processing to the context of information retrieval using the concept of subspaces. It is shown that this subspace-based model coupled with minimal description length (MDL) principle leads to a statistical test to determine the dimensions of the latent-concept subspaces in LSI. The effect of weighting on the choice of the optimal dimensions of latent-concept subspaces is illustrated. It is also shown that the model imposes a so-called low-rank-plus-shift structure that is approximately satisfied by the cross-product of the term-document matrices. This structure can be exploited to give a more accurate updating scheme for LSI and to correct some of the misconception about the achievable retrieval accuracy in LSI updating. Variants of Lanczos algorithms are illustrated with numerical test results on Cray T3E using document collections generated from World Wide Web.
This work was supported by the Director, Office of Energy Research, Office of Laboratory Policy and Infrastructure Management, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098 and NSF grant CCR-9619452.
Preview
Unable to display preview. Download preview PDF.
References
M.W. Berry, S.T. Dumais and G.W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573–595, 1995.
L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. SIAM, Philadelphia, USA, 1997.
Cornell SMART System, ftp://ftp.cs.cornell.edu/pub/smart.
S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41:391–407, 1990.
G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, USA, third edition, 1996.
R. G. Grimes, J. G. Lewis, and H. D. Simon. A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Eigenvalue Problems. SIAM J. Matrix Anal. Appl., 15:228–272, 1994.
D. Harman. TREC-3 conference report. NIST Special Publication 500-225, 1995.
G. Kowalski. Information Retrieval System: Theory and Implementation. Kluwer Academic Publishers, Boston, 1997.
R. Krovetz and W.B. Croft. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10:115–141, 1992.
B. Nour-Omid, B. N. Parlett, T. Ericsson, and P. S. Jensen. How to Implement the Spectral Transformation. Mathematics of Computation, 48:663–673, 1987.
G.W. O'Brien. Information Management Tools for Updating an SVD-Encoded Indexing Scheme. M.S. Thesis, Department of Computer Science, Univ. of Tennessee, 1994.
O.A. Marques.BLZPACK: Description and User's Guide. CERFACS, TR/PA/95/30, 1995.
B. N. Parlett. The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs, USA, 1980.
B. N. Parlett and D. S. Scott. The Lanczos Algorithm with Selective Orthogonalization. Mathematics of Computation, 33:217–238, 1979.
G. Salton. Automatic Text Processing. Addison-Wesley, New York, 1989.
H. D. Simon. The Lanczos Algorithm with Partial Reorthogonalization. Mathematics of Computation, 42:115–142, 1984.
H.D. Simon and H. Zha. Low rank matrix approximation using the Lanczos bidiagonalization process with applications. Technical Report CSE-97-008, Department of Computer Science and Engineering, The Pennsylvania State University, 1997.
G. Xu and T. Kailath. Fast subspace decomposotion. IEEE Transactions on Signal Processing, 42:539–551, 1994.
G. Xu, H. Zha, G. Golub, and T. Kailath. Fast algorithms for updating signal subspaces. IEEE Transactions on Circuits and Systems, 41:537–549, 1994.
Author information
Authors and Affiliations
Corresponding author
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zha, H., Marques, O., Simon, H.D. (1998). Large-scale SVD and subspace-based methods for information retrieval. In: Ferreira, A., Rolim, J., Simon, H., Teng, SH. (eds) Solving Irregularly Structured Problems in Parallel. IRREGULAR 1998. Lecture Notes in Computer Science, vol 1457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0018525
Download citation
DOI: https://doi.org/10.1007/BFb0018525
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64809-3
Online ISBN: 978-3-540-68533-3
eBook Packages: Springer Book Archive