Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1008992.1009014acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

On scaling latent semantic indexing for large peer-to-peer systems

Published: 25 July 2004 Publication History

Abstract

The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information re-trieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this di-rection is pSearch [32, 33]. pSearch places documents onto a peer-to- peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.This paper addresses the above limitations of LSI and makes the following contributions. (1) To reduce the cost of SVD, we reduce the size of its input matrix through document clustering and term selection. Our method retains the retrieval quality of LSI but is several orders of magnitude more efficient. (2) Through extensive experimentation, we found that proper normalization of semantic vectors for terms and documents improves recall by 76%. (3) To further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and document selection.

References

[1]
M. Bawa, G. S. Manku, and P. Raghavan. SETS: Search Enhanced by Topic Segmentation. In SIGIR'03, 2003.
[2]
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[3]
M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335--362, 1999.
[4]
M. W. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools). Society for Industrial & Applied Mathematics, 1999.
[5]
M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.
[6]
E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In SIGKDD'01, 2001.
[7]
C. Buckley. Implementation of the SMART information retrieval system. Technical Report TR85-686, Department of Computer Science, Cornell University, Ithaca, NY 14853, May 1985. Source code available at ftp://ftp.cs.cornell.edu/pub/smart.
[8]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[9]
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143--175, 2001.
[10]
S. Dumais. Using LSI for information filtering: TREC-3 experiments. In Third Text REtrieval Conference (TREC-3), 1995.
[11]
P. Frankl and H. Maehara. The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory Ser. B, 44(3):355--362, 1988.
[12]
G. Golub and C. V. Loan. Matrix Computations. The Jason Hopkins University Press, Baltimore, Maryland, second edition edition, 1989.
[13]
L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over the Internet. ACM Transactions on Database Systems, 24(2), 1999.
[14]
P. Husbands, H. Simon, and C. Ding. the use of singular value decomposition for text retrieval. In M. Berry, editor, Proc. of SIAM Comp. Info. Retrieval Workshop, October 2000.
[15]
G. Karypis and E.-H. S. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. In CIKM'00, 2000.
[16]
T. G. Kolda and D. P. O'Leary. semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Trans. Information Systems, 16:322--346, 1998.
[17]
L. S. Larkey, M. E. Connell, and J. P. Callan. Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data. In CIKM'00, 2000.
[18]
T. A. Letsche and M. W. Berry. Large-scale information retrieval with latent semantic indexing. Information Sciences, 100(1-4):105--137, 1997.
[19]
J. Li, B. T. Loo, J. Hellerstein, F. Kaashoek, D. R. Karger, and R. Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS'03, February 2003.
[20]
D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002.
[21]
J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9(1):38--71, 1984.
[22]
C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A Probabilistic Analysis. In PODC'98, 1998.
[23]
H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT, 43(2):1--22, 2003.
[24]
C. D. Prete, J. T. McArthur, R. L. Villars, I. L. Nathan Redmond, and D. Reinsel. Industry developments and models, Disruptive Innovation in Enterprise Computing: storage. IDC, February 2003.
[25]
S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-Addressable Network. In SIGCOMM'01, 2001.
[26]
S. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. In INFOCOM'02, 2002.
[27]
S. E. Robertson, S. Walker, S. Jones, M. M. HancockBeaulieu, and M. Gatford. Okapi at TREC-3. In TREC-3, 1994.
[28]
G. Salton, A. Wong, and C. Yang. A vector space model for information retrieval. Journal for the American Society for Information Retrieval, 18(11):613--620, 1975.
[29]
A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In SIGIR'96, 1996.
[30]
SVDPACK. http://www.netlib.org/svdpack.
[31]
C. Tang and S. Dwarkadas. Peer-to-Peer Information Retrieval in Distributed Hashtable Systems. In NSDI'04, 2004.
[32]
C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks. In SIGCOMM'03, 2003.
[33]
C. Tang, Z. Xu, and M. Mahalingam. pSearch: Information Retrieval in Structured Overlays. In The First Workshop on Hot Topics in Networks (HotNets I), 2002. Older but partially expanded version available as technical report HPL-2002-198, "PeerSearch: Efficient Information Retrieval in Peer- to-Peer Networks".
[34]
Text Retrieval Conference (TREC). http://trec.nist.gov.
[35]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB'98, 1998.
[36]
J. Xu and W. B. Croft. Cluster-Based Language Models for Distributed Retrieval. In SIGIR'99, 1999.

Cited By

View all
  • (2019)Semantic-Aware Data Cube for Cloud NetworksSearchable Storage in Cloud Computing10.1007/978-981-13-2721-6_8(179-204)Online publication date: 9-Feb-2019
  • (2016)Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language CorpusRegional Conference on Science, Technology and Social Sciences (RCSTSS 2014)10.1007/978-981-10-0534-3_31(325-336)Online publication date: 25-Mar-2016
  • (2015)A word prediction methodology for automatic sentence completionProceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015)10.1109/ICOSC.2015.7050813(240-243)Online publication date: Feb-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dimensionality reduction
  2. latent semantic indexing
  3. peer-to-peer IR

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Semantic-Aware Data Cube for Cloud NetworksSearchable Storage in Cloud Computing10.1007/978-981-13-2721-6_8(179-204)Online publication date: 9-Feb-2019
  • (2016)Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language CorpusRegional Conference on Science, Technology and Social Sciences (RCSTSS 2014)10.1007/978-981-10-0534-3_31(325-336)Online publication date: 25-Mar-2016
  • (2015)A word prediction methodology for automatic sentence completionProceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015)10.1109/ICOSC.2015.7050813(240-243)Online publication date: Feb-2015
  • (2014)ANTELOPEIEEE Transactions on Computers10.1109/TC.2013.11063:9(2146-2159)Online publication date: 1-Sep-2014
  • (2013)An approach to semantic indexing and information retrievalRevista Facultad de Ingeniería Universidad de Antioquia10.17533/udea.redin.16528(174-187)Online publication date: 30-Aug-2013
  • (2013)MapReduce Based Method for Big Data Semantic ClusteringProceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics10.1109/SMC.2013.480(2814-2819)Online publication date: 13-Oct-2013
  • (2012)Research of Media Material Retrieval Scheme Based on XPathIntelligent Information Processing VI10.1007/978-3-642-32891-6_25(196-201)Online publication date: 2012
  • (2011)Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension ReductionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2010.3423:7(1103-1117)Online publication date: 1-Jul-2011
  • (2011)Applying Information Retrieval to Distributed Hash Table (DHT) Systems2011 11th Annual International Conference on New Technologies of Distributed Systems10.1109/NOTERE.2011.5957990(1-7)Online publication date: May-2011
  • (2011)A Content-Based Information Retrieval Model Using Non-Negative Matrix Factorization Method2011 International Conference on Management and Service Science10.1109/ICMSS.2011.5999091(1-4)Online publication date: Aug-2011
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media