Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

PageRank without hyperlinks: Structural reranking using links induced by language models

Published: 23 November 2010 Publication History

Abstract

The ad hoc retrieval task is to find documents in a corpus that are relevant to a query. Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural reranking approach to ad-hoc retrieval that applies to settings with no hyperlink information. We reorder the documents in an initially retrieved set by exploiting implicit asymmetric relationships among them. We consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another. We study a number of reranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks; the best resultant performance is comparable, and often superior, to that of a state-of-the-art pseudo-feedback-based retrieval approach. In addition, we demonstrate the merits of our language-model-based method for inducing interdocument links by comparing it to previously suggested notions of interdocument similarities (e.g., cosines within the vector-space model).We also show that ourmethods for inducing centrality are substantially more effective than approaches based on document-specific characteristics, several of which are novel to this study.

References

[1]
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., Smucker, M. D., and Wade, C. 2004. UMASS at TREC 2004—novelty and hard. In Proceedings of the 13th Text Retrieval Conference (TREC-13). 715--725.
[2]
Amati, G., Carpineto, C., and Romano, G. 2004. Query difficulty, robustness, and selective application of query expansion. In Proceedings of European Conference on IR Research (ECIR). 127--137.
[3]
Azzopardi, L., Girolami, M., and van Rijsbergen, K. 2003. Investigating the relationship between language model preplexity and IR precision-recall measures. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 369--370. Poster.
[4]
Baliński, J. and Daniłowicz, C. 2005. Re-ranking method based on inter-document distances. Inform. Process. Manag. 41, 4, 759--775.
[5]
Barzilay, R. and Lapata, M. 2005. Collective content selection for concept-to-text generation. In Proceedings of the Human Language Technology/Empirical Methods in Natural Language Processing Conference (HLT/EMNLP). 331--338.
[6]
Bendersky, M. and Kurland, O. 2008. Utilizing passage-based language models for document retrieval. In Proceedings of European Conference on IR Research (ECIR). 162--174.
[7]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. 107--117.
[8]
Buckley, C., Salton, G., Allan, J., and Singhal, A. 1994. Automatic query expansion using SMART: TREC3. In Proceedings of the 3rd Text Retrieval Conference (TREC-3). 69--80.
[9]
Callan, J. P. 1994. Passage-level evidence in document retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 302--310.
[10]
Collins-Thompson, K. and Callan, J. 2005. Query expansion using random walk models. In Proceedings of the 14th International Conference on Information and Knowledge Management (CIKM). 704--711.
[11]
Collins-Thompson, K. and Callan, J. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 303--310.
[12]
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., and Allan, J. 2004. UMass at TDT. TDT2004 System Description.
[13]
Croft, W. B. 1980. A model of cluster searching based on classification. Inform. Syst. 5, 189--195.
[14]
Croft, W. B. and Lafferty, J., Eds. 2003. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer.
[15]
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 299--306.
[16]
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2004. A language modeling framework for selective query expansion. Tech. rep. IR-338, Center for Intelligent Information Retrieval, University of Massachusetts.
[17]
Daniłowicz, C. and Baliński, J. 2000. Document ranking based upon Markov chains. Inform. Proc. Manag. 41, 4, 759--775.
[18]
Dhillon, I. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 17th ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Conference. 269--274.
[19]
Diaz, F. 2005. Regularizing ad hoc retrieval scores. In Proceedings of the 14th International Conference on Information and Knowledge Management (CIKM). 672--679.
[20]
Diaz, F. and Metzler, D. 2006. Improving the estimation of relevance models using large external corpora. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 154--161.
[21]
Erkan, G. 2006a. Language model based document clustering using random walks. In Proceedings of the Human Language Technology/North American Chapter of the Association for Computational Linguistics (HLT/NAACL).
[22]
Erkan, G. 2006b. Using biased random walks for focused summarization. In Proceedings of Document Understanding Conference (DUC).
[23]
Erkan, G. and Radev, D. R. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457--479.
[24]
Garfield, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471--479.
[25]
Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations, Third ed. The Johns Hopkins University Press.
[26]
Grassmann, W. K., Taksar, M. I., and Heyman, D. P. 1985. Regenerative analysis and steady state distributions for Markov chains. Oper. Res. 33, 5, 1107--1116.
[27]
Grimmett, G. R. and Stirzaker, D. R. 2001. Probability and Random Processes, Third ed. Oxford Science Publications.
[28]
Hatzivassiloglou, V. and McKeown, K. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the 35th Association for Computational Linguistics (ACL)/8th European Association for Computational Linguistics (EACL). 174--181.
[29]
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 76--84.
[30]
Hiemstra, D. and Kraaij, W. 1999. Twenty-One at TREC7: Ad hoc and cross-language track. In Proceedings of the 7th Text Retrieval Conference (TREC-7). 227--238.
[31]
Hu, X., Bandhakavi, S., and Zhai, C. 2003. Error analysis of difficult TREC topics. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 407--408. Poster.
[32]
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inform. Storage Retrieval 7, 5, 217--240.
[33]
Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186, 1007, (Sept.) 453--461.
[34]
Joachims, T. 2003. Transductive learning via spectral graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML). 290--297.
[35]
Kleinberg, J. 1997. Authoritative sources in a hyperlinked environment. Tech. rep. Research Report RJ 10076, IBM. May.
[36]
Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 604--632.
[37]
Kraaij, W. and Westerveld, T. 2001. TNO-UT at TREC9: How different are Web documents? In Proceedings of the 9th Text Retrieval Conference (TREC-9). 665--671.
[38]
Kraaij, W., Westerveld, T., and Hiemstra, D. 2002. The importance of prior probabilities for entry page search. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 27--34.
[39]
Kurland, O. 2006. Inter-document similarities, language models, and ad hoc retrieval. Ph.D. thesis, Cornell University.
[40]
Kurland, O. and Lee, L. 2004. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 194--201.
[41]
Kurland, O. and Lee, L. 2005. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 306--313.
[42]
Kurland, O. and Lee, L. 2006. Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 83--90.
[43]
Kurland, O., Lee, L., and Domshlak, C. 2005. Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 19--26.
[44]
Lafferty, J. D. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 111--119.
[45]
Lavrenko, V. 2004. A generative theory of relevance. Ph.D. thesis, University of Massachusetts Amherst.
[46]
Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., and Thomas, S. 2002. Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT). 104--110.
[47]
Lavrenko, V. and Croft, W. B. 2001. Relevance-based language models. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 120--127.
[48]
Lavrenko, V. and Croft, W. B. 2003. Relevance models in information retrieval. In W. B. Croft and J. Lafferty (Eds)., Language Modeling for Information Retrieval, Number 13 in Information Retrieval BookSeries, Kluwer. 11--56.
[49]
Lee, K.-S., Croft, W. B., and Allan, J. 2008. A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 235--242.
[50]
Leuski, A. 2001. Evaluating document clustering for interactive information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM). 33--40.
[51]
Leuski, A. and Allan, J. 1998. Evaluating a visual navigation system for a digital library. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries (ECDL). 535--554.
[52]
Li, X. and Croft, W. B. 2003. Time-based language models. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). 469--475.
[53]
Liu, X. and Croft, W. B. 2002. Passage retrieval based on language models. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM). 375--382.
[54]
Liu, X. and Croft, W. B. 2004. Cluster-based retrieval using language models. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 186--193.
[55]
Liu, X. and Croft, W. B. 2006a. Experiments on retrieval of optimal clusters. Tech. Rep. IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.
[56]
Liu, X. and Croft, W. B. 2006b. Representing clusters for retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 671--672. Poster.
[57]
Metzler, D. and Croft, W. B. 2005. A Markov random field model for term dependencies. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 472--479.
[58]
Metzler, D., Diaz, F., Strohman, T., and Croft, W. B. 2005. Using mixtures of relevance models for query expansion. In Proceedings of the 14th Text Retrieval Conference (TREC).
[59]
Mihalcea, R. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In The Companion Volume to the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 170--173.
[60]
Mihalcea, R. and Tarau, P. 2004. TextRank: Bringing order into texts. In Proceedings of the Empirical Methods of Natural Language Processing (EMNLP). 404--411. Poster.
[61]
Miller, D. R. H., Leek, T., and Schwartz, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 214--221.
[62]
Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of Special Interest Group on Information Retrieval (SIGIR). 206--214.
[63]
Morgan, W., Greiff, W., and Henderson, J. 2004. Direct maximization of average precision by hill-climbing, with a comparison to a maximum entropy approach. Tech. rep. 04-0367, The MITRE Corporation.
[64]
Ng, A. Y., Zheng, A. X., and Jordan, M. I. 2001. Stable algorithms for link analysis. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 258--266.
[65]
Ng, K. 2000. A maximum likelihood ratio information retrieval model. In Proceedings of the 8th Text Retrieval Conference (TREC-8). 483--492.
[66]
Otterbacher, J., Erkan, G., and Radev, D. R. 2005. Using random walks for question-focused sentence retrieval. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). 915--922.
[67]
Pang, B. and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Association for Computational Linguistics (ACL). 271--278.
[68]
Pinski, G. and Narin, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inform. Proc. Manag. 12, 297--312.
[69]
Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 275--281.
[70]
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[71]
Preece, S. E. 1973. Clustering as an output option. In Proceedings of the American Society for Information Science. 189--190.
[72]
Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, 313--323.
[73]
Ruthven, I. and Lalmas, M. 2003. A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev. 18, 2, 95--145.
[74]
Salton, G. and Buckley, C. 1988. On the use of spreading activation methods in automatic information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 147--160.
[75]
Salton, J., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620.
[76]
Shah, C. and Croft, W. B. 2004. Evaluating high accuracy retrieval techniques. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 2--9.
[77]
Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 21--29.
[78]
Stewart, W. J. 1994. Introduction to the Numerical Solution of Markov chains. Princeton University Press.
[79]
Tao, T. and Zhai, C. 2006. Regularized esitmation of mixture models for robust pseudo-relevance feedback. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 162--169.
[80]
Tishby, N. and Slonim, N. 2000. Data clustering by Markovian relaxation and the information bottleneck method. In Advances in Neural Information Processing Systems (NIPS) 14. 640--646.
[81]
Tombros, A. 2002. The effectiveness of hierarchic query-based clustering of documents for information retrieval. Ph.D. thesis, Department of Computing Science, University of Glasgow.
[82]
Tombros, A., Villa, R., and van Rijsbergen, C. 2002. The effectiveness of query-specific hierarchic clustering in information retrieval. Inform. Proc. Manag. 38, 4, 559--582.
[83]
Toutanova, K., Manning, C. D., and Ng, A. Y. 2004. Learning random walk models for inducing word dependency distributions. In Proceedings of the International Conference on Machine Learning (TCML).
[84]
van Rijsbergen, C. J. 1979. Information Retrieval, second ed. Butterworths.
[85]
Voorhees, E. M. 1985. The cluster hypothesis revisited. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 188--196.
[86]
Voorhees, E. M. 1993. Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 171--180.
[87]
Voorhees, E. M. 2002. Overview of the TREC 2002 question answering track. In The 11th Text Retrieval Conference (TREC-11). 115--123.
[88]
Voorhees, E. M. 2005. Overview of the TREC 2005 robust retrieval task. In Proceedings of the 14th Text Retrieval Conference (TREC).
[89]
Voorhees, E. M. and Harman, D. K. 2005. TREC: Experiments and Evaluation in Information Retrieval. The MIT Press.
[90]
Willett, P. 1985. Query specific automatic document classification. International Forum on Information and Documentation 10, 2, 28--32.
[91]
Winaver, M., Kurland, O., and Domshlak, C. 2007. Towards robust query expansion: Model selection in the language model framework to retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 729--730.
[92]
Xu, J. and Croft, W. B. 1996. Query expansion using local and global document analysis. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 4--11.
[93]
Yom-Tov, E., Fine, S., Carmel, D., and Darlow, A. 2005. Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 512--519.
[94]
Zamir, O. and Etzioni, O. 1998. Web document clustering: a feasibility demonstration. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 46--54.
[95]
Zhai, C. and Lafferty, J. 2002. Two-stage language models for information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 49--56.
[96]
Zhai, C. and Lafferty, J. D. 2001a. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Conference on Information and Knowledge Management (CIKM). 403--410.
[97]
Zhai, C. and Lafferty, J. D. 2001b. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 334--342.
[98]
Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., and Ma, W.-Y. 2005. Improving Web search results using affinity graph. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 504--511.
[99]
Zhou, Y. and Croft, B. 2007. Query performance prediction in Web search environments. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR). 543--550.
[100]
Zhu, X. J. 2005. Semi-supervised learning with graphs. Ph.D. thesis, Carnegie Mellon University.

Cited By

View all
  • (2023)Optimal Page Ranking Technique for Webpage Personalization Using Semantic ClassifierHandbook of Artificial Intelligence10.2174/9789815124514123010010(144-164)Online publication date: 9-Nov-2023
  • (2023)The Impact of Language Technologies in the Legal DomainMultidisciplinary Perspectives on Artificial Intelligence and the Law10.1007/978-3-031-41264-6_2(25-46)Online publication date: 27-Dec-2023
  • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
  • Show More Cited By

Index Terms

  1. PageRank without hyperlinks: Structural reranking using links induced by language models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 28, Issue 4
    November 2010
    204 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/1852102
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 November 2010
    Accepted: 01 March 2010
    Revised: 01 October 2009
    Received: 01 March 2008
    Published in TOIS Volume 28, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HITS
    2. Language modeling
    3. PageRank
    4. authorities
    5. graph-based retrieval
    6. high-accuracy retrieval
    7. hubs
    8. social networks
    9. structural reranking

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Optimal Page Ranking Technique for Webpage Personalization Using Semantic ClassifierHandbook of Artificial Intelligence10.2174/9789815124514123010010(144-164)Online publication date: 9-Nov-2023
    • (2023)The Impact of Language Technologies in the Legal DomainMultidisciplinary Perspectives on Artificial Intelligence and the Law10.1007/978-3-031-41264-6_2(25-46)Online publication date: 27-Dec-2023
    • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
    • (2022)CRL: Collaborative Representation Learning by Coordinating Topic Modeling and Network EmbeddingsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.305442233:8(3765-3777)Online publication date: Aug-2022
    • (2022)VoCSK: Verb-oriented commonsense knowledge mining with taxonomy-guided inductionArtificial Intelligence10.1016/j.artint.2022.103744310(103744)Online publication date: Sep-2022
    • (2022)A retrieval model family based on the probability ranking principle for ad hoc retrievalJournal of the Association for Information Science and Technology10.1002/asi.2461973:8(1140-1154)Online publication date: 5-Feb-2022
    • (2021)Recommending Search Queries in Documents Using Inter N-Gram SimilaritiesProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472252(211-220)Online publication date: 11-Jul-2021
    • (2019)Incremental C-Rank: An effective and efficient ranking algorithm for dynamic Web environmentsKnowledge-Based Systems10.1016/j.knosys.2019.03.034Online publication date: Apr-2019
    • (2018)Mix 'n MatchProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271668(1373-1382)Online publication date: 17-Oct-2018
    • (2018)Utilizing Inter-Passage Similarities for Focused RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210222(1453-1453)Online publication date: 27-Jun-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media