Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Do TREC web collections look like the web?

Published: 01 September 2002 Publication History

Abstract

We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, and the .GOV test collection used in the TREC 2002 Web and Interactive Tracks, with common measures used in the web topology community, in order to see if these collections "look like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. The .GOV collection is a fairly straightforward 18GB crawl of sites in the .gov domain. In contrast, WT10g was carved out from a much larger crawl specifically to be a web search test collection within the reach of university researchers. Do such collections retain the properties of the larger web? In the case of WT10g and .GOV, yes.

References

[1]
Peter Bailey, Nick Craswell, and David Hawking. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing and Management, to appear.
[2]
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure on the web. In Proceedings of the 9th International WWW Conference, pages 309-320, Amsterdam, The Netherlands, May 2000.
[3]
David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Lee Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences, 99(8):5207-5211, April 2002.
[4]
Amit Singhal and Marcin Kaszkiel. A case study in web searching using TREC algorithms. In Proceedings of the 10th International World Wide Web Conferenece, pages 708-716, Hong Kong, May 2001.

Cited By

View all
  • (2017)Analysing the potential of Wikipedia for science education using automatic organization of knowledgeProgram10.1108/PROG-02-2016-001651:4(373-386)Online publication date: 7-Nov-2017
  • (2015)The implications of Wikipedia for contemporary science educationProceedings of the 3rd International Conference on Technological Ecosystems for Enhancing Multiculturality10.1145/2808580.2808641(403-410)Online publication date: 7-Oct-2015
  • (2014)Web Archive Search as Research: Methodological and Theoretical ImplicationsAlexandria: The Journal of National and International Library and Information Issues10.7227/ALX.002225:1-2(93-111)Online publication date: 1-Aug-2014
  • Show More Cited By

Index Terms

  1. Do TREC web collections look like the web?

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGIR Forum
    ACM SIGIR Forum  Volume 36, Issue 2
    Fall 2002
    99 pages
    ISSN:0163-5840
    DOI:10.1145/792550
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2002
    Published in SIGIR Volume 36, Issue 2

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Analysing the potential of Wikipedia for science education using automatic organization of knowledgeProgram10.1108/PROG-02-2016-001651:4(373-386)Online publication date: 7-Nov-2017
    • (2015)The implications of Wikipedia for contemporary science educationProceedings of the 3rd International Conference on Technological Ecosystems for Enhancing Multiculturality10.1145/2808580.2808641(403-410)Online publication date: 7-Oct-2015
    • (2014)Web Archive Search as Research: Methodological and Theoretical ImplicationsAlexandria: The Journal of National and International Library and Information Issues10.7227/ALX.002225:1-2(93-111)Online publication date: 1-Aug-2014
    • (2012)A path-based approach for web page retrievalWorld Wide Web10.1007/s11280-011-0133-515:3(257-283)Online publication date: 1-May-2012
    • (2009)Is Wikipedia link structure different?Proceedings of the Second ACM International Conference on Web Search and Data Mining10.1145/1498759.1498831(232-241)Online publication date: 9-Feb-2009
    • (2009)Correlation of Term Count and Document Frequency for Google N-GramsProceedings of the 31th European Conference on IR Research on Advances in Information Retrieval10.1007/978-3-642-00958-7_58(620-627)Online publication date: 18-Apr-2009
    • (2008)Test theory for evaluating reliability of IR test collectionsInformation Processing and Management: an International Journal10.1016/j.ipm.2007.11.00644:3(1117-1145)Online publication date: 1-May-2008
    • (2008)A systematic study on parameter correlations in large-scale duplicate document detectionKnowledge and Information Systems10.1007/s10115-007-0071-914:2(217-232)Online publication date: 24-Jan-2008
    • (2007)Relevance propagation model for large hypertext document collectionsLarge Scale Semantic Access to Content (Text, Image, Video, and Sound)10.5555/1931390.1931445(585-595)Online publication date: 30-May-2007
    • (2007)Using similarity links as shortcuts to relevant web pagesProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1277741.1277947(863-864)Online publication date: 23-Jul-2007
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media