Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Replicating Web Structure in Small-Scale Test Collections

Published: 01 September 2004 Publication History

Abstract

Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis in recent years? In this paper we put forward reasons why many disappointing results have been found in TREC experiments and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage-based retrieval by examining the linkage structure of the WWW. Based on these requirements we report on methodologies for synthesising such a test collection.

References

[1]
Adamic L (2003) Zipf, Power-laws, and Pareto-a ranking tutorial. Available at http://www.hpl.hp.com/shl/ papers/ranking/ (visited 1st September 2003).
[2]
Adamic L and Humberman B The Web's hidden order Communications of the ACM 2001 44 9 55-59
[3]
ALLTHEWEB (2003) http://www.alltheweb.com (visited 1st September 2003).
[4]
Amento B, Terveen L and Hill W (2000) Does 'Authority' mean quality? Predicting expert quality ratings of web document. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in IR, pp. 296-303.
[5]
Bailey P, Craswell N and Hawking D (2003) Engineering a multi-purpose test collection for Web Retrieval Experiments. Journal of Information Processing and Management, 853-871.
[6]
Barabasi A and Albert R Emergence of scaling in random networks Science 1999 286 509-512
[7]
Bharat K and Henzinger M (1998) Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in IR, pp. 104-111.
[8]
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International WWW Conference, pp. 107-117.
[9]
Broder A A taxonomy of web search ACM SIGIR Forum 2002 36 2 3-10
[10]
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A and Weiner J (2000) Graph structure in the web. In: Proceedings of the 9th International WWW Conference, pp. 309-320.
[11]
CYVEILLANCE (2003) http://www.cyveillance.com. (visited 13th May 2003).
[12]
Faloutsos M, Faloutsos P and Faloutsos C (1999) On power-law relationships of the internet topology. In: Proceedings of the annual ACM SIGCOMM Conference on Research and Development in Data Communications 99, pp. 251-262.
[13]
GOOGLE (2003) http://www.google.com (visited 1st September 2003).
[14]
Gurrin C and Smeaton AF (2003) Improving the evaluation of web search systems. Advances in information retrieval. In: Proceedings of the 25th BCS-IRSG European Colloquium on IR Research, Springer Lecture Notes in Computer Science, pp. 25-40.
[15]
Gurrin C and Smeaton AF (1999) Connectivity analysis approaches to increasing precision in retrieval from hyperlinked documents. In: Proceedings of the 8th Annual TREC Conference, pp. 357-366.
[16]
Gurrin C and Smeaton AF (2000) Dublin city university experiments in connectivity analysis for TREC-9. In: Proceedings of the 9th Annual TREC Conference, pp. 179-188.
[17]
Hawking D (2000) Overview of the TREC-9 web track. In: Proceedings of the 9th Annual TREC Conference, pp. 87-102.
[18]
Hawking D, Voorhees E, Craswell N and Bailey P (1999) Overview of the TREC-8 web track. In: Proceedings of the 8th Annual TREC Conference, pp. 131-150.
[19]
Kleinberg J Authoritative sources in a hyperlinked environment Journal of the ACM 1999 46 5 604-623
[20]
Kumar R, Raghavan P, Rajagopalan S and Tomkins A (1999) Trawling the web for emerging cyber-communities. In: Proceedings of the 8th International World Wide Web Conference, pp. 403-415.
[21]
McBryan O (1994) GENVL and WWWW: Tools for taming the Web. In: Proceedings of the 1st International WWW Conference, pp. 58-67.
[22]
Mitzenmacher M (2001) A brief history of generative models for power law and lognormal distributions. In: Proceedings of the 39th Annual Allerton Conference on Communication, Control, and Computing, pp. 182-191.
[23]
Murray B and Moore (2003)Asizing the internet-A White Paper. Cyveillance, Inc., 2000.Available at http://www.cyveillance.com/web/corporate/white papers.htm (visited 1st September 2003).
[24]
Page L, Brin S, Motwani R and Winograd T (1997) The PageRank citation ranking: Bringing order to the web. Stanford Digital Libraries Working Paper, 0072.
[25]
Pennock D, Flake G, Lawrence S, Glover E, and Giles L Winners don't take all: Characterising the competition for links on the web National Academy of Sciences 2002 99 8 5207-5211
[26]
Silverstein C, Henzinger M, Marais J and Moricz M (1998) Analysis of a very large AltaVista query log. Digital SRC Technical Note 1998-014.
[27]
Singhal A. and Kaszkiel M (2000) AT&T at TREC-9. In: Proceedings of the 9th Annual TREC Conference, pp. 103-105.
[28]
Soboroff I (2002) Does WT10g look like the Web? In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in IR, pp. 423-424.
[29]
SOWS III: The Third State of the Web Survey (1999) http://www.pantos.org/atw/35654-a.html (visited 1st September 2003).
[30]
URouLette Random Web Page Generator (2003) http://www.uroulette.com (visited 1st September 2003).
[31]
Wu L, Huang X, Niu J, Xia Y, Feng Z and Zhou Y (2002) FDU at TREC 2002: Filtering, Q&A, web and video tasks. In: Proceedings of the 11th Annual TREC Conference, pp. 232-247.

Cited By

View all

Index Terms

  1. Replicating Web Structure in Small-Scale Test Collections
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Information Retrieval
        Information Retrieval  Volume 7, Issue 3-4
        Sep 2004
        173 pages

        Publisher

        Kluwer Academic Publishers

        United States

        Publication History

        Published: 01 September 2004

        Author Tags

        1. linkage analysis
        2. search engine
        3. retrieval evaluation
        4. test collections

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 05 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2016)Power Law Distributions in Information RetrievalACM Transactions on Information Systems10.1145/281681534:2(1-37)Online publication date: 16-Feb-2016
        • (2014)LifeLoggingFoundations and Trends in Information Retrieval10.1561/15000000338:1(1-125)Online publication date: 16-Jun-2014
        • (2013)Incorporating social anchors for ad hoc retrievalProceedings of the 10th Conference on Open Research Areas in Information Retrieval10.5555/2491748.2491786(181-188)Online publication date: 15-May-2013
        • (2010)The importance of anchor text for ad hoc search revisitedProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval10.1145/1835449.1835472(122-129)Online publication date: 19-Jul-2010
        • (2006)Hierarchical web structuring from the web as a graph approach with repetitive cycle proofProceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications10.1007/11610496_140(1004-1011)Online publication date: 16-Jan-2006
        • (2005)Scalability influence on retrieval modelsProceedings of the 27th European conference on Advances in Information Retrieval Research10.1007/978-3-540-31865-1_28(388-402)Online publication date: 21-Mar-2005
        • (2005)The effect of collection fusion strategies on information seeking performance in distributed hypermedia digital librariesProceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries10.1007/11551362_6(57-68)Online publication date: 18-Sep-2005
        • (2004)The SPIRIT collectionACM SIGIR Forum10.1145/1041394.104139538:2(57-61)Online publication date: 1-Dec-2004

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media