research-article

Ranking web sites with real user traffic

Authors:

Filippo Menczer,

Santo Fortunato,

Alessandro Flammini,

Alessandro VespignaniAuthors Info & Claims

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

Pages 65 - 76

https://doi.org/10.1145/1341531.1341543

Published: 11 February 2008 Publication History

Abstract

We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior

References

[1]

L. Adamic and B. Huberman. Power-law distribution of the World Wide Web. Science, 287:2115, 2000.

[2]

E. Agichtein, E. Brill, and S. Dumais. Improving Web search ranking by incorporating user behavior information. In Proc. 29th ACM SIGIR Conf., 2006.

Digital Library

[3]

R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the World Wide Web. Nature, 401(6749):130--131, 1999.

[4]

E. Almaas, B. Kovacs, T. Vicsek, Z. N. Oltvai, and A.-L. Barabasi. Global organization of metabolic fluxes in the bacterium escherichia coli. Nature, 427(6977):839--843, 2004.

[5]

R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web structure, dynamics and page quality. In A. H. F. Laender and A. L. Oliveira, editors, Proc. 9th Intl. Symp. on String Processing and Information Retrieval (SPIRE 2002), volume 2476 of Lecture Notes in Computer Science, pages 117--130. Springer, 2002.

Digital Library

[6]

M. Barthelemy, B. Gondranb, and E. Guichardc. Spatial structure of the internet traffic. Physica A, 319:633--642, March 2003.

[7]

K. Bharat, B.-W. Chang, M. Kenzinger, and M. Ruhl. Who links to whom: Mining linkage between web sites. In Proceedings of First IEEE International Conference on Data Mining (ICDM'01), 2001.

Digital Library

[8]

P. Boldi, M. Santini, and S. Vigna. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.

[9]

P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In WWW'05: Proceedings of the 14th international conference on World Wide Web, pages 557--566, New York, NY, USA, 2005. ACM Press.

Digital Library

[10]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1-7):107--117, 1998.

Digital Library

[11]

A. Broder, S. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1-6):309--320, 2000.

Digital Library

[12]

L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065--1073, 1995.

Digital Library

[13]

J. Cho and S. Roy. Impact of search engines on page popularity. In S. I. Feldman, M. Uretsky, M. Najork, and C. E. Wills, editors, Proc. 13th intl. conf. on World Wide Web, pages 20--29. ACM, 2004.

Digital Library

[14]

A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. Technical report, arXiv:0706.1062v1 {physics.data-an}, 2007.

[15]

A. Cockburn and B. McKenzie. What do Web users do? An empirical analysis of Web use. Intl. Journal of Human-Computer Studies, 54(6):903--922, 2001.

Digital Library

[16]

S. Dill, R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. ACM Transactions on Internet Technology, 2(3):205--223, 2002.

Digital Library

[17]

D. Donato, L. Laura, S. Leonardi, and S. Millozzi. Large scale properties of the webgraph. Eur. Phys. J. B, 38:239--243, 2004.

[18]

J. Erman, A. Mahanti, M. Arlitt, and C. Williamson. Identifying and discriminating between web and peer-to-peer traffic in the network core. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 883--892, New York, NY, USA, 2007. ACM Press.

Digital Library

[19]

S. Fortunato and A. Flammini. Random walks on directed networks: the case of pagerank. International Journal of Bifurcation and Chaos, 2007. Forthcoming.

[20]

S. Fortunato, A. Flammini, and F. Menczer. Scale-free network growth by ranking. Phys. Rev. Lett., 96(21):218701, 2006.

[21]

S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani. Topical interests and the mitigation of search engine bias. Proc. Natl. Acad. Sci. USA, 103(34):12684--12689, 2006.

[22]

M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In Proc. 9th International World Wide Web Conference, 2000.

Digital Library

[23]

O. Herfindahl. Copper Costs and Prices: 1870--1957. John Hopkins University Press, Baltimore, MD, 1959.

[24]

A. Hirschman. The paternity of an index. American Economic Review, 54(5):761--762, 1964.

[25]

L. Introna and H. Nissenbaum. Defining the web: The politics of search engines. IEEE Computer, 33(1):54--62, January 2000.

Digital Library

[26]

M. Kendall. A new measure of rank correlation. Biometrika, 30:81--89, 1938.

[27]

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.

Digital Library

[28]

J. Luxenburger and G. Weikum. Query-Log Based Authority Analysis for Web Information Search, volume 3306 of Lecture Notes in Computer Science, pages 90--101. Springer Berlin/Heidelberg, 2004.

[29]

M. Meiss, F. Menczer, and A. Vespignani. On the lack of typical behavior in the global Web traffic network. In Proc. 14th International World Wide Web Conference, pages 510--518, 2005.

Digital Library

[30]

B. Mobasher, R. Cooley, and J. Srivastava. Automatic personalization based on web usage mining. Communications of the ACM, 43(8):141--151, 2000.

Digital Library

[31]

A. Mowshowitz and A. Kawaguchi. Bias on the Web. Commun. ACM, 45(9):56--60, 2002.

Digital Library

[32]

M. Najork and J. L. Wiener. Breadth-first search crawling yields high-quality pages. In Proc. 10th International World Wide Web Conference, 2001.

Digital Library

[33]

F. Qiu, Z. Liu, and J. Cho. Analysis of user web traffic with a focus on search activities. In A. Doan, F. Neven, R. McCann, and G. J. Bex, editors, Proc. 8th International Workshop on the Web and Databases (WebDB), pages 103--108, 2005.

[34]

M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. 15th International World Wide Web Conference, pages 707--715, New York, NY, USA, 2006. ACM.

Digital Library

[35]

M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: A comparative analysis of Web crawls. ACM Trans. Web, 1(2):10, 2007.

Digital Library

[36]

M. Sydow. Can link analysis tell us about web traffic? In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 954--955, New York, NY, USA, 2005. ACM.

Digital Library

[37]

Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. IEEE Trans. on Knowledge and Data Engineering, 15(4):1050--1053, 2003.

Digital Library

Cited By

SatheeshKumar MSrinivasagan KUnniKrishnan G(2022)A lightweight and proactive rule-based incremental construction approach to detect phishing scamInformation Technology and Management10.1007/s10799-021-00351-723:4(271-298)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s10799-021-00351-7
Bockholt MZweig K(2021)A systematic evaluation of assumptions in centrality measures by empirical flow dataSocial Network Analysis and Mining10.1007/s13278-021-00725-311:1Online publication date: 1-Mar-2021
https://doi.org/10.1007/s13278-021-00725-3
Agryzkov TCurado MPedroche FTortosa LVicent J(2019)Extending the Adapted PageRank Algorithm Centrality to Multiplex Networks with Data Using the PageRank Two-Layer ApproachSymmetry10.3390/sym1102028411:2(284)Online publication date: 22-Feb-2019
https://doi.org/10.3390/sym11020284
Show More Cited By

Index Terms

Ranking web sites with real user traffic

Recommendations

Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Since the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Content and link-structure perspective of ranking webpages: A review
Abstract
The delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

February 2008

270 pages

ISBN:9781595939272

DOI:10.1145/1341531

General Chair:
Marc Najork
Microsoft, USA
,
Program Chairs:
Andrei Broder
Yahoo!, USA
,
Soumen Chakrabarti
IIT Bombay, India

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
1,136
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

SatheeshKumar MSrinivasagan KUnniKrishnan G(2022)A lightweight and proactive rule-based incremental construction approach to detect phishing scamInformation Technology and Management10.1007/s10799-021-00351-723:4(271-298)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s10799-021-00351-7
Bockholt MZweig K(2021)A systematic evaluation of assumptions in centrality measures by empirical flow dataSocial Network Analysis and Mining10.1007/s13278-021-00725-311:1Online publication date: 1-Mar-2021
https://doi.org/10.1007/s13278-021-00725-3
Agryzkov TCurado MPedroche FTortosa LVicent J(2019)Extending the Adapted PageRank Algorithm Centrality to Multiplex Networks with Data Using the PageRank Two-Layer ApproachSymmetry10.3390/sym1102028411:2(284)Online publication date: 22-Feb-2019
https://doi.org/10.3390/sym11020284
Mayande NWeber C(2019)Do Directionality and Network Size Affect Network Structure in Online Social Networks?2019 Portland International Conference on Management of Engineering and Technology (PICMET)10.23919/PICMET.2019.8893732(1-18)Online publication date: Aug-2019
https://doi.org/10.23919/PICMET.2019.8893732
Mayande NWeber C(2018)Product Popularity versus Size of Conversation in Social Media: An Analysis of Twitter Conversations about YouTube Product Categories2018 Portland International Conference on Management of Engineering and Technology (PICMET)10.23919/PICMET.2018.8481830(1-7)Online publication date: Aug-2018
https://doi.org/10.23919/PICMET.2018.8481830
Tahir RRaza AAhmad FKazi JZaffar FKanich CCaesar M(2018)It's All in the Name: Why Some URLs are More Vulnerable to TyposquattingIEEE INFOCOM 2018 - IEEE Conference on Computer Communications10.1109/INFOCOM.2018.8486271(2618-2626)Online publication date: Apr-2018
https://doi.org/10.1109/INFOCOM.2018.8486271
Lescisin MMahmoud Q(2018)Dataset for Web Traffic Security AnalysisIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591589(2700-2705)Online publication date: Oct-2018
https://doi.org/10.1109/IECON.2018.8591589
Edler DBohlin LRosvall M(2017)Mapping Higher-Order Network Flows in Memory and Multilayer Networks with InfomapAlgorithms10.3390/a1004011210:4(112)Online publication date: 30-Sep-2017
https://doi.org/10.3390/a10040112
Lou XLi YGu WZhang J(2016)The Atlas of Chinese World Wide Web Ecosystem Shaped by the Collective Attention FlowsPLOS ONE10.1371/journal.pone.016524011:11(e0165240)Online publication date: 3-Nov-2016
https://doi.org/10.1371/journal.pone.0165240
Zhang XWang DWang TMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Inspiration or Preparation?Proceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983820(741-750)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983820
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten