Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1341531.1341543acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Ranking web sites with real user traffic

Published: 11 February 2008 Publication History

Abstract

We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior

References

[1]
L. Adamic and B. Huberman. Power-law distribution of the World Wide Web. Science, 287:2115, 2000.
[2]
E. Agichtein, E. Brill, and S. Dumais. Improving Web search ranking by incorporating user behavior information. In Proc. 29th ACM SIGIR Conf., 2006.
[3]
R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the World Wide Web. Nature, 401(6749):130--131, 1999.
[4]
E. Almaas, B. Kovacs, T. Vicsek, Z. N. Oltvai, and A.-L. Barabasi. Global organization of metabolic fluxes in the bacterium escherichia coli. Nature, 427(6977):839--843, 2004.
[5]
R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web structure, dynamics and page quality. In A. H. F. Laender and A. L. Oliveira, editors, Proc. 9th Intl. Symp. on String Processing and Information Retrieval (SPIRE 2002), volume 2476 of Lecture Notes in Computer Science, pages 117--130. Springer, 2002.
[6]
M. Barthelemy, B. Gondranb, and E. Guichardc. Spatial structure of the internet traffic. Physica A, 319:633--642, March 2003.
[7]
K. Bharat, B.-W. Chang, M. Kenzinger, and M. Ruhl. Who links to whom: Mining linkage between web sites. In Proceedings of First IEEE International Conference on Data Mining (ICDM'01), 2001.
[8]
P. Boldi, M. Santini, and S. Vigna. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.
[9]
P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In WWW'05: Proceedings of the 14th international conference on World Wide Web, pages 557--566, New York, NY, USA, 2005. ACM Press.
[10]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1-7):107--117, 1998.
[11]
A. Broder, S. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1-6):309--320, 2000.
[12]
L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065--1073, 1995.
[13]
J. Cho and S. Roy. Impact of search engines on page popularity. In S. I. Feldman, M. Uretsky, M. Najork, and C. E. Wills, editors, Proc. 13th intl. conf. on World Wide Web, pages 20--29. ACM, 2004.
[14]
A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. Technical report, arXiv:0706.1062v1 {physics.data-an}, 2007.
[15]
A. Cockburn and B. McKenzie. What do Web users do? An empirical analysis of Web use. Intl. Journal of Human-Computer Studies, 54(6):903--922, 2001.
[16]
S. Dill, R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. ACM Transactions on Internet Technology, 2(3):205--223, 2002.
[17]
D. Donato, L. Laura, S. Leonardi, and S. Millozzi. Large scale properties of the webgraph. Eur. Phys. J. B, 38:239--243, 2004.
[18]
J. Erman, A. Mahanti, M. Arlitt, and C. Williamson. Identifying and discriminating between web and peer-to-peer traffic in the network core. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 883--892, New York, NY, USA, 2007. ACM Press.
[19]
S. Fortunato and A. Flammini. Random walks on directed networks: the case of pagerank. International Journal of Bifurcation and Chaos, 2007. Forthcoming.
[20]
S. Fortunato, A. Flammini, and F. Menczer. Scale-free network growth by ranking. Phys. Rev. Lett., 96(21):218701, 2006.
[21]
S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani. Topical interests and the mitigation of search engine bias. Proc. Natl. Acad. Sci. USA, 103(34):12684--12689, 2006.
[22]
M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In Proc. 9th International World Wide Web Conference, 2000.
[23]
O. Herfindahl. Copper Costs and Prices: 1870--1957. John Hopkins University Press, Baltimore, MD, 1959.
[24]
A. Hirschman. The paternity of an index. American Economic Review, 54(5):761--762, 1964.
[25]
L. Introna and H. Nissenbaum. Defining the web: The politics of search engines. IEEE Computer, 33(1):54--62, January 2000.
[26]
M. Kendall. A new measure of rank correlation. Biometrika, 30:81--89, 1938.
[27]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.
[28]
J. Luxenburger and G. Weikum. Query-Log Based Authority Analysis for Web Information Search, volume 3306 of Lecture Notes in Computer Science, pages 90--101. Springer Berlin/Heidelberg, 2004.
[29]
M. Meiss, F. Menczer, and A. Vespignani. On the lack of typical behavior in the global Web traffic network. In Proc. 14th International World Wide Web Conference, pages 510--518, 2005.
[30]
B. Mobasher, R. Cooley, and J. Srivastava. Automatic personalization based on web usage mining. Communications of the ACM, 43(8):141--151, 2000.
[31]
A. Mowshowitz and A. Kawaguchi. Bias on the Web. Commun. ACM, 45(9):56--60, 2002.
[32]
M. Najork and J. L. Wiener. Breadth-first search crawling yields high-quality pages. In Proc. 10th International World Wide Web Conference, 2001.
[33]
F. Qiu, Z. Liu, and J. Cho. Analysis of user web traffic with a focus on search activities. In A. Doan, F. Neven, R. McCann, and G. J. Bex, editors, Proc. 8th International Workshop on the Web and Databases (WebDB), pages 103--108, 2005.
[34]
M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. 15th International World Wide Web Conference, pages 707--715, New York, NY, USA, 2006. ACM.
[35]
M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: A comparative analysis of Web crawls. ACM Trans. Web, 1(2):10, 2007.
[36]
M. Sydow. Can link analysis tell us about web traffic? In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 954--955, New York, NY, USA, 2005. ACM.
[37]
Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. IEEE Trans. on Knowledge and Data Engineering, 15(4):1050--1053, 2003.

Cited By

View all
  • (2022)A lightweight and proactive rule-based incremental construction approach to detect phishing scamInformation Technology and Management10.1007/s10799-021-00351-723:4(271-298)Online publication date: 17-Jan-2022
  • (2021)A systematic evaluation of assumptions in centrality measures by empirical flow dataSocial Network Analysis and Mining10.1007/s13278-021-00725-311:1Online publication date: 1-Mar-2021
  • (2019)Extending the Adapted PageRank Algorithm Centrality to Multiplex Networks with Data Using the PageRank Two-Layer ApproachSymmetry10.3390/sym1102028411:2(284)Online publication date: 22-Feb-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining
February 2008
270 pages
ISBN:9781595939272
DOI:10.1145/1341531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. navigation
  2. pagerank
  3. ranking
  4. search
  5. teleportation
  6. web traffic
  7. weighted host graph

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A lightweight and proactive rule-based incremental construction approach to detect phishing scamInformation Technology and Management10.1007/s10799-021-00351-723:4(271-298)Online publication date: 17-Jan-2022
  • (2021)A systematic evaluation of assumptions in centrality measures by empirical flow dataSocial Network Analysis and Mining10.1007/s13278-021-00725-311:1Online publication date: 1-Mar-2021
  • (2019)Extending the Adapted PageRank Algorithm Centrality to Multiplex Networks with Data Using the PageRank Two-Layer ApproachSymmetry10.3390/sym1102028411:2(284)Online publication date: 22-Feb-2019
  • (2019)Do Directionality and Network Size Affect Network Structure in Online Social Networks?2019 Portland International Conference on Management of Engineering and Technology (PICMET)10.23919/PICMET.2019.8893732(1-18)Online publication date: Aug-2019
  • (2018)Product Popularity versus Size of Conversation in Social Media: An Analysis of Twitter Conversations about YouTube Product Categories2018 Portland International Conference on Management of Engineering and Technology (PICMET)10.23919/PICMET.2018.8481830(1-7)Online publication date: Aug-2018
  • (2018)It's All in the Name: Why Some URLs are More Vulnerable to TyposquattingIEEE INFOCOM 2018 - IEEE Conference on Computer Communications10.1109/INFOCOM.2018.8486271(2618-2626)Online publication date: Apr-2018
  • (2018)Dataset for Web Traffic Security AnalysisIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591589(2700-2705)Online publication date: Oct-2018
  • (2017)Mapping Higher-Order Network Flows in Memory and Multilayer Networks with InfomapAlgorithms10.3390/a1004011210:4(112)Online publication date: 30-Sep-2017
  • (2016)The Atlas of Chinese World Wide Web Ecosystem Shaped by the Collective Attention FlowsPLOS ONE10.1371/journal.pone.016524011:11(e0165240)Online publication date: 3-Nov-2016
  • (2016)Inspiration or Preparation?Proceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983820(741-750)Online publication date: 24-Oct-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media