Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2615569.2615674acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
research-article

Graph structure in the web: aggregated by pay-level domain

Published: 23 June 2014 Publication History

Abstract

Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional insights about the macroscopic structure of the World Web Web by analyzing an aggregated version of a recent web graph. The graph covers over 3.5 billion web pages and 128 billion hyperlinks between pages. It was crawled in the first half of 2012. We aggregate this graph by pay-level domain (PLD), meaning that all pages that belong to the same pay-level domain are represented by a single node and that an arc exists between two nodes if there is at least one hyperlink between pages of the corresponding pay-level domains. The resulting PLD graph covers 43 million PLDs and contains 623 million arcs between PLDs. Analyzing this aggregated graph allows us to present findings about linkage patterns between complete websites and not only individual HTML pages. In this paper, we present basic statistics about the PLD graph, such as degree distributions, top-ranked PLDs, distances and diameter. We analyze whether the bow-tie structure introduced by Broder et al. can also be identified in our PLD graph and reveal a backbone of highly interlinked websites within the graph. We group the websites by top-level domain and report findings about the overall linkage within and between different top-level domains. In a last experiment, we use data from the Open Directory Project (DMOZ) to categorize websites by topic and report findings about linkage patterns between websites belonging to different topical categories.

References

[1]
R. Baeza-Yates and B. Poblete. Evolution of the chilean web structure composition. In Web Congress, 2003. Proceedings. First Latin American, pages 11--13, 2003.
[2]
A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:209--512, October 1999.
[3]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Structural properties of the african web. In WWW '02, volume 66, 2002.
[4]
P. Boldi and S. Vigna. The webgraph framework I: compression techniques. In WWW '04, pages 595--602. ACM, 2004.
[5]
P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013.
[6]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer networks, 33(1):309--320, 2000.
[7]
A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661--703, 2009.
[8]
S. Dill, R. Kumar, K. S. Mccurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. ACM Trans. Internet Technol., 2(3):205--223, Aug. 2002.
[9]
D. Donato, S. Leonardi, S. Millozzi, and P. Tsaparas. Mining the inner structure of the web graph. In WebDB, pages 145--150, 2005.
[10]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. International Workshop on the Web, pages 1--6, 2004.
[11]
Y. Hirate, S. Kato, and H. Yamana. Web structure in 2005. In Algorithms and models for the web-graph, pages 36--46. Springer, 2008.
[12]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. Computer Networks, 31(11 - 16):1481--1493, 1999.
[13]
R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web-revisited: A trick of the heavy tail. In Proc. of WWW Companion '14, pages 427--432, 2014.
[14]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web. pages 1--17, 1999.
[15]
G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize web structure. Computing and Combinatorics, pages 330--339, 2002.
[16]
M. P. Rombach, M. A. Porter, J. H. Fowler, and P. J. Mucha. Core-periphery structure in networks. arXiv preprint arXiv:1202.2684, 2012.
[17]
M. Serrano, A. Maguitman, M. Boguñá, S. Fortunato, and A. Vespignani. Decoding the structure of the www: A comparative analysis of web crawls. ACM Transactions on the Web (TWEB), 1(2):10, 2007.
[18]
S. Spiegler. Statistcs of the common crawl corpus 2012. Technical report, SwiftKey, June 2013. Document viewed on September 16th 2013 from https docs.google.comfiled1_9698uglerxB9nAglvaHkEgUiZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9.
[19]
S. Vigna. Fibonacci binning. CoRR, abs/1312.3749, 2013.
[20]
J. J. H. Zhu, T. Meng, Z. Xie, G. Li, and X. Li. A teapot graph and its hierarchical structure of the chinese web. WWW'08, pages 1133--1134, 2008.

Cited By

View all
  • (2024)Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data ManagementJournal of Computer Science and Technology10.1007/s11390-023-2939-x39:4(871-894)Online publication date: 1-Jul-2024
  • (2023)Readability and topics of the German Health Web: Exploratory study and text analysisPLOS ONE10.1371/journal.pone.028158218:2(e0281582)Online publication date: 10-Feb-2023
  • (2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
  • Show More Cited By

Index Terms

  1. Graph structure in the web: aggregated by pay-level domain

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WebSci '14: Proceedings of the 2014 ACM conference on Web science
      June 2014
      318 pages
      ISBN:9781450326223
      DOI:10.1145/2615569
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. graph analysis
      2. network analysis
      3. web graph
      4. web mining
      5. web science
      6. world wide web

      Qualifiers

      • Research-article

      Conference

      WebSci '14
      Sponsor:
      WebSci '14: ACM Web Science Conference
      June 23 - 26, 2014
      Indiana, Bloomington, USA

      Acceptance Rates

      WebSci '14 Paper Acceptance Rate 29 of 144 submissions, 20%;
      Overall Acceptance Rate 245 of 933 submissions, 26%

      Upcoming Conference

      Websci '25
      17th ACM Web Science Conference
      May 20 - 24, 2025
      New Brunswick , NJ , USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data ManagementJournal of Computer Science and Technology10.1007/s11390-023-2939-x39:4(871-894)Online publication date: 1-Jul-2024
      • (2023)Readability and topics of the German Health Web: Exploratory study and text analysisPLOS ONE10.1371/journal.pone.028158218:2(e0281582)Online publication date: 10-Feb-2023
      • (2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
      • (2023)Web MiningMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_20(447-467)Online publication date: 26-Feb-2023
      • (2023)GEM: Execution-Aware Cache Management for Graph AnalyticsAlgorithms and Architectures for Parallel Processing10.1007/978-3-031-22677-9_15(273-292)Online publication date: 11-Jan-2023
      • (2022)MASTIFFProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532365(1-13)Online publication date: 28-Jun-2022
      • (2022)LOTUSProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508402(219-233)Online publication date: 2-Apr-2022
      • (2022)SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00015(138-140)Online publication date: May-2022
      • (2022)Improving Locality of Irregular Updates with Hardware Assisted Propagation Blocking2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00047(543-557)Online publication date: Apr-2022
      • (2022)Onion under Microscope: An in-depth analysis of the Tor WebWorld Wide Web10.1007/s11280-022-01044-z25:3(1287-1313)Online publication date: 1-Apr-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media