Practical Social Network Analysis With Python PDFDrive
Practical Social Network Analysis With Python PDFDrive
Series Editors
Jacek Rak
Department of Computer Communications, Faculty of Electronics,
Telecommunications and Informatics, Gdansk University of Technology,
Gdansk, Poland
A. J. Sammes
Cyber Security Centre, Faculty of Technology, De Montfort University,
Leicester, UK
../images/462433_1_En_BookFrontmatter_Figa_HTML.gif
Krishna Raj P. M.
Department of ISE, Ramaiah Institute of Technology, Bangalore, Karnataka,
India
Ankith Mohan
Department of ISE, Ramaiah Institute of Technology, Bangalore, Karnataka,
India
K. G. Srinivasa
Department of Information Technology, C.B.P. Government Engineering
College, Jaffarpur, Delhi, India
Additional material to this book can be downloaded from http://extras.
springer.com .
Preface
Although there are innumerable complex systems and therefore such a large
number of networks, the focus of this book is social networks . A social
network contains individuals as nodes and links representing the relationship
between these individuals. The study of social networks is of particular
interest because it focuses on this abstract view of human relationships
which gives insight into various kinds of social relationships ranging all the
way from bargaining power to psychological health.
Figure 1 depicts the Internet on a global scale. This figure can help paint a
sort of picture as to how complicated a system really is, and how one must
proceed in order to understand such a complex system.
../images/462433_1_En_BookFrontmatter_Fig1_HTML.jpg
Fig. 1
Each of these complex systems (especially the Internet) has its own unique
idiosyncrasies but all of them share a particular commonality in the fact that
they can be described by an intricate wiring diagram, a network , which
defines the interactions between the components. We can never fully
understand the system unless we gain a full understanding of its network.
Network
A network is a collection of objects where some pairs of these objects are
connected by links. These objects are also sometimes referred to as nodes.
By representing a complex system through its network, we are able to better
visualize the system and observe the interconnections among the various
nodes. From close examination of networks, we can gather information
about which nodes are closely linked to one another, which are sparsely
linked, whether there is a concentration of links in a particular part of the
network, do some nodes have a very high number of links when compared
to others and so on.
../images/462433_1_En_BookFrontmatter_Fig2_HTML.gif
Fig. 2
Graph
Several properties can be retrieved from these networks but there are some
others which require a more mathematical approach. In the purview of
mathematics, networks in its current state fail to be amenable. To allow for
this amenability, a network is represented as a graph . In this view, a graph
can be described as a mathematical representation of networks which acts as
a framework for reasoning about numerous concepts. More formally, a
graph can be defined as $$ G(V,E) $$ where $$ V $$ denotes the set of
all vertices of $$ G $$ and $$ E $$ denotes the set of edges of $$ G $$
. Each edge $$ e = (u,v) \in E $$ where $$ u,v \in V $$ describes an
edge from $$ u $$ to $$ v $$ . If an edge exists between $$ u $$ and
$$ v $$ then they are considered as neighbours . The number of vertices in
$$ G $$ is denoted as $$ |V| $$ and the number of edges is denoted by
$$ |E| $$ . These notations will be used throughout the book.
../images/462433_1_En_BookFrontmatter_Fig3_HTML.gif
Fig. 3
Although there are innumerable complex systems and therefore such a large
number of networks, the focus of this book is social networks . A social
network contains individuals as nodes and links representing the relationship
between these individuals. The study of social networks is of particular
interest because it focuses on this abstract view of human relationships
which gives insight into various kinds of social relationships ranging all the
way from bargaining power to psychological health.
Network Datasets
In this age of information technology, we are blessed with an ever increasing
availability of large and detailed network datasets. These datasets generally
fall into one or more of the following groups.
Krishna Raj P. M.
Ankith Mohan
K. G. Srinivasa
Bangalore, IndiaBangalore, IndiaJaffarpur, India
Contents
Fig. 1.9 A folded directed graph with its corresponding bipartite directed
graph9
Fig. 1.19 A weakly connected graph whose vertices are the SCCs of
$$ G $$ and whose edges exist in $$ G^{\prime } $$ because there is an
edge between the corresponding SCCs in $$ G $$ 21
Fig. 2.1 In-degree distribution when only off-site edges are considered28
Fig. 2.2 In-degree distribution over May and October $$ 1999 $$ crawls
29
Fig. 2.3 Out-degree distribution when only off-site edges are considered29
Fig. 2.4 Out-degree distribution over May and October $$ 1999 $$ crawls
30
Fig. 2.7 Cumulative distribution on the number of nodes reached when BFS
is started from a random vertex and follows in-links32
Fig. 2.8 Cumulative distribution on the number of nodes reached when BFS
is started from a random vertex and follows out-links32
Fig. 2.9 Cumulative distribution on the number of nodes reached when BFS
is started from a random vertex and follows both in-links and out-links33
Fig. 2.10 In-degree distribution plotted as power law and Zipf distribution33
Fig. 2.20 Log-log plot of the number of pairs of nodes $$ P(h) $$ within
$$ h $$ hops versus the number of hops $$ h $$ for Int-11-97 40
Fig. 2.21 Log-log plot of the number of pairs of nodes $$ P(h) $$ within
$$ h $$ hops versus the number of hops $$ h $$ for Int-04-98 40
Fig. 2.22 Log-log plot of the number of pairs of nodes $$ P(h) $$ within
$$ h $$ hops versus the number of hops $$ h $$ for Int-12-98 40
Fig. 2.23 Log-log plot of the number of pairs of nodes $$ P(h) $$ within
$$ h $$ hops versus the number of hops $$ h $$ for Rout-95 41
Fig. 2.24 Log-log plot of the eigenvalues in decreasing order for Int-11-97
42
Fig. 2.25 Log-log plot of the eigenvalues in decreasing order for Int-04-98
42
Fig. 2.26 Log-log plot of the eigenvalues in decreasing order for Int-12-98
43
Fig. 2.27 Log-log plot of the eigenvalues in decreasing order for Rout-95 43
Fig. 3.3 A random graph generated using Bollobás configuration model with
and 52
steps 64
Fig. 4.5 “Ideal” histogram of chain lengths in Fig. 4.4 by accounting for
message attrition in Fig. 4.364
and target are chosen randomly; at each step, the message is forwarded
geographically closest to . If
, for
median , ,
; if
then picks a
random person in the same city as to pass the message to, and the
chain fails only if there is no such person available 74
, with is estimated by
computing the distance between randomly
chosen pairs of people in the network. b The same data are plotted,
versus 75
Fig. 4.17 The relationship between friendship probability and rank. The
, is counted
is shown. c and b The same data are replotted (unaveraged and averaged,
respectively), correcting for the background friendship probability: we plot
76
Fig. 4.18 Depiction of a regular network proceeding first to a small world
Fig. 5.1 Degree distribution of the global and US Facebook active users,
alongside its CCDF90
Fig. 5.2 Neighbourhood function showing the percentage of users that are
Fig. 5.7 Neighbor’s logins versus user’s logins to Facebook over a period of
Fig. 7.2 and are friends. However, both of them are enemies
Fig. 7.8 A simplified labelling of the supernodes for the graph in Fig. 7.6113
Fig. 7.10 Surprise values and predictions based on the competing theories of
structural balance and status118
Fig. 7.14 Helpfulness ratio declines with the absolute value of a review’s
deviation from the computed star average. The line segments within the bars
indicate the median helpfulness ratio; the bars depict the helpfulness ratio’s
second and third quantiles. Grey bars indicate that the amount of data at that
$$ x $$ value represents $$ 0.1\% $$ or less of the data depicted in the
plot 126
Fig. 7.16 As the variance of the star ratings of reviews for a particular
product increases, the median helpfulness ratio curve becomes two-humped
and the helpfulness ratio at signed deviation $$ 0 $$ (indicated in red) no
longer represents the unique global maximum 128
Fig. 7.17 Signed deviations vs. helpfulness ratio for variance = $$ 3 $$ , in
the Japanese (left) and U.S. (right) data. The curve for Japan has a
pronounced lean towards the left 128
Fig. 7.25 The Delta-similarity half-plane. Votes in each quadrant are treated
as a group137
Fig. 7.27 Accuracy of predicting the sign of an edge based on the signs of all
the other edges in the network in a Epinions, b Slashdot and c Wikipedia
141
Fig. 7.28 Accuracy of predicting the sign of an edge based on the signs of all
the other edges in the network in a Epinions, b Slashdot and c Wikipedia
142
Fig. 8.2 Initial network where all nodes exhibit the behaviour $$ B $$ 147
Fig. 8.6 Initial network where all nodes have behaviour $$ B $$ 149
Fig. 8.8 After three time steps there are no further cascades150
Fig. 8.11 By dividing the a-c plot based on the payoffs, we get the regions
corresponding to the different choices155
Fig. 8.12 The payoffs to node $$ w $$ on an infinite path with neighbours
Fig. 8.13 The a-c plot shows the regions where chooses each of the
possible strategies 156
Fig. 8.14 The plot shows the four possible outcomes for how
spreads or fails to spread on the infinite path, indicated by this division of
Fig. 8.17 Contact network for branching process where high infection
probability leads to widespread161
Fig. 8.18 Contact network for branching process where low infection
probability leads to the disappearance of the disease162
Fig. 8.19 Repeated application of
163
Fig. 8.20 A contact network where each edge has an associated period of
time denoting the time of contact between the connected vertices166
Fig. 9.3 Results for the independent cascade model with probability
182
Fig. 9.4 Results for the independent cascade model with probability
183
Fig. 10.1 (Left) Performance of CELF algorithm and off-line and on-line
bounds for PA objective function. (Right) Compares objective functions197
Fig. 10.2 Heuristic blog selection methods. (Left) unit cost model, (Right)
number of posts cost model197
Fig. 10.3 (Left) CELF with offline and online bounds for PA objective.
(Right) Different objective functions198
Fig. 10.4 Water network sensor placements: (Left) when optimizing PA,
sensors are concentrated in high population areas. (Right) when optimizing
DL, sensors are uniformly spread out199
Fig. 11.4 Average out-degree over time. Increasing trend signifies that the
graph are densifying216
Fig. 11.6 Effective diameter over time for different datasets. There is
a consistent decrease of the diameter over time 215
Fig. 11.7 The fraction of nodes that are part of the giant connected
Fig. 11.14 Edge gap distribution for a node to obtain the second edge,
$$ \delta (1) $$ , and MLE power law with exponential cutoff fits 229
Fig. 12.1 Top: a “3-chain” and its Kronecker product with itself; each of the
$$ X_{i} $$ nodes gets expanded into $$ 3 $$ nodes, which are then
linked. Bottom: the corresponding adjacency matrices, along with the matrix
for the fourth Kronecker power $$ G_{4} $$ 234
Fig. 12.2 CIT-HEP-TH: Patterns from the real graph (top row), the
deterministic Kronecker graph with $$ K_{1} $$ being a star graph on
$$ 4 $$ nodes (center $$ +3 $$ satellites) (middle row), and the
Stochastic Kronecker graph ( $$ \alpha = 0.41 $$ , $$ \beta = 0.11 $$ -
bottom row). Static patterns: a is the PDF of degrees in the graph (log-log
scale), and b the distribution of eigenvalues (log-log scale). Temporal
patterns: c gives the effective diameter over time (linear-linear scale), and d
is the number of edges versus number of nodes over time (log-log scale) 236
Fig. 12.6 A small network generated with the multifractal network generator.
a The generating measure (on the left) and the link probability measure (on
the right). The generating measure consists of $$ 3\times 3 $$ rectangles
for which the magnitude of the associated probabilities is indicated by the
colour. The number of iterations, $$ k $$ , is set to $$ k = 3 $$ , thus the
final link probability measure consists of $$ 27\times 27 $$ boxes, as
shown in the right panel. b A network with $$ 500 $$ nodes generated
from the link probability measure. The colours of the nodes were chosen as
follows. Each row in the final linking probability measure was assigned a
different colour, and the nodes were coloured according to their position in
the link probability measure. (Thus, nodes falling into the same row have the
same colour) 241
Fig. 13.4 Re-weighting votes for the query “newspapers”: each of the
labelled page’s new score is equal to the sum of the values of all lists that
point to it255
Fig. 13.6 Limiting hub and authority values for the query “newspapers”257
Fig. 13.8 Equilibrium PageRank values for the network in Fig. 13.7261
Fig. 14.1 Undirected graph with four vertices and four edges. Vertices
$$ A $$ and $$ C $$ have a mutual contacts $$ B $$ and $$ D $$ ,
while $$ B $$ and $$ D $$ have mutual friend $$ A $$ and $$ C $$
280
Fig. 14.4 Each edge of the graph in Fig 14.3 is labelled either as a strong
tie(S) or a weak tie(W). The labelling in the figure satisfies the Strong
Triadic Closure property281
Fig. 14.5 a Degree distribution. b Tie strength distribution. The blue line in
a and b correspond to $$P(x) = a(x+x_{0})^{-x}exp(-x/x_{c})$$ , where
x corresponds to either k or w . The parameter values for the fits in (A) are
$$k_{0}=10.9$$ , $$\gamma _{k}=8.4$$ , $$k_{c}=\infty $$ , and
for the fits in (B) are
$$w_{0}=280, \gamma _{w}=1.9, w_{c}=3.45\times 10.5$$ . c
Illustration of the overlap between two nodes, $$v_{i}$$ and $$v_{j}$$
, its value being shown for four local network configurations. d In the real
network, the overlap $$<O>_{w}$$ (blue circles) increases as a
function of cumulative tie strength $$P_{cum}(w)$$ , representing the
fraction of links with tie strength smaller than w . The dyadic hypothesis is
tested by randomly permuting the weights, which removes the coupling
between $$<O>_{w}$$ and w (red squares). The overlap
$$<O>_{b}$$ decreases as a function of cumulative link
betweenness centrality b (black diamonds) 293
Fig. 14.6 Each link represents mutual calls between the two users, and all
nodes are shown that are at distance less than six from the selected user,
marked by a circle in the center. a The real tie strengths, observed in the call
logs, defined as the aggregate call duration in minutes. b The dyadic
hypothesis suggests that the tie strength depends only on the relationship
between the two individuals. To illustrate the tie strength distribution in this
case, we randomly permuted tie strengths for the sample in a . c The weight
of the links assigned on the basis of their betweenness centrality
$$ b_{ij} $$ values for the sample in $$ A $$ as suggested by the global
efficiency principle. In this case, the links connecting communities have
high $$ b_{ij} $$ values (red), whereas the links within the communities
have low $$ b_{ij} $$ values (green) 294
Fig. 14.7 The control parameter f denotes the fraction of removed links. a
and c These graphs correspond to the case in which the links are removed on
the basis of their strengths ( $$w_{ij}$$ removal). b and d These graphs
correspond to the case in which the links were removed on the basis of their
overlap ( $$O_{ij}$$ removal). The black curves correspond to removing
first the high-strength (or high $$O_{ij}$$ ) links, moving toward the
weaker ones, whereas the red curves represent the opposite, starting with the
low-strength (or low $$O_{ij}$$ ) ties and moving toward the stronger
ones. a and b The relative size of the largest component
$$R_{GC}(f)=N_{GC}(f)/N_{GC}(f=0)$$ indicates that the removal of
the low $$w_{ij}$$ or $$O_{ij}$$ links leads to a breakdown of the
network, whereas the removal of the high $$w_{ij}$$ or $$O_{ij}$$
links leads only to the network’s gradual shrinkage. a Inset Shown is the
blowup of the high $$w_{ij}$$ region, indicating that when the low
$$w_{ij}$$ ties are removed first, the red curve goes to zero at a finite f
value. c and d According to percolation theory,
$$\tilde{S}=\sum _{s<s_{max}}n_{s}s^{2}/N$$ diverges for
Fig. 14.8 The dynamics of spreading on the weighted mobile call graph,
assuming that the probability for a node $$ v_{i} $$ to pass on the
information to its neighbour $$ v_{j} $$ in one time step is given by
$$ P_{ij}=xw_{ij} $$ , with $$ x=2.59\times 10^{-4} $$ . a The
fraction of infected nodes as a function of time $$ t $$ . The blue curve
(circles) corresponds to spreading on the network with the real tie strengths,
whereas the black curve (asterisks) represents the control simulation, in
which all tie strengths are considered equal. b Number of infected nodes as a
function of time for a single realization of the spreading process. Each steep
part of the curve corresponds to invading a small community. The flatter part
indicates that the spreading becomes trapped within the community. c and d
Distribution of strengths of the links responsible for the first infection for a
node in the real network ( c ) and control simulation ( d ). e and f Spreading
in a small neighbourhood in the simulation using the real weights (E) or the
control case, in which all weights are taken to be equal ( f ). The infection in
all cases was released from the node marked in red, and the empirically
observed tie strength is shown as the thickness of the arrows (right-hand
scale). The simulation was repeated 1,000 times; the size of the arrowheads
is proportional to the number of times that information was passed in the
given direction, and the colour indicates the total number of transmissions
on that link (the numbers in the colour scale refer to percentages of
$$ 1,000 $$ ). The contours are guides to the eye, illustrating the
difference in the information direction flow in the two simulations 297
Fig. 15.1 a Graph of the Zachary Karate Club network where nodes
represent members and edges indicate friendship between members. b Two-
dimensional visualization of node embeddings generated from this graph
using the DeepWalk method. The distances between nodes in the embedding
space reflect proximity in the original graph, and the node embeddings are
spatially clustered according to the different colour-coded communities 306
Fig. 15.2 Graph of the Les Misérables novel where nodes represent
characters and edges indicate interaction at some point in the novel between
corresponding characters. (Left) Global positioning of the nodes. Same
colour indicates that the nodes belong to the same community. (Right)
Colour denotes structural equivalence between nodes, i.e, they play the same
roles in their local neighbourhoods. Blue nodes are the articulation points.
This equivalence where generated using the node2vec algorithm306
List of Tables
Table 1.1 Table shows the vertices of the graph in Fig. 1.1 and their
corresponding degrees3
Table 1.2 Table shows the vertices of the graph in Fig. 1.2 and their
corresponding in-degrees, out-degrees and degrees3
Table 1.5 $$ In $$ and $$ Out $$ for all the vertices in Fig. 1.1 19
Table 1.6 $$ In $$ and $$ Out $$ for all the vertices in Fig. 1.2 19
Table 1.7 $$ In $$ and $$ Out $$ for all the vertices in Fig. 1.16 20
Table 1.8 $$ In $$ and $$ Out $$ for all the vertices in Fig. 1.17 20
Table 2.1 Size of the largest surviving weak component when links to pages
with in-degree at least k are removed from the graph30
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
A graph can shed light on several properties of a complex system, but this
is incumbent on the effectiveness of the network. If the network does not
fully represent all the features described in the system, then the graph will
also fail to describe all the properties lying therein. Therefore, the choice of
a proper network representation of the complex system is of first
importance.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig1_HTML.
gif
Fig. 1.1
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig2_HTML.
gif
Fig. 1.2
Tables 1.1 and 1.2 tabulate the degrees of all the vertices in Figs. 1.1 and
1.2 respectively.
Table 1.1
Table shows the vertices of the graph in Fig. 1.1 and their corresponding
degrees
Vertex Degree
A 3
B 3
C 2
D 2
Total 10
Table 1.2
Table shows the vertices of the graph in Fig. 1.2 and their corresponding in-
degrees, out-degrees and degrees
A 0 3 3
B 1 2 3
C 2 0 2
D 2 0 2
Total 5 5 10
A vertex with zero in-degree is called a source vertex , a vertex with zero
out-degree is called a sink vertex , and a vertex with in-degree and out-
degree both equal to zero is called an isolated vertex .
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig4_HTML.
gif
Fig. 1.4
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig5_HTML.
gif
Fig. 1.5
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig6_HTML.
gif
Fig. 1.6
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig7_HTML.
gif
Fig. 1.7
Figure 1.8 depicts the undirected bipartite graph with its corresponding
folded graph and Fig. 1.9 illustrates the directed bipartite graph and its
corresponding folded graph.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig8_HTML.
gif
Fig. 1.8
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig9_HTML.
gif
Fig. 1.9
The edge list for both the graph in Fig. 1.1 and for the graph in Fig. 1.2 is
given by {(A,B), (B,C), (B,D), (A,C), (A,D)}.
A : [B, C, D]
B : [A, C, D]
C: [A, B]
D: [A, B]
Table 1.4
A: [ $$\phi $$]
A: [ $$\phi $$]
B: [A]
C : [A, B]
D : [A, B]
The adjacency list for the graph in Fig. 1.1 and that for the graph in Fig. 1.2
is as shown in Tables 1.3 and 1.4 respectively.
When the graphs are small, there is not a notable difference between any of
these representations. However, when the graph is large and sparse (as in
the case of most real world networks), the adjacency matrix will be a large
matrix filled mostly with zeros. This accounts for a lot of unnecessary
space and account for a lot of time that could have been avoided during
computations. Although edge lists are concise, they are not the best data
structure for most graph algorithms. When dealing with such graphs,
adjacency lists are comparatively more effective. Adjacency lists can be
easily implemented in most programming languages as a hash table with
keys for the source vertices and a vector of destination vertices as values.
Working with this implementation can save a lot of computation time.
SNAP uses this hash table and vector representation for storing graphs [1].
Ranking: Best friend, second best friend, third best friend, so on.
Based on edge attributes, the number of edges between vertices and the
source and destination of an edge, graphs can be further classified.
Unweighted graphs are graphs where the edges do not have any associated
weights. Figures 1.1 and 1.2 are instances of an undirected unweighted
graph and a directed unweighted graph respectively.
Weighted graphs are graphs where the edges are associated with a certain
weight. Figures 1.10 and 1.11 depict an undirected weighted graph and a
directed weighted graph respectively.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig10_HTML.
gif
Fig. 1.10
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig11_HTML.
gif
Fig. 1.11
A directed weighted graph with 4 vertices and 5 weighted edges
The adjacency matrix for the graph in Fig. 1.10 is given below
$$ \begin{bmatrix} 0&1&2.5&13.75 \\
1&0&0.3&100 \\ 2.5&0.3&0&0 \\
13.75&100&0&0 \\ \end{bmatrix} $$
The adjacency matrix for the graph in Fig. 1.11 is as depicted below
$$ \begin{bmatrix} 0&1&2.5&13.75 \\
0&0&0.3&100 \\ 0&0&0&0 \\
0&0&0&0 \\ \end{bmatrix} $$
Undirected weighted graphs satisfy properties in Eqs. 1.7, 1.9 and 1.13.
$$\begin{aligned} |E| = \frac{1}{2}\sum \limits
_{i,j=1}^{|V|}nonzero(A_{ij}) \end{aligned}$$
(1.13)
Directed weighted graphs on the other hand satisfy properties in Eqs. 1.7,
1.10 and 1.14.
$$\begin{aligned} |E| = \sum \limits _{i,j=1}^{|V|}nonzero(A_{ij})
\end{aligned}$$
(1.14)
Collaborations and transportation networks are some examples of weighted
graphs.
Self-loops are defined as edges whose source and destination vertices are
the same. More formally, an edge $$e\in E$$ is called a self-looped edge
if $$e = (u,u)$$ where $$u\in V$$. A graph that contains one or more
self-loops is called a self-looped graph. Figures 1.12 and 1.13 illustrate an
undirected self-looped graph and a directed self-looped graph respectively.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig12_HTML.
gif
Fig. 1.12
The adjacency matrix for the graph in Fig. 1.12 is as shown below
$$ \begin{bmatrix} 1&1&1&1 \\
1&1&1&1 \\ 1&1&1&0 \\
1&1&0&1 \\ \end{bmatrix} $$
The adjacency matrix for the graph in Fig. 1.13 is illustrated as follows
$$ \begin{bmatrix} 1&1&1&1 \\
0&1&1&1 \\ 0&0&1&0 \\
0&0&0&1 \\ \end{bmatrix} $$
Undirected self-looped graphs satisfy properties in Eqs. 1.8, 1.9 and 1.15.
$$\begin{aligned} |E| = \frac{1}{2}\sum \limits _{i,j=1;i\ne
j}^{|V|}A_{ij} + \sum \limits _{i=1}^{|V|}A_{ii} \end{aligned}$$
(1.15)
Directed weighted graphs on the other hand satisfy properties in Eqs. 1.8,
1.10 and 1.16.
$$\begin{aligned} |E| = \sum \limits _{i,j=1;i\ne j}^{|V|}A_{ij} + \sum
\limits _{i=1}^{|V|}A_{ii} \end{aligned}$$
(1.16)
Proteins and hyperlinks are some commonly encountered examples of self-
looped graphs.
1.9.4 Multigraphs
A multigraph is a graph where multiple edges may share the same source
and destination vertices. Figures 1.14 and 1.15 are instances of undirected
and directed multigraphs respectively.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig14_HTML.
gif
Fig. 1.14
The adjacency matrix for the graph in Fig. 1.14 is as shown below
$$ \begin{bmatrix} 0&1&2&1 \\
1&0&1&3 \\ 2&1&0&3 \\
1&3&3&0 \\ \end{bmatrix} $$
The adjacency matrix for the graph in Fig. 1.15 is illustrated as follows
$$ \begin{bmatrix} 0&1&2&1 \\
0&0&1&2 \\ 0&0&0&1 \\
0&1&2&0 \\ \end{bmatrix} $$
Undirected multigraphs satisfy properties in Eqs. 1.7, 1.9 and 1.13.
Directed multigraphs on the other hand satisfy properties in Eqs. 1.7, 1.10
and 1.14.
1.10 Path
A path from a vertex u to a vertex v is defined either as a sequence of
vertices in which each vertex is linked to the next, {u, $$u_{1}$$,
$$u_{2}$$, $$\ldots $$, $$u_{k}$$,v} or as a sequence of edges {(u,
$$u_{1}$$), ( $$u_{1}$$, $$u_{2}$$), $$\ldots $$, ( $$u_{k}$$
,v)}. A path can pass through the same edge multiple times. A path that
does not contain any repetition in either the edges or the nodes is called a
simple path . Following a sequence of such edges gives us a walk through
the graph from a vertex u to a vertex v. A path from u to v does not
necessarily imply a path from v to u.
The path {A,B,C} is a simple path in both Figs. 1.1 and 1.2. {A,B,C,B,D} is
a path in Fig. 1.1 while there are no non-simple paths in Fig. 1.2.
1.11 Cycle
A cycle is defined as a path with atleast three edges, in which the first and
the last vertices are the same, but otherwise all other vertices are distinct.
The path {A,B,C,A} and {A,B,D,A} are cycles in Fig. 1.1.
1.13 Distance
The distance between a pair of vertices u and v is defined as the number of
edges along the shortest path connecting u and v. If two nodes are not
connected, the distance is usually defined as infinity. Distance is symmetric
in undirected graphs and asymmetric in directed graphs.
1.15 Diameter
The diameter of a graph is the maximum of the distances between all pairs
of vertices in this graph. While computing the diameter of a graph, all
infinite distances are disregarded.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig16_HTML.
gif
Fig. 1.16
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig17_HTML.
gif
Fig. 1.17
Giant Component
Bridge Edge
A bridge edge is an edge whose removal disconnects the graph.Every edge
in Fig. 1.17 is a bridge edge. For Fig. 1.1, the combination of edges (A,B),
(A,C), (A,D); or (A,C), (B,C); or (A,D), (B,D) are bridge edges.
Articulation Vertex
For a graph G(V, E), we define the In and Out of a vertex $$v\in V$$ as
given in Eqs. 1.19 and 1.20.
$$\begin{aligned} In(v) = \{ w\in V |\ there\ exists\ a\ path\ from\ w\ to\ v
\} \end{aligned}$$
(1.19)
$$\begin{aligned} Out(v) = \{ w\in V |\ there\ exists\ a\ path\ from\ v\ to\
w \} \end{aligned}$$
(1.20)
In other words if Eq. 1.21 is satisfied by a directed graph, then this graph is
said to be a strongly connected directed graph. This means that a weakly
connected directed graph must satisfy Eq. 1.22.
$$\begin{aligned} In(v) = Out(v)\ \forall \ v\in V \end{aligned}$$
(1.21)
$$\begin{aligned} In(v) \ne Out(v)\ \forall \ v\in V \end{aligned}$$
(1.22)
Tables 1.5, 1.6, 1.7 and 1.8 tabulate the In and Out for all the vertices in
each of these graphs. From these tables, we observe that Fig. 1.17 is
strongly connected and Fig. 1.2 is weakly connected because they satisfy
Eqs. 1.21 and 1.22 respectively.
Table 1.5
A A,B,C,D A,B,C,D
B A,B,C,D A,B,C,D
C A,B,C,D A,B,C,D
D A,B,C,D A,B,C,D
Table 1.6
A $$\phi $$ B,C,D
B A C,D
C A,B $$\phi $$
D A,B $$\phi $$
Table 1.7
A A,B,C A,B,C
B A,B,C A,B,C
C A,B,C A,B,C
D $$\phi $$ $$\phi $$
Table 1.8
A A,B,C,D A,B,C,D
B A,B,C,D A,B,C,D
C A,B,C,D A,B,C,D
D A,B,C,D A,B,C,D
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig18_HTML.
gif
Fig. 1.18
Theorem 1
1. 1.
For the proof of this theorem, we will use the graphs G in Fig. 1.18 and
$$G'$$ in Fig. 1.19.
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig19_HTML.
gif
Fig. 1.19
Proof
1. 1.
2. 2.
Assume that $$G'$$ is not a DAG, i.e, there exists a directed cycle
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig20_HTML.
gif
Fig. 1.20
../images/462433_1_En_1_Chapter/462433_1_En_1_Fig21_HTML.
gif
Fig. 1.21
A graph with the same vertices and edges as shown in Fig. 1.19
with the exception that there exists an edge between D and E to
make the graph a strongly connected one
A $$\frac{2}{3}$$
B $$\frac{2}{3}$$
C 3
D 3
Problems
Download the email-Eu-core directed network from the SNAP dataset
repository available at http://snap.stanford.edu/data/email-Eu-core.html.
Number of nodes
Number of edges
In-degree distribution
Out-degree distribution
9
10
11
12
Diameter
13
14
15
16
17
Number of nodes in In(v) for five random nodes
18
19
20
21
References
1. 1.
Leskovec, Jure, and Rok Sosič. 2016. Snap: A general-purpose
network analysis and graph-mining library. ACM Transactions on
Intelligent Systems and Technology (TIST) 8 (1): 1.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_2
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
In this chapter, we will take a close look at the Web when it is represented
in the form of a graph and attempt to understand its structure. We will begin
by looking at the reasons behind why we must be interested with this
problem concerning the Web’s structure. There are numerous reasons as to
why the structure of the Web is worth studying, but the most prominent
ones are as follows: the Web is a large system that evolved naturally over
time, understanding such a system’s organization and properties could help
better comprehend several other real-world systems; the study could yield
valuable insight into Web algorithms for crawling, searching and
community discovery, which could in turn help improve strategies to
accomplish these tasks; we could gain an understanding into the
sociological phenomena characterising content creation in its evolution; the
study could help predict the evolution of known or new web structures and
lead to the development of better algorithms for discovering and organizing
them; and we could predict the emergence of new phenomena in the Web.
Bowtie Structure of the Web
Reference [1] represented the Web as a directed graph where the
webpages are treated as vertices and hyperlinks are the edges and studied
the following properties in this graph: diameter, degree distribution,
connected components and macroscopic structure. However, the dark-web
(part of the Web that is composed of webpages that are not directly
accessible (even by Web browsers)) were disregarded. This study consisted
of performing web crawls on a snapshot of the graph consisting of 203
million URLs connected by 1.5 billion hyperlinks. The web crawl is based
on a large set of starting points accumulated over time from various
sources, including voluntary submissions. A 465 MHz server with 12 GB of
memory was dedicated for this purpose.
Reference [1] took a large snapshot of the Web and using Theorem 1,
attempted to understand how its SCCs fitted together as a DAG.
Power Law
The power law is a functional relationship between two quantities
where a relative change in one quantity results in a proportional relative
change in the other quantity, independent of the initial size of those
quantities, i.e, one quantity varies as the power of the other. Hence the name
power law [3]. Power law distributions as defined on positive integers is the
probability of the value i being proportional to $$\frac{1}{i^k}$$ for
a small positive integer k.
2.1 Algorithms
In this section we will look at several algorithms used by [1] in their
experiments.
../images/462433_1_En_2_Chapter/462433_1_En_2_Figa_HTML.gif
The Web crawl proceeds in roughly a BFS manner, subject to various rules
designed to avoid overloading, infinite paths, spam, time-outs, etc. Each
build is based on crawl data after further filtering and processing. Due to
multiple starting points, it is possible for the resulting graph to have several
connected components.
../images/462433_1_En_2_Chapter/462433_1_En_2_Figb_HTML.gif
../images/462433_1_En_2_Chapter/462433_1_En_2_Figc_HTML.gif
../images/462433_1_En_2_Chapter/462433_1_En_2_Figd_HTML.gif
../images/462433_1_En_2_Chapter/462433_1_En_2_Fige_HTML.gif
Three sets of experiments on these web crawls were performed from May
1999 to October 1999.
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig2_HTML.gif
Fig. 2.2 In-degree distribution over May and October 1999 crawls
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig3_HTML.gif
Fig. 2.3 Out-degree distribution when only off-site edges are considered
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig4_HTML.gif
Fig. 2.4 Out-degree distribution over May and October 1999 crawls
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig5_HTML.gif
Fig. 2.5 Distribution of WCCs on the Web
However, there still exists the question as to whether this widespread
connectivity results from a few vertices of large in-degree. The answer is as
tabulated in Table 2.1. Table shows the size of the largest surviving weak
components when edges to pages with in-degree atleast k are removed from
the graph.
Table 2.1 Size of the largest surviving weak component when links to pages with in-degree at least k
are removed from the graph
k 1000 100 10 5 4 3
Size (millions) 177 167 105 59 41 15
Table 2.1 gives us the following insights: Connectivity of the Web graph
is extremely resilient and does not depend on the existence of vertices of
high degree, vertices which are useful tend to include those vertices that
have a high PageRank or those which are considered as good hubs or
authorities are embedded in a graph that is well connected without them.
To understand what the giant component is composed of, it was
subjected to the SCC algorithm. The algorithm returned a single large SCC
consisting of 56 million vertices which barely amounts to 28% of all the
vertices in the crawl. This corresponds to all of the vertices that can reach
one another along the directed edges situated at the heart of the Web graph.
Diameter of this component is atleast 28. The distribution of sizes of the
SCCs also obeys power law with exponent of 2.5 as observed in Fig. 2.6.
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig6_HTML.gif
Fig. 2.6 Distribution of SCCs on the Web
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig7_HTML.gif
Fig. 2.7 Cumulative distribution on the number of nodes reached when BFS is started from a
random vertex and follows in-links
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig8_HTML.gif
Fig. 2.8 Cumulative distribution on the number of nodes reached when BFS is started from a
random vertex and follows out-links
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig9_HTML.gif
Fig. 2.9 Cumulative distribution on the number of nodes reached when BFS is started from a
random vertex and follows both in-links and out-links
Zipf’s Law
Zipf’s law states that the frequency of occurrence of a certain value is
inversely proportional to its rank in the frequency table [3]. The in-degree
distribution shows a fit more with Zipf distribution than the power law as is
evident from Fig. 2.10.
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig10_HTML.gif
Fig. 2.10 In-degree distribution plotted as power law and Zipf distribution
From the previous set of experiments, we got the giant undirected
component, the $$DISCONNECTED\ COMPONENTS$$ and the
SCC. The 100 million vertices whose forward BFS traversals exploded
correspond to either the SCC component or a component called IN. Since,
SCC corresponds to 56 million vertices, this leaves with 44 million vertices
( $$\approx {22\%}$$) for IN. On similar lines, the 100 million
vertices whose backward BFS traversals exploded correspond to either SCC
or a component called OUT, which will have 44 million vertices (
$$\approx {22\%}$$). This leaves us with roughly 44 million vertices
( $$\approx {22\%}$$), which are not yet accounted for. These
vertices were placed in a component called TENDRILS. These components
altogether form the famous bowtie structure (Fig. 2.11) of the Web.
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig11_HTML.gif
Fig. 2.11 Bowtie structure of the graph of the Web
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig12_HTML.gif
Fig. 2.12 Log-log plot of the out-degree $$d_{v}$$ as a function of the rank $$r_{v}$$ in
the sequence of decreasing out-degree for Int-11-97
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig13_HTML.gif
Fig. 2.13 Log-log plot of the out-degree $$d_{v}$$ as a function of the rank $$r_{v}$$ in
the sequence of decreasing out-degree for Int-04-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig14_HTML.gif
Fig. 2.14 Log-log plot of the out-degree $$d_{v}$$ as a function of the rank $$r_{v}$$ in
the sequence of decreasing out-degree for Int-12-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig15_HTML.gif
Fig. 2.15 Log-log plot of the out-degree $$d_{v}$$ as a function of the rank $$r_{v}$$ in
the sequence of decreasing out-degree for Rout-95
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig16_HTML.gif
Fig. 2.16 Log-log plot of frequency $$f_{d}$$ versus the out-degree d for Int-11-97
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig17_HTML.gif
Fig. 2.17 Log-log plot of frequency $$f_{d}$$ versus the out-degree d for Int-04-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig18_HTML.gif
Fig. 2.18 Log-log plot of frequency $$f_{d}$$ versus the out-degree d for Int-12-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig19_HTML.gif
Fig. 2.19 Log-log plot of frequency $$f_{d}$$ versus the out-degree d for Rout-95
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig20_HTML.gif
Fig. 2.20 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops
h for Int-11-97
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig21_HTML.gif
Fig. 2.21 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops
h for Int-04-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig22_HTML.gif
Fig. 2.22 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops
h for Int-12-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig23_HTML.gif
Fig. 2.23 Log-log plot of the number of pairs of nodes P(h) within h hops versus the number of hops
h for Rout-95
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig24_HTML.gif
Fig. 2.24 Log-log plot of the eigenvalues in decreasing order for Int-11-97
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig25_HTML.gif
Fig. 2.25 Log-log plot of the eigenvalues in decreasing order for Int-04-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig26_HTML.gif
Fig. 2.26 Log-log plot of the eigenvalues in decreasing order for Int-12-98
../images/462433_1_En_2_Chapter/462433_1_En_2_Fig27_HTML.gif
Fig. 2.27 Log-log plot of the eigenvalues in decreasing order for Rout-95
Problems
Download the Epinions directed network from the SNAP dataset repository
available at http://snap.stanford.edu/data/soc-Epinions1.html.
For this dataset compute the structure of this social network using the
same methods as Broder et al. employed.
22 Compute the in-degree and out-degree distributions and plot the power
law for each of these distributions.
23 Choose 100 nodes at random from the network and do one forward
and one backward BFS traversal for each node. Plot the cumulative
distributions of the nodes covered in these BFS runs as shown in Fig. 2.7.
Create one figure for the forward BFS and one for the backward BFS. How
many nodes are in the OUT and IN components? How many nodes are in
the TENDRILS component?
(Hint: The forward BFS plot gives the number of nodes in SCC
$$+$$OUT and similarly, the backward BFS plot gives the number of
nodes in SCC $$+$$IN).
24 What is the probability that a path exists between two nodes chosen
uniformly from the graph? What if the node pairs are only drawn from the
WCC of the two networks? Compute the percentage of node pairs that were
connected in each of these cases.
References
1. Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan,
Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer
Networks 33 (1–6): 309–320.
2. Faloutsos, Michalis, Petros Faloutsos, and Christos Faloutsos. 1999. On power-law relationships
of the internet topology. In ACM SIGCOMM computer communication review, vol. 29, 251–262,
ACM.
3. Newman, Mark E.J. 2005. Power laws, pareto distributions and zipf’s law. Contemporary Physics
46 (5): 323–351.
4. Pansiot, Jean-Jacques, and Dominique Grad. 1998. On routes and multicast trees in the internet.
ACM SIGCOMM Computer Communication Review 28 (1): 41–50.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. SrinivasaPractical Social
Network Analysis with PythonComputer Communications and
Networkshttps://doi.org/10.1007/978-3-319-96746-2_3
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
Let $$G_{|V|,|E|}$$ denote the set of all graphs having |V| vertices,
$$v_{1}, v_{2},\ldots , v_{|V|}$$ and |E| edges. These graphs must not
have self-edges or multiple edges. Thus a graph belonging to
$$G_{|V|,|E|}$$ is obtained by choosing |E| out of the possible
$${|V|\atopwithdelims ()2}$$ edges between the vertices
$$v_{1}, v_{2}, \ldots , v_{|V|}$$, and therefore the number of elements
of $$G_{|V|,|E|}$$ is equal to
$${{|V|\atopwithdelims ()2}\atopwithdelims ()|E|}$$. A random graph
$$\Gamma _{|V|,|E|}$$ can be defined as an element of
$$G_{|V|,|E|}$$ chosen as random, so that each of the elements of
$$G_{|V|,|E|}$$ have the same probability to be chosen, namely
$$\frac{1}{{{|V|\atopwithdelims ()2}\atopwithdelims ()|E|}}$$.
1. 1.
2. 2.
3. 3.
4. 4.
5. 5.
6. 6.
n, p and m do not uniquely determine the graphs. Since the graph is the
result of a random process, we can have several different realizations for
the same values of n, p and m. Figure 3.1 gives instances of undirected
graphs generated using this model and Fig. 3.2 are examples of directed
graphs also generated using the model. Each of these graphs have
$$n = 5$$ and $$m = 5$$.
../images/462433_1_En_3_Chapter/462433_1_En_3_Fig1_HTML.
gif
Fig. 3.1
../images/462433_1_En_3_Chapter/462433_1_En_3_Fig2_HTML.
gif
Fig. 3.2
Theorem 2
Theorem 3
Theorem 4
If $$\prod _{k}(|V|,|E_{c}|)$$ denotes the probability of
$$\Gamma _{|V|,|E|}$$ consisting of exactly $$k+1$$ disjoint
components ( $$\prod _{0}(|V|,|E_{c}|) = P_{0}(|V|,|E_{c}|)$$). Then we
get Eq. 3.3
$$\begin{aligned} \lim _{|V|\rightarrow \infty }\prod _{k}(|V|,|E_{c}|) =
\frac{(e^{-2c})^{k}e^{-e^{-2c}}}{k!} \end{aligned}$$
(3.3)
i.e, the number of components of $$\Gamma _{|V|,|E|}$$ diminished by
one is the limit distributed according to Poisson’s law with mean value
$$e^{-2c}$$.
Theorem 5
3.2.1 Properties
3.2.1.1 Edges
Let P(|E|) denote the probability that $$G_{n,p}$$ generates a graph on
|E| edges.
$$\begin{aligned} P(|E|) = {E_{max}\atopwithdelims ()|E|}p^{|E|}(1-
p)^{E_{max}-|E|} \end{aligned}$$
(3.6)
From Eq. 3.6, we observe that P(|E|) follows a binomial distribution with
mean of $$pE_{max}$$ and variance of $$p(1-p)E_{max}$$.
3.2.1.2 Degree
3.2.1.5 Diameter
3.2.2 Drawbacks of G n, p
The exercise in this chapter will require the computations of the real world
graph and the $$G_{n,p}$$.
1. 1.
Start with |V| vertices and draw d half-edges emanating from each of
these vertices, so that the ends of these half-edges are all distinct.
2. 2.
3. 3.
../images/462433_1_En_3_Chapter/462433_1_En_3_Fig3_HTML.
gif
Fig. 3.3
The graph is simple if and only if the following conditions are satisfied,
1. 1.
2. 2.
3. 3.
This process will correctly generate random graphs with desired properties.
However, the expected number of edges between two vertices will often
exceed one. This makes it unlikely that the procedure will run to
completion except in the rarest of cases. To obviate this problem, a
modification of the method can be used in which following selection of a
pair of half-edges that create a multiple edge, instead of discarding the
graph, an alternate half-edge pair is randomly selected. Although the
method generates a biased sample of possible graphs, not significantly
serving the required purpose.
3.5.4 Comparison
From these points it is evident that any of these methods are adequate for
generating suitable random graphs to act as null models. Although the “go
with the winners” and the switching algorithms, while slower, are clearly
more satisfactory theoretically, the matching algorithm gives better results
on real-world problems.
Reference [2] argues in favour of using the switching method, with the “go
with the winners” method finding limited use as a check on the accuracy of
sampling.
Reference [3] shows how using generating functions, one can calculate
exactly many of the statistical properties of random graphs generated from
prescribed degree sequences in the limit of the number of vertices.
Additionally, the position at which the giant component forms, the size of
the giant component, the average and the distribution of the size of the
other components, average number of vertices at certain distance from a
given vertex, the clustering coefficient and the typical vertex-vertex
distances are explained in detail.
Problems
Download the Astro Physics collaboration network from the SNAP dataset
repository available at http://snap.stanford.edu/data/ca-AstroPh.html. This
co-authorship network contains 18772 nodes and 198110 edges.
Generate the graph for this dataset (we will refer to this graph as the real
world graph).
25
Erdös–Rényi random graph (G(n, m): Generate a random instance of this
model by using the number of nodes and edges as the real world graph.
26
For each of the real world graph, Erdös–Rényi graph and Cofiguration
model graph, compute the following:
27
Degree distributions
28
29
30
31
For each of these distributions, state whether or not the random models
have the same property as the real world graph.
32
Are the random graph generators capable of generating graphs that are
representative of real world graphs?
References
1. 1.
Ellis, David. 2011. The expansion of random regular graphs, Lecture
notes. Lent.
2. 2.
Milo, Ron, Nadav Kashtan, Shalev Itzkovitz, Mark E.J. Newman, and
Uri Alon. 2003. On the uniform generation of random graphs with
prescribed degree sequences. arXiv:cond-mat/0312028.
3. 3.
Newman, Mark E.J., Steven H. Strogatz, and Duncan J. Watts. 2001.
Random graphs with arbitrary degree distributions and their
applications. Physical Review E 64(2):026118.
4. 4.
Paul, Erdös, and Rényi Alfréd . 1959. On random graphs, i.
Publicationes Mathematicae (Debrecen) 6: 290–297.
5. 5.
Paul, Erdös, and Rényi Alfréd. 1960. On the evolution of random
graphs. Publications of the Mathematical Institute of the Hungarian
Academy of Sciences 5: 17–61.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. SrinivasaPractical Social
Network Analysis with PythonComputer Communications and
Networkshttps://doi.org/10.1007/978-3-319-96746-2_4
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
We have all had the experience of encountering someone far from home,
who turns out to share a mutual acquaintance with us. Then comes the
cliché, “My it’s a small world”. Similarly, there exists a speculative idea
that the distance between vertices in a very large graph is surprisingly
small, i.e, the vertices co-exist in a “small world”. Hence the term small
world phenomena.
The volunteers were sent a folder containing a document and a set of tracer
cards, with the name and some relevant information concerning their target.
The following rules were set: (i) The subjects could only send the folder to
individuals they knew on a first-name basis. If the subject did not know the
target on this basis, they could send it to only one other person who
qualified this condition and was more likely to know the target. (ii) The
folder contained a roster on which the subject and all subsequent recipients
had to write their name. This roster tells the recipients all the people who
were part of this chain and thereby prevents endless loop. (iii) Every
participant had to fill out in the tracer card the relationship of the person
they are sending it to, and send it back to the authors. This helped the
authors keep track of the complete and incomplete links.
Kansas Study
The Kansas study started with 145 participants which shows a certain
pattern: 56 of the chains involved a female sending the folder to another
female, 58 of them involved a male sending the folder to a male, 18 of
them involved a female sending the folder to a male, and 13 of the chains
had a male passing the folder to a female. This shows a three times greater
tendency for a same-gender passing. 123 of these participants were
observed to send the folder to friends and acquaintances while the other 22
sent the folder to relatives. However, the authors indicate that this
preference towards friends could be exclusive to the participants and this
does not necessarily generalize to the public.
Nebraska Study
The starting population for the Nebraska study was: 296 people were
selected, 100 of them were people residing in Boston designated as the
“Boston random” group, 100 were blue chip stock holders from Nebraska,
and 96 were randomly chosen Nebraska inhabitants called the “Nebraska
random” group. Only 64 chains ( $$29\%$$) completed, with $$24\%$$
by “Nebraska random”, $$31\%$$ by Nebraska stockholder and
$$35\%$$ by “Boston random”. There were 453 participants who made
up the intermediaries solicited by other participants as people likely to
extend the chains towards the target. Similar to the Kansas study,
$$86\%$$ of the participants sent the folders to friends or acquaintances,
and the rest to relatives. Men were ten times more likely to send the
document to other men than to women, while women were equally likely to
send the folder to males as to females. These results were probably affected
by the fact that the target was male.
Figure 4.1 shows that the average number of intermediaries on the path of
the folders was between 4.4 and 5.7, depending on the sample of people
chosen. The average number of intermediaries in these chains was 5.2, with
considerable difference between the Boston group (4.4) and the rest of the
starting population, whereas the difference between the two other sub-
populations were not statistically significant. The random group from
Nebraska needed 5.7 intermediaries on average. On average, 6.2 steps to
make it to the target. Hence the expression “six degrees of separation”. The
main conclusion was that the average path length is much smaller than
expected, and that geographic location has an impact on the average length
whereas other information, such as profession, did not. It was also observed
that some chains moved from Nebraska to the target’s neighbourhood but
then goes round in circles never making contact to complete the chain. So,
social communication is restricted less by physical distance than by social
distance. Additionally, a funnelling-like phenomenon was observed because
$$48\%$$ of chains in the last link passed through only 3 of the target’s
acquaintances. Thereby, showing that not all individuals in a person’s circle
are equally likely to be selected as the next node in a chain.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig1_HTML.
gif
Fig. 4.1
Figure 4.2 depicts the distribution of the incomplete chains. The median of
this distributions is found to be 2.6. There are two probable reasons for this:
(i) Participants are not motivated enough to participate in this study. (ii)
Participants do not know to whom they must send the folder in order to
advance the chain towards the target.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig2_HTML.
gif
Fig. 4.2
The authors conclude from this study that although each and every
individual is embedded in a small-world structure, not all acquaintances are
equally important. Some are more important than others in establishing
contacts with broader social realms because while some are relatively
isolated, others possess a wide circle of acquaintances.
Although the idea of the six degrees of separation gained quick and wide
acceptance, there is empirical evidence which suggests that we actually live
in a world deeply divided by social barriers such as race and class.
Income Stratification
Acquaintance Networks
Social Stratification
Urban Myth
Reference [10] questions whether Milgram’s low success rates results from
people not bothering to send the folder or does it reveal that the theory is
incorrect. Or do some people live in a small, small world where they can
easily reach people across boundaries while others do not. Further, the
paper suggests that the research on the small world problem may be a
familiar pattern:
People who owned stock had shorter paths (5.4) to stockbrokers than
random people (6.7). People from Boston area have even closer paths (4.4).
Only 31 out of the 64 chains passed through one of three people as their
final step, thus not all vertices and edges are equal. The sources and target
were non-random. 64 is not a sufficiently large sample to base a conclusion
on. $$25\%$$ of the approached refused to participate. It was commonly
thought that people would find the shortest path to the target but the
experiment revealed that they instead used additional information to form a
strategy.
Despite all this, the experiment and the resulting phenomena have formed a
crucial aspect in our understanding of social networks. The conclusion has
been accepted in the broad sense: social networks tend to have very short
paths between essentially arbitrary pairs of people. The existence of these
short paths has substantial consequences for the potential speed with which
information, diseases, and other kinds of contagion can spread through
society, as well as for the potential access that the social network provides
to opportunities and to people with very different characteristics from one’s
own.
This gives us the small world property , networks of size n have diameter
O(logn), meaning that between any two nodes there exists a path of size
O(logn).
The study observed the following: When passing messages, the senders
typically used friendships in preference to business or family ties.
Successful chains in comparison with incomplete chains disproportionately
involved professional ties (33.9 versus $$13.2\%$$) rather than
friendship and familial relationships (59.8 versus $$83.4\%$$). Men
passed messages more frequently to other men ( $$57\%$$) and women
to other women ( $$61\%$$) and this tendency to pass to a same gender
contact was strengthened by about $$3\%$$ if the target was the same
gender as the sender and similarly weakened in the opposite case. Senders
were also asked why they considered their nominated acquaintance a
suitable recipient. Geographical proximity of acquaintance to target and
similarity of occupation accounted for atleast half of all choices. Presence
of highly connected individuals appear to have limited relevance to any
kind of social search involved in this experiment. Participants relatively
rarely nominated an acquaintance primarily because he or she had more
friends, and individuals in successful chains were far less likely than those
in incomplete chains to send messages to hubs (1.6 versus $$8.2\%$$).
There was no observation of message funneling : at most $$5\%$$ of
messages passed through a single acquaintance of any target, and
$$95\%$$ of all chains were completed through individuals who
delivered atmost 3 messages. From these observations, the study concluded
that social search is an egalitarian exercise, not one whose success depends
on a small minority of exceptional individuals.
The aggregate of the 384 completed chains across targets (Fig. 4.3) gives us
an average length of 4.05 as shown in Fig. 4.4.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig3_HTML.
gif
Fig. 4.3
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig4_HTML.
gif
Fig. 4.4
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig5_HTML.
gif
Fig. 4.5
The study finds that successful chains are due to intermediate to weak
strength ties. It does not require highly connected hubs to succeed but
instead relies on professional relationships. Small variations in chain
lengths and participation rates generate large differences in target
reachability. Although global social networks are searchable, actual success
depends on individual incentives. Since individuals have only limited, local
information about global social networks, finding short paths represents a
non-trivial search effort.
On the one hand, all the targets may be reachable from random initial
seeders in only a few steps, with surprisingly little variation across targets
in different countries and professions. On the other hand, small differences
in either the participation rates or the underlying chain lengths can have a
dramatic impact on the apparent reachability of the different targets.
Therefore, the results suggest that if the individuals searching for remote
targets do not have sufficient incentives to proceed, the small-world
hypothesis will not appear to hold, but that even the slightest increase in
incentives can render social searches successful under broad conditions.
The authors add that,
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig6_HTML.
gif
Fig. 4.6
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig7_HTML.
gif
Fig. 4.7
1. 1.
2. 2.
3. 3.
The locations and long-range contacts of all the vertices who were part
of this message chain
Given this information, u must choose one of its contacts v to pass on the
message.
4.7 Searchable
A graph is said to be searchable or navigable if its diameter is
$$O((log|V|)^{\beta })$$, i.e, the expected delivery time of a
decentralized algorithm is polylogarithmic. Therefore, a non-searchable or
a non-navigable graph is one which has a diameter of
$$O(|V|^{\alpha })$$, i.e, the expected delivery time of a decentralized
algorithm is not polylogarithmic.
Reference [7] conducted a small-world study where they analysed 10, 920
shortest path connections and small-world routes between 105 members of
an interviewing bureau. They observed that the mean small-world path
length (3.23) is $$40\%$$ longer than the mean of the actual shortest
paths (2.30), showing that mistakes are prevalent. A Markov model with a
probability of simply guessing an intermediary of 0.52 gives an excellent fit
to these observations, thus concluding that people make the wrong small-
world choice more than half the time.
The study drew the following conclusions. First, a mean of 210 choices per
subject. However, only 35 choices are necessary to account for half the
world. Of the 210, 95 ( $$45\%$$) were chosen most often for location
reasons, 99 ( $$47\%$$) were chosen most often for occupation reasons,
and only $$7\%$$ of the choices were mainly based on ethnicity or other
reasons. Second, the choices were mostly friends and acquaintances, and
not family. For any given target, about $$82\%$$ of the time, a male is
likely to be chosen, unless both subject and target are female, or if the
target has a low-status occupation. Over $$64\%$$ of the time, this trend
appears even when female choices are more likely. Lastly, the order of the
reasons given were: location, occupation and ethnicity.
The authors conclude that the strong similarity between predictions from
these results and the results of [16] suggests that the experiment is an
adequate proxy for behavior.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig8_HTML.
gif
Fig. 4.8
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig9_HTML.
gif
Fig. 4.9
Degree distribution of HP Lab network
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig10_HTML.
gif
Fig. 4.10
The last strategy was based on the target’s physical location. Individuals’
locations are given by their building, the floor of the building, and the
nearest building post to their cubicle. Figure 4.12 shows the email
correspondence mapped onto the physical layout of the buildings. The
general tendency of individuals in close physical proximity to correspond
holds: over $$87\%$$ percent of the 4000 emails are between individuals
on the same floor. From Fig. 4.13, we observe that geography could be
used to find most individuals, but was slower, taking a median number of 6
steps, and a mean of 12.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig12_HTML.
gif
Fig. 4.12
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig13_HTML.
gif
Fig. 4.13
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig14_HTML.
gif
Fig. 4.14
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig15_HTML.
gif
Fig. 4.15
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig16_HTML.
gif
Fig. 4.16
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig17_HTML.
gif
Fig. 4.17
The agents were constrained by several rules to make them behave the
same way humans would, thereby levelling the playing field. The agents
could only use local node features independent of global information about
the network. Only nodes visited so far, immediate neighbours, and the
target may play a role in picking the next node. The user can only follow
links from the current page or return to its immediate predecessor by
clicking the back button. Jumping between two unconnected pages was not
possible, even if they were both visited separately before.
The supervised and reinforcement learning were found to obtain the best
results. With the exception of DBN, even the fairly simple agents were
found to do well.
The study observed that agents find shorter paths than humans on average
and therefore, no sophisticated background knowledge or high-level
reasoning is required for navigating networks. However, humans are less
likely to get totally lost during search because they typically form robust
high-level plans with backup options, something automatic agents cannot
do. Instead, agents compensate for this lack of smartness with increased
thoroughness: since they cannot know what to expect, they always have to
inspect all options available, thereby missing out on fewer immediate
opportunities than humans, who focused on executing a premeditated plan,
may overlook shortcuts. In other words, humans have a common sense
expectation about what links may exist and strategize on the route to take
before even making the first move. Following through on this premeditated
plan, they might often just skim pages for links they already expect to exist,
thereby not taking notice of shortcuts hidden in the abundant textual article
contents. These deeper understanding of the world is the reason why their
searches fail completely less often: instead of exact, narrow plans, they
sketch out rough, high-level strategies with backup options that are robust
to contingencies.
4.10 Small World Models
We have observed that random graphs have a low clustering and a low
diameter, a regular graph has a high clustering and a high diameter. But, a
real-world graph has a clustering value comparable with that of a regular
graph but a diameter of the order of that of a random graph. This means
that neither the random graph models nor a regular graph generator is the
best model that can produce a graph with clustering and diameter
comparable with those of real-world graphs. In this section, we will look at
models that are capable of generating graphs that exhibit the small world
property.
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig18_HTML.
gif
Fig. 4.18
In this model, [8] begins with a set of nodes that are identified with a set of
lattice points in a $$n\times n$$ square,
$$\{(i,j): i\ \epsilon \ \{1,2,\ldots ,n\}, j\ \epsilon \ \{1,2,\ldots ,n\}\}$$
and the lattice distance between two nodes (i,j) and (k,l) is the number of
lattice steps between them $$d((i,j),(k,l)) = |k-i| + |l-j|$$. For a universal
constant $$p\ge 1$$, node u has a directed edge to every other node
within lattice distance p. These are node u’s local contacts. For universal
constants $$q\ge 0$$ and $$r\ge 0$$, we construct directed edges from
u to q other nodes using independent random trials; the $$i^{th}$$
directed edge from u has endpoint v with probability proportional to
$$[d(u,v)]^{-r}$$. These are node u’s long-range contacts. When r is
very small, the long range edges are too random to facilitate decentralized
search (as observed in Sect. 4.10.1), when r is too large, the long range
edges are not random enough to provide the jumps necessary to allow small
world phenomenon to be exhibited.
Theorem 6
../images/462433_1_En_4_Chapter/462433_1_En_4_Fig19_HTML.
gif
Fig. 4.19
Theorem 7
Theorem 8
1. 1.
2. 2.
Let $$r > 2$$. There is a constant $$\alpha _{r}$$, depending
on p, q, r but independent of n, so that the expected delivery time of
any decentralized algorithm is atleast
$$\alpha _{r}n^{\frac{r-2}{r-1}}$$.
Consider a hierarchy using b-ary tree T for a constant b. Let V denote the
set of leaves of T and n denote the size of V. For two leaves v and w, h(v, w)
denotes the height of the least common ancestor of v and w in T and
$$f(\cdot )$$ determines the link probability. For each node $$v\in V$$
, a random link to w is created with probability proportional to f(h(v, w)),
i.e, the probability of choosing w is
$$f(h(v, w))/\sum \limits _{x\ne v}f(h(v, x))$$. k links out of each node v
is created in this way, choosing endpoint w each time independently and
with repetitions allowed. This results in a graph G on the set V. Here the
out-degree is $$k = clog^{2}n$$ for constant c. This process of
producing G is the hierarchical model with exponent $$\alpha $$ if f(h)
grows asymptotically like $$b^{-\alpha h}$$.
$$\begin{aligned} \lim _{h\rightarrow \infty } \frac{f(h)}{b^{-\alpha '
h}} = 0\ \forall \ \alpha ' < \alpha \ and\ \lim _{h\rightarrow \infty }
\frac{b^{-\alpha '' h}}{f(h)} = 0\ \forall \ \alpha '' > \alpha
\end{aligned}$$
A decentralized algorithm has knowledge of the tree T, and knows the
location of a target leaf that it must reach; however, it only learns the
structure of G as it visits nodes. The exponent $$\alpha $$ determines
how the structures of G and T are related.
Theorem 9
1. 1.
There is a hierarchical model with exponent $$\alpha = 1$$ and
polylogarithmic out-degree in which decentralized algorithm can
achieve search time of O(logn).
2. 2.
1. 1.
2. 2.
If $$R_{1}, R_{2}, \ldots $$ are groups that all have sizes atmost q
and all contain a common node v, then their union has size almost
$$\beta q$$.
Theorem 10
1. 1.
2. 2.
To start with a hierarchical model and construct graphs with constant out-
degree k, the value of k must be sufficiently large in terms of other
parameters of this model. To obviate the problem that t itself may have no
incoming links, the search problem is relaxed to finding a cluster of nodes
containing t. Given a complete b-ary tree T, where b is a constant, let L
denote the set of leaves of T and m denote the size of L. r nodes are placed
at each leaf of T, forming a set V of $$n = mr$$ nodes total. A graph G
on V is defined for a non-increasing function $$f(\cdot )$$, we create k
links out of each node $$v\in V$$, choosing w as an endpoint with
probability proportional to f(h(v, w)). Each set of r nodes at a common leaf
of T is referred to as a cluster and the resolution of the hierarchical model is
defined to be r.
Theorem 11
1. 1.
2. 2.
The model defines V as a set of vertices in a regular lattice and E as the set
of long-range (shortcut) edges between vertices in V. This gives a digraph
G(V, E). Let $$G'$$ be G augmented with edges going both ways
between each pair of adjacent vertices in lattice.
1. 1.
2. 2.
3. 2.
33
Degree distribution
34
35
36
37
For each of these distributions, state whether or not the small world model
has the same property as the real world graph
38
Is the small world graph generator capable of generating graphs that are
representative of real world graphs?
References
1. 1.
Adamic, Lada, and Eytan Adar. 2005. How to search a social network.
Social Networks 27 (3): 187–203.
2. 2.
Adamic, Lada A., Rajan M. Lukose, Amit R. Puniyani, and
Bernardo A. Huberman. 2001. Search in power-law networks.
Physical Review E 64 (4): 046135.
3. 3.
Beck, M., and P. Cadamagnani. 1968. The extent of intra-and inter-
social group contact in the American society. Unpublished manuscript,
Stanley Milgram Papers, Manuscripts and Archives, Yale University.
4. 4.
Dodds, Peter Sheridan, Roby Muhamad, and Duncan J. Watts. 2003.
An experimental study of search in global social networks. Science
301 (5634): 827–829.
5. 5.
Horvitz, Eric, and Jure Leskovec. 2007. Planetary-scale views on an
instant-messaging network. Redmond-USA: Microsoft research
Technical report.
6. 6.
Killworth, Peter D., and H. Russell Bernard. 1978. The reversal small-
world experiment. Social Networks 1 (2): 159–192.
7. 7.
Killworth, Peter D., Christopher McCarty, H. Russell Bernard, and
Mark House. 2006. The accuracy of small world chains in social
networks. Social Networks 28 (1): 85–96.
8. 8.
Kleinberg, Jon. 2000. The small-world phenomenon: An algorithmic
perspective. In Proceedings of the thirty-second annual ACM
symposium on theory of computing, 163–170. ACM.
9. 9.
Kleinberg, Jon M. 2002. Small-world phenomena and the dynamics of
information. In Advances in neural information processing systems,
431–438.
10. 10.
Kleinfeld, Judith. 2002. Could it be a big world after all? the six
degrees of separation myth. Society 12:5–2.
11. 11.
Korte, Charles, and Stanley Milgram. 1970. Acquaintance networks
between racial groups: Application of the small world method.
Journal of Personality and Social Psychology 15 (2): 101.
12. 12.
Liben-Nowell, David, Jasmine Novak, Ravi Kumar, Prabhakar
Raghavan, and Andrew Tomkins. 2005. Geographic routing in social
networks. Proceedings of the National Academy of Sciences of the
United States of America 102 (33): 11623–11628.
13. 13.
Lin, Nan, Paul Dayton, and Peter Greenwald. 1977. The urban
communication network and social stratification: A small world
experiment. Annals of the International Communication Association 1
(1): 107–119.
14. 14.
Sandberg, Oskar, and Ian Clarke. 2006. The evolution of navigable
small-world networks. arXiv:cs/0607025.
15. 15.
Travers, Jeffrey, and Stanley Milgram. 1967. The small world
problem. Phychology Today 1 (1): 61–67.
16. 16.
Travers, Jeffrey, and Stanley Milgram. 1977. An experimental study of
the small world problem. Social Networks, 179–197. Elsevier.
17. 17.
West, Robert, and Jure Leskovec. 2012. Automatic versus human
navigation in information networks. ICWSM.
18. 18.
West, Robert, and Jure Leskovec. 2012. Human wayfinding in
information networks. In Proceedings of the 21st international
conference on World Wide Web, 619–628. ACM.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_5
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
Reference [2] explains the study of the social graph of the active users of
the world’s largest online social network, Facebook . The study mainly
focused on computing the number of users and their friendships, degree
distribution, path length, clustering and various mixing patterns. All
calculations concerning this study were performed on Hadoop cluster with
2250 machines using Hadoop/Hive data analysis framework developed at
Facebook. This social network is seen to display a broad range of unifying
structural properties such as homophily, clustering, small-world effect,
heterogeneous distribution of friends and community structure.
The graph of the entire social network of the active members of
Facebook as of May 2011 is analysed in an anonymized form and the focus
is placed on the set of active user accounts reliably corresponding to people.
A user of Facebook is deemed as an active member if they logged into the
site in the last 28 days from the time of measurement in May 2011 and had
atleast one Facebook friend. The restriction to study only active users
allows us to eliminate accounts that have been abandoned in the early
stages of creation, and focus on accounts that plausibly represent actual
individuals. This graphs precedes the existence of “subscriptions” and does
not include “pages” that people may “like” . According to this definition,
the population of active Facebook users is 721 million at the time of
measurement. The world’s population at the time was 6.9 billion people
which means that this graph includes roughly of the Earth’s
inhabitants. There were 68.7 billion friendships in this graph, so the average
Facebook user had around 190 Facebook friends.
The study also focuses on the subgraph of the 149 US Facebook users.
The US Census Bureau for 2011 shows roughly 260 million individuals in
the US over the age of 13 and therefore eligible to create a Facebook
account. Therefore this social network includes more than half the eligible
US population. This graph had 15.9 billion edges, so an average US user
had 214 other US users as friends. Note that this average is higher than that
of the global graph.
Neighbourhood function , denoted by of a graph G returns for
each , the number of pairs of vertices (x,y) such that x has a path of
length at most x to y. It provides data about how fast the “average ball”
around each vertex expands. It measures what percentile of vertex pairs are
within a given distance. Although, the diameter of a graph can be wildly
distorted by the presence of a single ill- connected path in some peripheral
region of the graph, the neighbourhood function and the average path length
are thought to robustly capture the distances between pairs of vertices.
From this function, it is possible to derive the distance distribution which
gives for each t, the fraction of reachable pairs at a distance of exactly t.
5.3 Spid
Measures the dispersion of degree distribution. Spid, an acronym for
“shortest-paths index of dispersion”, is defined as the variance-to-mean
ratio of the distance distribution. It is sometimes referred to as the
webbiness of a social network. Networks with spid greater than one should
be considered web-like whereas networks with spid less than one should be
considered properly social.
The intuition behind this measure is that proper social networks strongly
favour short connections, whereas in the web, long connections are not
uncommon. The correlation between spid and average distance is inverse,
i.e, larger the average distance, smaller is the spid.
The spid of the Facebook graph is 0.09 thereby confirming that it is a
proper social network.
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig1_HTML.gif
Fig. 5.1 Degree distribution of the global and US Facebook active users, alongside its CCDF
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig2_HTML.gif
Fig. 5.2 Neighbourhood function showing the percentage of users that are within h hops of one
another
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig3_HTML.gif
Fig. 5.3 Distribution of component sizes on log–log scale
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig4_HTML.gif
Fig. 5.4 Clustering coefficient and degeneracy as a function of degree on log–log scale
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig5_HTML.gif
Fig. 5.5 Average number of unique and non-unique friends-of-friends as a function of degree
5.8 Friends-of-Friends
The friends-of-friends, as the name suggests, denotes the number of users
that are within two hops of an initial user. Figure 5.5 computes the average
counts of both the unique and non-unique friends-of-friends as a function of
the degree. The non-unique friends-of-friends count corresponds to the
number of length 2 paths starting at an initial vertex and not returning to
that vertex. The unique friends-of-friends count corresponds to the number
of unique vertices that can be reached at the end of a length 2 path.
A naive approach to counting friends-of-friends would assume that a
user with k friends has roughly $$k^{2}$$ non-unique friends-of-
friends, assuming that their friends have roughly the same friend count as
them. The same principle could also apply to estimating the number of
unique friends-of-friends. However, the number of unique friends-of-
friends grows very close to linear, and the number of non-unique friends-of-
friends grows only moderately faster than linear. While the growth rate may
be slower than expected, as Fig. 5.5 illustrates, until a user has more than
800 friends the absolute amounts are unexpectedly large: a user with 100
friends has 27500 unique friends-of-friends and 40300 non-unique friends-
of-friends. This is significantly more than $$100\times 99 = 9900$$
non-unique friends-of-friends we would have expected if our friends had
roughly the same number of friends as us.
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig6_HTML.gif
Fig. 5.6 Average neighbour degree as a function of an individual’s degree, and the conditional
probability $$p(k'|k)$$ that a randomly chosen neighbour of an individual with degree k has
degree $$k'$$
5.10 Login Correlation
Figure 5.7 shows the correlation calculation of the number of days users
logged in during the 28-day window of the study. The definition of a
random neighbour of vertices with trait x is to first select a vertex with trait
x in proportion to their degree and select an edge connected to that vertex
uniformly at random, i.e, we give each edge connected to vertices with trait
x equal weight. From Fig. 5.7, it is evident that similar to degree’s
assortativity property, login also shows a correlation between that of an
individual with neighbours’.
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig7_HTML.gif
Fig. 5.7 Neighbor’s logins versus user’s logins to Facebook over a period of 28 days, and a user’s
degree versus the number of days a user logged into Facebook in the 28 day period
5.11.1 Age
To understand the friendship patterns among individuals with different ages,
we compute of selecting a random neighbour of individuals with age
t who has age $$t'$$. A random neighbour means that each edge
connected to a user with age t is given equal probability of being followed.
Figure 5.8 shows that the resulting distribution is not merely a function of
the magnitude of age difference $$|t-t'|$$ as might naively be expected,
and instead are asymmetric about a maximum value of $$t'=t$$.
Unsurprisingly, a random neighbour is most likely to be the same age as
you. Younger individuals have most of their friends within a small age
range while older individuals have a much wider range.
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig8_HTML.gif
Fig. 5.8 Distribution $$p(t'|t)$$ of ages $$t'$$ for the neighbours of users with age t
5.11.2 Gender
By computing $$p(g'|g)$$, we get the probability that a random
neighbour of individuals with gender g has gender $$g'$$ where M
denotes male and F denotes female. The Facebook graph gives us the
following probabilities, $$p(F|M) = 0.5131$$, $$p(M|M) = 0.4869$$,
$$p(F|F) = 0.5178$$ and $$p(M|F) = 0.4822$$. By these computations,
a random neighbour is more likely to be a female. There are roughly 30
million fewer active female users on Facebook with average female degree
(198) larger than the average male degree (172) with $$p(F) = 0.5156$$
and $$p(M) = 0.4844$$. Therefore, we have
$$p(F|M)< p(F) < p(F|F)$$ and $$p(M|F)< p(M) < p(M|M)$$
. However, the difference between these probabilities is extremely small
thereby giving a minimal effect on the preference for same gender
friendships on Facebook.
../images/462433_1_En_5_Chapter/462433_1_En_5_Fig9_HTML.gif
Fig. 5.9 Normalized country adjacency matrix as a heatmap on a log scale. Normalized by dividing
each element of the adjacency matrix by the product of the row country degree and column country
degree
Country Code
Indonesia ID
Philipines PH
Sri Lanka LK
Australia AU
New Zealand NZ
Thailand TH
Malaysia MY
Singapore SG
Hong Kong HK
Taiwan TW
United States US
Dominican Republic DO
Puerto Rico PR
Mexico MX
Canada CA
Venezuela VE
Chile CL
Argentina AR
Uruguay UY
Colombia CO
Costa Rica CR
Guatemala GT
Ecuador EC
Peru PE
Country Code
Bolivia BO
Spain ES
Ghana GH
United Kingdom GB
South Africa ZA
Israel IL
Jordan JO
United Arab Emirates AE
Kuwait KW
Algeria DZ
Tunisia TN
Italy IT
Macedonia MK
Albania AL
Serbia RS
Slovenia SI
Bosnia and Herzegovina BA
Croatia HR
Turkey TR
Portugal PT
Belgium BE
France FR
Hungary HU
Ireland IE
Denmark DK
Norway NO
Sweden SE
Czech Republic CZ
Bulgaria BG
Greece GR
Problems
Download the Friendster undirected social network data available at https://
snap.stanford.edu/data/com-Friendster.html.
This network consists of 65 million nodes and 180 million edges. The
world’s population in 2012 was 7 billion people. This means that the
network has $$1\%$$ of the world’s inhabitants.
For this graph, compute the following network parameters:
39 Degree distribution
References
1. Backstrom, Lars, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna. 2012. Four
degrees of separation. In Proceedings of the 4th Annual ACM Web Science Conference, 33–42.
ACM.
2. Ugander, Johan, Brian Karrer, Lars Backstrom, and Cameron Marlow. 2011. The anatomy of the
facebook social graph. arXiv:1111.4503.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. SrinivasaPractical Social
Network Analysis with PythonComputer Communications and
Networkshttps://doi.org/10.1007/978-3-319-96746-2_6
6. Peer-To-Peer Networks
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
(1)
Department of ISE, Ramaiah Institute of Technology, Bangalore,
Karnataka, India
(2)
Department of Information Technology, C.B.P. Government Engineering
College, Jaffarpur, Delhi, India
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
../images/462433_1_En_6_Chapter/462433_1_En_6_Fig1_HTML.
gif
Fig. 6.1
6.1 Chord
Chord [3] is a distributed lookup protocol that addresses the problem of
efficiently locating the node that stores a particular data item in a structured
P2P application. Given a key, it maps the key onto a node. A key is
associated with each data item and key/data item pair is stored at the node
to which the key maps. Chord is designed to adapt efficiently as nodes join
and leave the network dynamically.
The Chord software takes the form of a library to be linked with the client
and server applications that use it. The application interacts with Chord in
two ways: First, Chord provides key algorithm that yields IP address of the
node responsible for the key. Next, the Chord software on each node
notifies the application of changes in the set of keys that the node is
responsible for. This allows the application to move the corresponding
values to new homes when new node joins.
Chord uses consistent hashing . In this algorithm, each node and key has a
m-bit identifier. One has to ensure that m is large enough to make the
probability of two nodes or keys hashing to the same identifier negligible.
The keys are assigned to nodes as follows: The identifiers are ordered in an
identifier circle modulo $$2^{m}$$. Key k is assigned to the first node
whose identifier is equal to or follows k. This node is called the successor
node of key k, denoted by successor(k). If the identifiers are represented as
a circle of numbers from 0 to $$2^{m-1}$$, then successor(k) is the first
node clockwise from k. This tends to balance the load, since each node
receives roughly the same number of keys and involves relatively little
movement of keys when nodes join and leave the system. In a N-node
system, each node maintains information of O(logN) other nodes and
resolves all lookups via O(logN) messages to the other nodes. To maintain
consistent hashing mapping when a node n joins the network, certain keys
previously assigned to n’s successor now become assigned to n. When n
leaves the network, all of its assigned keys are reassigned to n’s successor.
Each node n, maintains a routing table with atmost m entries, called the
finger table . The ith entry in a table at node n contains entry of the first
node s, that succeeds n by atleast $$2^{i-1}$$ on the identifier circle, i.e,
$$s = successor(n + 2^{i-1})$$ where $$1 \le i \le m$$ (and all
arithmetic is modulo $$2^{m}$$). Node s is the i 4th finger of node n.
This finger table scheme is designed for two purposes: First, each node
stores information about only a small number of other nodes and knows
more about nodes closely following it than the nodes far away. Next, the
node’s finger table generally does not contain enough information to
determine the success of an arbitrary key k.
If n does not know the successor of key k, then it finds the node whose ID
is closer than its own to k. That node will know more about identifier circle
in region of k than n does. Thus, n searches its finger table for node j whose
ID immediately precedes k, and asks j for the node it knows whose ID is
closest to k. By repeating this process, n learns about nodes with IDs closer
to k.
In a dynamic network where the nodes can join and leave at any time, to
preserve the ability to locate every key in the network, each node’s
successor must be correctly maintained. For fast lookups, the finger tables
must be correct. To simplify this joining and leaving mechanisms, each
node maintains a predecessor pointer . A node’s predecessor pointer
contains Chord identifier and IP address of the immediate predecessor of
this node, and can be used to walk counter-clockwise around the identifier
circle. To preserve this, Chord performs the following tasks when a node n
joins the network: First, it initializes the predecessor and fingers of the
node n. Next, it updates the fingers and predecessors of the existing nodes
to reflect the addition of n. Finally, it notifies the application software so
that it can transfer state associated keys that node n is now responsible for.
6.2 Freenet
Freenet [1] is an unstructured P2P network application that allows the
publication, replication and retrieval of data while protecting the anonymity
of both the authors and the readers. It operates as a network of identical
nodes that collectively pool their storage space to store data files and
cooperate to route requests to the most likely physical location of data. The
files are referred to in a location-independent manner, and are dynamically
replicated in locations near requestors and deleted from locations where
there is no interest. It is infeasible to discover the true origin or destination
of a file passing through the network, and difficult for a node operator to be
held responsible for the actual physical contents of his or her node.
The adaptive peer-to-peer network of nodes query one another to store and
retrieve data files, which are named by location-independent keys. Each
node maintains a datastore which it makes available to the network for
reading and writing, as well as a dynamic routing table containing
addresses of their immediate neighbour nodes and the keys they are thought
to hold. Most users run nodes, both for security from hostile foreign nodes
and to contribute to the network’s storage capacity. Thus, the system is a
cooperative distributed file system with location independence and
transparent lazy replication.
The basic model is that the request for keys are passed along from node to
node through a chain of proxy requests in which each node makes a local
decision about where to send the request next, in the style of IP routing.
Since the nodes only have knowledge of their immediate neighbours, the
routing algorithms are designed to adaptively adjust routes over time to
provide efficient performance while using only local knowledge. Each
request is identified by a pseudo-unique random number, so that nodes can
reject requests they have seen before, and a hops-to-live limit which is
decremented at each node, to prevent infinite chains. If a request is rejected
by a node, then the immediately preceding node chooses a different node to
forward to. The process continues until the request is either satisfied or
exceeds its hops-to-live. The result backtracks the chain to the sending
node.
To retrieve a file, a user must first obtain or calculate its key (calculation of
the key is explained in [1]). Then, a request message is sent to his or her
own node specifying that key and a hops-to-live value. When a node
receives a request, it first checks its own store for the file and returns it if
found, together with a note saying it was the source of the data. If not
found, it looks up the nearest key in its routing table to the key requested
and forwards the request to the corresponding node. If that request is
ultimately successful and returns with the data, the node will pass the data
back to the user, cache the file in its own datastore, and create a new entry
in its routing table associating the actual data source with the requested key.
A subsequent request for the same key will be immediately satisfied by the
user’s node. To obviate the security issue which could potentially be caused
by maintaining a table of data sources, any node can unilaterally decide to
change the reply message to claim itself or another arbitrarily chosen node
as the data source.
If a node cannot forward a request to its preferred node, the node having
the second-nearest key will be tried, then the third-nearest, and so on. If a
node runs out of candidates to try, it reports failure back to its predecessor
node, which will then try its second choice, etc. In this manner, a request
operates as a steepest-ascent hill-climbing search with backtracking. If the
hops-to-live limit is exceeded, a failure result is propagated back to the
original requestor without any further nodes being tried. As nodes process
requests, they create new routing table entries for previously unknown
nodes that supplies files, thereby increasing connectivity.
File insertions work in the same manner as file requests. To insert a file, a
user first calculates a file key and then sends an insert message to his or her
node specifying the proposed key and a hops-to-live value (this will
determine the number of nodes to store it on). When a node receives an
insert message, it first checks its own store to see if the key is already
taken. If the key exists, the node returns the existing file as if a request has
been made for it. This notifies the user of a collision. If the key is not
found, the node looks up the nearest key in its routing table to the key
proposed and forward the insert to the corresponding node. If that insert
also causes a collision and returns with the data, the node will pass the data
back to the upstream inserter and again behave as if a request has been
made. If the hops-to-live limit is reached without a key collision being
detected, a success message will be propagated back to the original inserter.
The user then sends the data to insert, which will be propagated along the
path established by the initial query and stored in each node along the way.
Each node will also create an entry in its routing table associating the
inserter with the new key. To avoid the obvious security problem, any node
along the way can arbitrarily decide to change the insert message to claim
itself or another arbitrarily-chosen node as the data source. If a node cannot
forward an insert to its preferred node, it uses the same backtracking
approach as was used while handling requests.
When a new node intends to join the network, it chooses a random seed and
sends an announcement message containing its address and the hash of that
seed to some existing node. When a node receives a new node
announcement, it generates a random seed, XORs that with the hash it
received and hashes the result again to create a commitment. It then
forwards this new hash to some randomly chosen node. This forwarding
continues until the hops-to-live of the announcement runs out. The last
node to receive the announcement just generates a seed. Now all the nodes
in the chain reveal their seeds and the key of the new node is assigned as
the XOR of all the seeds. Checking the commitments enables each node to
confirm that everyone revealed their seeds truthfully. This yields a
consistent random key which each node as an entry for this new node in the
routing table.
../images/462433_1_En_6_Chapter/462433_1_En_6_Fig2_HTML.
gif
Fig. 6.2
Problems
In this exercise, the task is to evaluate a decentralized search algorithm on a
network where the edges are created according to a hierarchical tree
structure. The leaves of the tree will form the nodes of the network and the
edge probabilities between two nodes depends on their proximity in the
underlying tree structure.
P2P networks can be organized in a tree hierarchy, where the root is the
main software application and the second level contains the different
countries. The third level represents the different states and the fourth level
is the different cities. There could be several more levels depending on the
size and structure of the P2P network. Nevertheless, the final level are the
clients.
In this problem, there are two networks: one is the observed network, i.e,
the edge between P2P clients and the other is the hierarchical tree structure
that is used to generated the edges in the observed network.
For this exercise, we will use a complete, perfectly balanced b-ary tree T
(each node has b children and $$b\ge 2$$), and a network whose nodes
are the leaves of T. For any pair of network nodes v and w, h(v, w) denotes
the distance between the nodes and is defined as the height of the subtree
L(v, w) of T rooted at the lowest common ancestor of v and w. The distance
captures the intuition that clients in the same city are more likely to be
connected than, for example, in the same state.
To model this intuition, generate a random network on the leaf nodes where
for a node v, the probability distribution of node v creating an edge to any
other node w is given by Eq. 6.1
$$\begin{aligned} p_{v}(w) = \frac{1}{Z}b^{-h(v,w)} \end{aligned}$$
(6.1)
where $$Z = \sum _{w \ne v}b^{-h(v,w)}$$ is a normalizing constant.
Next, set some parameter k and ensure that every node v has exactly k
outgoing edges, using the following procedure. For each node v, sample a
random node w according to $$p_{v}$$ and create edge (v, w) in the
network. Continue this until v has exactly k neighbours. Equivalently, after
an edge is added from v to w, set $$p_{v}(w)$$ to 0 and renormalize
with a new Z to ensure that $$\sum _{w} p(w) = 1$$. This results in a k-
regular directed network.
46
Create random networks for $$\alpha = 0.1, 0.2, \ldots , 10$$. For each of
these networks, sample 1000 unique random (s, t) pairs $$(s \ne t)$$.
Then do a decentralized search starting from s as follows. Assuming that
the current node is s, pick its neighbour u with smallest h(u, t) (break ties
arbitrarily). If $$u = t$$, the search succeeds. If $$h(s, t) > h(u, t)$$,
set s to u and repeat. If $$h(s, t) \le h(u, t)$$, the search fails.
For each $$\alpha $$, pick 1000 pairs of nodes and compute the average
path length for the searches that succeeded. Then draw a plot of the average
path length as a function of $$\alpha $$. Also, plot the search success
probability as a function of $$\alpha $$.
References
1. 1.
Clarke, Ian, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong.
2001. Freenet: A distributed anonymous information storage and
retrieval system. In Designing privacy enhancing technologies, 46–66.
Berlin: Springer.
2. 2.
Lua, Eng Keong, Jon Crowcroft, Marcelo Pias, Ravi Sharma, and
Steven Lim. 2005. A survey and comparison of peer-to-peer overlay
network schemes. IEEE Communications Surveys & Tutorials 7 (2):
72–93.
3. 3.
Stoica, Ion, Robert Morris, David Liben-Nowell, David R. Karger,
M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. 2003.
Chord: A scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Transactions on Networking (TON) 11 (1):
17–32.
4. 4.
Zhang, Hui, Ashish Goel, and Ramesh Govindan. 2002. Using the
small-world model to improve freenet performance. In INFOCOM
2002. Twenty-first annual joint conference of the IEEE computer and
communications societies. Proceedings. IEEE, vol. 3, 1228–1237.
IEEE.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_7
7. Signed Networks
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig1_HTML.gif
Fig. 7.1 A, B and C are all friends of each other. Therefore this triad ( $$T_{3}$$ ) by satisfying
structural balance property is balanced
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig2_HTML.gif
Fig. 7.2 A and B are friends. However, both of them are enemies of C. Similar to Fig. 7.1, this triad (
$$T_{1}$$ ) is balanced
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig3_HTML.gif
Fig. 7.3 A is friends with B and C. However, B and C are enemies. Therefore the triad (
$$T_{2}$$ ) by failing to satisfy the structural balance property is unbalanced
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig4_HTML.gif
Fig. 7.4 A, B and C are all enemies of one another. Similar to Fig. 7.3, the triad ( $$T_{0}$$ ) is
unbalanced
A graph in which all possible triads satisfy the structural balance
property is called a balanced graph and one which does not is called an
unbalanced graph.
Figure 7.5 depicts a balanced and an unbalanced graph. The graph to the
left is balanced because all of the triads A,B,C; A,B,D; B,C,D and A,C,D
satisfy the structural balance property. However, the graph to the right is
unbalanced because the triads A,B,C and B,C,D do not satisfy the structural
balance property.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig5_HTML.gif
Fig. 7.5 A balanced (left) and an unbalanced (right) graph
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig6_HTML.gif
Fig. 7.6 A signed graph
1.
Identify the supernodes. Supernodes are defined as the connected
components where each pair of adjacent vertices have a positive edge.
If the entire graph forms one supernode, then the graph is balanced. If
there is a vertex that cannot be part of any supernode, then this vertex is
a supernode in itself. Figure 7.7 depicts the supernodes of Fig. 7.6.
Here, we see that the vertices 4, 11, 14 and 15 are supernodes by
themselves.
2.
Now beginning at a supernode we assign each of the supernodes to
groups X or Y alternatively. If every adjacent connected pair of
supernodes can be placed in different groups, then the graph is said to
be balanced. If such a placement is not possible, then the graph is
unbalanced. Figure 7.6 is unbalanced because, if we consider a
simplified graph of the supernodes (Fig. 7.8), then there are two ways
to bipartition these vertices starting from the vertex A: either A, C, G
and E are assigned to X with B, D and F assigned to Y or A, F, D and B
are assigned to X with E, G and C assigned to Y. In the first case, A and
E are placed in the same group while in the next case, A and B are
assigned to the same group. Since this is the case no matter which of
the vertices we start from, this graph is unbalanced.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig7_HTML.gif
Fig. 7.7 Supernodes of the graph in Fig. 7.6
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig8_HTML.gif
Fig. 7.8 A simplified labelling of the supernodes for the graph in Fig. 7.6
Table 7.2 tabulates the number of all the balanced and unbalanced triads
in each of these datasets. Let p denote the fraction of positive edges in the
network, $$T_{i}$$ denote the type of the triad, $$|T_{i}|$$
denote the number of $$T_{i}$$, and $$p(T_{i})$$ denote the
fraction of triads $$T_{i}$$, computed as
$$p(T_{i}) = T_{i}/\Delta $$ where $$\Delta $$ denotes the
total number of triads. Now, we shuffle the signs of all the edges in the
graph (keeping the fraction p of positive edges the same), and we let
$$p_{0}(T_{i})$$ denote the expected fraction of triads that are of
type $$T_{i}$$ after this shuffling.
Table 7.2 Dataset statistics
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig9_HTML.gif
Fig. 7.9 All contexts (A,B;X) where the red edge closes the triad
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig10_HTML.gif
Fig. 7.10 Surprise values and predictions based on the competing theories of structural balance and
status
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig11_HTML.gif
Fig. 7.11 Given that the first edge was of sign X, $$P(Y \mid X)$$ gives the probability that
reciprocated edge is of sign Y
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig12_HTML.gif
Fig. 7.12 Edge reciprocation in balanced and unbalanced triads. Triads: number of
balanced/unbalanced triads in the network where one of the edges was reciprocated. P(RSS):
probability that the reciprocated edge is of the same sign. $$P(+ \mid +)$$ : probability that the
positive edge is later reciprocated with a plus. $$P(- \mid -)$$ : probability that the negative edge
is reciprocated with a minus
To understand the boundary between the balance and status theories and
where they apply, it is interesting to consider a particular subset of these
networks where the directed edges are used to create reciprocating
relationships. Figure 7.11 shows that users treat each other differently in the
context of reciprocating interactions than when they are using links to refer
to others who do not link back.
To consider how reciprocation between A and B is affected by the
context of A and B’s relationships to third nodes X, suppose that an A-B link
is part of a directed triad in which each of A and B has a link to or from a
node X. Now, B reciprocates the link to A. As indicated in Fig. 7.12, we find
that the B-A link is significantly more likely to have the same sign as the A-
B link when the original triad on A-B-X (viewed as an undirected triad) is
structurally balanced. In other words, when the initial A-B-X triad is
unbalanced, there is more of a latent tendency for B to “reverse the sign”
when she links back to A.
Transition of an Unbalanced Network to a Balanced One
A large signed network whose signs were randomly assigned was
almost surely unbalanced. This currently unbalanced network had to evolve
to a more balanced state with nodes changing their signs accordingly.
Reference [3] studied how a unbalanced network transitions to a balanced
one and focused on any human tendencies being exhibited.
They considered local triad dynamics (LTD) wherein every update step
chooses a triad at random. If this triad is balanced, $$T_{1}$$ or
$$T_{3}$$, no evolution occurs. If the triad is unbalanced,
$$T_{0}$$ or $$T_{2}$$, the sign of one of the links is
changed as follows: $$T_{2} \rightarrow T_{3}$$ occurs with
probability p, $$T_{2} \rightarrow T_{1}$$ occurs with probability
$$1-p$$, while $$T_{0} \rightarrow T_{1}$$ occurs with
probability 1. After an update step, the unbalanced triad becomes balanced
but this could cause a balanced triad that shares a link with this target to
become unbalanced. These could subsequently evolve to balance, leading to
new unbalanced triads.
They show that for $$p < \frac{1}{2}$$, LTD quickly drives an
infinite network to a quasi-stationary dynamic state. As p passes through a
critical value of $$\frac{1}{2}$$, the network undergoes a phase
transition to a paradise state. On the other hand, a finite network always
reaches a balanced state. For $$p < \frac{1}{2}$$, this balanced
state is bipolar and the time to reach this state scales faster than
exponentially with network size. For $$p \ge \frac{1}{2}$$, the final
state is paradise. The time to reach this state scales algebraically with N
when $$p = \frac{1}{2}$$, and logarithmically in N for
$$p > \frac{1}{2}$$.
They also investigated constrained triad dynamics (CTD). A random
link was selected, and the sign of this link was changed if the total number
of unbalanced triads decreases. If the total number of unbalanced triads is
conserved in an update, then the update occurs with probability
$$\frac{1}{2}$$. Updates that would increase the total number of
unbalanced triads are not allowed. On average each link is changed once in
a unit of time. A crucial outcome of this is that a network is quickly driven
to a balanced state in a time that scales as lnN .
What is most important with user evaluations is to determine what are
the factors that drive one’s evaluations and how a composite description
that accurately reflects the aggregate opinion of the community can be
created. The following are some of the studies that focus on addressing this
problem.
Reference [12] designed and analysed a large-scale randomized
experiment on a social news aggregation Web site to investigate whether
knowledge of such aggregates distorts decision-making . Prior ratings were
found to create significant bias in individual rating behaviour, and positive
and negative social influences were found to create asymmetric herding
effects. Whereas negative social influence inspired users to correct
manipulated ratings, positive social influence increased the likelihood of
positive ratings by $$32\%$$ and created accumulating positive
herding that increased final ratings by $$25\%$$ on average. This
positive herding was topic-dependent and affected by whether individuals
were viewing the opinions of friends or enemies. A mixture of changing
opinion and greater turnout under both manipulations together with a
natural tendency to up-vote on the site combined to create the herding
effects.
Reference [13] studied factors in how users give ratings in different
contexts, i.e, whether they are given anonymously or under one’s own name
and whether they are displayed publicly or held confidentially. They
investigated on three datasets, Amazon.com reviews, Epinions ratings, and
CouchSurfing.com trust and friendship networks, which were found to
represent a variety of design choices in how ratings are collected and
shared. The findings indicate that ratings should not be taken at face value,
but rather that one should examine the context in which they were given.
Public, identified ratings tend to be disproportionately positive, but only
when the ratee is another user who can reciprocate.
Reference [1] studied YahooAnswers (YA) which is a large and diverse
question answer community, acting not only as a medium for knowledge
sharing, but as a place to seek advice, gather opinions, and satisfy one’s
curiosity about things which may not have a single best answer. The fact
about YA that sets it apart from others is that participants believe that
anything from the sordid intricacies of celebrities’ lives to conspiracy
theories is considered knowledge, worthy of being exchanged. Taking
advantage of the range of user behaviour in YA, several aspects of question-
answer dynamics were investigated. First, content properties and social
network interactions across different YA categories (or topics) were
contrasted. It was found that the categories could be clustered according to
thread length and overlap between the set of users who asked and those who
replied. While, discussion topics or topics that did not focus on factual
answers tended to have longer threads, broader distributions of activity
levels, and their users tended to participate by both posing and replying to
questions, YA categories favouring factual questions had shorter thread
lengths on average and users typically did not occupy both a helper and
asker role in the same forum. It was found that the ego-networks easily
revealed YA categories where discussion threads, even in this constrained
question-answer format, tended to dominate. While many users are quite
broad, answering questions in many different categories, this was of a mild
detriment for specialized, technical categories. In those categories, users
who focused the most (had a lower entropy and a higher proportion of
answers just in that category) tended to have their answers selected as best
more often. Finally, they attempted to predict best answers based on
attributes of the question and the replier. They found that just the very basic
metric of reply length, along with the number of competing answers, and
the track record of the user, was most predictive of whether the answer
would be selected. The number of other best answers by a user, a potential
indicator of expertise, was predictive of an answer being selected as best,
but most significantly so for the technically focused categories.
Reference [8] explored CouchSurfing, an application which enables
users to either allow other users to sleep in their couch or sleep on someone
else’s couch. Due to security and privacy concerns, this application heavily
depends on reciprocity and trust among these users. By studing the surfing
activities, social networks and vouch networks, they found the following:
First, CouchSurfing is a community rife with generalized reciprocity, i.e,
active participants take on the role of both hosts and surfers, in roughly
equal proportion. About a third of those who hosted or surfed are in the
giant strongly connected component, such that one couch can be reached
from any other by following previous surfs. Second, the high degree of
activity and reciprocity is enabled by a reputation system wherein users
vouch for one another. They found that connections that are vouched, or
declared trustworthy can best be predicted based on the direct interaction
between the two individuals: their friendship degree, followed by the
overall experience from surfing or hosting with the other person, and also
how the two friends met. However, global measures that aim to propagate
trust, such as PageRank, are found to be poor predictors of whether an edge
is vouched. Although such metrics may be useful in assigning overall
reputation scores to individuals, they are too diffuse to predict specifically
whether one individual will vouch for another. Finally, the analysis revealed
a high rate of vouching: about a quarter of all edges that can be vouched
are, as are a majority of highly active users. While this could be reflection
of a healthy web of trust, there are indications that vouches may be given
too freely. The main reason behind this high rate of vouching may be its
public nature. It can be awkward for friends to not give or reciprocate a
vouch, even if privately they have reservations about the trustworthiness of
the other person.
7.4.4 Rounding
The next problem encountered was that of converting a continuous value
into a discrete one ( $$\pm 1$$). There were the following three ways of
accomplishing this rounding:
1.
Global rounding: This rounding tries to align the ratio of trust to
distrust values in F to that in the input M. In the row vector
$$F_{i}$$, i trusts j if and only if $$F_{ij}$$ is within the top
$$\tau $$ fraction of entries of the vector $$F_{i}$$. This
threshold $$\tau $$ is chosen based on the overall relative fractions
of trust and distrust in the input.
2.
Local rounding: Here, we account for the trust/distrust behavior of i.
The conditions for $$F_{ij}$$ and $$\tau $$ are same as the
previous definition.
3. Majority rounding: This rounding intends to capture the local structure
of the original trust and distrust matrix. Consider the set J of users on
whom i has expressed either trust or distrust. If J is a set of labelled
examples using which we are to predict the label of a user j,
$$j\ \notin \ J$$. We order J along with j according to the entries
$$F_{ij'}$$ where $$j' \in J \cup \{j\}$$. At the end of this, we
have an ordered sequence of trust and distrust labels with the unknown
label for j embedded in the sequence at a unique location. From this,
we predict the label of j to be that of the majority of the labels in the
smallest local neighbourhood surrounding it where the majority is well-
defined.
Epinions
For this study, the Epinions web of trust was constructed as a directed
graph consisting of 131, 829 nodes and 841, 372 edges, each labelled either
trust or distrust. Of these labelled edges, $$85.29\%$$ were labelled
trust; we interpret trustto be the real value $$+1.0$$ and distrust to be
$$-1.0$$.
The combinations of atomic propagations, distrust propagations,
iterative propagations and rounding methods gave 81 experimental
schemes. To determine whether any particular algorithm can correctly
induce the trust or distrust that i holds for j, a single trial edge (i, j) is
masked from the truth graph, and then each of the 81 schemes is asked to
guess whether i trusts j. This trial was performed on 3, 250 randomly
masked edges for each of the 81 schemes, resulting in 263, 000 total trust
computations, and depicted in Fig. 7.13. In this table, $$\epsilon $$
denotes the prediction error of an algorithm and a given rounding method,
i.e., $$\epsilon $$ is the fraction of incorrect predictions made by the
algorithm.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig13_HTML.gif
Fig. 7.13 Prediction of the algorithms. Here, $$e^{*} = (0.4, 0.4, 0.1, 0.1)$$ , $$K = 20$$
The trust edges in the graph outnumber the distrust edges by a huge
margin: 85 versus 15. Hence, a naive algorithm that always predicts “trust”
will incur a prediction error of only $$15\%$$. Nevertheless, the
results are first reported for prediction on randomly masked edges in the
graph, as it reflects the underlying problem. However, to ensure that the
algorithms are not benefiting unduly from this bias, the largest balanced
subset of the 3, 250 randomly masked trial edges are taken such that half
the edges are trust and the other half are distrust—this is done by taking all
the 498 distrust edges in the trial set as well as the 498 randomly chosen
trust edges from the trial set. Thus, the size of this subset S is 996. The
prediction error in S is called $$\epsilon _{S}$$. The naive prediction
error on S would be $$50\%$$.
They found that even a small amount of information about distrust
(rather than information about trust alone) can provide tangibly better
judgements about how much user i should trust user j, and that a small
number of expressed trusts/distrust per individual can allow prediction of
trust between any two people in the system with surprisingly high accuracy.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig14_HTML.gif
Fig. 7.14 Helpfulness ratio declines with the absolute value of a review’s deviation from the
computed star average. The line segments within the bars indicate the median helpfulness ratio; the
bars depict the helpfulness ratio’s second and third quantiles. Grey bars indicate that the amount of
data at that x value represents $$0.1\%$$ or less of the data depicted in the plot
1.
The conformity hypothesis: This hypothesis holds that a review is
evaluated as more helpful when its star rating is closer to the consensus
star rating for the product.
2. The individual-bias hypothesis: According to this hypothesis, when a
user considers a review, she will rate it more highly if it expresses an
opinion that she agrees with. However, one might expect that if a
diverse range of individuals apply this rule, then the overall helpfulness
evaluation could be hard to distinguish from one based on conformity.
3.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig15_HTML.gif
Fig. 7.15 Helpfulness ratio as a function of a review’s signed deviation
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig16_HTML.gif
Fig. 7.16 As the variance of the star ratings of reviews for a particular product increases, the median
helpfulness ratio curve becomes two-humped and the helpfulness ratio at signed deviation 0
(indicated in red) no longer represents the unique global maximum
For a given review, let the computed product-average star rating be the
average star rating as computed over all reviews of that product in the
dataset.
To investigate straw-man quality-only hypothesis, we must review the
text quality. To avoid the subjectiveness that may be involved when human
evaluators are used, a machine learning algorithm is trained to
automatically determine the degree of helpfulness of each review. For
$$i,j\in \{0,0.5,\ldots ,3.5\}$$ where $$i < j$$,
$$i \succ j$$ when the helpfulness ratio of reviews, with absolute
deviation i is significantly larger than for reviews with absolute deviation j.
Reference [5] explains a model that can explain the observed behaviour.
We evaluate the robustness of the observed social-effects phenomena by
comparing review data from three additional different national Amazon
sites: Amazon.co.uk (UK), Amazon.de (Germany) and Amazon.co.jp
(Japan). The Japanese data exhibits a left hump that is higher than the right
one for reviews with high variance, i.e., reviews with star ratings below the
mean are more favored by helpfulness evaluators than the respective
reviews with positive deviations (Fig. 7.17).
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig17_HTML.gif
Fig. 7.17 Signed deviations vs. helpfulness ratio for variance = 3, in the Japanese (left) and U.S.
(right) data. The curve for Japan has a pronounced lean towards the left
They found that the perceived helpfulness of a review depends not just
on its content but also in subtle ways on how the expressed evaluation
relates to other evaluations of the same product.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig18_HTML.gif
Fig. 7.18 Probability of a positive evaluation ( $$P(+)$$ ) as a function of the similarity (binary
cosine) between the evaluator and target edit vectors (s(e, t)) in Wikipedia
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig19_HTML.gif
Fig. 7.19 Probability of E positively evaluating T as a function of similarity in Stack Overflow
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig21_HTML.gif
Fig. 7.21 Probability of E voting positively on T as a function of $$\Delta $$ for different levels
of similarity on Stack Overflow for a all evaluators b no low status evaluators
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig22_HTML.gif
Fig. 7.22 Similarity between E and T pairs as a function of $$\Delta $$ for a English Wikipedia
and b Stack Overflow
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig23_HTML.gif
Fig. 7.23 Probability of E positively evaluating T versus $$\sigma _{T}$$ for various fixed
levels of $$\Delta $$ in Stack Overflow
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig24_HTML.gif
Fig. 7.24 Dip in various datasets
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig25_HTML.gif
Fig. 7.25 The Delta-similarity half-plane. Votes in each quadrant are treated as a group
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig26_HTML.gif
Fig. 7.26 Ballot-blind prediction results for a English Wikipedia b German Wikipedia
These results demonstrate that without even looking at the actual votes,
it is possible to derive a lot of information about the outcome of the election
from a small prefix of the evaluators. Very informative implicit feedback
could be gleaned from a small sampling of the audience consuming the
content in question, especially if previous evaluation behaviour by the
audience members is known.
7.7 Predicting Positive and Negative Links
Reference [9] investigated relationships in Epinions, Slashdot and
Wikipedia. They observed that the signs of links in the underlying social
networks can be predicted with high accuracy using generalized models.
These models are found to provide insight into some of the fundamental
principles that drive the formation of signed links in networks, shedding
light on theories of balance and status from social psychology. They also
suggest social computing applications by which the attitude of one user
towards another can be estimated from evidence provided by their
relationships with other members of the surrounding social network.
The study looked at three datasets.
The trust network of Epinions data spanning from the inception of the
site in 1999 until 2003. This network contained 119, 217 nodes and
841, 000 edges, of which $$85.0\%$$ were positive. 80, 668 users
received at least one trust or distrust edge, while there were 49, 534 users
that created at least one and received at least one signed edge. In this
network, s(u, v) indicates whether u had expressed trust or distrust of user
v.
The social network of the technology-related news website Slashdot,
where u can designate v as either a “friend” or “foe” to indicate u’s
approval or disapproval of v’s comments. Slashdot was crawled Slashdot
in 2009 to obtain its network of 82, 144 users and 549, 202 edges of
which $$77.4\%$$ are positive. 70, 284 users received at least one
signed edge, and there were 32, 188 users with non-zero in- and out-
degree.
The network of votes cast by Wikipedia users in elections for promoting
individuals to the role of admin. A signed link indicated a positive or
negative vote by one user on the promotion of another. Using the January
2008 complete dump of Wikipedia page edit history all administrator
election and vote history data was extracted. This gave 2, 794 elections
with 103, 747 total votes and 7, 118 users participating in the elections.
Out of this total, 1, 235 elections resulted in a successful promotion,
while 1, 559 elections did not result in the promotion of the candidate.
The resulting network contained 7, 118 nodes and 103, 747 edges of
which $$78.7\%$$ are positive. There were 2, 794 nodes that received
at least one edge and 1, 376 users that both received and created signed
edges. s(u, v) indicates whether u voted for or against the promotion of v
to admin status.
In all of these networks the background proportion of positive edges is
about the same, with $$\approx \!80\%$$ of the edges having a
positive sign.
The aim was to answer the question: “How does the sign of a given link
interact with the pattern of link signs in its local vicinity, or more broadly
throughout the network? Moreover, what are the plausible configurations of
link signs in real social networks?” The attempt is to infer the attitude of
one user towards another, using the positive and negative relations that have
been observed in the vicinity of this user. This becomes particularly
important in the case of recommender systems where users’ pre-existing
attitudes and opinions must be deliberated before recommending a link.
This involves predicting the sign of the link to be recommended before
actually suggesting this.
This gives us the edge sign prediction problem. Formally, given a social
network with signs on all its edges, but the sign on the edge from node u to
node v, denoted s(u, v), has been “hidden”. With what probability can we
infer this sign s(u, v) using the information provided by the rest of the
network? In a way, this problem is similar to the problem of link prediction.
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig28_HTML.gif
Fig. 7.28 Accuracy of predicting the sign of an edge based on the signs of all the other edges in the
network in a Epinions, b Slashdot and c Wikipedia
../images/462433_1_En_7_Chapter/462433_1_En_7_Fig29_HTML.gif
Fig. 7.29 Accuracy for handwritten heuristics as a function of minimum edge embeddedness
Problems
Download the signed Epinions social network dataset available at https://
snap.stanford.edu/data/soc-sign-epinions.txt.gz.
Consider the graph as undirected and compute the following:
47 Calculate the count and fraction of triads of each type in this network.
48 Calculate the fraction of positive and negative edges in the graph. Let
the fraction of positive edges be p. Assuming that each edge of a triad will
independently be assigned a positive sign with probability p and a negative
sign with probability $$1 - p$$, calculate the probability of each type of
triad.
References
1. Adamic, Lada A., Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge sharing
and Yahoo answers: Everyone knows something. In Proceedings of the 17th international
conference on World Wide Web, 665–674. ACM.
2. Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2012. Effects of user
similarity in social media. In Proceedings of the fifth ACM international conference on Web
search and data mining, 703–712. ACM.
3. Antal, Tibor, Pavel L. Krapivsky, and Sidney Redner. 2005. Dynamics of social balance on
networks. Physical Review E 72(3):036121.
4. Brzozowski, Michael J., Tad Hogg, and Gabor Szabo. 2008. Friends and foes: Ideological social
networking. In Proceedings of the SIGCHI conference on human factors in computing systems,
817–820. ACM.
5. Danescu-Niculescu-Mizil, Cristian, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. 2009.
How opinions are received by online communities: A case study on Amazon. com helpfulness
votes. In Proceedings of the 18th international conference on World Wide Web, 141–150. ACM.
6. Davis, James A. 1963. Structural balance, mechanical solidarity, and interpersonal relations.
American Journal of Sociology 68 (4): 444–462.
7. Guha, Ramanthan, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins. 2004. Propagation
of trust and distrust. In Proceedings of the 13th international conference on World Wide Web,
403–412. ACM.
8. Lauterbach, Debra, Hung Truong, Tanuj Shah, and Lada Adamic. 2009. Surfing a web of trust:
Reputation and reciprocity on couchsurfing. com. In 2009 International conference on
computational science and engineering (CSE’09), vol. 4, 346–353. IEEE.
9. Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. 2010. Predicting positive and negative
links in online social networks. In Proceedings of the 19th international conference on World
Wide Web, 641–650. ACM.
10. Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. 2010. Signed networks in social media.
In Proceedings of the SIGCHI conference on human factors in computing systems, 1361–1370.
ACM.
11. Marvel, Seth A., Jon Kleinberg, Robert D. Kleinberg, and Steven H. Strogatz. 2011. Continuous-
time model of structural balance. Proceedings of the National Academy of Sciences 108 (5):
1771–1776.
12. Muchnik, Lev, Sinan Aral, and Sean J. Taylor. 2013. Social influence bias: A randomized
experiment. Science 341(6146):647–651.
13. Teng, ChunYuen, Debra Lauterbach, and Lada A. Adamic. 2010. I rate you. You rate me. Should
we do so publicly? In WOSN.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_8
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
A B
A a,a 0,0
B 0,0 b,b
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig1_HTML.gif
Fig. 8.1 v must choose between behaviours A and B based on its neighbours behaviours
In the long run, this behaviour adoption leads to one of the following
states of equilibria. Either everyone adopts behaviour A, or everyone adopts
behaviour B. Additionally, there exists a third possibility where nodes
adopting behaviour A coexist with nodes adopting behaviour B. In this
section, we will understand the network circumstances that will lead to one
of these possibilities.
Assume a network where every node initially has behaviour B. Let a
small portion of the nodes be early adopters of the behaviour A. These early
adopters choose A for reasons other than those guiding the coordination
game, while the other nodes operate within the rules of the game. Now with
every time step, each of these nodes following B will adopt A based on the
threshold rule. This adoption will cascade until one of the two possibilities
occur: Either all nodes eventually adopt A leading to a complete cascade ,
or an impasse arises between those adopting A and the ones adopting B
forming a partial cascade .
Figures 8.2, 8.3, 8.4 and 8.5 all show an example of a complete cascade.
In this network, let the payoffs be $$a = 4$$ and $$b = 5$$, and
the initial adopters be v and w. Assuming that all the nodes started with
behaviour B. In the first time step, $$q = \frac{5}{9}$$ but
$$p = \frac{2}{3}$$ for the nodes r and t, and
$$p = \frac{1}{3}$$ for the nodes s and u. Since the p for nodes r and
t is greater than q, r and t will adopt A. However, the p for nodes s and u is
less than q. Therefore these two nodes will not adopt A. In the second time
step, $$p = \frac{2}{3}$$ for the nodes s and u. p being greater than q
cause the remaining two nodes to adopt the behaviour, thereby causing a
complete cascade.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig2_HTML.gif
Fig. 8.2 Initial network where all nodes exhibit the behaviour B
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig3_HTML.gif
Fig. 8.3 Nodes v and w are initial adopters of behaviour A while all the other nodes still exhibit
behaviour B
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig4_HTML.gif
Fig. 8.4 First time step where r and t adopt behaviour A by threshold rule
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig5_HTML.gif
Fig. 8.5 Second time step where s and u adopt behaviour A also by threshold rule
Figures 8.6, 8.7 and 8.8 illustrate a partial cascade. Similar to the
complete cascade depiction, we begin with a network where all the nodes
exhibit behaviour B with payoffs $$a = 3$$ and $$b = 2$$. The
nodes 7 and 8 are the initial adopters, and $$q = \frac{2}{5}$$. In the
first step, node 5 with $$p = \frac{2}{3}$$ and node 10 with
$$p = \frac{1}{2}$$ are the only two nodes that can change
behaviour to A. In the second step, node 4 has $$p = \frac{2}{3}$$
and node 9 has $$p = \frac{1}{1}$$, causing them to switch to A. In
the third time step, node 6 with $$p = \frac{2}{3}$$ adopts A.
Beyond these adoptions, there are no possible cascades leading to a partial
cascade of behaviour A.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig6_HTML.gif
Fig. 8.6 Initial network where all nodes have behaviour B
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig7_HTML.gif
Fig. 8.7 Nodes 7 and 8 are early adopters of behaviour A while all the other nodes exhibit B
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig8_HTML.gif
Fig. 8.8 After three time steps there are no further cascades
There are two ways for a partial cascade to turn into a complete
cascade. The first way is for the payoff of one of the behaviours to exceed
the other. In Fig. 8.6, if the payoff a is increased, then the whole network
will eventually switch to A. The other way is to coerce some critical nodes
of one of the behaviours to adopt the other. This would restart the cascade,
ending in a complete cascade in favour of the coerced behaviour. In
Fig. 8.6, if nodes 12 and 13 were forced to adopt A, then this would lead to
a complete cascade. Instead, if 11 and 14 were coerced to switch, then there
would be no further cascades.
Partial cascades are caused due to the fact that the spread of a new
behaviour can stall when it tries to break into a tightly-knit community
within the network, i.e, homophily can often serve as a barrier to diffusion,
by making it hard for innovations to arrive from outside densely connected
communities. More formally, consider a set of initial adopters of behaviour
A, with a threshold of q for nodes in the remaining network to adopt
behaviour A. If the remaining network contains a cluster of density greater
than $$1 - q$$, then the set of initial adopters will not cause a
complete cascade. Whenever a set of initial adopters does not cause a
complete cascade with threshold q, the remaining network must contain a
cluster of density greater than $$1 - q$$.
The above discussed models work on the assumption that the payoffs
that each adopter receives is the same, i.e, a for each adopter of A and b for
each adopter of B. However, if we consider the possibility of node specific
payoff, i.e, every node v receives a payoff $$a_{v}$$ for adopting A
and payoff $$b_{v}$$ for taking B. So the payoff matrix in such a
coordination game is as shown in Table 8.2.
Table 8.2 Coordination game for node specific payoffs
A B
A $$a_{v},a_{w}$$ 0, 0
B 0, 0 $$b_{v},b_{w}$$
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig9_HTML.gif
Fig. 8.9 An infinite grid with 9 early adopters of behaviour A while the others exhibit behaviour B
A B AB
A $$1-q,1-q$$ 0, 0 $$1-q,1-q-c$$
B 0, 0 q, q $$q,q-c$$
AB $$1-q-c,1-q$$ $$q-c,q$$ $$max(q,1-q)-c,max(q,1-q)-c$$
Let us consider that G is infinite and each node has a degree
$$\Delta \in \mathbb {N}$$. The per-edge cost for adopting AB is
defined as $$r = c/\Delta $$.
Reference [9] found that for values of q close to but less than
$$\frac{1}{2}$$, strategy A can cascade on the infinite line if r is
sufficiently small or sufficiently large, but not if r takes values in some
intermediate interval. In other words, strategy B (which represents the
worse technology, since $$q < \frac{1}{2}$$) will survive if and
only if the cost of adopting AB is calibrated to lie in this middle interval.
The explanation for this is as follows:
When r is very small, it is cheap for nodes to adopt AB as a strategy, and
so AB spreads through the whole network. Once AB is everywhere, the
best-response updates cause all nodes to switch to A, since they get the
same interaction benefits without paying the penalty of r.
When r is very large, nodes at the interface, with one A neighbour and
one B neighbour, will find it too expensive to choose AB, so they will
choose A (the better technology), and hence A will spread step-by-step
through the network.
When r takes an intermediate value, a node with one A neighbour and
one B neighbour, will find it most beneficial to adopt AB as a strategy.
Once this happens, the neighbour who is playing B will not have
sufficient incentive to switch, and the updates make no further progress.
Hence, this intermediate value of r allows a “boundary” of AB to form
between the adopters of A and the adopters of B.
Therefore, the situation facing B is this: if it is too permissive, it gets
invaded by AB followed by A; if it is too inflexible, forcing nodes to choose
just one of A or B, it gets destroyed by a cascade of direct conversions to A.
But if it has the right balance in the value of r, then the adoptions of A come
to a stop at a boundary where nodes adopt AB.
Reference [9] also shows that when there are three competing
technologies, two of these technologies have an incentive to adopt a limited
“strategic alliance” , partially increasing their interoperability to defend
against a new entrant.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig10_HTML.gif
Fig. 8.10 The payoffs to node w on an infinite path with neighbours exhibiting behaviour A and B
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig11_HTML.gif
Fig. 8.11 By dividing the a-c plot based on the payoffs, we get the regions corresponding to the
different choices
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig12_HTML.gif
Fig. 8.12 The payoffs to node w on an infinite path with neighbours exhibiting behaviour AB and B
If $$a < 1$$, then B will provide w with the highest payoff
regardless of the value of c. Instead, let $$a \ge 1$$ and
$$b = 2$$ (if w adopts behaviour B, it can interact with both of its
neighbours). Similar to the case where w was sandwiched between
neighbours exhibiting A and B, $$a-c$$ plot as shown in Fig. 8.13
simplifies the view of the payoffs. This $$a-c$$ plot shows a diagonal
line segment from the point (1, 0) to the point (2, 1). To the left of this line
segment, B wins and the cascade stops. To the left of this line segment, AB
wins—so AB continues spreading to the right, and behind this wave of
AB’s, nodes will steadily drop B and use only A.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig13_HTML.gif
Fig. 8.13 The a-c plot shows the regions where w chooses each of the possible strategies
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig14_HTML.gif
Fig. 8.14 The plot shows the four possible outcomes for how A spreads or fails to spread on the
infinite path, indicated by this division of the (a, c)-plane into four regions
Models are developed to understand systems where users are faced with
alternatives and the costs and benefits of each depend on how many others
choose which alternative. This is further complicated when the idea of a
“threshold” is introduced. The threshold is defined as the proportion of
neighbours who must take a decision so that adoption of the same decision
by the user renders the net benefits to exceed the net costs for this user.
Reference [8] aimed to present a formal model that could predict, from
the initial distribution of thresholds, the final proportion of the users in a
system making each of the two decisions, i.e, finding a state of equilibrium.
Let the threshold be x, the frequency distribution be f(x), and the cumulative
distribution be F(x). Let the proportion of adopters at time t be denoted by
r(t). This process is described by the difference equation
$$r(t+1) = F[r(t)]$$. When the frequency distribution has a simple
form, the difference equation can be solved to give an expression for r(t) at
any value of t. Then, by setting $$r(t+1) = r(t)$$, the equilibrium
outcome can be found. However, when the functional form is not simple,
the equilibrium must be computed by forward recursion. Graphical
observation can be used to compute the equilibrium points instead of
manipulating the difference equations. Let’s start with the fact that we know
r(t). Since $$r(t+1) = F[r(t)]$$, this gives us $$r(t+1)$$.
Repeating this process we find the point $$r_{e}$$ at
$$F[r_{e}] = r_{e}$$. This is illustrated in Fig. 8.15.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig15_HTML.gif
Fig. 8.15 Graphical method of finding the equilibrium point of a threshold distribution
0 1
0 u(0, 0), u(0, 0) u(0, 1), u(1, 0)
1 u(1, 0), u(0, 1) u(1, 1), u(1, 1)
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig16_HTML.gif
Fig. 8.16 Contact network for branching process
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig17_HTML.gif
Fig. 8.17 Contact network for branching process where high infection probability leads to
widespread
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig18_HTML.gif
Fig. 8.18 Contact network for branching process where low infection probability leads to the
disappearance of the disease
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig19_HTML.gif
Fig. 8.19 Repeated application of $$f(x) = 1-(1-px)^{k}$$
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig20_HTML.gif
Fig. 8.20 A contact network where each edge has an associated period of time denoting the time of
contact between the connected vertices
8.2.6.1 Concurrency
In certain situations it is not just the timing of the contacts that matter but
the pattern of timing also which can influence the severity of the epidemic.
A timing pattern of particular interest is concurrency. A person is involved
in concurrent partnerships if she has multiple active partnerships that
overlap in time. These concurrent patterns cause the disease to circulate
more vigorously through the network and enable any node with the disease
to spread it to any other.
Reference [7] proposed a model which when given as input the social
network graph with the edges labelled with probabilities of influence
between users could predict the time by which a user may be expected to
perform an action.
../images/462433_1_En_8_Chapter/462433_1_En_8_Fig21_HTML.gif
Fig. 8.21 Exposure curve for top 500 hashtags
Stickiness is defined as the probability that a piece of information will
pass from a person who knows or mentions it to a person who is exposed to
it. It is computed as the maximum value of p(k). Persistence is defined as
the relative extent to which repeated exposures to a piece of information
continue to have significant marginal effects on its adoption.
To understand how these exposure curves vary across different kinds of
hashtags, the 500 most-mentioned hashtags were classified according to
their topic. Then the curves p(k) were averaged separately within each
category and their shapes were compared. Many of the categories had p(k)
curves that did not differ significantly in shape from the average, but
unusual shapes for several important categories were found. First, for
political hashtags, the persistence had a significantly larger value than the
average - in other words, successive exposures to a political hashtag had an
unusually large effect relative to the peak. In contrast, Twitter idiom hastags
- a kind of hashtag that will be familiar to Twitter users in which common
English words are concatenated together to serve as a marker for a
conversational theme (e.g. #cantlivewithout, #dontyouhate, #iloveitwhen,
and many others, including concatenated markers for weekly Twitter events
such as #musicmonday and #followfriday). In such cases, the stickiness was
high, but the persistence was unusually low; if a user didn’t adopt an idiom
after a small number of exposures, the marginal chance they did so later fell
off quickly.
Next, the paper looks at the overall structure of interconnections among
the initial adopters of a hashtag. For this, the first m individuals to mention
a particular hashtag H were taken, and the structure of the subgraph
$$G_{m}$$ induced on these first m mentioners were studied. In this
structural context, it was found that political hashtags exhibited distinctive
features - in particular, the subgraphs $$G_{m}$$ for political
hashtags H tended to exhibit higher internal degree, a greater density of
triangles, and a large of number of nodes not in $$G_{m}$$ who had
significant number of neighbors in it.
In the real-world there are multiple pieces of information spreading
through the network simultaneously. These pieces of information do not
spread in isolation, independent of all other information currently diffusing
in the network. These pieces sometimes cooperate while in other times
compete with one another. Reference [14] explains a statistical model that
allows for this competition as well as cooperation of different contagions in
information diffusion. Competing contagions decrease each other’s
probability of spreading, while cooperating contagions help each other in
being adopted throughout the network. It was observed that the sequence in
which the contagions were exposed to a person played a role in the effect of
a particular contagion X. Thus all possible sequences of contagions have to
be considered. This model was evaluated on 18, 000 contagions
simultaneously spreading through the Twitter network. It learnt how
different contagions interact with each other and then uses these interactions
to more accurately predict the diffusion of a contagion through the network.
The following summary for the effects of contagions was learnt:
A user is exposed to a certain contagion $$u_{1}$$. Whether or not
she retweets it, this contagion influences her.
This user is next exposed to another contagion $$u_{2}$$. If
$$u_{2}$$ is more infectious than $$u_{1}$$, the user shifts focus
from $$u_{1}$$ to $$u_{2}$$.
On the other hand, if $$u_{2}$$ is less infectious than $$u_{1}$$,
then the user will still be influenced by $$u_{1}$$ instead of switching
to $$u_{2}$$. However, one of two things can happen:
– If $$u_{2}$$ is highly related to $$u_{1}$$ in content, then the
user will be more receptive of $$u_{2}$$ and this will increase the
infection probability of $$u_{2}$$.
– If $$u_{2}$$ is unrelated to $$u_{1}$$, then the user is less
receptive of $$u_{2}$$ and the infection probability is decreased.
Moreover, the model also provided a compelling hypothesis for the
principles that govern content interaction in information diffusion. It was
found that contagions have an adversely negative (suppressive) effect on
less infectious contagions that are of unrelated content or subject matter,
while at the same time they can dramatically increase the infection
probability of contagions that are less infectious but are highly related in
subject matter. These interactions caused a relative change in the spreading
probability of a contagion by $$71\%$$ on the average. This model
provides ways in which promotion content could strategically be placed to
encourage its adoption, and also ways to suppress previously exposed
contagions (such as in the case of fake news).
Information reaches people not only through ties in online networks but
also through offline networks. Reference [15] devised such a model where a
node can receive information through both offline as well as online
connections. Using the Twitter data in [14], they studied how information
reached the nodes of the network. It was discovered that the information
tended to “jump” across the network, which could only be explained as an
effect of an unobservable external influence on the network. Additionally, it
was found that only about $$71\%$$ of the information volume in
Twitter could be attributed to network diffusion, and the remaining
$$29\%$$ was due to external events and factors outside the network.
Cascades in Recommendations
Reference [11] studied information cascades in the context of
recommendations, and in particular studied the patterns of cascading
recommendations that arise in large social networks. A large person-to-
person recommendation network from a online retailer consisting of four
million people who made sixteen million recommendations on half a
million products were investigated. During the period covered by the
dataset, each time a person purchased a book, DVD, video, or music
product, she was given the option of sending an e-mail message
recommending the item to friends. The first recipient to purchase the item
received a discount, and the sender received a referral credit with monetary
value. A person could make recommendations on a product only after
purchasing it. Since each sender had an incentive for making effective
referrals, it is natural to hypothesize that this dataset is a good source of
cascades. Such a dataset allowed the discovery of novel patterns: the
distribution of cascade sizes is approximately heavy-tailed; cascades tend to
be shallow, but occasional large bursts of propagation can occur. The
cascade sub-patterns revealed mostly small tree-like subgraphs; however
differences in connectivity, density, and the shape of cascades across
product types were observed. It was observed that the frequency of different
cascade subgraphs was not a simple consequence of differences in size or
density; rather, there were instances where denser subgraphs were more
frequent than sparser ones, in a manner suggestive of properties in the
underlying social network and recommendation process. The relative
abundance of different cascade subgraphs suggested subtle properties of the
underlying social network and recommendation process.
Reference [10] analysed blog information to find out how blogs behave
and how information propagates through the Blogspace , the space of all
blogs. There findings are summarized as follows: The decline of a post’s
popularity was found to follow a power law with slope
$$\approx -1.5$$. The size of the cascades, size of blogs, in-degrees
and out-degrees all follow a power law. Stars and chains were the basic
components of cascades, with stars being more common. A SIS model was
found to generate cascades that match very well the real cascades with
respect to in-degree distribution and cascade size distribution.
Reference [6] developed a generative model that is able to produce the
temporal characteristics of the Blogspace. This model
$$\mathbb {ZC}$$ used a ‘zero-crossing’ approach based on a
random walk, combined with exploration and exploitation.
Reference [12] analysed a large network of mass media and blog posts
to determine how sentiment features of a post affect the sentiment of
connected posts and the structure of the network itself. The experiments
were conducted on a graph containing nearly 8 million nodes and 15
million edges. They found that (i) the nodes are not only influenced by their
immediate neighbours but also by its position within a cascade and that
cascade’s characteristics., (ii) deep cascades lead complex but predictable
lives, (iii) shallow cascades tend to be objective, and (iv) sentiment
becomes more polarized as depth increases.
Problems
Generate the following two graphs with random seed of 10:
Let the nodes in each of these graphs have IDs ranging from 0 to 9999.
Assume that the graphs represent the political climate of an upcoming
election between yourself and a rival with a total of 10000 voters. If most of
the voters have already made up their mind: $$40\%$$ will vote for
you, $$40\%$$ for your rival, and the remaining $$20\%$$ are
undecided. Let us say that each voter’s support is determined by the last
digit of their node ID. If the last digit is 0, 2, 4 or 6, the node supports you.
If the last digit is 1, 3, 5 or 7, the node supports your rival. And if the last
digit is 8 or 9, the node is undecided.
Assume that the loyalties of the ones that have already made up their
minds are steadfast. There are 10 days to the election and the undecided
voters will choose their candidate in the following manner:
1.
In each iteration, every undecided voter decides on a candidate. Voters
are processed in increasing order of node ID. For every undecided
voter, if the majority of their friends support you, they now support
you. If the majority of their friends support your rival, they now
support your rival.
2.
If a voter has equal number of friends supporting you and your rival,
support is assigned in an alternating fashion, starting from yourself. In
other words, the first tie leads to support for you, the second tie leads to
support for your rival, the third for you, the fourth for your rival, and so
on.
3.
When processing the updates, the values from the current iteration are
used.
4.
There are 10 iterations of the process described above. One happening
on each day.
5.
The 11th day is the election day, and the votes are counted.
Let us say that you have a total funding of Rs. 9000, and you have decided
to spend this money by hosting a live stream. Unfortunately, only the voters
with IDs 3000–3099. However, your stream is so persuasive that any voter
who sees it will immediately decide to vote for you, regardless of whether
they had decided to vote for yourself, your rival, or where undecided. If it
costs Rs. 1000 to reach 10 voters in sequential order, i.e, the first Rs. 1000
reaches voters 3000–3009, the second Rs. 1000 reaches voters 3010–3019,
and so on. In other words, the total of Rs. k reaches voters with IDs from
3000 to $$3000 + k/100 - 1$$. The live stream happens before the 10 day
period, and the persuaded voters never change their mind.
53 Simulate the effect of spending on the two graphs. First, read in the
two graphs again and assign the initial configurations as before. Now,
before the decision process, you purchase Rs. k of ads and go through the
decision process of counting votes.
For each of the two social graphs, plot Rs. k (the amount you spend) on
the x-axis (for values k = ) and the number of votes
you win by on the y-axis (that is, the number of votes for youself less the
number of votes for your rival). Put these on the same plot. What is the
minimum amount you can spend to win the election in each of these
graphs?
54 Simulate the effect of the high roller dinner on the two graphs. First,
read in the graphs and assign the initial configuration as before. Now,
before the decision process, you spend Rs. k on the fancy dinner and then
go through the decision process of counting votes.
For each of the two social graphs, plot Rs. k (the amount you spend) on
the x-axis (for values ) and the number of votes
you win by on the y-axis (that is, the number of votes for yourself less the
number of votes for your rival). What is the minimum amount you can
spend to win the election in each of the two social graphs?
Assume that a mob has to choose between two behaviours, riot or not.
However, this behaviour depends on a threshold which varies from one
individual to another, i.e, an individual i has a threshold that determines
whether or not to participate. If there are atleast individuals rioting, then i
will also participate, otherwise i will refrain from the behaviour.
Assuming that each individual has full knowledge of the behaviour of
all the other nodes in the network. In order to explore the impact of
thresholds on the final number of rioters, for a mob of n individuals, the
histogram of thresholds is defined, where
expresses the number of individuals that have threshold . For
example, $$N_{0}$$ is the number of people who riot no matter
what, $$N_{1}$$ is the number of people who will riot if one person
is rioting, and so on.
Let T = [1, 1, 1, 1, 1, 4, 1, 0, 4, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 4, 0, 1, 4,
0, 1, 1, 1, 4, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 4, 1, 1,
4. 1, 4, 0, 1, 0, 1, 1, 1, 0, 4, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,
1, 1, 1, 0, 4, 0, 4, 0, 0, 1, 1, 1, 4, 0, 4, 0] be the vector of thresholds of 101
rioters.
References
1. Berger, Eli. 2001. Dynamic monopolies of constant size. Journal of Combinatorial Theory,
Series B 83 (2): 191–200.
[MathSciNet][Crossref]
2. Bikhchandani, Sushil, David Hirshleifer, and Ivo Welch. 1992. A theory of fads, fashion,
custom, and cultural change as informational cascades. Journal of Political Economy 100 (5):
992–1026.
[Crossref]
3. Centola, Damon. 2010. The spread of behavior in an online social network experiment. Science
329 (5996): 1194–1197.
[Crossref]
4.
Centola, Damon, and Michael Macy. 2007. Complex contagions and the weakness of long ties.
American Journal of Sociology 113 (3): 702–734.
[Crossref]
5. Centola, Damon, Víctor M. Eguíluz, and Michael W. Macy. 2007. Cascade dynamics of complex
propagation. Physica A: Statistical Mechanics and its Applications 374 (1): 449–456.
[Crossref]
6. Goetz, Michaela, Jure Leskovec, Mary McGlohon, and Christos Faloutsos. 2009. Modeling blog
dynamics. In ICWSM.
7. Goyal, Amit, Francesco Bonchi, and Laks V.S. Lakshmanan. 2010. Learning influence
probabilities in social networks. In Proceedings of the third ACM international conference on
Web search and data mining, 241–250. ACM.
9. Immorlica, Nicole, Jon Kleinberg, Mohammad Mahdian, and Tom Wexler. 2007. The role of
compatibility in the diffusion of technologies through social networks. In Proceedings of the 8th
ACM conference on Electronic commerce, 75–83. ACM.
10. Leskovec, Jure, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. 2007.
Patterns of cascading behavior in large blog graphs. In Proceedings of the 2007 SIAM
international conference on data mining, 551–556. SIAM
11. Leskovec, Jure, Ajit Singh, and Jon Kleinberg. 2006. Patterns of influence in a recommendation
network. In Pacific-Asia conference on knowledge discovery and data mining, 380–389. Berlin:
Springer.
12. Miller, Mahalia, Conal Sathi, Daniel Wiesenthal, Jure Leskovec, and Christopher Potts. 2011.
Sentiment flow through hyperlink networks. In ICWSM.
13. Morris, Stephen. 2000. Contagion. The Review of Economic Studies 67 (1): 57–78.
[MathSciNet][Crossref]
14. Myers, Seth A., and Jure Leskovec. 2012. Clash of the contagions: Cooperation and competition
in information diffusion. In 2012 IEEE 12th international conference on data mining (ICDM),
539–548. IEEE.
15. Myers, Seth A., Chenguang Zhu, and Jure Leskovec. 2012. Information diffusion and external
influence in networks. In Proceedings of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, 33–41. ACM.
16. Romero, Daniel M., Brendan Meeder, and Jon Kleinberg. 2011. Differences in the mechanics of
information diffusion across topics: idioms, political hashtags, and complex contagion on twitter.
In Proceedings of the 20th international conference on World wide web, 695–704. ACM.
17. Watts, Duncan J. 2002. A simple model of global cascades on random networks. Proceedings of
the National Academy of Sciences 99 (9): 5766–5771.
[MathSciNet][Crossref]
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_9
9. Influence Maximisation
Krishna Raj P. M.1 , Ankith Mohan1 and K. G. Srinivasa2
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
(9.1)
(9.2)
(9.3)
For calculating the optimal marketing plan for a product that has not yet
been introduced to the market, we compute
(9.4)
where is the set of all possible configurations of and hence is
(9.5)
leads to
(9.6)
The goal is to find the marketing plan that maximises profit. Assume that M
is a boolean vector, c is the cost of marketing to a customer, is the
revenue from selling to the customer if no marketing action is performed,
and is the revenue if marketing is performed. and is the same
unless marketing action includes offering a discount. Let be the
isolation is then
(9.7)
This equation gives the intrinsic value of the customer. Let be the null
vector. The global lift in profit that results from a particular choice M of
customers to market to is then
(9.8)
(9.9)
(9.10)
To find the optimal M that maximises ELP, we must try all possible
combinations of assignments to its components. However, this is
intractable, [4] proposes the following approximate procedures:
Single pass : For each i, set if and set
otherwise.
Greedy search : Set . Loop through the ’s setting each to
1 if . Continue looping until there are no
The effect that marketing to a person has on the rest of the network is
independent of the marketing actions to other customers. From a customer’s
network effect, we can directly compute whether he is worth marketing to.
Let the be the network effect of customer i for a product with
attributes Y. It is defined as the total increase in probability of purchasing in
the network (including ) that results from a unit change in :
(9.11)
(9.13)
(9.14)
(9.15)
This approximation is exact when r(z) is constant, which is the case in any
marketing scenario that is advertising-based. When this is the case, the
equation simplifies to:
(9.16)
With Eq. 9.16, we can directly estimate customer i’s lift in profit for any
marketing action z. To find the z that maximises the lift in profit, we take
the derivative with respect to z and set it equal to zero, resulting in:
$$\begin{aligned} \begin{aligned} r \Delta _{i}(Y)\frac{d\Delta (9.17)
P_{i}(z,Y)}{dz} = \frac{dc(z)}{dz} \end{aligned} \end{aligned}$$
Assume $$\Delta P_{i}(z,Y)$$ is differentiable, this allows us to
directly calculate the z which maximises
$$ELP_{i,total}^{z}(Y,M)$$ which is the optimal value for
$$M_{i}$$ in the M that maximizes ELP(Y, M). Hence, from the
customers network effects, $$\Delta _{i}(Y)$$, we can directly
calculate the optimal marketing plan.
Collaborative filtering systems were used to identify the items to
recommend to customers. In these systems, users rate a set of items, and
these ratings are used to recommend other items that the user might be
interested in. The quantitative collaborative filtering algorithm proposed in
[11] was used in this study. The algorithm predicts a user’s rating of an item
as the weighted average of the ratings given by similar users, and
recommends these items with high predicted ratings.
These methodologies were applied on the problem of marketing movies
using the EachMovie collaboration filtering database. EachMovie contains
2.8 million ratings of 1628 movies by 72916 users between 1996 and 1997
by the eponymous recommendation site. EachMovie consists of three
databases: one contains the ratings (on a scale of zero to five), one contains
the demographic information, and one containing information about the
movies. The methods with certain modifications were applied to the
Epinions dataset.
In summary, the goal is to market to a good customer. Although in a
general sense, the definition of good is a subjective one, this paper uses the
following operating definitions. A good customer is one who satisfies the
following conditions: (i) Likely to give the product a high rating, (ii) Has a
strong weight in determining the rating prediction for many of her
neighbours, (iii) Has many neighbours who are easily influenced by the
rating prediction they receive, (iv) Will have high probability of purchasing
the product, and thus will be likely to actually submit a rating that will
affect her neighbours, (v) has many neighbours with the aforementioned
four characteristics, (vi) will enjoy the marketed movie, (vii) has many
close friends, (viii) these close friends are easily swayed, (ix) the friends
will very likely see this movie, and (x) has friends whose friends have these
properties.
This property can be extended to show that for any $$\epsilon > 0$$,
there is a $$\gamma > 0$$ such that by using $$(1+\gamma )$$-
approximate values for the function to be optimized, a
$$(1-\frac{1}{e}-\epsilon )$$-approximation.
Further it is assumed that each node v has an associated non-negative
weight $$w_{v}$$ which tells us how important it is that v be
activated in the final outcome. If B denotes the set activated by the process
with initial activation A, then the weighted influence function
$$\sigma _{w}(A)$$ is defined to be the expected value over
outcomes B of the quantity $$\sum \limits _{v \in B}w_{v}$$.
The paper proves the following theorems:
Theorem 16
For a given targeted set A, the following two distributions over sets of
nodes are the same:
1.
The distribution over active sets obtained by running the linear
threshold model to completion starting from A.
2.
The distribution over sets reachable from A via live-edge paths,
under the random selection of live edges defined above.
../images/462433_1_En_9_Chapter/462433_1_En_9_Fig1_HTML.gif
Fig. 9.1 Results for the linear threshold model
../images/462433_1_En_9_Chapter/462433_1_En_9_Fig2_HTML.gif
Fig. 9.2 Results for the weight cascade model
../images/462433_1_En_9_Chapter/462433_1_En_9_Fig3_HTML.gif
Fig. 9.3 Results for the independent cascade model with probability $$1\%$$
../images/462433_1_En_9_Chapter/462433_1_En_9_Fig4_HTML.gif
Fig. 9.4 Results for the independent cascade model with probability $$10\%$$
References
1. Bakshy, Eytan, J.M. Hofman, W.A. Mason, and D.J. Watts. 2011. Everyone’s an influencer:
quantifying influence on twitter. In Proceedings of the fourth ACM international conference on
Web search and data mining, 65–74. ACM.
2. Chen, Wei, Yajun Wang, and Siyu Yang. 2009. Efficient influence maximization in social
networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining, 199–208. ACM.
3. Chen, Wei , Yifei Yuan, and Li Zhang. 2010. Scalable influence maximization in social networks
under the linear threshold model. In 2010 IEEE 10th international conference on data mining
(ICDM), 88–97. IEEE.
4. Domingos, Pedro, and Matt Richardson. 2001. Mining the network value of customers. In
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and
data mining, 57–66. ACM.
5. Goldenberg, Jacob, Barak Libai, and Eitan Muller. 2001. Talk of the network: A complex
systems look at the underlying process of word-of-mouth. Marketing Letters 12 (3): 211–223.
[Crossref]
6. Goyal, Amit, Wei Lu, and Laks V.S. Lakshmanan. 2011. Simpath: An efficient algorithm for
influence maximization under the linear threshold model. In 2011 IEEE 11th international
conference on data mining (ICDM), 211–220. IEEE.
7. Granovetter, Mark S. 1977. The strength of weak ties. In Social networks, 347–367. Elsevier.
8.
Kempe, David, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence
through a social network. In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, 137–146. ACM.
9. Leskovec, Jure, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral
marketing. ACM Transactions on the Web (TWEB) 1 (1): 5.
[Crossref]
10. Leskovec, Jure, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne Van Briesen, and
Natalie Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th
ACM SIGKDD international conference on Knowledge discovery and data mining, 420–429.
ACM.
11. Resnick, Paul, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grouplens:
An open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM
conference on computer supported cooperative work, 175–186. ACM.
12. Richardson, Matthew, and Pedro Domingos. 2002. Mining knowledge-sharing sites for viral
marketing. In Proceedings of the eighth ACM SIGKDD international conference on knowledge
discovery and data mining, 61–70. ACM.
13. Singer, Yaron. 2012. How to win friends and influence people, truthfully: Influence
maximization mechanisms for social networks. In Proceedings of the fifth ACM international
conference on web search and data mining, 733–742. ACM.
14. Watts, Duncan J., and Peter Sheridan Dodds. 2007. Influentials, networks, and public opinion
formation. Journal of Consumer Research 34 (4): 441–458.
[Crossref]
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. SrinivasaPractical Social
Network Analysis with PythonComputer Communications and
Networkshttps://doi.org/10.1007/978-3-319-96746-2_10
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
In the domain of blogs, bloggers publish posts and use hyper-links to refer
to other bloggers’ posts and content on the web. Each post is time stamped,
so the spread of information on the Blogspace can be observed. In this
setting, we want to select a set of blogs which are most up to date, i.e, our
goal is to select a small set of blogs which “catch” as many cascades as
possible.
The following four quantitative design objectives were used for evaluation:
10.1.5 Evaluation
A methodology for evaluating a given sensor design should comply with
two basic requirements: (i) it should be objective, and (ii) it should assess a
design regardless of the method used to receive it.
This utility was distributed to all participants prior to the BWSN for testing
both the networks.
Thus, the goal of the BWSN was to objectively compare the solutions
obtained using different approaches to the problem of placing five and 20
sensors for two real water distribution systems of increasing complexity and
for four derivative cases, taking into account the aforementioned four design
objectives. Fifteen contributions were received from academia and
practitioners.
The paper proves that the objective functions described in the previous
section are submodular and by exploiting this property, an efficient
approximation algorithm was proposed.
For the DL function, $$\pi _{i}(t) = 0$$ and $$\pi _{i}(\infty ) = 1$$,
i.e, no penalty is incurred if the outbreak is detected in finite time, otherwise
the incurred penalty is 1. For the DT measure,
$$\pi _{i}(t) = min(t,T_{max})$$, where $$T_{max}$$ is the time
horizon. The PA criterion has $$\pi _{i}(t)$$ is the size of the cascade i at
time t, and $$\pi _{i}(\infty )$$ is the size of the cascade at the end of the
horizon.
If we consider that every node has equal cost (i.e., unit cost, $$c(s) = 1$$
for all locations s), then the greedy algorithm starts with
$$A_{0} = \phi $$, and iteratively, in step k, adds the location
$$s_{k}$$ which maximizes the marginal gain
$$\begin{aligned} s_{k} = argmax_{s \in V \setminus A_{k-1}} R(A_{k-
1} \cup {s}) - R(A_{k-1}) \end{aligned}$$
(10.13)
The algorithm stops, once it has selected B elements. For this unit cost case,
the greedy algorithm is proved to achieve atleast $$63\%$$ optimal score
(Chap. 9). We will refer to this algorithm as the unit cost algorithm .
In the case where the nodes non-constant costs, the greedy algorithm that
iteratively adds sensors until the budget is exhausted can fail badly, since a
very expensive sensor providing reward r is preferred over a cheaper sensor
providing reward $$r - \epsilon $$. To avoid this, we modify Eq. 10.13 to
take costs into account
$$\begin{aligned} s_{k} = argmax_{s \in V \setminus A_{k-1}}
\frac{R(A_{k-1} \cup {s}) - R(A_{k-1})}{c(s)} \end{aligned}$$
(10.14)
So the greedy algorithm picks the element maximising the benefit/cost ratio.
The algorithm stops once no element can be added to the current set A
without exceeding the budget. Unfortunately, this intuitive generalization of
the greedy algorithm can perform arbitrarily worse than the optimal
solution. Consider the case where we have two locations, $$s_{1}$$ and
$$s_{2}$$ , $$c(s_{1}) = \epsilon $$ and $$c(s_{2}) = B$$. Also
assume we have only one contaminant i, and $$R({s_{1}}) = 2\epsilon $$
, and $$R({s_{2}}) = B$$. Now,
$$R(({s_{1}})-R(\phi ))/c(s_{1}) = 2$$, and
$$R(({s_{2}})-R(\phi ))/c(s_{2}) = 1$$. Hence the greedy algorithm
would pick $$s_{1}$$. After selecting $$s_{1}$$, we cannot afford
$$s_{2}$$ anymore, and our total reward would be $$\epsilon $$.
However, the optimal solution would pick $$s_{2}$$, achieving total
penalty reduction of B. As $$\epsilon $$ goes to 0, the performance of the
greedy algorithm becomes arbitrarily bad. This algorithm is called the
benefit-cost greedy algorithm .
The paper proposes the Cost-Effective Forward selection (CEF) algorithm.
It computes $$A_{GCB}$$ using the benefit-cost greedy algorithm and
$$A_{GUC}$$ using the unit-cost greedy algorithm. For both of these,
CEF only considers elements which do not exceed the budget B. CEF then
returns the solution with higher score. Even though both solutions can be
arbitrarily bad, if R is a non-decreasing submodular function with
$$R(\phi ) = 0$$. Then we get Eq. 10.15.
$$\begin{aligned} max\{R(A_{GCB}),R(A_{GUC})\} \ge \frac{1}
{2}\left( 1-\frac{1}{e}\right) max_{A, c(A) \le B}R(A) \end{aligned}$$
(10.15)
The running time of CEF is O(B|V|). The approximation guarantees of
$$(1-\frac{1}{e})$$ and $$\frac{1}{2}(1-\frac{1}{e})$$ in the unit and
non- constant cost cases are offline, i.e, we can state them in advance before
running the actual algorithm. Online bounds on the performance can be
found through arbitrary sensor locations. For $$\hat{A} \subseteq V$$
and each $$s \in V \setminus \hat{A}$$, let
$$\delta _{s} = R(\hat{A} \cup \{s\}) - R(\hat{A})$$. Let
$$r_{s} = \delta _{s}/c(s)$$, and let $$s_{1}, \ldots , s_{m}$$ be the
sequence of locations with $$r_{s}$$ in decreasing order. Let k be such
that $$C = \sum \limits _{i=1}^{k-1}c(s_{i}) \le B$$ and
$$\sum \limits _{i=1}^{k}c(s_{i}) > B$$. Let
$$\lambda = (B-C)/c(s_{k})$$, then we get Eq. 10.16.
$$\begin{aligned} max_{A,c(A)\le B}R(A) \le R(\hat{A}) + \sum \limits
_{i=1}^{k-1}\delta _{s_{i}} + \lambda \delta _{s_{k}} \end{aligned}$$
(10.16)
This computes how far away $$\hat{A}$$ is from the optimal solution.
This is found to give a 31% bound.
Most outbreaks are sparse, i.e, affect only a small area of network, and
hence are only detected by a small number of nodes. Hence, most nodes s
do not reduce the penalty incurred by an outbreak, i.e, $$R_{i}(s) = 0$$.
However, this sparsity is only present when penalty reductions are
considered. If for each sensor $$s \in V$$ and contaminant $$i \in I$$
we store the actual penalty $$\pi _{i}(s)$$, the resulting representation is
not sparse. By representing the penalty function R as an inverted index,
which allows fast lookup of the penalty reductions by sensor s, the sparsity
can be exploited. Therefore, the penalty reduction can be computed as given
in Eq. 10.17.
$$\begin{aligned} R(A) = \sum \limits _{i:i\ detected\ by\
A}P(i)max_{s\in A}R_{i}(\{s\}) \end{aligned}$$
(10.17)
Even if we can quickly evaluate the score R(A) of any A, we still need to
perform a large number of these evaluations in order to run the greedy
algorithm. If we select k sensors among |V| locations, we roughly need k|V|
function evaluations. We can exploit submodularity further to require far
fewer function evaluations in practice. Assume we have computed the
marginal increments $$\delta _{s}(A) = R(A \cup \{s\}) - R(A)$$ (or
$$\delta _{s}(A)/c(s))$$ for all $$s \in V \setminus A$$. As our node
selection A grows, the marginal increments $$\delta _{s'}$$ (and
$$\delta _{s'}/c(s)$$) (i.e., the benefits for adding sensor $$s'$$) can
never increase: For $$A \subseteq B \subseteq V$$, it holds that
$$\delta _{s}(A) \ge \delta _{s}(B)$$. So instead of recomputing
$$\delta _{s} \equiv \delta _{s}(A)$$ for every sensor after adding
$$s'$$ (and hence requiring $$|V| - |A|$$ evaluations of R), we perform
lazy evaluations: Initially, we mark all $$\delta _{s}$$ as invalid. When
finding the next location to place a sensor, we go through the nodes in
decreasing order of their $$\delta _{s}$$. If the $$\delta _{s}$$ for the
top node s is invalid, we recompute it, and insert it into the existing order of
the $$\delta _{s}$$ (e.g., by using a priority queue). In many cases, the
recomputation of $$\delta _{s}$$ will lead to a new value which is not
much smaller, and hence often, the top element will stay the top element
even after recomputation. In this case, we found a new sensor to add,
without having re-evaluated $$\delta _{s}$$ for every location s. The
correctness of this lazy procedure follows directly from submodularity, and
leads to far fewer (expensive) evaluations of R. This is called the lazy
greedy algorithm CELF (Cost-Effective Lazy Forward selection) . This is
found to have a factor 700 improvement in speed compared to CEF.
10.2.1 Blogspace
A dataset having 45000 blogs with 10.5 million posts and 16.2 million links
was taken. Every cascade has a single starting post, and other posts
recursively join by linking to posts within the cascade, whereby the links
obey time order. We detect cascades by first identifying starting post and
then following in-links. 346, 209 non-trivial cascades having at least 2
nodes were discovered. Since the cascade size distribution is heavy-tailed,
the analysis was limited to only cascades that had at least 10 nodes. The
final dataset had 17, 589 cascades, where each blog participated in 9.4
different cascades on average.
Figure 10.1 shows the results when PA function is optimized. The offline
and the online bounds can be computed regardless of the algorithm used.
CELF shows that we are at $$13.8\%$$ away from optimal solution. In
the right, we have the performance using various objective functions (from
top to bottom: DL, DT, PA). DL increases the fastest, which means that one
only needs to read a few blogs to detect most of the cascades, or
equivalently that most cascades hit one of the big blogs. However, the
population affected (PA) increases much slower, which means that one
needs many more blogs to know about stories before the rest of population
does.
../images/462433_1_En_10_Chapter/462433_1_En_10_Fig1_HTML.
gif
Fig. 10.1
In Fig. 10.2, the CELF method is compared with several intuitive heuristic
selection techniques. The heuristics are: the number of posts on the blog, the
cumulative number of out-links of blog’s posts, the number of in-links the
blog received from other blogs in the dataset, and the number of out-links to
other blogs in the dataset. CELF is observed to greatly outperform the other
methods. For the figure in the right, given a budget of B posts, we select a
set of blogs to optimize PA objective. For the heuristics, a set of blogs to
optimize chosen heuristic was selected, e.g., the total number of in-links of
selected blogs while still fitting inside the budget of B posts. Again, CELF
outperforms the next best heuristics by $$41\%$$.
../images/462433_1_En_10_Chapter/462433_1_En_10_Fig2_HTML.
gif
Fig. 10.2
In this water distribution system application, the data and rules introduced
by the Battle of Water Sensor Networks (BWSN) challenge was used. Both
the small network on 129 nodes (BWSN1), and a large, realistic, 12, 527
node distribution network (BWSN2) provided as part of the BWSN
challenge were considered. In addition a third water distribution network
(NW3) of a large metropolitan area in the United States was considered. The
network (not including the household level) contains 21, 000 nodes and
25, 000 pipes (edges). The networks consist of a static description (junctions
and pipes) and dynamic parameters (time-varying water consumption
demand patterns at different nodes, opening and closing of valves, pumps,
tanks, etc).
In the BWSN challenge, the goal is to select a set of 20 sensors,
simultaneously optimizing the objective functions DT, PA and DL. To
obtain cascades a realistic disease model was used, which depends on the
demands and the contaminant concentration at each node. In order to
evaluate these objectives, the EPANET simulator was used, which is based
on a physical model to provide realistic predictions on the detection time
and concentration of contaminant for any possible contamination event.
Simulations of 48 h length, with 5 min simulation timesteps were
considered. Contaminations can happen at any node and any time within the
first 24 h, and spread through the network according to the EPANET
simulation. The time of the outbreak is important, since water consumption
varies over the day and the contamination spreads at different rates
depending on the time of the day. Altogether, a set of 3.6 million possible
contamination scenarios were considered, each of which were associated
with a “cascade” of contaminant spreading over the network.
Figure 10.3 presents the CELF score, the off-line and on-line bounds for PA
objective on the BWSN2 network. On the right is shown CELF’s
performance on all 3 objective functions.
../images/462433_1_En_10_Chapter/462433_1_En_10_Fig3_HTML.
gif
Fig. 10.3
../images/462433_1_En_10_Chapter/462433_1_En_10_Fig4_HTML.
gif
Fig. 10.4
Figure 10.5 shows the scores achieved by CELF compared with several
heuristic sensor placement techniques, where the nodes were ordered by
some “goodness” criteria, and then the top nodes were picked for the PA
objective function. The following criteria were considered: population at the
node, water flow through the node, and the diameter and the number of
pipes at the node. CELF outperforms the best heuristic by $$45\%$$.
../images/462433_1_En_10_Chapter/462433_1_En_10_Fig5_HTML.
gif
Fig. 10.5
Reference [1] proposed the CELF++ algorithm which the paper showed as
being 35–55% faster than CELF. Here $$\sigma _{S}$$ denotes the
spread of seed set S. A heap Q is maintained with nodes corresponding to
users in the network. The node of Q corresponding to user u stores a tuple of
the form $$<u.mg1, u.prev\_best, u.mg2, u.flag>$$, where
$$u.mg1 = \delta _{u}(S)$$ which is the marginal gain of u w.r.t the
current seed set S, $$u.prev\_best$$ is the node that has the maximal
marginal gain among all the users examined in the current iteration before
user u, $$u.mg2 = \delta _{u}(S \cup \{prev\_best\})$$, and u.flag is the
iteration number when u.mg1 was last updated. The idea is that if the node
$$u.prev\_best$$ is picked as a seed in the current iteration, the marginal
gain of u w.r.t ( $$S \cup \{prev\_best\}$$) need not be recomputed in the
next iteration. In addition to computing $$\Delta _{u}(S)$$, it is not
necessary to compute $$\Delta _{u}(S \cup \{prev\_best\})$$ from
scratch. The algorithm can be implemented in an efficient manner such that
both $$\Delta _{u}(S)$$ and $$\Delta _{u}(S \cup \{prev\_best\})$$ are
evaluated simultaneously in a single iteration of Monte Carlo simulation
(which typically contains 10, 000 runs).
The variables S is used to denote the current seed set, last seed to track the
ID of the last seed user picked by the algorithm, and $$cur\_best$$ to
track the user having the maximum marginal gain w.r.t S over all users
examined in the current iteration. The algorithm starts by building the heap
Q initially. Then, it continues to select seeds until the budget k is exhausted.
As in CELF, the root element u of Q is looked at and if u.flag is equal to the
size of the seed set, we pick u as the seed as this indicates that u.mg1 is
actually $$\Delta _{u}(S)$$. The optimization of CELF++ is that we
update u.mg1 without recomputing the marginal gain. Clearly, this can be
done since u.mg2 has already been computed efficiently w.r.t the last seed
node picked. If none of the above cases applies, we recompute the marginal
gain of u.
Problems
Download the DBLP collaboration network dataset at
https://snap.stanford.edu/data/bigdata/communities/com-dblp.ungraph.txt.
gz.
This exercise is to explore how varying the set of initially infected nodes in
a SIR model can affect how a contagion spreads through a network. We
learnt in Sect. 8.2.3 that under the SIR model, every node can be either
susceptible, infected, or recovered and every node starts off as either
susceptible or infected. Every infected neighbour of a susceptible node
infects the susceptible node with probability , and infected nodes can
../images/462433_1_En_10_Chapter/462433_1_En_10_Figa_HTML.
gif
57
Implement the SIR model in Algorithm 7 and run 100 simulations with
$$\beta = 0.05$$ and $$\delta = 0.5$$ for each of the following three
graphs:
1. 1.
The graph for the network in the dataset (will be referred to as the real
world graph).
2. 2.
3. 3.
58
Repeat the above process, but instead of selecting a random starting node,
infect the node with the highest degree. Compute the total percentage of
nodes that became infected in each simulation.
59
References
1. 1.
Goyal, Amit, Wei Lu, and Laks V.S. Lakshmanan. 2011. Celf++:
Optimizing the greedy algorithm for influence maximization in social
networks. In Proceedings of the 20th international conference
companion on World wide web, 47–48. ACM.
2. 2.
Krause, Andreas, and Carlos Guestrin. 2005. A note on the budgeted
maximization of submodular functions. 2005.
3. 3.
Krause, Andreas, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen,
and Christos Faloutsos. 2008. Efficient sensor placement optimization
for securing large water distribution networks. Journal of Water
Resources Planning and Management 134 (6): 516–526.Crossref
4. 4.
Leskovec, Jure, Andreas Krause, Carlos Guestrin, Christos Faloutsos,
Jeanne VanBriesen, and Natalie Glance. 2007. Cost-effective outbreak
detection in networks. In Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining,
420–429. ACM.
5. 5.
Ostfeld, Avi, James G. Uber, Elad Salomons, Jonathan W. Berry,
William E. Hart, Cindy A. Phillips, Jean-Paul Watson, Gianluca Dorini,
Philip Jonkergouw, Zoran Kapelan, et al. 2008. The battle of the water
sensor networks (BWSN): A design challenge for engineers and
algorithms. Journal of Water Resources Planning and Management
134 (6): 556–568.Crossref
6. 6.
Pastor-Satorras, Romualdo, and Alessandro Vespignani. 2002.
Immunization of complex networks. Physical Review E 65 (3):
036104.Crossref
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. SrinivasaPractical Social
Network Analysis with PythonComputer Communications and
Networkshttps://doi.org/10.1007/978-3-319-96746-2_11
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
An interesting question to ask is, which are the popular websites? Although
popularity may be considered an elusive term with a lot of varying
definitions, here we will restrict ourselves by taking a snapshot of the Web
and counting the number of in-links to websites and using this a measure of
the site’s popularity.
On first thought, one would guess that the popularity would follow a normal
or Gaussian distribution, since the probability of observing a value that
exceeds the mean by more than c times the standard deviation decreases
exponentially in c. Central Limit Theorem supports this fact because it states
that if we take any sequence of small independent random quantities, then in
the limit their sum (or average) will be distributed according to the normal
distribution. So, if we assume that each website decides independently at
random whether to link to any other site, then the number of in-links to a
given page is the sum of many independent random quantities, and hence
we’d expect it to be normally distributed. Then, by this model, the number of
pages with k in-links should decrease exponentially in k, as k grows large.
However, we learnt in Chap. 2 that this is not the case. The fraction of
websites that have k in-links is approximately proportional to $$1/k^{2}$$
(the exponent is slightly larger than 2). A function that decreases as k to some
fixed power, such as $$1/k^{2}$$, is called a power law ; when used to
measure the fraction of items having value k, it says, that it’s possible to see
very large values of k.
Going back to the question of popularity, this means that extreme imbalances
with very large values are likely to arise. This is true in reality because there
are a few websites that are greatly popular while compared to others.
Additionally, we have learnt in previous chapters that the degree distributions
of several social network applications exhibit a power law.
We could say that just as the normal distribution is widespread among natural
sciences settings, so is the case of power law when the network in question
has a popularity factor to it.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig1_HTML.
gif
Fig. 11.1
A log-log plot exhibiting a power law
In practice, few empirical phenomena obey power laws for all values of x.
More often the power law applies only for values greater than some
minimum $$x_{min}$$. In such cases we say that the tail of the
distribution follows a power law.
1. 1.
2. 2.
Calculate the goodness-of-fit between the data and the power law. If the
resulting p-value is greater than 0.1 the power law is a plausible
hypothesis for the data, otherwise it is rejected.
3. 3.
The power law has several moniker in various fields. It goes by the term
Lotka distribution for scientific productivity [15], Bradford law for journal
use [4], Pareto law of income distribution [16] and the Zipf law for literary
word frequencies.
11.1.1 Model A
Model A is the basic model which the subsequent models rely upon. It starts
with no nodes and no edges at time step 0. At time step 1, a node with in-
weight 1 and out-weight 1 is added. At time step $$t + 1$$, with
probability $$1 - \alpha $$ a new node with in-weight 1 and out-weight 1 is
added. With probability $$\alpha $$ a new directed edge (u, v) is added to
the existing nodes. Here the origin u is chosen with probability proportional
to the current out-weight
$$w_{u,t}^{out} \overset{\mathrm {def}}{=} 1+ \delta _{u,t}^{out}$$
and the destination v is chosen with probability proportional to the current in-
weight
$$w_{v,t}^{in} \overset{\mathrm {def}}{=} 1+ \delta _{v,t}^{in}$$.
$$\delta _{u,t}^{out}$$ and $$\delta _{v,t}^{in}$$ denote the out-
degree of u and the in-degree of v at time t, respectively.
11.1.2 Model B
Model B starts with no node and no edge at time step 0. At time step 1, a
node with in-weight $$\gamma ^{in}$$ and out-weight
$$\gamma ^{out}$$ is added. At time step $$t + 1$$, with probability
$$1-\alpha $$ a new node with in-weight $$\gamma ^{in}$$ and out-
weight $$\gamma ^{out}$$ is added. With probability $$\alpha $$ a new
directed edge (u, v) is added to the existing nodes. Here the origin u
(destination v) is chosen proportional to the current out-weight
$$w_{u,t}^{out} \overset{\mathrm {def}}{=} \gamma ^{out} + \delta
_{u,t}^{out}$$
while the current in-weight
$$w_{v,t}^{in} \overset{\mathrm {def}}{=} \gamma ^{in} + \delta
_{v,t}^{in}$$
. $$\delta _{u,t}^{out}$$ and $$\delta _{v,t}^{in}$$ denote the out-
degree of u and the in-degree of v at time t, respectively.
In model B, at time step t the total in-weight $$w_{t}^{in}$$ and the out-
weight $$w_{t}^{out}$$ of the graph are random variables. The
probability that a new edge is added onto two particular nodes u and v is as
given in Eq. 11.3.
$$\begin{aligned} \alpha \frac{(\gamma ^{out}+\delta _{u,t}^{out})
(\gamma ^{in}+\delta _{v,t}^{in})}{w_{t}^{in}w_{t}^{out}}
\end{aligned}$$
(11.3)
11.1.3 Model C
Now we consider Model C, this is a general model with four specified types
of edges to be added.
Assume that the random process of model C starts at time step $$t_{0}$$.
At $$t = t_{0}$$, we start with an initial directed graph with some vertices
and edges. At time step $$t > t_{0}$$, a new vertex is added and four
numbers $$m^{e,e}$$, $$m^{n,e}$$, $$m^{e,n}$$, $$m^{n,n}$$
are drawn according to some probability distribution. Assuming that the four
random variables are bounded, we proceed as follows:
Add $$m^{e,e}$$ edges randomly. The origins are chosen with the
probability proportional to the current out-degree and the destinations
are chosen proportional to the current in-degree.
Add $$m^{e,n}$$ edges into the new vertex randomly. The origins
are chosen with the probability proportional to the current out-degree
and the destinations are the new vertex.
11.1.4 Model D
Model A, B and C are all power law models for directed graphs. Here we
describe a general undirected model which we denote by Model D. It is a
natural variant of Model C. We assume that the random process of model C
starts at time step $$t_{0}$$. At $$t = t_{0}$$, we start with an initial
undirected graph with some Vertices and edges. At time step
$$t > t_{0}$$, a new vertex is added and three numbers $$m^{e,e}$$,
$$m^{n,e}$$, $$m^{n,n}$$ are drawn according to some probability
distribution. We assume that the three random variables are bounded. Then
we proceed as follows:
Add $$m^{e,e}$$ edges randomly. The vertices are chosen with the
probability proportional to the current degree.
For $$t \ge t_{0}$$ we form $$G(t + 1)$$ from G(t) according to the
following rules:
1. 1.
With probability $$\alpha $$, add a new vertex v together with an
edge from v to an existing vertex w, where w is chosen according to
$$d_{in} + \delta _{in}$$.
2. 2.
3. 3.
With probability $$\gamma $$, add a new vertex w and an edge from
an existing vertex v to w, where v is chosen according to
$$d_{out} + \delta _{out}$$.
Reference [7] studied how power law was exhibited in the demand for
products and how companies could operate based on this exhibition to
improve the size and quality of service provided to their patronage.
Companies that have limited inventory space have to opt for operating on
only those products that are at the top of the distribution, so as to remain in
the competition. On the other hand, there exist companies called “infinite-
inventory” spaced who are capable of servicing all products no matter how
low their demand. This way the companies can profit by catering to all kinds
of customers and earn from selling products their limited spaced competitors
could not.
This is attributed to two properties of the real-world networks that are not
incorporated in either of these models. First, both of these models begin with
a fixed number of vertices which does not change. However, this is rarely the
case in real-world networks, where this number varies. Second, the models
assume that the probability that two vertices are connected is uniform and
random. In contrast, real-world networks exhibit preferential attachment. So,
the probability with which a new vertex connects to the existing vertices is
not uniform, but there is a higher probability to be linked to a vertex that
already has a large number of connections.
This scale-free model was thus proposed to deal with these two issues. Scale-
free in the sense that the frequency of sampling (w.r.t. the growth rate) is
independent of the parameter of the resulting power law graphs. This model
is found to show properties resembling those of real-world networks. To
counter the first issue, starting with a small number $$m_{0}$$ of vertices,
at every time step a new vertex is added with $$m \le m_{0}$$ edges that
link the new vertex to m different vertices already present in the network. To
exhibit preferential attachment, the probability $$\prod $$ with which a
new vertex will be connected to a vertex i depends on the connectivity
$$k_{i}$$ of that vertex, such that
$$\prod (k_{i}) = k_{i}/\sum \limits _{j}k_{j}$$. After t time steps the
model leads to a random network with $$t+m_{0}$$ vertices and mt edges.
This network is found to exhibit a power-law with exponent
$$2.9 \pm 0.1$$.
1. 1.
2. 2.
2. b.
3. 3.
This describes the creation of a single link from site j. One can
repeat this process to create multiple, independently generated
links from site j.
Step 2(b) is the key because site j is imitating the decision of site i. The main
result about this model is that if we run it for many sites, the fraction of sites
with k in-links will be distributed approximately according to a power law
$$1/k^{c}$$, where the value of the exponent c depends on the choice of
p. This dependence goes in an intuitive direction: as p gets smaller, so that
copying becomes more frequent, the exponent c gets smaller as well, making
one more likely to see extremely popular pages.
Popularity distribution is found to have a long tail as shown in Fig. 11.2. This
shows some nodes which have very high popularity when compared to the
other nodes. This popularity however drops of to give a long set of nodes
who have more or less the same popularity. It is this latter set of a long list of
nodes which contribute to the long tail.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig2_HTML.
gif
Fig. 11.2
Let the number of in-links to node j at time step $$t \in j$$ be a random
variable $$X_{j}(t)$$. The following two conditions exist for
$$X_{j}(t)$$:
1. 1.
Initial condition: Since node j has no links when first created at time j,
$$X_{j}(j) = 0$$.
2. 2.
Expected change to $$X_{j}$$ over time: At time step $$t+1$$,
node j gains an in-link if and only if a link from this new created node
$$t+1$$ points to it. From step (2a) of the model, node $$t+1$$
links to j with probability 1 / t. From step (2b) of the model, node
$$t+1$$ links to j with probability $$X_{j}(t)/t$$ since at the
moment node $$t+1$$ was created, the total number of links in the
network was t and of these $$X_{j}(t)$$ point to j. Therefore, the
overall probability that node $$t+1$$ links to node j is as given in
Eq. 11.4.
$$\begin{aligned} \frac{p}{t} + \frac{(1-p)X_{j}(t)}{t}
\end{aligned}$$
(11.4)
However, this deals with the probabilistic case. For the deterministic case, we
define the time as continuously running from 0 to N instead of the
probabilistic case considered in the model. The function of time
$$X_{j}(t)$$ in the discrete case is approximated in the continuous case as
$$x_{j}(t)$$. The two properties of $$x_{j}(t)$$ are:
1. 1.
2. 2.
Growth equation: In the probabilistic case, when node $$t+1$$
arrives, the number of in-links to j increases with probability given by
Eq. 11.4. In the deterministic approximation provided by $$x_{j}$$,
the rate of growth is modeled by the differential equation given in
Eq. 11.5.
$$\begin{aligned} \frac{dx_{j}}{dt} = \frac{p}{t} + \frac{(1-
p)x_{j}}{t} \end{aligned}$$
(11.5)
In the case where each event in a stream is either relevant or irrelevant , this
two-state model can be extended to generate events with a particular mix of
these relevant and irrelevant events according to a binomial distribution. A
sequence of events is considered bursty if the fraction of relevant events
alternates between periods in which it is large and long periods in which it is
small. Associating a weight with each burst, solves the problem of
enumerating all the bursts by order of weight.
The paper uses a graph object called the time graph . A time graph
$$G = (V,E)$$ consists of
1. 1.
A set V of vertices where each vertex $$v \in V$$ has an associated
interval D(v) on the time axis called the duration of v.
2. 2.
A set E of edges where each edge $$e \in E$$ is a triplet (u, v, t)
where u and v are vertices in V and t is a point in time in the interval
$$D(u) \cap D(v)$$.
A vertex v is said to be alive at time t if $$t \in D(v)$$. This means that
each edge is created at a point in time at which its two endpoints are alive.
1. 1.
Identify dense subgraphs in the time graph of the Blogspace which will
correspond to all potential communities. However, this will result in
finding all the clusters regardless of whether or not they are bursty.
2. 2.
The time graph was generated for blogs from seven blog sites. The resulting
graph consisted of 22299 vertices, 70472 unique edges and 777653 edges
counting multiplicity. Applying the steps to this graph, the burstiness of the
graph is as shown in Fig. 11.3.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig3_HTML.
gif
Fig. 11.3
Twelve different datasets from seven different sources were considered for
the study. These included HEP-PH and HEP-TH arXiv citation graphs, a
citation graph for U.S. utility patents, a graph of routers comprising the
Internet, five bipartite affiliation graphs of authors with papers they authored
for ASTRO-PH, HEP-TH, HEP-PH, COND-MAT and GR-QC, a bipartite
graph of actors-to-movies corresponding to IMDB, a person to person
recommendation graph, and an email communication network from an
European research institution.
Figure 11.4 depicts the plot of average out-degree over time. We observe an
increase indicating that graphs become dense. Figure 11.5 illustrates the log-
log plot of the number of edges as a function of the number of vertices. They
all obey the densification power law. This could mean that densification of
graphs is an intrinsic phenomenon.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig4_HTML.
gif
Fig. 11.4
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig5_HTML.
gif
Fig. 11.5
In this paper, for every $$d \in \mathbb {N}$$, g(d) denotes the fraction of
connected node pairs whose shortest connecting path has length atmost d.
The effective diameter of the network is defined as the value of d at which
this function g(d) achieves the value 0.9. So, if D is a value where
$$g(D) = 0.9$$, then the graph has effective diameter D. Figure 11.6 shows
the effective diameter over time. A decrease in diameter can be observed
from the plots. Since all of these plots exhibited a decrease in the diameter, it
could be that the shrinkage was an inherent property of networks.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig6_HTML.
gif
Fig. 11.6
To verify that the shrinkage of diameters was not intrinsic to the datasets,
experiments were performed to account for:
1. 1.
3. 3.
Effects of missing past: With almost any dataset, one does not have data
reaching all the way back to the network’s birth. This is referred to as
the problem of missing past . This means that there will be edges
pointing to nodes prior to the beginning of the observation period. Such
nodes and edges are referred to as phantom nodes and phantom edges
respectively.
2. b.
3. c.
4. 4.
Within a few years, the giant component accounts for almost all the
nodes in the graph. The effective diameter, however, continues to
steadily decrease beyond this point. This indicates that the decrease is
happening in a mature graph and not because many small disconnected
components are being rapidly glued together.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig7_HTML.
gif
Fig. 11.7
The models studied so far do not exhibit this densification and diameter
shrinkage. The paper proposes the following models which can achieve these
properties. The first model is the Community Guided Attachment where the
idea is that if the nodes of a graph belong to communities-within-
communities [6], and if the cost for cross-community edges is scale-free,
then the densification power-law follows naturally. Also, this model exhibits
a heavy-tailed degree distribution. However, this model cannot capture the
shrinking effective diameters. To capture all of these, the Forest Fire model
was proposed. In this model, nodes arrive in over time. Each node has a
center-of-gravity in some part of the network and its probability of linking to
other nodes decreases rapidly with their distance from this center-of-gravity.
However, occasionally a new node will produce a very large number of
outlinks. Such nodes will help cause a more skewed out-degree distribution.
These nodes will serve as bridges that connect formerly disparate parts of the
network, bringing the diameter down.
Formally, the forest fire model has two parameters, a forward burning
probability p , and a backward burning probability r . Consider a node v
joining the network at time $$t > 1$$, and let $$G_{t}$$ be the graph
constructed thus far. Node v forms outlinks to nodes in $$G_{t}$$
according to the following process:
1. 1.
2. 2.
3. 3.
Thus, the burning of links in the Forest Fire model begins at w, spreads to
$$w_{1}$$, $$\ldots $$, $$w_{x+y}$$, and proceeds recursively until
it dies out. The model can be extended to account for isolated vertices and
vertices with large degree by having newcomers choose no ambassadors in
the former case (called orphans ) and multiple ambassadors in the latter case
(simply called multiple ambassadors ). Orphans and multiple ambassadors
help further separate the diameter decrease/increase boundary from the
densification transition and so widen the region of parameter space for which
the model produces reasonably sparse graphs with decreasing effective
diameters.
There are three core processes to the models: (i) Node arrival process : This
governs the arrival of new nodes into the network, (ii) Edge initiation process
: Determines for each node when it will initiate a new edge, and (iii) Edge
destination selection process : Determines the destination of a newly initiated
edge.
Let $$G_{t}$$ denote a network composed from the earliest t edges,
$$e_{1}$$, $$\ldots $$, $$e_{t}$$ for $$t \in \{1, \ldots , |E|\}$$. Let
$$t_{e}$$ be the time when edge e is created, let t(u) be the time when the
node u joined the network, and let $$t_{k}(u)$$ be the time when the
$$k^{th}$$ edge of the node u is created. Then $$a_{t}(u) = t - t(u)$$
denotes the age of the node u at time t. Let $$d_{t}(u)$$ denote the degree
of the node u at time t and $$d(u) = d_{T}(u)$$. $$[\cdot ]$$ denotes a
predicate (takes value of 1 if expression is true, else 0).
The maximum likelihood estimation (MLE) was applied to pick the best
model in the following manner: the network is evolved in an edge by edge
manner, and for every edge that arrives into this network, the likelihood that
the particular edge endpoints would be chosen under some model is
measured. The product of these likelihoods over all edges will give the
likelihood of the model. A higher likelihood means a better model in the
sense that it offers a more likely explanation of the observed data.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig8_HTML.
gif
Fig. 11.8
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig9_HTML.
gif
Fig. 11.9
Using the MLE principle, the combined effect of node age and degree was
studied by considering the following four parameterised models for choosing
the edge endpoints at time t.
Figure 11.10 plots the log-likelihood under the different models as a function
of $$\tau $$. The red curve plots the log-likelihood of selecting a source
node and the green curve for selecting the destination node of an edge.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig10_HTML.
gif
Fig. 11.10
Selecting an edge’s destination node is harder than selecting its source (the
green curve is usually below the red). Also, selecting a destination appears
more random than selecting a source, the maximum likelihood $$\tau $$ of
the destination node (green curve) for models D and DR is shifted to the left
when compared to the source node (red), which means the degree bias is
weaker. Similarly, there is a stronger bias towards young nodes in selecting
an edge’s source than in selecting its destination. Based on the observations,
model D performs reasonably well compared to more sophisticated variants
based on degree and age.
Even though the analysis suggests that model D is a reasonable model for
edge destination selection, it is inherently “non-local” in that edges are no
more likely to form between nodes which already have friends in common. A
detailed study of the locality properties of edge destination selection is
required.
Consider the following notion of edge locality: for each new edge (u, w), the
number of hops it spans are measured, i.e., the length of the shortest path
between nodes u and w immediately before the edge was created.
Figure 11.11 shows the distribution of these shortest path values induced by
each new edge for $$G_{np}$$ (with $$p = 12/n$$), PA, and the four
social networks. (The isolated dot on the left counts the number of edges that
connected previously disconnected components of the network). For
$$G_{np}$$ most new edges span nodes that were originally six hops
away, and then the number decays polynomially in the hops. In the PA
model, we see a lot of long-range edges; most of them span four hops but
none spans more than seven. The hop distributions corresponding to the four
real-world networks look similar to one another, and strikingly different from
both $$G_{np}$$ and PA. The number of edges decays exponentially with
the hop distance between the nodes, meaning that most edges are created
locally between nodes that are close. The exponential decay suggests that the
creation of a large fraction of edges can be attributed to locality in the
network structure, namely most of the times people who are close in the
network (e.g., have a common friend) become friends themselves. These
results involve counting the number of edges that link nodes certain distance
away. In a sense, this over counts edges (u, w) for which u and w are far
away, as there are many more distant candidates to choose from, it appears
that the number of long-range edges decays exponentially while the number
of long-range candidates grows exponentially. To explore this phenomenon,
the number of hops each new edge spans are counted but then normalized by
the total number of nodes at h hops, i.e, we compute
$$\begin{aligned} p_{e}(h) = \frac{\sum \limits _{t}[e_{t}\ connects\
nodes\ at\ distance\ h\ in\ G_{t-1}]}{\sum \limits _{t}(\#\ nodes\ at\ distance\
h\ from\ the\ source\ node\ of\ e_{t})} \end{aligned}$$
(11.12)
First, Fig. 11.12a, b show the results for $$G_{np}$$ and PA models.
(Again, the isolated dot at $$h = 0$$ plots the probability of a new edge
connecting disconnected components.) In $$G_{np}$$, edges are created
uniformly at random, and so the probability of linking is independent of the
number of hops between the nodes. In PA, due to degree correlations short
(local) edges prevail. However, a non-trivial amount of probability goes to
edges that span more than two hops. Figure 11.12c–f show the plots for the
four networks. Notice the probability of linking to a node h hops away
decays double-exponentially, i.e., $$p_{e}(h) \propto exp(exp(-h))$$, since
the number of edges at h hops increases exponentially with h. This behaviour
is drastically different from both the PA and $$G_{np}$$ models. Also
note that almost all of the probability mass is on edges that close length-two
paths. This means that edges are most likely to close triangles, i.e, connect
people with common friends.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig11_HTML.
gif
Fig. 11.11
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig12_HTML.
gif
Fig. 11.12
Here, we assume that the sequence and timing of node arrivals is given, and
we model the process by which nodes initiate edges. We begin by studying
how long a node remains active in the social network, and then during this
active lifetime, we study the specific times at which the node initiates new
edges.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig13_HTML.
gif
Fig. 11.13
Now that we have a model for the lifetime of a node u, we must model that
amount of elapsed time between edge initiations from u. Let
$$\delta _{u}(d) = t_{d+1}(u) - t_{d}(u)$$ be the time it takes for the node
u with current degree d to create its $$(d+1)$$-st out-edge; we call
$$\delta _{u}(d)$$ the edge gap. Again, we examine several candidate
distributions to model edge gaps. The best likelihood is provided by a power
law with exponential cut-off:
$$p_{g}(\delta (d);\alpha ,\beta ) \propto \delta (d)^{-\alpha } exp(-\beta
\delta (d))$$
, where d is the current degree of the node. These results are confirmed in
Fig. 11.14, in which we plot the MLE estimates to gap distribution
$$\delta (1)$$, i.e., distribution of times that it took a node of degree 1 to
add the second edge. We find that all gaps distributions $$\delta (d)$$ are
best modelled by a power law with exponential cut-off. For each
$$\delta (d)$$ we fit a separate distribution and Fig. 11.15 shows the
evolution of the parameters $$\alpha $$ and $$\beta $$ of the gap
distribution, as a function of the degree d of the node. Interestingly, the
power law exponent $$\alpha (d)$$ remains constant as a function of d, at
almost the same value for all four networks. On the other hand, the
exponential cutoff parameter $$\beta (d)$$ increases linearly with d, and
varies by an order of magnitude across networks; this variation models the
extent to which the “rich get richer” phenomenon manifests in each network.
This means that the slope $$\alpha $$ of power-law part remains constant,
only the exponential cutoff part (parameter $$\beta $$) starts to kick in
sooner and sooner. So, nodes add their $$(d + 1)$$-st edge faster than their
dth edge, i.e, nodes start to create more and more edges (sleeping times get
shorter) as they get older (and have higher degree). So, based on Fig. 11.15,
the overall gap distribution can be modelled by
$$p_{g}(\delta |d;\alpha ,\beta ) \propto \delta ^{-\alpha } exp(-\beta d
\delta )$$
.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig14_HTML.
gif
Fig. 11.14
Fig. 11.15
Figure 11.16 shows the number of users in each of our networks over time.
FLICKR grows exponentially over much of our network, while the growth of
other networks is much slower. DELICIOUS grows slightly super-linearly,
LINKEDIN quadratically, and YAHOO! ANSWERS sub-linearly. Given these
wild variations we conclude the node arrival process needs to be specified in
advance as it varies greatly across networks due to external factors.
../images/462433_1_En_11_Chapter/462433_1_En_11_Fig16_HTML.
gif
Fig. 11.16
1. 1.
2. 2.
3. 3.
Node u adds the first edge to node v with probability proportional to its
degree.
4. 4.
5. 5.
When a node wakes up, if its lifetime has not expired yet, it creates a
two-hop edge using the random-random triangle-closing model.
6. 6.
If a node’s lifetime has expired, then it stops adding edges; otherwise it
repeats from step 4.
Problems
Given that the probability density function (PDF) of a power-law distribution
is given by Eq. 11.13.
$$\begin{aligned} P(X=x) = \frac{\alpha -1}{x_{min}}\left( \frac{x}
{x_{min}}\right) ^{-\alpha } \end{aligned}$$
(11.13)
where $$x_{min}$$ is the minimum value that X can take.
60
61
62
References
1. 1.
Aiello, William, Fan Chung, and Linyuan Lu. 2002. Random evolution
in massive graphs. In Handbook of massive data sets, 97–122. Berlin:
Springer.
2. 2.
Barabási, Albert-László, and Réka Albert. 1999. Emergence of scaling
in random networks. Science 286 (5439): 509–512.MathSciNetCrossref
3. 3.
Bollobás, Béla, Christian Borgs, Jennifer Chayes, and Oliver Riordan.
2003. Directed scale-free graphs. In Proceedings of the fourteenth
annual ACM-SIAM symposium on discrete algorithms, 132–139.
Society for Industrial and Applied Mathematics.
4. 4.
Brookes, Bertram C. 1969. Bradford’s law and the bibliography of
science. Nature 224 (5223): 953.Crossref
5. 5.
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark E.J. Newman. 2009.
Power-law distributions in empirical data. SIAM Review 51 (4): 661–
703.
6. 6.
Fractals, Chaos. 1991. Power laws: Minutes from an infinite paradise.
7. 7.
Goel, Sharad, Andrei Broder, Evgeniy Gabrilovich, and Bo Pang. 2010.
Anatomy of the long tail: Ordinary people with extraordinary tastes. In
Proceedings of the third ACM international conference on Web search
and data mining, 201–210. ACM.
8. 8.
Grünwald, Peter D. 2007. The minimum description length principle.
Cambridge: MIT press.
9. 9.
Kass, Robert E., and Adrian E. Raftery. 1995. Bayes factors. Journal of
the American Statistical Association 90 (430): 773–795.
10. 10.
Kleinberg, Jon. 2003. Bursty and hierarchical structure in streams. Data
Mining and Knowledge Discovery 7 (4): 373–397.MathSciNetCrossref
11. 11.
Kumar, Ravi, Jasmine Novak, Prabhakar Raghavan, and Andrew
Tomkins. 2005. On the bursty evolution of blogspace. World Wide Web
8 (2): 159–178.Crossref
12. 12.
Lattanzi, Silvio, and D. Sivakumar. 2009. Affiliation networks. In
Proceedings of the forty-first annual ACM symposium on Theory of
computing, 427–434. ACM.
13. 13.
Leskovec, Jure, Jon Kleinberg, and Christos Faloutsos. 2007. Graph
evolution: Densification and shrinking diameters. ACM Transactions on
Knowledge Discovery from Data (TKDD) 1 (1): 2.Crossref
14. 14.
Leskovec, Jure, Lars Backstrom, Ravi Kumar, and Andrew Tomkins.
2008. Microscopic evolution of social networks. In Proceedings of the
14th ACM SIGKDD international conference on Knowledge discovery
and data mining, 462–470. ACM.
15. 15.
Lotka, Alfred J. 1926. The frequency distribution of scientific
productivity. Journal of the Washington Academy of Sciences 16 (12):
317–323.
16. 16.
Mandelbrot, Benoit. 1960. The pareto-levy law and the distribution of
income. International Economic Review 1 (2): 79–106.Crossref
17. 17.
Stone, Mervyn. 1977. An asymptotic equivalence of choice of model by
cross-validation and Akaike’s criterion. Journal of the Royal Statistical
Society. Series B (Methodological) 39: 44–47.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_12
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
The graph models that we have learnt heretofore cater to specific network
properties. We need a graph generator that can produce the long list of
properties. These generated graphs can be used for simulations, scenarios
and extrapolation, and the graphs draw a boundary over what properties to
realistically focus on. Reference [4] proposed that such a realistic graph is
the Kronecker graph which is generated using the Kronecker product.
The idea behind these Kronecker graphs is to create self-similar graphs
recursively. Beginning with an initiator graph $$G_{1}$$, with
$$|V|_{1}$$ vertices and $$|E|_{1}$$ edges, produce
successively larger graphs $$G_{2}, \dots , G_{n}$$ such that the kth
graph $$G_{k}$$ is on $$|V|_{k} = |V|_{1}^{k}$$ vertices. To
exhibit the densification power-law, $$G_{k}$$ should have
$$|E|_{k} = |E|_{1}^{k}$$ edges. The Kronecker product of two
matrices generates this recursive self-similar graphs.
Given two matrices $$A = [a_{i,j}]$$ and B of sizes
$$n \times m$$ and $$n' \times m'$$ respectively, the
Kronecker product matrix C of size $$(n*n') \times (m*m')$$ is given
by Eq. 12.1.
$$\begin{aligned} C = A \otimes B = \left( \begin{matrix}
a_{1,1}B &{} a_{1,2}B &{} \dots &{} a_{1,m}B \\
a_{2,1}B &{} a_{2,2}B &{} \dots &{} a_{2,m}B \\ (12.1)
\vdots &{} \vdots &{} \ddots &{} \vdots \\ a_{n,1}B
&{} a_{n,2}B &{} \dots &{} a_{n,m}B \\
\end{matrix} \right) \end{aligned}$$
So, the Kronecker product of two graphs is the Kronecker product of their
adjacency matrices.
In a Kronecker graph, $$Edge(X_{ij},X_{kl}) \in G \otimes H$$
iff $$(X_{i},X_{k}) \in G$$ and $$(X_{j},X_{l}) \in H$$
where $$X_{ij}$$ and $$X_{kl}$$ are vertices in
$$G \otimes H$$, and $$X_{i}, X_{j}, X_{k}$$ and
$$X_{l}$$ are the corresponding vertices in G and H.
Figure 12.1 shows the recursive construction of $$G \otimes H$$,
when $$G = H$$ is a 3-node path.
../images/462433_1_En_12_Chapter/462433_1_En_12_Fig1_HTML.gif
Fig. 12.1 Top: a “3-chain” and its Kronecker product with itself; each of the $$X_{i}$$ nodes
gets expanded into 3 nodes, which are then linked. Bottom: the corresponding adjacency matrices,
along with the matrix for the fourth Kronecker power $$G_{4}$$
../images/462433_1_En_12_Chapter/462433_1_En_12_Fig2_HTML.gif
Fig. 12.2 CIT-HEP-TH: Patterns from the real graph (top row), the deterministic Kronecker graph
with $$K_{1}$$ being a star graph on 4 nodes (center + 3 satellites) (middle row), and the
Stochastic Kronecker graph ( $$\alpha = 0.41$$ , $$\beta = 0.11$$ - bottom row). Static
patterns: a is the PDF of degrees in the graph (log-log scale), and b the distribution of eigenvalues
(log-log scale). Temporal patterns: c gives the effective diameter over time (linear-linear scale), and d
is the number of edges versus number of nodes over time (log-log scale)
../images/462433_1_En_12_Chapter/462433_1_En_12_Fig3_HTML.gif
Fig. 12.3 AS-ROUTEVIEWS: Real (top) versus Kronecker (bottom). Columns a and b show the
degree distribution and the scree plot. Columns c shows the distribution of network values (principal
eigenvector components, sorted, versus rank) and d shows the hop-plot (the number of reachable
pairs g(h) within h hops or less, as a function of the number of hops h
12.3 KRONFIT
Reference [5] presented KRONFIT, a fast and scalable algorithm for fitting
Kronecker graphs by using the maximum likelihood principle. A
Metropolis sampling algorithm was developed for sampling node
correspondences, and approximating the likelihood of obtaining a linear
time algorithm for Kronecker graph model parameter estimation that scales
to large networks with millions of nodes and edges.
../images/462433_1_En_12_Chapter/462433_1_En_12_Fig5_HTML.gif
Fig. 12.5 Schematic illustration of the multifractal graph generator. a The construction of the link
probability measure. Start from a symmetric generating measure on the unit square defined by a set
of probabilities $$p_{ij} = p_{ji}$$ associated to $$m \times m$$ rectangles (shown on the
left). Here $$m = 2$$ , the length of the intervals defining the rectangles is given by $$l_{1}$$
and $$l_{2}$$ respectively, and the magnitude of the probabilities is indicated by both the height
and the colour of the corresponding boxes. The generating measure is iterated by recursively
multiplying each box with the generating measure itself as shown in the middle and on the right,
yielding $$m^{k} \times m^{k}$$ boxes at iteration k. The variance of the height of the boxes
(corresponding to the probabilities associated to the rectangles) becomes larger at each step,
producing a surface which is getting rougher and rougher, meanwhile the symmetry and the self
similar nature of the multifractal is preserved. b Drawing linking probabilities from the obtained
measure. Assign random coordinates in the unit interval to the nodes in the graph, and link each node
pair I, J with a probability given by the probability measure at the corresponding coordinates
../images/462433_1_En_12_Chapter/462433_1_En_12_Fig6_HTML.gif
Fig. 12.6 A small network generated with the multifractal network generator. a The generating
measure (on the left) and the link probability measure (on the right). The generating measure consists
of $$3\times 3$$ rectangles for which the magnitude of the associated probabilities is indicated by
the colour. The number of iterations, k, is set to $$k = 3$$ , thus the final link probability measure
consists of $$27\times 27$$ boxes, as shown in the right panel. b A network with 500 nodes
generated from the link probability measure. The colours of the nodes were chosen as follows. Each
row in the final linking probability measure was assigned a different colour, and the nodes were
coloured according to their position in the link probability measure. (Thus, nodes falling into the
same row have the same colour)
12.4 KRONEM
Reference [3] addressed the network completion problem by using the
observed part of the network to fit a model of network structure, and then
estimating the missing part of the network using the model, re-estimating
the parameters and so on. This is combined with the Kronecker graphs
model to design a scalable Metropolized Gibbs sampling approach that
allows for the estimation of the model parameters as well as the inference
about missing nodes and edges of the network.
The problem of network completion is cast into the Expectation
Maximisation (EM) framework and the KRONEM algorithm is developed
that alternates between the following two stages. First, the observed part of
the network is used to estimate the parameters of the network model. This
estimated model then gives us a way to infer the missing part of the
network. Now, we act as if the complete network is visible and we re-
estimate the model. This in turn gives us a better way to infer the missing
part of the network. We iterate between the model estimation step (the M-
step) and the inference of the hidden part of the network (the E-step) until
the model parameters converge.
The advantages of KronEM are the following: It requires a small
number of parameters and thus does not overfit the network. It infers not
only the model parameters but also the mapping between the nodes of the
true and the estimated networks. The approach can be directly applied to
cases when collected network data is incomplete. It provides an accurate
probabilistic prior over the missing network structure and easily scales to
large networks.
2. Chakrabarti, Deepayan, Yiping Zhan, and Christos Faloutsos. 2004. R-mat: A recursive model for
graph mining. In Proceedings of the 2004 SIAM international conference on data mining, 442–
446. SIAM.
3. Kim, Myunghwan, and Jure Leskovec. 2011. The network completion problem: Inferring missing
nodes and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data
Mining, 47–58. SIAM.
4. Leskovec, Jurij, Deepayan Chakrabarti, Jon Kleinberg, and Christos Faloutsos. 2005. Realistic,
mathematically tractable graph generation and evolution, using Kronecker multiplication. In
European conference on principles of data mining and knowledge discovery, 133–145. Berlin:
Springer.
5. Leskovec, Jure, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin
Ghahramani. 2010. Kronecker graphs: An approach to modeling networks. Journal of Machine
Learning Research 11 (Feb): 985–1042.
6. Mahdian Mohammad, and Ying Xu. 2007. Stochastic Kronecker graphs. In International
workshop on algorithms and models for the web-graph, 179–186. Berlin: Springer.
7. Palla, Gergely, László Lovász, and Tamás Vicsek. 2010. Multifractal network generator.
Proceedings of the National Academy of Sciences 107 (17): 7640–7645.
[Crossref]
8. Seshadhri, Comandur, Ali Pinar, and Tamara G. Kolda. 2013. An in-depth analysis of stochastic
Kronecker graphs. Journal of the ACM (JACM) 60 (2): 13.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_13
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig1_HTML.gif
Fig. 13.1 Typical search engine architecture
13.1.1 Crawling
The crawler module starts with an initial set of URLs $$S_{0}$$ which
are initially kept in a priority queue. From the queue, the crawler gets a
URL, downloads the page, extracts any URLs in the downloaded page, and
puts the new URLs in the queue. This process repeats until the crawl
control module asks the crawler to stop.
13.1.2 Storage
The storage repository of a search engine must perform two basic functions.
First, it must provide an interface for the crawler to store pages. Second, it
must provide an efficient access API that the indexer module and the
collection analysis module can use to retrieve pages.
The issues a repository must deal with are as follows: First, it must be
scalable, i.e, it must be capable of being distributed across a cluster of
systems. Second, it must be capable of supporting both random access and
streaming access equally efficiently. Random access for quickly retrieving a
specific Webpage, given the page’s unique identifier to serve out cached
copies to the end-user. Streaming access to receive the entire collection as a
stream of pages to provide pages in bulk for the indexer and analysis
module to process and analyse. Third, since the Web changes rapidly, the
repository needs to handle a high rate of modifications since pages have to
be refreshed periodically. Lastly, the repository must have a mechanism of
detecting and removing obsolete pages that have been removed from the
Web.
A distributed Web repository that is designed to function over a cluster
of interconnected storage nodes must deal with the following issues that
affect the characteristics and performance of the repository. A detailed
explanation of these issues are found in [15].
13.1.3 Indexing
The indexer and the collection analysis modules are responsible for
generating the text, structure and the utility indexes. Here, we describe the
indexes.
13.1.4 Ranking
The query engine collects search terms from the user and retrieves pages
that are likely to be relevant. These pages cannot be returned to the user in
this format and must be ranked.
However, the problem of ranking is faced with the following issues.
First, before the task of ranking pages, one must first retrieve the pages that
are relevant to a search. This information retrieval suffers from problems of
synonymy (multiple terms that more or less mean the same thing) and
polysemy (multiple meanings of the same term, so by the term “jaguar”, do
you mean the animal, the automobile company, the American football team,
or the operating system).
Second, the time of retrieval plays a pivotal role. For instance, in the
case of an event such as a calamity, government and news sites will update
their pages as and when they receive information. The search engine not
only has to retrieve the pages repeatedly, but will have to re-rank the pages
depending on which page currently has the latest report.
Third, with every one capable of writing a Web page, the Web has an
abundance of information for any topic. For example, the term “social
network analysis” returns a search of around 56 million results. Now, the
task is to rank these results in a manner such that the most important ones
appear first.
13.1.4.1 HITS Algorithm
When you search for the term “msrit”, what makes www.msrit.edu a good
answer? The idea is that the page www.msrit.edu is not going to use the
term “msrit” more frequently or prominently than other pages. Therefore,
there is nothing in the page that makes it stand out in particular. Rather, it
stands out because of features on other Web pages: when a page is relevant
to “msrit”, very often www.msrit.edu is among the pages it links to.
One approach is to first retrieve a large collection of Web pages that are
relevant to the query “msrit” using traditional information retrieval. Then,
these pages are voted based on which page on the Web receives the greatest
number of in-links from pages that are relevant to msrit. In this manner,
there will ultimately be a page that will be ranked first.
Consider the search for the term “newspapers”. Unlike the case of
“msrit”, the position for a good answer for the term “newspapers” is not
necessarily a single answer. If we try the traditional information retrieval
approach, then we will get a set of pages variably pointing to several of the
news sites.
Now we attempt to tackle the problem from another direction. Since
finding the best answer for a query is incumbent upon the retrieved pages,
finding good retrieved pages will automatically result in finding good
answers. Figure 13.2 shows a set of retrieved pages pointing to newspapers.
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig2_HTML.gif
Fig. 13.2 Counting in-links to pages for the query “newspapers”
We observe from Fig. 13.2 that among the sites casting votes, a few of
them voted for many of the pages that received a lot of votes. Therefore, we
could say that these pages have a sense where the good answers are, and to
score them highly as lists. Thus, a page’s value as a list is equal to the sum
of the votes received by all pages that it voted for. Figure 13.3 depicts the
result of applying this rule to the pages casting votes.
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig3_HTML.gif
Fig. 13.3 Finding good lists for the query “newspapers”: each page’s value as a list is written as a
number inside it
If pages scoring well as lists are believed to actually have a better sense
for where the good results are, then we should weigh their votes more
heavily. So, in particular, we could tabulate the votes again, but this time
giving each page’s vote a weight equal to its value as a list. Figure 13.4
illustrates what happens when weights are accounted for in the newspaper
case.
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig4_HTML.gif
Fig. 13.4 Re-weighting votes for the query “newspapers”: each of the labelled page’s new score is
equal to the sum of the values of all lists that point to it
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig5_HTML.gif
Fig. 13.5 Re-weighting votes after normalizing for the query “newspapers”
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig6_HTML.gif
Fig. 13.6 Limiting hub and authority values for the query “newspapers”
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig7_HTML.gif
Fig. 13.7 A collection of eight web pages
Step A B C D E F G H
1 1/2 1 / 16 1 / 16 1 / 16 1 / 16 1 / 16 1 / 16 1 / 8
2 3 / 16 1 / 4 1/4 1 / 32 1 / 32 1 / 32 1 / 32 1 / 16
This scaling factor makes the PageRank measure less sensitive to the
addition or deletion of small number of nodes or edges.
A survey of the PageRank algorithm and its developments can be found
in [4].
We will first analyse the Basic PageRank Update rule and then move on
to the scaled version. Under the basic rule, each node takes its current
PageRank and divides it equally over all the nodes it points to. This
suggests that the “flow” of PageRank specified by the update rule can be
naturally represented using a matrix N. Let $$N_{ij}$$ be the share of
i’s PageRank that j should get in one update step. $$N_{ij} = 0$$ if i
doesn’t link to j, and when i links to j, then $$N_{ij} = 1/l_{i}$$,
where $$l_{i}$$ is the number of links out of i. (If i has no outgoing
links, then we define $$N_{ii} = 1$$, in keeping with the rule that a
node with no outgoing links passes all its PageRank to itself.)
If we represent the PageRank of all the nodes using a vector r, where
$$r_{i}$$ is the PageRank of node i. In this manner, the Basic
PageRank Update rule is
$$\begin{aligned} r \leftarrow N^{T} \cdot r \end{aligned}$$ (13.22)
We can similarly represent the Scaled PageRank Update rule using the
matrix $$\hat{N}$$ to denote the different flow of PageRank. To
account for the scaling, we define $$\hat{N_{ij}}$$ to be
$$sN_{ij}+(1-s)/n$$, this gives the scaled update rule as
(13.23)
$$\begin{aligned} r \leftarrow \hat{N^{T}} \cdot r
\end{aligned}$$
Starting from an initial PageRank vector $$r^{[0]}$$, a sequence of
vectors $$r^{[1]},r^{[2]},\dots $$ are obtained from repeated
improvement by multiplying the previous vector by $$\hat{N^{T}}$$
. This gives us
$$\begin{aligned} r^{[k]} = (\hat{N^{T}})^{k}r^{[0]} (13.24)
\end{aligned}$$
This means that if the Scaled PageRank Update rule converges to a limiting
vector $$r^{[*]}$$, this limit would satisfy
$$\hat{N^{T}}r^{[*]} = r^{[*]}$$. This is proved using Perron’s
Theorem [18].
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig8_HTML.gif
Fig. 13.8 Equilibrium PageRank values for the network in Fig. 13.7
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig9_HTML.gif
Fig. 13.9 The same collection of eight pages, but F and G have changed their links to point to each
other instead of to A. Without a smoothing effect, all the PageRank would go to F and G
Proof
If $$b_{1},b_{2},\dots ,b_{n}$$ denote the probabilities of the walk
being at nodes $$1,2,\dots ,n$$ respectively in a given step, then the
probability it will be at node i in the next step is computed as follows:
1.
For each node j that links to i, if we are given that the walk is currently
at node j, then there is a $$1/l_{j}$$ chance that it moves from j to i
in the next step, where $$l_{j}$$ is the number of links out of j.
2.
The walk has to actually be at node j for this to happen, so node j
contributes $$b_{j}(1/l_{j})=b_{j}/l_{j}$$ to the probability of
being at i in the next step.
3.
Therefore, summing $$b_{j}/l_{j}$$ over all nodes j that link to i
gives the probability the walk is at $$b_{i}$$ in the next step.
So the overall probability that the walk is at i in the next step is the sum of
$$b_{j}/l_{j}$$ over all nodes that link to i.
If we represent the probabilities of being at different nodes using a
vector b, where the coordinate $$b_{i}$$ is the probability of being at
node i, then this update rule can be written using matrix-vector
multiplication as
$$\begin{aligned} b \leftarrow N^{T} \cdot b \end{aligned}$$ (13.25)
This is exactly the same as Eq. 13.22. Since both PageRank values and
random-walk probabilities start out the same (they are initially 1 / n for all
nodes), and they then evolve according to exactly the same rule, so they
remain same forever. This justifies the claim.
Proof
We go by the same lines as the proof of Remark 1. If
$$b_{1}, b_{2}, \dots , b_{n}$$ denote the probabilities of the walk
being at nodes $$1, 2, \dots , n$$ respectively in a given step, then the
probability it will be at node i in the next step, is the sum of
$$sb_{j}/l_{j}$$, over all nodes j that link to i, plus $$(1 - s)/n$$. If we
use the matrix $$\hat{N}$$, then we can write the probability update as
$$\begin{aligned} b \leftarrow \hat{N^{T}}b \end{aligned}$$ (13.26)
This is the same as the update rule from Eq. 13.23 for the scaled PageRank
values. The random-walk probabilities and the scaled PageRank values start
at the same initial values, and then evolve according to the same update, so
they remain the same forever. This justifies the argument.
A problem with both PageRank and HITS is topic drift . Because they give
the same weights to all edges, the pages with the most in-links in the
network being considered tend to dominate, whether or not they are most
relevant to the query. References [8] and [5] propose heuristic methods for
differently weighting links. Reference [20] biased PageRank towards pages
containing a specific word, and [14] proposed applying an optimized
version of PageRank to the subset of pages containing the query terms.
13.2 Google
Reference [7] describes Google , a prototype of a large-scale search engine
which makes use of the structure present in hypertext. This prototype forms
the base of the Google search engine we know today.
Figure 13.10 illustrates a high level view of how the whole system
works. The paper states that most of Google was implemented in C or C++
for efficiency and was available to be run on either Solaris or Linux.
../images/462433_1_En_13_Chapter/462433_1_En_13_Fig10_HTML.gi
f
Fig. 13.10 High level Google architecture
The URL Server plays the role of the crawl control module here by
sending the list of URLs to be fetched by the crawlers. The Store Server
receives the fetched pages and compresses them before storing it in the
repository. Every Webpage has an associated ID number called a docID
which is assigned whenever a new URL is parsed out of a Webpage. The
Indexer module reads the repository, uncompresses the documents and
parses them. Each of these documents is converted to a set of word
occurrences called hits. The hit record the word, position in document,
approximation of the font size and capitalization. These hits are then
distributed into a set of Barrels , creating a partially sorted forward index.
The indexer also parses out all the links in every Webpage and stores
important information about them in anchor files placed in Anchors . These
anchor files contain enough information to determine where each link
points from and to, and the text of the link.
The URL Resolver reads these anchor files and converts relative URLs
into absolute URLs and in turn into docIDs. These anchor texts are put into
the forward index, associated with the docID that the anchor points to. It
also generates a database of links which are pairs of docIDs. The Links
database is used to compute PageRanks for all the documents. The Sorter
takes the barrels, which are sorted by docID, and resorts them by wordID to
generate the inverted index. The sorter also produces a list of wordIDs and
offsets into the inverted index. A program called DumpLexicon takes this
list together with the lexicon produced by the indexer and generates a new
lexicon to be used by the searcher. The Searcher is run by a Web server and
uses the lexicon built by DumpLexicon together with the inverted index and
the PageRanks to answer queries.
13.2.2 Crawling
Crawling is the most fragile application since it involves interacting with
hundreds of thousands of Web servers and various name servers which are
all beyond the control of the system. In order to scale to hundreds of
millions of Webpages, Google has a fast distributed crawling system. A
single URLserver serves lists of URLs to a number of crawlers. Both the
URLserver and the crawlers were implemented in Python. Each crawler
keeps roughly 300 connections open at once. This is necessary to retrieve
Web pages at a fast enough pace. At peak speeds, the system can crawl over
100 Web pages per second using four crawlers. A major performance stress
is DNS lookup so each crawler maintains a DNS cache. Each of the
hundreds of connections can be in a number of different states: looking up
DNS, connecting to host, sending request. and receiving response. These
factors make the crawler a complex component of the system. It uses
asynchronous IO to manage events, and a number of queues to move page
fetches from state to state.
13.2.3 Searching
Every hitlist includes position, font, and capitalization information.
Additionally, hits from anchor text and the PageRank of the document are
factored in. Combining all of this information into a rank is difficult. The
ranking function so that no one factor can have too much influence. For
every matching document we compute counts of hits of different types at
different proximity levels. These counts are then run through a series of
lookup tables and eventually are transformed into a rank. This process
involves many tunable parameters.
Problems
Download the Wikipedia hyperlinks network available at
https://snap.stanford.edu/data/wiki-topcats.txt.gz.
64 Compute the PageRank, hub score and authority score for each of the
nodes in the graph.
65 Report the nodes that have the top 3 PageRank, hub and authority
scores respectively.
References
1. Agirre, Eneko, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6:
a pilot on semantic textual similarity. In Proceedings of the first joint conference on lexical and
computational semantics-volume 1: Proceedings of the main conference and the shared task, and
Volume 2: Proceedings of the sixth international workshop on semantic evaluation, 385–393.
Association for Computational Linguistics.
2. Altman, Alon, and Moshe Tennenholtz. 2005. Ranking systems: the pagerank axioms. In
Proceedings of the 6th ACM conference on electronic commerce, 1–8. ACM.
3. Arasu, Arvind, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan.
2001. Searching the web. ACM Transactions on Internet Technology (TOIT) 1 (1): 2–43.
[Crossref]
4.
Berkhin, Pavel. 2005. A survey on pagerank computing. Internet Mathematics 2 (1): 73–120.
[MathSciNet][Crossref]
5. Bharat, Krishna, and Monika R. Henzinger. 1998. Improved algorithms for topic distillation in a
hyperlinked environment. In Proceedings of the 21st annual international ACM SIGIR
conference on research and development in information retrieval, 104–111. ACM.
6. Borodin, Allan, Gareth O. Roberts, Jeffrey S. Rosenthal, and Panayiotis Tsaparas. 2001. Finding
authorities and hubs from link structures on the world wide web. In Proceedings of the 10th
international conference on World Wide Web, 415–429. ACM.
7. Brin, Sergey, and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search
engine. Computer networks and ISDN systems 30 (1–7): 107–117.
[Crossref]
8. Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson,
and Jon Kleinberg. 1998. Automatic resource compilation by analyzing hyperlink structure and
associated text. Computer Networks and ISDN Systems 30 (1–7): 65–74.
[Crossref]
9. Cho, Junghoo, and Hector Garcia-Molina. 2003. Estimating frequency of change. ACM
Transactions on Internet Technology (TOIT) 3 (3): 256–290.
[Crossref]
10. Cohn, David, and Huan Chang. 2000. Learning to probabilistically identify authoritative
documents. In ICML, 167–174. Citeseer.
11. Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B
(methodological) 1–38.
12. Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with
trustrank. In Proceedings of the thirtieth international conference on very large data bases-
volume 30, 576–587. VLDB Endowment.
13. Gyongyi, Zoltan, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam
detection based on mass estimation. In Proceedings of the 32nd international conference on very
large data bases, 439–450. VLDB Endowment.
14. Haveliwala, Taher H. 2002. Topic-sensitive pagerank. In Proceedings of the 11th international
conference on World Wide Web, 517–526. ACM.
15. Hirai, Jun, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. Webbase: a
repository of web pages. Computer Networks 33 (1–6): 277–293.
[Crossref]
17. Lempel, Ronny, and Shlomo Moran. 2000. The stochastic approach for link-structure analysis
(salsa) and the tkc effect1. Computer Networks 33 (1–6): 387–401.
[Crossref]
18.
MacCluer, Charles R. 2000. The many proofs and applications of perron’s theorem. Siam Review
42 (3): 487–498.
19. Melink, Sergey, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a
distributed full-text index for the web. ACM Transactions on Information Systems (TOIS) 19 (3):
217–241.
[Crossref]
20. Rafiei, Davood, and Alberto O. Mendelzon. 2000. What is this page known for? computing web
page reputations. Computer Networks 33 (1–6): 823–835.
21. Ribeiro-Neto, Berthier A., and Ramurti A. Barbosa. 1998. Query performance for tightly coupled
distributed digital libraries. In Proceedings of the third ACM conference on digital libraries,
182–190. ACM.
22. Tomasic, Anthony, and Hector Garcia-Molina. 1993. Performance of inverted indices in shared-
nothing distributed text document information retrieval systems. In Proceedings of the second
international conference on parallel and distributed information systems, 8–17. IEEE.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_14
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig1_HTML.gif
Fig. 14.1 Undirected graph with four vertices and four edges. Vertices A and C have a mutual
contacts B and D, while B and D have mutual friend A and C
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig2_HTML.gif
Fig. 14.2 Figure 14.1 with an edge between A and C, and B and D due to triadic closure property
Granovetter postulated that triadic closure was one of the most crucial
reason why acquaintances are the ones to thank for a person’s new job.
Consider the graph in Fig. 14.3, B has edges to the tightly-knit group
containing A, D and C, and also has an edge to E. The connection of B to E
is qualitatively different from the links to the tightly-knit group, because the
opinions and information that B and the group have access to are similar.
The information that E will provide to B will be things that B will not
necessarily have access to.
We define an edge to be a local bridge is the end vertices of the edge do
not have mutual friends. By this definition, the edge between B and E is a
local bridge. Observe that local bridge and triadic closure are conceptually
opposite term. While triadic closure implies an edge between vertices
having mutual friends, local bridge does not form an edge between such
vertices. So, Granovetter’s observation was that acquaintances connected to
an individual by local bridges can provide information such as job openings
which the individual might otherwise not have access to because the tightly-
knit group, although with greater motivation to find their buddy a job, will
know roughly the same things that the individual is exposed to.
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig3_HTML.gif
Fig. 14.3 Graph with a local bridge between B and E
When going about edges in formal terms, acquaintanceships are
considered as weak ties while friendships in closely-knit groups are called
strong ties . Let us assume that we are given the strength of each tie in
Fig. 14.3 (shown in Fig. 14.4). So, if a vertex A has edges to both B and C,
then the B-C edge is likely to form if A’s edges to B and C are both strong
ties. This could be considered a similarity to the theory of structural balance
studied in Chap. 7, where the affinity was to keep triads balanced by
promoting positive edges between each pair of vertices in the triangle.
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig4_HTML.gif
Fig. 14.4 Each edge of the graph in Fig 14.3 is labelled either as a strong tie(S) or a weak tie(W).
The labelling in the figure satisfies the Strong Triadic Closure property
3.
Recalculate the betweenness for all edges affected by the removal.
4.
Repeat from step 2 until no edges remain.
The algorithm in [5] can calculate the betweenness for all the |E| edges
in a graph of |V| vertices in time O(|V||E|). Since the calculation has to be
repeated once for the removal of each edge, the entire algorithm runs in
worst-case time $$O(|E|^{2}|V|)$$.
However, Girvan-Newman algorithm runs in $$O(|V|^{3})$$ time
on sparse graphs, which makes it impractical for very large graphs.
14.2.2 Modularity
Reference [2] proposed a measure called “modularity” and suggested that
the optimization of this quality function over the possible divisions of the
network is an efficient way to detect communities in a network. The
modularity is, up to a multiplicative constant, the number of edges falling
within groups minus the expected number in an equivalent network with
edges placed at random. The modularity can be either positive or negative,
with positive values indicating the possible presence of community
structure. Thus, we can search for community structure precisely by looking
for the divisions of a network that have positive, and preferably large,
values of the modularity.
Let us suppose that all the vertices in our graph belongs to one of two
groups, i.e, $$s_{i} = 1$$ if vertex i belongs to group 1 and
$$s_{i} = -1$$ if it belongs to group 2. Let the number of vertices
between vertices i and j be $$A_{ij}$$, which will normally be 0 or
1, although larger values are possible for graphs with multiple edges. The
expected number of edges between vertices i and j if edges are placed at
random is $$k_{i}k_{j}/2|E|$$ where $$k_{i}$$ and
$$k_{j}$$ are the degrees of vertices i and j respectively, and
$$E = \sum _{i}k_{i}/2$$. The modularity Q is given by the sum of
$$A_{ij}-k_{i}k_{j}/2|E|$$ over all pairs of vertices i and j that fall
in the same group.
Observing that the quantity $$\frac{1}{2}(s_{i}s_{j}+1)$$ is 1 if
i and j are in the same group and 0 otherwise, the modularity is formally
expressed as
$$\begin{aligned} Q = \frac{1}{4|E|}\sum \limits _{ij}\left(
A_{ij}-\frac{k_{i}k_{j}}{2|E|}\right) (s_{i}s_{j}+1) = \frac{1} (14.3)
{4|E|}\sum \limits _{ij}\left( A_{ij}-\frac{k_{i}k_{j}}{2|E|}\right)
s_{i}s_{j} \end{aligned}$$
where the second equality follows from the observation that
$$2|E| = \sum _{i}k_{i} = \sum _{ij}A_{ij}$$.
Equation 14.3 can be written in matrix form as
$$\begin{aligned} Q = \frac{1}{4|E|}s^{T}Bs \end{aligned}$$ (14.4)
where s is the column vector whose elements are $$s_{i}$$ and have
defined a real symmetric matrix B with elements
$$\begin{aligned} B_{ij} = A_{ij} - \frac{k_{i}k_{j}}{2|E|} (14.5)
\end{aligned}$$
which is called the modularity matrix . The elements of each of its rows and
columns sum to zero, so that it always has an eigenvector
$$(1,1,1,\ldots )$$ with eigenvalue zero.
Given Eq. 14.4, s can be written as a linear combination of the
normalized eigenvectors $$u_{i}$$ of B so that
$$s = \sum _{i=1}^{|V|}a_{i}u_{i}$$ with
$$a_{i} = u_{i}^{T}\cdot s$$, we get
$$\begin{aligned} Q = \frac{1}{4|E|}\sum \limits
_{i}a_{i}u_{i}^{T}B\sum \limits _{j}a_{j}u_{j} = \frac{1} (14.6)
{4|E|}\sum \limits _{i=1}^{|V|}(u_{i}^{T}\cdot s)^{2}\beta _{i}
\end{aligned}$$
where $$\beta _{i}$$ is the eigenvalue of B corresponding to
eigenvector $$u_{i}$$.
Assume that the eigenvalues are labelled in decreasing order,
$$\beta _{1}\ge \beta _{2}\ge \cdots \ge \beta _{|V|}$$. To maximize
the modularity, we have to choose the value of s that concentrates as much
weight as possible in terms of the sum in Eq. 14.6 involving the largest
eigenvalues. If there were no other constraints on the choice of s, this would
be an easy task: we would simply chose s proportional to the eigenvector
$$u_{1}$$. This places all of the weight in the term involving the
largest eigenvalue $$\beta _{1}$$, the other terms being
automatically zero, because the eigenvectors are orthogonal.
There is another constraint on the problem imposed by the restriction of
the elements of s to the values $$\pm 1$$, which means s cannot
normally be chosen parallel to $$u_{1}$$ . To make it as close to
parallel as possible, we have to maximize the dot product
$$u_{1}^{T}\cdot s$$. It is straightforward to see that the maximum
is achieved by setting $$s_{i}=+1$$ if the corresponding element of
$$u_{1}$$ is positive and $$s_{i}=-1$$ otherwise. In other
words, all vertices whose corresponding elements are positive go in one
group and all of the rest in the other. This gives us the algorithm for
dividing the network: we compute the leading eigenvector of the modularity
matrix and divide the vertices into two groups according to the signs of the
elements in this vector.
There are some satisfying features of this method. First, it works even
though the sizes of the communities are not specified. Unlike conventional
partitioning methods that minimize the number of between-group edges,
there is no need to constrain the group sizes or artificially forbid the trivial
solution with all vertices in a single group. There is an eigenvector
$$(1,1,1,\ldots )$$ corresponding to such a trivial solution, but its
eigenvalue is zero. All other eigenvectors are orthogonal to this one and
hence must possess both positive and negative elements. Thus, as long as
there is any positive eigenvalue this method will not put all vertices in the
same group.
It is, however, possible for there to be no positive eigenvalues of the
modularity matrix. In this case the leading eigenvector is the vector
$$(1,1,1,\ldots )$$ corresponding to all vertices in a single group
together. But this is precisely the correct result: the algorithm is in this case
telling us that there is no division of the network that results in positive
modularity, as can immediately be seen from Eq. 14.6, because all terms in
the sum will be zero or negative. The modularity of the undivided network
is zero, which is the best that can be achieved. This is an important feature
of the algorithm. The algorithm has the ability not only to divide networks
effectively, but also to refuse to divide them when no good division exists.
The networks in this latter case will be called indivisible , i.e, a network is
indivisible if the modularity matrix has no positive eigenvalues. This idea
will play a crucial role in later developments.
The algorithm as described makes use only of the signs of the elements
of the leading eigenvector, but the magnitudes convey information, too.
Vertices corresponding to elements of large magnitude make large
contributions to the modularity, Eq. 14.6, and conversely for small ones.
Alternatively, if we take the optimal division of a network into two groups
and move a vertex from one group to the other, the vector element for that
vertex gives an indication of how much the modularity will decrease:
vertices corresponding to elements of large magnitude cannot be moved
without incurring a large modularity penalty, whereas those corresponding
to smaller elements can be moved at relatively little cost. Thus, the
elements of the leading eigenvector measure how firmly each vertex
belongs to its assigned community, those with large vector elements being
strong central members of their communities, whereas those with smaller
elements are more ambivalent.
When dealing with networks that can be divided into more than two
communities, we use the algorithm of the previous section first to divide the
network into two parts, then divide those parts, and so forth. However, after
first dividing a network in two, it is not correct to simply delete the edges
falling between the two parts and then apply the algorithm again to each
subgraph. This is because the degrees appearing in the definition, Eq. 14.3,
of the modularity will change if edges are deleted, and any subsequent
maximization of modularity would thus maximize the wrong quantity.
Instead, the correct approach is to write the additional contribution
$$\Delta Q$$ to the modularity upon further dividing a group g of
size $$|V|_{g}$$ in two as
$$\begin{aligned} \Delta Q&= \frac{1}{2|E|}\left[ \frac{1}
{2}\sum \limits _{i,j\in g}B_{ij}(s_{i}s_{j}+1)-\sum \limits _{i,j\in
g}B_{ij}\right] \nonumber \\&= \frac{1}{4|E|}\left[ \sum \limits
_{i,j\in g}B_{ij}s_{i}s_{j}-\sum \limits _{i,j\in g}B_{ij}\right] (14.7)
\\&= \frac{1}{4|E|}\sum \limits _{i,j\in g}\left[ B_{ij}-\delta
_{ij}\sum \limits _{k\in g}B_{ik}\right] s_{i}s_{j} \nonumber
\\&= \frac{1}{4|E|}s^{T}B^{(g)}s \nonumber \end{aligned}$$
where $$\delta _{ij}$$ is the Kronecker $$\delta $$-symbol,
$$B^{(g)}$$ is the $$n_{g}\times n_{g}$$ matrix with
elements indexed by the labels i, j of vertices within group g and having
values
(14.8)
$$\begin{aligned} B_{ij}^{(g)} = B_{ij}-\delta _{ij}\sum \limits
_{k\in g}B_{ik} \end{aligned}$$
Since Eq. 14.7 has the same form as Eq. 14.4, we can now apply the
spectral approach to this generalized modularity matrix, just as before, to
maximize $$\Delta Q$$. Note that the rows and columns of
$$B^{(g)}$$ still sum to zero and that $$\Delta Q$$ is correctly
zero if group g is undivided. Note also that for a complete network Eq. 14.8
reduces to the previous definition of the modularity matrix, Eq. 14.5,
because $$\sum _{k}B_{ik}$$ is zero in that case.
In repeatedly subdividing the network, an important question we need to
address is at what point to halt the subdivision process. A nice feature of
this method is that it provides a clear answer to this question: if there exists
no division of a subgraph that will increase the modularity of the network,
or equivalently that gives a positive value for $$\Delta Q$$, then there
is nothing to be gained by dividing the subgraph and it should be left alone;
it is indivisible in the sense of the previous section. This happens when
there are no positive eigenvalues to the matrix $$B^{(g)}$$, and thus
the leading eigenvalue provides a simple check for the termination of the
subdivision process: if the leading eigenvalue is zero, which is the smallest
value it can take, then the subgraph is indivisible.
Note, however, that although the absence of positive eigenvalues is a
sufficient condition for indivisibility, it is not a necessary one. In particular,
if there are only small positive eigenvalues and large negative ones, the
terms in Eq. 14.6 for negative $$\beta _{i}$$ may outweigh those for
positive. It is straightforward to guard against this possibility, however; we
simply calculate the modularity contribution $$\Delta Q$$ for each
proposed split directly and confirm that it is greater than zero.
Thus the algorithm is as follows. We construct the modularity matrix,
Eq. 14.5, for the network and find its leading (most positive) eigenvalue and
the corresponding eigenvector. We divide the network into two parts
according to the signs of the elements of this vector, and then repeat the
process for each of the parts, using the generalized modularity matrix,
Eq. 14.8. If at any stage we find that a proposed split makes a zero or
negative contribution to the total modularity, we leave the corresponding
subgraph undivided. When the entire network has been decomposed into
indivisible subgraphs in this way, the algorithm ends.
An alternate method for community detection is a technique that bears a
striking resemblance to the Kernighan-Lin algorithm. Suppose we are given
some initial division of our vertices into two groups. We then find among
the vertices the one that, when moved to the other group, will give the
biggest increase in the modularity of the complete network, or the smallest
decrease if no increase is possible. We make such moves repeatedly, with
the constraint that each vertex is moved only once. When all the vertices
have been moved, we search the set of intermediate states occupied by the
network during the operation of the algorithm to find the state that has the
greatest modularity. Starting again from this state, we repeat the entire
process iteratively until no further improvement in the modularity results.
Although this method by itself only gives reasonable modularity values,
the method really comes into its own when it is used in combination with
the spectral method introduced earlier. The spectral approach based on the
leading eigenvector of the modularity matrix gives an excellent guide to the
general form that the communities should take and this general form can
then be fine-tuned by the vertex moving method to reach the best possible
modularity value. The whole procedure is repeated to subdivide the network
until every remaining subgraph is indivisible, and no further improvement
in the modularity is possible.
The most time-consuming part of the algorithm is the evaluation of the
leading eigenvector of the modularity matrix. The fastest method for
finding this eigenvector is the simple power method, the repeated
multiplication of the matrix into a trial vector. Although it appears at first
glance that matrix multiplications will be slow, taking $$O(n^{2})$$
operations each because the modularity matrix is dense, we can in fact
perform them much faster by exploiting the particular structure of the
matrix. Writing $$B=A-kk^{T}/2|E|$$, where A is the adjacency
matrix and k is the vector whose elements are the degrees of the vertices,
the product of B and an arbitrary vector x can be written
$$\begin{aligned} Bx = Ax - \frac{k(k^{T}\cdot x)}{2|E|} (14.9)
\end{aligned}$$
The first term is a standard sparse matrix multiplication taking time
$$O(|V|+|E|)$$. The inner product $$k^{T}\cdot x$$ takes time
O(|V|) to evaluate and hence the second term can be evaluated in total time
O(|V|) also. Thus the complete multiplication can be performed in
$$O(|V|+|E|)$$ time. Typically O(|V|) such multiplications are needed
to converge to the leading eigenvector, for a running time of
$$O[(|V|+|E|)|V|]$$ overall. Often we are concerned with sparse
graphs with $$|E|\propto |V|$$, in which case the running time
becomes $$O(|V|^{2})$$. It is a simple matter to extend this
procedure to find the leading eigenvector of the generalized modularity
matrix, Eq. 14.8, also.
Although we will not go through the details here, it is straight-forward
to show that the fine-tuning stage of the algorithm can also be completed in
$$O[(|V|+|E|)|V|]$$ time, so that the combined running time for a
single split of a graph or subgraph scales as $$O[(|V|+|E|)|V|]$$, or
$$O(|V|^{2})$$ on a sparse graph.
We then repeat the division into two parts until the network is reduced
to its component indivisible subgraphs. The running time of the entire
process depends on the depth of the tree or “dendrogram” formed by these
repeated divisions. In the worst case the dendrogram has depth linear in |V|,
but only a small fraction of possible dendrograms realize this worst case. A
more realistic figure for running time is given by the average depth of the
dendrogram, which goes as $$\log |V|$$, giving an average running
time for the whole algorithm of $$O(|V|^{2}\log |V|)$$ in the sparse
case. This is considerably better than the $$O(|V|^{3})$$ running
time of the betweenness algorithm, and slightly better than the
$$O(|V|^{2}\log 2|V|)$$ of the extremal optimization algorithm.
Theorem 34
For an undirected graph G(V, E), let S be a community of s w.r.t t.
Then
$$\begin{aligned} \sum \limits _{v\in S}w(u,v) > \sum
\limits _{v\in \overline{S}}w(u,v),\forall u\in S-\{s\} (14.13)
\end{aligned}$$
Theorem 36
Let G(V, E) be an undirected graph, $$s\in V$$ a source, and
connect an artificial sink t with edges of capacity $$\alpha $$ to all
nodes. Let S be the community of s w.r.t t. For any non-empty P and Q,
such that $$P\cup Q=S$$ and $$P\cap Q=\phi $$, the
following bounds always hold:
$$\begin{aligned} \frac{c(S,V-S)}{|V-S|}\le \alpha \le (14.14)
\frac{c(P,Q)}{min(|P|,|Q|)} \end{aligned}$$
Theorem 37 Let $$s,t\in V$$ be two nodes of G and let S be
the community of s w.r.t t. Then, there exists a minimum cut tree
$$T_{G}$$ of G, and an edge $$(a,b)\in T_{G}$$, such that
the removal of (a, b) yields S and $$V-S$$.
Theorem 38
Let $$T_{G}$$ be a minimum cut tree of a graph G(V, E), and
let (u, w) be an edge of $$T_{G}$$. Edge (u, w) yields the cut
(U, W) in G, with $$u\in U$$, $$w\in W$$. Now, take any
cut $$(U_{1},U_{2})$$ of U, so that $$U_{1}$$ and
$$U_{2}$$ are non-empty, $$u\in U_{1}$$,
$$U_{1}\cup U_{2}=U$$, and $$U_{1}\cap U_{2}=\phi $$.
Then,
$$\begin{aligned} c(W,U_{2})\le c(U_{1},U_{2}) (14.15)
\end{aligned}$$
Theorem 39
Let $$G_{\alpha }$$ be the expanded graph of G, and let S be
the community of s w.r.t the artificial sink t. For any non-empty P and Q,
such that $$P\cup Q=S$$ and $$P\cap Q=\phi $$, the
following bound always holds:
$$\begin{aligned} \alpha \le \frac{c(P,Q)}{min(|P|,|Q|)} (14.16)
\end{aligned}$$
Theorem 40
Let $$G_{\alpha }$$ be the expanded graph of G(V, E) and let
S be the community of s w.r.t the artificial sink t. Then, the following
bound always holds:
$$\begin{aligned} \frac{c(S,V-S)}{|V-S|}\le \alpha (14.17)
\end{aligned}$$
The cut clustering algorithm is as given in Algorithm 8
../images/462433_1_En_14_Chapter/462433_1_En_14_Figa_HTML.gif
Theorem 43 Let
$$\alpha _{1}>\alpha _{2}>\cdots >\alpha _{max}$$ be a
sequence of parameter values that connect t to V in
$$G_{\alpha _{i}}$$. Let
$$\alpha _{max+1}\le \alpha _{max}$$ be small enough to yield a
single cluster in G and $$\alpha _{0}\ge \alpha _{1}$$ be large
enough to yield all singletons. Then all $$\alpha _{i+1}$$ values,
for $$0\le i\le max$$, yield clusters in G which are supersets of the
clusters produced by each $$\alpha _{i}$$, and all clusterings
together form a hierarchical tree over the clusterings of G.
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig5_HTML.gif
Fig. 14.5 a Degree distribution. b Tie strength distribution. The blue line in a and b correspond to
$$P(x) = a(x+x_{0})^{-x}exp(-x/x_{c})$$ , where x corresponds to either k or w. The parameter
values for the fits in (A) are $$k_{0}=10.9$$ , $$\gamma _{k}=8.4$$ , $$k_{c}=\infty $$ ,
and for the fits in (B) are $$w_{0}=280, \gamma _{w}=1.9, w_{c}=3.45\times 10.5$$ . c
Illustration of the overlap between two nodes, $$v_{i}$$ and $$v_{j}$$ , its value being
shown for four local network configurations. d In the real network, the overlap
$$<O>_{w}$$ (blue circles) increases as a function of cumulative tie strength
$$P_{cum}(w)$$ , representing the fraction of links with tie strength smaller than w. The dyadic
hypothesis is tested by randomly permuting the weights, which removes the coupling between
$$<O>_{w}$$ and w (red squares). The overlap $$<O>_{b}$$ decreases as a
function of cumulative link betweenness centrality b (black diamonds)
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig6_HTML.gif
Fig. 14.6 Each link represents mutual calls between the two users, and all nodes are shown that are
at distance less than six from the selected user, marked by a circle in the center. a The real tie
strengths, observed in the call logs, defined as the aggregate call duration in minutes. b The dyadic
hypothesis suggests that the tie strength depends only on the relationship between the two
individuals. To illustrate the tie strength distribution in this case, we randomly permuted tie strengths
for the sample in a. c The weight of the links assigned on the basis of their betweenness centrality
$$b_{ij}$$ values for the sample in A as suggested by the global efficiency principle. In this case,
the links connecting communities have high $$b_{ij}$$ values (red), whereas the links within the
communities have low $$b_{ij}$$ values (green)
Figures 14.5 and 14.6 suggest that instead of tie strength being
determined by the characteristics of the individuals it connects or by the
network topology, it is determined solely by the network structure in the
tie’s immediate vicinity.
To evaluate this suggestion, we explore the network’s ability to
withstand the removal of either strong or weak ties. For this, we measure
the relative size of the giant component $$R_{gc}(f)$$, providing the
fraction of nodes that can all reach each other through connected paths as a
function of the fraction of removed links, f. Figure 14.7a, b shows that
removing in rank order the weakest (or smallest overlap) to strongest (or
greatest overlap) ties leads to the network’s sudden disintegration at
$$f^{w}=0.8(f^{O}=0.6)$$. However, removing first the strongest
ties will shrink the network but will not rapidly break it apart. The precise
point at which the network disintegrates can be determined by monitoring
$$\tilde{S}=\sum _{s<s_{max}}n_{s}s^{2}/N$$, where
$$n_{s}$$ is the number of clusters containing s nodes. Figure 14.7c,
d shows that $$\tilde{S}$$ develops a peak if we start with the
weakest links. Finite size scaling, a well established technique for
identifying the phase transition, indicates that the values of the critical
points are $$f^{O}_{c}(\infty )=0.62\pm 0.05$$ and
$$f^{w}_{c}(\infty )=0.80\pm 0.04$$ for the removal of the weak
ties, but there is no phase transition when the strong ties are removed first.
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig7_HTML.gif
Fig. 14.7 The control parameter f denotes the fraction of removed links. a and c These graphs
correspond to the case in which the links are removed on the basis of their strengths ( $$w_{ij}$$
removal). b and d These graphs correspond to the case in which the links were removed on the basis
of their overlap ( $$O_{ij}$$ removal). The black curves correspond to removing first the high-
strength (or high $$O_{ij}$$ ) links, moving toward the weaker ones, whereas the red curves
represent the opposite, starting with the low-strength (or low $$O_{ij}$$ ) ties and moving toward
the stronger ones. a and b The relative size of the largest component
$$R_{GC}(f)=N_{GC}(f)/N_{GC}(f=0)$$ indicates that the removal of the low $$w_{ij}$$
or $$O_{ij}$$ links leads to a breakdown of the network, whereas the removal of the high
$$w_{ij}$$ or $$O_{ij}$$ links leads only to the network’s gradual shrinkage. a Inset Shown
is the blowup of the high $$w_{ij}$$ region, indicating that when the low $$w_{ij}$$ ties are
removed first, the red curve goes to zero at a finite f value. c and d According to percolation theory,
$$\tilde{S}=\sum _{s<s_{max}}n_{s}s^{2}/N$$ diverges for $$N\rightarrow \infty $$ as
we approach the critical threshold $$f_{c}$$ , where the network falls apart. If we start link
removal from links with low $$w_{ij}$$ (c) or $$O_{ij}$$ (d) values, we observe a clear
signature of divergence. In contrast, if we start with high $$w_{ij}$$ (c) or $$O_{ij}$$ (d)
links, there the divergence is absent
This finding gives us the following conclusion: Given that the strong
ties are predominantly within the communities, their removal will only
locally disintegrate a community but not affect the network’s overall
integrity. In contrast, the removal of the weak links will delete the bridges
that connect different communities, leading to a phase transition driven
network collapse.
To see whether this observation affects global information diffusion, at
time 0 a randomly selected individual is infected with some novel
information. It is assumed that at each time step, each infected individual,
$$v_{i}$$, can pass the information to his/her contact,
$$v_{j}$$, with effective probability $$P_{ij}=xw_{ij}$$,
where the parameter x controls the overall spreading rate. Therefore, the
more time two individuals spend on the phone, the higher the chance that
they will pass on the monitored information. The spreading mechanism is
similar to the susceptible-infected model of epidemiology in which
recovery is not possible, i.e., an infected individual will continue
transmitting information indefinitely. As a control, the authors considered
spreading on the same network, but replaced all tie strengths with their
average value, resulting in a constant transmission probability for all links.
Figure 14.8a shows the real diffusion simulation, where it was found
that information transfer is significantly faster on the network for which all
weights are equal, the difference being rooted in a dynamic trapping of
information in communities. Such trapping is clearly visible if the number
of infected individuals in the early stages of the diffusion process is
monitored (as shown in Fig. 14.8b). Indeed, rapid diffusion within a single
community was observed, corresponding to fast increases in the number of
infected users, followed by plateaus, corresponding to time intervals during
which no new nodes are infected before the news escapes the community.
When all link weights are replaced with an average value w (the control
diffusion simulation) the bridges between communities are strengthened,
and the spreading becomes a predominantly global process, rapidly
reaching all nodes through a hierarchy of hubs.
The dramatic difference between the real and the control spreading
process begs the following question: Where do individuals get their
information? Figure 14.8c shows that the distribution of the tie strengths
through which each individual was first infected has a prominent peak at
$$w \approx 10^{2}$$ s, indicating that, in the vast majority of cases,
an individual learns about the news through ties of intermediate strength.
The distribution changes dramatically in the control case, however, when all
tie strengths are taken to be equal during the spreading process. In this case,
the majority of infections take place along the ties that are otherwise weak
(as depicted in Fig. 14.8d). Therefore, both weak and strong ties have a
relatively insignificant role as conduits for information, the former because
the small amount of on-air time offers little chance of information transfer
and the latter because they are mostly confined within communities, with
little access to new information.
To illustrate the difference between the real and the control simulation,
Fig. 14.8e, f show the spread of information in a small neighbourhood.
First, the overall direction of information flow is systematically different in
the two cases, as indicated by the large shaded arrows. In the control runs,
the information mainly follows the shortest paths. When the weights are
taken into account, however, information flows along a strong tie backbone,
and large regions of the network, connected to the rest of the network by
weak ties, are only rarely infected.
../images/462433_1_En_14_Chapter/462433_1_En_14_Fig8_HTML.gif
Fig. 14.8 The dynamics of spreading on the weighted mobile call graph, assuming that the
probability for a node $$v_{i}$$ to pass on the information to its neighbour $$v_{j}$$ in one
time step is given by $$P_{ij}=xw_{ij}$$ , with $$x=2.59\times 10^{-4}$$ . a The fraction of
infected nodes as a function of time t. The blue curve (circles) corresponds to spreading on the
network with the real tie strengths, whereas the black curve (asterisks) represents the control
simulation, in which all tie strengths are considered equal. b Number of infected nodes as a function
of time for a single realization of the spreading process. Each steep part of the curve corresponds to
invading a small community. The flatter part indicates that the spreading becomes trapped within the
community. c and d Distribution of strengths of the links responsible for the first infection for a node
in the real network (c) and control simulation (d). e and f Spreading in a small neighbourhood in the
simulation using the real weights (E) or the control case, in which all weights are taken to be equal
(f). The infection in all cases was released from the node marked in red, and the empirically observed
tie strength is shown as the thickness of the arrows (right-hand scale). The simulation was repeated
1,000 times; the size of the arrowheads is proportional to the number of times that information was
passed in the given direction, and the colour indicates the total number of transmissions on that link
(the numbers in the colour scale refer to percentages of 1, 000). The contours are guides to the eye,
illustrating the difference in the information direction flow in the two simulations
Problems
The betweenness centrality of an edge $$(u,v)\in E$$ in G(V, E) is given
by Eq. 14.18.
$$\begin{aligned} B(u,v) = \sum \limits _{(s,t)\in (14.18)
V^{2}}\frac{\sigma _{st}(u,v)}{\sigma _{st}} \end{aligned}$$
where $$\sigma _{st}$$ is the number of shortest paths between s and t,
and $$\sigma _{st}(u,v)$$ is the number of shortest paths between s and t
that contain the edge (u, v). This is assuming that the graph G is connected.
The betweenness centrality can be computed in the following manners.
14.4 Exact Betweenness Centrality
For each vertex $$s\in V$$, perform a BFS from s, which gives the BFS
tree $$T_{s}$$. For each $$v\in V$$, let v’s parent set $$P_{s}(v)$$
be defined as the set of nodes $$u\in V$$ that immediately precede v on
some shortest path from s to v in G. During the BFS, compute, for every
$$v\in V$$, the number $$\sigma _{sv}$$ of shortest paths between s
and v, according to recurrence
$$\begin{aligned} \sigma _{sv} = {\left\{ \begin{array}{ll} 1
&{} { if} v = s \\ \sum \limits _{u\in P_{s}(v)}\sigma _{su} (14.19)
&{} { otherwise} \end{array}\right. } \end{aligned}$$
After the BFS is finished, compute the dependency of s on each edge
$$(u,v)\in E$$ using the recurrence
$$\begin{aligned} \delta _{s}(u,v) = {\left\{ \begin{array}{ll}
\frac{\sigma _{su}}{\sigma _{sv}} &{} \text {if}\, v \text { is
a leaf of } T_{s} \\ \frac{\sigma _{su}}{\sigma _{sv}}(1 + \sum (14.20)
\limits _{x:w\in P_{s}(x)}\delta _{s}(w,x)) &{} { otherwise}
\end{array}\right. } \end{aligned}$$
Since the $$\delta $$ values are only really defined for edges that connect
two nodes, u and v, where u is further away from s than v, assume that
$$\delta $$ is 0 for cases that are undefined (i.e., where an edge connects
two nodes that are equidistant from s).
Do not iterate over all edges when computing the dependency values. It
makes more sense to just look at edges that are in some BFS tree starting
from s, since other edges will have zero dependency values (they are not on
any shortest paths). (Though iterating over all edges is not technically
wrong if you just set the dependency values to 0 when the nodes are
equidistant from s.)
The betweenness centrality of (u, v) or B(u, v) is
$$\begin{aligned} B(u,v) = \sum \limits _{s\in V}\delta _{s}(u,v) (14.21)
\end{aligned}$$
This algorithm has a time complexity of O(|V||E|). However, the algorithm
requires the computation of the betweenness centrality of all edges even if
only the values in some edges are to required to be computed.
14.5 Approximate Betweenness Centrality
This is an approximation of the previous algorithm with the difference that
it does not start a BFS from every node but rather samples starting nodes
randomly with replacement. Also, it can approximate betweenness for any
edge $$e \in E$$ without necessarily having to compute the centrality of
all other edges as well.
Repeatedly sample a vertex $$v_{i} \in V$$ and perform a BFS
from $$v_{i}$$ and maintain a running sum $$\Delta _{e}$$ of
the dependency scores $$\delta _{v_{i}}(e)$$ (one
$$\Delta _{e}$$ for each edge e of interest). Sample until
$$\Delta _{e}$$ is greater than cn for some constant
$$c \ge 2$$. Let the total number of samples be k. The estimated
betweenness centrality score of e is given by
$$\frac{n}{k}\Delta _{e}$$.
In this exercise, do the following:
References
1. Flake, Gary William, Robert E. Tarjan, and Kostas Tsioutsiouliklis. 2002. Clustering methods
based on minimum-cut trees. Technical report, technical report 2002–06, NEC, Princeton, NJ.
2. Girvan, Michelle, and Mark E.J. Newman. 2002. Community structure in social and biological
networks. Proceedings of the National Academy of Sciences 99 (12): 7821–7826.
[MathSciNet][Crossref]
3. Gomory, Ralph E., and Tien Chung Hu. 1961. Multi-terminal network flows. Journal of the
Society for Industrial and Applied Mathematics 9 (4): 551–570.
4. Granovetter, Mark S. 1977. The strength of weak ties. In Social networks, 347–367. Elsevier.
5. Newman, Mark E.J. 2011. Scientific collaboration networks. II. Shortest paths, weighted
networks, and centrality. Physical Review E 64 (1): 016132.
6. Onnela, J.-P., Jari Saramäki, Jorkki Hyvönen, György Szabó, David Lazer, Kimmo Kaski, János
Kertész, and A.-L. Barabási. 2007. Structure and tie strengths in mobile communication networks.
Proceedings of the National Academy of Sciences 104 (18): 7332–7336.
© Springer Nature Switzerland AG 2018
Krishna Raj P.M., Ankith Mohan and K.G. Srinivasa, Practical Social Network Analysis with Python,
Computer Communications and Networks
https://doi.org/10.1007/978-3-319-96746-2_15
Krishna Raj P. M.
Email: krishnarajpm@gmail.com
../images/462433_1_En_15_Chapter/462433_1_En_15_Fig1_HTML.gif
Fig. 15.1 a Graph of the Zachary Karate Club network where nodes represent members and edges
indicate friendship between members. b Two-dimensional visualization of node embeddings
generated from this graph using the DeepWalk method. The distances between nodes in the
embedding space reflect proximity in the original graph, and the node embeddings are spatially
clustered according to the different colour-coded communities
../images/462433_1_En_15_Chapter/462433_1_En_15_Fig2_HTML.gif
Fig. 15.2 Graph of the Les Misérables novel where nodes represent characters and edges indicate
interaction at some point in the novel between corresponding characters. (Left) Global positioning of
the nodes. Same colour indicates that the nodes belong to the same community. (Right) Colour
denotes structural equivalence between nodes, i.e, they play the same roles in their local
neighbourhoods. Blue nodes are the articulation points. This equivalence where generated using the
node2vec algorithm
../images/462433_1_En_15_Chapter/462433_1_En_15_Fig3_HTML.gif
Fig. 15.3 Illustration of the neighbourhood aggregation methods. To generate the embedding for a
node, these methods first collect the node’s k-hop neighbourhood. In the next step, these methods
aggregate the attributes of node’s neighbours, using neural network aggregators. This aggregated
neighbourhood information is used to generate an embedding, which is then fed to the decoder
References
1. Ahmed, Amr, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J.
Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd
international conference on World Wide Web, 37–48. ACM, 2013.
2. Bronstein, Michael, M., Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
2017. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing
Magazine 34 (4): 18–42.
3. Bruna, Joan, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and
locally connected networks on graphs. arXiv:1312.6203.
4. Cao, Shaosheng, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with
global structural information. In Proceedings of the 24th ACM international on conference on
information and knowledge management, 891–900. ACM.
5. Cao, Shaosheng, Wei Lu, and Qiongkai Xu. 2016. Deep neural networks for learning graph
representations. In AAAI, 1145–1152.
6. Chang, Shiyu, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang.
2015. Heterogeneous network embedding via deep architectures. In Proceedings of the 21st
ACM SIGKDD international conference on knowledge discovery and data mining, 119–128.
ACM.
7. Chen, Haochen, Bryan Perozzi, Yifan Hu, and Steven Skiena. 2017. Harp: Hierarchical
representation learning for networks. arXiv:1706.07845.
8. Dai, Hanjun, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models
for structured data. In International conference on machine learning, 2702–2711.
9. Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural
networks on graphs with fast localized spectral filtering. In Advances in neural information
processing systems, 3844–3852.
10. Dong, Yuxiao, Nitesh V. Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable
representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD
international conference on knowledge discovery and data mining, 135–144. ACM.
11. Donnat, Claire, Marinka Zitnik, David Hallac, and Jure Leskovec. 2017. Spectral graph wavelets
for structural role similarity in networks. arXiv:1710.10321.
12. Duvenaud, David K., Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,
Alán Aspuru-Guzik, and Ryan P. Adams. 2015. Convolutional networks on graphs for learning
molecular fingerprints. In Advances in neural information processing systems, 2224–2232.
13. Grover, Aditya, and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and
data mining, 855–864. ACM.
14. Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on
large graphs. In Advances in neural information processing systems, 1025–1035.
15. Hamilton, William L., Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs:
Methods and applications. arXiv:1709.05584.
16. Hinton, Geoffrey E., and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data
with neural networks. Science 313 (5786): 504–507.
17. Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation
9 (8): 1735–1780.
18. Kipf, Thomas N., and Max Welling. 2016. Semi-supervised classification with graph
convolutional networks. arXiv:1609.02907.
19.
Kipf, Thomas N., and Max Welling. 2016. Variational graph auto-encoders. arXiv:1611.07308.
20. Li, Yujia, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence
neural networks. arXiv:1511.05493.
21. Nickel, Maximilian, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of
relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1): 11–33.
22. Ou, Mingdong, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric
transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD
international conference on knowledge discovery and data mining, 1105–1114. ACM.
23. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on
knowledge discovery and data mining, 701–710. ACM.
24. Pham, Trang, Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. 2017. Column networks for
collective classification. In AAAI, 2485–2491.
25. Ribeiro, Leonardo FR., Pedro HP. Saverese, and Daniel R. Figueiredo. 2017. struc2vec:
Learning node representations from structural identity. In Proceedings of the 23rd ACM
SIGKDD international conference on knowledge discovery and data mining, 385–394. ACM.
26. Scarselli, Franco, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
2009. The graph neural network model. IEEE Transactions on Neural Networks 20 (1): 61–80.
27. Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and
Max Welling. 2017. Modeling relational data with graph convolutional networks. arXiv:1703.
06103.
28. Tang, Jian, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line:
Large-scale information network embedding. In Proceedings of the 24th international
conference on World Wide Web, 1067–1077. (International World Wide Web Conferences
Steering Committee).
29. van den Berg, Rianne, Thomas N. Kipf, and Max Welling. 2017. Graph convolutional matrix
completion. Statistics 1050: 7.
30. Wang, Daixin, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and
data mining, 1225–1234. ACM.
31. Zitnik, Marinka, and Jure Leskovec. 2017. Predicting multicellular function through multi-layer
tissue networks. Bioinformatics 33 (14): i190–i198.
Index
A
Absolute spam mass
Absolute status
Acquaintance networks
Action
Active node
Adjacency-based proximity measure
Adjacency list
Adjacency matrix
Affiliation network
African-American
Age
Agglomerative graph partitioning methods
Aggregated neighbourhood vector
Aggregation function
Alfréd Rényi
Ally request
Amazon.com
Ambassador node
Anchor
Approximate balanced graph
Approximate betweenness centrality
Approximation algorithm
ARPANET
Articulation vertex
Artificial sink
Asynchronous IO
Atomic propagation
Authorities
Authority score
Authority Update rule
Autoencoder objective
Autonomous system
Average clustering coefficient
Average degeneracy
Average degree
Average distance
Average neighbourhood size
Average out-degree estimate
Average path length
B
Backtracking
Backward BFS traversal
Backward burning probability
Backward direction
Backward-forward step
Bacon number
Balance heuristic
Balance theorem
Balance-based reasoning
Balanced
Balanced dataset
Balanced graph
Balanced triad
BalanceDet
BalanceLrn
Ball of radius
Ballot-blind prediction
Barrel
B-ary tree
Base Set
Basic PageRank Update rule
Basic reproductive number
Basis set
Batch-mode crawler
Battle of the Water Sensor Networks
Bayesian inference method
Behaviour
Béla Bollobás
Benefit-cost greedy algorithm
Bernoulli distribution
Betweenness
BFS algorithm
BFS traversals
BigFiles
Bilingual
Bimodal
Binary action vector
Binary classification label
Binary evaluation vectors
Binary tree structure
Binomial distribution
Bipartite graph
Bipolar
BitTorrent
Blog
Blogspace
Blue chip stock holders
Bollobás configuration model
Boosting nodes
Boston
Boston random group
Bowtie
Bradford law
Branching process
Breadth first searches
Breadth-first search
Bridge edge
Brilliant-but-cruel hypothesis
B-tree index
Burst of activity
C
California
Cambridge
Cascade
Cascade capacity
Cascading failure
Caucasian
CELF optimization
CELF++
Center-surround convolutional kernel
Chain lengths
Chord
Chord software
Classification accuracy
Cloning
Cloning step
Cluster
Clustering
Clustering coefficient
Co-citation
Collaboration graph
Collaborative filtering systems
Collection analysis module
Collective action
Columbia small world study
Columbia University
Column networks
Comments
Common knowledge
Community
Community discovery
Community Guided Attachment
Complete cascade
Complete crawl
Complete graph
Completely connected
Complex systems
Component
Component size distribution
Computed product-average star rating
Conceptual distance
Concurrency
Conductance
Configuration
Conformity hypothesis
Connected
Connected component
Connected components
Connected undirected graph
Connectivity
Consistent hashing
Constrained triad dynamics
Constructive intermediate embedding
Contact network
Contagion
Contagion probability
Contagion threshold
Content Addressable Network
Content creation
Content similarity
Contextualized link
Continuous vector representation
Convolutional
Convolutional coarsening layer
Convolutional encoder
Convolutional molecular fingerprinting
Coordination game
Copying model
Cost-Effective Forward selection algorithm
Cost-Effective Lazy Forward
Cost-Effective Lazy Forward selection algorithm
CouchSurfing.com
Crawl and stop
Crawl and stop with threshold
Crawl control
Crawler
Crawler revisit frequency
Crawling
Crestline
Critical number
Cross-entropy loss
Cross-validation approach
Cultural fad
Cut
Cut clustering algorithm
Cycle
D
Dampened trust score
Dark-web
Datastore
Decentralized algorithm
Decentralized search
Decision-making
Decoder
Decoder mapping
Decoder proximity value
Deep learning
Deep Neural Graph Representations
DeepWalk
Degeneracy
Degree
Degree assortativity
Degree-based navigation
Degree centrality
Degree discount
Degree distribution
Degree of separation
DELICIOUS
Dendrogram
Dense neural network layer
Densification power-law
Dependency
Deterministic Kronecker graph
Diameter
Differential status
Diffusion
Diffusion of norm
Diffusion-based algorithm
Dimensionality reduction
Dip
Direct encoding
Direct propagation
Directed acyclic graph
Directed graph
Directed multigraph
Directed self-looped graph
Directed unweighted graph
Directed weighted graph
$$DISCONNECTED\ COMPONENT $$
Disconnected graph
Disconnected undirected graph
Dispersion
Distance
Distance centrality
Distance-Dependent Kronecker graphs
Distance-dependent Kronecker operator
Distance distribution
Distance matrix
Distributed greedy algorithm
Distrust propagations
Divisive graph partitioning methods
DNS cache
DNS lookup
DocID
DocIndex
Document stream
Dodds
Dose-response model
$$d $$ -regular graph
DumpLexicon
Dynamic monopoly
Dynamic network
Dynamic programming
Dynamic routing table
Dynamo
E
EachMovie
Early adopters
Edge attributes
Edge destination selection process
Edge embedding
Edge initiation process
Edge list
Edge reciprocation
Edges
Edge sign prediction
EDonkey2000
Effective diameter
Effective number of vertices
Egalitarian
Ego
Ego-networks
Eigen exponent
Eigenvalue propagation
Element-wise max-pooling
Element-wise mean
Embeddedness
Embedding lookup
Embedding space
Encoder
Encoder-decoder
Encoder mapping
Enhanced-clustering cache replacement
EPANET simulator
Epinions
Equilibrium
Erdös number
Essembly.com
Evaluation function
Evaluation-similarity
Evolutionary model
Exact betweenness centrality
Expansion
Expectation Maximisation
Expected penalty
Expected penalty reduction
Expected profit lift
Expected value navigation
Exponent
Exponential attenuation
Exponential distribution
Exponential tail
Exposure
F
Facebook
Facebook friends
Facebook user
Faction membership
Factor
FastTrack
Fault-tolerance
Feature vector
Finger table
First-order graph proximity
Fixed deterministic distance measure
FLICKR
Foe
Folded graph
Forest Fire
Forest Fire model
Forward BFS traversal
Forward burning probability
Forward direction
Four degrees of separation
Freenet
Frequency
Frequency ratio
Frequency table
Freshness
Friend
Friend request
Friends-of-friends
Fully Bayesian approach
G
Gated recurrent unit
General threshold model
Generating functions
Generative baseline
Generative surprise
Geographic distance
Geographical proximity
Giant component
Girvan-Newman algorithm
Global cascade
Global information diffusion
Global inverted file
Global profit lift
Global rounding
Gnutella
Goodness
Goodness-of-fit
Google
``Go with the winners'' algorithm
Graph
Graph coarsening layer
Graph coarsening procedure
Graph convolutional networks
Graph Factorization algorithm
Graph Fourier transform
Graph neural networks
Graph structure
GraphSAGE algorithm
GraphWave
GraRep
Greedy search
Grid
Group-induced model
Group structure
Growth power-law
H
Hadoop
Half-edges
Hand-engineered heuristic
HARP
Hash-based organization
Hash-bucket
Hash distribution policy
Hashing mapping
Hashtags
Heat kernel
Helpfulness
Helpfulness evaluation
Helpfulness ratio
Helpfulness vote
Hierarchical clustering
Hierarchical distance
Hierarchical model
Hierarchical softmax
High-speed streaming
Hill- climbing approach
Hill climbing search
Hit
Hit list
Hive
Hollywood
Homophily
Honey pot
Hop distribution
HOPE
Hop-plot exponent
Hops-to-live limit
Hops-to-live value
Hot pages
Hub
Hub-authority update
Hub score
Hub Update rule
Human wayfinding
Hybrid hashed-log organization
HyperANF algorithm
HyperLogLog counter
I
Ideal chain lengths frequency distribution
Identifier
Identifier circle
Ignorant trust function
$$IN $$
In
Inactive node
Incentive compatible mechanism
Income stratification
Incremental function
In-degree
In-degree distribution
In-degree heuristic
Independent cascade model
Indexer
Index sequential access mode
Individual-bias hypothesis
Indivisible
Infected
Infinite grid
Infinite-inventory
Infinite line
Infinite paths
Infinite-state automaton
Influence
Influence maximization problem
Influence weights
Influential
Information linkage graph
Initiator graph
Initiator matrix
In-links
Inner-product decoder
Innovation
In-place update
Instance matrix
Instant messenger
Interaction
Inter-cluster cut
Inter-cluster weight
Interest Driven
Intermediaries
Internet
Internet Protocol
Intra-cluster cut
Intrinsic value
Inverted index
Inverted list
Inverted PageRank
Irrelevant event
Isolated vertex
Iterative propagation
J
Joint positive endorsement
K
Kademlia
Kansas
Kansas study
KaZaA
$$k $$ -core
Kernighan-Lin algorithm
Kevin Bacon
Key algorithm
$$k $$ -hop neighbourhood
KL-divergence metric
Kleinberg model
$$k $$ -regular graph
Kronecker graph
Kronecker graph product
Kronecker-like multiplication
Kronecker product
KRONEM algorithm
KRONFIT
L
Laplacian eigenmaps objective
Lattice distance
Lattice points
Lazy evaluation
Lazy replication
LDAG algorithm
Least recently used cache
Leave-one-out cross-validation
Leaves
Like
Likelihood ratio test
LINE method
Linear threshold model
LINKEDIN
Link prediction
Link probability
Link probability measure
Links
Links database
LiveJournal
LiveJournal population density
LiveJournal social network
Local bridge
Local contacts
Local inverted file
Local rounding
Local triad dynamics
Location Driven
Login correlation
Logistic regression
Logistic regression classifier
Log-log plot
Log-structured file
Log-structured organization
Long-range contacts
Long tail
LOOK AHEAD OPTIMIZATION
Lookup table
Los Angeles
Lotka distribution
Low-dimensional embedding
Low neighbour growth
LSTM
M
Machine learning
Macroscopic structure
Mailing lists
Majority rounding
Marginal gain
Marketing action
Marketing plan
Markov chain
Massachusetts
Mass-based spam detection algorithm
Matching algorithm
Matrix-factorization
Maximal subgraph
Maximum likelihood estimation
Maximum likelihood estimation technique
Maximum number of edges
Max-likelihood attrition rate
Max-pooling neural network
Mean-squared-error loss
Message-forwarding experiment
Message funneling
Message passing
Metropolis sampling algorithm
Metropolized Gibbs sampling approach
Microsoft Messenger instant-messaging system
Minimum cut
Minimum cut clustering
Minimum cut tree
Minimum description length approach
Missing past
Mobile call graph
Model A
Model B
Model C
Model D
Modularity
Modularity matrix
Modularity penalty
Molecular graph representation
Monitored node
Monotone threshold function
Monte Carlo switching steps
M-step trust function
Muhamad
Multicore breadth first search
Multifractal network generator
Multigraph
Multi-objective problem
Multiple ambassadors
Multiple edge
Multiplex propagation
N
Naive algorithm
Natural greedy hill-climbing strategy
Natural self-diminishing property
Natural-world graph
Navigable
Navigation agents
Navigation algorithm
Nebraska
Nebraska random group
Nebraska study
Negative attitude
Negative opinion
Negative sampling
Negative spam mass
Neighbourhood aggregation algorithm
Neighbourhood aggregation method
Neighbourhood function
Neighbourhood graph
Neighbourhood information
Neighbourhood overlap
Neighbours
Neighbour set
Nemesis request
Network
Network value
Neural network layer
New York
Newman-Zipf
Node arrival process
Node classification
Node embedding
Node embedding approach
Node-independent
Node-independent path
Nodes
Node2vec
Node2vec optimization algorithm
Noisy Stochastic Kronecker graph model
Non-dominated solution
Non-navigable graph
Non-searchable
Non-searchable graph
Non-simple path
Non-unique friends-of-friends
Normalizing factor
NP-hard
Null model
O
Occupation similarity
Ohio
OhmNet
Omaha
$$1 $$ -ball
One-hop neighbourhood
One-hot indicator vector
One-step distrust
Online contaminant monitoring system
Online social applications
Optimal marketing plan
Oracle function
Ordered degree sequence
Ordered trust property
Ordinary influencers
Organizational hierarchy
Orphan
$$OUT $$
Out
Outbreak detection
Outbreak detection problem
Out-degree
Out-degree distribution
Out-degree exponent
Out-degree heuristic
Out-links
Overloading
Overnet
P
Page addition/insertion
Page change frequency
PageRank
PageRank contribution
Pagerank threshold
PageRank with random jump distribution
Page repository
Pages
Pairwise decoder
Pairwise orderedness
Pairwise proximity measure
Paradise
Parameter matrix
Parent set
Pareto law
Pareto-optimal
Pareto-optimal solution
Partial cascade
Partial crawl
Participation rates
Pastry
Path
Path length
Pattern discovery
Paul Erdös
Payoff
Peer-To-Peer overlay networks
Penalty
Penalty reduction
Penalty reduction function
Permutation model
Persistence
Personal threshold
Personal threshold rule
Phantom edges
Phantom nodes
PHITS algorithm
Physical page organization
Pluralistic ignorance
Poisson distribution
Poisson's law
Polling game
Polylogarithmic
Polysemy
Popularity Driven
Positive attitude
Positive opinion
Positivity
Power law
Power law degree distribution
Power-law distribution
Power law random graph models
Predecessor pointer
Prediction error
Preferential attachment
Preferential attachment model
Pre-processing step
Principle of Repeated Improvement
Probability of persistence
Probit slope parameter
Product average
Professional ties
Propagated distrust
Proper social network
Proportional refresh policy
Protein-protein interaction graph
Proxy requests
Pseudo-unique random number
P-value
Q
Quality-only straw-man hypothesis
Quantitative collaborative filtering algorithm
Quasi-stationary dynamic state
Query engine
R
Random access
Random-failure hypothesis
Random graph
Random initial seeders
Random jump vector
Random page access
Random walk
Rank
Rank exponent
Ranking
Rating
Realization matrix
Real-world graph
Real-world systems
Receptive baseline
Receptive surprise
Recommendation network
Recommender systems
Register
Regular directed graph
Regular graph
Reinforcement learning
Relative spam mass
Relevant event
Removed
Representation learning
Representation learning techniques
Request hit ratio
Resolution
Resource-sharing
Reviews
Rich-get-richer phenomenon
R-MAT model
Root Set
Roster
Rounding methods
Routing path
Routing table
S
SALSA algorithm
Scalability
Scalable
Scalarization
Scaled PageRank Update rule
Scale-free
Scale-free model
Scaling
SCC algorithm
Searchable
Searchable graph
Search engine
Searcher
Searching
Search query
Second-order encoder-decoder objective
Second-order graph proximity
Self-edge
Self-loop
Self-looped edge
Self-looped graph
Sensor
Seven degree of separation
Shadowing
Sharon
Sibling page
Sigmoid function
Sign
Signed-difference analysis
Signed network
Similarity
Similarity-based navigation
Similarity function
SIMPATH
Simple path
Simple Summary Statistics
Single pass
Sink vertex
SIR epidemic model
SIR model
SIRS epidemic model
SIRS model
SIS epidemic model
Six Degrees of Kevin Bacon
Six degrees of separation
Slashdot
Small world
Small world phenomena
Small world phenomenon
Small world property
Small-world acquaintance graph
Small-world experiment
Small-world hypothesis
Small-world models
Small-world structure
SNAP
Social epidemics
Social intelligence
Social network
Social similarity
Social stratification
Sorter
Source vertex
Spam
Spam detection
Spam farms
Spam mass
Spectral analysis
Spectral graph wavelet
Spid
Stack Overflow
Staircase effects
Stanley Milgram
Star rating
Starting population bias
State transition
Static score distribution vector
Status heuristic
Status-based reasoning
StatusDet
StatusLrn
Steepest-ascent hill-climbing search
Stickiness
Stochastic averaging
Stochastic Kronecker graph model
Stochastic process
Storage
Storage nodes
Store Server
Strategic alliance
Straw-man quality-only hypothesis
Streaming access
Strongly Connected Component (SCC)
Strongly connected directed graph
Strong ties
Strong Triadic Closure property
Structural balance
Structural balance property
Structural Deep Network Embeddings
Structural role
Structured P2P networks
Structure index
Struc2vec
Subgraph classification
Submodular
Submodularity
Subscriptions
Successor
Super-spreaders
Supernodes
Superseeders
Supervised learning
Surprise
Susceptible
Susceptible-Infected-Removed cycle
Switching algorithm
Synonymy
Systolic approach
T
Tag-similarity
Tapestry
Targeted immunization
Target node
Target reachability
Technological graph
Temporal graph
$$TENDRILS $$
Text index
Theory of balance
Theory of status
Theory of structural balance
Theory of triad types
Threshold
Threshold rule
Threshold trust property
Tightly-knit community
Time-expanded network
Time graph
Time-outs
Time-To-Live
Topic
Topic drift
Tracer cards
Traditional information retrieval
Transient contact network
Transpose trust
Triadic closure property
Triggering model
Triggering set
True value
Trust
Trust coupling
Trust dampening
Trust function
Trust only
TrustRank algorithm
Trust splitting
Twitter idiom hashtag
Two-hop adjacency neighbourhood
U
Unbalanced
Unbalanced graph
Undirected component
Undirected graph
Undirected multigraph
Undirected self-looped graph
Undirected unweighted graph
Undirected weighted graph
Uniform distribution policy
Uniform immunization programmes
Uniform random $$d $$ -regular graph
Uniform random jump distribution
Uniform refresh policy
Uniform sampling
Union-Find
$$unique $$
Unique friends-of-friends
Unit cost algorithm
Unit-cost greedy algorithm
Unstructured P2P network
Unsupervised learning
Unweighted graph
Urban myth
URL Resolver
URL Server
User evaluation
Utility index
V
Variance-to-mean ratio
VERTEX COVER OPTIMIZATION
Vertices
Viceroy
Viral marketing
Visualization
W
Walk
Water distribution system
Watts
Watts–Strogatz model
WCC algorithm
WeakBalDet
Weakly connected directed graph
Weak structural balance property
Weak ties
Web
Webbiness
Web crawls
Weighted auxillary graph
Weighted cascade
Weighted graph
Weighted linear combinations
Weighted path count
Who-talks-to-whom graph
who-transacts-with-whom graph
Wichita
Wikipedia
Wikipedia adminship election
Wikipedia adminship voting network
Wikispeedia
Word-level parallelism
Y
YAHOO ANSWERS
Z
Zachary Karate Club network
Zero-crossing
Zipf distribution
Zipf law