Eric Ma and Mridul Seth - Network Analysis Made Simple An Introduction To Network Analysis and Applied Graph Theory Using Python and NetworkX-leanpub - Com (2021)
Eric Ma and Mridul Seth - Network Analysis Made Simple An Introduction To Network Analysis and Applied Graph Theory Using Python and NetworkX-leanpub - Com (2021)
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Technical Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Intellectual Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Introduction to Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A formal definition of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Examples of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Edges define the interesting part of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Graph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Hairballs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Matrix Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Arc Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Circos Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Hive Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Principles of Rational Graph Viz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Hubs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CONTENTS
Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Visualizing Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bottleneck nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Triadic Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Graph I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Graph Data as Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Pickling Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Other text formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Why test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
What to test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Continuous data testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
What are bipartite graphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bipartite Graph Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Weighted Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
CONTENTS
Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Technical Takeaways
Firstly, we would like to equip you to be familiar with the NetworkX application programming
interface (API). The reason for choosing NetworkX is because it is extremely beginner-friendly, and
has an API that matches graph theory concepts very closely.
Secondly, we would like to show you how you can visualize graph data in a fashion that doesn’t
involve showing mere hairballs. Throughout the book, you will see examples of what we call rational
graph visualizations. One of our authors, Eric Ma, has developed a companion package, nxviz, that
provides a declarative and convenient API (in other words an attempt at a “grammar”) for graph
visualization.
Thirdly, in this book, you will be introduced to basic graph algorithms, such as finding special graph
structures, or finding paths in a graph. Graph algorithms will show you how to “think on graphs”,
and knowing how to do so will broaden your ability to interact with graph data structures.
Fourthly, you will also be equipped with the connection between graph theory and other areas of
math and computing, such as statistical inference and linear algebra.
Intellectual Goals
Beyond the technical takeaways, we hope to broaden how you think about data.
The first idea we hope to give you the ability to think about your data in terms of “relationships”.
As you will learn, relationships are what give rise to the interestingness of graphs. That’s where
relational insights can come to fore.
The second idea we hope to give you is the ability to “think on graphs”. This comes with practice.
Once you master it, though, you will find yourself becoming more and more familiar with
algorithmic thinking. which is where you look at a problem in terms of the algorithm that solves
it.
Introduction to Graphs
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="k4KHoLC7TFE", width="100%")
n = {a, b, c, d, ...}
If you extracted every node from the edge set e, it should form at least a subset of the node set n. (It
is at least a subset because not every node in n might participate in an edge.)
If you draw out a network, the “nodes” are commonly represented as shapes, such as circles, while
the “edges” are the lines between the shapes.
Introduction to Graphs 4
Examples of Networks
Now that we have a proper definition of a graph, let’s move on to explore examples of graphs.
One example I (Eric Ma) am fond of, based on my background as a biologist, is a protein-protein
interaction network. Here, the graph can be defined in the following way:
A more colloquial example of networks is an air transportation network. Here, the graph can be
defined in the following way:
And another even more relatable example would be our ever-prevalent social networks! With Twitter,
the graph can be defined in the following way:
Now that you’ve seen the framework for defining a graph, we’d like to invite you to answer the
following question: What examples of networks have you seen before in your profession?
Go ahead and list it out.
Types of Graphs
As you probably can see, graphs are a really flexible data model for modelling the world, as long as
the nodes and edges are strictly defined. (If the nodes and edges are sloppily defined, well, we run
into a lot of interpretability problems later on.)
If you are a member of both LinkedIn and Twitter, you might intuitively think that there’s a slight
difference in the structure of the two “social graphs”. You’d be absolutely correct on that count!
Twitter is an example of what we would intuitively call a directed graph. Why is this so? The key
here lies in how interactions are modelled. One user can follow another, but the other need not
necessarily follow back. As such, there is a directionality to the relationship.
LinkedIn is an example of what we would intuitively call an undirected graph. Why is this so? The
key here is that when two users are LinkedIn connections, we automatically assign a bi-directional
edge between them. As such, for convenience, we can collapse the bi-directional edge into an
undirected edge, thus yielding an undirected graph.
If we wanted to turn LinkedIn into a directed graph, we might want to keep information on who
initiated the invitation. In that way, the relationship is automatically bi-directional.
Introduction to Graphs 5
The heart of a graph lies in its edges, not in its nodes. (John Quackenbush, Harvard School
of Public Health)
Indeed, this is a key point to remember! Without edges, the nodes are merely collections of entities.
In a data table, they would correspond to the rows. That alone can be interesting, but doesn’t yield
relational insights between the entities.
The NetworkX API
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id='sdF0uJo2KdU', width="100%")
Data Model
In NetworkX, graph data are stored in a dictionary-like fashion. They are placed under a Graph
object, canonically instantiated with the variable G as follows:
1 G = nx.Graph()
Of course, you are free to name the graph anything you want!
Nodes are part of the attribute G.nodes. There, the node data are housed in a dictionary-like container,
where the key is the node ID and the values are a dictionary of attributes. Node data are accessible
using syntax that looks like:
¹http://konect.uni-koblenz.de/networks/moreno_seventh
The NetworkX API 7
1 G.nodes[node1]
Edges are part of the attribute G.edges, which is also stored in a dictionary-like container. Edge data
are accessible using syntax that looks like:
1 G.edges[node1, node2]
Because of the dictionary-like implementation of the graph, any hashable object can be a node. This
means strings and tuples, but not lists and sets.
Load Data
Let’s load some real network data to get a feel for the NetworkX API. This dataset² comes from a
study of 7th grade students.
This directed network contains proximity ratings between students from 29 seventh grade
students from a school in Victoria. Among other questions the students were asked to
nominate their preferred classmates for three different activities. A node represents a
student. An edge between two nodes shows that the left student picked the right student
as his or her answer. The edge weights are between 1 and 3 and show how often the left
student chose the right student as his/her favourite.
In the original dataset, students were from an all-boys school. However, I have modified the dataset
to instead be a mixed-gender school.
1 import networkx as nx
2 from datetime import datetime
3 import matplotlib.pyplot as plt
4 import numpy as np
5 import warnings
6 from nams import load_data as cf
7
8 warnings.filterwarnings('ignore')
1 G = cf.load_seventh_grader_network()
²http://konect.uni-koblenz.de/networks/moreno_seventh
The NetworkX API 8
1 type(G)
1 networkx.classes.digraph.DiGraph
Because the graph is a DiGraph, this tells us that the graph is a directed one.
If it were undirected, the type would change:
1 H = nx.Graph()
2 type(H)
1 networkx.classes.graph.Graph
1 list(G.nodes())[0:5]
1 [1, 2, 3, 4, 5]
G.nodes() returns a “view” on the nodes. We can’t actually slice into the view and grab out a sub-
selection, but we can at least see what nodes are present. For brevity, we have sliced into G.nodes()
passed into a list() constructor, so that we don’t pollute the output. Because a NodeView is iterable,
though, we can query it for its length:
The NetworkX API 9
1 len(G.nodes())
1 29
If our nodes have metadata attached to them, we can view the metadata at the same time by passing
in data=True:
1 list(G.nodes(data=True))[0:5]
1 G.nodes[1]
1 {'gender': 'male'}
Now, because a NodeDataView is dictionary-like, looping over G.nodes(data=True) is very much like
looping over key-value pairs of a dictionary. As such, we can write things like:
1 for n, d in G.nodes(data=True):
2 # n is the node
3 # d is the metadata dictionary
4 ...
1 for k, v in dictionary.items():
2 # do stuff in the loop
With this dictionary-like syntax, we can query back the metadata that’s associated with any node.
1 list(G.edges())[0:5]
1 [(1, 2), (1, 3), (1, 4), (1, 5), (1, 6)]
Similar to the NodeView, G.edges() returns an EdgeView that is also iterable. As with above, we have
abbreviated the output inside a sliced list to keep things readable. Because G.edges() is iterable, we
can get its length to see the number of edges that are present in a graph.
1 len(G.edges())
1 376
1 list(G.edges(data=True))[0:5]
The NetworkX API 11
Additionally, it is possible for us to select out individual edges, as long as they exist in the graph:
1 G.edges[15, 10]
1 {'count': 2}
1 ---------------------------------------------------------------------------
2 KeyError Traceback (most recent call last)
3 <ipython-input-21-ce014cab875a> in <module>
4 ----> 1 G.edges[15, 16]
5
6 ~/anaconda/envs/nams/lib/python3.7/site-packages/networkx/classes/reportviews.py in \
7 __getitem__(self, e)
8 928 def __getitem__(self, e):
9 929 u, v = e
10 --> 930 return self._adjdict[u][v]
11 931
12 932 # EdgeDataView methods
13
14 KeyError: 16
As with the NodeDataView, the EdgeDataView is dictionary-like, with the difference being that the
keys are 2-tuple-like instead of being single hashable objects. Thus, we can write syntax like the
following to loop over the edgelist:
The NetworkX API 12
Likewise, you can test your answer using the test function below:
1 def test_maxcount(maxcount):
2 assert maxcount == 3
3
4 test_maxcount(maxcount)
Now, let’s learn how to manipulate the graph. Specifically, we’ll learn how to add nodes and edges
to a graph.
Adding Nodes
The NetworkX graph API lets you add a node easily:
The NetworkX API 13
Adding Edges
It also allows you to add an edge easily:
You can verify that the graph has been correctly created by executing the test function below.
1 def test_graph_integrity(G):
2 assert 30 in G.nodes()
3 assert 31 in G.nodes()
4 assert G.nodes[30]['gender'] == 'male'
5 assert G.nodes[31]['gender'] == 'female'
6 assert G.has_edge(30, 31)
7 assert G.has_edge(30, 7)
8 assert G.has_edge(31, 7)
9 assert G.edges[30, 7]['count'] == 3
10 assert G.edges[7, 30]['count'] == 3
11 assert G.edges[31, 7]['count'] == 3
The NetworkX API 14
Coding Patterns
These are some recommended coding patterns when doing network analysis using NetworkX, which
stem from my personal experience with the package.
or
If the graph you are constructing is a directed graph, with a “source” and “sink” available, then I
would recommend the following naming of variables instead:
The NetworkX API 15
or
Further Reading
For a deeper look at the NetworkX API, be sure to check out the NetworkX docs³.
Further Exercises
Here’s some further exercises that you can use to get some practice.
Hint: the goal here is to get a list of edges for which the reverse edge is not present.
Hint: You may need the class method G.has_edge(n1, n2). This returns whether a graph has an edge
between the nodes n1 and n2.
In a previous session at ODSC East 2018, a few other class participants provided the following
solutions, which you can take a look at by uncommenting the following cells.
This first one by @schwanne⁴ is the list comprehension version of the above solution:
³https://networkx.readthedocs.io
⁴https://github.com/schwanne
The NetworkX API 16
Solution Answers
Here are the answers to the exercises above.
1 """
2 Solutions to Intro Chapter.
3 """
4
5
6 def node_metadata(G):
7 """Counts of students of each gender."""
8 from collections import Counter
9
10 mf_counts = Counter([d["gender"] for n, d in G.nodes(data=True)])
11 return mf_counts
12
13
14 def edge_metadata(G):
15 """Maximum number of times that a student rated another student."""
16 counts = [d["count"] for n1, n2, d in G.edges(data=True)]
17 maxcount = max(counts)
18 return maxcount
19
20
21 def adding_students(G):
⁵https://github.com/end0
The NetworkX API 17
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="v9HrR_AF5Zc", width="100%")
Hairballs
The node-link diagram is the canonical diagram we will see in publications. Nodes are commonly
drawn as circles, while edges are drawn s lines.
Node-link diagrams are common, and there’s a good reason for this: it’s convenient to draw! In
NetworkX, we can draw node-link diagrams using:
Graph Visualization 19
1 nx.draw(G)
png
Nodes more tightly connected with one another are clustered together. Initial node placement is
done typically at random, so really it’s tough to deterministically generate the same figure. If the
network is small enough to visualize, and the node labels are small enough to fit in a circle, then you
can use the with_labels=True argument to bring some degree of informativeness to the drawing:
1 G.is_directed()
1 True
Graph Visualization 20
1 nx.draw(G, with_labels=True)
png
The downside to drawing graphs this way is that large graphs end up looking like hairballs. Can
you imagine a graph with more than the 28 nodes that we have? As you probably can imagine, the
default nx.draw(G) is probably not suitable for generating visual insights.
Matrix Plot
A different way that we can visualize a graph is by visualizing it in its matrix form. The nodes are
on the x- and y- axes, and a filled square represent an edge between the nodes.
We can draw a graph’s matrix form conveniently by using nxviz.MatrixPlot:
Graph Visualization 21
1 import nxviz as nv
2 from nxviz import annotate
3
4
5 nv.matrix(G, group_by="gender", node_color_by="gender")
6 annotate.matrix_group(G, group_by="gender")
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
png
What can you tell from the graph visualization? A few things are immediately obvious:
Arc Plot
The Arc Plot is another rational graph visualization. Here, we line up the nodes along a horizontal
axis, and draw arcs between nodes if they are connected by an edge. We can also optionally group
and colour them by some metadata. In the case of this student graph, we group and colour them by
“gender”.
png
The Arc Plot forms the basis of the next visualization, the highly popular Circos plot.
Circos Plot
The Circos Plot was developed by Martin Krzywinski⁶ at the BC Cancer Research Center. The
nxviz.CircosPlot takes inspiration from the original by joining the two ends of the Arc Plot into a
⁶http://circos.ca/
Graph Visualization 23
png
Generally speaking, you can think of a Circos Plot as being a more compact and aesthetically pleasing
version of Arc Plots.
Hive Plot
The final plot we’ll show is, Hive Plots.
png
As you can see, with Hive Plots, we first group nodes along two or three radial axes. In this case, we
have the boys along one radial axis and the girls along the other. We can also order the nodes along
each axis if we so choose to. In this case, no particular ordering is chosen.
Next, we draw edges. We start first with edges between groups. That is shown on the left side of
the figure, joining nodes in the “yellow” and “green” (boys/girls) groups. We then proceed to edges
within groups. This is done by cloning the node radial axis before drawing edges.
In some ways, this makes a ton of sense. The nodes are the “entities” in a graph, corresponding
to people, proteins, and ports. For “entities”, we have natural ways to group, order and summarize
(reduce). (An example of a “reduction” is counting the number of things.) Prioritizing node placement
allows us to appeal to our audience’s natural sense of grouping, ordering and reduction.
So the next time you see a hairball, I hope you’re able to critique it for what it doesn’t communicate,
and possibly use the same principle to design a better visualization!
Hubs
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="-oimHbVDdDA", width=560, height=315)
This network describes the face-to-face behavior of people during the exhibition INFEC-
TIOUS: STAY AWAY in 2009 at the Science Gallery in Dublin. Nodes represent exhibition
visitors; edges represent face-to-face contacts that were active for at least 20 seconds.
Multiple edges between two nodes are possible and denote multiple contacts. The network
contains the data from the day with the most interactions.
To simplify the network, we have represented only the last contact between individuals.
1 type(G)
1 networkx.classes.graph.Graph
As usual, before proceeding with any analysis, we should know basic graph statistics.
1 len(G.nodes()), len(G.edges())
1 (410, 2765)
1 G.neighbors(7)
Hubs 27
1 <dict_keyiterator at 0x7fa2457aea40>
It returns a generator that doesn’t immediately return the exact neighbors list. This means we cannot
know its exact length, as it is a generator. If you tried to do:
1 len(G.neighbors(7))
1 ---------------------------------------------------------------------------
2 TypeError Traceback (most recent call last)
3 <ipython-input-13-72c56971d077> in <module>
4 ----> 1 len(G.neighbors(7))
5
6 TypeError: object of type 'dict_keyiterator' has no len()
Hence, we will need to cast it as a list in order to know both its length and its members:
1 list(G.neighbors(7))
In the event that some nodes have an extensive list of neighbors, then using the dict_keyiterator
is potentially a good memory-saving technique, as it lazily yields the neighbors.
Can you create a ranked list of the importance of each individual, based on the number
of neighbors they have?
• You could consider using a pandas Series. This would be a modern and idiomatic way of
approaching the problem.
• You could also consider using Python’s sorted function.
Hubs 28
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
Formally defined, the degree centrality of a node (let’s call it d) is the number of neighbors that a
node has (let’s call it n) divided by the number of neighbors it could possibly have (let’s call it N ):
n
d=
N
1 import networkx as nx
2 import pandas as pd
3 dcs = pd.Series(nx.degree_centrality(G))
4 dcs
1 100 0.070905
2 101 0.031785
3 102 0.039120
4 103 0.063570
5 104 0.041565
6 ...
7 89 0.009780
8 91 0.051345
9 96 0.036675
10 99 0.034230
11 98 0.002445
12 Length: 410, dtype: float64
nx.degree_centrality(G) returns to us a dictionary of key-value pairs, where the keys are node IDs
and values are the degree centrality score. To save on output length, I took the liberty of casting it
as a pandas Series to make it easier to display.
Incidentally, we can also sort the series to find the nodes with the highest degree centralities:
1 dcs.sort_values(ascending=False)
Hubs 30
1 51 0.122249
2 272 0.114914
3 235 0.105134
4 195 0.105134
5 265 0.083130
6 ...
7 390 0.002445
8 135 0.002445
9 398 0.002445
10 186 0.002445
11 98 0.002445
12 Length: 410, dtype: float64
Does the list order look familiar? It should, since the numerator of the degree centrality metric is
identical to the number of neighbors, and the denominator is a constant.
1 x, y = ecdf(list_of_values)
png
png
The fact that they are identically-shaped should not surprise you!
Visualize the graph G, while ordering and colouring them by the ‘order’ node attribute.
Hubs 33
png
1 import nxviz as nv
2 nv.arc(G, sort_by="order", node_color_by="order")
1 <AxesSubplot:>
Hubs 34
png
png
The somewhat positive correlation between the degree centrality might tell us that this trend holds
true. A further applied question would be to ask what behaviour of these nodes would give rise to this
pattern. Are these nodes actually exhibit staff? Or is there some other reason why they are staying
so long? This, of course, would require joining in further information that we would overlay on top
of the graph (by adding them as node or edge attributes) before we might make further statements.
Reflections
In this chapter, we defined a metric of node importance: the degree centrality metric. In the example
we looked at, it could help us identify potential infectious agent superspreaders in a disease contact
network. In other settings, it might help us spot:
What other settings can you think of in which the number of neighbors that a node has can become
a metric of importance for the node?
Hubs 36
Solutions
Here are the solutions to the exercises above.
76 def visual_insights():
77 """Visual insights from the Circos Plot."""
78 return """
79 We see that most edges are "local" with nodes
80 that are proximal in order.
81 The nodes that are weird are the ones that have connections
82 with individuals much later than itself,
83 crossing larger jumps in order/time.
84
85 Additionally, if you recall the ranked list of degree centralities,
86 it appears that these nodes that have the highest degree centrality scores
87 are also the ones that have edges that cross the circos plot.
88 """
89
90
91 def dc_node_order(G):
92 """Comparison of degree centrality by maximum difference in node order."""
93 import matplotlib.pyplot as plt
94 import pandas as pd
95 import networkx as nx
96
97 # Degree centralities
98 dcs = pd.Series(nx.degree_centrality(G))
99
100 # Maximum node order difference
101 maxdiffs = dict()
102 for n, d in G.nodes(data=True):
103 diffs = []
104 for nbr in G.neighbors(n):
105 diffs.append(abs(G.nodes[nbr]["order"] - d["order"]))
106 maxdiffs[n] = max(diffs)
107 maxdiffs = pd.Series(maxdiffs)
108
109 ax = pd.DataFrame(dict(degree_centrality=dcs, max_diff=maxdiffs)).plot(
110 x="degree_centrality", y="max_diff", kind="scatter"
111 )
Paths
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="JjpbztqP9_0", width="100%")
Breadth-First Search
The BFS algorithm is a staple of computer science curricula, and for good reason: it teaches learners
how to “think on” a graph, putting one in the position of “the dumb computer” that can’t use a visual
cortex to “just know” how to trace a path from one node to another. As a topic, learning how to do
BFS additionally imparts algorithmic thinking to the learner.
1. On a piece of paper, conjure up a graph that has 15-20 nodes. Connect them any way
you like.
Paths 40
2. Pick two nodes. Pretend that you’re standing on one of the nodes, but you can’t see
any further beyond one neighbor away.
3. Work out how you can find a path from the node you’re standing on to the other
node, given that you can only see nodes that are one neighbor away but have an
infinitely good memory.
If you are successful at designing the algorithm, you should get the answer below.
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
14
15 assert test_path_exists(10)
Visualizing Paths
One of the objectives of that exercise before was to help you “think on graphs”. Now that you’ve
learned how to do so, you might be wondering, “How do I visualize that path through the graph?”
Well first off, if you inspect the test_path_exists function above, you’ll notice that NetworkX
provides a shortest_path() function that you can use. Here’s what using nx.shortest_path() looks
like.
As you can see, it returns the nodes along the shortest path, incidentally in the exact order that you
would traverse.
One thing to note, though! If there are multiple shortest paths from one node to another, NetworkX
will only return one of them.
So how do you draw those nodes only?
You can use the G.subgraph(nodes) to return a new graph that only has nodes in nodes and only the
edges that exist between them. After that, you can use any plotting library you like. We will show
an example here that uses nxviz’s matrix plot.
Let’s see it in action:
1 import nxviz as nv
2 g = G.subgraph(path)
3 nv.matrix(g, sort_by="order")
1 <AxesSubplot:>
Paths 43
png
Voila! Now we have the subgraph (1) extracted and (2) drawn to screen! In this case, the matrix
plot is a suitable visualization for its compactness. The off-diagonals also show that each node is a
neighbor to the next one.
You’ll also notice that if you try to modify the graph g, say by adding a node:
1 g.add_node(2048)
1 ---------------------------------------------------------------------------
2 NetworkXError Traceback (most recent call last)
3 <ipython-input-10-ca6aa4c26819> in <module>
4 ----> 1 g.add_node(2048)
5
6 ~/anaconda/envs/nams/lib/python3.7/site-packages/networkx/classes/function.py in fro\
7 zen(*args, **kwargs)
8 156 def frozen(*args, **kwargs):
9 157 """Dummy method for raising errors when trying to modify frozen graphs"""
10 --> 158 raise nx.NetworkXError("Frozen graph can't be modified")
11 159
12 160
13
14 NetworkXError: Frozen graph can't be modified
Paths 44
From the perspective of semantics, this makes a ton of sense: the subgraph g is a perfect subset of
the larger graph G, and should not be allowed to be modified unless the larger container graph is
modified.
Extend graph drawing with the neighbors of each of those nodes. Use any of the nxviz
plots (nv.matrix, nv.arc, nv.circos); try to see which one helps you tell the best story.
1 plot_path_with_neighbors(G, 7, 400)
png
In this case, we opted for an Arc plot because we only have one grouping of nodes but have a logical
way to order them. Because the path follows the order, the edges being highlighted automatically
look like hops through the graph.
Bottleneck nodes
We’re now going to revisit the concept of an “important node”, this time now leveraging what we
know about paths.
Paths 45
In the “hubs” chapter, we saw how a node that is “important” could be so because it is connected to
many other nodes.
Paths give us an alternative definition. If we imagine that we have to pass a message on a graph
from one node to another, then there may be “bottleneck” nodes for which if they are removed, then
messages have a harder time flowing through the graph.
One metric that measures this form of importance is the “betweenness centrality” metric. On a
graph through which a generic “message” is flowing, a node with a high betweenness centrality is
one that has a high proportion of shortest paths flowing through it. In other words, it behaves like
a bottleneck.
1 import pandas as pd
2
3 pd.Series(nx.betweenness_centrality(G))
1 100 0.014809
2 101 0.001398
3 102 0.000748
4 103 0.006735
5 104 0.001198
6 ...
7 89 0.000004
8 91 0.006415
9 96 0.000323
10 99 0.000322
11 98 0.000000
12 Length: 410, dtype: float64
png
1 nx.draw(nx.barbell_graph(5, 1))
Paths 47
png
Recap
In this chapter, you learned the following things:
1. You figured out how to implement the breadth-first-search algorithm to find shortest paths.
2. You learned how to extract subgraphs from a larger graph.
3. You implemented visualizations of subgraphs, which should help you as you communicate
with colleagues.
4. You calculated betweenness centrality metrics for a graph, and visualized how they correlated
with degree centrality.
Solutions
Here are the solutions to the exercises above.
Paths 48
38
39 visited_nodes = set()
40 queue = [node1]
41
42 while len(queue) > 0:
43 node = queue.pop()
44 neighbors = list(G.neighbors(node))
45 if node2 in neighbors:
46 return True
47 else:
48 visited_nodes.add(node)
49 nbrs = [n for n in neighbors if n not in visited_nodes]
50 queue = nbrs + queue
51
52 return False
53
54
55 def path_exists_for_loop(node1, node2, G):
56 """
57 This function checks whether a path exists between two nodes (node1,
58 node2) in graph G.
59
60 Special thanks to @ghirlekar for suggesting that we keep track of the
61 "visited nodes" to prevent infinite loops from happening. This also
62 removes the need to remove nodes from queue.
63
64 Reference: https://github.com/ericmjl/Network-Analysis-Made-Simple/issues/3
65
66 With thanks to @joshporter1 for the second bug fix. Originally there was
67 an extraneous "if" statement that guaranteed that the "False" case would
68 never be returned - because queue never changes in shape. Discovered at
69 PyCon 2017.
70
71 With thanks to @chendaniely for pointing out the extraneous "break".
72
73 If you would like to see @dgerlanc's implementation, see
74 https://github.com/ericmjl/Network-Analysis-Made-Simple/issues/76
75 """
76 visited_nodes = set()
77 queue = [node1]
78
79 for node in queue:
80 neighbors = list(G.neighbors(node))
Paths 50
81 if node2 in neighbors:
82 return True
83 else:
84 visited_nodes.add(node)
85 queue.extend([n for n in neighbors if n not in visited_nodes])
86
87 return False
88
89
90 def path_exists_deque(node1, node2, G):
91 """An alternative implementation."""
92 from collections import deque
93
94 visited_nodes = set()
95 queue = deque([node1])
96
97 while len(queue) > 0:
98 node = queue.popleft()
99 neighbors = list(G.neighbors(node))
100 if node2 in neighbors:
101 return True
102 else:
103 visited_nodes.add(node)
104 queue.extend([n for n in neighbors if n not in visited_nodes])
105
106 return False
107
108
109 import nxviz as nv
110 from nxviz import annotate, highlights
111
112
113 def plot_path_with_neighbors(G, n1, n2):
114 """Plot a path with the heighbors of of the nodes along that path."""
115 path = nx.shortest_path(G, n1, n2)
116 nodes = [*path]
117 for node in path:
118 nodes.extend(list(G.neighbors(node)))
119 nodes = list(set(nodes))
120
121 g = G.subgraph(nodes)
122 nv.arc(
123 g, sort_by="order", node_color_by="order", edge_aes_kwargs={"alpha_scale": 0\
Paths 51
124 .5}
125 )
126 for n in path:
127 highlights.arc_node(g, n, sort_by="order")
128 for n1, n2 in zip(path[:-1], path[1:]):
129 highlights.arc_edge(g, n1, n2, sort_by="order")
130
131
132 def plot_degree_betweenness(G):
133 """Plot scatterplot between degree and betweenness centrality."""
134 bc = pd.Series(nx.betweenness_centrality(G))
135 dc = pd.Series(nx.degree_centrality(G))
136
137 df = pd.DataFrame(dict(bc=bc, dc=dc))
138 ax = df.plot(x="dc", y="bc", kind="scatter")
139 ax.set_ylabel("Betweenness\nCentrality")
140 ax.set_xlabel("Degree Centrality")
141 sns.despine()
Structures
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="3DWSRCbPPJs", width="100%")
Triangles
The first structure that we are going to learn about is triangles. Triangles are super interesting!
They are what one might consider to be “the simplest complex structure” in a graph. Triangles can
also have semantically-rich meaning depending on the application. To borrow a bad example, love
triangles in social networks are generally frowned upon, while on the other hand, when we connect
two people that we know together, we instead complete a triangle.
Load Data
To learn about triangles, we are going to leverage a physician trust network. Here’s the data
description:
This directed network captures innovation spread among 246 physicians for towns in
Illinois, Peoria, Bloomington, Quincy and Galesburg. The data was collected in 1966. A
node represents a physician and an edge between two physicians shows that the left
physician told that the right physician is his friend or that he turns to the right physician
if he needs advice or is interested in a discussion. There always only exists one edge
between two nodes even if more than one of the listed conditions are true.
Structures 53
Leveraging what you know, can you think of a few strategies to find triangles in a graph?
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
Every graph object G has a G.has_edge(n1, n2) method that you can use to identify
whether a graph has an edge between n1 and n2.
Also:
Now, test your implementation below! The code cell will not error out if your answer is correct.
As you can see from the test function above, NetworkX provides an nx.triangles(G, node) function.
It returns the number of triangles that a node is involved in. We convert it to boolean as a hack to
check whether or not a node is involved in a triangle relationship because 0 is equivalent to boolean
False, while any non-zero number is equivalent to boolean True.
Given a node, write a function that extracts out all of the neighbors that it is in a triangle
relationship with. Then, in a new function, implement code that plots only the subgraph
that contains those nodes.
Structures 55
png
Structures 56
Triadic Closure
In professional circles, making connections between two people is one of the most valuable things
you can do professionally. What you do in that moment is what we would call triadic closure.
Algorithmically, we can do the same thing if we maintain a graph of connections!
Essentially, what we are looking for are “open” or “unfinished” triangles”.
In this section, we’ll try our hand at implementing a rudimentary triadic closure system.
Write a function that takes in a graph G and a node n, and returns all of the neighbors that
are potential triadic closures with n being the center node.
png
Cliques
Triangles are interesting in a graph theoretic setting because triangles are the simplest complex clique
that exist.
But wait! What is the definition of a “clique”?
A “clique” is a set of nodes in a graph that are fully connected with one another by edges
between them.
Structures 58
k -Cliques
Cliques are identified by their size k, which is the number of nodes that are present in the clique.
A triangle is what we would consider to be a k-clique where k = 3.
A square with cross-diagonal connections is what we would consider to be a k-clique where k = 4.
By now, you should get the gist of the idea.
Maximal Cliques
Related to this idea of a k-clique is another idea called “maximal cliques”.
Maximal cliques are defined as follows:
1 [[1, 2], [1, 3], [1, 4, 5, 6], [1, 7], [1, 72]]
I’m requesting a generator as a matter of good practice; you never know when the list you return
might explode in memory consumption, so generators are a cheap and easy way to reduce memory
usage.
Structures 59
Clique Decomposition
One super neat property of cliques is that every clique of size k can be decomposed to the set of
cliques of size k − 1.
Does this make sense to you? If not, think about triangles (3-cliques). They can be decomposed to
three edges (2-cliques).
Think again about 4-cliques. Housed within 4-cliques are four 3-cliques. Draw it out if you’re still
not convinced!
If a k-clique can be decomposed to its k − 1 cliques, it follows that the k − 1 cliques can
be decomposed into k − 2 cliques, and so on until you hit 2-cliques. This implies that all
cliques of size k house cliques of size n < k, where n >= 2.
Structures 60
Connected Components
Now that we’ve explored a lot around cliques, we’re now going to explore this idea of “connected
components”. To do so, I am going to have you draw the graph that we are working with.
1 import nxviz as nv
2
3 nv.circos(G)
1 <AxesSubplot:>
png
Structures 61
1 ccsubgraph_nodes = list(nx.connected_components(G))
1 len(ccsubgraph_nodes)
1 4
⁹https://en.wikipedia.org/wiki/Connected_component_%28graph_theory%29
Structures 62
Firstly, label the nodes with a unique identifier for connected component subgraph that
it resides in. Use subgraph to store this piece of metadata.
1 def label_connected_component_subgraphs(G):
2 # Your answer here
3 return G
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import label_connected_component_subgraphs
8 G_labelled = label_connected_component_subgraphs(G)
9
10 # UNCOMMENT TO SEE THE ANSWER
11 # label_connected_component_subgraphs??
Now, draw a CircosPlot with the node order and colouring dictated by the subgraph key.
1 def plot_cc_subgraph(G):
2 # Your answer here
3 pass
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import plot_cc_subgraph
8 from nxviz import annotate
9
10 plot_cc_subgraph(G_labelled)
11 annotate.circos_group(G_labelled, group_by="subgraph")
Structures 63
png
Using an arc plot will also clearly illuminate for us that there are no inter-group connections.
png
Voila! It looks quite clear that there are indeed four disjoint group of physicians.
Solutions
Structures 64
38 """
39 Return neighbors involved in triangle relationship with node.
40 """
41 neighbors1 = set(G.neighbors(node))
42 triangle_nodes = set()
43 for nbr1, nbr2 in combinations(neighbors1, 2):
44 if G.has_edge(nbr1, nbr2):
45 triangle_nodes.add(nbr1)
46 triangle_nodes.add(nbr2)
47 return triangle_nodes
48
49
50 def plot_triangle_relations(G, node):
51 """
52 Plot all triangle relationships for a given node.
53 """
54 triangle_nbrs = get_triangle_neighbors(G, node)
55 triangle_nbrs.add(node)
56 nx.draw(G.subgraph(triangle_nbrs), with_labels=True)
57
58
59 def triadic_closure_algorithm():
60 """
61 How to do triadic closure.
62 """
63 ans = """
64 I would suggest the following strategy:
65
66 1. Pick a node
67 1. For every pair of neighbors:
68 1. If neighbors are not connected,
69 then this is a potential triangle to close.
70
71 This strategy gives you potential triadic closures
72 given a "center" node `n`.
73
74 The other way is to trace out a path two degrees out
75 and ask whether the terminal node is a neighbor
76 of the starting node.
77 If not, then we have another triadic closure to make.
78 """
79 return render_html(ans)
80
Structures 66
81
82 def get_open_triangles_neighbors(G, node) -> set:
83 """
84 Return neighbors involved in open triangle relationships with a node.
85 """
86 open_triangle_nodes = set()
87 neighbors = list(G.neighbors(node))
88
89 for n1, n2 in combinations(neighbors, 2):
90 if not G.has_edge(n1, n2):
91 open_triangle_nodes.add(n1)
92 open_triangle_nodes.add(n2)
93
94 return open_triangle_nodes
95
96
97 def plot_open_triangle_relations(G, node):
98 """
99 Plot open triangle relationships for a given node.
100 """
101 open_triangle_nbrs = get_open_triangles_neighbors(G, node)
102 open_triangle_nbrs.add(node)
103 nx.draw(G.subgraph(open_triangle_nbrs), with_labels=True)
104
105
106 def simplest_clique():
107 """
108 Answer to "what is the simplest clique".
109 """
110 return render_html("The simplest clique is an edge.")
111
112
113 def size_k_maximal_cliques(G, k):
114 """
115 Return all size-k maximal cliques.
116 """
117 for clique in nx.find_cliques(G):
118 if len(clique) == k:
119 yield clique
120
121
122 def find_k_cliques(G, k):
123 """
Structures 67
1
Graph I/O
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="3sJnTpeFXZ4", width="100%")
• Node set
• Edge set
1 A
2 B
3 C
Suppose the nodes also had metadata. Then, we could tag on metadata as well:
1 A, circle, 5
2 B, circle, 7
3 C, square, 9
Does this look familiar to you? Yes, node sets can be stored in CSV format, with one of the columns
being node ID, and the rest of the columns being metadata.
1 A, C
2 B, C
3 A, B
4 C, A
And let’s say we also had other metadata, we can represent it in the same CSV format:
1 A, C, red
2 B, C, orange
3 A, B, yellow
4 C, A, green
If you’ve been in the data world for a while, this should not look foreign to you. Yes, edge sets can
be stored in CSV format too! Two of the columns represent the nodes involved in an edge, and the
rest of the columns represent the metadata.
Combined Representation
In fact, one might also choose to combine the node set and edge set tables together in a merged
format:
Graph I/O 70
In this chapter, the datasets that we will be looking at are going to be formatted in both ways. Let’s
get going.
Dataset
We will be working with the Divvy bike sharing dataset.
Divvy is a bike sharing service in Chicago. Since 2013, Divvy has released their bike
sharing dataset to the public. The 2013 dataset is comprised of two files: - Divvy_-
Stations_2013.csv, containing the stations in the system, and - DivvyTrips_2013.csv,
containing the trips.
1 import zipfile
2 import os
3 from nams.load_data import datasets
4
5 # This block of code checks to make sure that a particular directory is present.
6 if "divvy_2013" not in os.listdir(datasets):
7 print('Unzipping the divvy_2013.zip file in the datasets folder.')
8 with zipfile.ZipFile(datasets / "divvy_2013.zip","r") as zip_ref:
9 zip_ref.extractall(datasets)
1 import pandas as pd
2
3 stations = pd.read_csv(datasets / 'divvy_2013/Divvy_Stations_2013.csv', parse_dates=\
4 ['online date'], encoding='utf-8')
5 print(stations.head().to_markdown())
1 print(stations.describe().to_markdown())
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Co\
3 lumns (10) have mixed types.Specify dtype option on import or set low_memory=False.
4 has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
trip_- starttime
stoptime from_-from_-to_- to_- usertype
bikeid tripduration genderbirthday
id sta- sta- sta- sta-
tion_- tion_- tion_- tion_-
id name id name
0 4118 2013- 2013- 480 316 85 Michigan28 Larrabee Customer
nan nan
06- 06- Ave St &
27 27 & Menomonee
12:11:0012:16:00 Oak St
St
1 4275 2013- 2013- 77 64 32 Racine 32 Racine Customer nan nan
06- 06- Ave Ave
27 27 & &
14:44:0014:45:00 Congress Congress
Pkwy Pkwy
2 4291 2013- 2013- 77 433 32 Racine 19 LoomisCustomer nan nan
06- 06- Ave St
27 27 & &
14:58:0015:05:00 Congress Tay-
Pkwy lor
St
3 4316 2013- 2013- 77 123 19 Loomis 19 LoomisCustomer nan nan
06- 06- St St
27 27 & &
15:06:0015:09:00 Tay- Tay-
lor lor
St St
4 4342 2013- 2013- 77 852 19 Loomis 55 HalstedCustomer nan nan
06- 06- St St &
27 27 & James
15:13:0015:27:00 Tay- M
lor Rochford
St St
Graph I/O 73
1 import janitor
2 trips_summary = (
3 trips
4 .groupby(["from_station_id", "to_station_id"])
5 .count()
6 .reset_index()
7 .select_columns(
8 [
9 "from_station_id",
10 "to_station_id",
11 "trip_id"
12 ]
13 )
14 .rename_column("trip_id", "num_trips")
15 )
1 print(trips_summary.head().to_markdown())
Graph Model
Given the data, if we wished to use a graph as a data model for the number of trips between stations,
then naturally, nodes would be the stations, and edges would be trips between them.
This graph would be directed, as one could have more trips from station A to B and less in the
reverse.
With this definition, we can begin graph construction!
1 import networkx as nx
2
3 G = nx.from_pandas_edgelist(
4 df=trips_summary,
5 source="from_station_id",
6 target="to_station_id",
7 edge_attr=["num_trips"],
8 create_using=nx.DiGraph
9 )
1 print(nx.info(G))
1 Name:
2 Type: DiGraph
3 Number of nodes: 300
4 Number of edges: 44422
5 Average in degree: 148.0733
6 Average out degree: 148.0733
You’ll notice that the edge metadata have been added correctly: we have recorded in there the
number of trips between stations.
1 list(G.edges(data=True))[0:5]
1 list(G.nodes(data=True))[0:5]
Graph I/O 75
1 [(5, {}), (13, {}), (14, {}), (15, {}), (16, {})]
1 print(stations.head().to_markdown())
1 list(G.nodes(data=True))[0:5]
Graph I/O 76
1 [(5,
2 {'name': 'State St & Harrison St',
3 'latitude': 41.87395806,
4 'longitude': -87.62773949,
5 'dpcapacity': 19,
6 'landmark': 30,
7 'online date': Timestamp('2013-06-28 00:00:00')}),
8 (13,
9 {'name': 'Wilton Ave & Diversey Pkwy',
10 'latitude': 41.93250008,
11 'longitude': -87.65268082,
12 'dpcapacity': 19,
13 'landmark': 66,
14 'online date': Timestamp('2013-06-28 00:00:00')}),
15 (14,
16 {'name': 'Morgan St & 18th St',
17 'latitude': 41.858086,
18 'longitude': -87.651073,
19 'dpcapacity': 15,
20 'landmark': 163,
21 'online date': Timestamp('2013-06-28 00:00:00')}),
22 (15,
23 {'name': 'Racine Ave & 18th St',
24 'latitude': 41.85818061,
25 'longitude': -87.65648665,
26 'dpcapacity': 15,
27 'landmark': 164,
28 'online date': Timestamp('2013-06-28 00:00:00')}),
29 (16,
30 {'name': 'Wood St & North Ave',
31 'latitude': 41.910329,
32 'longitude': -87.672516,
33 'dpcapacity': 15,
34 'landmark': 223,
35 'online date': Timestamp('2013-08-12 00:00:00')})]
In nxviz, a GeoPlot object is available that allows you to quickly visualize a graph that has
geographic data. However, being matplotlib-based, it is going to be quickly overwhelmed by the
sheer number of edges.
As such, we are going to first filter the edges.
Graph I/O 77
1 G_copy = G.copy()
As the creator of nxviz, I would recommend using proper geospatial packages to build
custom geospatial graph viz, such as pysal¹⁰.)
That said, nxviz can probably do what you need for a quick-and-dirty view of the data.
¹⁰http://pysal.org/
Graph I/O 78
1 import nxviz as nv
2
3 c = nv.geo(G_filtered, node_color_by="dpcapacity")
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
png
Does that look familiar to you? Looks quite a bit like Chicago, I’d say :)
Jesting aside, this visualization does help illustrate that the majority of trips occur between stations
that are near the city center.
Graph I/O 79
Pickling Graphs
Since NetworkX graphs are Python objects, the canonical way to save them is by pickling them. You
can do this using:
1 nx.write_gpickle(G, file_path)
1 nx.write_gpickle(G, "/tmp/divvy.pkl")
1 G_loaded = nx.read_gpickle("/tmp/divvy.pkl")
Write a function that tests that the graph has the correct number of nodes and edges inside
it.
1 def test_graph_integrity(G):
2 """Test integrity of raw Divvy graph."""
3 # Your solution here
4 pass
5
6 from nams.solutions.io import test_graph_integrity
7
8 test_graph_integrity(G)
Solutions
The solutions to this chapter’s exercises are below
Why test?
What to test
When thinking about what part of the data to test, it can be confusing. After all, data are seemingly
generated from random processes (my Bayesian foxtail has been revealed), and it seems difficult to
test random processes.
That said, from my experience handling data, I can suggest a few principles.
Test invariants
Firstly, we test invariant properties of the data. Put in plain language, things we know ought to be
true.
Using the Divvy bike dataset example, we know that every node ought to have a station name. Thus,
the minimum that we can test is that the station_name attribute is present on every node. As an
example:
Testing 82
1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 assert "station_name" in d.keys()
Test nullity
Secondly, we can test that values that ought not to be null should not be null.
Using the Divvy bike dataset example again, if we also know that the station name cannot be null
or an empty string, then we can bake that into the test.
1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 assert "station_name" in d.keys()
5 assert bool(d["station_name"])
Test boundaries
We can also test boundary values. For example, within the city of Chicago, we know that latitude
and longitude values ought to be within the vicinity of 41.85003, -87.65005. If we get data values
that are, say, outside the range of [41, 42]; [-88, -87], then we know that we have data issues
as well.
Here’s an example:
1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 # Test for station names.
5 assert "station_name" in d.keys()
6 assert bool(d["station_name"])
7
8 # Test for longitude/latitude
9 assert d["latitude"] >= 41 and d["latitude"] <= 42
10 assert d["longitude"] >= -88 and d["longitude"] <= -87
An apology to geospatial experts: I genuinely don’t know the bounding box lat/lon coordinates of
Chicago, so if you know those coordinates, please reach out so I can update the test.
Testing 83
1 # test_data.py
2 def test_divvy_nodes(G):
3 """Test node metadata on Divvy dataset."""
4 for n, d in G.nodes(data=True):
5 # Test for station names.
6 assert "station_name" in d.keys()
7 assert bool(d["station_name"])
8
9 # Test for longitude/latitude
10 assert d["latitude"] >= 41 and d["latitude"] <= 42
11 assert d["longitude"] >= -88 and d["longitude"] <= -87
At the command line, if you ran pytest, it will automatically discover all functions prefixed with
test_ in all .py files underneath the current working directory.
Secondly, set up a continuous pipelining system to continuously run data tests. For example,
you can set up Jenkins¹¹, Travis¹², Azure Pipelines¹³, Prefect¹⁴, and more, depending on what your
organization has bought into.
Sometimes data tests take longer than software tests, especially if you are pulling dumps from a
database, so you might want to run this portion of tests in a separate pipeline instead.
Further reading
• In my essays collection, I wrote about testing data¹⁵.
• Itamar Turner-Trauring has written about keeping tests quick and speedy¹⁶, which is extremely
crucial to keeping yourself motivated to write tests.
¹¹https://www.jenkins.io/
¹²https://travis-ci.org/
¹³https://azure.microsoft.com/en-us/services/devops/pipelines/
¹⁴https://www.prefect.io/
¹⁵https://ericmjl.github.io/essays-on-data-science/software-skills/testing/#tests-for-data
¹⁶https://pythonspeed.com/articles/slow-tests-fast-feedback/
Bipartite Graphs
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="BYOK12I9vgI", width="100%")
We can model customer purchases of products using a bipartite graph. Here, the two
node sets are customer nodes and product nodes, and edges indicate that a customer C
purchased a product P .
On the basis of this graph, we can do interesting analyses, such as finding customers that are similar
to one another on the basis of their shared product purchases.
Can you think of other situations where a bipartite graph model can be useful?
Dataset
Here’s another application in crime analysis, which is relevant to the example that we will use in
this chapter:
This bipartite network contains persons who appeared in at least one crime case as either
a suspect, a victim, a witness or both a suspect and victim at the same time. A left node
represents a person and a right node represents a crime. An edge between two nodes
shows that the left node was involved in the crime represented by the right node.
If you inspect the nodes, you will see that they contain a special metadata keyword: bipartite. This
is a special keyword that NetworkX can use to identify nodes of a given partition.
1 import nxviz as nv
2 import matplotlib.pyplot as plt
3
4 fig, ax = plt.subplots(figsize=(7, 7))
5 nv.circos(G, sort_by="degree", group_by="bipartite", node_color_by="bipartite", node\
6 _aes_kwargs={"size_scale": 3})
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
14
15 <AxesSubplot:>
Bipartite Graphs 87
png
Write a function that extracts all of the nodes from specified node partition. It should
also raise a plain Exception if no nodes exist in that specified partition. (as a precuation
against users putting in invalid partition names).
Bipartite Graphs 88
1 import networkx as nx
2
3 def extract_partition_nodes(G: nx.Graph, partition: str):
4 nodeset = [_ for _, _ in _______ if ____________]
5 if _____________:
6 raise Exception(f"No nodes exist in the partition {partition}!")
7 return nodeset
8
9 from nams.solutions.bipartite import extract_partition_nodes
10 # Uncomment the next line to see the answer.
11 # extract_partition_nodes??
png
As shown in the figure above, we start first with a bipartite graph with two node sets, the “alphabet”
set and the “numeric” set. The projection of this bipartite graph onto the “alphabet” node set is
a graph that is constructed such that it only contains the “alphabet” nodes, and edges join the
“alphabet” nodes because they share a connection to a “numeric” node. The red edge on the right is
basically the red path traced on the left.
1 True
Now that we’ve confirmed that the graph is indeed bipartite, we can use the NetworkX bipartite
submodule functions to generate the bipartite projection onto one of the node partitions.
First off, we need to extract nodes from a particular partition.
Bipartite Graphs 90
1 list(person_graph.edges(data=True))[0:5]
1 list(crime_graph.edges(data=True))[0:5]
• For person_graph, we have found individuals who are linked by shared participation (whether
witness or suspect) in a crime.
• For crime_graph, we have found crimes that are linked by shared involvement by people.
Just by this graph, we already can find out pretty useful information. Let’s use an exercise that
leverages what you already know to extract useful information from the projected graph.
Exercise: find the crime(s) that have the most shared connections
with other crimes
Find crimes that are most similar to one another on the basis of the number of shared
connections to individuals.
1 import pandas as pd
2
3 def find_most_similar_crimes(cG: nx.Graph):
4 """
5 Find the crimes that are most similar to other crimes.
6 """
7 dcs = ______________
8 return ___________________
9
10
11 from nams.solutions.bipartite import find_most_similar_crimes
12 find_most_similar_crimes(crime_graph)
1 c110 0.136364
2 c47 0.070909
3 c23 0.070909
4 c95 0.063636
5 c14 0.061818
6 c352 0.060000
7 c432 0.060000
8 c160 0.058182
9 c417 0.058182
10 c525 0.058182
11 dtype: float64
1 p425 0.061594
2 p2 0.057971
3 p356 0.053140
4 p56 0.039855
5 p695 0.039855
6 p497 0.036232
7 p715 0.035024
8 p10 0.033816
9 p815 0.032609
10 p74 0.030193
11 dtype: float64
Weighted Projection
Though we were able to find out which graphs were connected with one another, we did not record
in the resulting projected graph the strength by which the two nodes were connected. To preserve
this information, we need another function:
Exercise: Find the people that can help with investigating a crime’s
person.
Let’s pretend that we are a detective trying to solve a crime, and that we right now need to find
other individuals who were not implicated in the same exact crime as an individual was, but who
might be able to give us information about that individual because they were implicated in other
crimes with that individual.
Implement a function that takes in a bipartite graph G, a string person and a string
crime, and returns a list of other persons that were not implicated in the crime, but were
connected to the person via other crimes. It should return a ranked list, based on the
number of shared crimes (from highest to lowest) because the ranking will help with
triage.
Bipartite Graphs 93
1 list(G.neighbors('p1'))
node weight
27 p67 4
1 p361 2
29 p338 2
14 p356 2
25 p223 1
26 p608 1
28 p578 1
30 p304 1
31 p186 1
32 p661 1
33 p781 1
34 p820 1
0 p39 1
36 p401 1
37 p710 1
38 p300 1
39 p287 1
40 p309 1
41 p5 1
42 p587 1
43 p563 1
44 p806 1
45 p286 1
35 p320 1
23 p528 1
24 p360 1
22 p439 1
2 p499 1
3 p449 1
4 p4 1
5 p471 1
6 p48 1
7 p90 1
8 p475 1
9 p498 1
10 p690 1
11 p620 1
12 p603 1
13 p660 1
15 p768 1
16 p782 1
17 p495 1
18 p305 1
19 p665 1
20 p773 1
Bipartite Graphs 95
node weight
21 p211 1
46 p716 1
Degree Centrality
The degree centrality metric is something we can calculate for bipartite graphs. Recall that the degree
centrality metric is the number of neighbors of a node divided by the total number of possible
neighbors.
In a unipartite graph, the denominator can be the total number of nodes less one (if self-loops are
not allowed) or simply the total number of nodes (if self loops are allowed).
1 '\nThe total number of neighbors that a node can _possibly_ have\nis the number of n\
2 odes in the other partition.\nThis comes naturally from the definition of a bipartit\
3 e graph,\nwhere nodes can _only_ be connected to nodes in the other partition.\n'
To do so, you will need to use nx.bipartite.degree_centrality, rather than the regular nx.degree_-
centrality function.
nx.bipartite.degree_centrality requires that you pass in a node set from one of the partitions
so that it can correctly partition nodes on the other set. What is returned, though, is the degree
centrality for nodes in both sets. Here is an example to show you how the function is used:
1 'p815'
Solutions
Here are the solutions to the exercises above.
1 import networkx as nx
2 import pandas as pd
3 from nams.functions import render_html
4
5
6 def extract_partition_nodes(G: nx.Graph, partition: str):
7 nodeset = [n for n, d in G.nodes(data=True) if d["bipartite"] == partition]
8 if len(nodeset) == 0:
9 raise Exception(f"No nodes exist in the partition {partition}!")
10 return nodeset
11
12
13 def bipartite_example_graph():
14 bG = nx.Graph()
15 bG.add_nodes_from("abcd", bipartite="letters")
16 bG.add_nodes_from(range(1, 4), bipartite="numbers")
17 bG.add_edges_from([("a", 1), ("b", 1), ("b", 3), ("c", 2), ("c", 3), ("d", 1)])
18
19 return bG
20
21
Bipartite Graphs 97
22 def draw_bipartite_graph_example():
23 """Draw an example bipartite graph and its corresponding projection."""
24 import matplotlib.pyplot as plt
25 import nxviz as nv
26 from nxviz import annotate, plots, highlights
27
28 fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
29 plt.sca(ax[0])
30 bG = bipartite_example_graph()
31 nv.parallel(bG, group_by="bipartite", node_color_by="bipartite")
32 annotate.parallel_group(bG, group_by="bipartite", y_offset=-0.5)
33 highlights.parallel_edge(bG, "a", 1, group_by="bipartite")
34 highlights.parallel_edge(bG, "b", 1, group_by="bipartite")
35
36 pG = nx.bipartite.projected_graph(bG, nodes=list("abcd"))
37 plt.sca(ax[1])
38 nv.arc(pG)
39 highlights.arc_edge(pG, "a", "b")
40 return ax
41
42
43 def find_most_similar_crimes(cG: nx.Graph):
44 """
45 Find the crimes that are most similar to other crimes.
46 """
47 dcs = pd.Series(nx.degree_centrality(cG))
48 return dcs.sort_values(ascending=False).head(10)
49
50
51 def find_most_similar_people(pG: nx.Graph):
52 """
53 Find the persons that are most similar to other persons.
54 """
55 dcs = pd.Series(nx.degree_centrality(pG))
56 return dcs.sort_values(ascending=False).head(10)
57
58
59 def find_connected_persons(G, person, crime):
60 """Answer to exercise on people implicated in crimes"""
61 # Step 0: Check that the given "person" and "crime" are connected.
62 if not G.has_edge(person, crime):
63 raise ValueError(
64 f"Graph does not have a connection between {person} and {crime}!"
Bipartite Graphs 98
65 )
66
67 # Step 1: calculate weighted projection for person nodes.
68 person_nodes = extract_partition_nodes(G, "person")
69 person_graph = nx.bipartite.weighted_projected_graph(G, person_nodes)
70
71 # Step 2: Find neighbors of the given `person` node in projected graph.
72 candidate_neighbors = set(person_graph.neighbors(person))
73
74 # Step 3: Remove candidate neighbors from the set if they are implicated in the \
75 given crime.
76 for p in G.neighbors(crime):
77 if p in candidate_neighbors:
78 candidate_neighbors.remove(p)
79
80 # Step 4: Rank-order the candidate neighbors by number of shared connections.
81 data = []
82 for nbr in candidate_neighbors:
83 data.append(dict(node=nbr, weight=person_graph.edges[person, nbr]["weight"]))
84 return pd.DataFrame(data).sort_values("weight", ascending=False)
85
86
87 def bipartite_degree_centrality_denominator():
88 """Answer to bipartite graph denominator for degree centrality."""
89
90 ans = """
91 The total number of neighbors that a node can _possibly_ have
92 is the number of nodes in the other partition.
93 This comes naturally from the definition of a bipartite graph,
94 where nodes can _only_ be connected to nodes in the other partition.
95 """
96 return ans
97
98
99 def find_most_crime_person(G, person_nodes):
100 dcs = (
101 pd.Series(nx.bipartite.degree_centrality(G, person_nodes))
102 .sort_values(ascending=False)
103 .to_frame()
104 )
105 return dcs.reset_index().query("index.str.contains('p')").iloc[0]["index"]
Bipartite Graphs 99
1
Linear Algebra
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="uTHihJiRELc", width="100%")
1. Path finding
2. Message passing
3. Bipartite projections
Preliminaries
Before we go deep into the linear algebra piece though, we have to first make sure some ideas are
clear.
The most important thing that we need when treating graphs in linear algebra form is the adjacency
matrix. For example, for four nodes joined in a chain:
Linear Algebra 101
1 import networkx as nx
2 nodes = list(range(4))
3 G1 = nx.Graph()
4 G1.add_nodes_from(nodes)
5 G1.add_edges_from(zip(nodes, nodes[1:]))
1 nx.draw(G1, with_labels=True)
png
1 import nxviz as nv
2
3 m = nv.matrix(G1)
Linear Algebra 102
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
png
1 A1 = nx.to_numpy_array(G1, nodelist=sorted(G1.nodes()))
2 A1
Linear Algebra 103
Symmetry
Remember that for an undirected graph, the adjacency matrix will be symmetric about the diagonal,
while for a directed graph, the adjacency matrix will be asymmetric.
Path finding
In the Paths chapter, we can use the breadth-first search algorithm to find a shortest path between
any two nodes.
As it turns out, using adjacency matrices, we can answer a related question, which is how many
paths exist of length K between two nodes.
To see how, we need to see the relationship between matrix powers and graph path lengths.
Let’s take the adjacency matrix above, raise it to the second power, and see what it tells us.
1 import numpy as np
2 np.linalg.matrix_power(A1, 2)
1 '\n1. The diagonals equal to the degree of each node.\n1. The off-diagonals also con\
2 tain values,\nwhich correspond to the number of paths that exist of length 2\nbetwee\
3 n the node on the row axis and the node on the column axis.\n\nIn fact, the diagonal\
4 also takes on the same meaning!\n\nFor the terminal nodes, there is only 1 path\nfr\
5 om itself back to itself,\nwhile for the middle nodes, there are 2 paths\nfrom itsel\
6 f back to itself!\n'
1 np.linalg.matrix_power(A1, 3)
1. There’s no way to go from a node back to itself in 3 steps, thus explaining the diagonals, and
2. The off-diagonals take on the correct values when you think about them in terms of “ways to
go from one node to another”.
1 G2 = nx.DiGraph()
2 G2.add_nodes_from(nodes)
3 G2.add_edges_from(zip(nodes, nodes[1:]))
4 nx.draw(G2, with_labels=True)
png
1 A2 = nx.to_numpy_array(G2)
2 np.linalg.matrix_power(A2, 2)
Linear Algebra 106
1 np.linalg.matrix_power(A2, 3)
1 np.linalg.matrix_power(A2, 4)
Message Passing
Let’s now dive into the second topic here, that of message passing.
To show how message passing works on a graph, let’s start with the directed linear chain, as this
will make things easier to understand.
1 M = np.array([1, 0, 0, 0])
2 M
Linear Algebra 107
1 array([1, 0, 0, 0])
1 M @ A2
The message has been passed onto the next node! And if we pass the message one more time:
1 M @ A2 @ A2
35 nodes, colors))
36
37
38 # Initialize the message
39 msg = np.zeros(len(G2))
40 msg[0] = 1
41
42 # Animate the graph with message propagation.
43 # HTML(anim(G2, msg, n_frames=4).to_html5_video())
Bipartite graphs have a natural matrix representation, known as the biadjacency matrix. Nodes on
one partition are the rows, and nodes on the other partition are the columns.
NetworkX’s bipartite module provides a function for computing the biadjacency matrix of a
bipartite graph.
Let’s start by looking at a toy bipartite graph, a “customer-product” purchase record graph, with 4
products and 3 customers. The matrix representation might be as follows:
From this “bi-adjacency” matrix, one can compute the projection onto the customers, matrix
multiplying the matrix with its transpose.
Linear Algebra 110
1 array([[1, 0, 1],
2 [0, 2, 2],
3 [1, 2, 4]])
What we get is the connectivity matrix of the customers, based on shared purchases. The diagonals
are the degree of the customers in the original graph, i.e. the number of purchases they originally
made, and the off-diagonals are the connectivity matrix, based on shared products.
To get the products matrix, we make the transposed matrix the left side of the matrix multiplication.
1 array([[2, 1, 2, 1],
2 [1, 2, 1, 1],
3 [2, 1, 2, 1],
4 [1, 1, 1, 1]])
You may now try to convince yourself that the diagonals are the number of times a customer
purchased that product, and the off-diagonals are the connectivity matrix of the products, weighted
by how similar two customers are.
Exercises
In the following exercises, you will now play with a customer-product graph from Amazon. This
dataset was downloaded from UCSD’s Julian McAuley’s website¹⁷, and corresponds to the digital
music dataset.
This is a bipartite graph. The two partitions are:
In the original dataset (see the original JSON in the datasets/ directory), they are referred to as:
• customers: reviewerID
• products: asin
¹⁷http://jmcauley.ucsd.edu/data/amazon/
Linear Algebra 111
Remember that with bipartite graphs, it is useful to obtain nodes from one of the partitions.
You’ll notice that this matrix is extremely large! There are 5541 customers and 3568 products, for a
total matrix size of 5541 × 3568 = 19770288, but it is stored in a sparse format because only 64706
elements are filled in.
1 mat
Next, get the diagonals of the customer-customer matrix. Recall here that in customer_mat, the
diagonals correspond to the degree of the customer nodes in the bipartite matrix.
SciPy sparse matrices provide a .diagonal() method that returns the diagonal elements.
Linear Algebra 112
Finally, find the index of the customer that has the highest degree.
1 cust_idx = np.argmax(degrees)
2 cust_idx
1 294
1 import pandas as pd
2 import janitor
3
4 # There's some pandas-fu we need to use to get this correct.
5 deg = (
6 pd.Series(dict(nx.degree(G_amzn, customer_nodes)))
7 .to_frame()
8 .reset_index()
9 .rename_column("index", "customer")
10 .rename_column(0, "num_reviews")
11 .sort_values('num_reviews', ascending=False)
12 )
13 print(deg.head().to_markdown())
customer num_reviews
294 A9Q28YTLYREO7 578
86 A3HU0B9XUEVHIM 375
77 A3KJ6JAZPH382D 301
307 A3C6ZCBUNXUT7V 261
218 A8IFUOL8S9BZC 256
Indeed, customer 294 was the one who had the most number of reviews!
2. Subtract the diagonals from the customer matrix projection. This yields the customer-customer
similarity matrix, which should only consist of the off-diagonal elements of the customer
matrix projection.
3. Finally, get the indices where the weight (shared number of between the customers is highest.
(This code is provided for you.)
1 import scipy.sparse as sp
1 (294, 86)
Objects
Let’s first use NetworkX’s built-in machinery to find customers that are most similar.
1 15.934 seconds
2 Most similar customers: ('A3HU0B9XUEVHIM', 'A9Q28YTLYREO7', {'weight': 154})
Matrices
Now, let’s implement the same thing in matrix form.
1 start = time()
2
3 # Compute the projection using matrices
4 mat = nx.bipartite.matrix.biadjacency_matrix(G_amzn, customer_nodes)
5 cust_mat = mat @ mat.T
6
7 # Identify the most similar customers
8 degrees = customer_mat.diagonal()
9 customer_diags = sp.diags(degrees)
10 off_diagonals = customer_mat - customer_diags
11 c1, c2 = np.unravel_index(np.argmax(off_diagonals), customer_mat.shape)
12
13 end = time()
14 print(f'{end - start:.3f} seconds')
15 print(f'Most similar customers: {customer_nodes[c1]}, {customer_nodes[c2]}, {cust_ma\
16 t[c1, c2]}')
1 0.471 seconds
2 Most similar customers: A9Q28YTLYREO7, A3HU0B9XUEVHIM, 154
On a modern PC, the matrix computation should be about 10-50X faster using the matrix form
compared to the object-oriented form. (The web server that is used to build the book might not
necessarily have the software stack to do this though, so the time you see reported might not reflect
the expected speedups.) I’d encourage you to fire up a Binder session or clone the book locally to
test out the code yourself.
You may notice that it’s much easier to read the “objects” code, but the matrix code way outperforms
the object code. This tradeoff is common in computing, and shouldn’t surprise you. That said, the
speed gain alone is a great reason to use matrices!
Acceleration on a GPU
If your appetite has been whipped up for even more acceleration and you have a GPU on your daily
compute, then you’re very much in luck!
Linear Algebra 115
The RAPIDS.AI¹⁸ project has a package called cuGraph¹⁹, which provides GPU-accelerated graph
algorithms. As over release 0.16.0, all cuGraph algorithms will be able to accept NetworkX graph
objects! This came about through online conversations on GitHub and Twitter, which for us,
personally, speaks volumes to the power of open source projects!
Because cuGraph does presume that you have access to a GPU, and because we assume most readers
of this book might not have access to one easily, we’ll delegate teaching how to install and use
cuGraph to the cuGraph devs and their documentation²⁰. Nonetheless, if you do have the ability to
install and use the RAPIDS stack, definitely check it out!
¹⁸https://rapids.ai
¹⁹https://github.com/rapidsai/cugraph
²⁰https://docs.rapids.ai/api/cugraph/stable/api.html
Statistical Inference
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="P-0CJpO3spg", width="100%")
Statistics refresher
Before we can proceed with statistical inference on graphs, we must first refresh ourselves with
some ideas from the world of statistics. Otherwise, the methods that we will end up using may seem
a tad weird, and hence difficult to follow along.
To review statistical ideas, let’s set up a few statements and explore what they mean.
In an abstract fashion…
The supremely abstract way of thinking about a probability distribution is that it is the space of all
possibilities of “stuff” with different credibility points distributed amongst each possible “thing”.
Statistical Inference 117
Hypothesis Testing
A commonplace task in statistical inferences is calculating the probability of observing a value or
something more extreme under an assumed “null” model of reality. This is what we commonly call
“hypothesis testing”, and where the oft-misunderstood term “p-value” shows up.
• I observe that 8 out of 10 coin tosses give me heads, giving me a probability of heads p = 0.8 (a
summary statistic).
• Under a “null distribution” of a fair coin, I simulate the distribution of probability of heads (the
summary statistic) that I would get from 10 coin tosses.
• Finally, I use that distribution to calculate the probability of observing p = 0.8 or more extreme.
Secondly, we propose a null graph model, and calculate our summary statistic under simulated
versions of that null graph model.
Thirdly, we look at the probability of observing the summary statistic value that we calculated in
step 1 or more extreme, under the assumed graph null model distribution.
1 import networkx as nx
2
3
4 G_er = nx.erdos_renyi_graph(n=30, p=0.2)
1 nx.draw(G_er)
Statistical Inference 119
png
1 len(G_er.edges())
1 90
1 len(G_er.edges()) / 435
1 0.20689655172413793
1 import pandas as pd
2 from nams.functions import ecdf
3 import matplotlib.pyplot as plt
4
5 x, y = ecdf(pd.Series(dict(nx.degree(G_er))))
6 plt.scatter(x, y)
1 <matplotlib.collections.PathCollection at 0x7f0d7c4ff880>
png
Barabasi-Albert Graph
The data generating story of this graph generator is essentially that nodes that have lots of edges
preferentially get new edges attached onto them. This is what we call a “preferential attachment”
process.
png
1 len(G_ba.edges())
1 81
1 x, y = ecdf(pd.Series(dict(nx.degree(G_ba))))
2 plt.scatter(x, y)
1 <matplotlib.collections.PathCollection at 0x7f0d792097f0>
Statistical Inference 122
png
You can see that even though the number of edges between the two graphs are similar, their degree
distribution is wildly different.
Load Data
For this notebook, we are going to look at a protein-protein interaction network, and test the
hypothesis that this network was not generated by the data generating process described by an
Erdos-Renyi graph.
Let’s load a protein-protein interaction network dataset²¹.
²¹http://konect.uni-koblenz.de/networks/moreno_propro
Statistical Inference 123
As is always the case, let’s make sure we know some basic stats of the graph.
1 len(G.nodes())
1 1870
1 len(G.edges())
1 2277
1 x, y = ecdf(pd.Series(dict(nx.degree(G))))
2 plt.scatter(x, y)
1 <matplotlib.collections.PathCollection at 0x7f0d78fc6f70>
png
1 import nxviz as nv
2 from nxviz import annotate
3
4 nv.circos(G, sort_by="degree", node_color_by="degree", node_aes_kwargs={"size_scale"\
5 : 10})
6 annotate.node_colormapping(G, color_by="degree")
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
png
One thing we might infer from this visualization is that the vast majority of nodes have a very small
degree, while a very small number of nodes have a high degree. That would prompt us to think:
what process could be responsible for generating this graph?
Statistical Inference 125
Given the degree distribution only, which model do you think better describes the generation of a
protein-protein interaction network?
²²https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
Statistical Inference 127
1 deg = pd.Series(dict(nx.degree(G)))
2
3 er_deg = erdos_renyi_degdist(n=len(G.nodes()), p=0.001)
4 ba_deg = barabasi_albert_degdist(n=len(G.nodes()), m=1)
5 wasserstein_distance(deg, er_deg), wasserstein_distance(deg, ba_deg)
1 (0.792513368983957, 0.5272727272727269)
Notice that because the graphs are instantiated in a non-deterministic fashion, re-running the cell
above will give you different values for each new graph generated.
Let’s now plot the wasserstein distance to our graph data for the two particular Erdos-Renyi and
Barabasi-Albert graph models shown above.
png
From this, we might conclude that the Barabasi-Albert graph with m = 1 has the better fit to the
protein-protein interaction network graph.
Interpretation
That statement, accurate as it might be, still does not connect the dots to biology.
Let’s think about the generative model for this graph. The Barabasi-Albert graph gives us a model
for “rich gets richer”. Given the current state of the graph, if we want to add a new edge, we first pick
a node with probability proportional to the number of edges it already has. Then, we pick another
node with probability proportional to the number of edges that it has too. Finally, we add an edge
there. This has the effect of “enriching” nodes that have a large number of edges with more edges.
How might this connect to biology?
We can’t necessarily provide a concrete answer, but this model might help raise new hypotheses.
For example, if protein-protein interactions of the “binding” kind are driven by subdomains, then
proteins that acquire a domain through recombination may end up being able to bind to everything
else that the domain was able to. In this fashion, proteins with that particular binding domain gain
new edges more readily.
Testing these hypotheses would be a totally different matter, and at this point, I submit the above
hypothesis with a large amount of salt thrown over my shoulder. In other words, the hypothesized
Statistical Inference 130
mechanism could be completely wrong. However, I hope that this example illustrated that the usage
of a “graph generative model” can help us narrow down hypotheses about the observed world.
Game of Thrones
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
5 import pandas as pd
6 import networkx as nx
7 import community
8 import numpy as np
9 import matplotlib.pyplot as plt
Introduction
In this chapter, we will use Game of Thrones as a case study to practice our newly learnt skills of
network analysis.
It is suprising right? What is the relationship between a fatansy TV show/novel and network science
or Python(not dragons).
If you haven’t heard of Game of Thrones, then you must be really good at hiding. Game of Thrones
is a hugely popular television series by HBO based on the (also) hugely popular book series A Song
of Ice and Fire by George R.R. Martin. In this notebook, we will analyze the co-occurrence network
of the characters in the Game of Thrones books. Here, two characters are considered to co-occur if
their names appear in the vicinity of 15 words from one another in the books.
The figure below is a precusor of what we will analyse in this chapter.
Game of Thrones 132
The resulting DataFrame books has 5 columns: Source, Target, Type, weight, and book. Source and
Game of Thrones 133
target are the two nodes that are linked by an edge. As we know a network can have directed or
undirected edges and in this network all the edges are undirected. The weight attribute of every edge
tells us the number of interactions that the characters have had over the book, and the book column
tells us the book number.
Let’s have a look at the data.
1 print(books.head().to_markdown())
From the above data we can see that the characters Addam Marbrand and Tywin Lannister have
interacted 6 times in the first book.
We can investigate this data by using the pandas DataFrame. Let’s find all the interactions of Robb
Stark in the third book.
1 robbstark = (
2 books.query("book == 3")
3 .query("Source == 'Robb-Stark' or Target == 'Robb-Stark'")
4 )
1 print(robbstark.head().to_markdown())
Game of Thrones 134
As you can see this data easily translates to a network problem. Now it’s time to create a network.
We create a graph for each book. It’s possible to create one MultiGraph(Graph with multiple edges
between nodes) instead of 5 graphs, but it is easier to analyse and manipulate individual Graph
objects rather than a MultiGraph.
1 <networkx.classes.graph.Graph at 0x7f6f1e63abe0>
1 relationships[0:3]
1 [('Addam-Marbrand',
2 'Jaime-Lannister',
3 {'weight': 3, 'weight_inv': 0.3333333333333333}),
4 ('Addam-Marbrand',
5 'Tywin-Lannister',
6 {'weight': 6, 'weight_inv': 0.16666666666666666}),
7 ('Jaime-Lannister', 'Aerys-II-Targaryen', {'weight': 5, 'weight_inv': 0.2})]
degree_centrality returns a dictionary and to access the results we can directly use the name of
the character.
1 deg_cen_book1['Daenerys-Targaryen']
Game of Thrones 136
1 0.11290322580645162
1 [('Eddard-Stark', 0.3548387096774194),
2 ('Robert-Baratheon', 0.2688172043010753),
3 ('Tyrion-Lannister', 0.24731182795698928),
4 ('Catelyn-Stark', 0.23118279569892475),
5 ('Jon-Snow', 0.19892473118279572)]
1 sorted(deg_cen_book5.items(),
2 key=lambda x:x[1],
3 reverse=True)[0:5]
1 [('Jon-Snow', 0.1962025316455696),
2 ('Daenerys-Targaryen', 0.18354430379746836),
3 ('Stannis-Baratheon', 0.14873417721518986),
4 ('Tyrion-Lannister', 0.10443037974683544),
5 ('Theon-Greyjoy', 0.10443037974683544)]
To visualize the distribution of degree centrality let’s plot a histogram of degree centrality.
1 plt.hist(deg_cen_book1.values(), bins=30)
2 plt.show()
Game of Thrones 137
png
The above plot shows something that is expected, a high portion of characters aren’t connected
to lot of other characters while some characters are highly connected all through the network. A
close real world example of this is a social network like Twitter where a few people have millions
of connections(followers) but majority of users aren’t connected to that many other users. This
exponential decay like property resembles power law in real life networks.
png
Exercise
Create a new centrality measure, weighted_degree(Graph, weight) which takes in Graph and
the weight attribute and returns a weighted degree dictionary. Weighted degree is calculated by
summing the weight of the all edges of a node and find the top five characters according to this
measure.
png
1 [('Eddard-Stark', 1284),
2 ('Robert-Baratheon', 941),
3 ('Jon-Snow', 784),
4 ('Tyrion-Lannister', 650),
5 ('Sansa-Stark', 545)]
Betweeness centrality
Let’s do this for Betweeness centrality and check if this makes any difference. As different centrality
method use different measures underneath, they find nodes which are important in the network.
A centrality method like Betweeness centrality finds nodes which are structurally important to the
network, which binds the network together and densely.
Game of Thrones 140
1 [('Eddard-Stark', 0.2696038913836117),
2 ('Robert-Baratheon', 0.21403028397371796),
3 ('Tyrion-Lannister', 0.1902124972697492),
4 ('Jon-Snow', 0.17158135899829566),
5 ('Catelyn-Stark', 0.1513952715347627),
6 ('Daenerys-Targaryen', 0.08627015537511595),
7 ('Robb-Stark', 0.07298399629664767),
8 ('Drogo', 0.06481224290874964),
9 ('Bran-Stark', 0.05579958811784442),
10 ('Sansa-Stark', 0.03714483664326785)]
1 [('Eddard-Stark', 0.5926474861958733),
2 ('Catelyn-Stark', 0.36855565242662014),
3 ('Jon-Snow', 0.3514094739901191),
4 ('Robert-Baratheon', 0.3329991281604185),
5 ('Tyrion-Lannister', 0.27137460040685846),
6 ('Daenerys-Targaryen', 0.202615518744551),
7 ('Bran-Stark', 0.0945655332752107),
8 ('Robb-Stark', 0.09177564661435629),
9 ('Arya-Stark', 0.06939843068875327),
10 ('Sansa-Stark', 0.06870095902353966)]
We can see there are some differences between the unweighted and weighted centrality measures.
Another thing to note is that we are using the weight_inv attribute instead of weight(the number
of interactions between characters). This decision is based on the way we want to assign the notion
of “importance” of a character. The basic idea behind betweenness centrality is to find nodes which
are essential to the structure of the network. As betweenness centrality computes shortest paths
underneath, in the case of weighted betweenness centrality it will end up penalising characters
with high number of interactions. By using weight_inv we will prop up the characters with high
interactions with other characters.
Game of Thrones 141
PageRank
The billion dollar algorithm, PageRank works by counting the number and quality of links to a page
to determine a rough estimate of how important the website is. The underlying assumption is that
more important websites are likely to receive more links from other websites.
NOTE: We don’t need to worry about weight and weight_inv in PageRank as the algorithm uses
weights in the opposite sense (larger weights are better). This may seem confusing as different
centrality measures have different definition of weights. So it is always better to have a look at
documentation before using weights in a centrality measure.
1 [('Eddard-Stark', 0.04552079222830671),
2 ('Tyrion-Lannister', 0.03301362462493269),
3 ('Catelyn-Stark', 0.030193105286631914),
4 ('Robert-Baratheon', 0.02983474222773675),
5 ('Jon-Snow', 0.02683449952206619),
6 ('Robb-Stark', 0.021562941297247524),
7 ('Sansa-Stark', 0.020008034042864654),
8 ('Bran-Stark', 0.019945786786238318),
9 ('Jaime-Lannister', 0.017507847202846937),
10 ('Cersei-Lannister', 0.017082604584758087)]
1 sorted(nx.pagerank_numpy(
2 graphs[0], weight='weight').items(),
3 key=lambda x:x[1], reverse=True)[0:10]
Game of Thrones 142
1 [('Eddard-Stark', 0.0723940110049824),
2 ('Robert-Baratheon', 0.0485172757050994),
3 ('Jon-Snow', 0.04770689062474911),
4 ('Tyrion-Lannister', 0.04367437892706296),
5 ('Catelyn-Stark', 0.034667034701307414),
6 ('Bran-Stark', 0.029774200539800212),
7 ('Robb-Stark', 0.02921618364519686),
8 ('Daenerys-Targaryen', 0.027089622513021085),
9 ('Sansa-Stark', 0.026961778915683125),
10 ('Cersei-Lannister', 0.02163167939741897)]
Exercise
0 1 2 3
0 1 0.910352 0.992166 0.949307
1 0.910352 1 0.87924 0.790526
2 0.992166 0.87924 1 0.95506
3 0.949307 0.790526 0.95506 1
Let’s look at the evolution of degree centrality of a couple of characters like Eddard Stark, Jon Snow,
Tyrion which showed up in the top 10 of degree centrality in first book.
We create a dataframe with character columns and index as books where every entry is the degree
centrality of the character in that particular book and plot the evolution of degree centrality Eddard
Stark, Jon Snow and Tyrion. We can see that the importance of Eddard Stark in the network dies off
and with Jon Snow there is a drop in the fourth book but a sudden rise in the fifth book
1 evol = [nx.degree_centrality(graph)
2 for graph in graphs]
3 evol_df = pd.DataFrame.from_records(evol).fillna(0)
4 evol_df[['Eddard-Stark',
5 'Tyrion-Lannister',
6 'Jon-Snow']].plot()
7 plt.show()
png
1 set_of_char = set()
2 for i in range(5):
3 set_of_char |= set(list(
4 evol_df.T[i].sort_values(
5 ascending=False)[0:5].index))
6 set_of_char
Game of Thrones 144
1 {'Arya-Stark',
2 'Brienne-of-Tarth',
3 'Catelyn-Stark',
4 'Cersei-Lannister',
5 'Daenerys-Targaryen',
6 'Eddard-Stark',
7 'Jaime-Lannister',
8 'Joffrey-Baratheon',
9 'Jon-Snow',
10 'Margaery-Tyrell',
11 'Robb-Stark',
12 'Robert-Baratheon',
13 'Sansa-Stark',
14 'Stannis-Baratheon',
15 'Theon-Greyjoy',
16 'Tyrion-Lannister'}
Exercise
Plot the evolution of betweenness centrality of the above mentioned characters over the 5 books.
1 evol_betweenness(graphs)
Game of Thrones 145
png
1 [('Jon-Snow', 0.1962025316455696),
2 ('Daenerys-Targaryen', 0.18354430379746836),
3 ('Stannis-Baratheon', 0.14873417721518986),
4 ('Tyrion-Lannister', 0.10443037974683544),
5 ('Theon-Greyjoy', 0.10443037974683544)]
1 sorted(nx.betweenness_centrality(graphs[4]).items(),
2 key=lambda x:x[1], reverse=True)[:5]
Game of Thrones 146
1 [('Stannis-Baratheon', 0.45283060689247934),
2 ('Daenerys-Targaryen', 0.2959459062106149),
3 ('Jon-Snow', 0.24484873673158666),
4 ('Tyrion-Lannister', 0.20961613179551256),
5 ('Robert-Baratheon', 0.17716906651536968)]
png
As we know the a higher betweenness centrality means that the node is crucial for the structure of
the network, and in the case of Stannis Baratheon in the fifth book it seems like Stannis Baratheon
has characterstics similar to that of node 5 in the above example as it seems to be the holding the
network together.
As evident from the betweenness centrality scores of the above example of barbell graph, node 5 is
the most important node in this network.
1 nx.betweenness_centrality(nx.barbell_graph(5, 1))
Game of Thrones 147
1 {0: 0.0,
2 1: 0.0,
3 2: 0.0,
4 3: 0.0,
5 4: 0.5333333333333333,
6 6: 0.5333333333333333,
7 7: 0.0,
8 8: 0.0,
9 9: 0.0,
10 10: 0.0,
11 5: 0.5555555555555556}
1 import nxviz as nv
2 from nxviz import annotate
3 plt.figure(figsize=(8, 8))
4
5 partition = community.best_partition(graphs[0], randomize=False)
6
7 # Annotate nodes' partitions
8 for n in graphs[0].nodes():
9 graphs[0].nodes[n]["partition"] = partition[n]
10 graphs[0].nodes[n]["degree"] = graphs[0].degree(n)
11
12 nv.matrix(graphs[0], group_by="partition", sort_by="degree", node_color_by="partitio\
13 n")
14 annotate.matrix_block(graphs[0], group_by="partition", color_by="partition")
15 annotate.matrix_group(graphs[0], group_by="partition", offset=-8)
Game of Thrones 148
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
Game of Thrones 149
png
A common defining quality of a community is that the within-community edges are denser than
the between-community edges.
Game of Thrones 150
1 len(partition_dict)
1 8
1 partition_dict[2]
1 ['Bran-Stark',
2 'Rickon-Stark',
3 'Robb-Stark',
4 'Luwin',
5 'Theon-Greyjoy',
6 'Hali',
7 'Hallis-Mollen',
8 'Hodor',
9 'Hullen',
10 'Joseth',
11 'Nan',
12 'Osha',
13 'Rickard-Karstark',
14 'Rickard-Stark',
15 'Stiv',
16 'Jon-Umber-(Greatjon)',
17 'Galbart-Glover',
18 'Roose-Bolton',
19 'Maege-Mormont']
If we plot these communities of the network we see a denser network as compared to the original
network which contains all the characters.
1 nx.draw(nx.subgraph(graphs[0], partition_dict[3]))
Game of Thrones 151
png
1 nx.draw(nx.subgraph(graphs[0],partition_dict[1]))
Game of Thrones 152
png
We can test this by calculating the density of the network and the community.
Like in the following example the network between characters in a community is 5 times more dense
than the original network.
1 nx.density(nx.subgraph(
2 graphs[0], partition_dict[4])
3 )/nx.density(graphs[0])
1 25.42543859649123
Exercise
Find the most important node in the partitions according to degree centrality of the nodes using the
partition_dict we have already created.
1 most_important_node_in_partition(graphs[0], partition_dict)
1 {7: 'Tyrion-Lannister',
2 1: 'Daenerys-Targaryen',
3 6: 'Eddard-Stark',
4 3: 'Jon-Snow',
5 5: 'Sansa-Stark',
6 2: 'Robb-Stark',
7 0: 'Waymar-Royce',
8 4: 'Danwell-Frey'}
Solutions
Here are the solutions to the exercises above.
1 import pandas as pd
2 import networkx as nx
3
4
5 def weighted_degree(G, weight):
6 result = dict()
7 for node in G.nodes():
8 weight_degree = 0
9 for n in G.edges([node], data=True):
10 weight_degree += n[2]["weight"]
11 result[node] = weight_degree
12 return result
13
14
15 def correlation_centrality(G):
16 cor = pd.DataFrame.from_records(
17 [
18 nx.pagerank_numpy(G, weight="weight"),
19 nx.betweenness_centrality(G, weight="weight_inv"),
Game of Thrones 154
20 weighted_degree(G, "weight"),
21 nx.degree_centrality(G),
22 ]
23 )
24 return cor.T.corr()
25
26
27 def evol_betweenness(graphs):
28 evol = [nx.betweenness_centrality(graph, weight="weight_inv") for graph in graph\
29 s]
30 evol_df = pd.DataFrame.from_records(evol).fillna(0)
31
32 set_of_char = set()
33 for i in range(5):
34 set_of_char |= set(list(evol_df.T[i].sort_values(ascending=False)[0:5].index\
35 ))
36
37 evol_df[list(set_of_char)].plot(figsize=(19, 10))
38
39
40 def most_important_node_in_partition(graph, partition_dict):
41 max_d = {}
42 deg = nx.degree_centrality(graph)
43 for group in partition_dict:
44 temp = 0
45 for character in partition_dict[group]:
46 if deg[character] > temp:
47 max_d[group] = character
48 temp = deg[character]
49 return max_d
Airport Network
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
5 import networkx as nx
6 import pandas as pd
7 import matplotlib.pyplot as plt
8 import numpy as np
Introduction
In this chapter, we will analyse the evolution of US Airport Network between 1990 and 2015. This
dataset contains data for 25 years[1995-2015] of flights between various US airports and metadata
about these routes. Taken from Bureau of Transportation Statistics, United States Department of
Transportation.
Let’s see what can we make out of this!
In the pass_air_data dataframe we have the information of number of people that fly every year
on a particular route on the list of airlines that fly that route.
1 print(pass_air_data.head().to_markdown())
Every row in this dataset is a unique route between 2 airports in United States territory in a particular
Airport Network 156
year. Let’s see how many people flew from New York JFK to Austin in 2006.
NOTE: This will be a fun chapter if you are an aviation geek and like guessing airport IATA codes.
1 jfk_aus_2006 = (pass_air_data
2 .query('YEAR == 2006')
3 .query("ORIGIN == 'JFK' and DEST == 'AUS'"))
4
5 print(jfk_aus_2006.head().to_markdown())
From the above pandas query we see that according to this dataset 105290 passengers travelled from
JFK to AUS in the year 2006.
But how does this dataset translate to an applied network analysis problem? In the previous chapter
we created different graph objects for every book. Let’s create a graph object which encompasses all
the edges.
NetworkX provides us with Multi(Di)Graphs to model networks with multiple edges between two
nodes.
In this case every row in the dataframe represents a directed edge between two airports, common
sense suggests that if there is a flight from airport A to airport B there should definitely be a flight
from airport B to airport A, i.e direction of the edge shouldn’t matter. But in this dataset we have
data for individual directions (A -> B and B -> A) so we create a MultiDiGraph.
1 passenger_graph = nx.from_pandas_edgelist(
2 pass_air_data, source='ORIGIN',
3 target='DEST', edge_key='YEAR',
4 edge_attr=['PASSENGERS', 'UNIQUE_CARRIER_NAME'],
5 create_using=nx.MultiDiGraph())
We have created a MultiDiGraph object passenger_graph which contains all the information from
the dataframe pass_air_data. ORIGIN and DEST represent the column names in the dataframe pass_-
air_data used to construct the edge. As this is a MultiDiGraph we can also give a name/key to the
multiple edges between two nodes and edge_key is used to represent that name and in this graph
YEAR is used to distinguish between multiple edges between two nodes. PASSENGERS and UNIQUE_-
CARRIER_NAME are added as edge attributes which can be accessed using the nodes and the key form
the MultiDiGraph object.
Airport Network 157
Let’s check if can access the same information (the 2006 route between JFK and AUS) using our
passenger_graph.
To check an edge between two nodes in a Graph we can use the syntax GraphObject[source][target]
and further specify the edge attribute using GraphObject[source][target][attribute].
1 passenger_graph['JFK']['AUS'][2006]
1 {'PASSENGERS': 105290.0,
2 'UNIQUE_CARRIER_NAME': "{'Shuttle America Corp.', 'Ameristar Air Cargo', 'JetBlue A\
3 irways', 'United Parcel Service'}"}
Now let’s use our new constructed passenger graph to look at the evolution of passenger load over
25 years.
png
Airport Network 158
We see some usual trends all across the datasets like steep drops in 2001 (after 9/11) and 2008
(recession).
png
To find the overall trend, we can use our pass_air_data dataframe to calculate total passengers
flown in a year.
1 pass_air_data.groupby(
2 ['YEAR']).sum()['PASSENGERS'].plot()
3 plt.show()
Airport Network 159
png
Exercise
Find the busiest route in 1990 and in 2015 according to number of passengers, and plot the time series
of number of passengers on these routes.
You can use the DataFrame instead of working with the network. It will be faster :)
1 print(busiest_route(pass_air_data, 1990).head().to_markdown())
Airport Network 160
png
Airport Network 161
1 print(busiest_route(pass_air_data, 2015).head().to_markdown())
png
Before moving to the next part of the chapter let’s create a method to extract edges from passenger_-
graph for a particular year so we can better analyse the data on a granular scale.
1 Name:
2 Type: DiGraph
3 Number of nodes: 1258
4 Number of edges: 25354
5 Average in degree: 20.1542
6 Average out degree: 20.1542
23 wanted_nodes = list(pass_2015_network.nodes())
24 us_airports = lat_long.query("CODE3 in @wanted_nodes").drop_duplicates(subset=["CODE\
25 3"]).set_index("CODE3")
26 us_airports
27 # us_airports
1 .dataframe tbody tr th {
2 vertical-align: top;
3 }
4
5 .dataframe thead th {
6 text-align: right;
7 }
Let’s first plot only the nodes, i.e airports. Places like Guam, US Virgin Islands are also included in
this dataset as they are treated as domestic airports in this dataset.
1 import nxviz as nv
2 from nxviz import nodes, plots, edges
3 plt.figure(figsize=(20, 9))
4 pos = nodes.geo(g, aesthetics_kwargs={"size_scale": 1})
5 plots.aspect_equal()
6 plots.despine()
Airport Network 166
1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
png
1 import nxviz as nv
2 from nxviz import nodes, plots, edges, annotate
3 plt.figure(figsize=(20, 9))
4 pos = nodes.geo(g, color_by="degree", aesthetics_kwargs={"size_scale": 1})
5 edges.line(g, pos, aesthetics_kwargs={"alpha_scale": 0.1})
6 annotate.node_colormapping(g, color_by="degree")
7 plots.aspect_equal()
8 plots.despine()
Airport Network 167
png
Before we proceed further, let’s take a detour to briefly discuss directed networks and PageRank.
Source: Wikipedia
To better understand this let’s work through an example.
1 print(G.edges(data=True))
1 {'weight': 4}
1 G[2][1]
2
3 ---------------------------------------------------------------------------
4 KeyError Traceback (most recent call last)
5 <ipython-input-137-d6b8db3142ef> in <module>
6 1 # Access edge from 2 to 1
7 ----> 2 G[2][1]
8
9 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/classes/coreviews.py in \
10 __getitem__(self, key)
11 52
12 53 def __getitem__(self, key):
13 ---> 54 return self._atlas[key]
14 55
15 56 def copy(self):
16
17 KeyError: 1
As expected we get an error when we try to access the edge between 2 to 1 as this is a directed graph.
1 <AxesSubplot:>
Airport Network 170
png
Just by looking at the example above, we can conclude that node 2 should have the highest PageRank
score as all the nodes are pointing towards it.
This is confirmed by calculating the PageRank of this graph.
1 nx.pagerank(G)
1 {1: 0.0826448180198328,
2 2: 0.5041310918810031,
3 3: 0.0826448180198328,
4 4: 0.0826448180198328,
5 5: 0.0826448180198328,
6 6: 0.0826448180198328,
7 7: 0.0826448180198328}
1 G.add_edge(5, 6)
2 nv.circos(G, node_aes_kwargs={"size_scale": 0.3})
3 # nx.draw_spring(G, with_labels=True)
1 <AxesSubplot:>
Airport Network 171
png
1 nx.pagerank(G)
1 {1: 0.08024854052495894,
2 2: 0.4844028780560986,
3 3: 0.08024854052495894,
4 4: 0.08024854052495894,
5 5: 0.08024854052495894,
6 6: 0.11435441931910648,
7 7: 0.08024854052495894}
As expected there was some change in the scores (an increase for 6) but the overall trend stays the
same, with node 2 leading the pack.
1 G.add_edge(2, 8)
2 nv.circos(G, node_aes_kwargs={"size_scale": 0.3})
1 <AxesSubplot:>
Airport Network 172
png
Now we have an added an edge from 2 to a new node 8. As node 2 already has a high PageRank
score, this should be passed on node 8. Let’s see how much difference this can make.
1 nx.pagerank(G)
1 {1: 0.05378612718073915,
2 2: 0.3246687852772877,
3 3: 0.05378612718073915,
4 4: 0.05378612718073915,
5 5: 0.05378612718073915,
6 6: 0.0766454192258098,
7 7: 0.05378612718073915,
8 8: 0.3297551595932067}
In this example, node 8 is now even more “important” than node 2 even though node 8 has only
incoming connection.
Let’s move back to Airports and use this knowledge to analyse the network.
1 nx.pagerank(passenger_graph)
2
3 ---------------------------------------------------------------------------
4 NetworkXNotImplemented Traceback (most recent call last)
5 <ipython-input-144-15a6f513bf9b> in <module>
6 1 # Let's try to calulate the PageRank measures of this graph.
7 ----> 2 nx.pagerank(passenger_graph)
8
9 <decorator-gen-435> in pagerank(G, alpha, personalization, max_iter, tol, nstart, we\
10 ight, dangling)
11
12 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/utils/decorators.py in _\
13 not_implemented_for(not_implement_for_func, *args, **kwargs)
14 78 if match:
15 79 msg = 'not implemented for %s type' % ' '.join(graph_types)
16 ---> 80 raise nx.NetworkXNotImplemented(msg)
17 81 else:
18 82 return not_implement_for_func(*args, **kwargs)
19
20 NetworkXNotImplemented: not implemented for multigraph type
As PageRank isn’t defined for a MultiGraph in NetworkX we need to use our extracted yearly sub
networks.
1 0.0036376572979606586
Before looking at the results do think about what we just calculated and try to guess which airports
should come out at the top and be ready to be surprised :D
1 # PageRank
2 top_10_pr
1 [('ANC', 0.010425531156396332),
2 ('HPN', 0.008715287139161587),
3 ('FAI', 0.007865131822111036),
4 ('DFW', 0.007168038232113773),
5 ('DEN', 0.006557279519803018),
6 ('ATL', 0.006367579588749718),
7 ('ORD', 0.006178836107660135),
8 ('YIP', 0.005821525504523931),
9 ('ADQ', 0.005482597083474197),
10 ('MSP', 0.005481962582230961)]
1 # Betweenness Centrality
2 top_10_bc
Airport Network 175
1 [('ANC', 0.28907458480586606),
2 ('FAI', 0.08042857784594384),
3 ('SEA', 0.06745549919241699),
4 ('HPN', 0.06046810178534726),
5 ('ORD', 0.045544143864829294),
6 ('ADQ', 0.040170160000905696),
7 ('DEN', 0.038543251364241436),
8 ('BFI', 0.03811277548952854),
9 ('MSP', 0.03774809342340624),
10 ('TEB', 0.036229439542316354)]
1 # Degree Centrality
2 top_10_dc
1 [('ATL', 0.3643595863166269),
2 ('ORD', 0.354813046937152),
3 ('DFW', 0.3420843277645187),
4 ('MSP', 0.3261734287987271),
5 ('DEN', 0.31821797931583135),
6 ('ANC', 0.3046937151949085),
7 ('MEM', 0.29196499602227527),
8 ('LAX', 0.2840095465393795),
9 ('IAH', 0.28082736674622116),
10 ('DTW', 0.27446300715990457)]
The Degree Centrality results do make sense at first glance, ATL is Atlanta, ORD is Chicago, these
are defintely airports one would expect to be at the top of a list which calculates “importance” of
an airport. But when we look at PageRank and Betweenness Centrality we have an unexpected
airport ‘ANC’. Do think about measures like PageRank and Betweenness Centrality and what they
calculate. Do note that currently we have used the core structure of the network, no other metadata
like number of passengers. These are calculations on the unweighted network.
‘ANC’ is the airport code of Anchorage airport, a place in Alaska, and according to pagerank and
betweenness centrality it is the most important airport in this network. Isn’t that weird? Thoughts?
Looks like ‘ANC’ is essential to the core structure of the network, as it is the main airport connecting
Alaska with other parts of US. This explains the high Betweenness Centrality score and there are
flights from other major airports to ‘ANC’ which explains the high PageRank score.
Related blog post: https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-
ties-and-sample-selection/
Let’s look at weighted version, i.e taking into account the number of people flying to these places.
Airport Network 176
1 [('SEA', 0.4192179843829966),
2 ('ATL', 0.3589665389741017),
3 ('ANC', 0.32425767084369994),
4 ('LAX', 0.2668567170342895),
5 ('ORD', 0.10008664852621497),
6 ('DEN', 0.0964658422388763),
7 ('MSP', 0.09300021788810685),
8 ('DFW', 0.0926644126226465),
9 ('FAI', 0.08824779747216016),
10 ('BOS', 0.08259764427486331)]
1 sorted(nx.pagerank(
2 pass_2015_network, weight='weight').items(),
3 key=lambda x:x[1], reverse=True)[0:10]
1 [('ATL', 0.037535963029303135),
2 ('ORD', 0.028329766122739346),
3 ('SEA', 0.028274564067008245),
4 ('ANC', 0.027127866647567035),
5 ('DFW', 0.02570050418889442),
6 ('DEN', 0.025260024346433315),
7 ('LAX', 0.02394043498608451),
8 ('PHX', 0.018373176636420224),
9 ('CLT', 0.01780703930063076),
10 ('LAS', 0.017649683141049966)]
When we adjust for number of passengers we see that we have a reshuffle in the “importance”
rankings, and they do make a bit more sense now. According to weighted PageRank, Atlanta,
Chicago, Seattle the top 3 airports while Anchorage is at 4th rank now.
To get an even better picture of this we should do the analyse with more metadata about the routes
not just the number of passengers.
Airport Network 177
1 nx.average_shortest_path_length(pass_2015_network)
2
3 ---------------------------------------------------------------------------
4 NetworkXError Traceback (most recent call last)
5 <ipython-input-157-acfe9bf3572a> in <module>
6 ----> 1 nx.average_shortest_path_length(pass_2015_network)
7
8 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/algorithms/shortest_path\
9 s/generic.py in average_shortest_path_length(G, weight, method)
10 401 # Shortest path length is undefined if the graph is disconnected.
11 402 if G.is_directed() and not nx.is_weakly_connected(G):
12 --> 403 raise nx.NetworkXError("Graph is not weakly connected.")
13 404 if not G.is_directed() and not nx.is_connected(G):
14 405 raise nx.NetworkXError("Graph is not connected.")
15
16 NetworkXError: Graph is not weakly connected.
Wait, What? This network is not “connected” (ignore the term weakly for the moment). That seems
weird. It means that there are nodes which aren’t reachable from other set of nodes, which isn’t
good news in especially a transporation network.
Let’s have a look at these far flung airports which aren’t reachable.
1 components = list(
2 nx.weakly_connected_components(
3 pass_2015_network))
Airport Network 178
1 1255
2 2
3 1
1 {'SSB', 'SPB'}
2 {'AIK'}
The airports ‘SSB’ and ‘SPB’ are codes for Seaplanes airports and they have flights to each other so
it makes sense that they aren’t connected to the larger network of airports.
The airport is even more weird as it is in a component in itself, i.e there is a flight from AIK to AIK.
After investigating further it just seems like an anomaly in this dataset.
1 AIK_DEST_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['DEST'] == 'AIK')]
4 print(AIK_DEST_2015.head().to_markdown())
1 True
1 False
1 # NOTE: The notion of strongly and weakly exists only for directed graphs.
2 G = nx.DiGraph()
3
4 # Let's create a cycle directed graph, 1 -> 2 -> 3 -> 1
5 G.add_edge(1, 2)
6 G.add_edge(2, 3)
7 G.add_edge(3, 1)
8 nx.draw(G, with_labels=True)
png
Airport Network 180
In the above example we can reach any node irrespective of where we start traversing the network,
if we start from 2 we can reach 1 via 3. In this network every node is “reachable” from one another,
i.e the network is strongly connected.
1 nx.is_strongly_connected(G)
1 True
png
It’s evident from the example above that we can’t traverse the network graph. If we start from node
4 we are stuck at the node, we don’t have any way of leaving node 4. This is assuming we strictly
follow the direction of edges. In this case the network isn’t strongly connected but if we look at the
structure and assume the directions of edges don’t matter than we can go to any other node in the
network even if we start from node 4.
In the case an undirected copy of directed network is connected we call the directed network as
weakly connected.
Airport Network 181
1 nx.is_strongly_connected(G)
1 False
1 nx.is_weakly_connected(G)
1 True
1 nx.is_weakly_connected(pass_2015_network)
1 True
1 nx.is_strongly_connected(pass_2015_network)
1 False
But our network is still not strongly connected, which essentially means there are airports in the
network where you can fly into but not fly back, which doesn’t really seem okay
1 strongly_connected_components = list(
2 nx.strongly_connected_components(pass_2015_network))
1 {'BCE'}
Airport Network 182
1 BCE_DEST_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['DEST'] == 'BCE')]
4 print(BCE_DEST_2015.head().to_markdown())
1 BCE_ORI_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['ORIGIN'] == 'BCE')]
4 print(BCE_ORI_2015.head().to_markdown())
1 nx.is_strongly_connected(pass_2015_strong)
1 True
After removing multiple airports we now have a strongly connected airport network. We can now
travel from one airport to any other airport in the network.
Airport Network 183
1 1190
1 nx.average_shortest_path_length(pass_2015_strong)
1 3.174661992635574
The 3.17 number above represents the average length between 2 airports in the network which means
that it’s possible to go from one airport to another in this network under 3 layovers, which sounds
nice. A more reachable network is better, not necessearily in terms of revenue for the airline but for
social health of the air transport network.
Exercise
How can we decrease the average shortest path length of this network?
Think of an effective way to add new edges to decrease the average shortest path length. Let’s see
if we can come up with a nice way to do this.
The rules are simple: - You can’t add more than 2% of the current edges( ∼500 edges)
1 new_routes_network = add_opinated_edges(pass_2015_strong)
1 nx.average_shortest_path_length(new_routes_network)
1 3.0888508809747615
Using an opinionated heuristic we were able to reduce the average shortest path length of the
network. Check the solution below to understand the idea behind the heuristic, do try to come
up with your own heuristics.
1 # We have access to the airlines that fly the route in the edge attribute airlines
2 pass_2015_network['JFK']['SFO']
1 {'weight': 1179941.0,
2 'weight_inv': 8.4750000211875e-07,
3 'airlines': "{'Delta Air Lines Inc.', 'Virgin America', 'American Airlines Inc.', '\
4 Sun Country Airlines d/b/a MN Airlines', 'JetBlue Airways', 'Vision Airlines', 'Unit\
5 ed Air Lines Inc.'}"}
1 # A helper function to extract the airlines names from the edge attribute
2 def str_to_list(a):
3 return a[1:-1].split(', ')
Let’s extract the network of United Airlines from our airport network.
1 united_network = nx.DiGraph()
2 for origin, dest in pass_2015_network.edges():
3 if "'United Air Lines Inc.'" in pass_2015_network[origin][dest]['airlines_list']:
4 united_network.add_edge(
5 origin, dest,
6 weight=pass_2015_network[origin][dest]['weight'])
1 Name:
2 Type: DiGraph
3 Number of nodes: 194
4 Number of edges: 1894
5 Average in degree: 9.7629
6 Average out degree: 9.7629
Airport Network 185
1 [('ORD', 0.08385772266571424),
2 ('DEN', 0.06816244850418422),
3 ('LAX', 0.053065234147240105),
4 ('IAH', 0.044410609028379185),
5 ('SFO', 0.04326197030283029)]
1 [('ORD', 1.0),
2 ('IAH', 0.9274611398963731),
3 ('DEN', 0.8756476683937824),
4 ('EWR', 0.8134715025906736),
5 ('SFO', 0.6839378238341969)]
Solutions
Here are the solutions to the exercises above.
1 import networkx as nx
2 import pandas as pd
3
4
5 def busiest_route(pass_air_data, year):
6 return pass_air_data[
7 pass_air_data.groupby(["YEAR"])["PASSENGERS"].transform(max)
8 == pass_air_data["PASSENGERS"]
9 ].query(f"YEAR == {year}")
10
11
12 def plot_time_series(pass_air_data, origin, dest):
13 pass_air_data.query(f"ORIGIN == '{origin}' and DEST == '{dest}'").plot(
14 "YEAR", "PASSENGERS"
15 )
16
17
18 def add_opinated_edges(G):
19 G = nx.DiGraph(G)
20 sort_degree = sorted(
21 nx.degree_centrality(G).items(), key=lambda x: x[1], reverse=True
22 )
23 top_count = 0
24 for n, v in sort_degree:
25 count = 0
26 for node, val in sort_degree:
27 if node != n:
28 if node not in G._adj[n]:
29 G.add_edge(n, node)
30 count += 1
31 if count == 25:
32 break
33 top_count += 1
34 if top_count == 20:
35 break
36 return G