Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
502 views

Eric Ma and Mridul Seth - Network Analysis Made Simple An Introduction To Network Analysis and Applied Graph Theory Using Python and NetworkX-leanpub - Com (2021)

Uploaded by

Junio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
502 views

Eric Ma and Mridul Seth - Network Analysis Made Simple An Introduction To Network Analysis and Applied Graph Theory Using Python and NetworkX-leanpub - Com (2021)

Uploaded by

Junio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 191

Network Analysis Made Simple

An introduction to network analysis and applied graph


theory using Python and NetworkX

Eric Ma and Mridul Seth


This book is for sale at http://leanpub.com/nams

This version was published on 2021-04-08

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

© 2020 - 2021 Eric Ma and Mridul Seth


Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Learning Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Technical Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Intellectual Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Introduction to Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A formal definition of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Examples of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Edges define the interesting part of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

The NetworkX API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Understanding a graph’s basic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Manipulating the graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Coding Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Solution Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Graph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Hairballs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Matrix Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Arc Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Circos Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Hive Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Principles of Rational Graph Viz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Hubs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
CONTENTS

A Measure of Importance: “Number of Neighbors” . . . . . . . . . . . . . . . . . . . . . . . . 26


Generalizing “neighbors” to arbitrarily-sized graphs . . . . . . . . . . . . . . . . . . . . . . . 28
Distribution of graph metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Visualizing Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bottleneck nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Triadic Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Graph I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Graph Data as Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Pickling Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Other text formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Why test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
What to test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Continuous data testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bipartite Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
What are bipartite graphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bipartite Graph Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Weighted Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
CONTENTS

Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Path finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Bipartite Graphs & Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Performance: Object vs. Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Acceleration on a GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Statistics refresher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
We are concerned with models of randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Stochastic graph creation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Inferring Graph Generating Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Quantitative Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Game of Thrones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Finding the most important node i.e character in these networks. . . . . . . . . . . . . . . . 135
Betweeness centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Evolution of importance of characters over the books . . . . . . . . . . . . . . . . . . . . . . . 142
So what’s up with Stannis Baratheon? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Community detection in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Airport Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Visualise the airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Directed Graphs and PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Importants Hubs in the Airport Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
How reachable is this network? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Can we find airline specific reachability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Preface
Hey, thanks for picking up this e-Book. We had a ton of fun making the material, and we hope you
have a ton of fun learning new things from it too.
Applied network analysis, and graph theory concepts, are getting more and more relevant in our
world. Graph problems are abound. Once you pick up how to use graphs in an applied setting,
you’ll find your view of data problems change tremendously. We hope this book can become part
of your learning journey.
The act of purchasing this book means you’ve chosen to support us, the authors. It means a ton to
us, as this book is the culmination of 5 years of learning and teaching applied network analysis at
conferences around the world. The reason we went with LeanPub to publish this book is this: For as
long as we issue updates to the book, you will also receive an updated copy of it. And because the
book is digital, it’s easy for us to get updates out to you.
Just so you know, the full text of the book is available online too, at the accompanying website,
https://ericmjl.github.io/Network-Analysis-Made-Simple. On there, you’ll find a link to Binder so
you can interact with the code, and through the act of playing around with the code and breaking it
yourself, learn new things. (Breaking code and fixing it is something you should be doing - it’s one
of the best ways to learn!)
If you have questions about the content, or find an errata that you’d like to point out, please head over
to https://github.com/ericmjl/Network-Analysis-Made-Simple/, and post an issue up there. We’ll be
sure to address it and acknowledge it appropriately.
We hope that this book becomes a stepping stone in your learning journey. Enjoy!
Eric & Mridul
Learning Goals
Our learning goals for you with this book can be split into the technical and the intellectual.

Technical Takeaways
Firstly, we would like to equip you to be familiar with the NetworkX application programming
interface (API). The reason for choosing NetworkX is because it is extremely beginner-friendly, and
has an API that matches graph theory concepts very closely.
Secondly, we would like to show you how you can visualize graph data in a fashion that doesn’t
involve showing mere hairballs. Throughout the book, you will see examples of what we call rational
graph visualizations. One of our authors, Eric Ma, has developed a companion package, nxviz, that
provides a declarative and convenient API (in other words an attempt at a “grammar”) for graph
visualization.
Thirdly, in this book, you will be introduced to basic graph algorithms, such as finding special graph
structures, or finding paths in a graph. Graph algorithms will show you how to “think on graphs”,
and knowing how to do so will broaden your ability to interact with graph data structures.
Fourthly, you will also be equipped with the connection between graph theory and other areas of
math and computing, such as statistical inference and linear algebra.

Intellectual Goals
Beyond the technical takeaways, we hope to broaden how you think about data.
The first idea we hope to give you the ability to think about your data in terms of “relationships”.
As you will learn, relationships are what give rise to the interestingness of graphs. That’s where
relational insights can come to fore.
The second idea we hope to give you is the ability to “think on graphs”. This comes with practice.
Once you master it, though, you will find yourself becoming more and more familiar with
algorithmic thinking. which is where you look at a problem in terms of the algorithm that solves
it.
Introduction to Graphs
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="k4KHoLC7TFE", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/k4KHoLC7TFE” frame-


border=”0” allowfullscreen ></iframe>
In our world, networks are an immensely useful data modelling tool to model complex relational
problems. Building on top of a network-oriented data model, they have been put to great use in a
wide variety of settings.

A formal definition of networks


Before we explore examples of networks, we want to first give you a more formal definition of
what networks are. The reason is that knowing a formal definition helps us refine our application
of networks. So bear with me for a moment.
In the slightly more academic literature, networks are more formally referred to as graphs.
Graphs are comprised of two sets of objects:

• A node set: the “entities” in a graph.


• An edge set: the record of “relationships” between the entities in the graph.

For example, if a node set n is comprised of elements:

n = {a, b, c, d, ...}

Then, the edge set e would be represented as tuples of pairs of elements:

e = {(a, b), (a, c), (c, d), ...}

If you extracted every node from the edge set e, it should form at least a subset of the node set n. (It
is at least a subset because not every node in n might participate in an edge.)
If you draw out a network, the “nodes” are commonly represented as shapes, such as circles, while
the “edges” are the lines between the shapes.
Introduction to Graphs 4

Examples of Networks
Now that we have a proper definition of a graph, let’s move on to explore examples of graphs.
One example I (Eric Ma) am fond of, based on my background as a biologist, is a protein-protein
interaction network. Here, the graph can be defined in the following way:

• nodes/entities are the proteins,


• edges/relationships are defined as “one protein is known to bind with another”.

A more colloquial example of networks is an air transportation network. Here, the graph can be
defined in the following way:

• nodes/entities are airports


• edges/relationships are defined as “at least one flight carrier flies between the airports”.

And another even more relatable example would be our ever-prevalent social networks! With Twitter,
the graph can be defined in the following way:

• nodes/entities are individual users


• edges/relationships are defined as “one user has decided to follow another”.

Now that you’ve seen the framework for defining a graph, we’d like to invite you to answer the
following question: What examples of networks have you seen before in your profession?
Go ahead and list it out.

Types of Graphs
As you probably can see, graphs are a really flexible data model for modelling the world, as long as
the nodes and edges are strictly defined. (If the nodes and edges are sloppily defined, well, we run
into a lot of interpretability problems later on.)
If you are a member of both LinkedIn and Twitter, you might intuitively think that there’s a slight
difference in the structure of the two “social graphs”. You’d be absolutely correct on that count!
Twitter is an example of what we would intuitively call a directed graph. Why is this so? The key
here lies in how interactions are modelled. One user can follow another, but the other need not
necessarily follow back. As such, there is a directionality to the relationship.
LinkedIn is an example of what we would intuitively call an undirected graph. Why is this so? The
key here is that when two users are LinkedIn connections, we automatically assign a bi-directional
edge between them. As such, for convenience, we can collapse the bi-directional edge into an
undirected edge, thus yielding an undirected graph.
If we wanted to turn LinkedIn into a directed graph, we might want to keep information on who
initiated the invitation. In that way, the relationship is automatically bi-directional.
Introduction to Graphs 5

Edges define the interesting part of a graph


While in graduate school, I (Eric Ma) once sat in a seminar organized by one of the professors on my
thesis committee. The speaker that day was John Quackenbush, a faculty member of the Harvard
School of Public Health. While the topic of the day remained fuzzy in my memory, one quote stood
out:

The heart of a graph lies in its edges, not in its nodes. (John Quackenbush, Harvard School
of Public Health)

Indeed, this is a key point to remember! Without edges, the nodes are merely collections of entities.
In a data table, they would correspond to the rows. That alone can be interesting, but doesn’t yield
relational insights between the entities.
The NetworkX API
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id='sdF0uJo2KdU', width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/sdF0uJo2KdU” frame-


border=”0” allowfullscreen ></iframe>
In this chapter, we will introduce you to the NetworkX API. This will allow you to create and
manipulate graphs in your computer memory, thus giving you a language to more concretely explore
graph theory ideas.
Throughout the book, we will be using different graph datasets to help us anchor ideas. In this
section, we will work with a social network of seventh graders. Here, nodes are individual students,
and edges represent their relationships. Edges between individuals show how often the seventh
graders indicated other seventh graders as their favourite.
The data are taken from the Konect¹ graph data repository

Data Model
In NetworkX, graph data are stored in a dictionary-like fashion. They are placed under a Graph
object, canonically instantiated with the variable G as follows:

1 G = nx.Graph()

Of course, you are free to name the graph anything you want!
Nodes are part of the attribute G.nodes. There, the node data are housed in a dictionary-like container,
where the key is the node ID and the values are a dictionary of attributes. Node data are accessible
using syntax that looks like:
¹http://konect.uni-koblenz.de/networks/moreno_seventh
The NetworkX API 7

1 G.nodes[node1]

Edges are part of the attribute G.edges, which is also stored in a dictionary-like container. Edge data
are accessible using syntax that looks like:

1 G.edges[node1, node2]

Because of the dictionary-like implementation of the graph, any hashable object can be a node. This
means strings and tuples, but not lists and sets.

Load Data
Let’s load some real network data to get a feel for the NetworkX API. This dataset² comes from a
study of 7th grade students.

This directed network contains proximity ratings between students from 29 seventh grade
students from a school in Victoria. Among other questions the students were asked to
nominate their preferred classmates for three different activities. A node represents a
student. An edge between two nodes shows that the left student picked the right student
as his or her answer. The edge weights are between 1 and 3 and show how often the left
student chose the right student as his/her favourite.

In the original dataset, students were from an all-boys school. However, I have modified the dataset
to instead be a mixed-gender school.

1 import networkx as nx
2 from datetime import datetime
3 import matplotlib.pyplot as plt
4 import numpy as np
5 import warnings
6 from nams import load_data as cf
7
8 warnings.filterwarnings('ignore')

1 G = cf.load_seventh_grader_network()

²http://konect.uni-koblenz.de/networks/moreno_seventh
The NetworkX API 8

Understanding a graph’s basic statistics


When you get graph data, one of the first things you’ll want to do is to check its basic graph statistics:
the number of nodes and the number of edges that are represented in the graph. This is a basic sanity-
check on your data that you don’t want to skip out on.

Querying graph type


The first thing you need to know is the type of the graph:

1 type(G)

1 networkx.classes.digraph.DiGraph

Because the graph is a DiGraph, this tells us that the graph is a directed one.
If it were undirected, the type would change:

1 H = nx.Graph()
2 type(H)

1 networkx.classes.graph.Graph

Querying node information


Let’s now query for the nodeset:

1 list(G.nodes())[0:5]

1 [1, 2, 3, 4, 5]

G.nodes() returns a “view” on the nodes. We can’t actually slice into the view and grab out a sub-
selection, but we can at least see what nodes are present. For brevity, we have sliced into G.nodes()
passed into a list() constructor, so that we don’t pollute the output. Because a NodeView is iterable,
though, we can query it for its length:
The NetworkX API 9

1 len(G.nodes())

1 29

If our nodes have metadata attached to them, we can view the metadata at the same time by passing
in data=True:

1 list(G.nodes(data=True))[0:5]

1 [(1, {'gender': 'male'}),


2 (2, {'gender': 'male'}),
3 (3, {'gender': 'male'}),
4 (4, {'gender': 'male'}),
5 (5, {'gender': 'male'})]

G.nodes(data=True) returns a NodeDataView, which you can see is dictionary-like.


Additionally, we can select out individual nodes:

1 G.nodes[1]

1 {'gender': 'male'}

Now, because a NodeDataView is dictionary-like, looping over G.nodes(data=True) is very much like
looping over key-value pairs of a dictionary. As such, we can write things like:

1 for n, d in G.nodes(data=True):
2 # n is the node
3 # d is the metadata dictionary
4 ...

This is analogous to how we would loop over a dictionary:

1 for k, v in dictionary.items():
2 # do stuff in the loop

Naturally, this leads us to our first exercise.

Exercise: Summarizing node metadata


Can you count how many males and females are represented in the graph?
The NetworkX API 10

1 from nams.solutions.intro import node_metadata


2
3 #### REPLACE THE NEXT LINE WITH YOUR ANSWER
4 mf_counts = node_metadata(G)

Test your implementation by checking it against the test_answer function below.

1 from typing import Dict


2
3 def test_answer(mf_counts: Dict):
4 assert mf_counts['female'] == 17
5 assert mf_counts['male'] == 12
6
7 test_answer(mf_counts)

With this dictionary-like syntax, we can query back the metadata that’s associated with any node.

Querying edge information


Now that you’ve learned how to query for node information, let’s now see how to query for all of
the edges in the graph:

1 list(G.edges())[0:5]

1 [(1, 2), (1, 3), (1, 4), (1, 5), (1, 6)]

Similar to the NodeView, G.edges() returns an EdgeView that is also iterable. As with above, we have
abbreviated the output inside a sliced list to keep things readable. Because G.edges() is iterable, we
can get its length to see the number of edges that are present in a graph.

1 len(G.edges())

1 376

Likewise, we can also query for all of the edge’s metadata:

1 list(G.edges(data=True))[0:5]
The NetworkX API 11

1 [(1, 2, {'count': 1}),


2 (1, 3, {'count': 1}),
3 (1, 4, {'count': 2}),
4 (1, 5, {'count': 2}),
5 (1, 6, {'count': 3})]

Additionally, it is possible for us to select out individual edges, as long as they exist in the graph:

1 G.edges[15, 10]

1 {'count': 2}

This yields the metadata dictionary for that edge.


If the edge does not exist, then we get an error:

1 >>> G.edges[15, 16]

1 ---------------------------------------------------------------------------
2 KeyError Traceback (most recent call last)
3 <ipython-input-21-ce014cab875a> in <module>
4 ----> 1 G.edges[15, 16]
5
6 ~/anaconda/envs/nams/lib/python3.7/site-packages/networkx/classes/reportviews.py in \
7 __getitem__(self, e)
8 928 def __getitem__(self, e):
9 929 u, v = e
10 --> 930 return self._adjdict[u][v]
11 931
12 932 # EdgeDataView methods
13
14 KeyError: 16

As with the NodeDataView, the EdgeDataView is dictionary-like, with the difference being that the
keys are 2-tuple-like instead of being single hashable objects. Thus, we can write syntax like the
following to loop over the edgelist:
The NetworkX API 12

1 for n1, n2, d in G.edges(data=True):


2 # n1, n2 are the nodes
3 # d is the metadata dictionary
4 ...

Naturally, this leads us to our next exercise.

Exercise: Summarizing edge metadata


Can you write code to verify that the maximum times any student rated another student
as their favourite is 3 times?

1 from nams.solutions.intro import edge_metadata


2
3 #### REPLACE THE NEXT LINE WITH YOUR ANSWER
4 maxcount = edge_metadata(G)

Likewise, you can test your answer using the test function below:

1 def test_maxcount(maxcount):
2 assert maxcount == 3
3
4 test_maxcount(maxcount)

Manipulating the graph


Great stuff! You now know how to query a graph for:

• its node set, optionally including metadata


• individual node metadata
• its edge set, optionally including metadata, and
• individual edges’ metadata

Now, let’s learn how to manipulate the graph. Specifically, we’ll learn how to add nodes and edges
to a graph.

Adding Nodes
The NetworkX graph API lets you add a node easily:
The NetworkX API 13

1 G.add_node(node, node_data1=some_value, node_data2=some_value)

Adding Edges
It also allows you to add an edge easily:

1 G.add_edge(node1, node2, edge_data1=some_value, edge_data2=some_value)

Metadata by Keyword Arguments


In both cases, the keyword arguments that are passed into .add_node() are automatically collected
into the metadata dictionary.
Knowing this gives you enough knowledge to tackle the next exercise.

Exercise: adding students to the graph


We found out that there are two students that we left out of the network, student no. 30
and 31. They are one male (30) and one female (31), and they are a pair that just love
hanging out with one another and with individual 7 (i.e. count=3), in both directions per
pair. Add this information to the graph.

1 from nams.solutions.intro import adding_students


2
3 #### REPLACE THE NEXT LINE WITH YOUR ANSWER
4 G = adding_students(G)

You can verify that the graph has been correctly created by executing the test function below.

1 def test_graph_integrity(G):
2 assert 30 in G.nodes()
3 assert 31 in G.nodes()
4 assert G.nodes[30]['gender'] == 'male'
5 assert G.nodes[31]['gender'] == 'female'
6 assert G.has_edge(30, 31)
7 assert G.has_edge(30, 7)
8 assert G.has_edge(31, 7)
9 assert G.edges[30, 7]['count'] == 3
10 assert G.edges[7, 30]['count'] == 3
11 assert G.edges[31, 7]['count'] == 3
The NetworkX API 14

12 assert G.edges[7, 31]['count'] == 3


13 assert G.edges[30, 31]['count'] == 3
14 assert G.edges[31, 30]['count'] == 3
15 print('All tests passed.')
16
17 test_graph_integrity(G)

1 All tests passed.

Coding Patterns
These are some recommended coding patterns when doing network analysis using NetworkX, which
stem from my personal experience with the package.

Iterating using List Comprehensions


I would recommend that you use the following for compactness:

1 [d['attr'] for n, d in G.nodes(data=True)]

And if the node is unimportant, you can do:

1 [d['attr'] for _, d in G.nodes(data=True)]

Iterating over Edges using List Comprehensions


A similar pattern can be used for edges:

1 [n2 for n1, n2, d in G.edges(data=True)]

or

1 [n2 for _, n2, d in G.edges(data=True)]

If the graph you are constructing is a directed graph, with a “source” and “sink” available, then I
would recommend the following naming of variables instead:
The NetworkX API 15

1 [(sc, sk) for sc, sk, d in G.edges(data=True)]

or

1 [d['attr'] for sc, sk, d in G.edges(data=True)]

Further Reading
For a deeper look at the NetworkX API, be sure to check out the NetworkX docs³.

Further Exercises
Here’s some further exercises that you can use to get some practice.

Exercise: Unrequited Friendships


Try figuring out which students have “unrequited” friendships, that is, they have rated
another student as their favourite at least once, but that other student has not rated them
as their favourite at least once.

Hint: the goal here is to get a list of edges for which the reverse edge is not present.
Hint: You may need the class method G.has_edge(n1, n2). This returns whether a graph has an edge
between the nodes n1 and n2.

1 from nams.solutions.intro import unrequitted_friendships_v1


2 #### REPLACE THE NEXT LINE WITH YOUR ANSWER
3 unrequitted_friendships = unrequitted_friendships_v1(G)
4 assert len(unrequitted_friendships) == 124

In a previous session at ODSC East 2018, a few other class participants provided the following
solutions, which you can take a look at by uncommenting the following cells.
This first one by @schwanne⁴ is the list comprehension version of the above solution:

³https://networkx.readthedocs.io
⁴https://github.com/schwanne
The NetworkX API 16

1 from nams.solutions.intro import unrequitted_friendships_v2


2 # unrequitted_friendships_v2??

This one by @end0⁵ is a unique one involving sets.

1 from nams.solutions.intro import unrequitted_friendships_v3


2 # unrequitted_friendships_v3??

Solution Answers
Here are the answers to the exercises above.

1 import nams.solutions.intro as solutions


2 import inspect
3
4 print(inspect.getsource(solutions))

1 """
2 Solutions to Intro Chapter.
3 """
4
5
6 def node_metadata(G):
7 """Counts of students of each gender."""
8 from collections import Counter
9
10 mf_counts = Counter([d["gender"] for n, d in G.nodes(data=True)])
11 return mf_counts
12
13
14 def edge_metadata(G):
15 """Maximum number of times that a student rated another student."""
16 counts = [d["count"] for n1, n2, d in G.edges(data=True)]
17 maxcount = max(counts)
18 return maxcount
19
20
21 def adding_students(G):
⁵https://github.com/end0
The NetworkX API 17

22 """How to nodes and edges to a graph."""


23 G = G.copy()
24 G.add_node(30, gender="male")
25 G.add_node(31, gender="female")
26 G.add_edge(30, 31, count=3)
27 G.add_edge(31, 30, count=3) # reverse is optional in undirected network
28 G.add_edge(30, 7, count=3) # but this network is directed
29 G.add_edge(7, 30, count=3)
30 G.add_edge(31, 7, count=3)
31 G.add_edge(7, 31, count=3)
32 return G
33
34
35 def unrequitted_friendships_v1(G):
36 """Answer to unrequitted friendships problem."""
37 unrequitted_friendships = []
38 for n1, n2 in G.edges():
39 if not G.has_edge(n2, n1):
40 unrequitted_friendships.append((n1, n2))
41 return unrequitted_friendships
42
43
44 def unrequitted_friendships_v2(G):
45 """Alternative answer to unrequitted friendships problem. By @schwanne."""
46 return len([(n1, n2) for n1, n2 in G.edges() if not G.has_edge(n2, n1)])
47
48
49 def unrequitted_friendships_v3(G):
50 """Alternative answer to unrequitted friendships problem. By @end0."""
51 links = ((n1, n2) for n1, n2, d in G.edges(data=True))
52 reverse_links = ((n2, n1) for n1, n2, d in G.edges(data=True))
53
54 return len(list(set(links) - set(reverse_links)))
Graph Visualization
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="v9HrR_AF5Zc", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/v9HrR_AF5Zc” frame-


border=”0” allowfullscreen ></iframe>
In this chapter, We want to introduce you to the wonderful world of graph visualization.
You probably have seen graphs that are visualized as hairballs. Apart from communicating how
complex the graph is, hairballs don’t really communicate much else. As such, my goal by the end of
this chapter is to introduce you to what I call rational graph visualization.
But before we can do that, let’s first make sure we understand how to use NetworkX’s drawing
facilities to draw graphs to the screen. In a pinch, and for small graphs, it’s very handy to have.

Hairballs
The node-link diagram is the canonical diagram we will see in publications. Nodes are commonly
drawn as circles, while edges are drawn s lines.
Node-link diagrams are common, and there’s a good reason for this: it’s convenient to draw! In
NetworkX, we can draw node-link diagrams using:
Graph Visualization 19

1 from nams import load_data as cf


2 import networkx as nx
3 import matplotlib.pyplot as plt
4
5 G = cf.load_seventh_grader_network()

1 nx.draw(G)

png

Nodes more tightly connected with one another are clustered together. Initial node placement is
done typically at random, so really it’s tough to deterministically generate the same figure. If the
network is small enough to visualize, and the node labels are small enough to fit in a circle, then you
can use the with_labels=True argument to bring some degree of informativeness to the drawing:

1 G.is_directed()

1 True
Graph Visualization 20

1 nx.draw(G, with_labels=True)

png

The downside to drawing graphs this way is that large graphs end up looking like hairballs. Can
you imagine a graph with more than the 28 nodes that we have? As you probably can imagine, the
default nx.draw(G) is probably not suitable for generating visual insights.

Matrix Plot
A different way that we can visualize a graph is by visualizing it in its matrix form. The nodes are
on the x- and y- axes, and a filled square represent an edge between the nodes.
We can draw a graph’s matrix form conveniently by using nxviz.MatrixPlot:
Graph Visualization 21

1 import nxviz as nv
2 from nxviz import annotate
3
4
5 nv.matrix(G, group_by="gender", node_color_by="gender")
6 annotate.matrix_group(G, group_by="gender")

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

png

What can you tell from the graph visualization? A few things are immediately obvious:

• The diagonal is empty: no student voted for themselves as their favourite.


Graph Visualization 22

• The matrix is asymmetric about the diagonal: this is a directed graph!

(An undirected graph would be symmetric about the diagonal.)


You might go on to suggest that there is some clustering happening, but without applying a proper
clustering algorithm on the adjacency matrix, we would be hard-pressed to know for sure. After all,
we can simply re-order the node ordering along the axes to produce a seemingly-random matrix.

Arc Plot
The Arc Plot is another rational graph visualization. Here, we line up the nodes along a horizontal
axis, and draw arcs between nodes if they are connected by an edge. We can also optionally group
and colour them by some metadata. In the case of this student graph, we group and colour them by
“gender”.

1 # a = ArcPlot(G, node_color='gender', node_grouping='gender')


2 nv.arc(G, node_color_by="gender", group_by="gender")
3 annotate.arc_group(G, group_by="gender")

png

The Arc Plot forms the basis of the next visualization, the highly popular Circos plot.

Circos Plot
The Circos Plot was developed by Martin Krzywinski⁶ at the BC Cancer Research Center. The
nxviz.CircosPlot takes inspiration from the original by joining the two ends of the Arc Plot into a
⁶http://circos.ca/
Graph Visualization 23

circle. Likewise, we can colour and order nodes by node metadata:

1 nv.circos(G, group_by="gender", node_color_by="gender")


2 annotate.circos_group(G, group_by="gender")

png

Generally speaking, you can think of a Circos Plot as being a more compact and aesthetically pleasing
version of Arc Plots.

Hive Plot
The final plot we’ll show is, Hive Plots.

1 from nxviz import plots


2 import matplotlib.pyplot as plt
3
4 nv.hive(G, group_by="gender", node_color_by="gender")
5 annotate.hive_group(G, group_by="gender")
Graph Visualization 24

png

As you can see, with Hive Plots, we first group nodes along two or three radial axes. In this case, we
have the boys along one radial axis and the girls along the other. We can also order the nodes along
each axis if we so choose to. In this case, no particular ordering is chosen.
Next, we draw edges. We start first with edges between groups. That is shown on the left side of
the figure, joining nodes in the “yellow” and “green” (boys/girls) groups. We then proceed to edges
within groups. This is done by cloning the node radial axis before drawing edges.

Principles of Rational Graph Viz


While I was implementing these visualizations in nxviz, I learned an important lesson in implement-
ing graph visualizations in general:

To be most informative and communicative, a graph visualization should first prioritize


node placement in a fashion that makes sense.

In some ways, this makes a ton of sense. The nodes are the “entities” in a graph, corresponding
to people, proteins, and ports. For “entities”, we have natural ways to group, order and summarize
(reduce). (An example of a “reduction” is counting the number of things.) Prioritizing node placement
allows us to appeal to our audience’s natural sense of grouping, ordering and reduction.
So the next time you see a hairball, I hope you’re able to critique it for what it doesn’t communicate,
and possibly use the same principle to design a better visualization!
Hubs
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="-oimHbVDdDA", width=560, height=315)

<iframe width=”560” height=”315” src=”https://www.youtube.com/embed/-oimHbVDdDA” frame-


border=”0” allowfullscreen ></iframe>
Because of the relational structure in a graph, we can begin to think about “importance” of a node
that is induced because of its relationships to the rest of the nodes in the graph.
Before we go on, let’s think about a pertinent and contemporary example.

An example: contact tracing


At the time of writing (April 2020), finding important nodes in a graph has actually taken on
a measure of importance that we might not have appreciated before. With the COVID-19 virus
spreading, contact tracing has become quite important. In an infectious disease contact network,
where individuals are nodes and contact between individuals of some kind are the edges, an
“important” node in this contact network would be an individual who was infected who also was in
contact with many people during the time that they were infected.

Our dataset: “Sociopatterns”


The dataset that we will use in this chapter is the “sociopatterns network⁷” dataset. Incidentally, it’s
also about infectious diseases.
Note to readers: We originally obtained the dataset in 2014 from the Konect website. It is
unfortunately no longer available. The sociopatterns.org website hosts an edge list of a slightly
different format, so it will look different from what we have here.
From the original description on Konect, here is the description of the dataset:
⁷http://www.sociopatterns.org/datasets/infectious-sociopatterns-dynamic-contact-networks/
Hubs 26

This network describes the face-to-face behavior of people during the exhibition INFEC-
TIOUS: STAY AWAY in 2009 at the Science Gallery in Dublin. Nodes represent exhibition
visitors; edges represent face-to-face contacts that were active for at least 20 seconds.
Multiple edges between two nodes are possible and denote multiple contacts. The network
contains the data from the day with the most interactions.

To simplify the network, we have represented only the last contact between individuals.

1 from nams import load_data as cf


2 G = cf.load_sociopatterns_network()

It is loaded as an undirected graph object:

1 type(G)

1 networkx.classes.graph.Graph

As usual, before proceeding with any analysis, we should know basic graph statistics.

1 len(G.nodes()), len(G.edges())

1 (410, 2765)

A Measure of Importance: “Number of Neighbors”


One measure of importance of a node is the number of neighbors that the node has. What is a
neighbor? We will work with the following definition:

The neighbor of a node is connected to that node by an edge.

Let’s explore this concept, using the NetworkX API.


Every NetworkX graph provides a G.neighbors(node) class method, which lets us query a graph
for the number of neighbors of a given node:

1 G.neighbors(7)
Hubs 27

1 <dict_keyiterator at 0x7fa2457aea40>

It returns a generator that doesn’t immediately return the exact neighbors list. This means we cannot
know its exact length, as it is a generator. If you tried to do:

1 len(G.neighbors(7))

you would get the following error:

1 ---------------------------------------------------------------------------
2 TypeError Traceback (most recent call last)
3 <ipython-input-13-72c56971d077> in <module>
4 ----> 1 len(G.neighbors(7))
5
6 TypeError: object of type 'dict_keyiterator' has no len()

Hence, we will need to cast it as a list in order to know both its length and its members:

1 list(G.neighbors(7))

1 [5, 6, 21, 22, 37, 48, 51]

In the event that some nodes have an extensive list of neighbors, then using the dict_keyiterator
is potentially a good memory-saving technique, as it lazily yields the neighbors.

Exercise: Rank-ordering the number of neighbors a node has


Since we know how to get the list of nodes that are neighbors of a given node, try this following
exercise:

Can you create a ranked list of the importance of each individual, based on the number
of neighbors they have?

Here are a few hints to help:

• You could consider using a pandas Series. This would be a modern and idiomatic way of
approaching the problem.
• You could also consider using Python’s sorted function.
Hubs 28

1 from nams.solutions.hubs import rank_ordered_neighbors


2
3 #### REPLACE THE NEXT FEW LINES WITH YOUR ANSWER
4 # answer = rank_ordered_neighbors(G)
5 # answer

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

The original implementation looked like the following

1 from nams.solutions.hubs import rank_ordered_neighbors_original


2 # rank_ordered_neighbors_original??

And another implementation that uses generators:

1 from nams.solutions.hubs import rank_ordered_neighbors_generator


2 # rank_ordered_neighbors_generator??

Generalizing “neighbors” to arbitrarily-sized graphs


The concept of neighbors is simple and appealing, but it leaves us with a slight point of dissatisfaction:
it is difficult to compare graphs of different sizes. Is a node more important solely because it has more
neighbors? What if it were situated in an extremely large graph? Would we not expect it to have
more neighbors?
As such, we need a normalization factor. One reasonable one, in fact, is the number of nodes that a
given node could possibly be connected to. By taking the ratio of the number of neighbors a node
has to the number of neighbors it could possibly have, we get the degree centrality metric.
Hubs 29

Formally defined, the degree centrality of a node (let’s call it d) is the number of neighbors that a
node has (let’s call it n) divided by the number of neighbors it could possibly have (let’s call it N ):

n
d=
N

NetworkX provides a function for us to calculate degree centrality conveniently:

1 import networkx as nx
2 import pandas as pd
3 dcs = pd.Series(nx.degree_centrality(G))
4 dcs

1 100 0.070905
2 101 0.031785
3 102 0.039120
4 103 0.063570
5 104 0.041565
6 ...
7 89 0.009780
8 91 0.051345
9 96 0.036675
10 99 0.034230
11 98 0.002445
12 Length: 410, dtype: float64

nx.degree_centrality(G) returns to us a dictionary of key-value pairs, where the keys are node IDs
and values are the degree centrality score. To save on output length, I took the liberty of casting it
as a pandas Series to make it easier to display.
Incidentally, we can also sort the series to find the nodes with the highest degree centralities:

1 dcs.sort_values(ascending=False)
Hubs 30

1 51 0.122249
2 272 0.114914
3 235 0.105134
4 195 0.105134
5 265 0.083130
6 ...
7 390 0.002445
8 135 0.002445
9 398 0.002445
10 186 0.002445
11 98 0.002445
12 Length: 410, dtype: float64

Does the list order look familiar? It should, since the numerator of the degree centrality metric is
identical to the number of neighbors, and the denominator is a constant.

Distribution of graph metrics


One important concept that you should come to know is that the distribution of node-centric values
can characterize classes of graphs.
What do we mean by “distribution of node-centric values”? One would be the degree distribution,
that is, the collection of node degree values in a graph.
Generally, you might be familiar with plotting a histogram to visualize distributions of values, but
in this book, we are going to avoid histograms like the plague. I detail a lot of reasons in a blog post⁸
I wrote in 2018, but the main points are that:

1. It’s easier to lie with histograms.


2. You get informative statistical information (median, IQR, extremes/outliers) more easily.

Exercise: Degree distribution


In this next exercise, we are going to get practice visualizing these values using empirical cumulative
distribution function plots.
I have written for you an ECDF function that you can use already. Its API looks like the following:

1 x, y = ecdf(list_of_values)

giving you x and y values that you can directly plot.


The exercise prompt is this:
⁸https://ericmjl.github.io/blog/2018/7/14/ecdfs/
Hubs 31

Plot the ECDF of the degree centrality and degree distributions.

First do it for degree centrality:

1 from nams.functions import ecdf


2 from nams.solutions.hubs import ecdf_degree_centrality
3
4 #### REPLACE THE FUNCTION CALL WITH YOUR ANSWER
5 ecdf_degree_centrality(G)

png

Now do it for degree:

1 from nams.solutions.hubs import ecdf_degree


2
3 #### REPLACE THE FUNCTION CALL WITH YOUR ANSWER
4 ecdf_degree(G)
Hubs 32

png

The fact that they are identically-shaped should not surprise you!

Exercise: What about that denominator?


The denominator N in the degree centrality definition is “the number of nodes that a node could
possibly be connected to”. Can you think of two ways N be defined?

1 from nams.solutions.hubs import num_possible_neighbors


2
3 #### UNCOMMENT TO SEE MY ANSWER
4 # print(num_possible_neighbors())

Exercise: Circos Plotting


Let’s get some practice with the nxviz API.

Visualize the graph G, while ordering and colouring them by the ‘order’ node attribute.
Hubs 33

1 from nams.solutions.hubs import circos_plot


2
3 #### REPLACE THE NEXT LINE WITH YOUR ANSWER
4 circos_plot(G)

png

And here’s an alternative view using an arc plot.

1 import nxviz as nv
2 nv.arc(G, sort_by="order", node_color_by="order")

1 <AxesSubplot:>
Hubs 34

png

Exercise: Visual insights


Since we know that node colour and order are by the “order” in which the person entered into the
exhibit, what does this visualization tell you?

1 from nams.solutions.hubs import visual_insights


2
3 #### UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
4 # print(visual_insights())

Exercise: Investigating degree centrality and node order


One of the insights that we might have gleaned from visualizing the graph is that the nodes that
have a high degree centrality might also be responsible for the edges that criss-cross the Circos plot.
To test this, plot the following:

• x-axis: node degree centrality


• y-axis: maximum difference between the neighbors’ orders (a node attribute) and the node’s
order.

1 from nams.solutions.hubs import dc_node_order


2
3 dc_node_order(G)
Hubs 35

png

The somewhat positive correlation between the degree centrality might tell us that this trend holds
true. A further applied question would be to ask what behaviour of these nodes would give rise to this
pattern. Are these nodes actually exhibit staff? Or is there some other reason why they are staying
so long? This, of course, would require joining in further information that we would overlay on top
of the graph (by adding them as node or edge attributes) before we might make further statements.

Reflections
In this chapter, we defined a metric of node importance: the degree centrality metric. In the example
we looked at, it could help us identify potential infectious agent superspreaders in a disease contact
network. In other settings, it might help us spot:

• message amplifiers/influencers in a social network, and


• potentially crowded airports that have lots of connections into and out of it (still relevant to
infectious disease spread!)
• and many more!

What other settings can you think of in which the number of neighbors that a node has can become
a metric of importance for the node?
Hubs 36

Solutions
Here are the solutions to the exercises above.

1 from nams.solutions import hubs


2 import inspect
3
4 print(inspect.getsource(hubs))

1 """Solutions to Hubs chapter."""


2
3 import matplotlib.pyplot as plt
4 import networkx as nx
5 import pandas as pd
6 import nxviz as nv
7 from nxviz import annotate
8
9 from nams import ecdf
10
11
12 def rank_ordered_neighbors(G):
13 """
14 Uses a pandas Series to help with sorting.
15 """
16 s = pd.Series({n: len(list(G.neighbors(n))) for n in G.nodes()})
17 return s.sort_values(ascending=False)
18
19
20 def rank_ordered_neighbors_original(G):
21 """Original implementation of rank-ordered number of neighbors."""
22 return sorted(G.nodes(), key=lambda x: len(list(G.neighbors(x))), reverse=True)
23
24
25 def rank_ordered_neighbors_generator(G):
26 """
27 Rank-ordered generator of neighbors.
28
29 Contributed by @dgerlanc.
30
31 Ref: https://github.com/ericmjl/Network-Analysis-Made-Simple/issues/75
32 """
Hubs 37

33 gen = ((len(list(G.neighbors(x))), x) for x in G.nodes())


34 return sorted(gen, reverse=True)
35
36
37 def ecdf_degree_centrality(G):
38 """ECDF of degree centrality."""
39 x, y = ecdf(list(nx.degree_centrality(G).values()))
40 plt.scatter(x, y)
41 plt.xlabel("degree centrality")
42 plt.ylabel("cumulative fraction")
43
44
45 def ecdf_degree(G):
46 """ECDF of degree."""
47 num_neighbors = [len(list(G.neighbors(n))) for n in G.nodes()]
48 x, y = ecdf(num_neighbors)
49 plt.scatter(x, y)
50 plt.xlabel("degree")
51 plt.ylabel("cumulative fraction")
52
53
54 def num_possible_neighbors():
55 """Answer to the number of possible neighbors for a node."""
56 return r"""
57 The number of possible neighbors can either be defined as:
58
59 1. All other nodes but myself
60 2. All other nodes and myself
61
62 If {$$}K{/$$} is the number of nodes in the graph,
63 then if defined as (1), {$$}N{/$$} (the denominator) is {$$}K - 1{/$$}.
64 If defined as (2), {$$}N{/$$} is equal to {$$}K{/$$}.
65 """
66
67
68 def circos_plot(G):
69 """Draw a Circos Plot of the graph."""
70 # c = CircosPlot(G, node_order="order", node_color="order")
71 # c.draw()
72 nv.circos(G, sort_by="order", node_color_by="order")
73 annotate.node_colormapping(G, color_by="order")
74
75
Hubs 38

76 def visual_insights():
77 """Visual insights from the Circos Plot."""
78 return """
79 We see that most edges are "local" with nodes
80 that are proximal in order.
81 The nodes that are weird are the ones that have connections
82 with individuals much later than itself,
83 crossing larger jumps in order/time.
84
85 Additionally, if you recall the ranked list of degree centralities,
86 it appears that these nodes that have the highest degree centrality scores
87 are also the ones that have edges that cross the circos plot.
88 """
89
90
91 def dc_node_order(G):
92 """Comparison of degree centrality by maximum difference in node order."""
93 import matplotlib.pyplot as plt
94 import pandas as pd
95 import networkx as nx
96
97 # Degree centralities
98 dcs = pd.Series(nx.degree_centrality(G))
99
100 # Maximum node order difference
101 maxdiffs = dict()
102 for n, d in G.nodes(data=True):
103 diffs = []
104 for nbr in G.neighbors(n):
105 diffs.append(abs(G.nodes[nbr]["order"] - d["order"]))
106 maxdiffs[n] = max(diffs)
107 maxdiffs = pd.Series(maxdiffs)
108
109 ax = pd.DataFrame(dict(degree_centrality=dcs, max_diff=maxdiffs)).plot(
110 x="degree_centrality", y="max_diff", kind="scatter"
111 )
Paths
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="JjpbztqP9_0", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/JjpbztqP9_0” framebor-


der=”0” allowfullscreen ></iframe>
Graph traversal is akin to walking along the graph, node by node, constrained by the edges that
connect the nodes. Graph traversal is particularly useful for understanding the local structure of
certain portions of the graph and for finding paths that connect two nodes in the network.
In this chapter, we are going to learn how to perform pathfinding in a graph, specifically by looking
for shortest paths via the breadth-first search algorithm.

Breadth-First Search
The BFS algorithm is a staple of computer science curricula, and for good reason: it teaches learners
how to “think on” a graph, putting one in the position of “the dumb computer” that can’t use a visual
cortex to “just know” how to trace a path from one node to another. As a topic, learning how to do
BFS additionally imparts algorithmic thinking to the learner.

Exercise: Design the algorithm


Try out this exercise to get some practice with algorithmic thinking.

1. On a piece of paper, conjure up a graph that has 15-20 nodes. Connect them any way
you like.
Paths 40

2. Pick two nodes. Pretend that you’re standing on one of the nodes, but you can’t see
any further beyond one neighbor away.
3. Work out how you can find a path from the node you’re standing on to the other
node, given that you can only see nodes that are one neighbor away but have an
infinitely good memory.

If you are successful at designing the algorithm, you should get the answer below.

1 from nams import load_data as cf


2 G = cf.load_sociopatterns_network()

1 from nams.solutions.paths import bfs_algorithm


2
3 # UNCOMMENT NEXT LINE TO GET THE ANSWER.
4 # bfs_algorithm()

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

Exercise: Implement the algorithm


Now that you’ve seen how the algorithm works, try implementing it!
Paths 41

1 # FILL IN THE BLANKS BELOW


2
3 def path_exists(node1, node2, G):
4 """
5 This function checks whether a path exists between two nodes (node1,
6 node2) in graph G.
7 """
8 visited_nodes = _____
9 queue = [_____]
10
11 while len(queue) > 0:
12 node = ___________
13 neighbors = list(_________________)
14 if _____ in _________:
15 # print('Path exists between nodes {0} and {1}'.format(node1, node2))
16 return True
17 else:
18 visited_nodes.___(____)
19 nbrs = [_ for _ in _________ if _ not in _____________]
20 queue = ____ + _____
21
22 # print('Path does not exist between nodes {0} and {1}'.format(node1, node2))
23 return False

1 # UNCOMMENT THE FOLLOWING TWO LINES TO SEE THE ANSWER


2 from nams.solutions.paths import path_exists
3 # path_exists??

1 # CHECK YOUR ANSWER AGAINST THE TEST FUNCTION BELOW


2 from random import sample
3 import networkx as nx
4
5
6 def test_path_exists(N):
7 """
8 N: The number of times to spot-check.
9 """
10 for i in range(N):
11 n1, n2 = sample(G.nodes(), 2)
12 assert path_exists(n1, n2, G) == bool(nx.shortest_path(G, n1, n2))
13 return True
Paths 42

14
15 assert test_path_exists(10)

Visualizing Paths
One of the objectives of that exercise before was to help you “think on graphs”. Now that you’ve
learned how to do so, you might be wondering, “How do I visualize that path through the graph?”
Well first off, if you inspect the test_path_exists function above, you’ll notice that NetworkX
provides a shortest_path() function that you can use. Here’s what using nx.shortest_path() looks
like.

1 path = nx.shortest_path(G, 7, 400)


2 path

1 [7, 51, 188, 230, 335, 400]

As you can see, it returns the nodes along the shortest path, incidentally in the exact order that you
would traverse.
One thing to note, though! If there are multiple shortest paths from one node to another, NetworkX
will only return one of them.
So how do you draw those nodes only?
You can use the G.subgraph(nodes) to return a new graph that only has nodes in nodes and only the
edges that exist between them. After that, you can use any plotting library you like. We will show
an example here that uses nxviz’s matrix plot.
Let’s see it in action:

1 import nxviz as nv
2 g = G.subgraph(path)
3 nv.matrix(g, sort_by="order")

1 <AxesSubplot:>
Paths 43

png

Voila! Now we have the subgraph (1) extracted and (2) drawn to screen! In this case, the matrix
plot is a suitable visualization for its compactness. The off-diagonals also show that each node is a
neighbor to the next one.
You’ll also notice that if you try to modify the graph g, say by adding a node:

1 g.add_node(2048)

you will get an error:

1 ---------------------------------------------------------------------------
2 NetworkXError Traceback (most recent call last)
3 <ipython-input-10-ca6aa4c26819> in <module>
4 ----> 1 g.add_node(2048)
5
6 ~/anaconda/envs/nams/lib/python3.7/site-packages/networkx/classes/function.py in fro\
7 zen(*args, **kwargs)
8 156 def frozen(*args, **kwargs):
9 157 """Dummy method for raising errors when trying to modify frozen graphs"""
10 --> 158 raise nx.NetworkXError("Frozen graph can't be modified")
11 159
12 160
13
14 NetworkXError: Frozen graph can't be modified
Paths 44

From the perspective of semantics, this makes a ton of sense: the subgraph g is a perfect subset of
the larger graph G, and should not be allowed to be modified unless the larger container graph is
modified.

Exercise: Draw path with neighbors one degree out


Try out this next exercise:

Extend graph drawing with the neighbors of each of those nodes. Use any of the nxviz
plots (nv.matrix, nv.arc, nv.circos); try to see which one helps you tell the best story.

1 from nams.solutions.paths import plot_path_with_neighbors


2
3 ### YOUR SOLUTION BELOW

1 plot_path_with_neighbors(G, 7, 400)

png

In this case, we opted for an Arc plot because we only have one grouping of nodes but have a logical
way to order them. Because the path follows the order, the edges being highlighted automatically
look like hops through the graph.

Bottleneck nodes
We’re now going to revisit the concept of an “important node”, this time now leveraging what we
know about paths.
Paths 45

In the “hubs” chapter, we saw how a node that is “important” could be so because it is connected to
many other nodes.
Paths give us an alternative definition. If we imagine that we have to pass a message on a graph
from one node to another, then there may be “bottleneck” nodes for which if they are removed, then
messages have a harder time flowing through the graph.
One metric that measures this form of importance is the “betweenness centrality” metric. On a
graph through which a generic “message” is flowing, a node with a high betweenness centrality is
one that has a high proportion of shortest paths flowing through it. In other words, it behaves like
a bottleneck.

Betweenness centrality in NetworkX


NetworkX provides a “betweenness centrality” function that behaves consistently with the “degree
centrality” function, in that it returns a mapping from node to metric:

1 import pandas as pd
2
3 pd.Series(nx.betweenness_centrality(G))

1 100 0.014809
2 101 0.001398
3 102 0.000748
4 103 0.006735
5 104 0.001198
6 ...
7 89 0.000004
8 91 0.006415
9 96 0.000323
10 99 0.000322
11 98 0.000000
12 Length: 410, dtype: float64

Exercise: compare degree and betweenness centrality


Make a scatterplot of degree centrality on the x-axis and betweenness centrality on the
y-axis. Do they correlate with one another?
Paths 46

1 import matplotlib.pyplot as plt


2 import seaborn as sns
3
4 # YOUR ANSWER HERE:

1 from nams.solutions.paths import plot_degree_betweenness


2 plot_degree_betweenness(G)

png

Think about it…


…does it make sense that degree centrality and betweenness centrality are not well-correlated?
Can you think of a scenario where a node has a “high” betweenness centrality but a “low” degree
centrality? Before peeking at the graph below, think about your answer for a moment.

1 nx.draw(nx.barbell_graph(5, 1))
Paths 47

png

Recap
In this chapter, you learned the following things:

1. You figured out how to implement the breadth-first-search algorithm to find shortest paths.
2. You learned how to extract subgraphs from a larger graph.
3. You implemented visualizations of subgraphs, which should help you as you communicate
with colleagues.
4. You calculated betweenness centrality metrics for a graph, and visualized how they correlated
with degree centrality.

Solutions
Here are the solutions to the exercises above.
Paths 48

1 from nams.solutions import paths


2 import inspect
3
4 print(inspect.getsource(paths))

1 """Solutions to Paths chapter."""


2
3 import matplotlib.pyplot as plt
4 import networkx as nx
5 import pandas as pd
6 import seaborn as sns
7 from nams.functions import render_html
8
9
10 def bfs_algorithm():
11 """
12 How to design a BFS algorithm.
13 """
14 ans = """
15 How does the breadth-first search work?
16 It essentially is as follows:
17
18 1. Begin with a queue that has only one element in it: the starting node.
19 2. Add the neighbors of that node to the queue.
20 1. If destination node is present in the queue, end.
21 2. If destination node is not present, proceed.
22 3. For each node in the queue:
23 1. Remove node from the queue.
24 2. Add neighbors of the node to the queue. Check if destination node is present \
25 or not.
26 3. If destination node is present, end. <!--Credit: @cavaunpeu for finding bug i\
27 n pseudocode.-->
28 4. If destination node is not present, continue.
29 """
30 return render_html(ans)
31
32
33 def path_exists(node1, node2, G):
34 """
35 This function checks whether a path exists between two nodes (node1,
36 node2) in graph G.
37 """
Paths 49

38
39 visited_nodes = set()
40 queue = [node1]
41
42 while len(queue) > 0:
43 node = queue.pop()
44 neighbors = list(G.neighbors(node))
45 if node2 in neighbors:
46 return True
47 else:
48 visited_nodes.add(node)
49 nbrs = [n for n in neighbors if n not in visited_nodes]
50 queue = nbrs + queue
51
52 return False
53
54
55 def path_exists_for_loop(node1, node2, G):
56 """
57 This function checks whether a path exists between two nodes (node1,
58 node2) in graph G.
59
60 Special thanks to @ghirlekar for suggesting that we keep track of the
61 "visited nodes" to prevent infinite loops from happening. This also
62 removes the need to remove nodes from queue.
63
64 Reference: https://github.com/ericmjl/Network-Analysis-Made-Simple/issues/3
65
66 With thanks to @joshporter1 for the second bug fix. Originally there was
67 an extraneous "if" statement that guaranteed that the "False" case would
68 never be returned - because queue never changes in shape. Discovered at
69 PyCon 2017.
70
71 With thanks to @chendaniely for pointing out the extraneous "break".
72
73 If you would like to see @dgerlanc's implementation, see
74 https://github.com/ericmjl/Network-Analysis-Made-Simple/issues/76
75 """
76 visited_nodes = set()
77 queue = [node1]
78
79 for node in queue:
80 neighbors = list(G.neighbors(node))
Paths 50

81 if node2 in neighbors:
82 return True
83 else:
84 visited_nodes.add(node)
85 queue.extend([n for n in neighbors if n not in visited_nodes])
86
87 return False
88
89
90 def path_exists_deque(node1, node2, G):
91 """An alternative implementation."""
92 from collections import deque
93
94 visited_nodes = set()
95 queue = deque([node1])
96
97 while len(queue) > 0:
98 node = queue.popleft()
99 neighbors = list(G.neighbors(node))
100 if node2 in neighbors:
101 return True
102 else:
103 visited_nodes.add(node)
104 queue.extend([n for n in neighbors if n not in visited_nodes])
105
106 return False
107
108
109 import nxviz as nv
110 from nxviz import annotate, highlights
111
112
113 def plot_path_with_neighbors(G, n1, n2):
114 """Plot a path with the heighbors of of the nodes along that path."""
115 path = nx.shortest_path(G, n1, n2)
116 nodes = [*path]
117 for node in path:
118 nodes.extend(list(G.neighbors(node)))
119 nodes = list(set(nodes))
120
121 g = G.subgraph(nodes)
122 nv.arc(
123 g, sort_by="order", node_color_by="order", edge_aes_kwargs={"alpha_scale": 0\
Paths 51

124 .5}
125 )
126 for n in path:
127 highlights.arc_node(g, n, sort_by="order")
128 for n1, n2 in zip(path[:-1], path[1:]):
129 highlights.arc_edge(g, n1, n2, sort_by="order")
130
131
132 def plot_degree_betweenness(G):
133 """Plot scatterplot between degree and betweenness centrality."""
134 bc = pd.Series(nx.betweenness_centrality(G))
135 dc = pd.Series(nx.degree_centrality(G))
136
137 df = pd.DataFrame(dict(bc=bc, dc=dc))
138 ax = df.plot(x="dc", y="bc", kind="scatter")
139 ax.set_ylabel("Betweenness\nCentrality")
140 ax.set_xlabel("Degree Centrality")
141 sns.despine()
Structures
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="3DWSRCbPPJs", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/3DWSRCbPPJs” frame-


border=”0” allowfullscreen ></iframe>
If you remember, at the beginning of this book, we saw a quote from John Quackenbush that
essentially said that the reason a graph is interesting is because of its edges. In this chapter, we’ll
see this in action once again, as we are going to figure out how to leverage the edges to find special
structures in a graph.

Triangles
The first structure that we are going to learn about is triangles. Triangles are super interesting!
They are what one might consider to be “the simplest complex structure” in a graph. Triangles can
also have semantically-rich meaning depending on the application. To borrow a bad example, love
triangles in social networks are generally frowned upon, while on the other hand, when we connect
two people that we know together, we instead complete a triangle.

Load Data
To learn about triangles, we are going to leverage a physician trust network. Here’s the data
description:

This directed network captures innovation spread among 246 physicians for towns in
Illinois, Peoria, Bloomington, Quincy and Galesburg. The data was collected in 1966. A
node represents a physician and an edge between two physicians shows that the left
physician told that the right physician is his friend or that he turns to the right physician
if he needs advice or is interested in a discussion. There always only exists one edge
between two nodes even if more than one of the listed conditions are true.
Structures 53

1 from nams import load_data as cf


2 G = cf.load_physicians_network()

Exercise: Finding triangles in a graph


This exercise is going to flex your ability to “think on a graph”, just as you did in the previous
chapters.

Leveraging what you know, can you think of a few strategies to find triangles in a graph?

1 from nams.solutions.structures import triangle_finding_strategies


2
3 # triangle_finding_strategies()

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

Exercise: Identify whether a node is in a triangle relationship or


not
Let’s now get down to implementing this next piece of code.

Write a function that identifies whether a node is or is not in a triangle relationship. It


should take in a graph G and a node n, and return a boolean True if the node n is in any
triangle relationship and boolean False if the node n is not in any triangle relationship.

A hint that may help you:


Structures 54

Every graph object G has a G.has_edge(n1, n2) method that you can use to identify
whether a graph has an edge between n1 and n2.

Also:

itertools.combinations lets you iterate over every K-combination of items in an iterable.

1 def in_triangle(G, node):


2 # Your answer here
3 pass
4
5 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
6 from nams.solutions.structures import in_triangle
7
8 # UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
9 # in_triangle??

Now, test your implementation below! The code cell will not error out if your answer is correct.

1 from random import sample


2 import networkx as nx
3
4 def test_in_triangle():
5 nodes = sample(G.nodes(), 10)
6 for node in nodes:
7 assert in_triangle(G, 3) == bool(nx.triangles(G, 3))
8
9 test_in_triangle()

As you can see from the test function above, NetworkX provides an nx.triangles(G, node) function.
It returns the number of triangles that a node is involved in. We convert it to boolean as a hack to
check whether or not a node is involved in a triangle relationship because 0 is equivalent to boolean
False, while any non-zero number is equivalent to boolean True.

Exercise: Extract triangles for plotting


We’re going to leverage another piece of knowledge that you already have: the ability to extract
subgraphs. We’ll be plotting all of the triangles that a node is involved in.

Given a node, write a function that extracts out all of the neighbors that it is in a triangle
relationship with. Then, in a new function, implement code that plots only the subgraph
that contains those nodes.
Structures 55

1 def get_triangle_neighbors(G, n):


2 # Your answer here
3 pass
4
5 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
6 from nams.solutions.structures import get_triangle_neighbors
7
8 # UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
9 # get_triangle_neighbors??

1 def plot_triangle_relations(G, n):


2 # Your answer here
3 pass
4
5 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
6 from nams.solutions.structures import plot_triangle_relations
7
8 plot_triangle_relations(G, 3)

png
Structures 56

Triadic Closure
In professional circles, making connections between two people is one of the most valuable things
you can do professionally. What you do in that moment is what we would call triadic closure.
Algorithmically, we can do the same thing if we maintain a graph of connections!
Essentially, what we are looking for are “open” or “unfinished” triangles”.
In this section, we’ll try our hand at implementing a rudimentary triadic closure system.

Exercise: Design the algorithm


What graph logic would you use to identify triadic closure opportunities? Try writing out
your general strategy, or discuss it with someone.

1 from nams.solutions.structures import triadic_closure_algorithm


2
3 # UNCOMMENT FOR MY ANSWER
4 # triadic_closure_algorithm()

Exercise: Implement triadic closure.


Now, try your hand at implementing triadic closure.

Write a function that takes in a graph G and a node n, and returns all of the neighbors that
are potential triadic closures with n being the center node.

1 def get_open_triangles_neighbors(G, n):


2 # Your answer here
3 pass
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import get_open_triangles_neighbors
8
9 # UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
10 # get_open_triangles_neighbors??

Exercise: Plot the open triangles


Now, write a function that takes in a graph G and a node n, and plots out that node n and
all of the neighbors that it could help close triangles with.
Structures 57

1 def plot_open_triangle_relations(G, n):


2 # Your answer here
3 pass
4
5 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
6 from nams.solutions.structures import plot_open_triangle_relations
7
8 plot_open_triangle_relations(G, 3)

png

Cliques
Triangles are interesting in a graph theoretic setting because triangles are the simplest complex clique
that exist.
But wait! What is the definition of a “clique”?

A “clique” is a set of nodes in a graph that are fully connected with one another by edges
between them.
Structures 58

Exercise: Simplest cliques


Given this definition, what is the simplest “clique” possible?

1 from nams.solutions.structures import simplest_clique


2
3 # UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
4 # simplest_clique()

k -Cliques

Cliques are identified by their size k, which is the number of nodes that are present in the clique.
A triangle is what we would consider to be a k-clique where k = 3.
A square with cross-diagonal connections is what we would consider to be a k-clique where k = 4.
By now, you should get the gist of the idea.

Maximal Cliques
Related to this idea of a k-clique is another idea called “maximal cliques”.
Maximal cliques are defined as follows:

A maximal clique is a subgraph of nodes in a graph

1. to which no other node can be added to it and


2. still remain a clique.

NetworkX provides a way to find all maximal cliques:

1 # I have truncated the output to the first 5 maximal cliques.


2 list(nx.find_cliques(G))[0:5]

1 [[1, 2], [1, 3], [1, 4, 5, 6], [1, 7], [1, 72]]

Exercise: finding sized-k maximal cliques


Write a generator function that yields all maximal cliques of size k.

I’m requesting a generator as a matter of good practice; you never know when the list you return
might explode in memory consumption, so generators are a cheap and easy way to reduce memory
usage.
Structures 59

1 def size_k_maximal_cliques(G, k):


2 # Your answer here
3 pass
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import size_k_maximal_cliques

Now, test your implementation against the test function below.

1 def test_size_k_maximal_cliques(G, k):


2 clique_generator = size_k_maximal_cliques(G, k)
3 for clique in clique_generator:
4 assert len(clique) == k
5
6 test_size_k_maximal_cliques(G, 5)

Clique Decomposition
One super neat property of cliques is that every clique of size k can be decomposed to the set of
cliques of size k − 1.
Does this make sense to you? If not, think about triangles (3-cliques). They can be decomposed to
three edges (2-cliques).
Think again about 4-cliques. Housed within 4-cliques are four 3-cliques. Draw it out if you’re still
not convinced!

Exercise: finding all k-cliques in a graph


Knowing this property of k-cliques, write a generator function that yields all k-cliques in
a graph, leveraging the nx.find_cliques(G) function.

Some hints to help you along:

If a k-clique can be decomposed to its k − 1 cliques, it follows that the k − 1 cliques can
be decomposed into k − 2 cliques, and so on until you hit 2-cliques. This implies that all
cliques of size k house cliques of size n < k, where n >= 2.
Structures 60

1 def find_k_cliques(G, k):


2 # your answer here
3 pass
4
5 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
6 from nams.solutions.structures import find_k_cliques
7
8 def test_find_k_cliques(G, k):
9 for clique in find_k_cliques(G, k):
10 assert len(clique) == k
11
12 test_find_k_cliques(G, 3)

Connected Components
Now that we’ve explored a lot around cliques, we’re now going to explore this idea of “connected
components”. To do so, I am going to have you draw the graph that we are working with.

1 import nxviz as nv
2
3 nv.circos(G)

1 <AxesSubplot:>

png
Structures 61

Exercise: Visual insights


From this rendering of the CircosPlot, what visual insights do you have about the structure of the
graph?

1 from nams.solutions.structures import visual_insights


2
3 # UNCOMMENT TO SEE MY ANSWER
4 # visual_insights()

Defining connected components


From Wikipedia⁹:

In graph theory, a connected component (or just component) of an undirected graph is a


subgraph in which any two vertices are connected to each other by paths, and which is
connected to no additional vertices in the supergraph.

NetworkX provides a function to let us find all of the connected components:

1 ccsubgraph_nodes = list(nx.connected_components(G))

Let’s see how many connected component subgraphs are present:

1 len(ccsubgraph_nodes)

1 4

Exercise: visualizing connected component subgraphs


In this exercise, we’re going to draw a circos plot of the graph, but colour and order the nodes by
their connected component subgraph.
Recall Circos API:

⁹https://en.wikipedia.org/wiki/Connected_component_%28graph_theory%29
Structures 62

1 c = CircosPlot(G, node_order='node_attribute', node_color='node_attribute')


2 c.draw()
3 plt.show() # or plt.savefig(...)

Follow the steps along here to accomplish this.

Firstly, label the nodes with a unique identifier for connected component subgraph that
it resides in. Use subgraph to store this piece of metadata.

1 def label_connected_component_subgraphs(G):
2 # Your answer here
3 return G
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import label_connected_component_subgraphs
8 G_labelled = label_connected_component_subgraphs(G)
9
10 # UNCOMMENT TO SEE THE ANSWER
11 # label_connected_component_subgraphs??

Now, draw a CircosPlot with the node order and colouring dictated by the subgraph key.

1 def plot_cc_subgraph(G):
2 # Your answer here
3 pass
4
5
6 # COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
7 from nams.solutions.structures import plot_cc_subgraph
8 from nxviz import annotate
9
10 plot_cc_subgraph(G_labelled)
11 annotate.circos_group(G_labelled, group_by="subgraph")
Structures 63

png

Using an arc plot will also clearly illuminate for us that there are no inter-group connections.

1 nv.arc(G_labelled, group_by="subgraph", node_color_by="subgraph")


2 annotate.arc_group(G_labelled, group_by="subgraph", rotation=0)

png

Voila! It looks quite clear that there are indeed four disjoint group of physicians.

Solutions
Structures 64

1 from nams.solutions import structures


2 import inspect
3
4 print(inspect.getsource(structures))

1 """Solutions to Structures chapter."""


2
3 from itertools import combinations
4
5 import networkx as nx
6 from nxviz import circos
7 from nams.functions import render_html
8
9
10 def triangle_finding_strategies():
11 """
12 How to find triangles.
13 """
14 ans = """
15 One way would be to take one node, and look at its neighbors.
16 If its neighbors are also connected to one another,
17 then we have found a triangle.
18
19 Another way would be to start at a given node,
20 and walk out two nodes.
21 If the starting node is the neighbor of the node two hops away,
22 then the path we traced traces out the nodes in a triangle.
23 """
24 return render_html(ans)
25
26
27 def in_triangle(G, node):
28 """
29 Return whether a given node is present in a triangle relationship.
30 """
31 for nbr1, nbr2 in combinations(G.neighbors(node), 2):
32 if G.has_edge(nbr1, nbr2):
33 return True
34 return False
35
36
37 def get_triangle_neighbors(G, node) -> set:
Structures 65

38 """
39 Return neighbors involved in triangle relationship with node.
40 """
41 neighbors1 = set(G.neighbors(node))
42 triangle_nodes = set()
43 for nbr1, nbr2 in combinations(neighbors1, 2):
44 if G.has_edge(nbr1, nbr2):
45 triangle_nodes.add(nbr1)
46 triangle_nodes.add(nbr2)
47 return triangle_nodes
48
49
50 def plot_triangle_relations(G, node):
51 """
52 Plot all triangle relationships for a given node.
53 """
54 triangle_nbrs = get_triangle_neighbors(G, node)
55 triangle_nbrs.add(node)
56 nx.draw(G.subgraph(triangle_nbrs), with_labels=True)
57
58
59 def triadic_closure_algorithm():
60 """
61 How to do triadic closure.
62 """
63 ans = """
64 I would suggest the following strategy:
65
66 1. Pick a node
67 1. For every pair of neighbors:
68 1. If neighbors are not connected,
69 then this is a potential triangle to close.
70
71 This strategy gives you potential triadic closures
72 given a "center" node `n`.
73
74 The other way is to trace out a path two degrees out
75 and ask whether the terminal node is a neighbor
76 of the starting node.
77 If not, then we have another triadic closure to make.
78 """
79 return render_html(ans)
80
Structures 66

81
82 def get_open_triangles_neighbors(G, node) -> set:
83 """
84 Return neighbors involved in open triangle relationships with a node.
85 """
86 open_triangle_nodes = set()
87 neighbors = list(G.neighbors(node))
88
89 for n1, n2 in combinations(neighbors, 2):
90 if not G.has_edge(n1, n2):
91 open_triangle_nodes.add(n1)
92 open_triangle_nodes.add(n2)
93
94 return open_triangle_nodes
95
96
97 def plot_open_triangle_relations(G, node):
98 """
99 Plot open triangle relationships for a given node.
100 """
101 open_triangle_nbrs = get_open_triangles_neighbors(G, node)
102 open_triangle_nbrs.add(node)
103 nx.draw(G.subgraph(open_triangle_nbrs), with_labels=True)
104
105
106 def simplest_clique():
107 """
108 Answer to "what is the simplest clique".
109 """
110 return render_html("The simplest clique is an edge.")
111
112
113 def size_k_maximal_cliques(G, k):
114 """
115 Return all size-k maximal cliques.
116 """
117 for clique in nx.find_cliques(G):
118 if len(clique) == k:
119 yield clique
120
121
122 def find_k_cliques(G, k):
123 """
Structures 67

124 Find all cliques of size k.


125 """
126 for clique in nx.find_cliques(G):
127 if len(clique) >= k:
128 for nodeset in combinations(clique, k):
129 yield nodeset
130
131
132 def visual_insights():
133 """
134 Answer to visual insights exercise.
135 """
136 ans = """
137 We might hypothesize that there are 3,
138 maybe 4 different "communities" of nodes
139 that are completely disjoint with one another,
140 i.e. there is no path between them.
141 """
142 print(ans)
143
144
145 def label_connected_component_subgraphs(G):
146 """Label all connected component subgraphs."""
147 G = G.copy()
148 for i, nodeset in enumerate(nx.connected_components(G)):
149 for n in nodeset:
150 G.nodes[n]["subgraph"] = i
151 return G
152
153
154 def plot_cc_subgraph(G):
155 """Plot all connected component subgraphs."""
156 c = circos(G, node_color_by="subgraph", group_by="subgraph")

1
Graph I/O
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="3sJnTpeFXZ4", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/3sJnTpeFXZ4” frame-


border=”0” allowfullscreen ></iframe>
In order to get you familiar with graph ideas, I have deliberately chosen to steer away from the
more pedantic matters of loading graph data to and from disk. That said, the following scenario will
eventually happen, where a graph dataset lands on your lap, and you’ll need to load it in memory
and start analyzing it.
Thus, we’re going to go through graph I/O, specifically the APIs on how to convert graph data that
comes to you into that magical NetworkX object G.
Let’s get going!

Graph Data as Tables


Let’s recall what we’ve learned in the introductory chapters. Graphs can be represented using two
sets:

• Node set
• Edge set

Node set as tables


Let’s say we had a graph with 3 nodes in it: A, B, C. We could represent it in plain text, computer-
readable format:
Graph I/O 69

1 A
2 B
3 C

Suppose the nodes also had metadata. Then, we could tag on metadata as well:

1 A, circle, 5
2 B, circle, 7
3 C, square, 9

Does this look familiar to you? Yes, node sets can be stored in CSV format, with one of the columns
being node ID, and the rest of the columns being metadata.

Edge set as tables


If, between the nodes, we had 4 edges (this is a directed graph), we can also represent those edges
in plain text, computer-readable format:

1 A, C
2 B, C
3 A, B
4 C, A

And let’s say we also had other metadata, we can represent it in the same CSV format:

1 A, C, red
2 B, C, orange
3 A, B, yellow
4 C, A, green

If you’ve been in the data world for a while, this should not look foreign to you. Yes, edge sets can
be stored in CSV format too! Two of the columns represent the nodes involved in an edge, and the
rest of the columns represent the metadata.

Combined Representation
In fact, one might also choose to combine the node set and edge set tables together in a merged
format:
Graph I/O 70

1 n1, n2, colour, shape1, num1, shape2, num2


2 A, C, red, circle, 5, square, 9
3 B, C, orange, circle, 7, square, 9
4 A, B, yellow, circle, 5, circle, 7
5 C, A, green, square, 9, circle, 5

In this chapter, the datasets that we will be looking at are going to be formatted in both ways. Let’s
get going.

Dataset
We will be working with the Divvy bike sharing dataset.

Divvy is a bike sharing service in Chicago. Since 2013, Divvy has released their bike
sharing dataset to the public. The 2013 dataset is comprised of two files: - Divvy_-
Stations_2013.csv, containing the stations in the system, and - DivvyTrips_2013.csv,
containing the trips.

Let’s dig into the data!

1 from pyprojroot import here

Firstly, we need to unzip the dataset:

1 import zipfile
2 import os
3 from nams.load_data import datasets
4
5 # This block of code checks to make sure that a particular directory is present.
6 if "divvy_2013" not in os.listdir(datasets):
7 print('Unzipping the divvy_2013.zip file in the datasets folder.')
8 with zipfile.ZipFile(datasets / "divvy_2013.zip","r") as zip_ref:
9 zip_ref.extractall(datasets)

Now, let’s load in both tables.


First is the stations table:
Graph I/O 71

1 import pandas as pd
2
3 stations = pd.read_csv(datasets / 'divvy_2013/Divvy_Stations_2013.csv', parse_dates=\
4 ['online date'], encoding='utf-8')
5 print(stations.head().to_markdown())

id name latitude longitude dpcapacity landmark online


date
0 5 State St & 41.874 -87.6277 19 30 2013-06-28
Harrison 00:00:00
St
1 13 Wilton 41.9325 -87.6527 19 66 2013-06-28
Ave & 00:00:00
Diversey
Pkwy
2 14 Morgan St 41.8581 -87.6511 15 163 2013-06-28
& 18th St 00:00:00
3 15 Racine 41.8582 -87.6565 15 164 2013-06-28
Ave & 18th 00:00:00
St
4 16 Wood St & 41.9103 -87.6725 15 223 2013-08-12
North Ave 00:00:00

1 print(stations.describe().to_markdown())

id latitude longitude dpcapacity landmark


count 300 300 300 300 300
mean 189.063 41.8963 -87.6482 16.8 192.013
std 99.4845 0.0409522 0.0230011 4.67399 120.535
min 5 41.7887 -87.7079 11 1
25% 108.75 41.8718 -87.6658 15 83.75
50% 196.5 41.8946 -87.6486 15 184.5
75% 276.25 41.9264 -87.6318 19 288.25
max 351 41.9784 -87.5807 47 440

Now, let’s load in the trips table.

1 trips = pd.read_csv(datasets / 'divvy_2013/Divvy_Trips_2013.csv',


2 parse_dates=['starttime', 'stoptime'])
3 print(trips.head().to_markdown())
Graph I/O 72

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Co\
3 lumns (10) have mixed types.Specify dtype option on import or set low_memory=False.
4 has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

trip_- starttime
stoptime from_-from_-to_- to_- usertype
bikeid tripduration genderbirthday
id sta- sta- sta- sta-
tion_- tion_- tion_- tion_-
id name id name
0 4118 2013- 2013- 480 316 85 Michigan28 Larrabee Customer
nan nan
06- 06- Ave St &
27 27 & Menomonee
12:11:0012:16:00 Oak St
St
1 4275 2013- 2013- 77 64 32 Racine 32 Racine Customer nan nan
06- 06- Ave Ave
27 27 & &
14:44:0014:45:00 Congress Congress
Pkwy Pkwy
2 4291 2013- 2013- 77 433 32 Racine 19 LoomisCustomer nan nan
06- 06- Ave St
27 27 & &
14:58:0015:05:00 Congress Tay-
Pkwy lor
St
3 4316 2013- 2013- 77 123 19 Loomis 19 LoomisCustomer nan nan
06- 06- St St
27 27 & &
15:06:0015:09:00 Tay- Tay-
lor lor
St St
4 4342 2013- 2013- 77 852 19 Loomis 55 HalstedCustomer nan nan
06- 06- St St &
27 27 & James
15:13:0015:27:00 Tay- M
lor Rochford
St St
Graph I/O 73

1 import janitor
2 trips_summary = (
3 trips
4 .groupby(["from_station_id", "to_station_id"])
5 .count()
6 .reset_index()
7 .select_columns(
8 [
9 "from_station_id",
10 "to_station_id",
11 "trip_id"
12 ]
13 )
14 .rename_column("trip_id", "num_trips")
15 )

1 print(trips_summary.head().to_markdown())

from_station_id to_station_id num_trips


0 5 5 232
1 5 13 1
2 5 14 15
3 5 15 9
4 5 16 4

Graph Model
Given the data, if we wished to use a graph as a data model for the number of trips between stations,
then naturally, nodes would be the stations, and edges would be trips between them.
This graph would be directed, as one could have more trips from station A to B and less in the
reverse.
With this definition, we can begin graph construction!

Create NetworkX graph from pandas edgelist


NetworkX provides an extremely convenient way to load data from a pandas DataFrame:
Graph I/O 74

1 import networkx as nx
2
3 G = nx.from_pandas_edgelist(
4 df=trips_summary,
5 source="from_station_id",
6 target="to_station_id",
7 edge_attr=["num_trips"],
8 create_using=nx.DiGraph
9 )

Inspect the graph


Once the graph is in memory, we can inspect it to get out summary graph statistics.

1 print(nx.info(G))

1 Name:
2 Type: DiGraph
3 Number of nodes: 300
4 Number of edges: 44422
5 Average in degree: 148.0733
6 Average out degree: 148.0733

You’ll notice that the edge metadata have been added correctly: we have recorded in there the
number of trips between stations.

1 list(G.edges(data=True))[0:5]

1 [(5, 5, {'num_trips': 232}),


2 (5, 13, {'num_trips': 1}),
3 (5, 14, {'num_trips': 15}),
4 (5, 15, {'num_trips': 9}),
5 (5, 16, {'num_trips': 4})]

However, the node metadata is not present:

1 list(G.nodes(data=True))[0:5]
Graph I/O 75

1 [(5, {}), (13, {}), (14, {}), (15, {}), (16, {})]

Annotate node metadata


We have rich station data on hand, such as the longitude and latitude of each station, and it would be
a pity to discard it, especially when we can potentially use it as part of the analysis or for visualization
purposes. Let’s see how we can add this information in.
Firstly, recall what the stations dataframe looked like:

1 print(stations.head().to_markdown())

id name latitude longitude dpcapacity landmark online


date
0 5 State St & 41.874 -87.6277 19 30 2013-06-28
Harrison 00:00:00
St
1 13 Wilton 41.9325 -87.6527 19 66 2013-06-28
Ave & 00:00:00
Diversey
Pkwy
2 14 Morgan St 41.8581 -87.6511 15 163 2013-06-28
& 18th St 00:00:00
3 15 Racine 41.8582 -87.6565 15 164 2013-06-28
Ave & 18th 00:00:00
St
4 16 Wood St & 41.9103 -87.6725 15 223 2013-08-12
North Ave 00:00:00
The id column gives us the node ID in the graph, so if we set id to be the index, if we then also
loop over each row, we can treat the rest of the columns as dictionary keys and values as dictionary
values, and add the information into the graph.
Let’s see this in action.

1 for node, metadata in stations.set_index("id").iterrows():


2 for key, val in metadata.items():
3 G.nodes[node][key] = val

Now, our node metadata should be populated.

1 list(G.nodes(data=True))[0:5]
Graph I/O 76

1 [(5,
2 {'name': 'State St & Harrison St',
3 'latitude': 41.87395806,
4 'longitude': -87.62773949,
5 'dpcapacity': 19,
6 'landmark': 30,
7 'online date': Timestamp('2013-06-28 00:00:00')}),
8 (13,
9 {'name': 'Wilton Ave & Diversey Pkwy',
10 'latitude': 41.93250008,
11 'longitude': -87.65268082,
12 'dpcapacity': 19,
13 'landmark': 66,
14 'online date': Timestamp('2013-06-28 00:00:00')}),
15 (14,
16 {'name': 'Morgan St & 18th St',
17 'latitude': 41.858086,
18 'longitude': -87.651073,
19 'dpcapacity': 15,
20 'landmark': 163,
21 'online date': Timestamp('2013-06-28 00:00:00')}),
22 (15,
23 {'name': 'Racine Ave & 18th St',
24 'latitude': 41.85818061,
25 'longitude': -87.65648665,
26 'dpcapacity': 15,
27 'landmark': 164,
28 'online date': Timestamp('2013-06-28 00:00:00')}),
29 (16,
30 {'name': 'Wood St & North Ave',
31 'latitude': 41.910329,
32 'longitude': -87.672516,
33 'dpcapacity': 15,
34 'landmark': 223,
35 'online date': Timestamp('2013-08-12 00:00:00')})]

In nxviz, a GeoPlot object is available that allows you to quickly visualize a graph that has
geographic data. However, being matplotlib-based, it is going to be quickly overwhelmed by the
sheer number of edges.
As such, we are going to first filter the edges.
Graph I/O 77

Exercise: Filter graph edges


Leveraging what you know about how to manipulate graphs, now try filtering edges.

Hint: NetworkX graph objects can be deep-copied using G.copy():

1 G_copy = G.copy()

Hint: NetworkX graph objects also let you remove edges:

1 G.remove_edge(node1, node2) # does not return anything

1 def filter_graph(G, minimum_num_trips):


2 """
3 Filter the graph such that
4 only edges that have minimum_num_trips or more
5 are present.
6 """
7 G_filtered = G.____()
8 for _, _, _ in G._____(data=____):
9 if d[___________] < ___:
10 G_________.___________(_, _)
11 return G_filtered
12
13 from nams.solutions.io import filter_graph
14
15 G_filtered = filter_graph(G, 50)

Visualize using GeoPlot


nxviz provides a GeoPlot object that lets you quickly visualize geospatial graph data.

A note on geospatial visualizations:

As the creator of nxviz, I would recommend using proper geospatial packages to build
custom geospatial graph viz, such as pysal¹⁰.)
That said, nxviz can probably do what you need for a quick-and-dirty view of the data.

¹⁰http://pysal.org/
Graph I/O 78

1 import nxviz as nv
2
3 c = nv.geo(G_filtered, node_color_by="dpcapacity")

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

png

Does that look familiar to you? Looks quite a bit like Chicago, I’d say :)
Jesting aside, this visualization does help illustrate that the majority of trips occur between stations
that are near the city center.
Graph I/O 79

Pickling Graphs
Since NetworkX graphs are Python objects, the canonical way to save them is by pickling them. You
can do this using:

1 nx.write_gpickle(G, file_path)

Here’s an example in action:

1 nx.write_gpickle(G, "/tmp/divvy.pkl")

And just to show that it can be loaded back into memory:

1 G_loaded = nx.read_gpickle("/tmp/divvy.pkl")

Exercise: checking graph integrity


If you get a graph dataset as a pickle, you should always check it against reference properties to
make sure of its data integrity.

Write a function that tests that the graph has the correct number of nodes and edges inside
it.

1 def test_graph_integrity(G):
2 """Test integrity of raw Divvy graph."""
3 # Your solution here
4 pass
5
6 from nams.solutions.io import test_graph_integrity
7
8 test_graph_integrity(G)

Other text formats


CSV files and pandas DataFrames give us a convenient way to store graph data, and if possible, do
insist with your data collaborators that they provide you with graph data that are in this format. If
they don’t, however, no sweat! After all, Python is super versatile.
In this ebook, we have loaded data in from non-CSV sources, sometimes by parsing text files raw,
sometimes by treating special characters as delimiters in a CSV-like file, and sometimes by resorting
to parsing JSON.
You can see other examples of how we load data by browsing through the source file of load_data.py
and studying how we construct graph objects.
Graph I/O 80

Solutions
The solutions to this chapter’s exercises are below

1 from nams.solutions import io


2 import inspect
3
4 print(inspect.getsource(io))

1 """Solutions to I/O chapter"""


2
3
4 def filter_graph(G, minimum_num_trips):
5 """
6 Filter the graph such that
7 only edges that have minimum_num_trips or more
8 are present.
9 """
10 G_filtered = G.copy()
11 for u, v, d in G.edges(data=True):
12 if d["num_trips"] < minimum_num_trips:
13 G_filtered.remove_edge(u, v)
14 return G_filtered
15
16
17 def test_graph_integrity(G):
18 """Test integrity of raw Divvy graph."""
19 assert len(G.nodes()) == 300
20 assert len(G.edges()) == 44422
Testing
Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="SdbKs-crm-g", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/SdbKs-crm-g” frame-


border=”0” allowfullscreen ></iframe>
By this point in the book, you should have observed that we have written a number of tests for our
data.

Why test?

If you like it, put a ring on it…


…and if you rely on it, test it.
I am personally a proponent of writing tests for our data because as data scientists, the fields of
our data, and their correct values, form the “data programming interface” (DPI) much like function
signatures form the “application programming interface” (API). Since we test the APIs that we rely
on, we probably should test the DPIs that we rely on too.

What to test
When thinking about what part of the data to test, it can be confusing. After all, data are seemingly
generated from random processes (my Bayesian foxtail has been revealed), and it seems difficult to
test random processes.
That said, from my experience handling data, I can suggest a few principles.

Test invariants
Firstly, we test invariant properties of the data. Put in plain language, things we know ought to be
true.
Using the Divvy bike dataset example, we know that every node ought to have a station name. Thus,
the minimum that we can test is that the station_name attribute is present on every node. As an
example:
Testing 82

1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 assert "station_name" in d.keys()

Test nullity
Secondly, we can test that values that ought not to be null should not be null.
Using the Divvy bike dataset example again, if we also know that the station name cannot be null
or an empty string, then we can bake that into the test.

1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 assert "station_name" in d.keys()
5 assert bool(d["station_name"])

Test boundaries
We can also test boundary values. For example, within the city of Chicago, we know that latitude
and longitude values ought to be within the vicinity of 41.85003, -87.65005. If we get data values
that are, say, outside the range of [41, 42]; [-88, -87], then we know that we have data issues
as well.
Here’s an example:

1 def test_divvy_nodes(G):
2 """Test node metadata on Divvy dataset."""
3 for n, d in G.nodes(data=True):
4 # Test for station names.
5 assert "station_name" in d.keys()
6 assert bool(d["station_name"])
7
8 # Test for longitude/latitude
9 assert d["latitude"] >= 41 and d["latitude"] <= 42
10 assert d["longitude"] >= -88 and d["longitude"] <= -87

An apology to geospatial experts: I genuinely don’t know the bounding box lat/lon coordinates of
Chicago, so if you know those coordinates, please reach out so I can update the test.
Testing 83

Continuous data testing


The key idea with testing is to have tests that continuously run all the time in the background
without you ever needing to intervene to kickstart it off. It’s like having a bot in the background
always running checks for you so you don’t have to kickstart them.
To do so, you should be equipped with a few tools. I won’t go into them in-depth here, as I will be
writing a “continuous data testing” essay in the near future. That said, here is the gist.
Firstly, use pytest to get set up with testing. You essentially write a test_something.py file in
which you write your test suite, and your test functions are all nothinng more than simple functions.

1 # test_data.py
2 def test_divvy_nodes(G):
3 """Test node metadata on Divvy dataset."""
4 for n, d in G.nodes(data=True):
5 # Test for station names.
6 assert "station_name" in d.keys()
7 assert bool(d["station_name"])
8
9 # Test for longitude/latitude
10 assert d["latitude"] >= 41 and d["latitude"] <= 42
11 assert d["longitude"] >= -88 and d["longitude"] <= -87

At the command line, if you ran pytest, it will automatically discover all functions prefixed with
test_ in all .py files underneath the current working directory.

Secondly, set up a continuous pipelining system to continuously run data tests. For example,
you can set up Jenkins¹¹, Travis¹², Azure Pipelines¹³, Prefect¹⁴, and more, depending on what your
organization has bought into.
Sometimes data tests take longer than software tests, especially if you are pulling dumps from a
database, so you might want to run this portion of tests in a separate pipeline instead.

Further reading
• In my essays collection, I wrote about testing data¹⁵.
• Itamar Turner-Trauring has written about keeping tests quick and speedy¹⁶, which is extremely
crucial to keeping yourself motivated to write tests.

¹¹https://www.jenkins.io/
¹²https://travis-ci.org/
¹³https://azure.microsoft.com/en-us/services/devops/pipelines/
¹⁴https://www.prefect.io/
¹⁵https://ericmjl.github.io/essays-on-data-science/software-skills/testing/#tests-for-data
¹⁶https://pythonspeed.com/articles/slow-tests-fast-feedback/
Bipartite Graphs
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="BYOK12I9vgI", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/BYOK12I9vgI” frame-


border=”0” allowfullscreen ></iframe>
In this chapter, we will look at bipartite graphs and their applications.

What are bipartite graphs?


As the name suggests, bipartite have two (bi) node partitions (partite). In other words, we can assign
nodes to one of the two partitions. (By contrast, all of the graphs that we have seen before are
unipartite: they only have a single partition.)

Rules for bipartite graphs


With unipartite graphs, you might remember a few rules that apply.
Firstly, nodes and edges belong to a set. This means the node set contains only unique members, i.e.
no node can be duplicated. The same applies for the edge set.
On top of those two basic rules, bipartite graphs add an additional rule: Edges can only occur between
nodes of different partitions. In other words, nodes within the same partition are not allowed to be
connected to one another.
Bipartite Graphs 85

Applications of bipartite graphs


Where do we see bipartite graphs being used? Here’s one that is very relevant to e-commerce, which
touches our daily lives:

We can model customer purchases of products using a bipartite graph. Here, the two
node sets are customer nodes and product nodes, and edges indicate that a customer C
purchased a product P .

On the basis of this graph, we can do interesting analyses, such as finding customers that are similar
to one another on the basis of their shared product purchases.
Can you think of other situations where a bipartite graph model can be useful?

Dataset
Here’s another application in crime analysis, which is relevant to the example that we will use in
this chapter:

This bipartite network contains persons who appeared in at least one crime case as either
a suspect, a victim, a witness or both a suspect and victim at the same time. A left node
represents a person and a right node represents a crime. An edge between two nodes
shows that the left node was involved in the crime represented by the right node.

This crime dataset was also sourced from Konect.

1 from nams import load_data as cf


2 G = cf.load_crime_network()
3 for n, d in G.nodes(data=True):
4 G.nodes[n]["degree"] = G.degree(n)

If you inspect the nodes, you will see that they contain a special metadata keyword: bipartite. This
is a special keyword that NetworkX can use to identify nodes of a given partition.

Visualize the crime network


To help us get our bearings right, let’s visualize the crime network.
Bipartite Graphs 86

1 import nxviz as nv
2 import matplotlib.pyplot as plt
3
4 fig, ax = plt.subplots(figsize=(7, 7))
5 nv.circos(G, sort_by="degree", group_by="bipartite", node_color_by="bipartite", node\
6 _aes_kwargs={"size_scale": 3})

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
14
15 <AxesSubplot:>
Bipartite Graphs 87

png

Exercise: Extract each node set


A useful thing to be able to do is to extract each partition’s node set. This will become handy when
interacting with NetworkX’s bipartite algorithms later on.

Write a function that extracts all of the nodes from specified node partition. It should
also raise a plain Exception if no nodes exist in that specified partition. (as a precuation
against users putting in invalid partition names).
Bipartite Graphs 88

1 import networkx as nx
2
3 def extract_partition_nodes(G: nx.Graph, partition: str):
4 nodeset = [_ for _, _ in _______ if ____________]
5 if _____________:
6 raise Exception(f"No nodes exist in the partition {partition}!")
7 return nodeset
8
9 from nams.solutions.bipartite import extract_partition_nodes
10 # Uncomment the next line to see the answer.
11 # extract_partition_nodes??

Bipartite Graph Projections


In a bipartite graph, one task that can be useful to do is to calculate the projection of a graph onto
one of its nodes.
What do we mean by the “projection of a graph”? It is best visualized using this figure:

1 from nams.solutions.bipartite import draw_bipartite_graph_example, bipartite_example\


2 _graph
3 from nxviz import annotate
4 import matplotlib.pyplot as plt
5
6 bG = bipartite_example_graph()
7 pG = nx.bipartite.projection.projected_graph(bG, "abcd")
8 ax = draw_bipartite_graph_example()
9 plt.sca(ax[0])
10 annotate.parallel_labels(bG, group_by="bipartite")
11 plt.sca(ax[1])
12 annotate.arc_labels(pG)
Bipartite Graphs 89

png

As shown in the figure above, we start first with a bipartite graph with two node sets, the “alphabet”
set and the “numeric” set. The projection of this bipartite graph onto the “alphabet” node set is
a graph that is constructed such that it only contains the “alphabet” nodes, and edges join the
“alphabet” nodes because they share a connection to a “numeric” node. The red edge on the right is
basically the red path traced on the left.

Computing graph projections


How does one compute graph projections using NetworkX? Turns out, NetworkX has a bipartite
submodule, which gives us all of the facilities that we need to interact with bipartite algorithms.
First of all, we need to check that the graph is indeed a bipartite graph. NetworkX provides a function
for us to do so:

1 from networkx.algorithms import bipartite


2
3 bipartite.is_bipartite(G)

1 True

Now that we’ve confirmed that the graph is indeed bipartite, we can use the NetworkX bipartite
submodule functions to generate the bipartite projection onto one of the node partitions.
First off, we need to extract nodes from a particular partition.
Bipartite Graphs 90

1 person_nodes = extract_partition_nodes(G, "person")


2 crime_nodes = extract_partition_nodes(G, "crime")

Next, we can compute the projection:

1 person_graph = bipartite.projected_graph(G, person_nodes)


2 crime_graph = bipartite.projected_graph(G, crime_nodes)

And with that, we have our projected graphs!


Go ahead and inspect them:

1 list(person_graph.edges(data=True))[0:5]

1 [('p1', 'p336', {}),


2 ('p1', 'p756', {}),
3 ('p1', 'p694', {}),
4 ('p1', 'p93', {}),
5 ('p2', 'p39', {})]

1 list(crime_graph.edges(data=True))[0:5]

1 [('c1', 'c4', {}),


2 ('c1', 'c3', {}),
3 ('c1', 'c2', {}),
4 ('c2', 'c4', {}),
5 ('c2', 'c3', {})]

Now, what is the interpretation of these projected graphs?

• For person_graph, we have found individuals who are linked by shared participation (whether
witness or suspect) in a crime.
• For crime_graph, we have found crimes that are linked by shared involvement by people.

Just by this graph, we already can find out pretty useful information. Let’s use an exercise that
leverages what you already know to extract useful information from the projected graph.

Exercise: find the crime(s) that have the most shared connections
with other crimes
Find crimes that are most similar to one another on the basis of the number of shared
connections to individuals.

Hint: This is a degree centrality problem!


Bipartite Graphs 91

1 import pandas as pd
2
3 def find_most_similar_crimes(cG: nx.Graph):
4 """
5 Find the crimes that are most similar to other crimes.
6 """
7 dcs = ______________
8 return ___________________
9
10
11 from nams.solutions.bipartite import find_most_similar_crimes
12 find_most_similar_crimes(crime_graph)

1 c110 0.136364
2 c47 0.070909
3 c23 0.070909
4 c95 0.063636
5 c14 0.061818
6 c352 0.060000
7 c432 0.060000
8 c160 0.058182
9 c417 0.058182
10 c525 0.058182
11 dtype: float64

Exercise: find the individual(s) that have the most shared


connections with other individuals
Now do the analogous thing for individuals!

1 def find_most_similar_people(pG: nx.Graph):


2 """
3 Find the persons that are most similar to other persons.
4 """
5 dcs = ______________
6 return ___________________
7
8
9 from nams.solutions.bipartite import find_most_similar_people
10 find_most_similar_people(person_graph)
Bipartite Graphs 92

1 p425 0.061594
2 p2 0.057971
3 p356 0.053140
4 p56 0.039855
5 p695 0.039855
6 p497 0.036232
7 p715 0.035024
8 p10 0.033816
9 p815 0.032609
10 p74 0.030193
11 dtype: float64

Weighted Projection
Though we were able to find out which graphs were connected with one another, we did not record
in the resulting projected graph the strength by which the two nodes were connected. To preserve
this information, we need another function:

1 weighted_person_graph = bipartite.weighted_projected_graph(G, person_nodes)


2 list(weighted_person_graph.edges(data=True))[0:5]

1 [('p1', 'p336', {'weight': 1}),


2 ('p1', 'p756', {'weight': 1}),
3 ('p1', 'p694', {'weight': 1}),
4 ('p1', 'p93', {'weight': 1}),
5 ('p2', 'p39', {'weight': 1})]

Exercise: Find the people that can help with investigating a crime’s
person.

Let’s pretend that we are a detective trying to solve a crime, and that we right now need to find
other individuals who were not implicated in the same exact crime as an individual was, but who
might be able to give us information about that individual because they were implicated in other
crimes with that individual.

Implement a function that takes in a bipartite graph G, a string person and a string
crime, and returns a list of other persons that were not implicated in the crime, but were
connected to the person via other crimes. It should return a ranked list, based on the
number of shared crimes (from highest to lowest) because the ranking will help with
triage.
Bipartite Graphs 93

1 list(G.neighbors('p1'))

1 ['c1', 'c2', 'c3', 'c4']

1 def find_connected_persons(G, person, crime):


2 # Step 0: Check that the given "person" and "crime" are connected.
3 if _____________________________:
4 raise ValueError(f"Graph does not have a connection between {person} and {cr\
5 ime}!")
6
7 # Step 1: calculate weighted projection for person nodes.
8 person_nodes = ____________________________________
9 person_graph = bipartite.________________________(_, ____________)
10
11 # Step 2: Find neighbors of the given `person` node in projected graph.
12 candidate_neighbors = ___________________________________
13
14 # Step 3: Remove candidate neighbors from the set if they are implicated in the \
15 given crime.
16 for p in G.neighbors(crime):
17 if ________________________:
18 _____________________________
19
20 # Step 4: Rank-order the candidate neighbors by number of shared connections.
21 _________ = []
22 ## You might need a for-loop here
23 return pd.DataFrame(__________).sort_values("________", ascending=False)
24
25
26 from nams.solutions.bipartite import find_connected_persons
27 print(find_connected_persons(G, 'p2', 'c10').to_markdown())
Bipartite Graphs 94

node weight
27 p67 4
1 p361 2
29 p338 2
14 p356 2
25 p223 1
26 p608 1
28 p578 1
30 p304 1
31 p186 1
32 p661 1
33 p781 1
34 p820 1
0 p39 1
36 p401 1
37 p710 1
38 p300 1
39 p287 1
40 p309 1
41 p5 1
42 p587 1
43 p563 1
44 p806 1
45 p286 1
35 p320 1
23 p528 1
24 p360 1
22 p439 1
2 p499 1
3 p449 1
4 p4 1
5 p471 1
6 p48 1
7 p90 1
8 p475 1
9 p498 1
10 p690 1
11 p620 1
12 p603 1
13 p660 1
15 p768 1
16 p782 1
17 p495 1
18 p305 1
19 p665 1
20 p773 1
Bipartite Graphs 95

node weight
21 p211 1
46 p716 1

Degree Centrality
The degree centrality metric is something we can calculate for bipartite graphs. Recall that the degree
centrality metric is the number of neighbors of a node divided by the total number of possible
neighbors.
In a unipartite graph, the denominator can be the total number of nodes less one (if self-loops are
not allowed) or simply the total number of nodes (if self loops are allowed).

Exercise: What is the denominator for bipartite graphs?


Think about it for a moment, then write down your answer.

1 from nams.solutions.bipartite import bipartite_degree_centrality_denominator


2 from nams.functions import render_html
3 bipartite_degree_centrality_denominator()

1 '\nThe total number of neighbors that a node can _possibly_ have\nis the number of n\
2 odes in the other partition.\nThis comes naturally from the definition of a bipartit\
3 e graph,\nwhere nodes can _only_ be connected to nodes in the other partition.\n'

Exercise: Which persons are implicated in the most number of


crimes?
Find the persons (singular or plural) who are connected to the most number of crimes.

To do so, you will need to use nx.bipartite.degree_centrality, rather than the regular nx.degree_-
centrality function.

nx.bipartite.degree_centrality requires that you pass in a node set from one of the partitions
so that it can correctly partition nodes on the other set. What is returned, though, is the degree
centrality for nodes in both sets. Here is an example to show you how the function is used:

1 dcs = nx.bipartite.degree_centrality(my_graph, nodes_from_one_partition)


Bipartite Graphs 96

1 def find_most_crime_person(G, person_nodes):


2 dcs = __________________________
3 return ___________________________
4
5 from nams.solutions.bipartite import find_most_crime_person
6 find_most_crime_person(G, person_nodes)

1 'p815'

Solutions
Here are the solutions to the exercises above.

1 from nams.solutions import bipartite


2 import inspect
3
4 print(inspect.getsource(bipartite))

1 import networkx as nx
2 import pandas as pd
3 from nams.functions import render_html
4
5
6 def extract_partition_nodes(G: nx.Graph, partition: str):
7 nodeset = [n for n, d in G.nodes(data=True) if d["bipartite"] == partition]
8 if len(nodeset) == 0:
9 raise Exception(f"No nodes exist in the partition {partition}!")
10 return nodeset
11
12
13 def bipartite_example_graph():
14 bG = nx.Graph()
15 bG.add_nodes_from("abcd", bipartite="letters")
16 bG.add_nodes_from(range(1, 4), bipartite="numbers")
17 bG.add_edges_from([("a", 1), ("b", 1), ("b", 3), ("c", 2), ("c", 3), ("d", 1)])
18
19 return bG
20
21
Bipartite Graphs 97

22 def draw_bipartite_graph_example():
23 """Draw an example bipartite graph and its corresponding projection."""
24 import matplotlib.pyplot as plt
25 import nxviz as nv
26 from nxviz import annotate, plots, highlights
27
28 fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
29 plt.sca(ax[0])
30 bG = bipartite_example_graph()
31 nv.parallel(bG, group_by="bipartite", node_color_by="bipartite")
32 annotate.parallel_group(bG, group_by="bipartite", y_offset=-0.5)
33 highlights.parallel_edge(bG, "a", 1, group_by="bipartite")
34 highlights.parallel_edge(bG, "b", 1, group_by="bipartite")
35
36 pG = nx.bipartite.projected_graph(bG, nodes=list("abcd"))
37 plt.sca(ax[1])
38 nv.arc(pG)
39 highlights.arc_edge(pG, "a", "b")
40 return ax
41
42
43 def find_most_similar_crimes(cG: nx.Graph):
44 """
45 Find the crimes that are most similar to other crimes.
46 """
47 dcs = pd.Series(nx.degree_centrality(cG))
48 return dcs.sort_values(ascending=False).head(10)
49
50
51 def find_most_similar_people(pG: nx.Graph):
52 """
53 Find the persons that are most similar to other persons.
54 """
55 dcs = pd.Series(nx.degree_centrality(pG))
56 return dcs.sort_values(ascending=False).head(10)
57
58
59 def find_connected_persons(G, person, crime):
60 """Answer to exercise on people implicated in crimes"""
61 # Step 0: Check that the given "person" and "crime" are connected.
62 if not G.has_edge(person, crime):
63 raise ValueError(
64 f"Graph does not have a connection between {person} and {crime}!"
Bipartite Graphs 98

65 )
66
67 # Step 1: calculate weighted projection for person nodes.
68 person_nodes = extract_partition_nodes(G, "person")
69 person_graph = nx.bipartite.weighted_projected_graph(G, person_nodes)
70
71 # Step 2: Find neighbors of the given `person` node in projected graph.
72 candidate_neighbors = set(person_graph.neighbors(person))
73
74 # Step 3: Remove candidate neighbors from the set if they are implicated in the \
75 given crime.
76 for p in G.neighbors(crime):
77 if p in candidate_neighbors:
78 candidate_neighbors.remove(p)
79
80 # Step 4: Rank-order the candidate neighbors by number of shared connections.
81 data = []
82 for nbr in candidate_neighbors:
83 data.append(dict(node=nbr, weight=person_graph.edges[person, nbr]["weight"]))
84 return pd.DataFrame(data).sort_values("weight", ascending=False)
85
86
87 def bipartite_degree_centrality_denominator():
88 """Answer to bipartite graph denominator for degree centrality."""
89
90 ans = """
91 The total number of neighbors that a node can _possibly_ have
92 is the number of nodes in the other partition.
93 This comes naturally from the definition of a bipartite graph,
94 where nodes can _only_ be connected to nodes in the other partition.
95 """
96 return ans
97
98
99 def find_most_crime_person(G, person_nodes):
100 dcs = (
101 pd.Series(nx.bipartite.degree_centrality(G, person_nodes))
102 .sort_values(ascending=False)
103 .to_frame()
104 )
105 return dcs.reset_index().query("index.str.contains('p')").iloc[0]["index"]
Bipartite Graphs 99

1
Linear Algebra
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="uTHihJiRELc", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/uTHihJiRELc” frame-


border=”0” allowfullscreen ></iframe>
In this chapter, we will look at the relationship between graphs and linear algebra.
The deep connection between these two topics is super interesting, and I’d like to show it to you
through an exploration of three topics:

1. Path finding
2. Message passing
3. Bipartite projections

Preliminaries
Before we go deep into the linear algebra piece though, we have to first make sure some ideas are
clear.
The most important thing that we need when treating graphs in linear algebra form is the adjacency
matrix. For example, for four nodes joined in a chain:
Linear Algebra 101

1 import networkx as nx
2 nodes = list(range(4))
3 G1 = nx.Graph()
4 G1.add_nodes_from(nodes)
5 G1.add_edges_from(zip(nodes, nodes[1:]))

we can visualize the graph:

1 nx.draw(G1, with_labels=True)

png

and we can visualize its adjacency matrix:

1 import nxviz as nv
2
3 m = nv.matrix(G1)
Linear Algebra 102

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

png

and we can obtain the adjacency matrix as a NumPy array:

1 A1 = nx.to_numpy_array(G1, nodelist=sorted(G1.nodes()))
2 A1
Linear Algebra 103

1 array([[0., 1., 0., 0.],


2 [1., 0., 1., 0.],
3 [0., 1., 0., 1.],
4 [0., 0., 1., 0.]])

Symmetry
Remember that for an undirected graph, the adjacency matrix will be symmetric about the diagonal,
while for a directed graph, the adjacency matrix will be asymmetric.

Path finding
In the Paths chapter, we can use the breadth-first search algorithm to find a shortest path between
any two nodes.
As it turns out, using adjacency matrices, we can answer a related question, which is how many
paths exist of length K between two nodes.
To see how, we need to see the relationship between matrix powers and graph path lengths.
Let’s take the adjacency matrix above, raise it to the second power, and see what it tells us.

1 import numpy as np
2 np.linalg.matrix_power(A1, 2)

1 array([[1., 0., 1., 0.],


2 [0., 2., 0., 1.],
3 [1., 0., 2., 0.],
4 [0., 1., 0., 1.]])

Exercise: adjacency matrix power?


What do you think the values in the adjacency matrix are related to? If studying in a
group, discuss with your neighbors; if working on this alone, write down your thoughts.
Linear Algebra 104

1 from nams.solutions.linalg import adjacency_matrix_power


2 from nams.functions import render_html
3
4 adjacency_matrix_power()

1 '\n1. The diagonals equal to the degree of each node.\n1. The off-diagonals also con\
2 tain values,\nwhich correspond to the number of paths that exist of length 2\nbetwee\
3 n the node on the row axis and the node on the column axis.\n\nIn fact, the diagonal\
4 also takes on the same meaning!\n\nFor the terminal nodes, there is only 1 path\nfr\
5 om itself back to itself,\nwhile for the middle nodes, there are 2 paths\nfrom itsel\
6 f back to itself!\n'

Higher matrix powers


The semantic meaning of adjacency matrix powers is preserved even if we go to higher powers. For
example, if we go to the 3rd matrix power:

1 np.linalg.matrix_power(A1, 3)

1 array([[0., 2., 0., 1.],


2 [2., 0., 3., 0.],
3 [0., 3., 0., 2.],
4 [1., 0., 2., 0.]])

You should be able to convince yourself that:

1. There’s no way to go from a node back to itself in 3 steps, thus explaining the diagonals, and
2. The off-diagonals take on the correct values when you think about them in terms of “ways to
go from one node to another”.

With directed graphs?


Does the “number of steps” interpretation hold with directed graphs? Yes it does! Let’s see it in
action.
Linear Algebra 105

1 G2 = nx.DiGraph()
2 G2.add_nodes_from(nodes)
3 G2.add_edges_from(zip(nodes, nodes[1:]))
4 nx.draw(G2, with_labels=True)

png

Exercise: directed graph matrix power


Convince yourself that the resulting adjacency matrix power contains the same semantic
meaning as that for an undirected graph, that is, the number of ways to go from “row”
node to “column” node in K steps. (I have provided three different matrix powers for you.)

1 A2 = nx.to_numpy_array(G2)
2 np.linalg.matrix_power(A2, 2)
Linear Algebra 106

1 array([[0., 0., 1., 0.],


2 [0., 0., 0., 1.],
3 [0., 0., 0., 0.],
4 [0., 0., 0., 0.]])

1 np.linalg.matrix_power(A2, 3)

1 array([[0., 0., 0., 1.],


2 [0., 0., 0., 0.],
3 [0., 0., 0., 0.],
4 [0., 0., 0., 0.]])

1 np.linalg.matrix_power(A2, 4)

1 array([[0., 0., 0., 0.],


2 [0., 0., 0., 0.],
3 [0., 0., 0., 0.],
4 [0., 0., 0., 0.]])

Message Passing
Let’s now dive into the second topic here, that of message passing.
To show how message passing works on a graph, let’s start with the directed linear chain, as this
will make things easier to understand.

“Message” representation in matrix form


Our graph adjacency matrix contains nodes ordered in a particular fashion along the rows and
columns. We can also create a “message” matrix M , using the same ordering of nodes along the
rows, with columns instead representing a “message” that is intended to be “passed” from one node
to another:

1 M = np.array([1, 0, 0, 0])
2 M
Linear Algebra 107

1 array([1, 0, 0, 0])

Notice where the position of the value 1 is - at the first node.


If we take M and matrix multiply it against A2, let’s see what we get:

1 M @ A2

1 array([0., 1., 0., 0.])

The message has been passed onto the next node! And if we pass the message one more time:

1 M @ A2 @ A2

1 array([0., 0., 1., 0.])

Now, the message lies on the 3rd node!


We can make an animation to visualize this more clearly. There are comments in the code to explain
what’s going on!

1 def propagate(G, msg, n_frames):


2 """
3 Computes the node values based on propagation.
4
5 Intended to be used before or when being passed into the
6 anim() function (defined below).
7
8 :param G: A NetworkX Graph.
9 :param msg: The initial state of the message.
10 :returns: A list of 1/0 representing message status at
11 each node.
12 """
13 # Initialize a list to store message states at each timestep.
14 msg_states = []
15
16 # Set a variable `new_msg` to be the initial message state.
17 new_msg = msg
18
19 # Get the adjacency matrix of the graph G.
20 A = nx.to_numpy_array(G)
21
Linear Algebra 108

22 # Perform message passing at each time step


23 for i in range(n_frames):
24 msg_states.append(new_msg)
25 new_msg = new_msg @ A
26
27 # Return the message states.
28 return msg_states

1 from IPython.display import HTML


2 import matplotlib.pyplot as plt
3 from matplotlib import animation
4
5 def update_func(step, nodes, colors):
6 """
7 The update function for each animation time step.
8
9 :param step: Passed in from matplotlib's FuncAnimation. Must
10 be present in the function signature.
11 :param nodes: Returned from nx.draw_networkx_edges(). Is an
12 array of colors.
13 :param colors: A list of pre-computed colors.
14 """
15 nodes.set_array(colors[step].ravel())
16 return nodes
17
18 def anim(G, initial_state, n_frames=4):
19 """
20 Animation function!
21 """
22 # First, pre-compute the message passing states over all frames.
23 colors = propagate(G, initial_state, n_frames)
24 # Instantiate a figure
25 fig = plt.figure()
26 # Precompute node positions so that they stay fixed over the entire animation
27 pos = nx.kamada_kawai_layout(G)
28 # Draw nodes to screen
29 nodes = nx.draw_networkx_nodes(G, pos=pos, node_color=colors[0].ravel(), node_si\
30 ze=20)
31 # Draw edges to screen
32 ax = nx.draw_networkx_edges(G, pos)
33 # Finally, return the animation through matplotlib.
34 return animation.FuncAnimation(fig, update_func, frames=range(n_frames), fargs=(\
Linear Algebra 109

35 nodes, colors))
36
37
38 # Initialize the message
39 msg = np.zeros(len(G2))
40 msg[0] = 1
41
42 # Animate the graph with message propagation.
43 # HTML(anim(G2, msg, n_frames=4).to_html5_video())

Bipartite Graphs & Matrices


The section on message passing above assumed unipartite graphs, or at least graphs for which
messages can be meaningfully passed between nodes.
In this section, we will look at bipartite graphs.
Recall from before the definition of a bipartite graph:

• Nodes are separated into two partitions (hence ‘bi’-‘partite’).


• Edges can only occur between nodes of different partitions.

Bipartite graphs have a natural matrix representation, known as the biadjacency matrix. Nodes on
one partition are the rows, and nodes on the other partition are the columns.
NetworkX’s bipartite module provides a function for computing the biadjacency matrix of a
bipartite graph.
Let’s start by looking at a toy bipartite graph, a “customer-product” purchase record graph, with 4
products and 3 customers. The matrix representation might be as follows:

1 # Rows = customers, columns = products, 1 = customer purchased product, 0 = customer\


2 did not purchase product.
3 cp_mat = np.array([[0, 1, 0, 0],
4 [1, 0, 1, 0],
5 [1, 1, 1, 1]])

From this “bi-adjacency” matrix, one can compute the projection onto the customers, matrix
multiplying the matrix with its transpose.
Linear Algebra 110

1 c_mat = cp_mat @ cp_mat.T # c_mat means "customer matrix"


2 c_mat

1 array([[1, 0, 1],
2 [0, 2, 2],
3 [1, 2, 4]])

What we get is the connectivity matrix of the customers, based on shared purchases. The diagonals
are the degree of the customers in the original graph, i.e. the number of purchases they originally
made, and the off-diagonals are the connectivity matrix, based on shared products.
To get the products matrix, we make the transposed matrix the left side of the matrix multiplication.

1 p_mat = cp_mat.T @ cp_mat # p_mat means "product matrix"


2 p_mat

1 array([[2, 1, 2, 1],
2 [1, 2, 1, 1],
3 [2, 1, 2, 1],
4 [1, 1, 1, 1]])

You may now try to convince yourself that the diagonals are the number of times a customer
purchased that product, and the off-diagonals are the connectivity matrix of the products, weighted
by how similar two customers are.

Exercises
In the following exercises, you will now play with a customer-product graph from Amazon. This
dataset was downloaded from UCSD’s Julian McAuley’s website¹⁷, and corresponds to the digital
music dataset.
This is a bipartite graph. The two partitions are:

• customers: The customers that were doing the reviews.


• products: The music that was being reviewed.

In the original dataset (see the original JSON in the datasets/ directory), they are referred to as:

• customers: reviewerID
• products: asin

¹⁷http://jmcauley.ucsd.edu/data/amazon/
Linear Algebra 111

1 from nams import load_data as cf


2
3 G_amzn = cf.load_amazon_reviews()

1 100%|����������| 64706/64706 [00:00<00:00, 67530.68it/s]


2 100%|����������| 64706/64706 [00:00<00:00, 465800.22it/s]
3 100%|����������| 64706/64706 [00:00<00:00, 435115.29it/s]

Remember that with bipartite graphs, it is useful to obtain nodes from one of the partitions.

1 from nams.solutions.bipartite import extract_partition_nodes

1 customer_nodes = extract_partition_nodes(G_amzn, "customer")


2 mat = nx.bipartite.biadjacency_matrix(G_amzn, row_order=customer_nodes)

You’ll notice that this matrix is extremely large! There are 5541 customers and 3568 products, for a
total matrix size of 5541 × 3568 = 19770288, but it is stored in a sparse format because only 64706
elements are filled in.

1 mat

1 <5541x3568 sparse matrix of type '<class 'numpy.int64'>'


2 with 64706 stored elements in Compressed Sparse Row format>

Example: finding customers who reviewed the most number of


music items.
Let’s find out which customers reviewed the most number of music items.
To do so, you can break the problem into a few steps.
First off, we compute the customer projection using matrix operations.

1 customer_mat = mat @ mat.T

Next, get the diagonals of the customer-customer matrix. Recall here that in customer_mat, the
diagonals correspond to the degree of the customer nodes in the bipartite matrix.
SciPy sparse matrices provide a .diagonal() method that returns the diagonal elements.
Linear Algebra 112

1 # Get the diagonal.


2 degrees = customer_mat.diagonal()

Finally, find the index of the customer that has the highest degree.

1 cust_idx = np.argmax(degrees)
2 cust_idx

1 294

We can verify this independently by sorting the customer nodes by degree.

1 import pandas as pd
2 import janitor
3
4 # There's some pandas-fu we need to use to get this correct.
5 deg = (
6 pd.Series(dict(nx.degree(G_amzn, customer_nodes)))
7 .to_frame()
8 .reset_index()
9 .rename_column("index", "customer")
10 .rename_column(0, "num_reviews")
11 .sort_values('num_reviews', ascending=False)
12 )
13 print(deg.head().to_markdown())

customer num_reviews
294 A9Q28YTLYREO7 578
86 A3HU0B9XUEVHIM 375
77 A3KJ6JAZPH382D 301
307 A3C6ZCBUNXUT7V 261
218 A8IFUOL8S9BZC 256

Indeed, customer 294 was the one who had the most number of reviews!

Example: finding similar customers


Let’s now also compute which two customers are similar, based on shared reviews. To do so involves
the following steps:

1. We construct a sparse matrix consisting of only the diagonals. scipy.sparse.diags(elements)


will construct a sparse diagonal matrix based on the elements inside elements.
Linear Algebra 113

2. Subtract the diagonals from the customer matrix projection. This yields the customer-customer
similarity matrix, which should only consist of the off-diagonal elements of the customer
matrix projection.
3. Finally, get the indices where the weight (shared number of between the customers is highest.
(This code is provided for you.)

1 import scipy.sparse as sp

1 # Construct diagonal elements.


2 customer_diags = sp.diags(degrees)
3 # Subtract off-diagonals.
4 off_diagonals = customer_mat - customer_diags
5 # Compute index of most similar individuals.
6 np.unravel_index(np.argmax(off_diagonals), customer_mat.shape)

1 (294, 86)

Performance: Object vs. Matrices


Finally, to motivate why you might want to use matrices rather than graph objects to compute some
of these statistics, let’s time the two ways of getting to the same answer.

Objects
Let’s first use NetworkX’s built-in machinery to find customers that are most similar.

1 from time import time


2
3 start = time()
4
5 # Compute the projection
6 G_cust = nx.bipartite.weighted_projected_graph(G_amzn, customer_nodes)
7
8 # Identify the most similar customers
9 most_similar_customers = sorted(G_cust.edges(data=True), key=lambda x: x[2]['weight'\
10 ], reverse=True)[0]
11
12 end = time()
13 print(f'{end - start:.3f} seconds')
14 print(f'Most similar customers: {most_similar_customers}')
Linear Algebra 114

1 15.934 seconds
2 Most similar customers: ('A3HU0B9XUEVHIM', 'A9Q28YTLYREO7', {'weight': 154})

Matrices
Now, let’s implement the same thing in matrix form.

1 start = time()
2
3 # Compute the projection using matrices
4 mat = nx.bipartite.matrix.biadjacency_matrix(G_amzn, customer_nodes)
5 cust_mat = mat @ mat.T
6
7 # Identify the most similar customers
8 degrees = customer_mat.diagonal()
9 customer_diags = sp.diags(degrees)
10 off_diagonals = customer_mat - customer_diags
11 c1, c2 = np.unravel_index(np.argmax(off_diagonals), customer_mat.shape)
12
13 end = time()
14 print(f'{end - start:.3f} seconds')
15 print(f'Most similar customers: {customer_nodes[c1]}, {customer_nodes[c2]}, {cust_ma\
16 t[c1, c2]}')

1 0.471 seconds
2 Most similar customers: A9Q28YTLYREO7, A3HU0B9XUEVHIM, 154

On a modern PC, the matrix computation should be about 10-50X faster using the matrix form
compared to the object-oriented form. (The web server that is used to build the book might not
necessarily have the software stack to do this though, so the time you see reported might not reflect
the expected speedups.) I’d encourage you to fire up a Binder session or clone the book locally to
test out the code yourself.
You may notice that it’s much easier to read the “objects” code, but the matrix code way outperforms
the object code. This tradeoff is common in computing, and shouldn’t surprise you. That said, the
speed gain alone is a great reason to use matrices!

Acceleration on a GPU
If your appetite has been whipped up for even more acceleration and you have a GPU on your daily
compute, then you’re very much in luck!
Linear Algebra 115

The RAPIDS.AI¹⁸ project has a package called cuGraph¹⁹, which provides GPU-accelerated graph
algorithms. As over release 0.16.0, all cuGraph algorithms will be able to accept NetworkX graph
objects! This came about through online conversations on GitHub and Twitter, which for us,
personally, speaks volumes to the power of open source projects!
Because cuGraph does presume that you have access to a GPU, and because we assume most readers
of this book might not have access to one easily, we’ll delegate teaching how to install and use
cuGraph to the cuGraph devs and their documentation²⁰. Nonetheless, if you do have the ability to
install and use the RAPIDS stack, definitely check it out!
¹⁸https://rapids.ai
¹⁹https://github.com/rapidsai/cugraph
²⁰https://docs.rapids.ai/api/cugraph/stable/api.html
Statistical Inference
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'

Introduction
1 from IPython.display import YouTubeVideo
2
3 YouTubeVideo(id="P-0CJpO3spg", width="100%")

<iframe width=”100%” height=”300” src=”https://www.youtube.com/embed/P-0CJpO3spg” frame-


border=”0” allowfullscreen ></iframe>
In this chapter, we are going to take a look at how to perform statistical inference on graphs.

Statistics refresher
Before we can proceed with statistical inference on graphs, we must first refresh ourselves with
some ideas from the world of statistics. Otherwise, the methods that we will end up using may seem
a tad weird, and hence difficult to follow along.
To review statistical ideas, let’s set up a few statements and explore what they mean.

We are concerned with models of randomness


As with all things statistics, we are concerned with models of randomness. Here, probability
distributions give us a way to think about random events and how to assign credibility points to
them.

In an abstract fashion…
The supremely abstract way of thinking about a probability distribution is that it is the space of all
possibilities of “stuff” with different credibility points distributed amongst each possible “thing”.
Statistical Inference 117

More concretely: the coin flip


A more concrete example is to consider the coin flip. Here, the space of all possibilities of “stuff” is
the set of “heads” and “tails”. If we have a fair coin, then we have 0.5 credibility points distributed
to each of “heads” and “tails”.

Another example: dice rolls


Another concrete example is to consider the six-sided dice. Here, the space of all possibilities of
“stuff” is the set of numbers in the range [1, 6]. If we have a fair dice, then we have 1/6 credibility
points assigned to each of the numbers. (Unfair dice will have an unequal distribution of credibility
points across each face.)

A graph-based example: social networks


If we receive an undirected social network graph with 5 nodes and 6 edges, we have to keep in mind
( )
that this graph with 6 edges was merely one of 156 ways to construct 5 node, 6 edge graphs. (15
comes up because there are 15 edges that can be constructed in a 5-node undirected graph.)

Hypothesis Testing
A commonplace task in statistical inferences is calculating the probability of observing a value or
something more extreme under an assumed “null” model of reality. This is what we commonly call
“hypothesis testing”, and where the oft-misunderstood term “p-value” shows up.

Hypothesis testing in coin flips, by simulation


As an example, hypothesis testing in coin flips follows this logic:

• I observe that 8 out of 10 coin tosses give me heads, giving me a probability of heads p = 0.8 (a
summary statistic).
• Under a “null distribution” of a fair coin, I simulate the distribution of probability of heads (the
summary statistic) that I would get from 10 coin tosses.
• Finally, I use that distribution to calculate the probability of observing p = 0.8 or more extreme.

Hypothesis testing in graphs


The same protocol applies when we perform hypothesis testing on graphs.
Firstly, we calculate a summary statistic that describes our graph.
Statistical Inference 118

Secondly, we propose a null graph model, and calculate our summary statistic under simulated
versions of that null graph model.
Thirdly, we look at the probability of observing the summary statistic value that we calculated in
step 1 or more extreme, under the assumed graph null model distribution.

Stochastic graph creation models


Since we are going to be dealing with models of randomness in graphs, let’s take a look at some
examples.

Erdos-Renyi (a.k.a. “binomial”) graph


On easy one to study is the Erdos-Renyi graph, also known as the “binomial” graph.
The data generation story here is that we instantiate an undirected graph with n nodes, giving n2 −n
2
possible edges. Each edge has a probability p of being created.

1 import networkx as nx
2
3
4 G_er = nx.erdos_renyi_graph(n=30, p=0.2)

1 nx.draw(G_er)
Statistical Inference 119

png

You can verify that there’s approximately 20% of 302 −30


2 = 435.

1 len(G_er.edges())

1 90

1 len(G_er.edges()) / 435

1 0.20689655172413793

We can also look at the degree distribution:


Statistical Inference 120

1 import pandas as pd
2 from nams.functions import ecdf
3 import matplotlib.pyplot as plt
4
5 x, y = ecdf(pd.Series(dict(nx.degree(G_er))))
6 plt.scatter(x, y)

1 <matplotlib.collections.PathCollection at 0x7f0d7c4ff880>

png

Barabasi-Albert Graph
The data generating story of this graph generator is essentially that nodes that have lots of edges
preferentially get new edges attached onto them. This is what we call a “preferential attachment”
process.

1 G_ba = nx.barabasi_albert_graph(n=30, m=3)


2 nx.draw(G_ba)
Statistical Inference 121

png

1 len(G_ba.edges())

1 81

And the degree distribution:

1 x, y = ecdf(pd.Series(dict(nx.degree(G_ba))))
2 plt.scatter(x, y)

1 <matplotlib.collections.PathCollection at 0x7f0d792097f0>
Statistical Inference 122

png

You can see that even though the number of edges between the two graphs are similar, their degree
distribution is wildly different.

Load Data
For this notebook, we are going to look at a protein-protein interaction network, and test the
hypothesis that this network was not generated by the data generating process described by an
Erdos-Renyi graph.
Let’s load a protein-protein interaction network dataset²¹.

This undirected network contains protein interactions contained in yeast. Research


showed that proteins with a high degree were more important for the surivial of the yeast
than others. A node represents a protein and an edge represents a metabolic interaction
between two proteins. The network contains loops.

²¹http://konect.uni-koblenz.de/networks/moreno_propro
Statistical Inference 123

1 from nams import load_data as cf


2
3 G = cf.load_propro_network()
4 for n, d in G.nodes(data=True):
5 G.nodes[n]["degree"] = G.degree(n)

As is always the case, let’s make sure we know some basic stats of the graph.

1 len(G.nodes())

1 1870

1 len(G.edges())

1 2277

Let’s also examine the degree distribution of the graph.

1 x, y = ecdf(pd.Series(dict(nx.degree(G))))
2 plt.scatter(x, y)

1 <matplotlib.collections.PathCollection at 0x7f0d78fc6f70>

png

Finally, we should visualize the graph to get a feel for it.


Statistical Inference 124

1 import nxviz as nv
2 from nxviz import annotate
3
4 nv.circos(G, sort_by="degree", node_color_by="degree", node_aes_kwargs={"size_scale"\
5 : 10})
6 annotate.node_colormapping(G, color_by="degree")

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

png

One thing we might infer from this visualization is that the vast majority of nodes have a very small
degree, while a very small number of nodes have a high degree. That would prompt us to think:
what process could be responsible for generating this graph?
Statistical Inference 125

Inferring Graph Generating Model


Given a graph dataset, how do we identify which data generating model provides the best fit?
One way to do this is to compare characteristics of a graph generating model against the character-
istics of the graph. The logic here is that if we have a good graph generating model for the data, we
should, in theory, observe the observed graph’s characteristics in the graphs generated by the graph
generating model.

Comparison of degree distribution


Let’s compare the degree distribution between the data, a few Erdos-Renyi graphs, and a few
Barabasi-Albert graphs.

Comparison with Barabasi-Albert graphs


1 from ipywidgets import interact, IntSlider
2
3 m = IntSlider(value=2, min=1, max=10)
4
5 @interact(m=m)
6 def compare_barabasi_albert_graph(m):
7 fig, ax = plt.subplots()
8 G_ba = nx.barabasi_albert_graph(n=len(G.nodes()), m=m)
9 x, y = ecdf(pd.Series(dict(nx.degree(G_ba))))
10 ax.scatter(x, y, label="Barabasi-Albert Graph")
11
12 x, y = ecdf(pd.Series(dict(nx.degree(G))))
13 ax.scatter(x, y, label="Protein Interaction Network")
14 ax.legend()

1 interactive(children=(IntSlider(value=2, description='m', max=10, min=1), Output()),\


2 _dom_classes=('widget-int…

Comparison with Erdos-Renyi graphs


Statistical Inference 126

1 from ipywidgets import FloatSlider


2 p = FloatSlider(value=0.6, min=0, max=0.1, step=0.001)
3
4 @interact(p=p)
5 def compare_erdos_renyi_graph(p):
6 fig, ax = plt.subplots()
7 G_er = nx.erdos_renyi_graph(n=len(G.nodes()), p=p)
8 x, y = ecdf(pd.Series(dict(nx.degree(G_er))))
9 ax.scatter(x, y, label="Erdos-Renyi Graph")
10
11 x, y = ecdf(pd.Series(dict(nx.degree(G))))
12 ax.scatter(x, y, label="Protein Interaction Network")
13 ax.legend()
14 ax.set_title(f"p={p}")

1 interactive(children=(FloatSlider(value=0.1, description='p', max=0.1, step=0.001), \


2 Output()), _dom_classes=('…

Given the degree distribution only, which model do you think better describes the generation of a
protein-protein interaction network?

Quantitative Model Comparison


Each time we plug in a value of m for the Barabasi-Albert graph model, we are using one of many
possible Barabasi-Albert graph models, each with a different m. Similarly, each time we choose a
different p for the Erdos-Renyi model, we are using one of many possible Erdos-Renyi graph models,
each with a different p.
To quantitatively compare degree distributions, we can use the Wasserstein distance²² between the
data. Let’s see how to implement this.

²²https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html
Statistical Inference 127

1 from scipy.stats import wasserstein_distance


2
3 def erdos_renyi_degdist(n, p):
4 """Return a Pandas series of degree distribution of an Erdos-Renyi graph."""
5 G = nx.erdos_renyi_graph(n=n, p=p)
6 return pd.Series(dict(nx.degree(G)))
7
8 def barabasi_albert_degdist(n, m):
9 """Return a Pandas series of degree distribution of an Barabasi-Albert graph."""
10 G = nx.barabasi_albert_graph(n=n, m=m)
11 return pd.Series(dict(nx.degree(G)))

1 deg = pd.Series(dict(nx.degree(G)))
2
3 er_deg = erdos_renyi_degdist(n=len(G.nodes()), p=0.001)
4 ba_deg = barabasi_albert_degdist(n=len(G.nodes()), m=1)
5 wasserstein_distance(deg, er_deg), wasserstein_distance(deg, ba_deg)

1 (0.792513368983957, 0.5272727272727269)

Notice that because the graphs are instantiated in a non-deterministic fashion, re-running the cell
above will give you different values for each new graph generated.
Let’s now plot the wasserstein distance to our graph data for the two particular Erdos-Renyi and
Barabasi-Albert graph models shown above.

1 import matplotlib.pyplot as plt


2 from tqdm.autonotebook import tqdm
3
4 er_dist = []
5 ba_dist = []
6 for _ in tqdm(range(100)):
7 er_deg = erdos_renyi_degdist(n=len(G.nodes()), p=0.001)
8 er_dist.append(wasserstein_distance(deg, er_deg))
9
10
11 ba_deg = barabasi_albert_degdist(n=len(G.nodes()), m=1)
12 ba_dist.append(wasserstein_distance(deg, ba_deg))
13
14 # er_degs = [erdos_renyi_degdist(n=len(G.nodes()), p=0.001) for _ in range(100)]
Statistical Inference 128

1 <ipython-input-1-71467d89950b>:2: TqdmExperimentalWarning: Using `tqdm.autonotebook.\


2 tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyt\
3 er console)
4 from tqdm.autonotebook import tqdm
5
6 0%| | 0/100 [00:00<?, ?it/s]

1 import seaborn as sns


2 import janitor
3
4
5 data = (
6 pd.DataFrame(
7 {
8 "Erdos-Renyi": er_dist,
9 "Barabasi-Albert": ba_dist,
10 }
11 )
12 .melt(value_vars=["Erdos-Renyi", "Barabasi-Albert"])
13 .rename_columns({"variable": "Graph Model", "value": "Wasserstein Distance"})
14 )
15 sns.swarmplot(data=data, x="Graph Model", y="Wasserstein Distance")

1 <AxesSubplot:xlabel='Graph Model', ylabel='Wasserstein Distance'>


Statistical Inference 129

png

From this, we might conclude that the Barabasi-Albert graph with m = 1 has the better fit to the
protein-protein interaction network graph.

Interpretation
That statement, accurate as it might be, still does not connect the dots to biology.
Let’s think about the generative model for this graph. The Barabasi-Albert graph gives us a model
for “rich gets richer”. Given the current state of the graph, if we want to add a new edge, we first pick
a node with probability proportional to the number of edges it already has. Then, we pick another
node with probability proportional to the number of edges that it has too. Finally, we add an edge
there. This has the effect of “enriching” nodes that have a large number of edges with more edges.
How might this connect to biology?
We can’t necessarily provide a concrete answer, but this model might help raise new hypotheses.
For example, if protein-protein interactions of the “binding” kind are driven by subdomains, then
proteins that acquire a domain through recombination may end up being able to bind to everything
else that the domain was able to. In this fashion, proteins with that particular binding domain gain
new edges more readily.
Testing these hypotheses would be a totally different matter, and at this point, I submit the above
hypothesis with a large amount of salt thrown over my shoulder. In other words, the hypothesized
Statistical Inference 130

mechanism could be completely wrong. However, I hope that this example illustrated that the usage
of a “graph generative model” can help us narrow down hypotheses about the observed world.
Game of Thrones
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
5 import pandas as pd
6 import networkx as nx
7 import community
8 import numpy as np
9 import matplotlib.pyplot as plt

Introduction
In this chapter, we will use Game of Thrones as a case study to practice our newly learnt skills of
network analysis.
It is suprising right? What is the relationship between a fatansy TV show/novel and network science
or Python(not dragons).
If you haven’t heard of Game of Thrones, then you must be really good at hiding. Game of Thrones
is a hugely popular television series by HBO based on the (also) hugely popular book series A Song
of Ice and Fire by George R.R. Martin. In this notebook, we will analyze the co-occurrence network
of the characters in the Game of Thrones books. Here, two characters are considered to co-occur if
their names appear in the vicinity of 15 words from one another in the books.
The figure below is a precusor of what we will analyse in this chapter.
Game of Thrones 132

The dataset is publicly avaiable for the 5 books at https://github.com/mathbeveridge/asoiaf. This


is an interaction network and were created by connecting two characters whenever their names
(or nicknames) appeared within 15 words of one another in one of the books. The edge weight
corresponds to the number of interactions.
Blog: https://networkofthrones.wordpress.com

1 from nams import load_data as cf


2 books = cf.load_game_of_thrones_data()

The resulting DataFrame books has 5 columns: Source, Target, Type, weight, and book. Source and
Game of Thrones 133

target are the two nodes that are linked by an edge. As we know a network can have directed or
undirected edges and in this network all the edges are undirected. The weight attribute of every edge
tells us the number of interactions that the characters have had over the book, and the book column
tells us the book number.
Let’s have a look at the data.

1 # We also add this weight_inv to our dataset.


2 # Why? we will discuss it in a later section.
3 books['weight_inv'] = 1/books.weight

1 print(books.head().to_markdown())

id Source Target Type weight book weight_inv


0 Addam- Jaime- Undirected 3 1 0.333333
Marbrand Lannister
1 Addam- Tywin- Undirected 6 1 0.166667
Marbrand Lannister
2 Aegon-I- Daenerys- Undirected 5 1 0.2
Targaryen Targaryen
3 Aegon-I- Eddard-Stark Undirected 4 1 0.25
Targaryen
4 Aemon- Alliser-Thorne Undirected 4 1 0.25
Targaryen-
(Maester-
Aemon)

From the above data we can see that the characters Addam Marbrand and Tywin Lannister have
interacted 6 times in the first book.
We can investigate this data by using the pandas DataFrame. Let’s find all the interactions of Robb
Stark in the third book.

1 robbstark = (
2 books.query("book == 3")
3 .query("Source == 'Robb-Stark' or Target == 'Robb-Stark'")
4 )

1 print(robbstark.head().to_markdown())
Game of Thrones 134

id Source Target Type weight book weight_inv


1468 Aegon-Frey- Robb-Stark Undirected 5 3 0.2
(son-of-Stevron)
1582 Arya-Stark Robb-Stark Undirected 14 3 0.0714286
1604 Balon-Greyjoy Robb-Stark Undirected 6 3 0.166667
1677 Bran-Stark Robb-Stark Undirected 18 3 0.0555556
1683 Brandon-Stark Robb-Stark Undirected 3 3 0.333333

As you can see this data easily translates to a network problem. Now it’s time to create a network.
We create a graph for each book. It’s possible to create one MultiGraph(Graph with multiple edges
between nodes) instead of 5 graphs, but it is easier to analyse and manipulate individual Graph
objects rather than a MultiGraph.

1 # example of creating a MultiGraph


2
3 # all_books_multigraph = nx.from_pandas_edgelist(
4 # books, source='Source', target='Target',
5 # edge_attr=['weight', 'book'],
6 # create_using=nx.MultiGraph)

1 # we create a list of graph objects using


2 # nx.from_pandas_edgelist and specifying
3 # the edge attributes.
4
5 graphs = [nx.from_pandas_edgelist(
6 books[books.book==i],
7 source='Source', target='Target',
8 edge_attr=['weight', 'weight_inv'])
9 for i in range(1, 6)]

1 # The Graph object associated with the first book.


2 graphs[0]

1 <networkx.classes.graph.Graph at 0x7f6f1e63abe0>

1 # To access the relationship edges in the graph with


2 # the edge attribute weight data (data=True)
3 relationships = list(graphs[0].edges(data=True))
Game of Thrones 135

1 relationships[0:3]

1 [('Addam-Marbrand',
2 'Jaime-Lannister',
3 {'weight': 3, 'weight_inv': 0.3333333333333333}),
4 ('Addam-Marbrand',
5 'Tywin-Lannister',
6 {'weight': 6, 'weight_inv': 0.16666666666666666}),
7 ('Jaime-Lannister', 'Aerys-II-Targaryen', {'weight': 5, 'weight_inv': 0.2})]

Finding the most important node i.e character in


these networks.
Let’s use our network analysis knowledge to decrypt these Graphs that we have just created.
Is it Jon Snow, Tyrion, Daenerys, or someone else? Let’s see! Network Science offers us many
different metrics to measure the importance of a node in a network as we saw in the first part of the
tutorial. Note that there is no “correct” way of calculating the most important node in a network,
every metric has a different meaning.
First, let’s measure the importance of a node in a network by looking at the number of neighbors it
has, that is, the number of nodes it is connected to. For example, an influential account on Twitter,
where the follower-followee relationship forms the network, is an account which has a high number
of followers. This measure of importance is called degree centrality.
Using this measure, let’s extract the top ten important characters from the first book (graphs[0])
and the fifth book (graphs[4]).
NOTE: We are using zero-indexing and that’s why the graph of the first book is acceseed by
graphs[0].

1 # We use the in-built degree_centrality method


2 deg_cen_book1 = nx.degree_centrality(graphs[0])
3 deg_cen_book5 = nx.degree_centrality(graphs[4])

degree_centrality returns a dictionary and to access the results we can directly use the name of
the character.

1 deg_cen_book1['Daenerys-Targaryen']
Game of Thrones 136

1 0.11290322580645162

Top 5 important characters in the first book according to degree centrality.

1 # The following expression sorts the dictionary by


2 # degree centrality and returns the top 5 from a graph
3
4 sorted(deg_cen_book1.items(),
5 key=lambda x:x[1],
6 reverse=True)[0:5]

1 [('Eddard-Stark', 0.3548387096774194),
2 ('Robert-Baratheon', 0.2688172043010753),
3 ('Tyrion-Lannister', 0.24731182795698928),
4 ('Catelyn-Stark', 0.23118279569892475),
5 ('Jon-Snow', 0.19892473118279572)]

Top 5 important characters in the fifth book according to degree centrality.

1 sorted(deg_cen_book5.items(),
2 key=lambda x:x[1],
3 reverse=True)[0:5]

1 [('Jon-Snow', 0.1962025316455696),
2 ('Daenerys-Targaryen', 0.18354430379746836),
3 ('Stannis-Baratheon', 0.14873417721518986),
4 ('Tyrion-Lannister', 0.10443037974683544),
5 ('Theon-Greyjoy', 0.10443037974683544)]

To visualize the distribution of degree centrality let’s plot a histogram of degree centrality.

1 plt.hist(deg_cen_book1.values(), bins=30)
2 plt.show()
Game of Thrones 137

png

The above plot shows something that is expected, a high portion of characters aren’t connected
to lot of other characters while some characters are highly connected all through the network. A
close real world example of this is a social network like Twitter where a few people have millions
of connections(followers) but majority of users aren’t connected to that many other users. This
exponential decay like property resembles power law in real life networks.

1 # A log-log plot to show the "signature" of power law in graphs.


2 from collections import Counter
3 hist = Counter(deg_cen_book1.values())
4 plt.scatter(np.log2(list(hist.keys())),
5 np.log2(list(hist.values())),
6 alpha=0.9)
7 plt.show()
Game of Thrones 138

png

Exercise
Create a new centrality measure, weighted_degree(Graph, weight) which takes in Graph and
the weight attribute and returns a weighted degree dictionary. Weighted degree is calculated by
summing the weight of the all edges of a node and find the top five characters according to this
measure.

1 from nams.solutions.got import weighted_degree


2
3 plt.hist(list(weighted_degree(graphs[0], 'weight').values()), bins=30)
4 plt.show()
Game of Thrones 139

png

1 sorted(weighted_degree(graphs[0], 'weight').items(), key=lambda x:x[1], reverse=True\


2 )[0:5]

1 [('Eddard-Stark', 1284),
2 ('Robert-Baratheon', 941),
3 ('Jon-Snow', 784),
4 ('Tyrion-Lannister', 650),
5 ('Sansa-Stark', 545)]

Betweeness centrality
Let’s do this for Betweeness centrality and check if this makes any difference. As different centrality
method use different measures underneath, they find nodes which are important in the network.
A centrality method like Betweeness centrality finds nodes which are structurally important to the
network, which binds the network together and densely.
Game of Thrones 140

1 # First check unweighted (just the structure)


2
3 sorted(nx.betweenness_centrality(graphs[0]).items(),
4 key=lambda x:x[1], reverse=True)[0:10]

1 [('Eddard-Stark', 0.2696038913836117),
2 ('Robert-Baratheon', 0.21403028397371796),
3 ('Tyrion-Lannister', 0.1902124972697492),
4 ('Jon-Snow', 0.17158135899829566),
5 ('Catelyn-Stark', 0.1513952715347627),
6 ('Daenerys-Targaryen', 0.08627015537511595),
7 ('Robb-Stark', 0.07298399629664767),
8 ('Drogo', 0.06481224290874964),
9 ('Bran-Stark', 0.05579958811784442),
10 ('Sansa-Stark', 0.03714483664326785)]

1 # Let's care about interactions now


2
3 sorted(nx.betweenness_centrality(graphs[0],
4 weight='weight_inv').items(),
5 key=lambda x:x[1], reverse=True)[0:10]

1 [('Eddard-Stark', 0.5926474861958733),
2 ('Catelyn-Stark', 0.36855565242662014),
3 ('Jon-Snow', 0.3514094739901191),
4 ('Robert-Baratheon', 0.3329991281604185),
5 ('Tyrion-Lannister', 0.27137460040685846),
6 ('Daenerys-Targaryen', 0.202615518744551),
7 ('Bran-Stark', 0.0945655332752107),
8 ('Robb-Stark', 0.09177564661435629),
9 ('Arya-Stark', 0.06939843068875327),
10 ('Sansa-Stark', 0.06870095902353966)]

We can see there are some differences between the unweighted and weighted centrality measures.
Another thing to note is that we are using the weight_inv attribute instead of weight(the number
of interactions between characters). This decision is based on the way we want to assign the notion
of “importance” of a character. The basic idea behind betweenness centrality is to find nodes which
are essential to the structure of the network. As betweenness centrality computes shortest paths
underneath, in the case of weighted betweenness centrality it will end up penalising characters
with high number of interactions. By using weight_inv we will prop up the characters with high
interactions with other characters.
Game of Thrones 141

PageRank
The billion dollar algorithm, PageRank works by counting the number and quality of links to a page
to determine a rough estimate of how important the website is. The underlying assumption is that
more important websites are likely to receive more links from other websites.
NOTE: We don’t need to worry about weight and weight_inv in PageRank as the algorithm uses
weights in the opposite sense (larger weights are better). This may seem confusing as different
centrality measures have different definition of weights. So it is always better to have a look at
documentation before using weights in a centrality measure.

1 # by default weight attribute in PageRank is weight


2 # so we use weight=None to find the unweighted results
3 sorted(nx.pagerank_numpy(graphs[0],
4 weight=None).items(),
5 key=lambda x:x[1], reverse=True)[0:10]

1 [('Eddard-Stark', 0.04552079222830671),
2 ('Tyrion-Lannister', 0.03301362462493269),
3 ('Catelyn-Stark', 0.030193105286631914),
4 ('Robert-Baratheon', 0.02983474222773675),
5 ('Jon-Snow', 0.02683449952206619),
6 ('Robb-Stark', 0.021562941297247524),
7 ('Sansa-Stark', 0.020008034042864654),
8 ('Bran-Stark', 0.019945786786238318),
9 ('Jaime-Lannister', 0.017507847202846937),
10 ('Cersei-Lannister', 0.017082604584758087)]

1 sorted(nx.pagerank_numpy(
2 graphs[0], weight='weight').items(),
3 key=lambda x:x[1], reverse=True)[0:10]
Game of Thrones 142

1 [('Eddard-Stark', 0.0723940110049824),
2 ('Robert-Baratheon', 0.0485172757050994),
3 ('Jon-Snow', 0.04770689062474911),
4 ('Tyrion-Lannister', 0.04367437892706296),
5 ('Catelyn-Stark', 0.034667034701307414),
6 ('Bran-Stark', 0.029774200539800212),
7 ('Robb-Stark', 0.02921618364519686),
8 ('Daenerys-Targaryen', 0.027089622513021085),
9 ('Sansa-Stark', 0.026961778915683125),
10 ('Cersei-Lannister', 0.02163167939741897)]

Exercise

Is there a correlation between these techniques?

Find the correlation between these four techniques.

• pagerank (weight = ‘weight’)


• betweenness_centrality (weight = ‘weight_inv’)
• weighted_degree
• degree centrality

HINT: Use pandas correlation

1 from nams.solutions.got import correlation_centrality


2
3 print(correlation_centrality(graphs[0]).to_markdown())

0 1 2 3
0 1 0.910352 0.992166 0.949307
1 0.910352 1 0.87924 0.790526
2 0.992166 0.87924 1 0.95506
3 0.949307 0.790526 0.95506 1

Evolution of importance of characters over the books


According to degree centrality the most important character in the first book is Eddard Stark but
he is not even in the top 10 of the fifth book. The importance changes over the course of five books,
because you know stuff happens ;)
Game of Thrones 143

Let’s look at the evolution of degree centrality of a couple of characters like Eddard Stark, Jon Snow,
Tyrion which showed up in the top 10 of degree centrality in first book.
We create a dataframe with character columns and index as books where every entry is the degree
centrality of the character in that particular book and plot the evolution of degree centrality Eddard
Stark, Jon Snow and Tyrion. We can see that the importance of Eddard Stark in the network dies off
and with Jon Snow there is a drop in the fourth book but a sudden rise in the fifth book

1 evol = [nx.degree_centrality(graph)
2 for graph in graphs]
3 evol_df = pd.DataFrame.from_records(evol).fillna(0)
4 evol_df[['Eddard-Stark',
5 'Tyrion-Lannister',
6 'Jon-Snow']].plot()
7 plt.show()

png

1 set_of_char = set()
2 for i in range(5):
3 set_of_char |= set(list(
4 evol_df.T[i].sort_values(
5 ascending=False)[0:5].index))
6 set_of_char
Game of Thrones 144

1 {'Arya-Stark',
2 'Brienne-of-Tarth',
3 'Catelyn-Stark',
4 'Cersei-Lannister',
5 'Daenerys-Targaryen',
6 'Eddard-Stark',
7 'Jaime-Lannister',
8 'Joffrey-Baratheon',
9 'Jon-Snow',
10 'Margaery-Tyrell',
11 'Robb-Stark',
12 'Robert-Baratheon',
13 'Sansa-Stark',
14 'Stannis-Baratheon',
15 'Theon-Greyjoy',
16 'Tyrion-Lannister'}

Exercise
Plot the evolution of betweenness centrality of the above mentioned characters over the 5 books.

1 from nams.solutions.got import evol_betweenness

1 evol_betweenness(graphs)
Game of Thrones 145

png

So what’s up with Stannis Baratheon?


1 sorted(nx.degree_centrality(graphs[4]).items(),
2 key=lambda x:x[1], reverse=True)[:5]

1 [('Jon-Snow', 0.1962025316455696),
2 ('Daenerys-Targaryen', 0.18354430379746836),
3 ('Stannis-Baratheon', 0.14873417721518986),
4 ('Tyrion-Lannister', 0.10443037974683544),
5 ('Theon-Greyjoy', 0.10443037974683544)]

1 sorted(nx.betweenness_centrality(graphs[4]).items(),
2 key=lambda x:x[1], reverse=True)[:5]
Game of Thrones 146

1 [('Stannis-Baratheon', 0.45283060689247934),
2 ('Daenerys-Targaryen', 0.2959459062106149),
3 ('Jon-Snow', 0.24484873673158666),
4 ('Tyrion-Lannister', 0.20961613179551256),
5 ('Robert-Baratheon', 0.17716906651536968)]

1 nx.draw(nx.barbell_graph(5, 1), with_labels=True)

png

As we know the a higher betweenness centrality means that the node is crucial for the structure of
the network, and in the case of Stannis Baratheon in the fifth book it seems like Stannis Baratheon
has characterstics similar to that of node 5 in the above example as it seems to be the holding the
network together.
As evident from the betweenness centrality scores of the above example of barbell graph, node 5 is
the most important node in this network.

1 nx.betweenness_centrality(nx.barbell_graph(5, 1))
Game of Thrones 147

1 {0: 0.0,
2 1: 0.0,
3 2: 0.0,
4 3: 0.0,
5 4: 0.5333333333333333,
6 6: 0.5333333333333333,
7 7: 0.0,
8 8: 0.0,
9 9: 0.0,
10 10: 0.0,
11 5: 0.5555555555555556}

Community detection in Networks


A network is said to have community structure if the nodes of the network can be easily grouped into
(potentially overlapping) sets of nodes such that each set of nodes is densely connected internally.
There are multiple algorithms and definitions to calculate these communites in a network.
We will use louvain community detection algorithm to find the modules in our graph.

1 import nxviz as nv
2 from nxviz import annotate
3 plt.figure(figsize=(8, 8))
4
5 partition = community.best_partition(graphs[0], randomize=False)
6
7 # Annotate nodes' partitions
8 for n in graphs[0].nodes():
9 graphs[0].nodes[n]["partition"] = partition[n]
10 graphs[0].nodes[n]["degree"] = graphs[0].degree(n)
11
12 nv.matrix(graphs[0], group_by="partition", sort_by="degree", node_color_by="partitio\
13 n")
14 annotate.matrix_block(graphs[0], group_by="partition", color_by="partition")
15 annotate.matrix_group(graphs[0], group_by="partition", offset=-8)
Game of Thrones 148

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(
Game of Thrones 149

png

A common defining quality of a community is that the within-community edges are denser than
the between-community edges.
Game of Thrones 150

1 # louvain community detection find us 8 different set of communities


2 partition_dict = {}
3 for character, par in partition.items():
4 if par in partition_dict:
5 partition_dict[par].append(character)
6 else:
7 partition_dict[par] = [character]

1 len(partition_dict)

1 8

1 partition_dict[2]

1 ['Bran-Stark',
2 'Rickon-Stark',
3 'Robb-Stark',
4 'Luwin',
5 'Theon-Greyjoy',
6 'Hali',
7 'Hallis-Mollen',
8 'Hodor',
9 'Hullen',
10 'Joseth',
11 'Nan',
12 'Osha',
13 'Rickard-Karstark',
14 'Rickard-Stark',
15 'Stiv',
16 'Jon-Umber-(Greatjon)',
17 'Galbart-Glover',
18 'Roose-Bolton',
19 'Maege-Mormont']

If we plot these communities of the network we see a denser network as compared to the original
network which contains all the characters.

1 nx.draw(nx.subgraph(graphs[0], partition_dict[3]))
Game of Thrones 151

png

1 nx.draw(nx.subgraph(graphs[0],partition_dict[1]))
Game of Thrones 152

png

We can test this by calculating the density of the network and the community.
Like in the following example the network between characters in a community is 5 times more dense
than the original network.

1 nx.density(nx.subgraph(
2 graphs[0], partition_dict[4])
3 )/nx.density(graphs[0])

1 25.42543859649123

Exercise
Find the most important node in the partitions according to degree centrality of the nodes using the
partition_dict we have already created.

1 from nams.solutions.got import most_important_node_in_partition


Game of Thrones 153

1 most_important_node_in_partition(graphs[0], partition_dict)

1 {7: 'Tyrion-Lannister',
2 1: 'Daenerys-Targaryen',
3 6: 'Eddard-Stark',
4 3: 'Jon-Snow',
5 5: 'Sansa-Stark',
6 2: 'Robb-Stark',
7 0: 'Waymar-Royce',
8 4: 'Danwell-Frey'}

Solutions
Here are the solutions to the exercises above.

1 from nams.solutions import got


2 import inspect
3
4 print(inspect.getsource(got))

1 import pandas as pd
2 import networkx as nx
3
4
5 def weighted_degree(G, weight):
6 result = dict()
7 for node in G.nodes():
8 weight_degree = 0
9 for n in G.edges([node], data=True):
10 weight_degree += n[2]["weight"]
11 result[node] = weight_degree
12 return result
13
14
15 def correlation_centrality(G):
16 cor = pd.DataFrame.from_records(
17 [
18 nx.pagerank_numpy(G, weight="weight"),
19 nx.betweenness_centrality(G, weight="weight_inv"),
Game of Thrones 154

20 weighted_degree(G, "weight"),
21 nx.degree_centrality(G),
22 ]
23 )
24 return cor.T.corr()
25
26
27 def evol_betweenness(graphs):
28 evol = [nx.betweenness_centrality(graph, weight="weight_inv") for graph in graph\
29 s]
30 evol_df = pd.DataFrame.from_records(evol).fillna(0)
31
32 set_of_char = set()
33 for i in range(5):
34 set_of_char |= set(list(evol_df.T[i].sort_values(ascending=False)[0:5].index\
35 ))
36
37 evol_df[list(set_of_char)].plot(figsize=(19, 10))
38
39
40 def most_important_node_in_partition(graph, partition_dict):
41 max_d = {}
42 deg = nx.degree_centrality(graph)
43 for group in partition_dict:
44 temp = 0
45 for character in partition_dict[group]:
46 if deg[character] > temp:
47 max_d[group] = character
48 temp = deg[character]
49 return max_d
Airport Network
1 %load_ext autoreload
2 %autoreload 2
3 %matplotlib inline
4 %config InlineBackend.figure_format = 'retina'
5 import networkx as nx
6 import pandas as pd
7 import matplotlib.pyplot as plt
8 import numpy as np

Introduction
In this chapter, we will analyse the evolution of US Airport Network between 1990 and 2015. This
dataset contains data for 25 years[1995-2015] of flights between various US airports and metadata
about these routes. Taken from Bureau of Transportation Statistics, United States Department of
Transportation.
Let’s see what can we make out of this!

1 from nams import load_data as cf


2 pass_air_data = cf.load_airports_data()

In the pass_air_data dataframe we have the information of number of people that fly every year
on a particular route on the list of airlines that fly that route.

1 print(pass_air_data.head().to_markdown())

id YEAR ORIGIN DEST UNIQUE_CARRIER_NAME PASSENGERS


0 1990 ABE ACY {‘US Airways Inc.’} 73
1 1990 ABE ATL {‘Eastern Air Lines Inc.’} 73172
2 1990 ABE AVL {‘Westair Airlines Inc.’} 0
3 1990 ABE AVP {‘Westair Airlines Inc.’, ‘US 8397
Airways Inc.’, ‘Eastern Air Lines
Inc.’}
4 1990 ABE BHM {‘Eastern Air Lines Inc.’} 59

Every row in this dataset is a unique route between 2 airports in United States territory in a particular
Airport Network 156

year. Let’s see how many people flew from New York JFK to Austin in 2006.
NOTE: This will be a fun chapter if you are an aviation geek and like guessing airport IATA codes.

1 jfk_aus_2006 = (pass_air_data
2 .query('YEAR == 2006')
3 .query("ORIGIN == 'JFK' and DEST == 'AUS'"))
4
5 print(jfk_aus_2006.head().to_markdown())

id YEAR ORIGIN DEST UNIQUE_CARRIER_- PASSENGERS


NAME
215634 2006 JFK AUS {‘Shuttle America Corp.’, 105290
‘Ameristar Air Cargo’,
‘JetBlue Airways’, ‘United
Parcel Service’}

From the above pandas query we see that according to this dataset 105290 passengers travelled from
JFK to AUS in the year 2006.
But how does this dataset translate to an applied network analysis problem? In the previous chapter
we created different graph objects for every book. Let’s create a graph object which encompasses all
the edges.
NetworkX provides us with Multi(Di)Graphs to model networks with multiple edges between two
nodes.
In this case every row in the dataframe represents a directed edge between two airports, common
sense suggests that if there is a flight from airport A to airport B there should definitely be a flight
from airport B to airport A, i.e direction of the edge shouldn’t matter. But in this dataset we have
data for individual directions (A -> B and B -> A) so we create a MultiDiGraph.

1 passenger_graph = nx.from_pandas_edgelist(
2 pass_air_data, source='ORIGIN',
3 target='DEST', edge_key='YEAR',
4 edge_attr=['PASSENGERS', 'UNIQUE_CARRIER_NAME'],
5 create_using=nx.MultiDiGraph())

We have created a MultiDiGraph object passenger_graph which contains all the information from
the dataframe pass_air_data. ORIGIN and DEST represent the column names in the dataframe pass_-
air_data used to construct the edge. As this is a MultiDiGraph we can also give a name/key to the
multiple edges between two nodes and edge_key is used to represent that name and in this graph
YEAR is used to distinguish between multiple edges between two nodes. PASSENGERS and UNIQUE_-
CARRIER_NAME are added as edge attributes which can be accessed using the nodes and the key form
the MultiDiGraph object.
Airport Network 157

Let’s check if can access the same information (the 2006 route between JFK and AUS) using our
passenger_graph.
To check an edge between two nodes in a Graph we can use the syntax GraphObject[source][target]
and further specify the edge attribute using GraphObject[source][target][attribute].

1 passenger_graph['JFK']['AUS'][2006]

1 {'PASSENGERS': 105290.0,
2 'UNIQUE_CARRIER_NAME': "{'Shuttle America Corp.', 'Ameristar Air Cargo', 'JetBlue A\
3 irways', 'United Parcel Service'}"}

Now let’s use our new constructed passenger graph to look at the evolution of passenger load over
25 years.

1 # Route betweeen New York-JFK and SFO


2
3 values = [(year, attr['PASSENGERS'])
4 for year, attr in
5 passenger_graph['JFK']['SFO'].items()]
6 x, y = zip(*values)
7 plt.plot(x, y)
8 plt.show()

png
Airport Network 158

We see some usual trends all across the datasets like steep drops in 2001 (after 9/11) and 2008
(recession).

1 # Route betweeen SFO and Chicago-ORD


2
3 values = [(year, attr['PASSENGERS'])
4 for year, attr in
5 passenger_graph['SFO']['ORD'].items()]
6 x, y = zip(*values)
7 plt.plot(x, y)
8 plt.show()

png

To find the overall trend, we can use our pass_air_data dataframe to calculate total passengers
flown in a year.

1 pass_air_data.groupby(
2 ['YEAR']).sum()['PASSENGERS'].plot()
3 plt.show()
Airport Network 159

png

Exercise
Find the busiest route in 1990 and in 2015 according to number of passengers, and plot the time series
of number of passengers on these routes.
You can use the DataFrame instead of working with the network. It will be faster :)

1 from nams.solutions.airport import busiest_route, plot_time_series

1 print(busiest_route(pass_air_data, 1990).head().to_markdown())
Airport Network 160

id YEAR ORIGIN DEST UNIQUE_CARRIER_NAME PASSENGERS


3917 1990 LAX HNL {‘Heavylift Cargo Airlines Lt’, 1.82716e+06
‘Hawaiian Airlines Inc.’, ‘Pan
American World Airways (1)’,
‘Delta Air Lines Inc.’, ‘Trans
World Airways LLC’, ‘World
Airways Inc.’, ‘China Airlines
Ltd.’, ‘Korean Air Lines Co.
Ltd.’, ‘Qantas Airways Ltd.’,
‘P.T. Garuda Indonesian Arwy’,
‘Air New Zealand’,
‘Continental Air Lines Inc.’,
‘American Airlines Inc.’,
‘Northwest Airlines Inc.’,
‘Philippine Airlines Inc.’,
‘Malaysian Airline System’,
‘Singapore Airlines Ltd.’,
‘Flagship Express Services’,
‘United Air Lines Inc.’, ‘Eastern
Air Lines Inc.’}

1 plot_time_series(pass_air_data, 'LAX', 'HNL')

png
Airport Network 161

1 print(busiest_route(pass_air_data, 2015).head().to_markdown())

id YEAR ORIGIN DEST UNIQUE_CARRIER_- PASSENGERS


NAME
445978 2015 LAX SFO {‘Hawaiian Airlines Inc.’, 1.86907e+06
‘Delta Air Lines Inc.’,
‘SkyWest Airlines Inc.’,
‘Atlas Air Inc.’, ‘Asiana
Airlines Inc.’, ‘Compass
Airlines’, ‘Southwest
Airlines Co.’, ‘American
Airlines Inc.’, ‘Western
Global’, ‘Vision Airlines’,
‘China Airlines Ltd.’,
‘Korean Air Lines Co. Ltd.’,
‘Mesa Airlines Inc.’, ‘Alaska
Airlines Inc.’, ‘ABX Air Inc’,
‘Aeromexico’, ‘Kalitta Air
LLC’, ‘Virgin America’,
‘Nippon Cargo Airlines’,
‘Swift Air, LLC’, ‘US
Airways Inc.’, ‘United Air
Lines Inc.’, ‘Tyrolean Jet
Service’}

1 plot_time_series(pass_air_data, 'LAX', 'SFO')


Airport Network 162

png

Before moving to the next part of the chapter let’s create a method to extract edges from passenger_-
graph for a particular year so we can better analyse the data on a granular scale.

1 def year_network(G, year):


2 """ Extract edges for a particular year from
3 a MultiGraph. The edge is also populated with
4 two attributes, weight and weight_inv where
5 weight is the number of passengers and
6 weight_inv the inverse of it.
7 """
8 year_network = nx.DiGraph()
9 for edge in G.edges:
10 source, target, edge_year = edge
11 if edge_year == year:
12 attr = G[source][target][edge_year]
13 year_network.add_edge(
14 source, target,
15 weight=attr['PASSENGERS'],
16 weight_inv=1/(attr['PASSENGERS']
17 if attr['PASSENGERS'] != 0.0 else 1),
18 airlines=attr['UNIQUE_CARRIER_NAME'])
19 return year_network
Airport Network 163

1 pass_2015_network = year_network(passenger_graph, 2015)

1 # Extracted a Directed Graph from the Multi Directed Graph


2 # Number of nodes = airports
3 # Number of edges = routes
4
5 print(nx.info(pass_2015_network))

1 Name:
2 Type: DiGraph
3 Number of nodes: 1258
4 Number of edges: 25354
5 Average in degree: 20.1542
6 Average out degree: 20.1542

Visualise the airports


1 # Loadin the GPS coordinates of all the airports
2 from nams import load_data as cf
3 lat_long = cf.load_airports_GPS_data()
4 lat_long.columns = [
5 "CODE4",
6 "CODE3",
7 "CITY",
8 "PROVINCE",
9 "COUNTRY",
10 "UNKNOWN1",
11 "UNKNOWN2",
12 "UNKNOWN3",
13 "UNKNOWN4",
14 "UNKNOWN5",
15 "UNKNOWN6",
16 "UNKNOWN7",
17 "UNKNOWN8",
18 "UNKNOWN9",
19 "LATITUDE",
20 "LONGITUDE"
21 ]
22 lat_long
Airport Network 164

23 wanted_nodes = list(pass_2015_network.nodes())
24 us_airports = lat_long.query("CODE3 in @wanted_nodes").drop_duplicates(subset=["CODE\
25 3"]).set_index("CODE3")
26 us_airports
27 # us_airports

<div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }

1 .dataframe tbody tr th {
2 vertical-align: top;
3 }
4
5 .dataframe thead th {
6 text-align: right;
7 }

</style> <table border=”1” class=”dataframe”> <thead> <tr style=”text-align: right;”> <th></th>


<th>CODE4</th> <th>CITY</th> <th>PROVINCE</th> <th>COUNTRY</th> <th>UNKNOWN1</th>
<th>UNKNOWN2</th> <th>UNKNOWN3</th> <th>UNKNOWN4</th> <th>UNKNOWN5</th> <th>UNKNOWN
<th>UNKNOWN7</th> <th>UNKNOWN8</th> <th>UNKNOWN9</th> <th>LATITUDE</th> <th>LONGITUDE<
</tr> <tr> <th>CODE3</th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th>
<th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr> </thead>
<tbody> <tr> <th>ABI</th> <td>KABI</td> <td>ABILENE RGNL</td> <td>ABILENE</td> <td>USA</td>
<td>32</td> <td>24</td> <td>40</td> <td>N</td> <td>99</td> <td>40</td> <td>54</td> <td>W</td>
<td>546</td> <td>32.411</td> <td>-99.682</td> </tr> <tr> <th>ABQ</th> <td>KABQ</td> <td>NaN</td>
<td>ALBUQUERQUE</td> <td>USA</td> <td>0</td> <td>0</td> <td>0</td> <td>U</td> <td>0</td>
<td>0</td> <td>0</td> <td>U</td> <td>0</td> <td>0.000</td> <td>0.000</td> </tr> <tr> <th>ACK</th>
<td>KACK</td> <td>NANTUCKET MEM</td> <td>NANTUCKET</td> <td>USA</td> <td>41</td>
<td>15</td> <td>10</td> <td>N</td> <td>70</td> <td>3</td> <td>36</td> <td>W</td> <td>15</td>
<td>41.253</td> <td>-70.060</td> </tr> <tr> <th>ACT</th> <td>KACT</td> <td>WACO RGNL</td>
<td>WACO</td> <td>USA</td> <td>31</td> <td>36</td> <td>40</td> <td>N</td> <td>97</td> <td>13</td>
<td>49</td> <td>W</td> <td>158</td> <td>31.611</td> <td>-97.230</td> </tr> <tr> <th>ACY</th>
<td>KACY</td> <td>ATLANTIC CITY INTERNATIONAL</td> <td>ATLANTIC CITY</td> <td>USA</td>
<td>39</td> <td>27</td> <td>27</td> <td>N</td> <td>74</td> <td>34</td> <td>37</td> <td>W</td>
<td>23</td> <td>39.458</td> <td>-74.577</td> </tr> <tr> <th>…</th> <td>…</td> <td>…</td> <td>…</td>
<td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td> <td>…</td>
<td>…</td> <td>…</td> <td>…</td> <td>…</td> </tr> <tr> <th>BQN</th> <td>TJBQ</td> <td>RAFAEL
HERNANDEZ</td> <td>AGUADILLA</td> <td>PUERTO RICO</td> <td>18</td> <td>29</td> <td>41</td>
<td>N</td> <td>67</td> <td>7</td> <td>46</td> <td>W</td> <td>73</td> <td>18.495</td> <td>-
67.129</td> </tr> <tr> <th>SIG</th> <td>TJIG</td> <td>FERNANDO LUIS RIBAS DOMINICCI</td>
<td>SAN JUAN</td> <td>PUERTO RICO</td> <td>18</td> <td>27</td> <td>24</td> <td>N</td>
<td>66</td> <td>5</td> <td>53</td> <td>W</td> <td>4</td> <td>18.457</td> <td>-66.098</td> </tr>
<tr> <th>MAZ</th> <td>TJMZ</td> <td>EUGENIO MARIA DE HOSTOS</td> <td>MAYAGUEZ</td>
Airport Network 165

<td>PUERTO RICO</td> <td>18</td> <td>15</td> <td>20</td> <td>N</td> <td>67</td> <td>8</td>


<td>54</td> <td>W</td> <td>9</td> <td>18.256</td> <td>-67.148</td> </tr> <tr> <th>PSE</th>
<td>TJPS</td> <td>MERCEDITA</td> <td>PONCE</td> <td>PUERTO RICO</td> <td>18</td> <td>0</td>
<td>29</td> <td>N</td> <td>66</td> <td>33</td> <td>46</td> <td>W</td> <td>8</td> <td>18.008</td>
<td>-66.563</td> </tr> <tr> <th>SJU</th> <td>TJSJ</td> <td>LUIS MUNOZ MARIN INTERNA-
TIONAL</td> <td>SAN JUAN</td> <td>PUERTO RICO</td> <td>18</td> <td>26</td> <td>21</td>
<td>N</td> <td>66</td> <td>0</td> <td>6</td> <td>W</td> <td>3</td> <td>18.439</td> <td>-66.002</td>
</tr> </tbody> </table> <p>363 rows × 15 columns</p> </div>

1 # Annotate graph with latitude and longitude


2 no_gps = []
3 for n, d in pass_2015_network.nodes(data=True):
4 try:
5 pass_2015_network.nodes[n]["longitude"] = us_airports.loc[n, "LONGITUDE"]
6 pass_2015_network.nodes[n]["latitude"] = us_airports.loc[n, "LATITUDE"]
7 pass_2015_network.nodes[n]["degree"] = pass_2015_network.degree(n)
8
9 # Some of the nodes are not represented
10 except KeyError:
11 no_gps.append(n)
12
13 # Get subgraph of nodes that do have GPS coords
14 has_gps = set(pass_2015_network.nodes()).difference(no_gps)
15 g = pass_2015_network.subgraph(has_gps)

Let’s first plot only the nodes, i.e airports. Places like Guam, US Virgin Islands are also included in
this dataset as they are treated as domestic airports in this dataset.

1 import nxviz as nv
2 from nxviz import nodes, plots, edges
3 plt.figure(figsize=(20, 9))
4 pos = nodes.geo(g, aesthetics_kwargs={"size_scale": 1})
5 plots.aspect_equal()
6 plots.despine()
Airport Network 166

1 /home/runner/work/Network-Analysis-Made-Simple/Network-Analysis-Made-Simple/nams_env\
2 /lib/python3.8/site-packages/nxviz/__init__.py:18: UserWarning:
3 nxviz has a new API! Version 0.7.0 onwards, the old class-based API is being
4 deprecated in favour of a new API focused on advancing a grammar of network
5 graphics. If your plotting code depends on the old API, please consider
6 pinning nxviz at version 0.6.3, as the new API will break your old code.
7
8 To check out the new API, please head over to the docs at
9 https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
10
11 (This deprecation message will go away in version 1.0.)
12
13 warnings.warn(

png

Let’s also plot the routes(edges).

1 import nxviz as nv
2 from nxviz import nodes, plots, edges, annotate
3 plt.figure(figsize=(20, 9))
4 pos = nodes.geo(g, color_by="degree", aesthetics_kwargs={"size_scale": 1})
5 edges.line(g, pos, aesthetics_kwargs={"alpha_scale": 0.1})
6 annotate.node_colormapping(g, color_by="degree")
7 plots.aspect_equal()
8 plots.despine()
Airport Network 167

png

Before we proceed further, let’s take a detour to briefly discuss directed networks and PageRank.

Directed Graphs and PageRank


The figure below explains the basic idea behind the PageRank algorithm. The “importance” of the
node depends on the incoming links to the node, i.e if an “important” node A points towards a node
B it will increase the PageRank score of node B, and this is run iteratively. In the given figure, even
though node C is only connected to one node it is considered “important” as the connection is to
node B, which is an “important” node.
Airport Network 168

Source: Wikipedia
To better understand this let’s work through an example.

1 # Create an empty directed graph object


2 G = nx.DiGraph()
3 # Add an edge from 1 to 2 with weight 4
4 G.add_edge(1, 2, weight=4)

1 print(G.edges(data=True))

1 [(1, 2, {'weight': 4})]


Airport Network 169

1 # Access edge from 1 to 2


2 G[1][2]

1 {'weight': 4}

What happens when we try to access the edge from 2 to 1?

1 G[2][1]
2
3 ---------------------------------------------------------------------------
4 KeyError Traceback (most recent call last)
5 <ipython-input-137-d6b8db3142ef> in <module>
6 1 # Access edge from 2 to 1
7 ----> 2 G[2][1]
8
9 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/classes/coreviews.py in \
10 __getitem__(self, key)
11 52
12 53 def __getitem__(self, key):
13 ---> 54 return self._atlas[key]
14 55
15 56 def copy(self):
16
17 KeyError: 1

As expected we get an error when we try to access the edge between 2 to 1 as this is a directed graph.

1 G.add_edges_from([(1, 2), (3, 2),


2 (4, 2), (5, 2),
3 (6, 2), (7, 2)])
4 # nx.draw_spring(G, with_labels=True)
5 nv.circos(G, node_aes_kwargs={"size_scale": 0.3})

1 <AxesSubplot:>
Airport Network 170

png

Just by looking at the example above, we can conclude that node 2 should have the highest PageRank
score as all the nodes are pointing towards it.
This is confirmed by calculating the PageRank of this graph.

1 nx.pagerank(G)

1 {1: 0.0826448180198328,
2 2: 0.5041310918810031,
3 3: 0.0826448180198328,
4 4: 0.0826448180198328,
5 5: 0.0826448180198328,
6 6: 0.0826448180198328,
7 7: 0.0826448180198328}

What happens when we add an edge from node 5 to node 6.

1 G.add_edge(5, 6)
2 nv.circos(G, node_aes_kwargs={"size_scale": 0.3})
3 # nx.draw_spring(G, with_labels=True)

1 <AxesSubplot:>
Airport Network 171

png

1 nx.pagerank(G)

1 {1: 0.08024854052495894,
2 2: 0.4844028780560986,
3 3: 0.08024854052495894,
4 4: 0.08024854052495894,
5 5: 0.08024854052495894,
6 6: 0.11435441931910648,
7 7: 0.08024854052495894}

As expected there was some change in the scores (an increase for 6) but the overall trend stays the
same, with node 2 leading the pack.

1 G.add_edge(2, 8)
2 nv.circos(G, node_aes_kwargs={"size_scale": 0.3})

1 <AxesSubplot:>
Airport Network 172

png

Now we have an added an edge from 2 to a new node 8. As node 2 already has a high PageRank
score, this should be passed on node 8. Let’s see how much difference this can make.

1 nx.pagerank(G)

1 {1: 0.05378612718073915,
2 2: 0.3246687852772877,
3 3: 0.05378612718073915,
4 4: 0.05378612718073915,
5 5: 0.05378612718073915,
6 6: 0.0766454192258098,
7 7: 0.05378612718073915,
8 8: 0.3297551595932067}

In this example, node 8 is now even more “important” than node 2 even though node 8 has only
incoming connection.
Let’s move back to Airports and use this knowledge to analyse the network.

Importants Hubs in the Airport Network


So let’s have a look at the important nodes in this network, i.e. important airports in this network.
We’ll use centrality measures like pagerank, betweenness centrality and degree centrality which we
gone through in this book.
Let’s try to calculate the PageRank of passenger_graph.
Airport Network 173

1 nx.pagerank(passenger_graph)
2
3 ---------------------------------------------------------------------------
4 NetworkXNotImplemented Traceback (most recent call last)
5 <ipython-input-144-15a6f513bf9b> in <module>
6 1 # Let's try to calulate the PageRank measures of this graph.
7 ----> 2 nx.pagerank(passenger_graph)
8
9 <decorator-gen-435> in pagerank(G, alpha, personalization, max_iter, tol, nstart, we\
10 ight, dangling)
11
12 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/utils/decorators.py in _\
13 not_implemented_for(not_implement_for_func, *args, **kwargs)
14 78 if match:
15 79 msg = 'not implemented for %s type' % ' '.join(graph_types)
16 ---> 80 raise nx.NetworkXNotImplemented(msg)
17 81 else:
18 82 return not_implement_for_func(*args, **kwargs)
19
20 NetworkXNotImplemented: not implemented for multigraph type

As PageRank isn’t defined for a MultiGraph in NetworkX we need to use our extracted yearly sub
networks.

1 # As pagerank will take weighted measure


2 # by default we pass in None to make this
3 # calculation for unweighted network
4 PR_2015_scores = nx.pagerank(
5 pass_2015_network, weight=None)

1 # Let's check the PageRank score for JFK


2 PR_2015_scores['JFK']

1 0.0036376572979606586

1 # top 10 airports according to unweighted PageRank


2 top_10_pr = sorted(PR_2015_scores.items(),
3 key=lambda x:x[1],
4 reverse=True)[:10]
Airport Network 174

1 # top 10 airports according to unweighted betweenness centrality


2 top_10_bc = sorted(
3 nx.betweenness_centrality(pass_2015_network,
4 weight=None).items(), key=lambda x:x[1],
5 reverse=True)[0:10]

1 # top 10 airports according to degree centrality


2 top_10_dc = sorted(
3 nx.degree_centrality(pass_2015_network).items(),
4 key=lambda x:x[1], reverse=True)[0:10]

Before looking at the results do think about what we just calculated and try to guess which airports
should come out at the top and be ready to be surprised :D

1 # PageRank
2 top_10_pr

1 [('ANC', 0.010425531156396332),
2 ('HPN', 0.008715287139161587),
3 ('FAI', 0.007865131822111036),
4 ('DFW', 0.007168038232113773),
5 ('DEN', 0.006557279519803018),
6 ('ATL', 0.006367579588749718),
7 ('ORD', 0.006178836107660135),
8 ('YIP', 0.005821525504523931),
9 ('ADQ', 0.005482597083474197),
10 ('MSP', 0.005481962582230961)]

1 # Betweenness Centrality
2 top_10_bc
Airport Network 175

1 [('ANC', 0.28907458480586606),
2 ('FAI', 0.08042857784594384),
3 ('SEA', 0.06745549919241699),
4 ('HPN', 0.06046810178534726),
5 ('ORD', 0.045544143864829294),
6 ('ADQ', 0.040170160000905696),
7 ('DEN', 0.038543251364241436),
8 ('BFI', 0.03811277548952854),
9 ('MSP', 0.03774809342340624),
10 ('TEB', 0.036229439542316354)]

1 # Degree Centrality
2 top_10_dc

1 [('ATL', 0.3643595863166269),
2 ('ORD', 0.354813046937152),
3 ('DFW', 0.3420843277645187),
4 ('MSP', 0.3261734287987271),
5 ('DEN', 0.31821797931583135),
6 ('ANC', 0.3046937151949085),
7 ('MEM', 0.29196499602227527),
8 ('LAX', 0.2840095465393795),
9 ('IAH', 0.28082736674622116),
10 ('DTW', 0.27446300715990457)]

The Degree Centrality results do make sense at first glance, ATL is Atlanta, ORD is Chicago, these
are defintely airports one would expect to be at the top of a list which calculates “importance” of
an airport. But when we look at PageRank and Betweenness Centrality we have an unexpected
airport ‘ANC’. Do think about measures like PageRank and Betweenness Centrality and what they
calculate. Do note that currently we have used the core structure of the network, no other metadata
like number of passengers. These are calculations on the unweighted network.
‘ANC’ is the airport code of Anchorage airport, a place in Alaska, and according to pagerank and
betweenness centrality it is the most important airport in this network. Isn’t that weird? Thoughts?
Looks like ‘ANC’ is essential to the core structure of the network, as it is the main airport connecting
Alaska with other parts of US. This explains the high Betweenness Centrality score and there are
flights from other major airports to ‘ANC’ which explains the high PageRank score.
Related blog post: https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-
ties-and-sample-selection/
Let’s look at weighted version, i.e taking into account the number of people flying to these places.
Airport Network 176

1 # Recall from the last chapter we use weight_inv


2 # while calculating betweenness centrality
3 sorted(nx.betweenness_centrality(
4 pass_2015_network, weight='weight_inv').items(),
5 key=lambda x:x[1], reverse=True)[0:10]

1 [('SEA', 0.4192179843829966),
2 ('ATL', 0.3589665389741017),
3 ('ANC', 0.32425767084369994),
4 ('LAX', 0.2668567170342895),
5 ('ORD', 0.10008664852621497),
6 ('DEN', 0.0964658422388763),
7 ('MSP', 0.09300021788810685),
8 ('DFW', 0.0926644126226465),
9 ('FAI', 0.08824779747216016),
10 ('BOS', 0.08259764427486331)]

1 sorted(nx.pagerank(
2 pass_2015_network, weight='weight').items(),
3 key=lambda x:x[1], reverse=True)[0:10]

1 [('ATL', 0.037535963029303135),
2 ('ORD', 0.028329766122739346),
3 ('SEA', 0.028274564067008245),
4 ('ANC', 0.027127866647567035),
5 ('DFW', 0.02570050418889442),
6 ('DEN', 0.025260024346433315),
7 ('LAX', 0.02394043498608451),
8 ('PHX', 0.018373176636420224),
9 ('CLT', 0.01780703930063076),
10 ('LAS', 0.017649683141049966)]

When we adjust for number of passengers we see that we have a reshuffle in the “importance”
rankings, and they do make a bit more sense now. According to weighted PageRank, Atlanta,
Chicago, Seattle the top 3 airports while Anchorage is at 4th rank now.
To get an even better picture of this we should do the analyse with more metadata about the routes
not just the number of passengers.
Airport Network 177

How reachable is this network?


Let’s assume you are the Head of Data Science of an airline and your job is to make your airline
network as “connected” as possible.
To translate this problem statement to network science, we calculate the average shortest path length
of this network, it gives us an idea about the number of jumps we need to make around the network
to go from one airport to any other airport in this network on average.
We can use the inbuilt networkx method average_shortest_path_length to find the average
shortest path length of a network.

1 nx.average_shortest_path_length(pass_2015_network)
2
3 ---------------------------------------------------------------------------
4 NetworkXError Traceback (most recent call last)
5 <ipython-input-157-acfe9bf3572a> in <module>
6 ----> 1 nx.average_shortest_path_length(pass_2015_network)
7
8 ~/miniconda3/envs/nams/lib/python3.7/site-packages/networkx/algorithms/shortest_path\
9 s/generic.py in average_shortest_path_length(G, weight, method)
10 401 # Shortest path length is undefined if the graph is disconnected.
11 402 if G.is_directed() and not nx.is_weakly_connected(G):
12 --> 403 raise nx.NetworkXError("Graph is not weakly connected.")
13 404 if not G.is_directed() and not nx.is_connected(G):
14 405 raise nx.NetworkXError("Graph is not connected.")
15
16 NetworkXError: Graph is not weakly connected.

Wait, What? This network is not “connected” (ignore the term weakly for the moment). That seems
weird. It means that there are nodes which aren’t reachable from other set of nodes, which isn’t
good news in especially a transporation network.
Let’s have a look at these far flung airports which aren’t reachable.

1 components = list(
2 nx.weakly_connected_components(
3 pass_2015_network))
Airport Network 178

1 # There are 3 weakly connected components in the network.


2 for c in components:
3 print(len(c))

1 1255
2 2
3 1

1 # Let's look at the component with 2 and 1 airports respectively.


2 print(components[1])
3 print(components[2])

1 {'SSB', 'SPB'}
2 {'AIK'}

The airports ‘SSB’ and ‘SPB’ are codes for Seaplanes airports and they have flights to each other so
it makes sense that they aren’t connected to the larger network of airports.
The airport is even more weird as it is in a component in itself, i.e there is a flight from AIK to AIK.
After investigating further it just seems like an anomaly in this dataset.

1 AIK_DEST_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['DEST'] == 'AIK')]
4 print(AIK_DEST_2015.head().to_markdown())

id YEAR ORIGIN DEST UNIQUE_CARRIER_- PASSENGERS


NAME
433338 2015 AIK AIK {‘Wright Air Service’} 0

1 # Let's get rid of them, we don't like them


2 pass_2015_network.remove_nodes_from(
3 ['SPB', 'SSB', 'AIK'])

1 # Our network is now weakly connected


2 nx.is_weakly_connected(pass_2015_network)
Airport Network 179

1 True

1 # It's not strongly connected


2 nx.is_strongly_connected(pass_2015_network)

1 False

Strongly vs weakly connected graphs.


Let’s go through an example to understand weakly and strongly connected directed graphs.

1 # NOTE: The notion of strongly and weakly exists only for directed graphs.
2 G = nx.DiGraph()
3
4 # Let's create a cycle directed graph, 1 -> 2 -> 3 -> 1
5 G.add_edge(1, 2)
6 G.add_edge(2, 3)
7 G.add_edge(3, 1)
8 nx.draw(G, with_labels=True)

png
Airport Network 180

In the above example we can reach any node irrespective of where we start traversing the network,
if we start from 2 we can reach 1 via 3. In this network every node is “reachable” from one another,
i.e the network is strongly connected.

1 nx.is_strongly_connected(G)

1 True

1 # Let's add a new connection


2 G.add_edge(3, 4)
3 nx.draw(G, with_labels=True)

png

It’s evident from the example above that we can’t traverse the network graph. If we start from node
4 we are stuck at the node, we don’t have any way of leaving node 4. This is assuming we strictly
follow the direction of edges. In this case the network isn’t strongly connected but if we look at the
structure and assume the directions of edges don’t matter than we can go to any other node in the
network even if we start from node 4.
In the case an undirected copy of directed network is connected we call the directed network as
weakly connected.
Airport Network 181

1 nx.is_strongly_connected(G)

1 False

1 nx.is_weakly_connected(G)

1 True

Let’s go back to our airport network of 2015.


After removing those 3 airports the network is weakly connected.

1 nx.is_weakly_connected(pass_2015_network)

1 True

1 nx.is_strongly_connected(pass_2015_network)

1 False

But our network is still not strongly connected, which essentially means there are airports in the
network where you can fly into but not fly back, which doesn’t really seem okay

1 strongly_connected_components = list(
2 nx.strongly_connected_components(pass_2015_network))

1 # Let's look at one of the examples of a strong connected component


2 strongly_connected_components[0]

1 {'BCE'}
Airport Network 182

1 BCE_DEST_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['DEST'] == 'BCE')]
4 print(BCE_DEST_2015.head().to_markdown())

id YEAR ORIGIN DEST UNIQUE_CARRIER_- PASSENGERS


NAME
451074 2015 PGA BCE {‘Grand Canyon Airlines, 8
Inc. d/b/a Grand Canyon
Airlines d/b/a Scenic
Airlines’}

1 BCE_ORI_2015 = pass_air_data[
2 (pass_air_data['YEAR'] == 2015) &
3 (pass_air_data['ORIGIN'] == 'BCE')]
4 print(BCE_ORI_2015.head().to_markdown())

| id | YEAR | ORIGIN | DEST | UNIQUE_CARRIER_NAME | PASSENGERS | |——|——–|———-|——–


|———————–|————–|
As we can see above you can fly into ‘BCE’ but can’t fly out, weird indeed. These airport are small
airports with one off schedules flights. For the purposes of our analyses we can ignore such airports.

1 # Let's find the biggest strongly connected component


2 pass_2015_strong_nodes = max(
3 strongly_connected_components, key=len)

1 # Create a subgraph with the nodes in the


2 # biggest strongly connected component
3 pass_2015_strong = pass_2015_network.subgraph(
4 nodes=pass_2015_strong_nodes)

1 nx.is_strongly_connected(pass_2015_strong)

1 True

After removing multiple airports we now have a strongly connected airport network. We can now
travel from one airport to any other airport in the network.
Airport Network 183

1 # We started with 1258 airports


2 len(pass_2015_strong)

1 1190

1 nx.average_shortest_path_length(pass_2015_strong)

1 3.174661992635574

The 3.17 number above represents the average length between 2 airports in the network which means
that it’s possible to go from one airport to another in this network under 3 layovers, which sounds
nice. A more reachable network is better, not necessearily in terms of revenue for the airline but for
social health of the air transport network.

Exercise
How can we decrease the average shortest path length of this network?
Think of an effective way to add new edges to decrease the average shortest path length. Let’s see
if we can come up with a nice way to do this.
The rules are simple: - You can’t add more than 2% of the current edges( ∼500 edges)

1 from nams.solutions.airport import add_opinated_edges

1 new_routes_network = add_opinated_edges(pass_2015_strong)

1 nx.average_shortest_path_length(new_routes_network)

1 3.0888508809747615

Using an opinionated heuristic we were able to reduce the average shortest path length of the
network. Check the solution below to understand the idea behind the heuristic, do try to come
up with your own heuristics.

Can we find airline specific reachability?


Let’s see how we can use the airline metadata to calculate the reachability of a specific airline.
Airport Network 184

1 # We have access to the airlines that fly the route in the edge attribute airlines
2 pass_2015_network['JFK']['SFO']

1 {'weight': 1179941.0,
2 'weight_inv': 8.4750000211875e-07,
3 'airlines': "{'Delta Air Lines Inc.', 'Virgin America', 'American Airlines Inc.', '\
4 Sun Country Airlines d/b/a MN Airlines', 'JetBlue Airways', 'Vision Airlines', 'Unit\
5 ed Air Lines Inc.'}"}

1 # A helper function to extract the airlines names from the edge attribute
2 def str_to_list(a):
3 return a[1:-1].split(', ')

1 for origin, dest in pass_2015_network.edges():


2 pass_2015_network[origin][dest]['airlines_list'] = str_to_list(
3 (pass_2015_network[origin][dest]['airlines']))

Let’s extract the network of United Airlines from our airport network.

1 united_network = nx.DiGraph()
2 for origin, dest in pass_2015_network.edges():
3 if "'United Air Lines Inc.'" in pass_2015_network[origin][dest]['airlines_list']:
4 united_network.add_edge(
5 origin, dest,
6 weight=pass_2015_network[origin][dest]['weight'])

1 # number of nodes -> airports


2 # number of edges -> routes
3 print(nx.info(united_network))

1 Name:
2 Type: DiGraph
3 Number of nodes: 194
4 Number of edges: 1894
5 Average in degree: 9.7629
6 Average out degree: 9.7629
Airport Network 185

1 # Let's find United Hubs according to PageRank


2 sorted(nx.pagerank(
3 united_network, weight='weight').items(),
4 key=lambda x:x[1], reverse=True)[0:5]

1 [('ORD', 0.08385772266571424),
2 ('DEN', 0.06816244850418422),
3 ('LAX', 0.053065234147240105),
4 ('IAH', 0.044410609028379185),
5 ('SFO', 0.04326197030283029)]

1 # Let's find United Hubs according to Degree Centrality


2 sorted(nx.degree_centrality(
3 united_network).items(),
4 key=lambda x:x[1], reverse=True)[0:5]

1 [('ORD', 1.0),
2 ('IAH', 0.9274611398963731),
3 ('DEN', 0.8756476683937824),
4 ('EWR', 0.8134715025906736),
5 ('SFO', 0.6839378238341969)]

Solutions
Here are the solutions to the exercises above.

1 from nams.solutions import airport


2 import inspect
3
4 print(inspect.getsource(airport))
Airport Network 186

1 import networkx as nx
2 import pandas as pd
3
4
5 def busiest_route(pass_air_data, year):
6 return pass_air_data[
7 pass_air_data.groupby(["YEAR"])["PASSENGERS"].transform(max)
8 == pass_air_data["PASSENGERS"]
9 ].query(f"YEAR == {year}")
10
11
12 def plot_time_series(pass_air_data, origin, dest):
13 pass_air_data.query(f"ORIGIN == '{origin}' and DEST == '{dest}'").plot(
14 "YEAR", "PASSENGERS"
15 )
16
17
18 def add_opinated_edges(G):
19 G = nx.DiGraph(G)
20 sort_degree = sorted(
21 nx.degree_centrality(G).items(), key=lambda x: x[1], reverse=True
22 )
23 top_count = 0
24 for n, v in sort_degree:
25 count = 0
26 for node, val in sort_degree:
27 if node != n:
28 if node not in G._adj[n]:
29 G.add_edge(n, node)
30 count += 1
31 if count == 25:
32 break
33 top_count += 1
34 if top_count == 20:
35 break
36 return G

You might also like