Zhiyuan L. Introduction To Graph Neural Networks 2020
Zhiyuan L. Introduction To Graph Neural Networks 2020
Zhiyuan L. Introduction To Graph Neural Networks 2020
LIU • ZHOU
Series Editors: Ronald J. Brachman, Jacobs Technion-Cornell Institute at Cornell Tech
Francesca Rossi, AI Ethics Global Leader, IBM Research AI
Peter Stone, University of Texas at Austin
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
store.morganclaypool.com
Ronald J. Brachman, Francesca Rossi, and Peter Stone, Series Editors
Introduction to
Graph Neural Networks
Synthesis Lectures on Artificial
Intelligence and Machine
Learning
Editors
Ronald Brachman, Jacobs Technion-Cornell Institute at Cornell Tech
Francesca Rossi, IBM Research AI
Peter Stone, University of Texas at Austin
Federated Learning
Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, and Tianjian Chen
2019
Strategic Voting
Reshef Meir
2018
Metric Learning
Aurélien Bellet, Amaury Habrard, and Marc Sebban
2015
Active Learning
Burr Settles
2012
Human Computation
Edith Law and Luis von Ahn
2011
Trading Agents
Michael P. Wellman
2011
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00980ED1V01Y202001AIM045
Lecture #45
Series Editors: Ronald Brachman, Jacobs Technion-Cornell Institute at Cornell Tech
Francesca Rossi, IBM Research AI
Peter Stone, University of Texas at Austin
Series ISSN
Synthesis Lectures on Artificial Intelligence and Machine Learning
Print 1939-4608 Electronic 1939-4616
Introduction to
Graph Neural Networks
M
&C Morgan & cLaypool publishers
ABSTRACT
Graphs are useful data structures in complex real-life applications such as modeling physical sys-
tems, learning molecular fingerprints, controlling traffic networks, and recommending friends
in social networks. However, these tasks require dealing with non-Euclidean graph data that
contains rich relational information between elements and cannot be well handled by tradi-
tional deep learning models (e.g., convolutional neural networks (CNNs) or recurrent neural
networks (RNNs)). Nodes in graphs usually contain useful feature information that cannot be
well addressed in most unsupervised representation learning methods (e.g., network embedding
methods). Graph neural networks (GNNs) are proposed to combine the feature information
and the graph structure to learn better representations on graphs via feature propagation and
aggregation. Due to its convincing performance and high interpretability, GNN has recently
become a widely applied graph analysis tool.
This book provides a comprehensive introduction to the basic concepts, models, and ap-
plications of graph neural networks. It starts with the introduction of the vanilla GNN model.
Then several variants of the vanilla model are introduced such as graph convolutional networks,
graph recurrent networks, graph attention networks, graph residual networks, and several general
frameworks. Variants for different graph types and advanced training methods are also included.
As for the applications of GNNs, the book categorizes them into structural, non-structural, and
other scenarios, and then it introduces several typical models on solving these tasks. Finally, the
closing chapters provide GNN open resources and the outlook of several future directions.
KEYWORDS
deep graph learning, deep learning, graph neural network, graph analysis, graph
convolutional network, graph recurrent network, graph residual network
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
11 General Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1 Message Passing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Non-local Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.3 Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15 Open Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
15.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Preface
Deep learning has achieved promising progress in many fields such as computer vision and natu-
ral language processing. The data in these tasks are usually represented in the Euclidean domain.
However, many learning tasks require dealing with non-Euclidean graph data that contains rich
relational information between elements, such as modeling physical systems, learning molecular
fingerprints, predicting protein interface, etc. Graph neural networks (GNNs) are deep learning-
based methods that operate on graph domains. Due to its convincing performance and high
interpretability, GNN has recently been a widely applied graph analysis method.
The book provides a comprehensive introduction to the basic concepts, models, and appli-
cations of graph neural networks. It starts with the basics of mathematics and neural networks. In
the first chapters, it gives an introduction to the basic concepts of GNNs, which aims to provide
a general overview for readers. Then it introduces different variants of GNNs: graph convolu-
tional networks, graph recurrent networks, graph attention networks, graph residual networks,
and several general frameworks. These variants tend to generalize different deep learning tech-
niques into graphs, such as convolutional neural network, recurrent neural network, attention
mechanism, and skip connections. Further, the book introduces different applications of GNNs
in structural scenarios (physics, chemistry, knowledge graph), non-structural scenarios (image,
text) and other scenarios (generative models, combinatorial optimization). Finally, the book lists
relevant datasets, open source platforms, and implementations of GNNs.
This book is organized as follows. After an overview in Chapter 1, we introduce some
basic knowledge of math and graph theory in Chapter 2. We show the basics of neural networks
in Chapter 3 and then give a brief introduction to the vanilla GNN in Chapter 4. Four types of
models are introduced in Chapters 5, 6, 7, and 8, respectively. Other variants for different graph
types and advanced training methods are introduced in Chapters 9 and 10. Then we propose
several general GNN frameworks in Chapter 11. Applications of GNN in structural scenarios,
nonstructural scenarios, and other scenarios are presented in Chapters 12, 13, and 14. And
finally, we provide some open resources in Chapter 15 and conclude the book in Chapter 16.
Acknowledgments
We would like to thank those who contributed and gave advice in individual chapters:
Chapter 1: Ganqu Cui, Zhengyan Zhang
Chapter 2: Yushi Bai
Chapter 3: Yushi Bai
Chapter 4: Zhengyan Zhang
Chapter 9: Zhengyan Zhang, Ganqu Cui, Shengding Hu
Chapter 10: Ganqu Cui
Chapter 12: Ganqu Cui
Chapter 13: Ganqu Cui, Zhengyan Zhang
Chapter 14: Ganqu Cui, Zhengyan Zhang
Chapter 15: Yushi Bai, Shengding Hu
We would also thank those who provide feedback on the content of the book: Cheng
Yang, Ruidong Wu, Chang Shu, Yufeng Du, and Jiayou Zhang.
Finally, we would like to thank all the editors, reviewers, and staff who helped with the
publication of the book. Without you, this book would not have been possible.
CHAPTER 1
Introduction
Graphs are a kind of data structure which models a set of objects (nodes) and their relation-
ships (edges). Recently, researches of analyzing graphs with machine learning have received
more and more attention because of the great expressive power of graphs, i.e., graphs can be
used as denotation of a large number of systems across various areas including social science (so-
cial networks) [Hamilton et al., 2017b, Kipf and Welling, 2017], natural science (physical sys-
tems [Battaglia et al., 2016, Sanchez et al., 2018] and protein-protein interaction networks [Fout
et al., 2017]), knowledge graphs [Hamaguchi et al., 2017] and many other research areas [Khalil
et al., 2017]. As a unique non-Euclidean data structure for machine learning, graph draws atten-
tion on analyses that focus on node classification, link prediction, and clustering. Graph neural
networks (GNNs) are deep learning-based methods that operate on graph domain. Due to its
convincing performance and high interpretability, GNN has been a widely applied graph analy-
sis method recently. In the following paragraphs, we will illustrate the fundamental motivations
of GNNs.
1.1 MOTIVATIONS
1.1.1 CONVOLUTIONAL NEURAL NETWORKS
Firstly, GNNs are motivated by convolutional neural networks (CNNs) LeCun et al. [1998].
CNNs is capable of extracting and composing multi-scale localized spatial features for features
of high representation power, which have result in breakthroughs in almost all machine learning
areas and the revolution of deep learning. As we go deeper into CNNs and graphs, we find the
keys of CNNs: local connection, shared weights, and the use of multi-layer [LeCun et al., 2015].
These are also of great importance in solving problems of graph domain, because (1) graphs are
the most typical locally connected structure, (2) shared weights reduce the computational cost
compared with traditional spectral graph theory [Chung and Graham, 1997], (3) multi-layer
structure is the key to deal with hierarchical patterns, which captures the features of various sizes.
However, CNNs can only operate on regular Euclidean data like images (2D grid) and text (1D
sequence), which can also be regarded as instances of graphs. Therefore, it is straightforward to
think of finding the generalization of CNNs to graphs. As shown in Figure 1.1, it is hard to
define localized convolutional filters and pooling operators, which hinders the transformation
of CNN from Euclidean to non-Euclidean domain.
2 1. INTRODUCTION
Figure 1.1: Left: image in Euclidean space. Right: graph in non-Euclidean space.
CHAPTER 2
The norm of a vector measures its length. The Lp norm is defined as follows:
n
! p1
X
jjxjjp D jxi jp : (2.2)
i D1
The L1 norm, L2 norm and L1 norm are often used in machine learning.
The L1 norm can be simplified as
n
X
jjxjj1 D jxi j: (2.3)
iD1
In Euclidean space Rn , the L2 norm is used to measure the length of vectors, where
v
u n
uX
jjxjj2 D t xi2 : (2.4)
i D1
6 2. BASICS OF MATH AND GRAPH
The L1 norm is also called the max norm, as
With Lp norm, the distance of two vectors x1 , x2 (where x1 and x2 are in the same linear
space) can be defined as
Dp .x1 ; x2 / D jjx1 x2 jjp : (2.6)
A set of vectors x1 ; x2 ; ; xm are linearly independent if and only if there does not exist
a set of scalars 1 ; 2 ; ; m , which are not all 0, such that
1 x1 C 2 x2 C C m xm D 0: (2.7)
where A 2 Rmn .
Given two matrices A 2 Rmn and B 2 Rnp , the matrix product of AB can be denoted
as C 2 Rmp , where
Xn
Cij D Ai k Bkj : (2.9)
kD1
It can be proved that matrix product is associative but not necessarily commutative. In
mathematical language,
.AB/C D A.BC/ (2.10)
holds for arbitrary matrices A, B, and C (presuming that the multiplication is legitimate).
Yet
AB D BA (2.11)
is not always true.
For each n n square matrix A, its determinant (also denoted as jAj) is defined as
X
det.A/ D . 1/ .k1 k2 kn / a1k1 a2k2 ankn ; (2.12)
k1 k2 kn
2.1.2 EIGENDECOMPOSITION
Let A be a matrix in Rnn . A nonzero vector v 2 C n is called an eigenvector of A if there exists
such scalar 2 C that
Av D v: (2.16)
Here scalar is an eigenvalue of A corresponding to the eigenvector v. If matrix A has n
eigenvectors fv1 ; v2 ; ; vn g that are linearly independent, corresponding to the eigenvalue
f1 ; 2 ; ; n g, then it can be deduced that
2 3
1
6
6 2 7
7
A v1 v2 : : : vn D v1 v2 : : : vn 6 :: 7: (2.17)
4 : 5
n
Let V D v1 v2 : : : vn ; then it is clear that V is an invertible matrix. We have the eigen-
decomposition of A (also called diagonalization)
1
A D Vdiag./V : (2.18)
It can also be written in the following form:
n
X
AD i vi vTi : (2.19)
i D1
However, not all square matrices can be diagonalized in such form because a matrix may not
have as many as n linear independent eigenvectors. Fortunately, it can be proved that every real
symmetric matrix has an eigendecomposition.
8 2. BASICS OF MATH AND GRAPH
2.1.3 SINGULAR VALUE DECOMPOSITION
As eigendecomposition can only be applied to certain matrices, we introduce the singular value
decomposition, which is a generalization to all matrices.
First we need to introduce the concept of singular value. Let r denote the rank of AT A,
then there exist r positive scalars 1 2 r > 0 such that for 1 i r , vi is an eigen-
vector of AT A with corresponding eigenvalue i2 . Note that v1 ; v2 ; ; vr are linearly inde-
pendent. The r positive scalars 1 ; 2 ; ; r are called singular values of A. Then we have the
singular value decomposition
A D U †V T ; (2.20)
where U 2 Rmm and V (n n) are orthogonal matrices and † is an m n matrix defined as
follows:
i if i D j r;
†ij D
0 otherwise:
In fact, the column vectors of U are eigenvectors of AAT , and the eigenvectors of AT A are made
up of the the column vectors of V.
P .X D x1 / C P .X D x2 / D 1: (2.21)
Suppose there is another random variable Y that has y1 as a possible value. The probability
that X D x1 and Y D y1 is written as P .X D x1 ; Y D y1 /, which is called the joint probability
of X D x1 and Y D y1 .
Sometimes we need to know the relationship between random variables, like the prob-
ability of X D x1 on the condition that Y D y1 , which can be written as P .X D x1 jY D y1 /.
We call this the conditional probability of X D x1 given Y D y1 . With the concepts above, we
can write the following two fundamental rules of probability theory:
X
P .X D x/ D P .X D x; Y D y/; (2.22)
y
2.2. PROBABILITY THEORY 9
P .X D x; Y D y/ D P .Y D yjX D x/P .X D x/: (2.23)
The former is the sum rule while the latter is the product rule. Slightly modifying the form of
product rule, we get another useful formula:
P .X D x; Y D y/
P .Y D yjX D x/ D
P .X D x/
(2.24)
P .X D xjY D y/P .Y D y/
D
P .X D x/
which is the famous Bayes formula. Note that it also holds for more than two variables:
P .Y D yjXi D xi /P .Xi D xi /
P .Xi D xi jY D y/ D Pn : (2.25)
j D1 P .Y D yjXj D xj /P .Xj D xj /
P .X D x/ D p x .1 p/1 x
; x 2 f0; 1g: (2.31)
• Binomial distribution: repeat the Bernoulli experiment for N times and the times that X
equals to 1 is denote by Y , then
!
N
P .Y D k/ D p k .1 p/N k (2.32)
k
is the Binomial distribution satisfying that E.Y / D np and Var.Y / D np.1 p/.
• Degree matrix: for a graph G D .V; E/ with n-vertices, its degree matrix D 2 Rnn is a
diagonal matrix, where
Di i D d.vi /:
• Laplacian matrix: for a simple graph G D .V; E/ with n-vertices, if we consider all edges
in G to be undirected, then its Laplacian matrix L 2 Rnn can be defined as
LDD A:
CHAPTER 3
3.1 NEURON
The basic units of neural networks are neurons, which can receive a series of inputs and return
the corresponding output. A classic neuron is as shown in Figure 3.1. Where the neuron re-
ceives n inputs x1 ; x2 ; ; xn with corresponding weights w1 ; w2 ; ; wn and an offset b . Then
P
the weighted summation y D niD1 wi xi C b passes through an activation function f and the
neuron returns the output z D f .y/. Note that the output will be the input of the next neuron.
The activation function is a kind of function that maps a real number to a number between
0 and 1 (with rare exceptions), which represents the activation of the neuron, where 0 indi-
cates deactivated and 1 indicates fully activated. Several useful activation functions are shown as
follows.
• Sigmoid Function (Figure 3.2):
1
.x/ D x
: (3.1)
1Ce
• Tanh Function (Figure 3.3):
ex e x
tanh.x/ D : (3.2)
ex C e x
x1
w1
x2
w2 f
. y z
.
. wn
xn 1
y 1
0.8
0.6
0.4
0.2
x
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
In fact, there are many other activation functions and each has its corresponding derivatives.
But do remember that a good activation function is always smooth (which means that it is a
continuous differentiable function) and easily calculated (in order to minimize the computa-
tional complexity of the neural network). During the training of a neural network, the choice of
activation function is usually essential to the outcome.
-5 -4 -3 -2 -1 0 1 2 3 4 5x
-1
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
-1
-2
By the chain rule, we can deduce the derivative of z with respect to wi and b :
@z @z @y
D
@wi @y @wi
(3.4)
@f .y/
D xi
@y
@z @z @y
D
@b @y @b
(3.5)
@f .y/
D :
@y
With a learning rate of , the update for each parameter will be:
@z
wi D .z0 z/
@wi
(3.6)
@f .y/
D .z0 z/xi
@y
16 3. BASICS OF NEURAL NETWORKS
x0(0) x(0)
1 x(0)
2 x(0)
3
x(1)
0 x(1)
1 x(1)
2 x(1)
3
x(2)
0 x(2)
1 x(2)
2 x(2)
3 x(2)
4
x(3)
0
@z
b D .z0 z/
@b
(3.7)
@f .y/
D .z0 z/ :
@y
In summary, the process of the back propagation consists of the following two steps.
• Forward calculation: given a set of parameters and an input, the neural network computes
the values at each neuron in a forward order.
• Backward propagation: compute the error at each variable to be optimized, and update
the parameters with their corresponding partial derivatives in a backward order.
The above two steps will go on repeatedly until the optimization target is acquired.
• Feedforward neural network: The feedforward neural network (FNN) (Figure 3.5) is the
first and simplest network architecture of artificial neural network. The FNN usually con-
tains an input layer, several hidden layers, and an output layer. The feedforward neural
network has a clear hierarchical structure, which always consists of multiple layers of neu-
rons, and each layer is only connected to its neighbor layers. There are no loops in this
network.
• Convolutional neural network: Convolutional neural networks (CNNs) are special ver-
sions of FNNs. FNNs are usually fully connected networks while CNNs preserve the local
3.3. NEURAL NETWORKS 17
connectivity. The CNN architecture usually contains convolutional layers, pooling layers,
and several fully connected layers. There exist several classical CNN architectures such as
LeNet5 [LeCun et al., 1998], AlexNet [Krizhevsky et al., 2012] (Figure 3.6), VGG [Si-
monyan and Zisserman, 2014], and GoogLeNet [Szegedy et al., 2015]. CNNs are widely
used in the area of computer vision and proven to be effective in many other research fields.
• Recurrent neural network: In comparison with FNN, the neurons in recurrent neural
network (RNN) receive not only signals and inputs from other neurons, but also its own
historical information. The memory mechanism in recurrent neural network (RNN) help
the model to process series data effectively. However, the RNN usually suffers from the
problem of long-term dependencies [Bengio et al., 1994, Hochreiter et al., 2001]. Several
variants are proposed to solve the problem by incorporating the gate mechanism such as
GRU [Cho et al., 2014] and LSTM [Hochreiter and Schmidhuber, 1997]. The RNN is
widely used in the area of speech and natural language processing.
• Graph neural network: The GNN is designed specifically to handle graph-structured
data, such as social networks, molecular structures, knowledge graphs, etc. Detailed de-
scriptions of GNNs will be covered in the later chapters of this book.
19
CHAPTER 4
4.1 INTRODUCTION
The concept of GNN was first proposed in Gori et al. [2005], Scarselli et al. [2004, 2009]. For
simplicity, we will talk about the model proposed in Scarselli et al. [2009], which aims to extend
existing neural networks for processing graph-structured data.
A node is naturally defined by its features and related nodes in the graph. The target of
GNN is to learn a state embedding hv 2 Rs , which encodes the information of the neighbor-
hood, for each node. The state embedding hv is used to produce an output ov , such as the
distribution of the predicted node label.
In Scarselli et al. [2009], a typical graph is illustrated in Figure 4.1. The vanilla GNN
model deals with the undirected homogeneous graph where each node in the graph has its input
features xv and each edge may also have its features. The paper uses coŒv and neŒv to denote
the set of edges and neighbors of node v . For processing other more complicated graphs such
as heterogeneous graphs, the corresponding variants of GNNs could be found in later chapters.
4.2 MODEL
Given the input features of nodes and edges, next we will talk about how the model obtains the
node embedding hv and the output embedding ov .
In order to update the node state according to the input neighborhood, there is a para-
metric function f , called local transition function, shared among all nodes. In order to produce
the output of the node, there is a parametric function g , called local output function. Then, hv
and ov are defined as follows:
ov D g.hv ; xv /; (4.2)
where x denotes the input feature and h denotes the hidden state. coŒv is the set of edges
connected to node v and neŒv is set of neighbors of node v . So that xv ; xcoŒv ; hneŒv ; xneŒv
are the features of v , the features of its edges, the states and the features of the nodes in the
20 4. VANILLA GRAPH NEURAL NETWORKS
neighborhood of v , respectively. In the example of node l1 in Figure 4.1, xl1 is the input feature
of l1 . coŒl1 contains edges l.1;4/ ; l.1;6/ ; l.1;2/ , and l.3;1/ . neŒl1 contains nodes l2 ; l3 ; l4 , and l6 .
Let H, O, X, and XN be the matrices constructed by stacking all the states, all the outputs,
all the features, and all the node features, respectively. Then we have a compact form as:
H D F .H; X/ (4.3)
O D G.H; XN / (4.4)
where F , the global transition function, and G is the global output function. They are stacked
versions of the local transition function f and the local output function g for all nodes in a
graph, respectively. The value of H is the fixed point of Eq. (4.3) and is uniquely defined with
the assumption that F is a contraction map.
With the suggestion of Banach’s fixed point theorem [Khamsi and Kirk, 2011], GNN
uses the following classic iterative scheme to compute the state:
where Ht denotes the t th iteration of H. The dynamical system Eq. (4.5) converges exponentially
fast to the solution of Eq. (4.3) for any initial value of H.0/. Note that the computations described
in f and g can be interpreted as the FNNs.
4.3. LIMITATIONS 21
After the introduction of the framework of GNN, the next question is how to learn the
parameters of the local transition function f and local output function g . With the target in-
formation (tv for a specific node) for the supervision, the loss can be written as:
p
X
loss D .ti oi /; (4.6)
i D1
where p is the number of supervised nodes. The learning algorithm is based on a gradient-
descent strategy and is composed of the following steps.
• The states htv are iteratively updated by Eq. (4.1) until a time step T . Then we obtain an
approximate fixed point solution of Eq. (4.3): H.T / H.
• The weights W are updated according to the gradient computed in the last step.
After running the algorithm, we can get a model trained for a specific supervised/semi-
supervised task as well as hidden states of nodes in the graph. The vanilla GNN model provides
an effective way to model graphic data and it is the first step toward incorporating neural net-
works into graph domain.
4.3 LIMITATIONS
Though experimental results showed that GNN is a powerful architecture for modeling struc-
tural data, there are still several limitations of the vanilla GNN.
• First, it is computationally inefficient to update the hidden states of nodes iteratively to get
the fixed point. The model needs T steps of computation to approximate the fixed point.
If relaxing the assumption of the fixed point, we can design a multi-layer GNN to get a
stable representation of the node and its neighborhood.
• Second, vanilla GNN uses the same parameters in the iteration while most popular neural
networks use different parameters in different layers, which serves as a hierarchical feature
extraction method. Moreover, the update of node hidden states is a sequential process
which can benefit from the RNN kernels like GRU and LSTM.
• Third, there are also some informative features on the edges which cannot be effectively
modeled in the vanilla GNN. For example, the edges in the knowledge graph have the
type of relations and the message propagation through different edges should be differ-
ent according to their types. Besides, how to learn the hidden states of edges is also an
important problem.
22 4. VANILLA GRAPH NEURAL NETWORKS
• Last, if T is pretty large, it is unsuitable to use the fixed points if we focus on the repre-
sentation of nodes instead of graphs because the distribution of representation in the fixed
point will be much more smooth in value and less informative for distinguishing each
node.
Beyond the vanilla GNN, several variants are proposed to release these limitations. For
example, Gated Graph Neural Network (GGNN) [Li et al., 2016] is proposed to solve the
first problem. Relational GCN (R-GCN) [Schlichtkrull et al., 2018] is proposed to deal with
directed graphs. More details could be found in the following chapters.
23
CHAPTER 5
Graph Convolutional
Networks
In this chapter, we will talk about graph convolutional networks (GCNs), which aim to general-
ize convolutions to the graph domain. As convolutional neural networks (CNNs) have achieved
great success in the area of deep learning, it is intuitive to define the convolution operation on
graphs. Advances in this direction are often categorized as spectral approaches and spatial ap-
proaches. As there may have vast variants in each direction, we only list several classic models
in this chapter.
with LQ D max2
L IN . max denotes the largest eigenvalue of L. 2 RK is now a vector
of Chebyshev coefficients. The Chebyshev polynomials are defined as Tk .x/ D 2xTk 1 .x/
Tk 2 .x/, with T0 .x/ D 1 and T1 .x/ D x. It can be observed that the operation is K -localized
since it is a K th-order polynomial in the Laplacian.
Defferrard et al. [2016] propose the ChebNet. It uses this K -localized convolution to
define a convolutional neural network which could remove the need to compute the eigenvectors
of the Laplacian.
5.1.3 GCN
Kipf and Welling [2017] limit the layer-wise convolution operation to K D 1 to alleviate the
problem of overfitting on local neighborhood structures for graphs with very wide node degree
distributions. It further approximates max 2 and the equation simplifies to:
1 1
g 0 ? x 00 x C 10 .L IN / x D 00 x 10 D 2 AD 2 x (5.3)
with two free parameters 00 and 10 . After constraining the number of parameters with D 00 D
10 , we can obtain the following expression:
1 1
g ? x IN C D 2 AD 2 x: (5.4)
Note that stacking this operator could lead to numerical instabilities and explod-
ing/vanishing gradients, Kipf and Welling [2017] introduce the renormalization trick: IN C
1 1 P
D 2 AD 2 ! D Q 12 AQDQ 12 , with A
Q D A C IN and D Q ii D Q
j Aij . Finally, Kipf and Welling
N C
[2017] generalize the definition to a signal X 2 R with C input channels and F filters
for feature maps as follows:
ZDD Q 21 A
QDQ 12 X‚ (5.5)
where ‚ 2 RC F is a matrix of filter parameters and Z 2 RN F is the convolved signal matrix.
As a simplification of spectral methods, the GCN model could also be regarded as a spatial
method as we listed in Section 5.2.
5.1.4 AGCN
All of the above models use the original graph structure to denote relations between nodes.
However, there may have implicit relations between different nodes and the Adaptive Graph
5.2. SPATIAL METHODS 25
Convolution Network (AGCN) is proposed to learn the underlying relations [Li et al., 2018b].
AGCN learns a “residual” graph Laplacian Lres and add it to the original Laplacian matrix:
b
L D L C ˛Lres ; (5.6)
where ˛ is a parameter.
b
Lres is computed by a learned graph adjacency matrix A
b 21 A
Lres D I D bDb 1
2
(5.7)
b
b D degree.A/;
D
and Ab is computed via a learned metric. The idea behind the adaptive metric is that Euclidean
distance is not suitable for graph structured data and the metric should be adaptive to the task
and input features. AGCN uses the generalized Mahalanobis distance:
q
D.xi ; xj / D .xi xj /T M.xi xj /; (5.8)
where M is a learned parameter that satisfies M D Wd WTd . Wd is the transform basis to the
adaptive space. Then AGCN calucates the Gaussian kernel and normalize G to obtain the dense
adjacency matrix Ab:
Gxi ;xj D exp D.xi ; xj /= 2 2 : (5.9)
x D htv 1
C hti 1
i D1 (5.10)
htv D xWjN
t
vj
;
where WjN
t
vj
is the weight matrix for nodes with degree jNv j at layer t , Nv denotes the set of
neighbors of node v , htv is the embedding of node v at layer t . We can see from the equations
26 5. GRAPH CONVOLUTIONAL NETWORKS
that the model first adds the embeddings from itself as well as its neighbors, then it uses WjN
t
vj
5.2.2 PATCHY-SAN
The PATCHY-SAN model [Niepert et al., 2016] first selects and normalizes exactly k neigh-
bors for each node. Then the normalized neighborhood serves as the receptive filed and the
convolutional operation is applied. In detail, the method has four steps.
Node Sequence Selection. This method does not process with all nodes in the graph.
Instead it selects a sequence of nodes for processing. It first uses a graph labeling procedure to
get the order of the nodes and obtain the sequence of the nodes. Then the method uses a stride
s to select nodes from the sequence until a number of w nodes are selected.
Neighborhood Assembly. In this step, the receptive fields of nodes selected from last
step are constructed. The neighbors of each node are the candidates and the model uses a simple
breadth-first search to collect k neighbors for each node. It first extracts the 1-hop neighbors
of the node, then it considers high-order neighbors until the total number of k neighbors are
extracted.
Graph Normalization. In this step, the algorithm aims to give an order to nodes in the
receptive field, so that this step maps from the unordered graph space to a vector space. This is the
most important step and the idea behind this step is to assign nodes from two different graphs
similar relative positions if they have similar structural roles. More details of this algorithm could
be found in Niepert et al. [2016].
Convolutional Architecture. After the receptive fields are normalized in last step, CNN
architectures can be used. The normalized neighborhoods serve as receptive fields and node and
edge attributes are regarded as channels.
An illustration of this model could be found in Figure 5.1. This method tries to convert
the graph learning problem to the traditional euclidean data learning problem.
5.2.3 DCNN
Atwood and Towsley [2016] propose the diffusion-convolutional neural networks (DCNNs).
Transition matrices are used to define the neighborhood for nodes in DCNN. For node classi-
fication, it has
H D Wc ˇ P X ; (5.11)
where X is an N F tensor of input features (N is the number of nodes and F is the number
of features). P is an N K N tensor which contains the power series {P; P2 , …, PK } of
matrix P. And P is the degree-normalized transition matrix from the graphs adjacency matrix
A. Each entity is transformed to a diffusion convolutional representation which is a K F
5.2. SPATIAL METHODS 27
Neighborhood Assembly
Graph Normalization
2 2 1 2 4
3 1
4 1
2 4 1
3 4 3 3
Feature Extraction
n
n
α
α
…
…
1
1
α
α
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1
2 2 2 2
m
m
3 3 3 3
α
α
…
…
4 4 4 4
1
1
α
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
CNN Architecture
Figure 5.1: An illustration of the architecture of PATCHY-SAN. Note the figure is only for
illustration and it is not the true algorithm output.
28 5. GRAPH CONVOLUTIONAL NETWORKS
matrix defined by K hops of graph diffusion over F features. And then it will be defined by a
K F weight matrix and a nonlinear activation function . Finally, H (which is N K F )
denotes the diffusion representations of each node in the graph.
As for graph classification, DCNN simply takes the average of nodes’ representation,
H D Wc ˇ 1TN P X=N (5.12)
and 1N here is an N 1 vector of ones. DCNN can also be applied to edge classification tasks,
which requires converting edges to nodes and augmenting the adjacency matrix.
5.2.4 DGCN
Zhuang and Ma [2018] propose the dual graph convolutional network (DGCN) to jointly con-
sider the local consistency and global consistency on graphs. It uses two convolutional networks
to capture the local/global consistency and adopts an unsupervised loss to ensemble them. The
first convolutional network is the same as Eq. (5.5). And the second network replaces the adja-
cency matrix with positive pointwise mutual information (PPMI) matrix:
1 1
0
H D DP XP DP H‚ ;
2 2
(5.13)
where XP is the PPMI matrix and DP is the diagonal degree matrix of XP , is a nonlinear
activation function.
The motivations of jointly using the two perspectives are: (1) Eq. (5.5) models the local
consistency, which indicates that nearby nodes may have similar labels, and (2) Eq. (5.13) models
the global consistency which assumes that nodes with similar context may have similar labels.
The local consistency convolution and global consistency convolution are named as ConvA and
ConvP .
Zhuang and Ma [2018] further ensemble these two convolutions via the final loss func-
tion. It can be written as:
.t/ is the dynamic weight to balance the importance of these two loss functions.
L0 .ConvA / is the supervised loss function with given node labels. If we have c different labels
to predict, Z A denotes the output matrix of ConvA and Z b A denotes the output of Z A after a
softmax operation, then the loss L0 .ConvA /, which is the cross-entropy error, can be written as:
1 XX
c
L0 .ConvA / D Yl;i ln ZbA ; (5.15)
l;i
jyL j
l2yL i D1
where yL is the set of training data indices and Y is the ground truth.
5.2. SPATIAL METHODS 29
Figure 5.2: Architecture of the dual graph convolutional network (DGCN) model.
1 X
2
n
bP bA
Lreg .ConvA ; ConvP / D
Z i;W Z i;W
; (5.16)
n
i D1
where Zb P denotes the output of ConvP after the softmax operation. Thus, Lreg .ConvA ; ConvP /
is the unsupervised loss function to measure the differences between Zb P and Zb A . As a result,
the architecture of this model is shown in Figure 5.2.
5.2.5 LGCN
Gao et al. [2018] propose the learnable graph convolutional networks (LGCN). The network is
based on the learnable graph convolutional layer (LGCL) and the sub-graph training strategy.
We will give the details of the learnable graph convolutional layer in this section.
LGCL leverages CNNs as aggregators. It performs max pooling on nodes’ neighborhood
matrices to get top-k feature elements and then applies 1-D CNN to compute hidden repre-
sentations. The propagation step of LGCL is formulated as:
Hb t D g .H t ; A; k/
(5.17)
H t C1 D c Hbt ;
where A is the adjacency matrix, g./ is the k -largest node selection operation, and c./ denotes
the regular 1-D CNN.
The model uses the k -largest node selection operation to gather information for each node.
For a given node x , the features of its neighbors are firstly gathered; suppose it has n neighbors
and each node has c features, then a matrix M 2 Rnc is obtained. If n is less than k , then M
is padded with columns of zeros. Then the k -largest node selection is conducted that we rank
the values in each column and select the top-k values. After that, the embedding of the node x
is inserted into the first row of the matrix and finally we get a matrix M c 2 R.kC1/c .
30 5. GRAPH CONVOLUTIONAL NETWORKS
Figure 5.3: An example of the learnable graph convolutional layer (LGCL). Each node has
three three features and this layer selects k D 4 neighbors. The node has five neighbors and four
of them are selected. The k -largest node selection procedure is shown in the left part and four
largest values are selected in each column. Then a 1-D CNN is performed to get the final output.
After the matrix Mc is obtained, then the model uses the regular 1-D CNN to aggregate
the features. The function c./ should take a matrix of N .k C 1/ C as input and output a
matrix of dimension N D or N 1 D . Figure 5.3 gives an example of the LGCL.
5.2.6 MONET
Monti et al. [2017] propose a spatial-domain model (MoNet) on non-Euclidean domains
which could generalize several previous techniques. The Geodesic CNN (GCNN) [Masci et
al., 2015] and Anisotropic CNN (ACNN) [Boscaini et al., 2016] on manifolds or GCN [Kipf
and Welling, 2017] and DCNN [Atwood and Towsley, 2016] on graphs could be formulated
as particular instances of MoNet.
We use x to denote the node in the graph and y 2 Nx to denote the neighbor node of x .
The MoNet model computes the pseudo-coordinates u.x; y/ between the node and its neighbor
and uses a weighting function among these coordinates:
X
Dj .x/f D wj .u.x; y//f .y/; (5.18)
y2Nx
5.2. SPATIAL METHODS 31
Table 5.1: Different settings for different methods in the MoNet framework
where the parameters are w‚ .u/ D .w1 .u/; : : : ; wJ .u// and J represents the size of the extracted
patch. Then a spatial generalization of the convolution on non-Euclidean domains is defined as:
J
X
.f ? g/.x/ D gj Dj .x/f: (5.19)
j D1
Then other methods can be regarded as a special case with different coordinates u and weight
functions w.u/. As we only focus on deep learning on graphs, we list several settings in Table 5.1.
More details could be found in the original paper.
5.2.7 GRAPHSAGE
Hamilton et al. [2017b] propose the GraphSAGE, a general inductive framework. The frame-
work generates embeddings by sampling and aggregating features from a node’s local neighbor-
hood. The propagation step of GraphSAGE is:
˚
htNv D AGGREGATE t htu 1 ; 8u 2 Nv
(5.20)
htv D Wt htv 1 khtNv ;
The mean aggregator is different from other aggregators because it does not perform the
concatenation operation which concatenates htv 1 and htNv in Eq. (5.20). It could be viewed
as a form of “skip connection” [He et al., 2016b] and could achieve better performance.
32 5. GRAPH CONVOLUTIONAL NETWORKS
• LSTM aggregator. Hamilton et al. [2017b] also use an LSTM-based aggregator which
has a larger expressive capability. However, LSTMs process inputs in a sequential man-
ner so that they are not permutation invariant. Hamilton et al. [2017b] adapt LSTMs to
operate on an unordered set by permutating node’s neighbors.
• Pooling aggregator. In the pooling aggregator, each neighbor’s hidden state is fed through
a fully connected layer and then a max-pooling operation is applied to the set of the node’s
neighbors:
˚
htNv D max Wpool htu 1 C b ; 8u 2 Nv : (5.22)
Note that any symmetric functions could be used in place of the max-pooling operation
here.
To learn better representations, GraphSAGE further proposes an unsupervised loss func-
tion which encourages nearby nodes to have similar representations while distant nodes have
different representations:
JG .zu / D log zTu zv Q Evn Pn .v/ log zTu zvn ; (5.23)
where v is the neighbor of node u and Pn is a negative sampling distribution. Q is the number
of negative samples.
33
CHAPTER 6
The node v first aggregates message from its neighbors, where Av is the sub-matrix of
the graph adjacency matrix A and denotes the connection of node v with its neighbors. The
GRU-like update functions use information from each node’s neighbors and from the previous
timestep to update node’s hidden state. Vector a gathers the neighborhood information of node
v , z, and r are the update and reset gates, ˇ is the Hardamard product operation.
The GGNN model is designed for problems defined on graphs which require outputting
sequences while existing models focus on producing single outputs such as node-level or graph-
level classifications.
Li et al. [2016] further propose Gated Graph Sequence Neural Networks (GGS-NNs)
which uses several GGNNs to produce an output sequence o.1/ : : : o.K/ . As shown in Figure 6.1,
for the k th output step, the matrix of node annotations is denoted as X.k/ . Two GGNNs are used
in this architecture: (1) Fo.k/ for predicting o.k/ from X.k/ and (2) Fx.k/ for predicting X.kC1/
from X.k/ . We use H.k;t/ to denote the t -th propagation step of the k -th output step. The value
34 6. GRAPH RECURRENT NETWORKS
of H.k;1/ at each step k is initialized by X.k/ . The value of H.t;1/ at each step t is initialized by
X.t/ . Fo.k/ and Fx.k/ can be different models or share the same parameters.
The model is used on the bAbI task as well as the program verification task and has demon-
strated its effectiveness.
where xtv is the input vector at time t in the standard LSTM setting, ˇ is the Hardamard product
operation.
6.3. GRAPH LSTM 35
If the number of children of each node in a tree is at most K and the children can be
ordered from 1 to K , then the N -array Tree-LSTM can be applied. For node v , htvk and ctvk
denote the hidden state and memory cell of its k -th child at time t , respectively. The transition
equations are the following:
K
X
itv D Wi xtv C Uil htvl 1 C bi
lD1
K
X
ftvk f t
D W xv C Ufkl htvl 1 C bf
lD1
K
X
otv D Wo xtv C Uol htvl 1 C bo (6.3)
lD1
K
X
utv u t
D tanh W xv C Uul htvl 1 C bu
lD1
K
X
ctv D itv ˇ utv C ftvl ˇ ctvl 1
lD1
htv D otv ˇ tanh.ctv /:
The two types of Tree-LSTMs can be easily adapted to the graph. The graph-structured LSTM
in Zayats and Ostendorf [2018] is an example of the N -ary Tree-LSTM applied to the graph.
However, it is a simplified version since each node in the graph has at most two incoming edges
(from its parent and sibling predecessor). Peng et al. [2017] propose another variant of the Graph
LSTM based on the relation extraction task. The main difference between graphs and trees is
that edges of graphs have their labels. And Peng et al. [2017] utilize different weight matrices
36 6. GRAPH RECURRENT NETWORKS
to represent different labels:
X
itv D Wi xtv C Uim.v;k/ htk 1
C bi
k2Nv
ftvkD Wf xtv C Ufm.v;k/ hkt 1 C bf
X
otv D Wo xtv C Uom.v;k/ htk 1 C bo (6.4)
k2Nv
X
utv D tanh Wu xtv C Uum.v;k/ htk 1
C bu
k2Nv
X
ctv D itv ˇ utv C ftvk ˇ ctk 1
k2Nv
where m.v; k/ denotes the edge label between node v and k , ˇ is the Hardamard product op-
eration.
Liang et al. [2016] propose a Graph LSTM network to address the semantic object pars-
ing task. It uses the confidence-driven scheme to adaptively select the starting node and deter-
mine the node updating sequence. It follows the same idea of generalizing the existing LSTMs
into the graph-structured data but has a specific updating sequence while methods we mentioned
above are agnostic to the order of nodes.
Figure 6.2: The propagation step of the S-LSTM model. The dash lines connect the supernode
g with its neighbors from last layer. The solid lines connect the word node with its neighbors
from last layer.
39
CHAPTER 7
7.1 GAT
Velickovic et al. [2018] propose a graph attention network (GAT) which incorporates the atten-
tion mechanism into the propagation steps. It follows the self-attention strategy and the hidden
state of each node is computed by attending over its neighbors.
Velickovic et al. [2018] define a single graph attentional layer and constructs arbitrary graph
attention networks by stacking this layer. The layer computes the coefficients in the attention
mechanism of the node pair .i; j / by:
exp LeakyReLU aT ŒWhi kWhj
˛ij D P ; (7.1)
T
k2Ni exp LeakyReLU a ŒWhi kWhk
where ˛ij is the attention coefficient of node j to i , Ni represents the neighborhoods of node i
in the graph. The input node features are denoted as h D fh1 ; h2 ; : : : ; hN g; hi 2 RF , where N is
the number of nodes and F is the dimension, the output node features (with cardinality F 0 ) are
0 0
denoted as h0 D fh01 ; h02 ; : : : ; h0N g; h0i 2 RF . W 2 RF F is the weight matrix of a shared linear
0
transformation which applied to every node, a 2 R2F is the weight vector. It is normalized by a
softmax function and the LeakyReLU nonlinearity (with negative input slop ˛ D 0:2) is applied.
Then the final output features of each node can be obtained by (after applying a nonlin-
earity ): X
0
hi D ˛ij Whj : (7.2)
j 2Ni
Moreover, the layer utilizes the multi-head attention similarly to Vaswani et al. [2017] to
stabilize the learning process. It applies K independent attention mechanisms to compute the
40 7. GRAPH ATTENTION NETWORKS
Figure 7.1: The illustration of the GAT model. Left: The attention mechanism employed in
the model. Right: An illustration of multihead attention (with three heads denoted by different
colors) by node 1 on its neighborhood.
hidden states and then concatenates their features (or computes the average), resulting in the
following two output representations:
K X
h0i D k ˛ijk Wk hj (7.3)
kD1 j 2Ni
1 X
K X
h0i D ˛ijk Wk hj ; (7.4)
K
kD1 j 2Ni
7.2 GAAN
Besides GAT, Gated Attention Network (GaAN) [Zhang et al., 2018b] also uses the multi-head
attention mechanism. The difference between the attention aggregator in GaAN and the one
7.2. GAAN 41
in GAT is that GaAN uses the key-value attention mechanism and the dot product attention
while GAT uses a fully connected layer to compute the attention coefficients.
Furthermore, GaAN assigns different weights for different heads by computing an addi-
tional soft gate. This aggregator is called the gated attention aggregator. In detail, GaAN uses
a convolutional network that takes the features of the center node and it neighbors to gener-
ate gate values. And as a result, it could outperform GAT as well as other GNN models with
different aggregators in the inductive node classification problem.
43
CHAPTER 8
The aim of adding the highway gates is to provide the network the ability to select from
new and old hidden states. Thus, an early hidden state could be propagated to the final state if
it is needed. By adding the highway gates, the performance peaks at four layers and does not
change much after adding more layers in a specific problem discussed in Rahimi et al. [2018].
The Column Network (CLN) proposed in Pham et al. [2017] also utilizes the highway
network. But it has different function to compute the gating weights, which is selected based
on the specific task.
Figure 8.1: The illustration of Jump Knowledge Network. N.A. stands for neighborhood aggre-
gation.
exponentially, thus much more noise may also be incorporated and the representation could be
more smooth. For a node which is far from the core of the graph, the number of its neighbors
could be very relatively small even if we expand its receptive field. Thus, this kind of nodes lacks
sufficient information to learn better representations.
Xu et al. [2018] propose the Jump Knowledge Network which could learn adaptive,
structure-aware representations. The Jump Knowledge Network selects from all of the inter-
mediate representations (which “jump” to the last layer) for each node at the last layer, which
enables each node to select effective neighborhood size as needed. Xu et al. [2018] used three
approaches of concatenation, max-pooling, and LSTM-attention in the experiments to ag-
gregate information. The illustration of JKN could be found in Figure 8.1.
The idea of Jump Knowledge Network is straightforward and it performs well on the ex-
periments in social, bioinformatics and citation networks. It could also be combined with models
like GCNs, GraphSAGE, and Graph Attention Networks to improve their performance.
8.3. DEEPGCNS 45
8.3 DEEPGCNS
Li et al. [2019] borrow ideas from CNNs to add skip connections into graph neural networks.
There are two major challenges to stack more layers of GNNs: vanishing gradient and over
smoothing. Li et al. [2019] use residual connections and dense connections from ResNet [He
et al., 2016b] and DenseNet [Huang et al., 2017] to solve the vanishing gradient problem and
uses dilated convolutions [Yu and Koltun, 2015] to solve the over smoothing problem.
Li et al. [2019] denote the vanilla GCN as PlainGCN and further propose ResGCN
and DenseGCN. In PlainGCN, the computation of hidden states is
Ht C1 D F Ht ; Wt (8.2)
HtRes
C1
D Ht C1 C Ht
(8.3)
D F .Ht ; Wt / C Ht ;
where the matrix of hidden states Ht is directly added to the matrix after the graph convolution.
For DenseGCN, the computation is
t C1
HDense D T Ht C1 ; Ht ; : : : ; H0
(8.4)
D T F Ht ; Wt ; F Ht 1 ; Wt 1 ; : : : ; H0 ;
where T is a vertex-wise concatenation function. Thus, the dimension of the hidden states grows
with the graph layer. Figure 8.2 gives a straightforward demonstration of these three architec-
tures.
Li et al. [2019] further use dilated convolutions [Yu and Koltun, 2015] to solve the
over smoothing problem. The paper uses a Dilated k-NN method with dilation rate d . For
each node, the method first computes the k d nearest neighbors using a pre-defined metric
and then selects neighbors by skipping every d neighbors. For example, if .u1 ; u2 ; : : : ; ukd /
are the k d nearest neighbors for a node v , then the dilated neighborhood of node v is
.u1 ; u1Cd ; u1C2d ; : : : ; u1C.k 1/d /. The dilated convolution leverages information from differ-
ent context and enlarges the receptive field of node v and is proven to be effective. Figure 8.3
shows the dilated convolution.
The dilated k -NN is added to the ResGCN and DenseGCN model. Li et al. [2019]
conduct experiments on the task of point cloud semantic segmentation and built a 56-layer
GCN to achieve promising results.
46 8. GRAPH RESIDUAL NETWORKS
⊕ Add ⊗ Concat
Figure 8.2: DeepGCN blocks (PlainGCN, ResGCN, DenseGCN) proposed in Li et al. [2019].
9 9 9
5 5 5
3 3 3
4 4 4
6 1 6 1 6 1
2 2 2
8 8 8
7
& 7 7
Figure 8.3: An example of the dilated convolution. The dilation rate is 1, 2, 3 for figures from
left to right.
47
CHAPTER 9
where Aakdenotes the part of the adjacency matrice that contains the k-hop edges for the ancestor
propagation phase and Adk contains the k-hop edges for the descendant propagation phase. Dka
and Dkd are the corresponding degree matrices for Aak and Adk .
48 9. VARIANTS FOR DIFFERENT GRAPH TYPES
9.2 HETEROGENEOUS GRAPHS
The second variant of graph is heterogeneous graph, where there are several kinds of nodes.
Definition 9.1 A heterogeneous graph can be represented as a directed graph G D fV ; E g
with a node type mapping W V ! A and a relation type mapping W E ! R. V denotes the
node set and E denotes the edge set. A denotes the node type set while R denotes the edge type
set. jAj > 1 or jRj > 1 holds.
The simplest way to process heterogeneous graph is to convert the type of each node to a
one-hot feature vector which is concatenated with the original feature. GraphInception [Zhang
et al., 2018e] introduces the concept of metapath into the propagation on the heterogeneous
graph.
Definition 9.2 A meta-path P of heterogeneous graph G D fV ; E g is a path of the form
R1 R2 RL
A1 ! A2 ! A3 ! ALC1 . L C 1 is the length of P .
With metapath, we can group the neighbors according to their node types and distances.
In this way, the heterogeneous graph can be transformed to a set of homogeneous ˚ graphs and
this is called multi-channel network. Formally, given a set of meta paths S D P1 ; ; PjS j , the
translated multi-channel network G 0 is defined as
˚ ˇ ˇ
G 0 D G`0 ˇG`0 D .V1 ; E1` / ; ` D 1; ; ˇ S j ; (9.3)
where V1 ; : : : ; Vm denote node sets with m different types, V1 is the target node type set, E1`
V1 V1 denotes the meta-path instances in P` . For each neighbor group, GraphInception treats
it as a sub-graph in a homogeneous graph to do propagation and concatenates the propagation
results from different homogeneous graphs to do a collective node representation. Instead of
Laplacian matrix L, GraphInception uses transition probability matrix P as the Fourier basis.
Recently, Wang et al. [2019b] propose the heterogeneous graph attention network (HAN)
which utilizes node-level and semantic-level attentions. First for each meta-path, HAN learns
a specific node embedding through node-level attention aggregation. Then with the meta-path
specific embeddings, HAN presents semantic-level attention for a more comprehensive node
embedding. Overall, the model have the ability to consider node importance and meta-paths
importance simultaneously.
On event detection task, Peng et al. [2019] propose Pairwise Popularity Graph Con-
volutional Network (PP-GCN) to detect events on social networks. The model first calcu-
lates a weighted average between events for different meta-paths on event graphs. Then builds
a weighted adjacent matrix to annotate social event instances and perform GCN [Kipf and
Welling, 2017] on it to learn event embeddings.
In order to reduce training cost, ActiveHNE [Chen et al., 2019] introduces active
learning into heterogeneous graph learning. Based on uncertainty and representativeness, Ac-
9.3. GRAPHS WITH EDGE INFORMATION 49
want-01 ARG 0
want-01
boy ARG 1
ARG 0
red
boy
ARG 1
ride-01 ARG 0
ARG 0 mod
ride-01
ARG 1 horse ARG 1
horse
red mod
Figure 9.1: An example of AMR graph and its corresponding Levi graph. Left: The AMR graph
of sentence The boy wants to ride the red horse. Right: The Levi transformation of the AMR graph.
tiveHNE selects the most valuable nodes in training set for label acquisition. This method sig-
nificantly saves the query cost and achieves state-of-the-art performance on real-world datasets
simultaneously.
Here each Wr is a linear combination of basis transformations which can be viewed as some
kind of weight sharing strategy. Vb 2 Rdin dout with coefficients arb . In the block-diagonal de-
composition, R-GCN defines each Wr through the direct sum over a set of low-dimensional
matrices, which needs more parameters than the first one:
B
M
Wr D Qbr : (9.6)
bD1
.lC1/ =B d .l/ =B
Thereby, Wr D diag.Q1r ; : : : ; QBr / is composed by Qbr 2 R.d / . / . Block-
diagonal-decomposition constrains the sparsity of weight matrices and encodes the hypothesis
that latent vectors can be grouped into some small parts.
Figure 9.2: An example of spatial temporal graph. Each G t indicates a frame of current graph
state at time t .
Differently, Structural-RNN [Jain et al., 2016] and ST-GCN [Yan et al., 2018] collect
spatial and temporal messages at the same time. They extend static graph structure with tempo-
ral connections so they can apply traditional GNNs on the extended graphs. Structural-RNN
adds edges between the same node at time step t and t C 1 to construct the comprehension rep-
resentation of spatio-temporal graphs. Then, the model represents each node with a nodeRNN
and each edge with an edgeRNN. The edgeRNNs and nodeRNNs form a bipartite graph and
forward-pass for each node.
ST-GCN [Yan et al., 2018] stacks graph frames of all time steps to construct spatial-
temporal graphs. The model partitions the graph and assigns a weight vector for each node,
then performs graph convolution directly on the weighted spatial-temporal graph.
Graph WaveNet [Wu et al., 2019d] considers a more challenging setting where the ad-
jacency matrix of the static graph doesn’t reflect the genuine spatial dependencies, i.e., some
dependencies are missing or some are deceptive, which are ubiquitous because the distance of
nodes doesn’t necessarily mean a causal relationship. They propose a self-adaptive adjacency
matrix which is learned in the framework and use a Temporal Convolution Network (TCN)
together with a GCN to address the problem.
Figure 9.3: An example multi-dimensional graph, with its single graph and expansion.
type among users can be either “subscription,” “sharing,” and “comment” [Ma et al., 2019].
Because relation types are not assumed independent with each other naturally, directly applying
the models of “single-dimensional” graph might not be an optimal solution.
Early works on multi-dimensional graph mainly focus on community detection and clus-
tering. In Berlingerio et al. [2011], they give the illustration of “multidimensional commu-
nity” and provide two different measures to disambiguate the definition of density in multi-
dimensional graph. Papalexakis et al. [2013] provide concrete algorithms to find clusters across
all the dimensions.
More recently, special types of GCN suitable for multi-dimensional graphs were designed.
Ma et al. [2019] handle the problem by providing separate embeddings for a node in different
dimensions, and these embeddings are viewed as projections from a general representation. They
design an aggregation mechanism of graph neural network by taking into account the interac-
tions between different nodes in the same dimension and the interactions between different di-
mensions of a single node. Khan and Blumenstock [2019] reduce the multi-dimensional graph
to a single-dimensional one through two steps. They first merge the multiple view by subspace
analysis and then prune the graph through manifold learning. The single-dimensional graph
is passed into a normal GCN to perform learning. Sun et al. [2018] mainly focus on study-
ing node embedding of the network and extends the node embedding algorithm (SVNE) to a
multi-dimensional setting.
53
CHAPTER 10
10.1 SAMPLING
The original graph neural network has several drawbacks in training and optimization. For ex-
ample, GCN requires the full-graph Laplacian, which is computational-consuming for large
graphs. Furthermore, GCN is trained independently for a fixed graph, which lacks the ability
for inductive learning.
GraphSAGE [Hamilton et al., 2017b] is a comprehensive improvement of the original
GCN. To solve the problems mentioned above, GraphSAGE replaced full-graph Laplacian
with learnable aggregation functions, which are key to perform message passing and general-
ize to unseen nodes. As shown in Eq. (5.20), they first aggregate neighborhood embeddings,
concatenate with target node’s embedding, then propagate to the next layer. With learned aggre-
gation and propagation functions, GraphSAGE could generate embeddings for unseen nodes.
Also, GraphSAGE uses a random neighbor sampling method to alleviate receptive field expan-
sion.
Compared to GCN [Kipf and Welling, 2017], GraphSAGE proposes a way to train the
model via batches of nodes instead of the full-graph Laplacian. This enables the training of large
graphs though it may be time-consuming.
PinSage [Ying et al., 2018a] is an extension version of GraphSAGE on large graphs. It
uses the importance-based sampling method. Simple random sampling is suboptimal because of
the increase of variance. PinSage defines importance-based neighborhoods of node u as the T
nodes that exert the most influence on node u. By simulating random walks starting from target
nodes, this approach calculate the L1 -normalized visit count of nodes visited by the random
walk. Then the top T nodes with the highest normalized visit counts with respect to u are
selected to be the neighborhood of node u.
54 10. VARIANTS FOR ADVANCED TRAINING METHODS
Figure 10.1: The illustration of sampled neighborhood on an example graph, K denotes the hop
of neighborhood.
1 X 1
q.v/ / ; (10.1)
jNv j jNu j
u2Nv
where Nv is the neighborhood of node v . The sampling distribution is the same for each layer.
In contrast to fixed sampling methods above, Huang et al. [2018] introduce a parame-
terized and trainable sampler to perform layer-wise sampling. The authors try to learn a self-
dependent function g.x.uj // of each node to determine its importance for sampling based on
10.2. HIERARCHICAL POOLING 55
the node feature x.uj /. The sampling distribution is defined as
Pn ˇ ˇ
ˇ ˇ
i D1 p uj jvi g x uj
q uj D PN Pn ˇ ˇ : (10.2)
j D1 p uj jvi ˇg x vj ˇ
i D1
Furthermore, this adaptive sampler could find optimal sampling importance and reduce
variance simultaneously.
Many graph analytic problems are solved iteratively and finally achieve steady states. Fol-
lowing the idea of reinforcement learning, SSE [Dai et al., 2018] proposes Stochastic Fixed-
Point Gradient Descent for GNN training to obtain the same steady-state solutions automat-
ically from examples. This method views embedding update as value function and parameter
update as policy function. In training, the algorithm samples nodes to update embeddings and
samples labeled nodes to update parameters alternately.
Chen et al. [2018b] propose a control-variate based stochastic approximation algorithm
for GCN by utilizing the historical activations of nodes as a control variate. This method main-
N
tains the historical average activations h.l/ .l/
v to approximate the true activation hv . The advantage
of this approach is it limits the receptive field of nodes in the 1-hop neighborhood by using the
historical hidden state as an affordable approximation, and the approximation are further proved
to have zero variance.
where Xt is the matrix of node features and At is the coarsened adjacency matrix of layer t .
Z D GCN.X; A/
(10.4)
e
A D ZZT :
Kipf and Welling [2016] also train the GAE model in a variational manner and the model
is named as the variational graph auto-encoder (VGAE). Furthermore, Berg et al. use GAE in
recommender systems and have proposed the graph convolutional matrix completion model
(GC-MC) [van den Berg et al., 2017], which outperforms other baseline models on the Movie-
Lens dataset.
Adversarially Regularized Graph Auto-encoder (ARGA) [Pan et al., 2018] employs gen-
erative adversarial networks (GANs) to regularize a GCN-based graph auto-encoder to follow
a prior distribution.
Deep Graph Infomax (DGI) [Veličković et al., 2019] aims to maximize the local-global
mutual information to learn representations. The local information comes from each node’s hid-
den state after the graph convolution function F . The global information sE of a graph is com-
puted by the readout function R. This function aggregates all node presentations and is set to
an average function in the paper. The paper uses node shuffling to get negative examples (by
changing node features from X to X Q with a corruption function C ). Then it use a discriminator
D to classify the positive samples and negative samples. The architecture of DGI is shown in
Figure 10.2.
There are also several graph auto-encoders such as NetRA [Yu et al., 2018b], DNGR [Cao
et al., 2016], SDNE [Wang et al., 2016], and DRNE [Tu et al., 2018], however, they don’t use
GNNs in their framework.
10.4. UNSUPERVISED TRAINING 57
(X, A) (H, A)
F
xi hi D +
R
C s⃗
F
D −
x̃j h̃j
CHAPTER 11
General Frameworks
Apart from different variants of graph neural networks, several general frameworks are proposed
aiming to integrate different models into one single framework. Gilmer et al. [2017] propose
the message passing neural network (MPNN) and it is a unified framework to generalize several
graph neural network and graph convolutional network methods. Wang et al. [2018b] propose
the non-local neural network (NLNN) which is used to solve computer vision tasks. It could
generalize several “self-attention”-style methods [Hoshen, 2017, Vaswani et al., 2017, Velickovic
et al., 2018]. Battaglia et al. [2018] propose the graph network (GN) which unified the MPNN
and NLNN methods as well as many other variants like Interaction Networks [Battaglia et al.,
2016, Watters et al., 2017], Neural Physics Engine [Chang et al., 2017], CommNet [Sukhbaatar
et al., 2016], structure2vec [Dai et al., 2016, Khalil et al., 2017], GGNN [Li et al., 2016],
Relation Network [Raposo et al., 2017, Santoro et al., 2017], Deep Sets [Zaheer et al., 2017],
and Point Net [Qi et al., 2017a].
xi
xj
Figure 11.1: A spacetime non-local operation in the network trained for video classification. The
response of xi is computed as the weighted sum of all positions xj where in this figure only the
highest weighted ones are shown.
where T denotes the total time steps. The message function M t , vertex update function U t ,
and readout function R could have different settings. Hence, the MPNN framework could
generalize several different models via different function settings. Here we give an example of
generalizing GGNN, and other models’ function settings could be found in Gilmer et al. [2017].
The function settings for GGNNs are:
M t htv ; htw ; evw D Aevw htw
U t D GRU htv ; mtvC1 (11.3)
X
RD i.hTv ; h0v / ˇ j.hTv / ;
v2V
where Aevw is the adjacency matrix, one for each edge label e . GRU is the Gated Recurrent Unit
introduced in Cho et al. [2014]. i and j are neural networks in function R.
2. e!h uses Ei0 to aggregate corresponding edge updates for node i and get the result eN 0i .
4. e!u uses E 0 to aggregate all edge updates into eN 0 . It will be further used in the computa-
tion of the global state.
5. h!u uses H 0 to aggregate all node updates into hN 0 , which will be used in the update of
the global state.
6. u is designed to compute an update for the global attribute u0 with the information from
eN 0 , hN 0 and u.
11.3. GRAPH NETWORKS 63
Note here the order is not strictly enforced. For example, it is possible to proceed from
global, to per-node, to per-edge updates. And the and functions need not be neural networks
though in this paper we only focus on neural network implementations.
Design Principles. The design of GN based on three basic principles: flexible represen-
tations, configurable within-block structure, and composable multi-block architectures.
• Flexible representations. The GN framework supports flexible representations of the at-
tributes as well as different graph structures. The global, node, and edge attributes can
use different kinds of representations and researchers usually use real-valued vectors and
tensors. One can simply tailor the output of a GN block according to specific demands of
tasks. For example, Battaglia et al. [2018] list several edge-focused [Hamrick et al., 2018,
Kipf et al., 2018], node-focused [Battaglia et al., 2016, Chang et al., 2017, Sanchez et al.,
2018, Wang et al., 2018a], and graph-focused [Battaglia et al., 2016, Gilmer et al., 2017,
Santoro et al., 2017] GNs. In terms of graph structures, the framework can be applied to
both structural scenarios where the graph structure is explicit and non-structural scenarios
where the relational structure should be inferred or assumed.
• Configurable within-block structure. The functions and their inputs within a GN block
can have different settings so that the GN framework provides flexibility in within-block
structure configuration. For example, Hamrick et al. [2018] and Sanchez et al. [2018] use
the full GN blocks. Their implementations use neural networks and their functions use
the elementwise summation. Based on different structure and functions settings, a variety
of models (such as MPNN, NLNN, and other variants) could be expressed by the GN
framework. Figure 11.2a gives an illustration of a full GN block and other models can be
regarded as special variants of the GN block. For example, the MPNN uses the features
of nodes and edges as input and outputs graph-level and node-level representations. The
MPNN model does not use the graph-level input features and omits the learning process
of edge embeddings.
• Composable multi-block architectures. GN blocks could be composed to construct com-
plex architectures. Arbitrary numbers of GN blocks could be composed in sequence with
shared or unshared parameters. Battaglia et al. [2018] utilize GN blocks to construct an
encode-process-decode architecture and a recurrent GN-based architecture. These architec-
tures are demonstrated in Figure 11.3. Other techniques for building GN-based architec-
tures could also be useful, such as skip connections, LSTM-, or GRU-style gating schemes
and so on.
In conclusion, GN is a general and flexible framework for deep learning on graphs. It
can be used for various tasks including physical systems, traffic networks and so on. However,
the GNs still has its limitations. For example, it cannot solve some classes of problems like
discriminating between certain non-isomorphic graphs.
64 11. GENERAL FRAMEWORKS
Figure 11.2: Different internal GN block configurations. (a) a full GN block [Battaglia et al.,
2018]; (b) an independent recurrent block [Sanchez et al., 2018]; (c) an MPNN [Gilmer et al.,
2017]; (d) a NLNN [Wang et al., 2018b]; (e) a relation network [Raposo et al., 2017]; and (f ) a
deep set [Zaheer et al., 2017].
11.3. GRAPH NETWORKS 65
Figure 11.3: Examples of architectures composed by GN blocks. (a) The sequential processing
architecture; (b) The encode-process-decode architecture; and (c) The recurrent architecture.
67
CHAPTER 12
Applications – Structural
Scenarios
In the following sections, we will introduce GNN’s applications in structural scenarios, where the
data are naturally performed in the graph structure. For example, GNNs are widely being used
in social network prediction [Hamilton et al., 2017b, Kipf and Welling, 2017], traffic predic-
tion [Rahimi et al., 2018], recommender systems [van den Berg et al., 2017, Ying et al., 2018a],
and graph representation [Ying et al., 2018b]. Specifically, we are discussing how to model real-
world physical systems with object-relationship graphs, how to predict chemical properties of
molecules and biological interaction properties of proteins and the methods of reasoning about
the out-of-knowledge-base (OOKB) entities in knowledge graphs.
12.1 PHYSICS
Modeling real-world physical systems is one of the most basic aspects of understanding human
intelligence. By representing objects as nodes and relations as edges, we can perform GNN-
based reasoning about objects, relations, and physics in a simplified but effective way.
Battaglia et al. [2016] propose Interaction Networks to make predictions and inferences
about various physical systems. In current state, we input objects and relations into GNN to
model their interactions, then the physical dynamics are adopted to predict future states. They
separately model relation-centric and object-centric models, making it easier to generalize across
different systems.
In CommNet [Sukhbaatar et al., 2016], interactions are not modeled explicitly. Instead,
an interaction vector is obtained by averaging all other agents’ hidden vectors.
VAIN [Hoshen, 2017] further introduces attentional methods into agent interaction pro-
cess, which preserves both the complexity advantages and computational efficiency as well.
Visual Interaction Networks [Watters et al., 2017] could make predictions from pixels. It
learns a state code from two consecutive input frames for each object. Then, after adding their
interaction effect by an Interaction Net block, the state decoder converts state codes to next
step’s state.
Sanchez et al. [2018] propose a GN-based model which could either perform state predic-
tion or inductive inference. The inference model takes partially observed information as input
and constructs a hidden graph for implicit system classification. Kipf et al. [2018] also build
graphs from object trajectories, they adopt an encoder-decoder architecture for neural relational
68 12. APPLICATIONS – STRUCTURAL SCENARIOS
Figure 12.1: A physical system and its corresponding graph representation. Colored nodes de-
note different objects and edges denote interaction between them.
inference process. In detail, the encoder returns a factorized distribution of interaction graph Z
through GNN while the decoder generates trajectory predictions conditioned on both the latent
code of the encoder and the previous time step of the trajectory.
On the problem of solving partial differential equations, inspired by finite element meth-
ods [Hughes, 2012], graph element networks [Alet et al., 2019] place nodes in the continuous
space. Each node represents the local state of the system, and the model establishes a connectiv-
ity graph among the nodes. GNN propagates state information to simulate the dynamic system.
Figure 12.2: A single CH 3 OH molecular and its graph representation. Nodes are elements and
edges are bonds.
where euv is the edge feature of edge .u; v/. Then update node representation by
deg.v/ t
htvC1 D W t hNv ; (12.2)
where deg.v/ is the degree of node v and WN t is a learned matrix for each time step t and node
degree N .
Kearnes et al. [2016] further explicitly model atom and atom pairs independently to em-
phasize atom interactions. It introduces edge representation etuv instead of aggregation function,
P
i.e., htNv D u2Nv etuv . The node update function is
htC1
v D ReLU W1 ReLU W0 htu ; htNv (12.3)
Beyond atom molecular graphs, some works [Jin et al., 2018, 2019] represent molecules
as junction trees. A junction tree is generated by contracting certain vertices in corresponding
molecular graph into a single node. The nodes in a junction tree are molecular substructures such
as rings and bonds. Jin et al. [2018] leverage variational auto-encoder to generate molecular
graphs. Their model follows a two-step process, first generating a junction tree scaffold over
chemical substructures, then combining them into a molecule with a graph message passing
network. Jin et al. [2019] focus on molecular optimization. This task aims to map one molecule
to another molecular graph which preserves better properties. The proposed VJTNN uses graph
attention to decode the junction tree and incorporates GAN for adversarial training to avoid
valid graph translation.
To better explain the function of each substructure in a molecule, Lee et al. [2019] propose
a game-theoretic approach to exhibit the transparency in structured data. The model is set up
70 12. APPLICATIONS – STRUCTURAL SCENARIOS
as a two-player co-operative game between a predictor and a witness. The predictor is trained to
minimize the discrepancy while the goal of the witness is to test how well the predictor conforms
to the transparency.
Figure 12.3: Example of knowledge base fragment. The nodes are entities and the edges are
relations. The dashed line is missing edge information to be inferred.
effects prediction. Their work models the drug and protein interaction network and separately
deals with edges in different types.
User Domain
Figure 12.4: Users, items, and attributes are nodes on the graph and interactions between them
are edges. In this way, we can convert rating prediction task to link prediction task.
CHAPTER 13
Applications – Non-Structural
Scenarios
In this chapter we will talk about applications on non-structural scenarios such as image,
text, programming source code [Allamanis et al., 2018, Li et al., 2016], and multi-agent sys-
tems [Hoshen, 2017, Kipf et al., 2018, Sukhbaatar et al., 2016]. We will only give detailed
introduction to the first two scenarios due to the length limit. Roughly, there are two ways to
apply the graph neural networks on non-structural scenarios: (1) incorporate structural infor-
mation from other domains to improve the performance, for example using information from
knowledge graphs to alleviate the zero-shot problems in image tasks; and (2) infer or assume the
relational structure in the scenario and then apply GNN model to solve the problems defined
on graphs, such as the method in Zhang et al. [2018c] which models text into graphs.
13.1 IMAGE
13.1.1 IMAGE CLASSIFICATION
Image classification is a very basic and important task in the field of computer vision, which
attracts much attention and has many famous datasets like ImageNet [Russakovsky et al., 2015].
Recent progress in image classification benefits from big data and the strong power of GPU
computation, which allows us to train a classifier without extracting structural information from
images. However, zero-shot and few-shot learning are becoming more and more popular in
the field of image classification, because most models can achieve similar performance with
enough data. There are several works leveraging graph neural networks to incorporate structural
information in image classification.
First, knowledge graphs can be used as extra information to guide zero-short recogni-
tion classification [Kampffmeyer et al., 2019, Wang et al., 2018c]. Wang et al. [2018c] builds a
knowledge graph where each node corresponds to an object category and takes the word embed-
dings of nodes as input for predicting the classifier of different categories. As over-smoothing
effect happens with the deep depth of convolution architecture, the six-layer GCN used in Wang
et al. [2018c] would wash out much useful information in the representation. To solve the
smoothing problem in the propagation of GCN, Kampffmeyer et al. [2019] managed to use
single layer GCN with a larger neighborhood which includes both one-hop and multi-hops
nodes in the graph. And it is proved effective in building a zero-shot classifier with existing ones.
76 13. APPLICATIONS – NON-STRUCTURAL SCENARIOS
Placental Ancestor
Propagation
a3a Carnivore
Descendent
Propagation
a2a
Feline Candid
a1a
a1d a1d
Tiger Domestic Wild
Cat Cat
Figure 13.1: The black lines represent the propagation step from previous methods. The red and
blues lines represent the propagation step in Kampffmeyer et al. [2019], where the node could
aggregate information from ancestor and descendent nodes.
Figure 13.1 shows an example of the propagation step in Kampffmeyer et al. [2019] and Wang
et al. [2018c].
Besides the knowledge graph, the similarity between images in the dataset is also helpful
for the few-shot learning [Garcia and Bruna, 2018]. Garcia and Bruna [2018] propose to build
a weighted fully-connected image network based on the similarity and do message passing in
the graph for few-shot recognition.
As most knowledge graphs are large for reasoning, Marino et al. [2017] select some related
entities to build a sub-graph based on the result of object detection and apply GGNN to the
extracted graph for prediction. Besides, Lee et al. [2018a] propose to construct a new knowledge
graph where the entities are all the categories. And, they defined three types of label relations:
super-subordinate, positive correlation, and negative correlation and propagate the confidence
of labels in the graph directly.
13.1. IMAGE 77
blocks
Neural toys
Network cats
…
Figure 13.2: The method in Teney et al. [2017] for visual question answering. The scene graph
from the picture and the syntactic graph from the question are first constructed and then com-
bined for question answering.
13.2 TEXT
The graph neural networks could be applied to several tasks based on text. It could be applied
to both sentence-level tasks (e.g., text classification) as well as word-level tasks (e.g., sequence
labeling). We will introduce several major applications on text in the following.
NMOD
SBJ OBJ
Figure 13.3: An example of Syntactic GCN. This figure shows the example with two Syntactic
GCN layers.
GNNs can process multi-hop relational reasoning on graphs but cannot be directly applied on
text. So GP-GNN is propose to solve the relational reasoning task on text.
Cross-sentence N-ary relation extraction detects relations among n entities across multi-
ple sentences. Peng et al. [2017] explore a general framework for cross-sentence n-ary relation
extraction based on graph LSTMs. It splits the input graph into two DAGs while in this pro-
cedure useful information can be lost. Song et al. [2018c] propose a graph-state LSTM model
that keeps the original graph structure. Furthermore, the model allows more parallelization to
speed up the computation.
13.2. TEXT 81
13.2.5 EVENT EXTRACTION
Event extraction is an important information extraction task to recognize instances of specified
types of events in texts. Nguyen and Grishman [2018] investigate a CNN (which is the Syn-
tactic GCN exactly) based on dependency trees to perform event detection. Liu et al. [2018]
propose a Jointly Multiple Events Extraction ( JMEE) framework which extracts event triggers
and arguments jointly. It uses an attention-based GCN to model graph information and uses
shortcut arcs from syntactic structures to enhance information flow.
proposed to solve the relational reasoning task based on text. The works cited above are not an
exhaustive list, and we encourage our readers to find more works and application domains of
graph neural networks that they are interested in.
83
CHAPTER 14
A 5
2 C
6
4
B
3 4 4
E
6 D
Figure 14.1: A small example of traveling salesman problem (TSP). The nodes denote different
cities and edges denote paths between cities. The edge weights are path lengths. The red line
shows the shortest possible loop that connects every city.
Li et al. [2018c] propose a model which generates edges and nodes sequentially and utilize
a graph neural network to extract the hidden state of the current graph which is used to decide
the action in the next step during the sequential generative process.
Rather than small graphs like molecules, Graphite [Grover et al., 2019] is particularly
suited for large graphs. The model learns a parameterized distribution of adjacent matrix.
Graphite adopts an encoder-decoder architecture, where the encoder is a GNN. For the pro-
posed decoder, the model constructs an intermediate graph and iteratively refine the graph by
message passing.
Source code generation is an interesting structured prediction task which requires satis-
fying semantic and syntactic constraints simultaneously. Brockschmidt et al. [2019] propose to
solve this problem by graph generation. They design a novel model which builds a graph from a
partial AST by adding edges encoding attribute relationships. A graph neural network performs
message passing on the graph helps better guide the generation procedure.
CHAPTER 15
Open Resources
15.1 DATASETS
Many tasks related to graphs are released to test the performance of various graph neural net-
works. Such tasks are based on the following commonly used datasets.
A series of datasets based on citation networks are as follows:
• Pubmed [Yang et al., 2016]
• Cora [Yang et al., 2016]
• Citeseer [Yang et al., 2016]
• DBLP [Tang et al., 2008]
A series of datasets based on Biochemical graphs are as follows:
• MUTAG [Debnath et al., 1991]
• NCI-1 [Wale et al., 2008]
• PPI [Zitnik and Leskovec, 2017]
• D&D [Dobson and Doig, 2003]
• PROTEIN [Borgwardt et al., 2005]
• PTC [Toivonen et al., 2003]
A series of datasets based on Social Networks are as follows:
• Reddit [Hamilton et al., 2017c]
• BlogCatalog [Zafarani and Liu, 2009]
A series of datasets based on Knowledge Graphs are as follows:
• FB13 [Socher et al., 2013]
• FB15K [Bordes et al., 2013]
• FB15K237 [Toutanova et al., 2015]
88 15. OPEN RESOURCES
• WN11 [Socher et al., 2013]
• WN18 [Bordes et al., 2013]
• WN18RR [Dettmers et al., 2018]
A broader range of opensource dataset repositories are as follows:
• Network Repository
A scientific network data repository with interactive visualization and mining tools.
http://networkrepository.com
15.2 IMPLEMENTATIONS
We first list several platforms that provide codes for graph computing in Table 15.1.
Next, we list the hyperlinks of the current opensource implementations of some famous
GNN models in Table 15.2.
As the research filed grows rapidly, we recommend our readers the paper list published by
our team, GNNPapers (https://github.com/thunlp/gnnpapers), for recent studies.
15.2. IMPLEMENTATIONS 89
Model Link
GGNN (2015) https://github.com/yujiali/ggnn
Neurals FPs (2015) https://github.com/HIPS/neural-fingerprint
ChebNet (2016) https://github.com/mdeff/cnn_graph
DNGR (2016) https://github.com/ShelsonCao/DNGR
SDNE (2016) https://github.com/suanrong/SDNE
GAE (2016) https://github.com/limaosen0/Variational-Graph-Auto-Encoders
DRNE (2016) https://github.com/tadpole/DRNE
Structural RNN (2016) https://github.com/asheshjain399/RNNexp
DCNN (2016) https://github.com/jcatw/dcnn
GCN (2017) https://github.com/tkipf/gcn
CayleyNet (2017) https://github.com/amoliu/CayleyNet
GraphSage (2017) https://github.com/williamleif/GraphSAGE
GAT (2017) https://github.com/PetarV-/GAT
CLN(2017) https://github.com/trangptm/Column_networks
ECC (2017) https://github.com/mys007/ecc
MPNNs (2017) https://github.com/brain-research/mpnn
MoNet (2017) https://github.com/pierrebaque/GeometricConvolutionsBench
https://github.com/ShinKyuY/Representation_Learning_on_
JK-Net (2018)
Graphs_with_Jumping_Knowledge_Networks
SSE (2018) https://github.com/Hanjun-Dai/steady_state _embedding
LGCN (2018) https://github.com/divelab/lgcn/
FastGCN (2018) https://github.com/matenure/FastGCN
DiffPool (2018) https://github.com/RexYing/diffpool
GraphRNN (2018) https://github.com/snap-stanford/GraphRNN
MolGAN (2018) https://github.com/nicola-decao/MolGAN
NetGAN (2018) https://github.com/danielzuegner/netgan
DCRNN (2018) https://github.com/liyaguang/DCRNN
ST-GCN (2018) https://github.com/yysijie/st-gcn
RGCN (2018) https://github.com/tkipf/relational-gcn
AS-GCN (2018) https://github.com/huangwb/AS-GCN
DGCN (2018) https://github.com/ZhuangCY/DGCN
GaAN (2018) https://github.com/jennyzhang0215/GaAN
DGI (2019) https://github.com/PetarV-/DGI
GraphWaveNet (2019) https://github.com/nnzhan/Graph-WaveNet
HAN (2019) https://github.com/Jhy1993/HAN
91
CHAPTER 16
Conclusion
Although GNNs have achieved great success in different fields, it is remarkable that GNN
models are not good enough to offer satisfying solutions for any graph in any condition. In this
section, we will state some open problems for further researches.
Shallow Structure. Traditional DNNs can stack hundreds of layers to get better per-
formance, because deeper structure has more parameters, which improve the expressive power
significantly. However, graph neural networks are always shallow, most of which are no more
than three layers. As experiments in Li et al. [2018a] show, stacking multiple GCN layers will
result in over-smoothing, that is to say, all vertices will converge to the same value. Although
some researchers have managed to tackle this problem [Li et al., 2018a, 2016], it remains to be
the biggest limitation of GNN. Designing real deep GNN is an exciting challenge for future
research, and will be a considerable contribution to the understanding of GNN.
Dynamic Graphs. Another challenging problem is how to deal with graphs with dynamic
structures. Static graphs are stable so they can be modeled feasibly, while dynamic graphs in-
troduce changing structures. When edges and nodes appear or disappear, GNN cannot change
adaptively. Dynamic GNN is being actively researched on and we believe it to be a big milestone
about the stability and adaptability of general GNN.
Non-Structural Scenarios. Although we have discussed the applications of GNN on
non-structural scenarios, we found that there is no optimal methods to generate graphs from raw
data. In image domain, some work utilizes CNN to obtain feature maps then upsamples them
to form superpixels as nodes [Liang et al., 2016], while other ones directly leverage some object
detection algorithms to get object nodes. In the text domain [Chen et al., 2018c], some work
employs syntactic trees as syntactic graphs while others adopt fully connected graphs. Therefore,
finding the best graph generation approach will offer a wider range of fields where GNN could
make a contribution.
Scalability. How to apply embedding methods in web-scale conditions like social net-
works or recommendation systems has been a fatal problem for almost all graph-embedding
algorithms, and GNN is not an exception. Scaling up GNN is difficult because many of the
core steps are computational consuming in big data environment. There are several examples
about this phenomenon. First, graph data are not regular Euclidean, each node has its own
neighborhood structure so batches cannot be applied. Then, calculating graph Laplacian is also
unfeasible when there are millions of nodes and edges. Moreover, we need to point out that
scaling determines whether an algorithm is able to be applied into practical use. Several works
92 16. CONCLUSION
have proposed their solutions to this problem [Ying et al., 2018a] and recent research is paying
more attention to this direction.
In conclusion, graph neural networks have become powerful and practical tools for ma-
chine learning tasks in graph domain. This progress owes to advances in expressive power, model
flexibility, and training algorithms. In this book, we give a detailed introduction to graph neural
networks. For GNN models, we introduce its variants categorized by graph convolutional net-
works, graph recurrent networks, graph attention networks and graph residual networks. More-
over, we also summarize several general frameworks to uniformly represent different variants. In
terms of application taxonomy, we divide the GNN applications into structural scenarios, non-
structural scenarios, and other scenarios, then give a detailed review for applications in each
scenario. Finally, we suggest four open problems indicating the major challenges and future re-
search directions of graph neural networks, including model depth, scalability, the ability to deal
with dynamic graphs, and non-structural scenarios.
93
Bibliography
F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling. 2019.
Graph element networks: Adaptive, structured computation and memory. In Proc. of ICML.
68
M. Allamanis, M. Brockschmidt, and M. Khademi. 2018. Learning to represent programs
with graphs. In Proc. of ICLR. 75
G. Angeli and C. D. Manning. 2014. Naturalli: Natural logic inference for common sense
reasoning. In Proc. of EMNLP, pages 534–545. DOI: 10.3115/v1/d14-1059 81
J. Atwood and D. Towsley. 2016. Diffusion-convolutional neural networks. In Proc. of NIPS,
pages 1993–2001. 2, 26, 30, 78
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to
align and translate. In Proc. of ICLR. 39
J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan. 2017. Graph convolutional
encoders for syntax-aware neural machine translation. In Proc. of EMNLP, pages 1957–1967.
DOI: 10.18653/v1/d17-1209 79
P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. 2016. Interaction networks for learning
about objects, relations and physics. In Proc. of NIPS, pages 4502–4510. 1, 59, 63, 67, 81
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski,
A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. 2018. Relational inductive biases,
deep learning, and graph networks. ArXiv Preprint ArXiv:1806.01261. 3, 59, 62, 63, 64
D. Beck, G. Haffari, and T. Cohn. 2018. Graph-to-sequence learning using gated graph neural
networks. In Proc. of ACL, pages 273–283. DOI: 10.18653/v1/p18-1026 49, 79, 81
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. 2017. Neural combinatorial opti-
mization with reinforcement learning. In Proc. of ICLR. 84
Y. Bengio, P. Simard, P. Frasconi, et al. 1994. Learning long-term dependencies with gradient
descent is difficult. IEEE TNN, 5(2):157–166. DOI: 10.1109/72.279181 17
M. Berlingerio, M. Coscia, and F. Giannotti. 2011. Finding redundant and complementary
communities in multidimensional networks. In Proc. of CIKM, pages 2181–2184. ACM.
DOI: 10.1145/2063576.2063921 52
94 BIBLIOGRAPHY
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. 2013. Translating
embeddings for modeling multi-relational data. In Proc. of NIPS, pages 2787–2795. 87, 88
J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. 2014. Spectral networks and locally connected
networks on graphs. In Proc. of ICLR. 23, 59
A. Buades, B. Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. In
Proc. of CVPR, 2:60–65. IEEE. DOI: 10.1109/cvpr.2005.38 60, 61
H. Cai, V. W. Zheng, and K. C.-C. Chang. 2018. A comprehensive survey of graph em-
bedding: Problems, techniques, and applications. IEEE TKDE, 30(9):1616–1637. DOI:
10.1109/tkde.2018.2807452 2
S. Cao, W. Lu, and Q. Xu. 2016. Deep neural networks for learning graph representations. In
Proc. of AAAI. 56
J. Chen, T. Ma, and C. Xiao. 2018a. FastGCN: Fast learning with graph convolutional networks
via importance sampling. In Proc. of ICLR. 54
J. Chen, J. Zhu, and L. Song. 2018b. Stochastic training of graph convolutional networks with
variance reduction. In Proc. of ICML, pages 941–949. 55
X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. 2018c. Iterative visual reasoning beyond convo-
lutions. In Proc. of CVPR, pages 7239–7248. DOI: 10.1109/cvpr.2018.00756 77, 91
BIBLIOGRAPHY 95
X. Chen, G. Yu, J. Wang, C. Domeniconi, Z. Li, and X. Zhang. 2019. Activehne: Active
heterogeneous network embedding. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/294 48
J. Cheng, L. Dong, and M. Lapata. 2016. Long short-term memory-networks for machine
reading. In Proc. of EMNLP, pages 551–561. DOI: 10.18653/v1/d16-1053 39
K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y.
Bengio. 2014. Learning phrase representations using RNN encoder—decoder for statistical
machine translation. In Proc. of EMNLP, pages 1724–1734. DOI: 10.3115/v1/d14-1179 17,
33, 60
F. R. Chung and F. C. Graham. 1997. Spectral Graph Theory. American Mathematical Society.
DOI: 10.1090/cbms/092 1
P. Cui, X. Wang, J. Pei, and W. Zhu. 2018. A survey on network embedding. IEEE TKDE.
DOI: 10.1109/TKDE.2018.2849727 2
H. Dai, B. Dai, and L. Song. 2016. Discriminative embeddings of latent variable models for
structured data. In Proc. of ICML, pages 2702–2711. 59, 85
H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song. 2018. Learning steady-states of iterative
algorithms over graphs. In Proc. of ICML, pages 1114–1122. 55
N. De Cao and T. Kipf. 2018. MolGAN: An implicit generative model for small molecular
graphs. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models.
83
A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch.
1991. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro com-
pounds. Correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal
Chemistry, 34(2):786–797. DOI: 10.1021/jm00106a046 87
M. Defferrard, X. Bresson, and P. Vandergheynst. 2016. Convolutional neural networks on
graphs with fast localized spectral filtering. In Proc. of NIPS, pages 3844–3852. 24, 59, 78
T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. 2018. Convolutional 2D knowledge
graph embeddings. In Proc. of AAAI. 71, 88
K. Do, T. Tran, and S. Venkatesh. 2019. Graph transformation policy network
for chemical reaction prediction. In Proc. of SIGKDD, pages 750–760. ACM. DOI:
10.1145/3292500.3330958 70
P. D. Dobson and A. J. Doig. 2003. Distinguishing enzyme structures from non-enzymes
without alignments. Journal of Molecular Biology, 330(4):771–783. DOI: 10.1016/s0022-
2836(03)00628-4 87
96 BIBLIOGRAPHY
D. K. Duvenaud, D. Maclaurin, J. Aguileraiparraguirre, R. Gomezbombarelli, T. D. Hirzel,
A. Aspuruguzik, and R. P. Adams. 2015. Convolutional networks on graphs for learning
molecular fingerprints. In Proc. of NIPS, pages 2224–2232. 25, 59, 68
W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin. 2019. Graph neural
networks for social recommendation. In Proc. of WWW, pages 417–426. ACM. DOI:
10.1145/3308558.3313488 74
M. Fey and J. E. Lenssen. 2019. Fast graph representation learning with PyTorch Geometric.
In ICLR Workshop on Representation Learning on Graphs and Manifolds. 89
A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur. 2017. Protein interface prediction using graph
convolutional networks. In Proc. of NIPS, pages 6530–6539. 1, 70
H. Gao, Z. Wang, and S. Ji. 2018. Large-scale learnable graph convolutional networks. In Proc.
of SIGKDD, pages 1416–1424. ACM. DOI: 10.1145/3219819.3219947 29
V. Garcia and J. Bruna. 2018. Few-shot learning with graph neural networks. In Proc. of ICLR.
76
M. Gori, G. Monfardini, and F. Scarselli. 2005. A new model for learning in graph domains.
In Proc. of IJCNN, pages 729–734. DOI: 10.1109/ijcnn.2005.1555942 19
P. Goyal and E. Ferrara. 2018. Graph embedding techniques, applications, and performance:
A survey. Knowledge-Based Systems, 151:78–94. DOI: 10.1016/j.knosys.2018.03.022 2
J. L. Gross and J. Yellen. 2004. Handbook of Graph Theory. CRC Press. DOI:
10.1201/9780203490204 49
A. Grover and J. Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proc. of
SIGKDD, pages 855–864. ACM. DOI: 10.1145/2939672.2939754 2
A. Grover, A. Zweig, and S. Ermon. 2019. Graphite: Iterative generative modeling of graphs.
In Proc. of ICML. 84
J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai. 2018. Learning region features for object detection.
In Proc. of ECCV, pages 381–395. DOI: 10.1007/978-3-030-01258-8_24 77
BIBLIOGRAPHY 97
T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto. 2017. Knowledge transfer for out-
of-knowledge-base entities: A graph neural network approach. In Proc. of IJCAI, pages 1802–
1808. DOI: 10.24963/ijcai.2017/250 1, 72
W. L. Hamilton, R. Ying, and J. Leskovec. 2017a. Representation learning on graphs: Methods
and applications. IEEE Data(base) Engineering Bulletin, 40:52–74. 2
W. L. Hamilton, Z. Ying, and J. Leskovec. 2017b. Inductive representation learning on large
graphs. In Proc. of NIPS, pages 1024–1034. 1, 31, 32, 53, 67, 74, 78
W. L. Hamilton, J. Zhang, C. Danescu-Niculescu-Mizil, D. Jurafsky, and J. Leskovec. 2017c.
Loyalty in online communities. In Proc. of ICWSM. 87
D. K. Hammond, P. Vandergheynst, and R. Gribonval. 2011. Wavelets on graphs via
spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150. DOI:
10.1016/j.acha.2010.04.005 24
J. B. Hamrick, K. Allen, V. Bapst, T. Zhu, K. R. Mckee, J. B. Tenenbaum, and P. Battaglia.
2018. Relational inductive bias for physical construction in humans and machines. Cognitive
Science. 63
K. He, X. Zhang, S. Ren, and J. Sun. 2016a. Deep residual learning for image recognition. In
Proc. of CVPR, pages 770–778. DOI: 10.1109/cvpr.2016.90 43, 61
K. He, X. Zhang, S. Ren, and J. Sun. 2016b. Identity mappings in deep residual networks. In
Proc. of ECCV, pages 630–645. Springer. DOI: 10.1007/978-3-319-46493-0_38 31, 45
M. Henaff, J. Bruna, and Y. Lecun. 2015. Deep convolutional networks on graph-structured
data. ArXiv: Preprint, ArXiv:1506.05163. 23, 78
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation,
9(8):1735–1780. DOI: 10.1162/neco.1997.9.8.1735 17, 33
S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., 2001. Gradient flow in recurrent
nets: The difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent
Neural Networks. IEEE Press. 17
Y. Hoshen. 2017. Vain: Attentional multi-agent predictive modeling. In Proc. of NIPS,
pages 2701–2711. 59, 60, 67, 75
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. 2018. Relation networks for object detection. In
Proc. of CVPR, pages 3588–3597. DOI: 10.1109/cvpr.2018.00378 77
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected
convolutional networks. In Proc. of CVPR, pages 4700–4708. DOI: 10.1109/cvpr.2017.243
45
98 BIBLIOGRAPHY
W. Huang, T. Zhang, Y. Rong, and J. Huang. 2018. Adaptive sampling towards fast graph
representation learning. In Proc. of NeurIPS, pages 4563–4572. 54
T. J. Hughes. 2012. The Finite Element Method: Linear Static and Dynamic Finite Element
Analysis. Courier Corporation. 68
W. Jin, R. Barzilay, and T. Jaakkola. 2018. Junction tree variational autoencoder for molecular
graph generation. In Proc. of ICML. 69
S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. 2016. Molecular graph convo-
lutions: Moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30(8):595–
608. DOI: 10.1007/s10822-016-9938-8 59, 69
E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. 2017. Learning combinatorial opti-
mization algorithms over graphs. In Proc. of NIPS, pages 6348–6358. 1, 59, 85
M. A. Khamsi and W. A. Kirk. 2011. An Introduction to Metric Spaces and Fixed Point Theory,
volume 53. John Wiley & Sons. DOI: 10.1002/9781118033074 20
T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. S. Zemel. 2018. Neural relational inference
for interacting systems. In Proc. of ICML, pages 2688–2697. 63, 67, 75
W. Kool, H. van Hoof, and M. Welling. 2019. Attention, learn to solve routing problems! In
Proc. of ICLR. https://openreview.net/forum?id=ByxBFsRqYm 85
BIBLIOGRAPHY 99
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep con-
volutional neural networks. In Proc. of NIPS, pages 1097–1105. DOI: 10.1145/3065386 17
L. Landrieu and M. Simonovsky. 2018. Large-scale point cloud semantic segmentation with
superpoint graphs. In Proc. of CVPR, pages 4558–4567. DOI: 10.1109/cvpr.2018.00479 78
Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature, 521(7553):436. DOI:
10.1038/nature14539 1
C. Lee, W. Fang, C. Yeh, and Y. F. Wang. 2018a. Multi-label zero-shot learning with structured
knowledge graphs. In Proc. of CVPR, pages 1576–1585. DOI: 10.1109/cvpr.2018.00170 76
G.-H. Lee, W. Jin, D. Alvarez-Melis, and T. S. Jaakkola. 2019. Functional transparency for
structured data: A game-theoretic approach. In Proc. of ICML. 69
J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh. 2018b. Attention models in graphs:
A survey. ArXiv Preprint ArXiv:1807.07984. DOI: 10.1145/3363574 3
F. W. Levi. 1942. Finite Geometrical Systems: Six Public Lectures Delivered in February, 1940, at
the University of Calcutta. The University of Calcutta. 49
G. Li, M. Muller, A. Thabet, and B. Ghanem. 2019. DeepGCNs: Can GCNs go as deep as
CNNs? In Proc. of ICCV. 45, 46
Q. Li, Z. Han, and X.-M. Wu. 2018a. Deeper insights into graph convolutional networks for
semi-supervised learning. In Proc. of AAAI. 55, 91
R. Li, S. Wang, F. Zhu, and J. Huang. 2018b. Adaptive graph convolutional neural networks.
In Proc. of AAAI. 25
Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. 2016. Gated graph sequence neural
networks. In Proc. of ICLR. 22, 33, 59, 75, 91
Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. 2018c. Learning deep generative
models of graphs. In Proc. of ICLR Workshop. 83
Y. Li, R. Yu, C. Shahabi, and Y. Liu. 2018d. Diffusion convolutional recurrent neu-
ral network: Data-driven traffic forecasting. In Proc. of ICLR. DOI: 10.1109/trust-
com/bigdatase.2019.00096 50
X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. 2016. Semantic object parsing with graph
LSTM. In Proc. of ECCV, pages 125–143. DOI: 10.1007/978-3-319-46448-0_8 36, 77, 91
100 BIBLIOGRAPHY
X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. 2017. Interpretable structure-evolving
LSTM. In Proc. of CVPR, pages 2175–2184. DOI: 10.1109/cvpr.2017.234 78
X. Liu, Z. Luo, and H. Huang. 2018. Jointly multiple events extraction via attention-based
graph information aggregation. In Proc. of EMNLP. DOI: 10.18653/v1/d18-1156 81
T. Ma, J. Chen, and C. Xiao. 2018. Constrained generation of semantically valid graphs via
regularizing variational autoencoders. In Proc. of NeurIPS, pages 7113–7124. 83
Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang. 2019. Multi-dimensional graph con-
volutional networks. In Proc. of SDM, pages 657–665. DOI: 10.1137/1.9781611975673.74
52
D. Marcheggiani and I. Titov. 2017. Encoding sentences with graph convolutional networks for
semantic role labeling. In Proc. of EMNLP, pages 1506–1515. DOI: 10.18653/v1/d17-1159
79
D. Marcheggiani, J. Bastings, and I. Titov. 2018. Exploiting semantics in neural machine
translation with graph convolutional networks. In Proc. of NAACL. DOI: 10.18653/v1/n18-
2078 79
K. Marino, R. Salakhutdinov, and A. Gupta. 2017. The more you know: Using knowledge
graphs for image classification. In Proc. of CVPR, pages 20–28. DOI: 10.1109/cvpr.2017.10
76
J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. 2015. Geodesic convolutional
neural networks on Riemannian manifolds. In Proc. of ICCV Workshops, pages 37–45. DOI:
10.1109/iccvw.2015.112 2, 30
T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word represen-
tations in vector space. In Proc. of ICLR. 2
M. Miwa and M. Bansal. 2016. End-to-end relation extraction using LSTMs on sequences
and tree structures. In Proc. of ACL, pages 1105–1116. DOI: 10.18653/v1/p16-1105 79
F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. 2017. Geomet-
ric deep learning on graphs and manifolds using mixture model CNNs. In Proc. of CVPR,
pages 5425–5434. DOI: 10.1109/cvpr.2017.576 2, 30, 73, 78
M. Narasimhan, S. Lazebnik, and A. G. Schwing. 2018. Out of the box: Reasoning with graph
convolution nets for factual visual question answering. In Proc. of NeurIPS, pages 2654–2665.
77
D. Nathani, J. Chauhan, C. Sharma, and M. Kaul. 2019. Learning attention-based embeddings
for relation prediction in knowledge graphs. In Proc. of ACL. DOI: 10.18653/v1/p19-1466
72
BIBLIOGRAPHY 101
T. H. Nguyen and R. Grishman. 2018. Graph convolutional networks with argument-aware
pooling for event detection. In Proc. of AAAI. 81
M. Niepert, M. Ahmed, and K. Kutzkov. 2016. Learning convolutional neural networks for
graphs. In Proc. of ICML, pages 2014–2023. 26, 78
W. Norcliffebrown, S. Vafeias, and S. Parisot. 2018. Learning conditioned graph structures for
interpretable visual question answering. In Proc. of NeurIPS, pages 8334–8343. 77
A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna. 2018. Revised note on learning quadratic
assignment with graph neural networks. In Proc. of IEEE DSW, pages 1–5. IEEE. DOI:
10.1109/dsw.2018.8439919 85
R. Palm, U. Paquet, and O. Winther. 2018. Recurrent relational networks. In Proc. of NeurIPS,
pages 3368–3378. 81
S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang. 2018. Adversarially regularized graph
autoencoder for graph embedding. In Proc. of IJCAI. DOI: 10.24963/ijcai.2018/362 56
E. E. Papalexakis, L. Akoglu, and D. Ience. 2013. Do more views of a graph help? Community
detection and clustering in multi-graphs. In Proc. of FUSION, pages 899–905. IEEE. 52
H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang. 2018. Large-scale hi-
erarchical text classification with recursively regularized deep graph-CNN. In Proc. of WWW,
pages 1063–1072. DOI: 10.1145/3178876.3186005 78
H. Peng, J. Li, Q. Gong, Y. Song, Y. Ning, K. Lai, and P. S. Yu. 2019. Fine-grained event
categorization with heterogeneous graph convolutional networks. In Proc. of IJCAI. DOI:
10.24963/ijcai.2019/449 48
N. Peng, H. Poon, C. Quirk, K. Toutanova, and W.-t. Yih. 2017. Cross-sentence N-ary relation
extraction with graph LSTMs. TACL, 5:101–115. DOI: 10.1162/tacl_a_00049 35, 80
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social representa-
tions. In Proc. of SIGKDD, pages 701–710. ACM. DOI: 10.1145/2623330.2623732 2
T. Pham, T. Tran, D. Phung, and S. Venkatesh. 2017. Column networks for collective classi-
fication. In Proc. of AAAI. 43
M. Prates, P. H. Avelar, H. Lemos, L. C. Lamb, and M. Y. Vardi. 2019. Learning to solve
NP-complete problems: A graph neural network for decision TSP. In Proc. of AAAI, 33:4731–
4738. DOI: 10.1609/aaai.v33i01.33014731 85
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D
classification and segmentation. In Proc. of CVPR, 1(2):4. DOI: 10.1109/cvpr.2017.16 59
102 BIBLIOGRAPHY
S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. 2018. Learning human-object interactions by
graph parsing neural networks. In Proc. of ECCV, pages 401–417. DOI: 10.1007/978-3-030-
01240-3_25 77
X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 2017b. 3D graph neural networks for RGBD
semantic segmentation. In Proc. of CVPR, pages 5199–5208. DOI: 10.1109/iccv.2017.556
78
A. Rahimi, T. Cohn, and T. Baldwin. 2018. Semi-supervised user geolocation via graph con-
volutional networks. In Proc. of ACL, 1:2009–2019. DOI: 10.18653/v1/p18-1187 43, 67
S. Rhee, S. Seo, and S. Kim. 2018. Hybrid approach of relation network and localized
graph convolutional filtering for breast cancer subtype classification. In Proc. of IJCAI. DOI:
10.24963/ijcai.2018/490 70
M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling. 2018.
Modeling relational data with graph convolutional networks. In Proc. of ESWC, pages 593–
607. Springer. DOI: 10.1007/978-3-319-93417-4_38 22, 50, 71
BIBLIOGRAPHY 103
K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko. 2017. Quantum-
chemical insights from deep tensor neural networks. Nature Communications, 8:13890. DOI:
10.1038/ncomms13890 59
C. Shang, Y. Tang, J. Huang, J. Bi, X. He, and B. Zhou. 2019a. End-to-end structure-aware
convolutional networks for knowledge base completion. In Proc. of AAAI, 33:3060–3067.
DOI: 10.1609/aaai.v33i01.33013060 71
J. Shang, T. Ma, C. Xiao, and J. Sun. 2019b. Pre-training of graph augmented transformers for
medication recommendation. In Proc. of IJCAI. DOI: 10.24963/ijcai.2019/825 70
J. Shang, C. Xiao, T. Ma, H. Li, and J. Sun. 2019c. GameNet: Graph augmented memory
networks for recommending medication combination. In Proc. of AAAI, 33:1126–1133. DOI:
10.1609/aaai.v33i01.33011126 70
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image
recognition. ArXiv Preprint ArXiv:1409.1556. 17
R. Socher, D. Chen, C. D. Manning, and A. Ng. 2013. Reasoning with neural tensor networks
for knowledge base completion. In Proc. of NIPS, pages 926–934. 87, 88
L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea. 2018a. Exploring graph-
structured passage representation for multi-hop reading comprehension with graph neural
networks. ArXiv Preprint ArXiv:1809.02040. 81
L. Song, Y. Zhang, Z. Wang, and D. Gildea. 2018b. A graph-to-sequence model for AMR-
to-text generation. In Proc. of ACL, pages 1616–1626. DOI: 10.18653/v1/p18-1150 81
L. Song, Y. Zhang, Z. Wang, and D. Gildea. 2018c. N-ary relation extraction using graph state
LSTM. In Proc. of EMNLP, pages 2226–2235. DOI: 10.18653/v1/d18-1246 80
Y. Sun, N. Bui, T.-Y. Hsieh, and V. Honavar. 2018. Multi-view network embedding via
graph factorization clustering and co-regularized multi-view agreement. In IEEE ICDMW,
pages 1006–1013. DOI: 10.1109/icdmw.2018.00145 52
104 BIBLIOGRAPHY
R. S. Sutton and A. G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press. DOI:
10.1109/tnn.1998.712192 85
K. S. Tai, R. Socher, and C. D. Manning. 2015. Improved semantic representations from tree-
structured long short-term memory networks. In Proc. of IJCNLP, pages 1556–1566. DOI:
10.3115/v1/p15-1150 34, 78, 81
J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. 2008. Arnetminer: Extraction
and mining of academic social networks. In Proc. of SIGKDD, pages 990–998. DOI:
10.1145/1401890.1402008 87
J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. Line: Large-scale information
network embedding. In Proc. of WWW, pages 1067–1077. DOI: 10.1145/2736277.2741093
2
D. Teney, L. Liu, and A. V. Den Hengel. 2017. Graph-structured representations for visual
question answering. In Proc. of CVPR, pages 3233–3241. DOI: 10.1109/cvpr.2017.344 77
C. Tomasi and R. Manduchi. 1998. Bilateral filtering for gray and color images. In Computer
Vision, pages 839–846. IEEE. DOI: 10.1109/iccv.1998.710815 61
K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu. 2018. Deep recursive network embedding with
regular equivalence. In Proc. of SIGKDD. DOI: 10.1145/3219819.3220068 56
R. van den Berg, T. N. Kipf, and M. Welling. 2017. Graph convolutional matrix completion.
In Proc. of SIGKDD. 56, 67, 74
Q. Wu, H. Zhang, X. Gao, P. He, P. Weng, H. Gao, and G. Chen. 2019b. Dual graph attention
networks for deep latent representation of multifaceted social effects in recommender systems.
In Proc. of WWW, pages 2091–2102. ACM. DOI: 10.1145/3308558.3313442 74
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019c. A comprehensive survey on
graph neural networks. ArXiv Preprint ArXiv:1901.00596. 3
Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang. 2019d. Graph waveNet for deep spatial-
temporal graph modeling. ArXiv Preprint ArXiv:1906.00121. DOI: 10.24963/ijcai.2019/264
51
K. Xu, L. Wang, M. Yu, Y. Feng, Y. Song, Z. Wang, and D. Yu. 2019a. Cross-lingual knowledge
graph alignment via graph matching neural network. In Proc. of ACL. DOI: 10.18653/v1/p19-
1304 72
N. Xu, P. Wang, L. Chen, J. Tao, and J. Zhao. 2019b. Mr-GNN: Multi-resolution and dual
graph neural network for predicting structured entity interactions. In Proc. of IJCAI. DOI:
10.24963/ijcai.2019/551 70
S. Yan, Y. Xiong, and D. Lin. 2018. Spatial temporal graph convolutional networks for skeleton-
based action recognition. In Proc. of AAAI. DOI: 10.1186/s13640-019-0476-x 51
B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. 2015a. Embedding entities and relations for
learning and inference in knowledge bases. In Proc. of ICLR. 71
C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang. 2015b. Network representation learning
with rich text information. In Proc. of IJCAI, pages 2111–2117. 2
L. Yao, C. Mao, and Y. Luo. 2019. Graph convolutional networks for text classification. In
Proc. of AAAI, 33:7370–7377. DOI: 10.1609/aaai.v33i01.33017370 78
Authors’ Biographies
ZHIYUAN LIU
Zhiyuan Liu is an associate professor in the Department of Computer Science and Technology,
Tsinghua University. He got his B.E. in 2006 and his Ph.D. in 2011 from the Department
of Computer Science and Technology, Tsinghua University. His research interests are natural
language processing and social computation. He has published over 60 papers in international
journals and conferences, including IJCAI, AAAI, ACL, and EMNLP.
JIE ZHOU
Jie Zhou is a second-year Master’s student of the Department of Computer Science and Tech-
nology, Tsinghua University. He got his B.E. from Tsinghua University in 2016. His research
interests include graph neural networks and natural language processing.