kdf2020 Paper 21
kdf2020 Paper 21
kdf2020 Paper 21
node labeling. GNNs are able to combine node features, 3. We provide the tools to interpret informative interactions
connection patterns, and graph structure by using a neural between entities for entity labeling in the structured-graph
network to embed the node information and pass it through format.
edges in the graph. However, due to the complex data rep-
resentation and non-linear transformations performed on the Paper structure: In section 2, we introduce Graph repre-
data, explaining decisions made by GNNs is a challenging sentation. Then in section 3, we walk through the operations
problem. One example about understanding fraud detection in GNN. In section 4, the formula of graph explanation is
from explaining GNN node classification decision is shown described, and the corresponding methods are introduced.
in Figure. 1. Therefore we want to identify the patterns in In section 5, we propose the evaluation metrics and meth-
the input data that were used by the GNN model to make a ods. The experiments and results are presented in section 6.
decision and examine if the model works as we desire. We conclude the paper in section 7.
Although deep learning model visualization techniques
have been developed in convolution neural network (CNN), 2 Data Representation – Weighted Graph
those methods are not directly applicable to explain
weighted graphs with node features for the classification For the financial data, which contains user and interaction
task. A few work have been down on explaining GNN information, we can model each entity in the as a node
((Pope et al. 2019; Baldassarre and Azizpour 2019; Ying et and build the underlying connection between them using
al. 2019; Yang et al. 2019)). However, to our best knowl- based on their interaction. In this section, we introduce the
edge, no work has been done on explaining comprehensive necessary notation and definitions. We denote a graph by
features (namely node feature, edge feature, and connect- G = (V, E) where V is the set of nodes, E the set of edges
ing patterns) in weighted graph, especially for node classi- linking the nodes and X the set of nodes’ features. For every
fication problem. Here we propose a few graph feature ex- pair of connected nodes, u, v ∈ V , we denote by evu ∈ R
planation methods to formulate financial data in based on the weight of the edge (v, u) ∈ E linking them. We denote
graph structure. We use three datasets (one synthetic and E[v, u] = evu , where E ∈ R|E| . For each node, u, we asso-
two real data) to validate our methods. Our results demon- ciate a d-dimensional vector of features, Xu ∈ Rd and de-
strate that by using explanation approach, we can discover note the set of all features as X = {Xu : u ∈ V } ∈ (Rd )|V | .
the data patterns used for node classification correspond to Edge features contain important information about
human interpretation and those explanation methods can be graphs. For instances, the graph G may represent a bank-
used for understanding data, debugging GNN model and ex- ing system, where the nodes V represents different banks,
amine model decisions, and in other tasks. and the edges E are the transaction between them; or graph
Our contribution is summarized as follows: G may represent a social network, where the nodes V rep-
1. We propose to transfer financial transaction data to resent different users, and the edges E is the contacting fre-
weighted graph representation for further analysis and quencies between the users. We consider a node classifica-
data information understanding. tion task, where each node u is assigned a label yu ∈ IC =
{0, . . . , C − 1}. In the financial application, the node classi-
2. We propose to use GNN to analyze financial transaction fication problem can be fraud detection, new costumer dis-
data, including fraud detection and account matching. covery, account matching, and so on.
3 GNN Utilizing Edge Weight 2018) and GCN (Kipf and Welling 2016). Due to the aggre-
Different from the state of art GNN architecture, i.e. graph gation mechanism,
P we normalize the weights by in-degree
convolution networks (GCN) (Kipf and Welling 2016) and ēvu = evu / v∈N (u) evu . Our method can deal with nega-
graph attention networks (GAT) (Veličković and others tive edges-weighted by re-normalizing them to a positive in-
2018), some GNNs can exploit the edge information on terval, for instances [0, 1], therefore in the following we use
graph (Gong and Cheng 2019; Shang et al. 2018; Yang et only positive weighted edges and the edge weights are used
al. 2019). Here, we consider weighted and directed graphs, as message filtering ratio. Depending on the the problem:
and develop the graph neural network that uses both nodes (l)
• g can simply defined as: g = ēvu mvu ; or
and edges weights, where edge weights affect message ag-
(l)
gregation. Not only our approach can handle directed and • g can be a gate function, such as a rnn-type block of mvu ,
weighted graphs but also preserves edge information in the (l) (l−1)
i.e. g = GRU (ēvu mvu , hu ).
propagation of GNNs. Preserving and using edges informa-
tion is important in many real-world graphs such as banking 4 Explaining Informative Component of
payment network, recommendation systems (that use social
network), and other systems that heavily rely on the topol- Graph Structures Data
ogy of the connections. Since, apart from node (atomic) Relational structures in graphs often contain crucial infor-
features also attributes of edges (bonds) are important for mation for node classification, such as graph’s topology and
predicting local and global properties of graphs. Generally information flow (i.e., direction and amplitude). Therefore,
speaking, GNNs inductively learn a node representation by knowing which edges contribute the most to the information
recursively aggregating and transforming the feature vectors flow towards or from a node is essential to understand and
of its neighboring nodes. Following (Battaglia et al. 2018; interpret the node classification evidence.
Zhang, Cui, and Zhu 2018; Zhou et al. 2018), a per-layer We tackle the weighted graph feature explanation prob-
update of the GNN in our setting involves these three com- lem as a two-stage pipeline. First, we train a node clas-
putations, message passing Eq. (1), message aggregation Eq. sification function, in this case, a GNN. The GNN inputs
(2), and updating node representation Eq. (3), which can be are a graph G = (V, E), its associated nodes’ features, X,
expressed as: and its true nodes’ labels Y . We represent this classifier as
Φ : G 7→ (u 7→ yu ), where yu ∈ IC . One advantage
m(l) (l−1)
vu = MSG(hu , h(l−1)
v , evu ) (1) of GNNs is the preservation of the information flow across
(l)
= AGG({m(l) nodes as well as the data structure. Furthermore, it is in-
Mi vu , evu } | v ∈ N (u)}) (2)
variant to permutations on the ordering. Hence it keeps the
h(l)
u = UPDATE(Mu(l) , h(l−1)
u ) (3) relational inductive biases of the input data (see (Battaglia et
(l)
al. 2018)). Second, given the node classification model and
where hu is the embedded representation of node u on the node’s true label, the explanation part provides a subgraph
layer l; evu is the weighted edge pointing from v to u; N (u) and a subset of features retrieved from the k-hop neighbor-
is u’s neighborhood from where it collects information to hood of each node u, for k ∈ N and u ∈ V . The subgraph,
(0)
update its aggregated message Mi . Specifically, hu = xu along with the subset of features, is the minimal set of in-
(L)
as initial, and hu is the final embedding for node u of an formation and information flow across neighbor nodes of u,
L-layer GNN node classifier. that the GNN uses to compute the node’s label. We define
Here, following (Schlichtkrull et al. 2018), the GNN layer GS = (VS , ES ) to be a subgraph of G, where GS ⊆ G, if
using edge weight for filtering can be formed as the follow- VS ⊆ V and ES ⊆ E. Consider the classification yu ∈ IC
ing steps: of node u, then Informative Components Detection com-
putes a subgraph, GS , containing u, that aims to explain the
(l−1)
m(l)
vu = W1 h(l−1)
v (message) (4) classification task by looking at the edge connectivity pat-
X terns ES and their connecting nodes VS . This provides in-
M(l)
u = g(m (l) (l−1)
vu , hu , evu ) (aggregate) (5) sights on the characteristics of the graph that contribute to
v∈N (u) the node’s label.
(l−1) (l−1)
h(l)
u = σ(W0 hu + M(l)
u ) (update) (6)
4.1 Maximal Mutual Information (MMI) Mask
where N (u) denotes the set of neighbors of node u and Due to the properties of the GNN, (5), we only need to
evu denotes the directed edge from v to u, W denotes consider the graph structure used in aggregation, i.e., the
the model’s parameters to be learned, and φ is any lin- computational graph w.r.t. node u is defined as Gc (u) con-
ear/nonlinear function that can be applied on neighbour taining N 0 nodes, where N 0 ≤ N . The node feature set
(l)
nodes’ feature embedding. We set h(l) ∈ Rd and d(l) is associated with the Gc (u) is Xc (u) = {xv |v ∈ Vc (u)}.
th
the dimension of the l layer representation. For node u, the label prediction of GNN Φ is given by
As the graph convolution operations in (Gong and Cheng ŷu = Φ(Gc (u), Xc (u)), which can be interpreted as a dis-
2019), the edge feature matrices will be used as filters to tribution PΦ (Y |Gc , Xc ) mapping by GNN. Our goal is to
multiply the node feature matrix. To avoid increasing the identity a subgraph GS ⊆ Gc (u) (and its associated fea-
scale of output features by multiplication, the edge features tures XS = {xw |w ∈ VS }, or a subset of them) which the
need to be normalized, as in GAT (Veličković and others GNN uses to predict u’s label.
Using ideas from Information theory (Cover and Thomas Algorithm 1 Optimize mask for weighted graph
2012) and following GNNExplainer (Ying et al. 2019), the Input: 1. Gc (u), computation graph of node u; 2. Pre-
informative explainable subgraph and nodes features subset trained GNN model Φ; 3. yu , node u’s real label; 4. M,
are chosen to maximize the mutual information (MI): learn-able mask; 5. K, number of optimization iterations; 6.
L, number of layers of GNN.
max I(Y, (GS , XS )) = H(Y |G, X)−H(Y |GS , XS ). (7)
GS 1: M ← randomize parameters . initialize, M ∈ [0, 1]Q
(0)
Since the trained GNN node classifier Φ is fixed, the H(Y ) 2: hv ← xv , for v ∈ Gc (u)
term of Eq.(7) is constant. As a result, it is equivalent to 3: for k = 1 to K do
vw evw )
minimize the conditional entropy H(Y |GS , XS ): 4: Mvw ← Pexp(M . renormalize mask
v exp(Mvw evw )
5: for l = 1 to L do
(l) (l−1) (l−1)
−EY |GS ,XS [log PΦ (Y |GS , XS )]. (8) 6: mvu ← W1 hv . message
(l) P (l) (l−1)
7: Mu ← v g(Mvu mvu , hu ) . aggregate
Therefore, the explanation to the graph components with (l) (l−1) (l)
prediction power w.r.t node u’s prediction ŷu is a subgraph 8: hu ← σ(W0 hu + Mu ) . update
GS and its associated feature set XS , that minimize (8). 9: end for
(L)
Thus, the objective of the explanation is to pick the top in- 10: ŷu ← softmax(hu ) . predict on masked graph
formative edges and its connecting neighbours, which form 11: loss ← crossentropy(yu , ŷu ) + regularizations
a subgraph, for predicting u’s label. Because, some edges 12: M ← optimizer(loss, M) . update mask
in u’s computational graph Gc (u) might form important 13: end for
message-passing (5) pathways, which allow useful node in- Return: M
formation to be propagated across Gc (u) and aggregated at
u for prediction; while some edges in Gc (u) might not be
informative for prediction. Instead of directly optimize GS to ensure that at least one edge connected to node u is se-
in (8), since there are exponentially many discrete structures lected. After mask M is learned, we use a threshold to re-
GS ⊆ Gc (u) containing N 0 nodes, the GNNExplainer (Ying move small Ec M and isolated nodes. Our proposed op-
et al. 2019) optimizes a mask Msym N 0 ×N 0
[0, 1] on the binary timization methods to optimize M maximizing mutual in-
adjacent matrix, which allows gradient descent to be per- formation (equation (7)) under above constrains is shown in
formed on GS . Algorithm 1.
If we use the edge weights for node embedding, the con- 4.2 Guided Gradient (GGD) Salience
nection can be treated as binary and fit into the original
GNNExplainer. However, if we use edge weights as filter- Guided gradient-based explanation methods(Simonyan,
ing, the mask should affect filtering and normalization. We Vedaldi, and Zisserman 2013) is perhaps the most straight
extend the original GNNExplainer method by considering forward and easiest approach. By simply calculate the differ-
edge weights and improving the method by adding extra reg- entiate of the output with respect to the model input then ap-
ularization. Unlike GNNExplainer, where there are no con- ply norm, a score can be obtained. The gradient-based score
straints on the mask value, we add constraints to the value can be used to indicate the relative importance of the input
learned by the mask feature, since it represent the change in input space which
corresponds to the maximizing positive rate of change in the
model output. Since egde weights have performed as filter-
X
Mvw evw = 1, Mvw ≥ 0, for (v, w) ∈ Ec (u), (9)
w
ing in GNN, we can obtain the edge mask as