Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

kdf2020 Paper 21

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Classifying and Understanding Financial Data

Using Graph Neural Network

Xiaoxiao Li1∗ Joao Saude 2 Prashant Reddy 2 Manuela Veloso2


1
Yale University
2
J.P.Morgan AI Research

Abstract faced with such a vast information ocean, especially unstruc-


tured data information, how to store, query, analyze, mine,
Real data collected from different applications usually do and utilize these massive information resources is particu-
not have a pre-defined data model or is not organized in a larly critical. Traditional relational databases are mainly ori-
pre-defined manner. Analyzing messy, unstructured data and
extracting useful data information is difficult. For the data
ented to transaction processing and data analysis applica-
collected in financial institutions, usually, it has additional tions. They are good at solving structured data management
topological structures and is amenable to be represented as problems. There are some inherent shortcomings in manag-
a graph. For example, social networks, communication net- ing unstructured data, especially when dealing with massive
works, financial systems, and payments networks. The graph unstructured information. In response to the challenges of
structure can be built from the connection of each entity, such unstructured data analysis, one strategy is transforming un-
as financial institutions, customers, or computing centers. We structured data to structured data, which will help similar
consider how the different entity influences each other’s la- information labeling, retrieving, searching, and clustering,
bel through a label prediction problem based on the graph and it will help finance better serve the real economy and
structure. Given the structured data, Graph Neural Networks effectively promote the overall development of the financial
(GNNs) is a powerful tool that can mimic experts’ decision on
node labeling. GNNs combine node features through graph
industry.
structure by using a neural network to embed node informa- Our contemporary society relies heavily in interper-
tion and pass it through edges in the graph. We want to iden- sonal/cultural relations (social networks), our economy is
tify the informative interaction in the input data used by the densely connected and structured (commercial relations, fi-
GNN model to classify the node in the graph and examine if nancial transfers, supply/distribution chains), the geopoliti-
the model works as we desire. However, due to the complex
cal relations are also very structured (commercial and polit-
data representation and non-linear transformations, explain-
ing decisions made by GNNs is challenging. In this work, we ical unions), we also rely on transportation networks (roads,
propose graph representation methods for finical transaction railroads, maritime and flight connections), and our cy-
data and new graph features’ explanation methods to iden- bersystems are also structurally connected (computers net-
tify the informative graph topology. We use four datasets (one works, internet). Moreover, those complex network struc-
synthetic and three reals) to validate our methods. Our results tures also appear in nature, on biological systems, like the
demonstrate that graph-structured representation help to ana- brain, vascular and nervous systems, and also on chemi-
lyze financial transaction data, and our explanation approach cal systems, for instances atoms connections on molecules.
can mimic patterns in human interpretation and disentangle Nowadays, with modern technologies we collect data from
different features in the graphs. all the above systems and their relations since this data
is hugely structured and depends heavily on the relations
1 Introduction within the networks, it makes sense to represent the data as
a graph, where nodes represent entities and the edges the
In recent years, with the rapid development of new tech- connections between them.
nologies such as big data, cloud computing, and artificial
intelligence, these new technologies are deeply integrated Artificial intelligence is becoming a new direction for
with financial services, releasing financial innovation vital- financial big data applications. Graph Neural Networks
ity and application potential, which has greatly promoted (GNNs) such as GCN (Kipf and Welling 2016), GraphSage
the financial industry. In this development process, big data (Hamilton, Ying, and Leskovec 2017), is a kind of deep
technology is the most mature and widely used. However, learning architectures that can handle graph-structured data
by preserving the information structure of graphs. Our pri-

This work was done at J.P.Morgan AI Research mary focus is the node labeling problem, such as fraud de-
Copyright c 2020, Association for the Advancement of Artificial tection, credit issuing, customer targeting, social network
Intelligence (www.aaai.org). All rights reserved. user classification, which can mimic experts’ decisions on
Figure 1: We can build graph structure for transaction system and try to understand what interaction patterns are informative
for fraud detection using GNN and explaination methods.

node labeling. GNNs are able to combine node features, 3. We provide the tools to interpret informative interactions
connection patterns, and graph structure by using a neural between entities for entity labeling in the structured-graph
network to embed the node information and pass it through format.
edges in the graph. However, due to the complex data rep-
resentation and non-linear transformations performed on the Paper structure: In section 2, we introduce Graph repre-
data, explaining decisions made by GNNs is a challenging sentation. Then in section 3, we walk through the operations
problem. One example about understanding fraud detection in GNN. In section 4, the formula of graph explanation is
from explaining GNN node classification decision is shown described, and the corresponding methods are introduced.
in Figure. 1. Therefore we want to identify the patterns in In section 5, we propose the evaluation metrics and meth-
the input data that were used by the GNN model to make a ods. The experiments and results are presented in section 6.
decision and examine if the model works as we desire. We conclude the paper in section 7.
Although deep learning model visualization techniques
have been developed in convolution neural network (CNN), 2 Data Representation – Weighted Graph
those methods are not directly applicable to explain
weighted graphs with node features for the classification For the financial data, which contains user and interaction
task. A few work have been down on explaining GNN information, we can model each entity in the as a node
((Pope et al. 2019; Baldassarre and Azizpour 2019; Ying et and build the underlying connection between them using
al. 2019; Yang et al. 2019)). However, to our best knowl- based on their interaction. In this section, we introduce the
edge, no work has been done on explaining comprehensive necessary notation and definitions. We denote a graph by
features (namely node feature, edge feature, and connect- G = (V, E) where V is the set of nodes, E the set of edges
ing patterns) in weighted graph, especially for node classi- linking the nodes and X the set of nodes’ features. For every
fication problem. Here we propose a few graph feature ex- pair of connected nodes, u, v ∈ V , we denote by evu ∈ R
planation methods to formulate financial data in based on the weight of the edge (v, u) ∈ E linking them. We denote
graph structure. We use three datasets (one synthetic and E[v, u] = evu , where E ∈ R|E| . For each node, u, we asso-
two real data) to validate our methods. Our results demon- ciate a d-dimensional vector of features, Xu ∈ Rd and de-
strate that by using explanation approach, we can discover note the set of all features as X = {Xu : u ∈ V } ∈ (Rd )|V | .
the data patterns used for node classification correspond to Edge features contain important information about
human interpretation and those explanation methods can be graphs. For instances, the graph G may represent a bank-
used for understanding data, debugging GNN model and ex- ing system, where the nodes V represents different banks,
amine model decisions, and in other tasks. and the edges E are the transaction between them; or graph
Our contribution is summarized as follows: G may represent a social network, where the nodes V rep-
1. We propose to transfer financial transaction data to resent different users, and the edges E is the contacting fre-
weighted graph representation for further analysis and quencies between the users. We consider a node classifica-
data information understanding. tion task, where each node u is assigned a label yu ∈ IC =
{0, . . . , C − 1}. In the financial application, the node classi-
2. We propose to use GNN to analyze financial transaction fication problem can be fraud detection, new costumer dis-
data, including fraud detection and account matching. covery, account matching, and so on.
3 GNN Utilizing Edge Weight 2018) and GCN (Kipf and Welling 2016). Due to the aggre-
Different from the state of art GNN architecture, i.e. graph gation mechanism,
P we normalize the weights by in-degree
convolution networks (GCN) (Kipf and Welling 2016) and ēvu = evu / v∈N (u) evu . Our method can deal with nega-
graph attention networks (GAT) (Veličković and others tive edges-weighted by re-normalizing them to a positive in-
2018), some GNNs can exploit the edge information on terval, for instances [0, 1], therefore in the following we use
graph (Gong and Cheng 2019; Shang et al. 2018; Yang et only positive weighted edges and the edge weights are used
al. 2019). Here, we consider weighted and directed graphs, as message filtering ratio. Depending on the the problem:
and develop the graph neural network that uses both nodes (l)
• g can simply defined as: g = ēvu mvu ; or
and edges weights, where edge weights affect message ag-
(l)
gregation. Not only our approach can handle directed and • g can be a gate function, such as a rnn-type block of mvu ,
weighted graphs but also preserves edge information in the (l) (l−1)
i.e. g = GRU (ēvu mvu , hu ).
propagation of GNNs. Preserving and using edges informa-
tion is important in many real-world graphs such as banking 4 Explaining Informative Component of
payment network, recommendation systems (that use social
network), and other systems that heavily rely on the topol- Graph Structures Data
ogy of the connections. Since, apart from node (atomic) Relational structures in graphs often contain crucial infor-
features also attributes of edges (bonds) are important for mation for node classification, such as graph’s topology and
predicting local and global properties of graphs. Generally information flow (i.e., direction and amplitude). Therefore,
speaking, GNNs inductively learn a node representation by knowing which edges contribute the most to the information
recursively aggregating and transforming the feature vectors flow towards or from a node is essential to understand and
of its neighboring nodes. Following (Battaglia et al. 2018; interpret the node classification evidence.
Zhang, Cui, and Zhu 2018; Zhou et al. 2018), a per-layer We tackle the weighted graph feature explanation prob-
update of the GNN in our setting involves these three com- lem as a two-stage pipeline. First, we train a node clas-
putations, message passing Eq. (1), message aggregation Eq. sification function, in this case, a GNN. The GNN inputs
(2), and updating node representation Eq. (3), which can be are a graph G = (V, E), its associated nodes’ features, X,
expressed as: and its true nodes’ labels Y . We represent this classifier as
Φ : G 7→ (u 7→ yu ), where yu ∈ IC . One advantage
m(l) (l−1)
vu = MSG(hu , h(l−1)
v , evu ) (1) of GNNs is the preservation of the information flow across
(l)
= AGG({m(l) nodes as well as the data structure. Furthermore, it is in-
Mi vu , evu } | v ∈ N (u)}) (2)
variant to permutations on the ordering. Hence it keeps the
h(l)
u = UPDATE(Mu(l) , h(l−1)
u ) (3) relational inductive biases of the input data (see (Battaglia et
(l)
al. 2018)). Second, given the node classification model and
where hu is the embedded representation of node u on the node’s true label, the explanation part provides a subgraph
layer l; evu is the weighted edge pointing from v to u; N (u) and a subset of features retrieved from the k-hop neighbor-
is u’s neighborhood from where it collects information to hood of each node u, for k ∈ N and u ∈ V . The subgraph,
(0)
update its aggregated message Mi . Specifically, hu = xu along with the subset of features, is the minimal set of in-
(L)
as initial, and hu is the final embedding for node u of an formation and information flow across neighbor nodes of u,
L-layer GNN node classifier. that the GNN uses to compute the node’s label. We define
Here, following (Schlichtkrull et al. 2018), the GNN layer GS = (VS , ES ) to be a subgraph of G, where GS ⊆ G, if
using edge weight for filtering can be formed as the follow- VS ⊆ V and ES ⊆ E. Consider the classification yu ∈ IC
ing steps: of node u, then Informative Components Detection com-
putes a subgraph, GS , containing u, that aims to explain the
(l−1)
m(l)
vu = W1 h(l−1)
v (message) (4) classification task by looking at the edge connectivity pat-
X terns ES and their connecting nodes VS . This provides in-
M(l)
u = g(m (l) (l−1)
vu , hu , evu ) (aggregate) (5) sights on the characteristics of the graph that contribute to
v∈N (u) the node’s label.
(l−1) (l−1)
h(l)
u = σ(W0 hu + M(l)
u ) (update) (6)
4.1 Maximal Mutual Information (MMI) Mask
where N (u) denotes the set of neighbors of node u and Due to the properties of the GNN, (5), we only need to
evu denotes the directed edge from v to u, W denotes consider the graph structure used in aggregation, i.e., the
the model’s parameters to be learned, and φ is any lin- computational graph w.r.t. node u is defined as Gc (u) con-
ear/nonlinear function that can be applied on neighbour taining N 0 nodes, where N 0 ≤ N . The node feature set
(l)
nodes’ feature embedding. We set h(l) ∈ Rd and d(l) is associated with the Gc (u) is Xc (u) = {xv |v ∈ Vc (u)}.
th
the dimension of the l layer representation. For node u, the label prediction of GNN Φ is given by
As the graph convolution operations in (Gong and Cheng ŷu = Φ(Gc (u), Xc (u)), which can be interpreted as a dis-
2019), the edge feature matrices will be used as filters to tribution PΦ (Y |Gc , Xc ) mapping by GNN. Our goal is to
multiply the node feature matrix. To avoid increasing the identity a subgraph GS ⊆ Gc (u) (and its associated fea-
scale of output features by multiplication, the edge features tures XS = {xw |w ∈ VS }, or a subset of them) which the
need to be normalized, as in GAT (Veličković and others GNN uses to predict u’s label.
Using ideas from Information theory (Cover and Thomas Algorithm 1 Optimize mask for weighted graph
2012) and following GNNExplainer (Ying et al. 2019), the Input: 1. Gc (u), computation graph of node u; 2. Pre-
informative explainable subgraph and nodes features subset trained GNN model Φ; 3. yu , node u’s real label; 4. M,
are chosen to maximize the mutual information (MI): learn-able mask; 5. K, number of optimization iterations; 6.
L, number of layers of GNN.
max I(Y, (GS , XS )) = H(Y |G, X)−H(Y |GS , XS ). (7)
GS 1: M ← randomize parameters . initialize, M ∈ [0, 1]Q
(0)
Since the trained GNN node classifier Φ is fixed, the H(Y ) 2: hv ← xv , for v ∈ Gc (u)
term of Eq.(7) is constant. As a result, it is equivalent to 3: for k = 1 to K do
vw evw )
minimize the conditional entropy H(Y |GS , XS ): 4: Mvw ← Pexp(M . renormalize mask
v exp(Mvw evw )
5: for l = 1 to L do
(l) (l−1) (l−1)
−EY |GS ,XS [log PΦ (Y |GS , XS )]. (8) 6: mvu ← W1 hv . message
(l) P (l) (l−1)
7: Mu ← v g(Mvu mvu , hu ) . aggregate
Therefore, the explanation to the graph components with (l) (l−1) (l)
prediction power w.r.t node u’s prediction ŷu is a subgraph 8: hu ← σ(W0 hu + Mu ) . update
GS and its associated feature set XS , that minimize (8). 9: end for
(L)
Thus, the objective of the explanation is to pick the top in- 10: ŷu ← softmax(hu ) . predict on masked graph
formative edges and its connecting neighbours, which form 11: loss ← crossentropy(yu , ŷu ) + regularizations
a subgraph, for predicting u’s label. Because, some edges 12: M ← optimizer(loss, M) . update mask
in u’s computational graph Gc (u) might form important 13: end for
message-passing (5) pathways, which allow useful node in- Return: M
formation to be propagated across Gc (u) and aggregated at
u for prediction; while some edges in Gc (u) might not be
informative for prediction. Instead of directly optimize GS to ensure that at least one edge connected to node u is se-
in (8), since there are exponentially many discrete structures lected. After mask M is learned, we use a threshold to re-
GS ⊆ Gc (u) containing N 0 nodes, the GNNExplainer (Ying move small Ec M and isolated nodes. Our proposed op-
et al. 2019) optimizes a mask Msym N 0 ×N 0
[0, 1] on the binary timization methods to optimize M maximizing mutual in-
adjacent matrix, which allows gradient descent to be per- formation (equation (7)) under above constrains is shown in
formed on GS . Algorithm 1.
If we use the edge weights for node embedding, the con- 4.2 Guided Gradient (GGD) Salience
nection can be treated as binary and fit into the original
GNNExplainer. However, if we use edge weights as filter- Guided gradient-based explanation methods(Simonyan,
ing, the mask should affect filtering and normalization. We Vedaldi, and Zisserman 2013) is perhaps the most straight
extend the original GNNExplainer method by considering forward and easiest approach. By simply calculate the differ-
edge weights and improving the method by adding extra reg- entiate of the output with respect to the model input then ap-
ularization. Unlike GNNExplainer, where there are no con- ply norm, a score can be obtained. The gradient-based score
straints on the mask value, we add constraints to the value can be used to indicate the relative importance of the input
learned by the mask feature, since it represent the change in input space which
corresponds to the maximizing positive rate of change in the
model output. Since egde weights have performed as filter-
X
Mvw evw = 1, Mvw ≥ 0, for (v, w) ∈ Ec (u), (9)
w
ing in GNN, we can obtain the edge mask as

and perform a projected gradient decent optimization. E ∂ ŷuc


gvu = ReLU( ) (11)
Therefore, rather than optimizing a relaxed adjacency ma- ∂evu
trix in GNNExplainer, we optimize a mask M ∈ [0, 1]Q where c ∈ {1, . . . , C} is the correct class of node u, and
on weighted edges, supposing there are Q edges in Gc (u). yuu is the score for class c before softmax layer. where xv is
Then EcM = Ec M, where is element-wise multiplica- node v’s feature. Here, we select the edges whose g E is in
tion of two matrix. The masked edge EcM is subject to the the top k largest ones and their connecting nodes. The advan-
constraint that EcM [v, w] ≤ Ec [v, w], ∀(v, w ∈ Ec (u). Then tage of contrastive gradient salience method is easy to com-
the objective function can be written as: pute. However, it was argued that recently that it generally
C perform worse than newer techniques ((Zhang et al. 2018;
Selvaraju et al. 2017)).
X
min − I[y = c] log PΦ (Y |Gc = (Vc , Ec M), Xc ).
M
c=1
(10) 4.3 Edge Weighted Graph Attention (E-GAT)
The Graph Attention Layer takes a set of node features
In GNNExplainer, the top k edges may not form a connected H(l−1) = {h1
(l−1) (l−1)
, h2
(l−1)
, · · · , hN }, xi ∈ RF as in-
component including the node (say u) under prediction i. (l) (l) (l) (l)
Hence, we added the entropy of the (Ec M)vu for ev- put, and maps them to H(l) = {h1 , h2 , · · · , hN }, hi ∈
(l)
ery node v pointing to node u’ as a regularization term, Rd . The idea is to compute an embedded representation
of each node v ∈ V , by aggregating its 1-hop neighbor- ground truth for the explanation. In order to evaluate the re-
(l−1) sults, we propose the evaluation metrics for quantitatively
hood nodes {hv , ∀v ∈ N (u)} following a self-attention
(l) (l) measuring the explanation results and propose the correla-
mechanism Att: Rd × Rd → R (Veličković and oth-
tion methods to validate if edge connection patter or node
ers 2018). Different from the original (Veličković and oth-
feature is the crucial factor for classification. We define
ers 2018), we leverage the edge weights of the underlying
metrics consistency, contrastivity and sparsity (Here, def-
graph. The modified attention αvu ∈ R can be expressed as
inition of contrastivity andsparsity are different from the
a single feed-forward layer of xv and xw with edge weight
ones in(Pope et al. 2019)) to measure informative compo-
evu :
nent detection results. Firstly, To measure the similarity be-
(l−1)
αwv = Att(Wa h(l−1)
w , Wa h(l−1)
v ) (12) tween graphs, we introduce graph edit distance (GED) (Abu-
Aisheh et al. 2015), which is a graph similarity measure
= LeakyReLU((a) |
[Wa h(l−1)
w kWa h(l−1)
v ])ewv , analogous to Levenshtein distance for strings. It is defined
(13) as minimum cost of edit path (sequence of node and edge
where α is the attention weight on v → u and indicates edit operations) transforming graph G1 to graph isomorphic
the importance of node j’s features to node i. It allows ev- to G2. In case the structure is isomorphic but edge weights
ery node to attend all the other nodes on the graph based on are different. If GED=0, Jensen-Shannon Divergence (JSD)
their node features, weighted by the underlying connectivity. (Nielsen 2010), is added on GED to further compare the two
(l) (l−1) isomorphic subgraphs. Specifically, we design consistency
The Wa ∈ Rd ×d is a learnable linear transformation as the GED between the informative subgraphs of the node
that maps each node’s feature vector from dimension d(l−1) in the same class, as whether the informative components
to the embedded dimension d(l) . The attention mechanism detected for the node in the same class are consist; and de-
Att is implemented by a nodal attributes learning vector sign contrastivity as the GED across the informative sub-
(l)
a ∈ R2d and LeakyRelu with input slope = 0.2. For ex- graphs of the node in the same class, as and whether the
planation purpose, in order to make coefficients comparable informative components detected for the node in the differ-
cross different edges, we normalized the weights across the ent class
P are contrastive; Sparsity is defined as the density of
source nodes: mask evw ∈Gc (u) Υvw /Q, Υ ∈ {M, g E }, as the density of
αvw component edge importance weights.
α̃vw = P (14)
w∈N (v) αvw
6 Experiments
where T (v) is the target nodes set where v points to. Then, Note that, the color codes for all the figures below follow the
there will be an attention embedding layer before graph con- on denoted in Figure 2. The red node is the node we try to
volutional layer: classify and explain.
X
h(l)
v =
(l−1)
α̃vw Wa h(l−1)
w . (15) 6.1 Synthetic Data
w∈N (v)
Data Following (Ying et al. 2019), we generated a
Then we average the attention over the layers. Barabási–Albert (BA) graph with 15 nodes and attached 10
five-node house-structure graph motifs are attached to ran-
4.4 Culturing Node Class Sensitivity dom nodes, ended with 65 nodes in Figure 2. We created
For each node u, we computed the sensitivity of labeling it a small graph for visualization purpose. However, the ex-
as class i ∈ {0, . . . , C − 1} with respect to all nodes in the periment results held for large graphs. Several natural and
computational graph v, w ∈ Vc (u) \ u human-made systems, including the Internet, citation net-
works, social networks, and banking payment system can be
∂ ŷ 1 ∂ ŷ C
 
ReLU (k ∂xuv k) . . . ReLU (k ∂xuv k)
 .. .. .. 
, (16)

 . . . 
1 C
∂ ŷu ∂ ŷu
Relu(k ∂xw k) . . . ReLU (k ∂xw k)
then we clustered each row vector of the previous matrix, to
obtain the set of neighbour nodes that have same contribu-
tion pattern to classify node u to each of class i ∈ IC .
This method can be used to validate if the nodes on in-
formative subgraph have the same node feature sensitivity.
Also, it can show the similarity between the neighbors in
Gc .
Figure 2: Synthetic BA-house graph data and correspond-
5 Evaluation Metrics and Methods ing edge weights, each BA node belongs to class ”0,” and
For synthetic data, we can compare explanation with data each ”house” shape node belongs labeled ”1-3” based on its
generation rules. However, for real data, we do not have motif. The node orders are denoted.
on 2-D space by T-SNE (Maaten and Hinton 2008) in Figure
4 (a) ), since they were most sensitive to predicting node 20
to class 1, which matched the informative components (4
(b)) shown in main paper. The other clusters grouped the
nodes in Gc (u) by their saliency sensitive to a certain class
in {0, 2, 3}.
This method can be used to validate if the nodes on in-
formative subgraph have the same node feature sensitivity.
Also, it can show the similarity between the neighbors in
Gc .

Figure 3: Informative Components. Row a)-c), w = 0.1.


Row a) is for the node in class one not connecting to class
0 nodes using MMI mask. Row b) is for the node in class
one connecting to class 0 nodes using MMI mask. Row c) is
for the node in class one connecting to class 0 nodes using
GGD. Row d) is for the node in class one connecting to class
0 nodes using E-GAT. Row e) is for the node in class one Figure 4: Node class sensitivities clustering (a) and compar-
connecting to class 0 nodes using MMI mask, but w = 2. ing with informative subgraph (b). Node orders are denoted
as numbers and node labels are denoted as colors.
Table 1: Saliency component compared with ’house’ shape
(Measuring on all the nodes in class 1 with w = 0.1)
Method MMI mask GGD E-GAT 6.2 Bitcoin OTC Data
AUC 0.932 0.899 0.667 Bitcoin is a cryptocurrency that is used for trading anony-
mously. There is counterparty risk due to anonymity. We
use Bitcoin dataset ((Kumar et al. 2018)) collecting in one
thought to be approximately a BA graph, which certainly month, where Bitcoin users rate the level of trust to the users
contains few nodes (hubs) with unusually high degree and a they made transactions to. The rating scales are from -10 to
big number of nodes poorly connected. The edges connect- +10 (except for 0). According to OTC’s guideline, the higher
ing with different node pairs were assigned different weights the rating, the more trustworthy. We labeled the users whose
denoted in Figure 2 as well, where w was an edge weight we rating score had at list one negative score as risky; the users
will discuss later. Then, we added noise to synthetic data by whose more than half received ratings were greater than one
uniformly randomly adding 0.1N edges, where N was the as trustworthy users; the users who did not receive any rat-
number of nodes in the graph. In order to constrain the node ing scores as an unknown group; and the rest of the users
label is determined by motif only, all the node feature xi was were assigned to the neural group. We chose the rating net-
designed the 2-D node attributes with the same constant. work data at a time point, which contained 1447 users, 5739
(l)
rating records. We renormalized the edge weights to [0, 1]
GNN Training We use g = ēvu mvu in Eq. (4). The by ẽij = eij /20 + 1/2. Then we trained a GNN on 90%
parameters setting are input dim = 2, hidden dim = 8, unknown, neutral and trustworthy node, 20% risky node,
num layers = 3 and epoch =300. We randomly split 60% those nodes only, and perform classification on the rest of
of the nodes for training and the rest for testing. the nodes. We chose g as a GRU gate and the other set-
tings are setting are hidden dim = 32, num layers = 3 and
Results GNN achieved 100% and 96.7% accuracy on
training and testing dataset correspondingly. We performed
informative component detection (kept top 6 edges) and
compare them with human interpretation – the ’house
shape,’ which can be used as a reality check (Table 1). In
Figure 3, we showed the explanation results of the node in
the same place but has different topology structure (row a &
b) and compared how eight weights affected the results (row
a & e). We also showed the results generated by different
methods (row a & c & d).
We showed a clustering results of node 20 on SynComp in Figure 5: Informative subgraph detected by MMI mask
Figure 4. Node 21 − 24 were clustered together (visualized (showing the original rating scores on the edges).
Figure 6: Examples of informative components for account matching.

epoch =1000. Learning rate was initialized as 0.1, and de-


creased half per 100 epochs. We achieved accuracy 0.730 on Table 2: Evaluate informative components
(Average on all the correctly classified nodes )
the training dataset and 0.632 on the testing dataset. Finally,
we showed the explanation result using MMI mask since it Dataset Consistency Contrastivity Sparsity
is more interpretable (see Figure 5) and compared them with BitCoin 1.79 3.23 0.126
possible human reasoning ones. The pattern of the informa- MMI Account 1.25 1.93 0.021
Bitcoin 2.17 2.90 0.124
tive component of the risky node contains negative rating;
GGD Account 1.44 1.89 0.095
the major ratings to a trustworthy node are greater than 1;
and for the neutral node, it received lots of rating score 1.
The informative components match the rules of how we la-
bel the nodes. topology structure. Low spasity values indicate the informa-
tive components have high information entropy, showing the
potential of the explanation methods to extract informative
6.3 Bank Transaction Data - Account Matching patterns from the financial data.
We use a database of 1000 international payment transac-
tion records, that involve four parties, the originating (ORG) 7 Conclusion
account, the sending (SND) bank, the receiving (RCV) In this work, we formulate the transaction data in financial
bank and the beneficiary (BEN) account. Each party plays system as a weighted directed graph. We apply explanation
a role in the transaction, and the set of possible roles is methods on weighted graph in GNN node classification task,
I4 = {ORG, SND, RCV, BEN}. The task is to classify each which can provide subjective and comprehensive explana-
node as being either an account (ORG or BEN) or a bank tions of data interaction patterns used in GNN. We also pro-
(SNF or RCV), since the input graph data is noisy. For this, pose evaluation metrics and methods to validate the expla-
we build a transaction graph using the payment data, nodes nation results. The explanations may benefit debugging, fea-
are accounts of banks; the transactions between the nodes ture engineering, informing human decision-making, build-
are the directed edges in the graph; and transaction amounts ing trust, etc. Our future work will include extending the
are the edge features. Furthermore, each node is associated explanation to graphs with multi-dimensional edge features
with categorical features (’party ID type’ and ’country’). and explaining different graph learning tasks, such as link
We used one hot encoding to convert the node features to prediction and graph classification.
10 × 1 vectors, and edge features were normalized to [0, 1].
We labeled 10% of the data and trained a GNN to classify
each account as a bank and costumer account. We use the References
same GNN architecture as described in synthetic data exper- Abu-Aisheh, Z.; Raveaux, R.; Ramel, J.-Y.; and Martineau,
iments, and use Adam optimization method with fixed learn- P. 2015. An exact graph edit distance algorithm for solving
ing rate 0.01 and trained the model for 100 epochs until con- pattern recognition problems.
vergence. The accuracy of node classification task achieved Baldassarre, F., and Azizpour, H. 2019. Explainability tech-
is 100%. Using our explanation algorithm, we present a vi- niques for graph convolutional networks. arXiv preprint
sualization of the informative components detected for each arXiv:1905.13686.
account type - costumer account or bank account in Figure
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-
6.
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;
For the above two real datasets, we measured consis- Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Rela-
tency, contrastivity, and sparsity by selecting the top 4 edges. tional inductive biases, deep learning, and graph networks.
Since when attention layer was added in GNN, we could arXiv preprint arXiv:1806.01261.
not achieve good classification results in the real datasets,
we only applied MMI mask and GGD methods for infor- Cover, T. M., and Thomas, J. A. 2012. Elements of informa-
mative components detection in the real data. The measure- tion theory. John Wiley & Sons.
ment of MMI mask and GGD methods are listed in Table 2. Gong, L., and Cheng, Q. 2019. Exploiting edge features
The higher contrastility values compared with consistency for graph neural networks. In Proceedings of the IEEE Con-
value, shows GNN replied on data topolgy information for ference on Computer Vision and Pattern Recognition, 9211–
node classification and nodes in the same class have similar 9219.
Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive
representation learning on large graphs. In Advances in Neu-
ral Information Processing Systems, 1024–1034.
Kipf, T. N., and Welling, M. 2016. Semi-supervised classi-
fication with graph convolutional networks. arXiv preprint
arXiv:1609.02907.
Kumar, S.; Hooi, B.; Makhija, D.; Kumar, M.; Faloutsos, C.;
and Subrahmanian, V. 2018. Rev2: Fraudulent user predic-
tion in rating platforms. In Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining,
333–341. ACM.
Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using
t-sne. Journal of machine learning research 9(Nov):2579–
2605.
Nielsen, F. 2010. A family of statistical symmetric di-
vergences based on jensen’s inequality. arXiv preprint
arXiv:1009.4004.
Pope, P. E.; Kolouri, S.; Rostami, M.; Martin, C. E.; and
Hoffmann, H. 2019. Explainability methods for graph con-
volutional neural networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
10772–10781.
Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.;
Titov, I.; and Welling, M. 2018. Modeling relational data
with graph convolutional networks. In European Semantic
Web Conference, 593–607. Springer.
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explana-
tions from deep networks via gradient-based localization. In
Proceedings of the IEEE International Conference on Com-
puter Vision, 618–626.
Shang, C.; Liu, Q.; Chen, K.-S.; Sun, J.; Lu, J.; Yi, J.; and
Bi, J. 2018. Edge attention-based multi-relational graph
convolutional networks. arXiv preprint arXiv:1802.04944.
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep
inside convolutional networks: Visualising image classifica-
tion models and saliency maps.
Veličković, P., et al. 2018. Graph attention networks. In
ICLR.
Yang, H.; Li, X.; Wu, Y.; Li, S.; Lu, S.; Duncan, J. S.; Gee,
J. C.; and Gu, S. 2019. Interpretable multimodality em-
bedding of cerebral cortex using attention graph network for
identifying bipolar disorder. MICCAI 671339.
Ying, R.; Bourgeois, D.; You, J.; Zitnik, M.; and Leskovec,
J. 2019. Gnn explainer: A tool for post-hoc explanation of
graph neural networks. arXiv preprint arXiv:1903.03894.
Zhang, J.; Bargal, S. A.; Lin, Z.; Brandt, J.; Shen, X.; and
Sclaroff, S. 2018. Top-down neural attention by excita-
tion backprop. International Journal of Computer Vision
126(10):1084–1102.
Zhang, Z.; Cui, P.; and Zhu, W. 2018. Deep learning on
graphs: A survey.
Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li,
C.; and Sun, M. 2018. Graph neural networks: A review of
methods and applications.

You might also like