Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Graph Representation Learning
with Deep Embedding Approach
Hanjun Dai
Ph.D. student in School of Computational Science & Engineering
Georgia Institute of Technology
Drug discovery
Dai et.al, structure2vec, ICML 2016
Recommendation
Dai et.al, DeepCoevolve, Recsys DLRS 2016
Knowledge Graph
Dai et.al, VRN, in submission
Trivedi et.al, Know-Evolve, ICML 2017
TSPMaxcut
Dai et.al, S2V-DQN, NIPS 2017
Vertex Cover
Graph applications
Outline
• Review of traditional approaches
• Our architecture
• Experiments on RNA and molecules
• Extension to social network and recommendation
• Application in graph combinatorial optimization
Review: RNA / Molecule property prediction
Application: High throughput virtual screening
C U U C A G
Structured data Target
Power Conversion Efficiency
(PCE): Regression problem
Handcrafted feature
Binding affinity:
Binary Classification Problem
,S={
}…… ,
C ,S={
}…… ,
U U U U C
C A G
0 2 0 0 … … … … 0 2
AA AC AU AG GGGU
Problem with handcrafted features
• Stage 1: Build kernel Matrix • Stage 2: Train classifier on top
Not scalable for millions of data
𝑘(𝜒2, 𝜒3)
Or High-dim Explicit Bag-of-words Feature Map
0 2 0 0 … … … … 0 2
AA AC AU AG GGGU Constructed features are
not aware of task
Review: RNA / Molecule property prediction
Review: Temporal Recommendation
who will do what and when?
ChristineAliceDavid Jacob
TowelShoe Book
𝑅user
item
≈
Matrix factorization
𝑈
𝑉
Epoch division
𝑡
Minimum vertex/set cover
Advertisers: influence maximization
Review: Graph Combinatorial Optimization
2 - approximation for
minimum vertex cover
Repeat till all edges covered:
• Select uncovered edge with
largest total degree
Manually
designed rule.
Can we learn
from data?
NP-hard problems
Outline
• Review of traditional approaches
• Our architecture
• Experiments on RNA and molecules
• Extension to social network and recommendation
• Application in graph combinatorial optimization
Intuitive understanding: local filters
Filter applies to each local patch:
Image Graph
Filter applies to each 1-hop neighborhood:
𝜇2
(1)
𝜇3
(0)
𝜇4
(0)
𝜇1
(0)
𝜒
𝑋1
𝑋2
𝑋3
𝐻3
𝐻1
𝐻2
𝐻4
𝑋4
𝜇2
(0)
𝜇3
(1)
𝜇4
(1)
𝜇1
(1)
End to end learning
𝜇2
(1)
𝜇3
(1)
𝜇1
(1)
……
𝜇2
(𝑇)
𝜇3
(𝑇)
𝜇1
(𝑇)
……
……
Iteration 1:
Iteration 𝑇:
Label 𝑦
classification/regression
withparameter𝑉
Aggregate
𝜇1
(𝑇)
𝜇2
(𝑇)
+
+
⋮
= 𝜇 𝑎(𝑊, 𝜒)
Outline
• Review of traditional approaches
• Our architecture
• Experiments on RNA and molecules
• Extension to social network and recommendation
• Application in graph combinatorial optimization
Experiment on RNA sequences
0
0.02
0.04
0.06
0.08
0.1
0.12
Ace2 Aft1 Aft2 Bas1 Cad1 Cbf1 Cin5 Cup9 Dal80 Gat1 Gcn4 Mata2 Mcm1
PWM
LM
SVR
DNN
CNN
FS
S2V
0
0.02
0.04
0.06
0.08
0.1
0.12
Met31 Met32 Msn1 Msn2 Nrg2 Pdr3 Pho4 Reb1 Rox1 Rpn4 Sko1 Stb5 Yap1 Yap3 Yap7
PWM
LM
SVR
DNN
CNN
FS
S2V
• MITOMI 2.0
• 28 TFs
• Regression problem
MAE
MAE
14
S2V Cmp
RMSE 0.035 0.039
PCC 0.62 0.45
SCC 0.29 0.26
Dai et.al, Bioinformatics 2017
Experiment on Molecules
Dataset Harvard clean energy project
Size 2.3 million
Avg node # 28
Avg edge # 33
Power Conversion Efficiency (PCE)
(0 -12 %)
predict
Organic
Solar Panel
Materials
Test MAE Test RMSE # parameters
Mean
predictor
1.986 2.406 1
WL level-3 0.143 0.204 1.6 m
WL level-6 0.096 0.137 1378 m
DE-MF 0.091 0.125 0.1 m
DE-LBP 0.085 0.117 0.1 m
Experiment on Molecules
0.1M 1M 10M 100M 1000M
0.085
0.095
0.120
0.150
0.280
Parameter
number
MAE
Embedded
MF
Embedded
BP
Weisfeiler-Lehman
Level 6
Hashed
WL Level 6
Embedding
reduces model
size by 10,000x !
[Dai, Dai & Song 2016]
Interpretable results
Effective ( > 10 ) Ineffective ( < 0.5 )
• Sequence Motifs • Molecule fragments
Outline
• Review of traditional approaches
• Our architecture
• Experiments on RNA and molecules
• Extension to social network and recommendation
• Application in graph combinatorial optimization
Extension to network analysis
ℒ =
𝑣∈𝒱
𝑙𝑜𝑠𝑠(𝑦𝑣, 𝑦(𝜇 𝑣))
• One loss term per node
Experiment on network analysis
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
Citeseer Cora Pubmed
Classification Accuracy on Citation
Networks
DeepWalk node2vec structure2vec
Citeseer Cora Pubmed
# nodes 3,327 2,708 19,717
# edges 4,732 5,429 44338
# classes 6 7 3
Label Rate 0.036 0.052 0.003
Features 3,703 1,433 500
Document classification
in citation network
Experiment on network analysis
15
20
25
30
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Blogcatalog
5
9
13
17
21
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-Fraction of training-
Wikipedia
deepwalk
node2vec
gcn
s2v
Blogcatalog Wikipedia
# nodes 10,312 4,777
# edges 333,983 184,812
# classes 39 40
task Group membership POS tag
Represent
𝑋1 𝑋2 𝑋3
𝑋4
𝑋5
𝐻8 𝐻9
𝐻6 𝐻7
𝐻4
𝐻1
𝐻5
𝐻2 𝐻3
𝑋6
LVM
𝐺 = (𝒱, ℇ)
user/item
raw features
Interaction
time/context
time
𝑡0
𝑡2
𝑡1
𝑡3
[Dai, et al. 2016]
Dynamic Graphs for Recommendation
• Unroll the interaction
along timeline
1. The bipartite
interaction graph
2. The temporal
ordering of events
• Mini-batch training
using truncated back
propagation through
time (BPTT)
Experiment on recommendation
Favors ’recurrent’ events: watching TV programs; discussion on Forums; visiting restaurants
Outline
• Review of traditional approaches
• Our architecture
• Experiments on RNA and molecules
• Extension to social network and recommendation
• Application in graph combinatorial optimization
Learning graph opt: Motivation
Minimum vertex/set cover
• Exponential time complexity
Branch and Bound
• Results too weak
Constructive Approximation
Cannot learn from solved instances!
Learning graph opt: Motivation
• Data from same / similar distribution
Social network: Barabási–Albert
Road network: fixed graph with
evolving edge weights
Images taken from Wikipedia
• Supervised learning?
• Reinforcement learning!
• No such supervision
• Learning by trial and error
Learning graph opt: RL background
[Minh, et al. Nature 2015]
Greedy policy:
𝑖∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑄(𝑆, 𝑖)
• State 𝑺: current screen
• Reward 𝑹(𝒕): score you earned at current step
• Action value function 𝑸(𝑺, 𝒊):
your predicted future total rewards
• Action 𝒊: move your board left / right
• Policy 𝝅(𝒔): How to choose your action
Learning graph opt: RL on graphs
[Dai, et al. NIPS 2017]
min
𝑥 𝑖∈ 0,1
𝑖∈𝓥
𝑥𝑖
𝑠. 𝑡. 𝑥𝑖 + 𝑥𝑗 ≥ 1, ∀ 𝑖, 𝑗 ∈ 𝓔
Reward: 𝑟 𝑡
= −1
• State 𝑺:
current partial solution
• Action value function 𝑸(𝑺, 𝒊):
Expected negative future loss
• Greedy Policy 𝝅(𝒔):
Add best node
Learning graph opt: action-value function
• Parameterize Action value function 𝑄(𝑆, 𝑖) with structure2vec
• Train with Deep Q-Learning (DQN)
Bellman optimality equation
Learning graph opt: quantitative comparison
approximation ratio ≈ 1
• A distribution of
scale free networks
• Optimal approximated by
running CPLEX for 1 hour
1
1.001
1.002
1.003
1.004
1.005
1.006
1.007
Generalization to large instances
Learning graph opt: quantitative comparison
• Train on small graphs with
50-100 nodes
• Generalize to not only graphs
from same distribution
• But also larger graphs
• Approximation ratio < 1.007
Learning graph opt: time-solution tradeoff
Embedded
MF
CPLEX
1st
CPLEX
2nd
CPLEX
3rd
CPLEX
4th
2-approx
2-approx +
Embedding
produces algorithm
with good tradeoff !
RNN
• Generate 200 Barabasi-
Albert networks with
300 nodes
• Let CPLEX produces 1st,
2nd, 3rd, 4th feasible
solutions
Learning graph opt: real-world data
http://snap.stanford.edu/netinf/#data
MemeTracker graph: 960 nodes and 5000 edges
Methods Approximation Ratio
Optimal 1.00 ( 473 nodes)
S2V-DQN 1.002 ( 474 nodes)
MVCApprox-Greedy 1.222 (578 nodes)
MVCApprox 1.408 (666 nodes)
• Learning from sampled
subgraphs
Learning graph opt: learned strategy
• Learned a greedy algorithm which is different from known ones
Learning graph opt: other problems
• Maximum cut
• Traveling Salesman Problem
optimal
Solution
found
• Set cover
Set cover image taken from Wikipedia
Learn with
Bipartite
graph
Thanks to my collaborators in this project
• Advisor:
Le Song
• Collaborators (alpha-beta order)
Bo Dai Bistra Dilkina Elias Khalil
Rakshit Trevidi Yichen Wang Yuyu Zhang
Q&A

More Related Content

Hanjun Dai, PhD Student, School of Computational Science and Engineering, Georgia Tech at MLconf ATL 2017

  • 1. Graph Representation Learning with Deep Embedding Approach Hanjun Dai Ph.D. student in School of Computational Science & Engineering Georgia Institute of Technology
  • 2. Drug discovery Dai et.al, structure2vec, ICML 2016 Recommendation Dai et.al, DeepCoevolve, Recsys DLRS 2016 Knowledge Graph Dai et.al, VRN, in submission Trivedi et.al, Know-Evolve, ICML 2017 TSPMaxcut Dai et.al, S2V-DQN, NIPS 2017 Vertex Cover Graph applications
  • 3. Outline • Review of traditional approaches • Our architecture • Experiments on RNA and molecules • Extension to social network and recommendation • Application in graph combinatorial optimization
  • 4. Review: RNA / Molecule property prediction Application: High throughput virtual screening C U U C A G Structured data Target Power Conversion Efficiency (PCE): Regression problem Handcrafted feature Binding affinity: Binary Classification Problem ,S={ }…… , C ,S={ }…… , U U U U C C A G 0 2 0 0 … … … … 0 2 AA AC AU AG GGGU
  • 5. Problem with handcrafted features • Stage 1: Build kernel Matrix • Stage 2: Train classifier on top Not scalable for millions of data 𝑘(𝜒2, 𝜒3) Or High-dim Explicit Bag-of-words Feature Map 0 2 0 0 … … … … 0 2 AA AC AU AG GGGU Constructed features are not aware of task Review: RNA / Molecule property prediction
  • 6. Review: Temporal Recommendation who will do what and when? ChristineAliceDavid Jacob TowelShoe Book 𝑅user item ≈ Matrix factorization 𝑈 𝑉 Epoch division 𝑡
  • 7. Minimum vertex/set cover Advertisers: influence maximization Review: Graph Combinatorial Optimization 2 - approximation for minimum vertex cover Repeat till all edges covered: • Select uncovered edge with largest total degree Manually designed rule. Can we learn from data? NP-hard problems
  • 8. Outline • Review of traditional approaches • Our architecture • Experiments on RNA and molecules • Extension to social network and recommendation • Application in graph combinatorial optimization
  • 9. Intuitive understanding: local filters Filter applies to each local patch: Image Graph Filter applies to each 1-hop neighborhood:
  • 10. 𝜇2 (1) 𝜇3 (0) 𝜇4 (0) 𝜇1 (0) 𝜒 𝑋1 𝑋2 𝑋3 𝐻3 𝐻1 𝐻2 𝐻4 𝑋4 𝜇2 (0) 𝜇3 (1) 𝜇4 (1) 𝜇1 (1) End to end learning 𝜇2 (1) 𝜇3 (1) 𝜇1 (1) …… 𝜇2 (𝑇) 𝜇3 (𝑇) 𝜇1 (𝑇) …… …… Iteration 1: Iteration 𝑇: Label 𝑦 classification/regression withparameter𝑉 Aggregate 𝜇1 (𝑇) 𝜇2 (𝑇) + + ⋮ = 𝜇 𝑎(𝑊, 𝜒)
  • 11. Outline • Review of traditional approaches • Our architecture • Experiments on RNA and molecules • Extension to social network and recommendation • Application in graph combinatorial optimization
  • 12. Experiment on RNA sequences 0 0.02 0.04 0.06 0.08 0.1 0.12 Ace2 Aft1 Aft2 Bas1 Cad1 Cbf1 Cin5 Cup9 Dal80 Gat1 Gcn4 Mata2 Mcm1 PWM LM SVR DNN CNN FS S2V 0 0.02 0.04 0.06 0.08 0.1 0.12 Met31 Met32 Msn1 Msn2 Nrg2 Pdr3 Pho4 Reb1 Rox1 Rpn4 Sko1 Stb5 Yap1 Yap3 Yap7 PWM LM SVR DNN CNN FS S2V • MITOMI 2.0 • 28 TFs • Regression problem MAE MAE 14 S2V Cmp RMSE 0.035 0.039 PCC 0.62 0.45 SCC 0.29 0.26 Dai et.al, Bioinformatics 2017
  • 13. Experiment on Molecules Dataset Harvard clean energy project Size 2.3 million Avg node # 28 Avg edge # 33 Power Conversion Efficiency (PCE) (0 -12 %) predict Organic Solar Panel Materials Test MAE Test RMSE # parameters Mean predictor 1.986 2.406 1 WL level-3 0.143 0.204 1.6 m WL level-6 0.096 0.137 1378 m DE-MF 0.091 0.125 0.1 m DE-LBP 0.085 0.117 0.1 m
  • 14. Experiment on Molecules 0.1M 1M 10M 100M 1000M 0.085 0.095 0.120 0.150 0.280 Parameter number MAE Embedded MF Embedded BP Weisfeiler-Lehman Level 6 Hashed WL Level 6 Embedding reduces model size by 10,000x ! [Dai, Dai & Song 2016]
  • 15. Interpretable results Effective ( > 10 ) Ineffective ( < 0.5 ) • Sequence Motifs • Molecule fragments
  • 16. Outline • Review of traditional approaches • Our architecture • Experiments on RNA and molecules • Extension to social network and recommendation • Application in graph combinatorial optimization
  • 17. Extension to network analysis ℒ = 𝑣∈𝒱 𝑙𝑜𝑠𝑠(𝑦𝑣, 𝑦(𝜇 𝑣)) • One loss term per node
  • 18. Experiment on network analysis 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% Citeseer Cora Pubmed Classification Accuracy on Citation Networks DeepWalk node2vec structure2vec Citeseer Cora Pubmed # nodes 3,327 2,708 19,717 # edges 4,732 5,429 44338 # classes 6 7 3 Label Rate 0.036 0.052 0.003 Features 3,703 1,433 500 Document classification in citation network
  • 19. Experiment on network analysis 15 20 25 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Blogcatalog 5 9 13 17 21 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-Fraction of training- Wikipedia deepwalk node2vec gcn s2v Blogcatalog Wikipedia # nodes 10,312 4,777 # edges 333,983 184,812 # classes 39 40 task Group membership POS tag
  • 20. Represent 𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝐻8 𝐻9 𝐻6 𝐻7 𝐻4 𝐻1 𝐻5 𝐻2 𝐻3 𝑋6 LVM 𝐺 = (𝒱, ℇ) user/item raw features Interaction time/context time 𝑡0 𝑡2 𝑡1 𝑡3 [Dai, et al. 2016] Dynamic Graphs for Recommendation • Unroll the interaction along timeline 1. The bipartite interaction graph 2. The temporal ordering of events • Mini-batch training using truncated back propagation through time (BPTT)
  • 21. Experiment on recommendation Favors ’recurrent’ events: watching TV programs; discussion on Forums; visiting restaurants
  • 22. Outline • Review of traditional approaches • Our architecture • Experiments on RNA and molecules • Extension to social network and recommendation • Application in graph combinatorial optimization
  • 23. Learning graph opt: Motivation Minimum vertex/set cover • Exponential time complexity Branch and Bound • Results too weak Constructive Approximation Cannot learn from solved instances!
  • 24. Learning graph opt: Motivation • Data from same / similar distribution Social network: Barabási–Albert Road network: fixed graph with evolving edge weights Images taken from Wikipedia • Supervised learning? • Reinforcement learning! • No such supervision • Learning by trial and error
  • 25. Learning graph opt: RL background [Minh, et al. Nature 2015] Greedy policy: 𝑖∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑄(𝑆, 𝑖) • State 𝑺: current screen • Reward 𝑹(𝒕): score you earned at current step • Action value function 𝑸(𝑺, 𝒊): your predicted future total rewards • Action 𝒊: move your board left / right • Policy 𝝅(𝒔): How to choose your action
  • 26. Learning graph opt: RL on graphs [Dai, et al. NIPS 2017] min 𝑥 𝑖∈ 0,1 𝑖∈𝓥 𝑥𝑖 𝑠. 𝑡. 𝑥𝑖 + 𝑥𝑗 ≥ 1, ∀ 𝑖, 𝑗 ∈ 𝓔 Reward: 𝑟 𝑡 = −1 • State 𝑺: current partial solution • Action value function 𝑸(𝑺, 𝒊): Expected negative future loss • Greedy Policy 𝝅(𝒔): Add best node
  • 27. Learning graph opt: action-value function • Parameterize Action value function 𝑄(𝑆, 𝑖) with structure2vec • Train with Deep Q-Learning (DQN) Bellman optimality equation
  • 28. Learning graph opt: quantitative comparison approximation ratio ≈ 1 • A distribution of scale free networks • Optimal approximated by running CPLEX for 1 hour
  • 29. 1 1.001 1.002 1.003 1.004 1.005 1.006 1.007 Generalization to large instances Learning graph opt: quantitative comparison • Train on small graphs with 50-100 nodes • Generalize to not only graphs from same distribution • But also larger graphs • Approximation ratio < 1.007
  • 30. Learning graph opt: time-solution tradeoff Embedded MF CPLEX 1st CPLEX 2nd CPLEX 3rd CPLEX 4th 2-approx 2-approx + Embedding produces algorithm with good tradeoff ! RNN • Generate 200 Barabasi- Albert networks with 300 nodes • Let CPLEX produces 1st, 2nd, 3rd, 4th feasible solutions
  • 31. Learning graph opt: real-world data http://snap.stanford.edu/netinf/#data MemeTracker graph: 960 nodes and 5000 edges Methods Approximation Ratio Optimal 1.00 ( 473 nodes) S2V-DQN 1.002 ( 474 nodes) MVCApprox-Greedy 1.222 (578 nodes) MVCApprox 1.408 (666 nodes) • Learning from sampled subgraphs
  • 32. Learning graph opt: learned strategy • Learned a greedy algorithm which is different from known ones
  • 33. Learning graph opt: other problems • Maximum cut • Traveling Salesman Problem optimal Solution found • Set cover Set cover image taken from Wikipedia Learn with Bipartite graph
  • 34. Thanks to my collaborators in this project • Advisor: Le Song • Collaborators (alpha-beta order) Bo Dai Bistra Dilkina Elias Khalil Rakshit Trevidi Yichen Wang Yuyu Zhang
  • 35. Q&A

Editor's Notes

  1. Now is the time to put them together. We start with the zero embeddings, and then perform one step of local filtering. For example, to get update of mu_2, we use its neighborhood embeddings and input features. Similarly, we can get updates for all node embeddings. We need to iterate through this process several times. Intuitively, this will allow each embedding capture more and more neighborhood information. In the last step, we merge those embeddings to get a vector representation of entire structure data. We can see this model can be trained in an end to end fashion.
  2. If we unroll the interactions between users and items, we can get a dependency graph which looks like this. Here the dependency comes from the assumption that the user or item's latent representation will change over time, after each interaction event. We can then abstract it as a directed graph on the right hand side. Different from the previous scenario, here we get a single gigantic graph. So we apply stochastic training with truncated backpropagation through time, which is commonly used in recurrent neural network.