COMMitMDE'18: Eclipse Hawk: model repository querying as a service

Eclipse Hawk: model repository
querying as a service
Antonio García Domínguez // @antoniogado
October 15, 2018
School of Engineering and Applied Science, Aston University

Who am I?
Current roles
• 2011: Eclipse Committer (Epsilon, now Hawk)
• 2016: Lecturer at Aston University (Birmingham, UK)
• 2018: Technical Director at Beautiful Canoe (part of Aston)
Research trajectory
• Started with automated theorem provers (ACL2, anyone?)...
• ... moved to software testing (e.g. mutation testing)
• ... applied it to service-oriented systems (SOA, WS-*)
• ... ended up in MDE land! (went from applications then to infra)
• Now interested in NoSQL technologies (graph DBs, indexing)
1

Two of the challenges in collaborative MDE
Larger, more distributed teams
• Models are created by multi-disciplinary teams
• Teams can be distributed across the world
• Different members work on different parts of the model
More detailed models of larger systems
• We are modelling increasingly larger systems:
• Entire buildings: BIM
• Cars: AUTOSAR
• We are increasing the level of detail of our models:
• Executable models are on the rise
• Statement-level reverse-engineering models
• If not careful, tools become slow and inefficient!
2

Two existing collaborative approaches
Eclipse Connected Objects (CDO)
+ Database storage on mature RDBMS (e.g. H2)
+ Per-element on-demand loading, can reuse SQL query planners
+ Referential integrity is well understood in RDBMS
− Best performance requires careful mapping (e.g. ELists)
− Hard to set up for first timers, requires specialist knowledge
File-based version control (Git/SVN)
+ Industry already trusts Git/SVN with their code
+ Can use close-fitting data model, less manual mapping involved
+ Inspecting, backing up and managing files is well understood
/ Need to fragment large models into multiple files
− Preserving inter-fragment links can be difficult
− Running a query may require large download + loading all files! 3

Solving some of the issues with the file-based approaches
Building the fragmentation into the editor
• This could be one more service of the editor generator
• EMF-Splitter (A. Garmendia et al.) does this for EMF
Preserving inter-fragment links
• Editor should reconcile refs when fragments are moved
• Editor should warn about broken refs after deletions
• Epsilon Concordance (Kolovos et al.) does this for EMF
4

Scalable querying as a service
• We could mirror each fragment into graph DB, then reconnect
• Industrial graph DBs already used for big data applications
• We can provide a remote API to run queries over the network
• Hawk does this for EMF/UML/Modelio... (this talk!)
4

Indexing a library model with Hawk
We go from these model files...
5

... to these NoSQL graphs.
5

• Ecore packages → metamodel nodes
• Ecore classes → type nodes
5

• Physical files → file nodes
• Model elements → element nodes
5

• MM index: package URI → metamodel node
• File index: file path → file node
• Users can define custom indices by attribute (e.g. title) or
expression (number of books, “is singleton?”) [1] 5

Component-based architecture
Hawk CoreHawk Core
Model Parsers
Model Parsers
Storage Backends
Neo4J
OrientDB
Ecore XMI
IFC2x3
Modelio
Version Control
Systems
Version Control Systems
Subversion
Folder / Git
IFC4
Eclipse Workspace
BPMN
Storage BackendsClients
Eclipse UI
Thrift API
Eclipse .MF
HTTP Locations
Query Engines
EOL
OrientDB SQL
Greycat
Time-aware EOL
6

Ways to deploy Hawk
As an Eclipse plug-in
• Aimed at proficient modellers, easy to iterate over
• Provides Eclipse-based GUI and EMF resource abstractions
• Can be integrated with other tools (e.g. EMF-Splitter)
As a plain Java library
• For solution developers: Hawk is hidden from the user
• Most of Hawk is Eclipse-independent
As a remote API
• Remote querying service over shared model repositories
• Much faster than downloading + loading the models
7

Search within collaboration server (SOFTEAM) [3]
Modelio screenshot: http://modelio.org
• Constellation: collaboration server for Modelio models
• Needed search, but Modelio is entirely file-based
• Integrated Hawk into the Constellation WAR file: reported that
initial indexing cost quickly pays off 8

Constellation dashboard

Query box for Constellation

1 2 3 4 5 6
·105
200
400
600
800
1,000
1,200
Project size (model elements)
Indexingtime(s)
500
1,000
1,500
2,000
2,500
OrientDBdiskspaceusage(MB)
Time
Disk space
Indexing times and index sizes
(OrientDB backend)
0 1 2 3 4 5 6 7
·105
100
101
102
103
Project size (model elements)
Indexingtime(s)
MT (run 1) MT (runs 2–5)
HT (run 1) HT (runs 2–5)
Code generation times:
Modelio (MT), Hawk (HT)

Software modernization (SOFT-MAINT)
• Software is migrated from an old technology to a new one,
keeping its functionality
• SOFT-MAINT does a good part of the work as a model-to-model
transformation
• Large system models do not fit in the memory of a normal
workstation
Model Hawk XMI SLOC Ratio
set1 4s 5s 1 413 0,8
set2 4s 5s 4 664 0,8
set3 9s 10s 53 704 0,9
set4 12s 71s 700 545 0,17
Execution times for the Java2SMM transformation
9

Building Information Models (UNINOVA): concept
Uses for Building Information Models
• UNINOVA is a Portuguese construction company.
• Uses models to integrate everything they know about a building:
floor plan, wiring, piping, doors/windows, and so on.
• Models can be in the order of GB! 10

Building Information Modelling (UNINOVA): buying doors
Small (100K) Medium (500K) Large (2.5M)
BimQL (G) 1.50s 2.30s 37.80s
Hawk (P) 0.13s 0.22s 0.16s
Hawk (G) 1.50s 1.90s 6.80s
Query times over building doors, depending on # of elements: P is only IDs,
and G is IDs + geometry.
• Hawk indexed BIM IFC models from the OpenBIM Server
• Used Hawk to query their buildings and export model views
• Model views = a specific floor, or just the electric wiring
• Model views required implementing paged fetch in the API
11

Building Information Modelling (UNINOVA): area and export
Query in Hawk to compute the area of a building
12

MEASURE software metrics platform [2]
• ITEA3 MEASURE: European industry-academia consortium
• https://itea3.org/project/measure.html
• MEASURE created a platform which brings togethr metrics from
various SDLC phases, providing analysis tools and dashboards
• Hawk has been integrated as a metric provider for the models
used in the anlaysis and design stage
13

Wishlist for a good model querying API? (1/2)
Performance
• Should be efficient in size (good encoding + compression)
• Should be efficient in time (avoid CPU-intensive encoding)
• Should avoid overheads from complex protocols
• Should reduce total roundtrip time (# of interactions)
Flexibility
• Should work on most languages (Java, C#, Python, JavaScript...)
• Should allow for various communication styles:
• Fast, few results: request-response
• Slow, few results: submit + fetch, with optional cancelling
• Fast, many results: request-response, then paged fetch
• Slow, many results: submit + fetch, then paged fetch
• Streaming: server-to-client messages (push notifications)
• Should be friendly with firewalls 14

Security
• Should implement authentication (API key or user/pass)
• Should implement authorization
• Authorization granularity should be as fine as possible
• Actual security mechanism should be configurable
Scalability
Should be able to handle a large number of users without failing
(bad) or producing a wrong number of results (much worse!)
15

Does the Hawk API meet all of them?
No. But we meet quite a few!
And some APIs fare worse...
15

Hawk API design
namespace java org.hawk.service.api
service Hawk {
/* ... */
QueryResult query(
1: required string name,
2: required string query,
3: required string language,
4: required HawkQueryOptions options,
)
throws (
1: HawkInstanceNotFound err1
2: HawkInstanceNotRunning err2
3: UnknownQueryLanguage err3
4: InvalidQuery err4
5: FailedQuery err5
)
/* ... */
}
General structure
• Based on Apache Thrift:
efficient, flexible RPC library
• Given IDL file, generates client
+ server stubs
• Hawk implements stubs and
exposes them through:
• TCP (multithreaded server)
• HTTP (Jetty servlet)
Flexibility
• Thrift supports 4 encodings
• JSON is the most flexible
• Tuple is the most compact
• Over 26 languages supported 16

But I don’t like writing IDL files...
• Neither do we — that’s why we wrote Ecore2Thrift
• Transforms annotated Ecore metamodels to Thrift IDL files
• https://github.com/bluezio/ecore2thrift
17

Relative performance of Thrift encodings (1/2)
• Hawk provides operations to fetch models over the network
• Three modes: greedy sends entire model, lazy attributes starts
with structure with no attribute values, lazy children sends roots
• Thrift Tuple doesn’t do too badly against EMF binary, and it
doesn’t require knowing the metamodel in advance
Encoding
No compression
Greedy Lazy attr. Lazy children
XMI 27884.85 – –
EMF binary 11353.19 – –
Thrift JSON 56946.78 12143.58 0.75
Thrift binary 32471.56 6801.65 0.60
Thrift compact 22082.97 4020.68 0.41
Thrift tuple 19763.98 3414.46 0.39
Initial fetch messages for GraBaTs’09 set1 model in KB. No compression.
18

Relative performance of Thrift encodings (2/2)
• If we apply gzip compression, things change!
• Even the JSON encoding is better than EMF binary
Encoding
Compressed with gzip
Greedy Lazy attr. Lazy children
XMI 1844.54 – –
EMF binary 1683.98 – –
Thrift JSON 1569.01 1007.34 0.40
Thrift binary 1474.33 975.79 0.39
Thrift compact 1329.32 924.25 0.34
Thrift tuple 1285.28 901.36 0.32
19

So far...
Performance
Should be efficient in size (good encoding + compression)
Should be efficient in time (avoid CPU-intensive encoding)
Should avoid overheads from complex protocols → TCP
? Should reduce total roundtrip time (# of interactions)
Flexibility
Should work on most languages (Java, C#, Python, JavaScript...)
? Should allow for various communication styles:
? Fast, few results: request-response
? Slow, few results: submit + fetch, with optional cancelling
? Fast, many results: request-response, then paged fetch
? Slow, many results: submit + fetch, then paged fetch
? Streaming: server-to-client messages (push notifications)
Should be friendly with firewalls → HTTP 20

Impact of communication style
Grace Hopper’s bundle of nanoseconds
• You don’t want to hit the network more than necessary
• Every message implies some latency
• However, you don’t want huge messages either
• Implement multiple styles: let the client choose!
21

Request-response style: nice and easy for small, fast queries
service Hawk {
/* ... */
QueryReport timedQuery(
)
throws (
5: FailedQuery err5
)
/* ... */
}
Advantages
• Easy to implement: one call
• Easy to use: invoke and wait
• No need to correlate msgs
• Great for small, fast queries
• Reliable in heavy concurrence
Note
QueryReport has server query
time: allows for separating the
network latency from the query
time.
22

Request-response style: nice and easy for small, fast queries
service Hawk {
/* ... */
QueryReport timedQuery(
)
throws (
5: FailedQuery err5
)
/* ... */
}
Disadvantages
• Huge response will use up all
the memory of the server
• Client cannot cancel a slow
query: no query ID to refer to
Recommendation
Set a max response size, bail out
when we exceed that.
22

Submit + wait style: allows for remote query cancelling
service Hawk {
string asyncQuery(...)
void cancelAsyncQuery(
1: required string queryID
)
QueryReport
fetchAsyncQueryResults(
1: required string queryID
)
}
Advantages
• Submit response is immediate
• Client receives UUID of query
• UUID is used to fetch results and
cancel if taking too long
Disadvantages
• Until fetched, results will take up
memory in the server
• Has more RTT than plain
request-response
Recommendation
There should be a maximum lifespan
for the results, so memory is freed
eventually. 23

Paged fetches for large loads
service Hawk {
list<ModelElement> resolveProxies(
2: required list<string> ids,
3: required HawkQueryOptions
options,
)
}
Advantages
• Can handle larger responses
• Initial query: only element IDs
• Next: fetch elements in batches
Disadvantages
• We spend RTT for each page
• Eventually, message with
element IDs may still be too large
Options (both with tradeoffs)
• Introduce stateful paging
• Re-run query on paging
24

Streaming queries, server status: push notifications
Updates on index state change
• Users want to know what the index is doing from the UI
• We do not want to poll the server unnecessarily
• Our choice: Apache ActiveMQ (AMQP, WebStomp)
• Users are given ActiveMQ queue details over the Thrift API
Updates on model changes
• Thrift API also allows watching indexed models over changes
• Same approach, but queues can survive client disconnects
Streaming queries: pending!
• Some graph DBs (e.g. Orient, Neo4j) allow you to leave a query
running and receive results as they appear
• We would need to tweak our query engine for this...
25

Done with the first two!
Performance
Should be efficient in size (good encoding + compression)
Should be efficient in time (avoid CPU-intensive encoding)
Should avoid overheads from complex protocols → TCP
Should reduce total roundtrip time (# of interactions)
Flexibility
Should work on most languages (Java, C#, Python, JavaScript...)
/ Should allow for various communication styles:
Fast, few results: request-response
Slow, few results: submit + fetch, with optional cancelling
/ Fast, many results: request-response, then paged fetch
Slow, many results: submit + fetch, then paged fetch
/ Streaming: server-to-client messages (push notifications)
Should be friendly with firewalls → HTTP 26

API security
Access control over HTTP: Apache Shiro
• Simple to integrate: 2 .jar files
• Configurable through shiro.ini
• Filters all incoming requests to servlets
• We provide a security realm backed by a MapDB user database
• Security is coarse: all users have same level at the moment
Credentials storage: Equinox secure storage
• Hawk needs to access password-protected VCS
• Stored credentials should be adequately protected
• In Eclipse-based environments, Hawk uses its secure storage
• Passwords are encrypted in custom keyring, with random
password (generated from /dev/random)
27

API scalability
SoSyM paper on stress testing APIs [4]
• Tried Hawk (Thrift), Mogwai (Thrift), CDO (Net4j)
• Used both HTTP and TCP variants
• Used 1 server and 2 client machines, emulated 1–64 clients
HTTP vs TCP: not that bad unless you do something wrong...
• CDO HTTP was 8x slower than CDO TCP!
• CDO TCP uses bidirectional nature of TCP to stream results
• CDO HTTP attempted to emulate this with polling: don’t do this!
• Hawk/Mogwai HTTP was only 20% slower than raw TCP
Concurrency can ruin your day
• Under heavy thread contention, race conditions quickly appear
• This may stress your networking library and backend...
28

Failed GraBaTs’09/TB queries during API stress-testing
Query Tool Proto 1t 2t 4t 8t 16t 32t 64t
OQ CDO HTTP 1 1
CS CDO/H2 HTTP 1
CS Hawk/O/EOL HTTP 1
CS Hawk/O/EPL HTTP 2
PL CDO/H2 HTTP 1
RS Hawk/O/EOL HTTP 1
RS CDO/H2 TCP 1
SN Hawk/O/EPL HTTP 1
SS Hawk/O/EPL HTTP 1
SS CDO/H2 TCP 1
SS Hawk/O/EPL TCP 2 1
29

Wrong GraBaTs’09/TB queries during API stress-testing
OQ CDO HTTP 3 8 22 17 1
CS CDO HTTP 2 2 1 2 3 4
CS CDO TCP 1 3 3 6
RS CDO HTTP 2 3 2 6 11 12 9
RS CDO TCP 6 3 3 10 15 28 32
SS CDO TCP 1
30

Reviewing the last two blocks of requirements
Security
Should implement authentication (API key or user/pass)
Should implement authorization
Authorization granularity should be as fine as possible
Actual security mechanism should be configurable
Scalability
Hawk over Neo4j worked reliably in all configurations
/ Hawk over OrientDB failed some queries (may need to revisit)
/ Greycat backend has not been stress-tested yet...
31

Conclusion and roadmap
Remote querying APIs are not simple!
• Many MDE researchers forget to pay attention to this
• Good remote API design rules still apply:
• Be efficient, but also firewall-friendly
• Be flexible in what you accept and strict in what you produce
• Avoid denial of service (excessive CPU/RAM use)
• Queries vary in nature: provide multiple communication styles!
• Allow use from more than just your host language:
• Having JavaScript will make it much easier to add a web UI
• Python? C++? C#? They could all consume model query results...
Things we want to add to Hawk
Fine-grained security, paging over async queries, result streaming,
horizontal scaling, smartly delegating work to clients...

Thank you!
Eclipse Hawk: model repository
querying as a service
Antonio García Domínguez // @antoniogado
October 13, 2018
School of Engineering and Applied Science, Aston University
Two existing collaborative approaches
Eclipse Connected Objects (CDO)
+ Database storage on mature RDBMS (e.g. H2)
+ Per-element on-demand loading, can reuse SQL query planners
+ Referential integrity is well understood in RDBMS
− Best performance requires careful mapping (e.g. ELists)
− Hard to set up for first timers, requires specialist knowledge
File-based version control (Git/SVN)
+ Industry already trusts Git/SVN with their code
+ Can use close-fitting data model, less manual mapping involved
+ Inspecting, backing up and managing files is well understood
/ Need to fragment large models into multiple files
− Preserving inter-fragment links can be difficult
− Running a query may require large download + loading all files! 3
Scalable querying as a service
• We could mirror each fragment into graph DB, then reconnect
• Industrial graph DBs already used for big data applications
• We can provide a remote API to run queries over the network
• Hawk does this for EMF/UML/Modelio... (this talk!)
4
... to these NoSQL graphs.
5
Performance
• Should be efficient in size (good encoding + compression)
• Should be efficient in time (avoid CPU-intensive encoding)
• Should avoid overheads from complex protocols
• Should reduce total roundtrip time (# of interactions)
Flexibility
• Should work on most languages (Java, C#, Python, JavaScript...)
• Should allow for various communication styles:
• Fast, few results: request-response
• Slow, few results: submit + fetch, with optional cancelling
• Fast, many results: request-response, then paged fetch
• Slow, many results: submit + fetch, then paged fetch
• Streaming: server-to-client messages (push notifications)
• Should be friendly with firewalls 8
Does the Hawk API meet all of them?
No. But we meet quite a few!
And some APIs fare worse...
9
Impact of communication style
Grace Hopper’s bundle of nanoseconds
• You don’t want to hit the network more than necessary
• Every message implies some latency
• However, you don’t want huge messages either
• Implement multiple styles: let the client choose!
15
Failed GraBaTs’09/TB queries during API stress-testing
OQ CDO HTTP 1 1
CS CDO/H2 HTTP 1
CS Hawk/O/EOL HTTP 1
CS Hawk/O/EPL HTTP 2
PL CDO/H2 HTTP 1
RS Hawk/O/EOL HTTP 1
RS CDO/H2 TCP 1
SN Hawk/O/EPL HTTP 1
SS Hawk/O/EPL HTTP 1
SS CDO/H2 TCP 1
SS Hawk/O/EPL TCP 2 1
23
Conclusion and roadmap
Remote querying APIs are not simple!
• Many MDE researchers forget to pay attention to this
• Good remote API design rules still apply:
• Be efficient, but also firewall-friendly
• Be flexible in what you accept and strict in what you produce
• Avoid denial of service (excessive CPU/RAM use)
• Queries vary in nature: provide multiple communication styles!
• Allow use from more than just your host language:
• Having JavaScript will make it much easier to add a web UI
• Python? C++? C#? They could all consume model query results...
Things we want to add to Hawk
Fine-grained security, paging over async queries, result streaming,
horizontal scaling, smartly delegating work to clients...

References i
K. Barmpis, Seyyed Shah, and D. S. Kolovos.
Towards incremental updates in large-scale model indexes.
In Proceedings of the 11th European Conference on Modelling
Foundations and Applications, July 2015.
A. García-Domínguez, A. Abherve, K. Barmpis, O. Al-Wadeai, and
A. Bagnato.
Integration of Hawk for Model Metrics in the MEASURE Platform.
In Proceedings of the 6th International Conference on
Model-Driven Engineering and Software Development, Funchal,
Madeira, Portugal, January 2018. SciTePress.

References ii
A. García Domínguez, K. Barmpis, D. S. Kolovos, M. A. A. da Silva,
A. Abherve, and A. Bagnato.
Integration of a graph-based model indexer in commercial
modelling tools.
In Proceedings of MoDELS’16, pages 340–350, Saint Malo, France,
2016. ACM Press.
A. García Domínguez, K. Barmpis, D. S. Kolovos, Ran Wei, and
Richard F. Paige.
Stress-testing remote model querying APIs for relational and
graph-based stores.
Software Systems Modeling, pages 1–29, June 2017.

COMMitMDE'18: Eclipse Hawk: model repository querying as a service

Related slideshows

More Related Content

COMMitMDE'18: Eclipse Hawk: model repository querying as a service