Graph DB
Graph DB
Graph DB
Release 10.2.5
Ontotext
04 September 2023
CONTENTS
1 General 1
1.1 About GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Comparison of GraphDB Free, GraphDB Standard, and GraphDB Enterprise . . . . . . 3
1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 LDBC Semantic Publishing Benchmark 2.0 . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Berlin SPARQL Benchmark (BSBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Getting Started 21
3.1 Running GraphDB as a Desktop Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 On Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 On MacOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 On Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.4 Configuring GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.5 Configuring the JVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.6 Stopping GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Running GraphDB as a Standalone Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Running GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Configuring GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Stopping the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Set up Your License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Interactive User Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
i
3.4.1 Available guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Run guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Create a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Load Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Load data through the GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 Load data through SPARQL or RDF4J API . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.3 Load data through the ImportRDF tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Explore Your Data and Class Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.1 Explore instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.2 Create your own visual graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7.3 Class hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.4 DomainRange graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.5 Class relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Query Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.1 Query data through the Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.2 Query data programmatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Managing Repositories 47
4.1 Creating a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Create a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Manage repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Configuring a Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Plan a repository configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Configure a repository through the GraphDB Workbench . . . . . . . . . . . . . . . . . 51
4.2.3 Edit a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Configure a repository programmatically . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.5 Configuration parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.6 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.7 Reconfigure a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.8 Rename a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Connecting to Remote GraphDB Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Connect to a remote location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Change location settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3 View or update location license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Activate and Enable Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Activate/deactivate plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Enable/disable plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Inference in GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Proof plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.3 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6.1 GraphDB persistence strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6.2 GraphDB indexing options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Query Monitoring and Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.7.1 Query monitoring and termination using the Workbench . . . . . . . . . . . . . . . . . 79
4.7.2 Automatically prevent long running queries . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8.2 Usage scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.8.3 Setup and configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8.4 Mapping language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8.5 SPARQL endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8.6 Query federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.8.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9 FedX Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
ii
4.9.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.9.3 Usage scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.9.4 Configuration parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
iii
6.5.7 Hybrid indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.5.8 Training cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.6 Geographic Data Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.6.1 Geospatial Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.6.2 GeoSPARQL Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.7 Data History and Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.7.1 What the plugin does . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.7.2 Index components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.7.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.7.4 Query process and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.8 SQL Access over JDBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.8.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.8.2 Type mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.8.3 WHERE to FILTER conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.8.4 Table verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.8.5 Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.8.6 How it works: Table description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.9 SPARQL Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.9.2 Internal SPARQL federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.9.3 Federated query to a remote passwordprotected repository . . . . . . . . . . . . . . . . 231
6.10 Visualize and Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.10.1 Class hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.10.2 Class relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.10.3 Explore resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.10.4 View and edit resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.11 Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.11.1 Exporting a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.11.2 Exporting individual graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.11.3 Exporting query results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.11.4 Exporting resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.12 JavaScript Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.12.1 How to register a JS function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.12.2 How to remove a JS function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.13 SPARQLMM support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.13.1 Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
iv
7.2.9 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.2.10 Upgrading from previous versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.3 Solr GraphDB Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.3.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.3.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
7.3.3 Setup and maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.3.4 Working with data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.3.5 List of creation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
7.3.6 Datatype mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
7.3.7 Entity filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
7.3.8 Overview of connector predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
7.3.9 SolrCloud support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.3.10 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
7.3.11 Upgrading from previous versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.4 Kafka GraphDB Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.4.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.4.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
7.4.3 Setup and maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
7.4.4 Working with data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
7.4.5 List of creation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
7.4.6 Datatype mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
7.4.7 Entity filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
7.4.8 Overview of connector predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
7.4.9 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
7.4.10 Upgrading from previous versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
7.5 MongoDB Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
7.5.1 Overview and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
7.5.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
7.5.3 Setup and maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
7.6 General Fulltext Search with the Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
7.6.1 Useful connector features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
7.6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
7.7 Kafka Sink Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
7.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
7.7.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
7.7.3 Update types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
7.7.4 Configuration properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
7.8 Text Mining Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
7.8.1 What the plugin does . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
7.8.2 Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
7.8.3 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
7.8.4 Manage text mining instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
7.8.5 Monitor annotation progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
v
8.2.9 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
8.2.10 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
8.3 Using GraphDB with the RDF4J API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
8.3.1 RDF4J API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
8.3.2 SPARQL endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
8.3.3 Graph Store HTTP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
8.4 GraphDB Plugin API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
8.4.1 What is the GraphDB Plugin API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
8.4.2 Description of a GraphDB plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
8.4.3 The life cycle of a plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
8.4.4 Repository internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
8.4.5 Query processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
8.4.6 Update processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
8.4.7 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
8.4.8 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
8.4.9 Accessing other plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
8.4.10 List of plugin interfaces and classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
8.4.11 Adding external plugins to GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.4.12 Putting it all together: example plugins . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.5 Using Maven Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
8.5.1 Public Maven repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
8.5.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
8.5.3 GraphDB runtime .jar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
8.5.4 GraphDB Client API .jar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
vi
11.1.1 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
11.1.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
11.2 Setting up Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
11.2.1 Setting up licenses through the Workbench . . . . . . . . . . . . . . . . . . . . . . . . 525
11.2.2 Setting up licenses through a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.2.3 Order of preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.3 Configuring GraphDB Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.3.1 Configure Java heap memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
11.3.2 Single global page cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
11.3.3 Configure Entity pool memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
11.3.4 Sample memory configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
11.3.5 Upper bounds for the memory consumed by the GraphDB process . . . . . . . . . . . . 528
11.4 Creating and Managing a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
11.4.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
11.4.2 High availability deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
11.4.3 Create cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
11.4.4 Manage cluster membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
11.4.5 Manage cluster configuration properties . . . . . . . . . . . . . . . . . . . . . . . . . . 536
11.4.6 Monitor cluster status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
11.4.7 Delete cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
11.4.8 Configure external cluster proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
11.4.9 Cluster security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
11.4.10 Truncate cluster transaction log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12 Security 545
12.1 Enabling Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
12.1.1 Enable security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
12.1.2 Login and default credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
12.1.3 Free access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
12.2 User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
12.2.1 Create new user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
12.2.2 Set password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
12.3 Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
12.3.1 Authorization and user database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
12.3.2 Authentication methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
12.3.3 Example configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
12.4 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
12.4.1 Encryption in transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
12.4.2 Encryption at rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
12.5 Security Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
vii
14 Monitoring and Troubleshooting 581
14.1 Request Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
14.2 Database health checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
14.2.1 Possible values for health checks and their meaning . . . . . . . . . . . . . . . . . . . . 582
14.2.2 Aggregated health checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
14.2.3 Running passive health checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
14.3 System monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
14.3.1 Workbench monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
14.3.2 Prometheus monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
14.3.3 JMX console monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
14.4 Diagnosing and reporting critical errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
14.4.1 Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
14.4.2 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
18 Tutorials 611
18.1 GraphDB Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.1.1 Module 1: RDF & RDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.1.2 Module 2: SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.1.3 Module 3: Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.1.4 Module 4: GraphDB Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
18.1.5 Module 5: GraphDB Workbench & REST API . . . . . . . . . . . . . . . . . . . . . . 612
18.1.6 Module 6: Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
18.1.7 Module 7: Rulesets & Reasoning Strategies . . . . . . . . . . . . . . . . . . . . . . . . 612
18.1.8 Module 8: Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
18.1.9 Module 9: Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
18.1.10 Module 10: Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
18.2 Programming with GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
18.2.1 Installing Maven dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
18.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
18.3 Extending GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
18.3.1 Clone, download, and run GraphDB Workbench . . . . . . . . . . . . . . . . . . . . . . 618
18.3.2 Add your own page and controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
18.3.3 Add repository checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
viii
18.3.4 Repository setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
18.3.5 Select departure and destination airport . . . . . . . . . . . . . . . . . . . . . . . . . . 620
18.3.6 Find the paths between the selected airports . . . . . . . . . . . . . . . . . . . . . . . . 622
18.3.7 Visualize results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
18.3.8 Add status message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
18.4 Location and Repository Management with the GraphDB REST API . . . . . . . . . . . . . . . 629
18.4.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
18.4.2 Managing repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
18.4.3 Managing locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
18.4.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
18.5 GraphDB REST API cURL Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
18.5.1 Cluster group management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
18.5.2 Cluster monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
18.5.3 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
18.5.4 Infrastructure monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
18.5.5 Location management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
18.5.6 Repository management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
18.5.7 Repository monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
18.5.8 Saved queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
18.5.9 Security management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
18.5.10 SPARQL template management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
18.5.11 SQL views management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
18.5.12 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
18.5.13 Structures monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
18.6 Visualize GraphDB Data with Ogma JS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
18.6.1 People and organizations related to Google in factforge.net . . . . . . . . . . . . . . . . 647
18.6.2 Suspicious control chain through offshore companies in factforge.net . . . . . . . . . . 650
18.6.3 Shortest flight path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
18.6.4 Common function to visualize GraphDB data . . . . . . . . . . . . . . . . . . . . . . . 658
18.7 Create Custom Graph View over Your RDF Data . . . . . . . . . . . . . . . . . . . . . . . . . . 659
18.7.1 How it works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
18.7.2 World airport, airline, and route data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
18.7.3 Springer Nature SciGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
18.7.4 Additional sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
18.8 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
18.8.1 What are GraphDB local notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
18.8.2 How to register for local notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
18.9 Graph Replacement Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
19 References 669
19.1 Introduction to the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
19.1.1 Resource Description Framework (RDF) . . . . . . . . . . . . . . . . . . . . . . . . . . 669
19.1.2 RDF Schema (RDFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
19.1.3 Ontologies and knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
19.1.4 Logic and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
19.1.5 The Web Ontology Language (OWL) and its dialects . . . . . . . . . . . . . . . . . . . 680
19.1.6 Query languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
19.1.7 Reasoning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
19.1.8 Semantic repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
19.2 Data Modeling with RDF(S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
19.2.1 What is RDF? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
19.2.2 What is RDFS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
19.3 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
19.3.1 What is SPARQL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
19.3.2 Using SPARQL in GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
19.4 RDF Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
19.4.1 Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
19.4.2 Turtlestar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
ix
19.4.3 TriG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
19.4.4 TriGstar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
19.4.5 N3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
19.4.6 NTriples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
19.4.7 NQuads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
19.4.8 JSONLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
19.4.9 NDJSONLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
19.4.10 RDF/JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
19.4.11 RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
19.4.12 TriX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
19.4.13 BinaryRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
19.5 RDFstar and SPARQLstar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
19.5.1 The modeling challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
19.5.2 How the different approaches compare? . . . . . . . . . . . . . . . . . . . . . . . . . . 695
19.5.3 Syntax and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
19.5.4 Convert standard reification to RDFstar . . . . . . . . . . . . . . . . . . . . . . . . . . 700
19.5.5 MIME types and file extensions for RDFstar in RDF4J . . . . . . . . . . . . . . . . . . 700
19.6 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
19.7 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
19.7.1 What is an ontology? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
19.7.2 What are the benefits of developing and using an ontology? . . . . . . . . . . . . . . . . 702
19.7.3 Using ontologies in GraphDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
19.8 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
19.8.1 Logical formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
19.8.2 Rule format and semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
19.8.3 The ruleset file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
19.8.4 Rulesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
19.8.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
19.8.6 How To’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
19.8.7 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.9 SPARQL Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.9.1 SPARQL 1.1 Protocol for RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.9.2 SPARQL 1.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.9.3 SPARQL 1.1 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.9.4 SPARQL 1.1 Federation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
19.9.5 SPARQL 1.1 Graph Store HTTP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 718
19.10 SPARQL Functions Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
19.10.1 SPARQL functions vs magic predicates . . . . . . . . . . . . . . . . . . . . . . . . . . 719
19.10.2 SPARQL 1.1 functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
19.10.3 SPARQL 1.1 constructor functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
19.10.4 Mathematical function extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
19.10.5 Date and time function extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
19.10.6 SPARQL SPIN functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
19.10.7 RDFstar extension functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
19.10.8 RDF list function extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
19.10.9 Aggregation function extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
19.10.10GeoSPARQL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
19.10.11Geospatial extension functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
19.10.12Other function extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
19.11 Time Functions Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
19.11.1 Period extraction functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
19.11.2 Period transformation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
19.11.3 Durations expressed in certain units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
19.11.4 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
19.12 OWL Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
19.13 GraphDB System Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
19.13.1 System graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
19.13.2 System predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
x
19.14 Repository Configuration Template How It Works . . . . . . . . . . . . . . . . . . . . . . . . 743
19.15 Ontology Mapping with owl:sameAs Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
19.16 Query Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
19.16.1 What are named graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
19.16.2 How to manage explicit and implicit statements . . . . . . . . . . . . . . . . . . . . . . 747
19.16.3 How to query explicit and implicit statements . . . . . . . . . . . . . . . . . . . . . . . 748
19.16.4 How to specify the dataset programmatically . . . . . . . . . . . . . . . . . . . . . . . 749
19.16.5 How to access internal identifiers for entities . . . . . . . . . . . . . . . . . . . . . . . 749
19.16.6 How to use RDF4J ‘direct hierarchy’ vocabulary . . . . . . . . . . . . . . . . . . . . . 751
19.16.7 Other special GraphDB query behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 751
19.17 Retain BIND Position Special Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
19.18 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
21 FAQ 767
21.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
21.1.1 What is OWLIM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
21.1.2 Why a solidstate drive and not a harddisk one? . . . . . . . . . . . . . . . . . . . . . . 767
21.1.3 Is GraphDB Jenacompatible? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
21.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
21.2.1 How do I find out the exact version number of GraphDB? . . . . . . . . . . . . . . . . 767
21.2.2 What is a repository? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
21.2.3 How do I create a repository? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
21.2.4 How do I retrieve repository configurations? . . . . . . . . . . . . . . . . . . . . . . . . 768
21.2.5 What is a location? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
21.2.6 How do I attach a location? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
21.3 RDF & SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
xi
21.3.1 How is GraphDB related to RDF4J? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
21.3.2 What does it mean when an IRI starts with urn:rdf4j:triple:? . . . . . . . . . . . . . 769
21.3.3 What kind of SPARQL compliance is supported? . . . . . . . . . . . . . . . . . . . . . 769
21.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
21.4.1 Does GraphDB have any security vulnerabilities? . . . . . . . . . . . . . . . . . . . . . 769
21.4.2 Does the Log4Shell issue (CVE202144228) affect GraphDB? . . . . . . . . . . . . . . 769
21.5 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
21.5.1 Why can’t I use custom rule file (.pie) an exception occurred? . . . . . . . . . . . . . 769
21.5.2 Why can’t I open GraphDB in MacOS? . . . . . . . . . . . . . . . . . . . . . . . . . . 770
22 Support 771
xii
CHAPTER
ONE
GENERAL
Hint: This documentation is written to be used by technical people. Whether you are a database engineer or system
designer evaluating how this database fits to your system, or you are a developer who has already integrated it and
actively employs its power this is the complete reference. It is also useful for system administrators who need to
support and maintain a GraphDBbased system.
Note: The GraphDB documentation presumes that the reader is familiar with databases. The required minimum
of Semantic Web concepts and related information is provided in the Introduction to the Semantic Web section in
References.
Ontotext GraphDB is a highly efficient and robust graph database with RDF and SPARQL support. This docu
mentation is a comprehensive guide that explains every feature of GraphDB, as well as topics such as setting up a
repository, loading and working with data, tuning its performance, scaling, etc.
The GraphDB database supports a highly available replication cluster, which has been proven in a number of
enterprise use cases that required resilience in data loading and query answering. If you need a quick overview of
GraphDB or a download link to its latest releases, please visit the GraphDB product section.
GraphDB uses RDF4J as a library, utilizing its APIs for storage and querying, as well as the support for a wide
variety of query languages (e.g., SPARQL and SeRQL) and RDF syntaxes (e.g., RDF/XML, N3, Turtle).
Full licensing information is available in the license files located in the doc folder of the distribution package.
Helpful hints
Throughout the documentation there are a number of helpful pieces of information that can give you additional
information, warn you, or save you time and unnecessary effort. Here is what to pay attention to:
Hint: Hint badges give additional information you may find useful.
Note: Notes are comments or references that may save you time and unnecessary effort.
1
GraphDB Documentation, Release 10.2.5
Warning: Warnings are pieces of advice that turn your attention to things you should be cautious about.
GraphDB is a family of highly efficient, robust, and scalable RDF databases. It streamlines the load and use
of linked data cloud datasets, as well as your own resources. For easy use and compatibility with the industry
standards, GraphDB implements the RDF4J framework interfaces, the W3C SPARQL Protocol specification, and
supports all RDF serialization formats. The database is the preferred choice of both small independent developers
and big enterprise organizations because of its community and commercial support, as well as excellent enterprise
features such as cluster support and integration with external highperformance search applications Lucene, Solr,
and Elasticsearch.
2 Chapter 1. General
GraphDB Documentation, Release 10.2.5
GraphDB is one of the few triplestores that can perform semantic inferencing at scale, allowing users to derive
new semantic facts from existing facts. It handles massive loads, queries, and inferencing in real time.
Reasoning and query evaluation are performed over a persistent storage layer. Loading, reasoning, and query
evaluation proceed extremely quickly even against huge ontologies and knowledge bases.
GraphDB can manage billions of explicit statements on a desktop hardware and handle tens of billions of statements
on a commodity server hardware. According to the LDBC Semantic Publishing Benchmark, it is is one of the most
scalable OWL repositories currently available.
Ontotext offers licenses for three editions of GraphDB:
• GraphDB Free
• GraphDB Standard (SE)
• GraphDB Enterprise (EE)
GraphDB Free and GraphDB SE are identical in terms of usage and integration and share most features: they are
both designed as an enterprisegrade semantic repository system, are suitable for massive volumes of data, employ
filebased indexes that enable them to scale to billions of statements even on desktop machines, and ensure fast
query evaluations through inference and query optimizations.
GraphDB Free is commercial & free to use, supports a limit of two concurrent queries, and is suitable for low
query loads and smaller projects.
GraphDB SE is commercial, supports an unlimited number of concurrent queries, and is suitable for heavy query
loads.
Building up on the above, the GraphDB EE edition is a highperformance, clustered semantic repository scaling
in production environments with simultaneous loading, querying and inferencing of billions of RDF statements.
It supports a highavailability cluster based on the Raft consensus algorithm, designed for high availability, with
several features that are crucial for achieving enterprisegrade highly available deployments. It also adds more
connectors for fulltext search and faceting the Solr and Elasticsearch connectors, as well as the Kafka connector
for synchronizing changes to the RDF model to any Kafka consumer.
To find out more about the differences between the editions, see the GraphDB Feature Comparison section.
1.2 Benchmarks
Our engineering team invests constant efforts in measuring the database data loading and query answering perfor
mance. The section covers common database scenarios tested with popular public benchmarks and their interpre
tation in the context of common RDF use cases.
LDBC is an industry association aimed to create TPClike benchmarks for RDF and graph databases. The as
sociation is founded by a consortium of database vendors like Ontotext, OpenLink, Neo Technologies, Oracle,
IBM, and SAP, among others. The SPB (Semantic Publishing Benchmark) simulates the database load commonly
faced by media or publishing organizations. The synthetic generated dataset is based on BBC’s Dynamic Seman
tic Publishing use case. It contains a graph of linked entities like creative works, persons, documents, products,
provenance, and content management system information. All benchmark operations follow a standard authoring
process – add new metadata, update the reference knowledge, and search queries hitting various choke points as
join performance, data access locality, expression calculation, parallelism, concurrency, and correlated subqueries.
1.2. Benchmarks 3
GraphDB Documentation, Release 10.2.5
Data loading
This section illustrates how quickly GraphDB can do an initial data load. The SPB256 dataset represents the size
of a midsized production database managing documents and metadata. The data loading test run measures how
the GraphDB edition and the selection of i3 instances affect the processing of 237K explicit statements, including
the materialization of the inferred triples generated by the reasoner.
Table 1: Loading time of the LDBC SPB256 dataset with the default RDFSPlusoptimized ruleset in minutes
Edi- Ruleset Explicit state- Total state- AWS Cores Loading time
tions ments ments instance (minutes)
10.0 RDFSPlus 237,802,643 385,168,491 i3.xlarge 1* 399
Free optimized
10.0 RDFSPlus 237,802,643 385,168,491 i3.xlarge 2 315
SE/EE optimized
10.0 RDFSPlus 237,802,643 385,168,491 i3.xlarge 4 312
SE/EE optimized
10.0 RDFSPlus 237,802,643 385,168,491 i3.2xlarge 8 259
SE/EE optimized
10.0 RDFSPlus 237,802,643 385,168,491 i3.4xlarge 16 253
SE/EE optimized
Editions Ruleset Explicit state- Total state- AWS in- Cores Loading time (min-
ments ments stance utes)
10.0 OWL2 237,802,643 752,341,659 i3.xlarge 2 889
SE/EE RL
10.0 OWL2 237,802,643 752,341,659 i3.xlarge 4 843
SE/EE RL
10.0 OWL2 237,802,643 752,341,659 i3.2xlarge 8 635
SE/EE RL
10.0 OWL2 237,802,643 752,341,659 i3.4xlarge 16 607
SE/EE RL
The same dataset tested with OWL2RL ruleset produces nearly 515M implicit statements, or an expansion of
1:3.2 from the imported explicit triples. The data loading performance scales much better with the increase of
additional CPU cores due to much higher computational complexity. Once again, the I/O performance throughput
becomes a major limiting factor, but the conclusion is that datasets with a higher reasoning complexity benefit
more from the additional CPU cores.
4 Chapter 1. General
GraphDB Documentation, Release 10.2.5
Production load
The test demonstrates the execution speed of smallsized transactions and read queries against the SPB256 dataset
preloaded with RDFSPlusoptimized ruleset. The query mix includes transactions generating updates and infor
mation searches with simple or complex aggregate queries. The different runs compare the database performance
according to the number of concurrent read and write clients.
Table 3: The number of executed query mixes per second (higher is better) vs. the number of concurrent clients.
Notes: All runs use the same configuration limited to 20GB heap size on instances with 16 vCPU. The AWS price
is based on the US East coast for an ondemand type of instance (Q1 2020), and does not include the EBS volume
charges that are substantial only for IOP partitions.
The instances with local NVMe SSD devices substantially outperform any EBS drives due to the lower disk latency
and higher bandwidth. In the case of standard and cheapest EBS gp2 volumes, the performance is even slower after
the AWS IOPs throttling starts to limit the disk operations. The c5d.4xlarge instances achieve consistently fastest
results with the main limitation of small local disks. Next in the list are i3.4xlarge instances offering substantially
bigger local disks. Our recommendation is to avoid using the slow EBS volumes, except for cases where you plan
to limit the database performance load.
1.2. Benchmarks 5
GraphDB Documentation, Release 10.2.5
BSBM is a popular benchmark combining read queries with frequent updates. It covers a less demanding use
case without reasoning, generally defined as eCommerce, describing relations between products and producers,
products and offers, offers and vendors, products and reviews.
The benchmark features two runs, where the “explore” run generates requests like “find products for a given set of
generic features”, “retrieve basic information about a product for display purpose”, “get recent review”, etc. The
“explore and update” run mixes all read queries with information updates.
Table 4: BSBM 100M query mixes per hour on AWS instance – c5ad.4xlarge, local NVMe SSD with GraphDB
10.0 EE, ruleset RDFSPlusoptimized and exluded Query 5
Threads explore (query mixes per hour) explore & update (query mixes per hour)
1 77,861 25,441
2 145,719 50,153
4 256,025 60,972
8 411,748 61,557
12 423,748 60,371
16 470,485 60,770
6 Chapter 1. General
CHAPTER
TWO
2.1.1 Architecture
GraphDB is packaged as a SAIL (Storage And Inference Layer) for RDF4J and makes extensive use of the features
and infrastructure of RDF4J, especially the RDF model, RDF parsers, and query engines.
Inference is performed by the Reasoner (TRREE Engine), where the explicit and inferred statements are stored in
highly optimized data structures that are kept inmemory for query evaluation and further inference. The inferred
closure is updated through inference at the end of each transaction that modifies the repository.
GraphDB implements The Sail API interface so that it can be integrated with the rest of the RDF4J framework,
e.g., the query engines and the web UI. A user application can be designed to use GraphDB directly through the
RDF4J SAIL API or via the higherlevel functional interfaces. When a GraphDB repository is exposed using the
RDF4J HTTP Server, users can manage the repository through the embedded Workbench, the RDF4J Workbench,
or other tools integrated with RDF4J.
7
GraphDB Documentation, Release 10.2.5
RDF4J
The RDF4J framework is a framework for storing, querying, and reasoning with RDF data. It is implemented in
Java by Aduna as an opensource project and includes various storage backends (memory, file, database), query
languages, reasoners, and clientserver protocols.
There are essentially two ways to use RDF4J:
• as a standalone server;
• embedded in an application as a Java library.
RDF4J supports the W3C SPARQL query language, as well as the most popular RDF file formats and query result
formats.
RDF4J offers a JDBClike user API, streamlined system APIs and a RESTful HTTP interface. Various extensions
are available or are being developed by third parties.
RDF4J Architecture
The following is a schematic representation of the RDF4J architecture and a brief overview of the main components.
The Sail API is a set of Java interfaces that support RDF storing, retrieving, deleting, and inferencing. It is used for
abstracting from the actual storage mechanism, e.g., an implementation can use relational databases, file systems,
inmemory storage, etc. One of its key characteristics is the option for SAIL stacking.
2.1.2 Components
Engine
Query optimizer
The query optimizer attempts to determine the most efficient way to execute a given query by considering the
possible query plans. Once queries are submitted and parsed, they are then passed to the query optimizer where
optimization occurs. GraphDB allows hints for guiding the query optimizer.
GraphDB is implemented on top of the TRREE engine. TRREE stands for ‘Triple Reasoning and Rule Entailment
Engine’. The TRREE performs reasoning based on forwardchaining of entailment rules over RDF triple patterns
with variables. TRREE’s reasoning strategy is total materialization, although various optimizations are used.
Further details about the rule language can be found in the Reasoning section.
Storage
GraphDB stores all of its data in files in the configured storage directory, usually called storage. It consists of two
main indexes on statements, POS and PSO, context index CPSO, and literal index, with the latter two being optional.
Entity Pool
The Entity Pool is a key component of the GraphDB storage layer. It converts entities (URIs, blank nodes, literals,
and RDFstar [formerly RDF*] embedded triples) to internal IDs (32 or 40bit integers). It supports transactional
behavior, which improves space usage and cluster behavior.
Page Cache
GraphDB’s cache strategy employs the concept of one global cache shared between all internal structures of all
repositories, so that you no longer have to configure the cache-memory, tuple-index-memory and predicate-
memory, or size every instance and calculate the amount of memory dedicated to it. If one of the repositories is
used more at the moment, it naturally gets more slots in the cache.
Connectors
The Connectors provide extremely fast keyword and faceted (aggregation) searches that are typically implemented
by an external component or service, but have the additional benefit of staying automatically uptodate with the
GraphDB repository data. GraphDB comes with the following connector implementations:
• Lucene GraphDB Connector
• Solr GraphDB Connector (requires a GraphDB Enterprise license)
• Elasticsearch GraphDB Connector (requires a GraphDB Enterprise license)
Additionally, the Kafka GraphDB Connector provides a means to synchronize changes to the RDF model to any
Kafka consumer. (requires a GraphDB Enterprise license)
Workbench
A system can be characterized as having high availability if it meets several key criteria such as: having a high up
time, recovering smoothly, achieving zero data loss, and essentially handling and adapting to unexpected situations
and scenarios.
The GraphDB cluster is designed for high availability and has several features that are crucial for achieving
enterprisegrade highly available deployments. It is based on coordination mechanisms known as consensus al
gorithms. They allow a collection of machines to work as a coherent group that can survive the failures of some
of its members and provide lower latency. In essence, such protocols define the set of rules for messaging between
machines. Because of this, they play a key role in building reliable largescale software systems.
Consensus algorithms aim to be faulttolerant, where faults can be classified in two categories:
• Crash failure: The component abruptly stops functioning and does not resume. The other components can
detect the crash and adjust their local decisions in time.
• Byzantine failure: The component behaves arbitrarily with no absolute conditions. It can send contradictory
messages to other components or simply remain silent. It may look normal from outside.
The GraphDB cluster uses the Raft consensus algorithm for managing a replicated log on distributed state machines.
It implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for
managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells
servers when it is safe to apply log entries to their state machines. It can tolerate n ≥ 2m + 1 failures.
Quorum-based replication
The GraphDB cluster relies on quorumbased replication, meaning that the cluster should have over 50% alive
nodes in order to be able to execute INSERT/DELETE operations. This ensures that there will always be a majority
of GraphDB nodes that always have uptodate data.
If there are unavailable nodes when an INSERT/DELETE operation is executed, but there are more than 50% alive
nodes, the request will be accepted, distributed among the reachable alive nodes, and saved if everything is OK.
Once the unavailable nodes come back online, the transactions will be distributed to them as well.
If there are fewer than 50% available nodes, any INSERT/DELETE operations will be rejected.
Internal proxy
In normal working conditions, the cluster nodes have two states – leader and follower. The follower nodes can
accept read requests, but cannot write any data. To make it easier for the user to communicate with the cluster, an
integrated proxy will redirect all requests (with some exceptions) to the leader node. This ensures that regardless
of which cluster node is reached, it can accept all user requests.
However, if a GraphDB cluster node is unavailable, you need to switch to another cluster node that will be on
another URL. This means that you need to know all cluster node addresses and make sure that the reached node is
healthy and online.
External proxy
For even better usability, the proxy can be deployed separately on its own URL. This way, you do not need to know
where all cluster nodes are. Instead, there is a single URL that will always point to the leader node.
The externally deployed proxy will behave like a regular GraphDB instance, including opening and using the
Workbench. It will always know which one the leader is and will always serve all requests to the current leader.
In order to achieve maximum efficiency, the GraphDB cluster distributes the incoming read queries to all nodes,
prioritizing the ones that have fewer running queries. This ensures the optimal hardware resource utilization of all
nodes.
Local consistency
GraphDB supports two types of local consistency: None and Last Committed.
• None is the default setting and is used when no local consistency is needed. In this mode, the query will
be sent to any readable node, without any guarantee of strong consistency. This is suitable for cases where
eventual consistency is sufficient or when enforcing strong consistency is too costly.
• Last Committed is used when strong consistency is required, ensuring that the results reflect the state of the
system after all transactions have been committed; however, it could lead to lower scalability as the set of
nodes to which a query could be loadbalanced is smaller. In this mode, the query will be sent to a readable
node that has advanced to the last transaction.
The choice between None and Last Committed depends on the specific requirements and constraints of the ap
plication and use case. In general, if query results should always reflect the uptodate state of the database, Last
Committed should be used. Otherwise, None is sufficient.
As mentioned above, the GraphDB cluster is made up of two basic node types: leaders and followers. Usually, it
comprises an odd number of nodes in order to tolerate failures. At any given time, each of the nodes is in one of
four states:
• Leader: Usually, there is one leader that handles all client requests, i.e., if a client contacts a follower, the
follower redirects it to the leader.
• Follower: A cluster is made up of one leader and all other servers are followers. They are passive, meaning
that they issue no requests on their own but simply respond to requests from leaders and candidates.
• Candidate: This state is used when electing a new leader.
• Restricted: In this state, the node cannot respond to requests from other nodes and cannot participate in
election. A node goes into this state when there is a license issue, i.e., invalid or expired license.
2.2.3 Fingerprints
Nodes use fingerprints – checksums used to determine whether two repositories are in the same condition and
contain the same data. Every transaction performed on a repository returns a fingerprint, which is then compared
against the fingerprint of the same repository on the leader node.
In case of mismatching fingerprints, GraphDB automatically resolves the issue by replicating the offending nodes.
2.2.4 Terms
The Raft algorithm divides time into terms of arbitrary length. Terms are numbered with consecutive integers.
Each term begins with an election, in which one or more candidates attempt to become leader. If a candidate wins
the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote.
In this case the term will end with no leader; a new term (with a new election) will commence. Raft ensures that
there is at most one leader in a given term.
Different servers may observe the transitions between terms at different times. Raft terms act as a logical clock
in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current
term number, which increments with term passings. Current terms are exchanged whenever servers communicate;
if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a
candidate or leader discovers that its term is out of date, it immediately reverts to the follower state. If a server
receives a request with a stale term number, it rejects the request.
The GraphDB cluster nodes communicate using remote procedure calls (RPCs), and the basic consensus algo
rithm requires only two types of RPCs:
• RequestVote: RPCs that are initiated by candidates during elections
• AppendEntries: RPCs that are initiated by leaders to replicate log entries and to provide a form of heartbeat
Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best
performance.
The log replication resembles a twophase commit where:
1. The user sends a commit transaction request.
2. The transaction is replicated in local transaction log.
3. The transaction is replicated to other followers in parallel.
4. The leader waits until enough members (total/2 + 1) have replicated the entry.
5. The leader start applying the entry to GraphDB successfully.
6. The leader sends a heartbeat until successful commit in GraphDB.
7. The leader sends a second RPC informing followers to apply log entry to GraphDB.
8. The leader informs the client that the transaction is successful.
Note: If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs
indefinitely in parallel (even after it has responded to the client) until all followers eventually store all log entries.
Only updates relating to repositories, data manipulation, and access are replicated between logs. This includes
adding/deleting repositories, user right changes, SQL views, smart updates, and standard repository updates.
Raft uses a heartbeat mechanism to trigger leader election. When nodes start up, they begin as followers. A node
remains in the follower state as long as it is receiving valid RPCs from a leader or candidate. Leaders send periodic
heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a
follower receives no communication over a period of time called the election timeout, then it assumes there is no
viable leader and begins an election to choose a new leader. A candidate wins an election if it receives votes from
a majority of the servers in the full cluster for that term. Each node can vote for at most one candidate in a given
term, on a firstcomefirstserved basis.
If the cluster gets into a situation where only one node is left, that node will switch to readonly mode. The state
shown in its cluster status will switch to candidate, as it cannot achieve a quorum with other cluster nodes when
new data is written.
The leader election process goes as follows:
1. After the initial configuration request has been sent, one of the nodes will be set as leader at random.
2. If the current leader node stops working for some reason, a new election is being held to promote the most
voted follower nodes to candidate status, and one of those candidates will become the new leader.
3. The leader node sends a constant heartbeat (a form of node status check to see if the node is present and able
to perform its tasks).
4. If only one node is left active for some reason, its status will change to candidate and it will switch to
readonly mode to prevent further tinkering with it, until more nodes appear in the cluster group.
2.3 Connectors
The GraphDB Connectors enable the connection to an external component or service, providing fulltext search
and aggregation (Lucene, Solr, Elasticsearch), or querying a database using SPARQL and executing heterogeneous
joins (MongoDB). They also offer the additional benefit of staying automatically uptodate with the GraphDB
repository data.
The Lucene, Solr, and Elasticsearch Connectors provide synchronization at entity level, where an entity is defined
as having a unique identifier (URI) and a set of properties and property values. In RDF context, this corresponds
to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the
Connectors support property chains. A property chain is a sequence of triples where each triple’s object is the
subject of the subsequent triple.
GraphDB comes with the following FTS connector implementations:
• Lucene GraphDB Connector
• Solr GraphDB Connector (requires a GraphDB Enterprise license)
• Elasticsearch GraphDB Connector (requires a GraphDB Enterprise license)
Subaggregations ✗ ✗ ✓
The MongoDB Integration allows you to query MongoDB databases using SPARQL and to execute heterogeneous
joins. A documentbased database with the biggest developer/user community, MongoDB is part of the MEAN
technology stack and guarantees scalability and performance well beyond the throughput supported in GraphDB.
The integration between GraphDB and MongoDB is done by a plugin that sends a request to MongoDB and then
transforms the result to RDF model.
The Kafka GraphDB Connector provides a means to synchronize changes to the RDF model to any downstream
system via the Kafka framework. This enables easy processing of RDF updates in any external system and covers
a variety of use cases where a reliable synchronization mechanism is needed.
This functionality requires a GraphDB Enterprise license.
Note: Despite having a similar name, the Kafka Sink connector is not a GraphDB connector.
2.4 Workbench
Workbench is the GraphDB webbased administration tool. The user interface is similar to the RDF4J Workbench
Web Application, but with more functionality.
2.4. Workbench 15
GraphDB Documentation, Release 10.2.5
2.5 Requirements
The minimum requirements allow loading datasets of only up to 50 million RDF triples.
• 3GB of memory
• 8GB of storage space
• Java SE Development Kit 11 to 16 (not required for GraphDB Free desktop installation)
Warning: All GraphDB indexes are optimized for hard disks with very low seek time. Our team highly
recommend using only SSD partition for persisting repository images.
The best approach for correctly sizing the hardware resources is to estimate the number of explicit statements.
Statistically, an average dataset has 3:1 statements to unique RDF resources. The total number of statements
determines the expected repository image size, and the number of unique resources affects the memory footprint
required to initialize the repository.
The table below summarizes the recommended parameters for planning RAM and disk sizing:
• Statements are the planned number of explicit statements.
• Java heap (minimal) is the minimal recommend JVM heap required to operate the database controlled by
-Xmx parameter.
• Java heap (optimal) is the recommended JVM heap required to operate a database controlled by -Xmx
parameter.
• OS is the recommended minimal RAM reserved for the operating system.
• Total is the RAM required for the hardware configuration.
• Repository image is the expected size on disk. For repositories with inference, use the total number of
explicit + implicit statements.
Statements Java heap (min) Java heap (opt) OS Total Repository image
100M 5GB 6GB 2 8GB 17GB
200M 8GB 12GB 3 15GB 34GB
500M 12GB 16GB 4 20GB 72GB
1B 32GB 32GB 4 36GB 150GB
2B 50GB 58GB 4 62GB 350GB
5B 64GB 68GB 4 72GB 720GB
10B 80GB 88GB 4 92GB 1450GB
20B 128GB 128GB 6 134GB 2900GB
Warning: Running a repository in a cluster doubles the requirements for the repository image storage.
The table above provides example sizes for a single repository and does not take restoring backups or snapshot
replication in consideration.
The optimal approach towards memory management of GraphDB is based on a balance of performance and re
source availability per repository. In heavy use cases such as parallel importing into a number of repositories,
GraphDB may take up more memory than usual.
There are several configuration properties with which the amount of memory used by GraphDB can be controlled:
• Reduce the global cache: by default, it can take up to 40% (or up to 40GB in case of heap sizes above
100GB) of the available memory allocated to GraphDB, which during periods of stress can be critical. By
reducing the size of the cache, more memory can be freed up for the actual operations. This can be beneficial
during periods of prolonged imports as that data is not likely to be queried right away.
graphdb.page.cache.size=2g
• Reduce the buffer size: this property is used to control the amount of statements that can be stored in buffers
by GraphDB. By default, it is sized at 200,000 statements, which can impact memory usage if many repos
itories are actively reading/writing data at once. The optimal buffer size depends on the hardware used, as
reducing it would cause more write/read operations to the actual storage.
pool.buffer.size=50000
2.5. Requirements 17
GraphDB Documentation, Release 10.2.5
• Disable parallel import: during periods of prolonged imports to a large number of repositories, parallel im
ports can take up more than 800 megabytes of retained heap per repository. In such cases, parallel importing
can be disabled, which would force data to be imported serially to each repository. However, serial import
reduces performance.
graphdb.engine.parallel-import=false
This table shows an example of retained heap usage by repository, using different configuration parameters:
* Depends on the number of available CPU cores to GraphDB. For the statistics, the default buffer size was reduced
from 200,000 (default) to 50,000 statements. The inference pool size was reduced from eight to three. Keep in
mind that this reduces performance.
** Without reducing buffer and inference pool sizes. Disables parallel import, which impacts performance.
2.5.4 Licensing
GraphDB is available in three different editions: Free, Standard Edition (SE), and Enterprise Edition (EE).
The Free edition is free to use and does not require a license. This is the default mode in which GraphDB will
start. However, it is not opensource.
SE and EE are RDBMSlike commercial licenses on a perserverCPU basis. They are neither free nor opensource.
To purchase a license or obtain a copy for evaluation, please contact graphdbinfo@ontotext.com.
When installing GraphDB, the SE/EE license file can be set through the GraphDB Workbench or programmatically.
THREE
GETTING STARTED
The easiest way to set up and run GraphDB is to use the native installations provided for the GraphDB Desktop
distribution. This kind of installation is the best option for your laptop/desktop computer, and does not require the
use of a console, as it works in a graphic user interface (GUI). For this distribution, you do not need to download
Java, as it comes bundled together with GraphDB.
Go to the GraphDB download page and request your GraphDB copy. You will receive an email with the download
link. According to your OS, proceed as follows:
Important: GraphDB Desktop is a new application that is similar to but different from the previous application
GraphDB Free.
If you are upgrading from the old GraphDB Free application, you need to stop GraphDB Free and uninstall it
before or after installing GraphDB Desktop. Once you run GraphDB Desktop for the first time, it will convert
some of the data files and GraphDB Free will no longer work correctly.
3.1.1 On Windows
3.1.2 On MacOS
21
GraphDB Documentation, Release 10.2.5
3.1.3 On Linux
Once GraphDB Desktop is running, a small icon appears in the status bar/menu/tray area (varying depending on
OS). It allows you to check whether the database is running, as well as to stop it or change the configuration
settings. Additionally, an application window is also opened, where you can go to the GraphDB documentation,
configure settings (such as the port on which the instance runs), and see all log files. You can hide the window from
the Hide window button and reopen it by choosing Show GraphDB window from the menu of the aforementioned
icon.
You can add and edit the JVM options (such as Java system properties or parameters to set memory usage) of the
GraphDB native app from the GraphDB Desktop config file. It is located at:
• On Mac: /Applications/GraphDB Desktop.app/Contents/app/GraphDB Desktop.cfg
• On Windows: \Users\<username>\AppData\Local\GraphDB Desktop\app\GraphDB Desktop.cfg
• On Linux: /opt/graphdb-desktop/lib/app/graphdb-desktop.cfg
The JVM options are defined at the end of the file and will look very similar to this:
[JavaOptions]
java-options=-Djpackage.app-version=10.0.0
java-options=-cp
java-options=$APPDIR/graphdb-native-app.jar:$APPDIR/lib/*
java-options=-Xms1g
java-options=-Dgraphdb.dist=$APPDIR
java-options=-Dfile.encoding=UTF-8
java-options=--add-exports
java-options=jdk.management.agent/jdk.internal.agent=ALL-UNNAMED
java-options=--add-opens
java-options=java.base/java.lang=ALL-UNNAMED
Each java-options= line provides a single argument passed to the JVM when it starts. To be on the safe side, it
is recommended not to remove or change any of the existing options provided with the installation. You can add
your own options at the end. For example, if you want to run GraphDB Desktop with 8 gigabytes of maximum
heap memory, you can set the following option:
java-options=-Xmx8g
To stop the database, simply quit it from the status bar/menu/tray area icon, or close the GraphDB Desktop appli
cation window.
Hint: On some Linux systems, there is no support for status bar/menu/tray area. If you have hidden the GraphDB
window, you can quit it by killing the process.
The default way of running GraphDB is as a standalone server. The server is platformindependent, and includes
all recommended JVM (Java virtual machine) parameters for immediate use.
Note: Before downloading and running GraphDB, please make sure to have JDK (Java Development Kit, rec
ommended) or JRE (Java Runtime Environment) installed. GraphDB requires Java 11 or greater.
The configuration of all GraphDB directory paths and network settings is read from the conf/graphdb.properties
file. It controls where to store the database data, log files, and internal data. To assign a new value, modify the file
or override the setting by adding -D<property>=<new-value> as a parameter to the startup script. For example, to
change the database port number:
graphdb -Dgraphdb.connector.port=<your-port>
The configuration properties can also be set in the environment variable GDB_JAVA_OPTS, using the same -
D<property>=<new-value> syntax.
Note: The order of precedence for GraphDB configuration properties is as follows: command line supplied
arguments > GDB_JAVA_OPTS > config file.
The GraphDB home defines the root directory where GraphDB stores all of its data. The home can be set through
the system or config file property graphdb.home.
The default value for the GraphDB home directory depends on how you run GraphDB:
• Running as a standalone server: the default is the same as the distribution directory.
• All other types of installations: OSdependent directory.
– On Mac: ~/Library/Application Support/GraphDB.
– On Windows: \Users\<username>\AppData\Roaming\GraphDB.
– On Linux and other Unixes: ~/.graphdb.
GraphDB does not store any files directly in the home directory, but uses several subdirectories for data or con
figuration.
We strongly recommend setting explicit values for the Java heap space. You can control the heap size by supplying
an explicit value to the startup script such as graphdb -Xms10g -Xmx10g or setting one of the following environment
variables:
• GDB_HEAP_SIZE: environment variable to set both the minimum and the maximum heap size (recommended);
• GDB_MIN_MEM: environment variable to set only the minimum heap size;
• GDB_MAX_MEM: environment variable to set only the maximum heap size.
For more information on how to change the default Java settings, check the instructions in the bin/graphdb file.
Note: The order of precedence for JVM options is as follows: command line supplied arguments > GDB_JAVA_OPTS
> GDB_HEAP_SIZE > GDB_MIN_MEM/GDB_MAX_MEM.
Tip: Every JDK package contains a default garbage collector (GC) that can potentially affect performance.
We benchmarked GraphDB’s performance against the LDBC SPB and BSBM benchmarks with JDK 8 and 11.
With JDK 8, the recommended GC is Parallel Garbage Collector (ParallelGC). With JDK 11, the most optimal
performance can be achieved with either G1 GC or ParallelGC.
To stop the database, find the GraphDB process identifier and send kill <process-id>. This sends a shutdown
signal and the database stops. If the database is run in nondaemon mode, you can also send Ctrl+C interrupt to
stop it.
GraphDB is available in three different editions: Free, Standard Edition (SE), and Enterprise Edition (EE).
The Free edition is free to use and does not require a license. This is the default mode in which GraphDB will
start. However, it is not opensource.
SE and EE are RDBMSlike commercial licenses on a perserverCPU basis. They are neither free nor opensource.
To purchase a license or obtain a copy for evaluation, please contact graphdbinfo@ontotext.com.
When installing GraphDB, the SE/EE license file can be set through the GraphDB Workbench or programmatically.
To do that, follow the steps:
1. Add, view, or update your license from Setup � Licenses � Set new license.
From here, you can also Revert to Free license. If you do so, GraphDB will ask you to confirm.
4. After completing these steps, you will be able to view your license details.
GraphDB 10.1 introduces a set of interactive tutorials that will walk you through key GraphDB functionalities
using the Workbench user interface. They can be accessed from Help � Interactive guides, as well as via the Take
me to the guides button in the center panel of the GraphDB Workbench startup screen.
Each guide has a name, a description, a level (Beginner, Intermediate, or Advanced), and a Run button, which
starts the tutorial. Currently, GraphDB 10.1 offers two such tutorials:
• The Star Wars guide: Designed for beginners and using the Star Wars dataset, which you can download
within the guide, this tutorial will walk you through some basic GraphDB functionalities such as creating a
repository, importing RDF data from a file in it, and exploring the data through the Visual graph.
• The Movies database guide: Also designed for beginners and using a dataset with movie information, this
tutorial will show you some additional functionalities like exploring your data from the class hierarchy
perspective, some SPARQL queries, as well as exploring RDF through the tabular view.
To start a guide, click Run. This will activate a series of dialogs that will guide you through the steps of the tutorial.
While the guide is running, the remaining part of the Workbench remains darker and is inactive.
Each window explains what is going to happen next or asks you to perform a certain action. The window title
shows the name of the current action, the number of steps it comprises, and the progress of the action, e.g., step
1 of the Create repository action that consists of 7 steps. Before each major action, you are provided with an
overview of what the particular view of the Workbench is used for.
The little icon left of the title of each step provides additional information about it, for example:
To proceed to the next step, either click/type in the highlighted active area in the Workbench (Setup in the above
example), or press the Next button. We recommend the former as it essentially is exactly what you would be doing
in the user interface if you were not in the guide, thus familiarizing yourself with it more easily.
If you attempt to close the dialog window, GraphDB will ask you to confirm the action before closing it.
Hint: When started, GraphDB creates the GraphDBHOME/data directory to store local repository data. To
change the directory, see Configuring GraphDB Data directory.
1. Go to Setup � Repositories.
2. Click Create new repository.
3. Select GraphDB repository.
4. For Repository ID, enter my_repo and leave all other optional configuration settings at their default values.
Tip: For repositories with over several tens of millions of statements, see Configuring a Repository.
5. Click the Connect button to set the newly created repository as the repository for this location.
Tip: You can also use cURL command to perform basic location and repository management through the
GraphDB REST API.
All examples given below are based on the News sample dataset provided in the distribution folder.
Tip: You can also use public datasets such as the w3.org Wine ontology by pasting its data URL https://www.
w3.org/TR/owlguide/wine.rdf in Import � User data � Get RDF data from a URL.
1. Go to Import.
2. Open the User data tab and click the Upload RDF files to upload the files from the News sample dataset
provided in the examples/data/news directory of the GraphDB distribution.
3. Click Import.
4. Enter the Import settings in the popup window.
Import Settings
• Base IRI: the default prefix for all local names in the file;
• Target graphs: imports the data into one or more graphs.
Note: You can also import data from files on the server where the Workbench is located, from a remote URL
(with a format extension or by specifying the data format), by typing or pasting the RDF data in a text area, or by
executing a SPARQL INSERT.
Import execution
• Imports are executed in the background while you continue working on other things.
• Interrupt is supported only when the location is local.
• Parser config options are not available for remote locations.
The GraphDB database also supports a powerful API with a standard SPARQL or RDF4J endpoint, to which data
can be posted with cURL, a local Java client API, or an RDF4J console. It is compliant with all standards, and
allows every database operation to be executed via an HTTP client request.
1. Locate the correct GraphDB URL endpoint:
• Go to Setup � Repositories.
• Click the link icon next to the repository name.
where local_file_name.ttl is the data file you want to import, and http://localhost:7200/
repositories/repository-id/statements is the GraphDB URL endpoint of your repository.
ImportRDF is a lowlevel bulk load tool that writes directly in the database index structures. It is ultrafast and
supports parallel inference. For more information, see Loading Data Using the ImportRDF Tool.
Note: Loading data through the GraphDB ImportRDF tool can be performed only if the repository is empty, e.g.,
the initial loading after the database has been inactive. If you use it on a nonempty repository, it will overwrite
all of the data in it.
To explore instances and their relationships, first enable the Autocomplete index from Setup � Autocomplete,
which makes the lookup of IRIs easier. Then navigate to Explore � Visual graph, and find an instance of interest
through the Easy graph search box. You can also do it from the View resource search field in GraphDB’s home
page search for the name of your graph, and press the Visual button.
The graph of the instance and its relationships are shown. The example here is from the w3.org wine ontology that
we mentioned earlier.
• Expand a node to show its relationships or collapse to hide them if already expanded. You can also expand
the node by doubleclicking on it.
• Copy a node’s IRI to the clipboard.
• Focus on a node to restart the graph with this instance as the central one. Note that you will lose the current
state of your graph.
• Delete a node to hide its relationships and hide it from the graph.
Click on a node to see more info about it: a side panel opens on the right, including a short description
(rdfs:comment), labels (rdfs:label), RDF rank, image (foaf:depiction) if present, and all DataType properties.
You can also search by DataType property if you are interested in its value. Click on the node again if you want to
hide the side panel.
You can switch between nodes without closing the side panel. Just click on the new node about which
you want to see more, and the side panel will automatically show the information about it.
Click on the settings icon on the top right for advanced graph settings. Control number of links, types, and predi
cates to hide and show.
Control the SPARQL queries behind the visual graph by creating your own visual graph configuration. To make
one, go to Explore � Visual graph � Advanced graph configurations � Create graph config. Use the sample
queries to guide you in the configuration.
– Search box: start with a search box to choose a different start resource each time;
– Fixed node: you may want to start exploration with the same resource each time;
– Query results: the initial config state may be the visual representation of a Graph SPARQL query result.
• Graph expansion: determines how new nodes and links are added to the visual graph when the user expands
an existing node. The ?node variable is required and will be replaced with the IRI of the expanded node.
• Node basics: this SELECT query controls how the type, label, comment and rank are obtained for the
nodes in the graph. Node types correspond to different colors. Node rank is a number between 0 and 1 and
determines the size of a node. The label is the text over each node, and if empty, IRI local name is used.
Again, ?node binding is replaced with node IRI.
• Predicate label: defines what text to show for each edge IRI. The query should have ?edge variable to
replace it with the edge IRI.
• Node extra: Click on the info icon to see additional node properties. Control what to see in the side panel.
?node variable is replaced with node IRI.
• Save your config and reload it to explore your data the way you wish to visualize it.
To explore your data, navigate to Explore � Class hierarchy. You can see a diagram depicting the hierarchy of the
imported RDF classes by number of instances. The biggest circles are the parent classes and the nested ones are
their children.
Note: If your data has no ontology (hierarchy), the RDF classes will be visualized as separate circles instead of
nested ones.
• To see what classes each parent has, hover over the nested circles.
• To explore a given class, click its circle. The selected class is highlighted with a dashed line, and a side panel
with its instances opens for further exploration. For each RDF class you can see its local name, IRI, and a
list of its first 1,000 class instances. The class instances are represented by their IRIs, which, when clicked,
lead to another view where you can further explore their metadata.
• To go to the DomainRange Graph diagram, doubleclick a class circle or the DomainRange Graph button
from the side panel.
• To explore an instance, click its IRI from the side panel.
• To adjust the number of classes displayed, drag the slider on the lefthand side of the screen. Classes are
sorted by the maximum instance count, and the diagram displays only the current slider value.
• To administrate your data view, use the toolbar options on the righthand side of the screen.
– To see only the class labels, click Hide/Show Prefixes. You can still view the prefixes when you hover
over the class that interests you.
– To zoom out of a particular class, click the Focus diagram home icon.
– To reload the data on the diagram, click the Reload diagram icon. This is recommended when you have
updated the data in your repository, or when you are experiencing some strange behavior, for example
you cannot see a given class.
– To export the diagram as an .svg image, click the Export Diagram download icon.
• You can also filter the hierarchy by graph when there is more than one named graph in your repository. Just
expand the All graphs dropdown menu next to the toolbar options and select the graph you want to explore.
To explore the connectedness of a given class, doubleclick the class circle or the DomainRange Graph button
from the side panel. You can see a diagram that shows this class and its properties with their domain and range,
where domain refers to all subject resources and range to all object resources. For example, if you start from class
pub:Company, you see something like: <pub-old:Mention pub-old:hasInstance pub:Company> <pub:Company
pub:description xsd:string>.
To see all predicate labels contained in a collapsed edge, click the collapsed edge count label, which is always in
the format <count> predicates. A side panel opens with the target node label, a list of the collapsed predicate
labels and the type of the property (explicit or implicit). You can click these labels to see the resource in the View
resource page.
To administrate your diagram view, use the toolbar options on the righthand side of the screen.
• To go back to your class in the Class hierarchy, click the Back to Class hierarchy diagram button.
• To collapse edges with common source/target nodes, in order to see the diagram more clearly, click the Show
all predicate/Show collapsed predicates button. The default is collapsed.
• To export the diagram as an .svg image, click the Export Diagram download icon.
To explore the relationships between the classes, navigate to Explore � Class relationships. You can see a compli
cated diagram showing only the top relationships, where each of them is a bundle of links between the individual
instances of two classes. Each link is an RDF statement, where the subject is an instance of one class, the object is
an instance of another class, and the link is the predicate. Depending on the number of links between the instances
of two classes, the bundle can be thicker or thinner and gets the color of the class with more incoming links. These
links can be in both directions.
In the example below, you can see the relationships between the classes of the News sample dataset provided in
the distribution folder. You can observe that the class with the biggest number of links (the thickest bundle) is
pub-old:Document.
To control which classes to display in the diagram, use the add/remove icon next to each class.
To see how many annotations (mentions) there are in the documents, click on the blue bundle representing the
relationship between the classes pub-old:Document and pub-old:TextMention. The tooltip shows that there are
6,197 annotations linked by the pub-old:containsMention predicate.
To see how many of these annotations are about people, click on light purple bundle representing the relationship
between the classes pub-old:TextMention and pub:Person. The tooltip shows that 274 annotations are about
Just like in the Class hierarchy view, you can also filter the class relationships by graph when there is more than
one named graph in the repository. Expand the All graphs dropdown menu next to the toolbar options and select
the graph you want to explore.
Hint: SPARQL is a SQLlike query language for RDF graph databases with the following types:
• SELECT: returns tabular results;
• CONSTRUCT: creates a new RDF graph based on query results;
• ASK: returns YES if the query has a solution, otherwise “NO”;
• DESCRIBE: returns RDF data about a resource; useful when you do not know the RDF data structure in the
data source;
• INSERT: inserts triples into a graph;
• DELETE: deletes triples from a graph.
For more information, see the Additional resources section.
Now it is time to delve into your data. The following is one possible scenario for querying it.
1. Select the repository you want to work with, in this example News, and click the SPARQL menu tab.
2. Let’s say you are interested in people. Paste the query below into the query field, and click Run to find all
people mentioned in the documents from this news articles dataset.
PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX pub-old: <http://ontology.ontotext.com/publishing#>
select distinct ?x ?Person where {
(continues on next page)
3. Run a query to calculate the RDF rank of the instances based on their interconnectedness.
4. Find all people mentioned in the documents, ordered by popularity in the repository.
5. Find all people who are mentioned together with their political parties.
6. Did you know that Marlon Brando was from the Democratic Party? Find what other mentions occur together
with Marlon Brando in the given news article.
8. Find all documents that mention members of the Democratic Party and the names of these people.
Tip: You can play with more example queries from the Example_queries.rtf file provided in the examples/
data/news directory of the GraphDB distribution.
SPARQL is not only a standard query language, but also a protocol for communicating with RDF databases.
GraphDB stays compliant with the protocol specification, and allows querying data with standard HTTP requests.
curl -G -H "Accept:application/x-trig" \
-d query=CONSTRUCT+%7B%3Fs+%3Fp+%3Fo%7D+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+10 \
http://localhost:7200/repositories/yourrepository
query=CONSTRUCT+%7B%3Fs+%3Fp+%3Fo%7D+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+10
Tip: For more information on how to interact with GraphDB APIs, refer to the RDF4J and SPARQL protocols
or the Linked Data Platform specifications.
FOUR
MANAGING REPOSITORIES
Data in GraphDB is organized within repositories. Each repository is an independent RDF database that can be
active independently from other repositories. Operations involving data updates or queries are always directed to
a single repository.
Repositories are typically native GraphDB repositories but there are other repository types that are used with
Virtualization and FedX Federation.
The following chapters cover repository management and how repositories work in general.
There are two ways for creating and managing repositories: either through the Workbench interface, or by using
the RDF4J console.
To manage your repositories, go to Setup � Repositories. This opens a list of available repositories and their
locations.
1. Click the Create new repository button or create it from a file by using the configuration template that can
be found under /configs/templates in the GraphDB distribution.
47
GraphDB Documentation, Release 10.2.5
3. If you have attached remote locations to your GraphDB instance, there will be an additional field for speci
fying the Location in which you want to create the repository. The default is Local. See more about creating
a repository in a remote location a little further below.
4. Enter the Repository ID (e.g., my_repo) and leave all other optional configuration settings with their default
values.
Tip: For repositories with over several tens of millions of statements, see the configuration parameters.
5. Click the Create button. Your newly created repository appears in the repository list.
You can easily create a repository in any remote location attached to your GraphDB instance.
1. First, connect to the location. For this example, let’s connect http://localhost:7202 (substitute localhost
and the 7200 port number as appropriate).
2. Then, just like with local repositories, go to Setup � Repositories � Create new repository.
3. In the Location field, the two locations are now visible. Select http://localhost:7202.
6. If you go to the http://localhost:7202 location, you will see remote_repo created there.
Note: Use the create command to add new repositories to the location to which the console is connected. This
command expects the name of the template that describes the repository’s configuration.
1. Run the RDF4J console application, which resides in the /bin folder:
• Windows: console.cmd
• Unix/Linux: ./console
2. Connect to the GraphDB server instance using the command:
connect http://localhost:7200.
create graphdb.
We can also create a repository with enabled SHACL validation through the RDF4J console. To
do that, execute:
create graphdb-shacl.
Connect a repository
Alternatively, use the dropdown menu in the top right corner. This allows you to easily switch between repositories
while running queries or importing and exporting data in other views. Hovering over the respective repository will
also display some basic information about it.
Note that when you are connected to a remote repository, its label in the top right corner indicates that:
Edit a repository
To copy the repository URL, edit it, download the repository configuration as a Turtle file, restart it, or delete it,
use the respective icons next to its name.
You can restart a repository without having to restart the entire GraphDB instance. There are two ways to do that:
• Click the restart icon as shown above. A warning will prompt you to confirm the action.
• Click the edit icon, which will open the repository configuration. At its bottom, tick the restart box, save,
and confirm.
Warning: Restarting the repository will shut it down immediately, and all running queries and updates will
be cancelled.
Before you start adding or changing the parameter values, we recommend planning your repository configuration
and familiarizing yourself with what each of the parameters does, what the configuration template is and how it
works, what data structures GraphDB supports, what configuration values are optimal for your setup, etc.
Note: If you need a repository with enabled SHACL validation, you must enable this option at configuration time.
SHACL validation cannot be enabled after the repository has been created.
Some of the parameters you specify at repository creation time can be changed at any point.
1. Click the Edit icon next to a repository to edit it.
2. Restart the repository for the changes to take effect.
Tip: GraphDB uses an RDF4J configuration template for configuring its repositories. RDF4J keeps the repos
itory configurations with their parameters modeled in RDF. Therefore, in order to create a new repository, the
RDF4J needs such an RDF file. For more information on how the configuration template works, see Repository
configuration template how it works.
<#wines> a rep:Repository;
rep:repositoryID "wines";
rep:repositoryImpl [
rep:repositoryType "graphdb:SailRepository";
<http://www.openrdf.org/config/repository/sail#sailImpl> [
<http://www.ontotext.com/config/graphdb#base-URL> "http://example.org/
,→owlim#";
<http://www.ontotext.com/config/graphdb#check-for-inconsistencies>
,→"false";
<http://www.ontotext.com/config/graphdb#defaultNS> "";
<http://www.ontotext.com/config/graphdb#disable-sameAs> "true";
<http://www.ontotext.com/config/graphdb#enable-context-index> "false";
<http://www.ontotext.com/config/graphdb#enable-fts-index> "false";
<http://www.ontotext.com/config/graphdb#enable-literal-index> "true";
<http://www.ontotext.com/config/graphdb#enablePredicateList> "true";
<http://www.ontotext.com/config/graphdb#entity-id-size> "32";
<http://www.ontotext.com/config/graphdb#entity-index-size> "10000000";
<http://www.ontotext.com/config/graphdb#fts-indexes> ("default" "iri
,→");
<http://www.ontotext.com/config/graphdb#fts-iris-index> "none";
<http://www.ontotext.com/config/graphdb#fts-string-literals-index>
,→"default";
<http://www.ontotext.com/config/graphdb#imports> "";
<http://www.ontotext.com/config/graphdb#in-memory-literal-properties>
,→"true";
<http://www.ontotext.com/config/graphdb#query-limit-results> "0";
<http://www.ontotext.com/config/graphdb#query-timeout> "0";
<http://www.ontotext.com/config/graphdb#read-only> "false";
<http://www.ontotext.com/config/graphdb#repository-type> "file-
,→repository";
<http://www.ontotext.com/config/graphdb#ruleset> "rdfsplus-optimized";
<http://www.ontotext.com/config/graphdb#storage-folder> "storage";
<http://www.ontotext.com/config/graphdb#throw-
,→QueryEvaluationException-on-timeout>
"false";
sail:sailType "graphdb:Sail"
]
];
rdfs:label "" .
<#wines-shacl> a rep:Repository;
rep:repositoryID "wines-shacl";
(continues on next page)
sail-shacl:transactionalValidationLimit "500000"^^xsd:long;
sail-shacl:validationEnabled true;
sail-shacl:validationResultsLimitPerConstraint "1000"^^xsd:long;
sail-shacl:validationResultsLimitTotal "1000000"^^xsd:long;
sail:delegate [
<http://www.ontotext.com/config/graphdb#base-URL> "http://example.
,→org/owlim#";
<http://www.ontotext.com/config/graphdb#check-for-inconsistencies>
,→ "false";
<http://www.ontotext.com/config/graphdb#defaultNS> "";
<http://www.ontotext.com/config/graphdb#disable-sameAs> "true";
<http://www.ontotext.com/config/graphdb#enable-context-index>
,→"false";
<http://www.ontotext.com/config/graphdb#enable-fts-index> "false";
<http://www.ontotext.com/config/graphdb#enable-literal-index>
,→"true";
<http://www.ontotext.com/config/graphdb#enablePredicateList> "true
,→";
<http://www.ontotext.com/config/graphdb#entity-id-size> "32";
<http://www.ontotext.com/config/graphdb#entity-index-size>
,→"10000000";
<http://www.ontotext.com/config/graphdb#fts-indexes> ("default"
,→"iri");
<http://www.ontotext.com/config/graphdb#fts-iris-index> "none";
<http://www.ontotext.com/config/graphdb#fts-string-literals-index>
,→ "default";
<http://www.ontotext.com/config/graphdb#imports> "";
<http://www.ontotext.com/config/graphdb#in-memory-literal-
,→properties> "true";
<http://www.ontotext.com/config/graphdb#query-limit-results> "0";
<http://www.ontotext.com/config/graphdb#query-timeout> "0";
<http://www.ontotext.com/config/graphdb#read-only> "false";
<http://www.ontotext.com/config/graphdb#repository-type> "file-
,→repository";
<http://www.ontotext.com/config/graphdb#ruleset> "rdfsplus-
,→optimized";
<http://www.ontotext.com/config/graphdb#storage-folder> "storage";
<http://www.ontotext.com/config/graphdb#throw-
,→QueryEvaluationException-on-timeout>
"false";
sail:sailType "graphdb:Sail"
];
sail:sailType "rdf4j:ShaclSail"
]
];
rdfs:label "" .
2. Rename it to config.ttl.
3. In the directory where the config.ttl is, run the below cURL request. If the file is in a different directory,
provide the path to it at config=@./.
4. The newly created repository will appear in the repository list under Setup � Repositories in the Workbench.
This is a list of all repository configuration parameters. Some of the parameters can be changed (effective after a
restart), some cannot be changed (the change has no effect) and others need special attention once a repository
has been created, as changing them will likely lead to inconsistent data (e.g., unsupported inferred statements,
missing inferred statements, or inferred statements that cannot be deleted).
4.2.6 Namespaces
Under Setup � Namespaces in the GraphDB Workbench, you can view and manipulate the RDF namespaces and
prefixes for the active repository. Each GraphDB repository contains the following predefined prefixes:
Once a repository is created, it is possible to change some parameters, either by editing it in the Workbench or by
setting a global override for a given property.
Note: When you change a repository parameter, you need to restart GraphDB for the changes to take effect.
To edit a repository parameter in the GraphDB Workbench, go to Setup � Repositories and click the Edit icon for
the repository whose parameters you want to edit.
Global overrides
It is also possible to override a repository parameter for all repositories by setting a configuration or system property.
See Engine properties for more details on how to do it.
Use the Workbench to change the repository ID. This will update all locations in the Workbench where the repos
itory name is used.
Connecting to remote GraphDB instances is done by attaching remote locations to GraphDB. Locations represent
individual GraphDB servers where the repository data is stored. They can be local (a directory on the disk) or
remote (an endpoint URL), and can be attached, edited, and detached.
Remote locations are mainly used for:
• Accessing remote GraphDB repositories from the same Workbench;
• Accessing secured remote repositories via SPARQL federation;
• As a key component of cluster management.
4. In terms of authentication methods to the remote location, GraphDB offers three options:
a. None: The security of the remote location is disabled, and no authentication is needed.
b. Basic authentication: The security of the remote location has basic authentication enabled (default
setting). Requires a username and a password.
c. Signature: Uses the token secret, which must be the same on both GraphDB instances. For more
information on configuring the token secret, see the GDB authentication section of the Access Control
documentation.
5. After the location has been created, it will appear right below the local one.
The location setting for sending anonymous statistics to Ontotext depends on the GraphDB license that you are
using. With GraphDB Free, it is enabled by default, and with GraphDB Standard and Enterprise, it is disabled by
default.
To enable or disable it manually, click Edit common settings for these repositories.
Click the key icon to check the details of your current license.
Hint: Signature authentication is the recommended method for a cluster environment, as both require the same
authentication settings.
Note: You can connect to a remote location over HTTPS as well. To do so:
1. Enable HTTPS on the remote host.
2. Set the correct Location URL, for example https://localhost:8083.
3. In case the certificate of the remote host is selfsigned, you should add it to you JVM’s SSL TrustStore.
GraphDB plugins can be in active or inactive state. This means attaching and detaching them to/from GraphDB
on a fundamental level.
For most of the plugins, this can be done from the Workbench in Setup � Plugins. By default, all plugins available
there are activated.
Note: The Provenance plugin needs to be registered first in order to be activated. Once registered, it will appear
in the list.
If you deactivate a plugin, you will not be able to enable it. For example:
1. In Setup � Plugins, deactivate Autocomplete.
2. If you go to Setup � Autocomplete, you will get the following error message:
To deactivate it:
Note: Spell out the plugin names the way they are displayed in the Workbench page shown above.
To get a list of all plugins and their current state (active/inactive), run:
Some of the plugins also have an enabled and disabled state, provided that they have been activated before that.
These include:
Autocomplete index
The index can be enabled both from Setup � Autocomplete in the Workbench and with a SPARQL query.
Change tracking
You can enable change tracking for a certain transaction ID. See how to do it here.
GeoSPARQL support
RDF Rank
This refers to the RDF Rank filtered mode when you want to calculate the rank only of certain entities. See how
to do it here.
4.5 Inference
Inference is the derivation of new knowledge from existing knowledge and axioms. In an RDF database, such
as GraphDB, inference is used for deducing further knowledge based on existing RDF data and a formal set of
inference rules.
GraphDB supports inference out of the box and provides updates to inferred facts automatically. Facts change all
the time and the amount of resources it would take to manually manage updates or rerun the inferencing process
would be overwhelming without this capability. This results in improved query speed, data availability and accurate
analysis.
Inference uncovers the full power of data modeled with RDF(S) and ontologies. GraphDB will use the data and
the rules to infer more facts and thus produce a richer dataset than the one you started with.
GraphDB can be configured via “rulesets” – sets of axiomatic triples and entailment rules – that determine the
applied semantics. The implementation of GraphDB relies on a compile stage, during which the rules are compiled
into Java source code that is then further compiled into Java bytecode and merged together with the inference
engine.
Standard rulesets
The GraphDB inference engine provides full standardcompliant reasoning for RDFS, OWLHorst, OWL2RL,
and OWL2QL.
To apply a ruleset, simply choose from the options in the pulldown list when configuring your repository as shown
below through GraphDB Workbench:
Custom rulesets
GraphDB also comes with support for custom rulesets that allow for custom reasoning through the same perfor
mance optimised inference engine. The rulesets are defined via .pie files.
To load custom rulesets, simply point to the location of your .pie file as shown below:
4.5. Inference 63
GraphDB Documentation, Release 10.2.5
See how to configure the default inference value setting from the Workbench here.
The GraphDB proof plugin enables you to find out how a particular statement has been derived by the inferencer,
e.g., which rule fired and which premises have been matched to produce that statement.
The plugin is available as open source.
When creating your repository, make sure to select the OWLHorst ruleset, as the examples below cover inferences
related to the owl:inverseOf and owl:intersectionOf predicates, for which OWLHorst contains rules.
This example will investigate the relevant rule from a ruleset supporting the owl:inverseOf predicate, which looks
like the one in the source .pie file:
Id: owl_invOf
a b c
b <owl:inverseOf> d
------------------------------------
c d a
Add to the repository the following data that declares that urn:childOf is inverse property of urn:hasChild, and
places a statement relating urn:John urn:childOf urn:Mary in a context named <urn:family>:
INSERT DATA {
<urn:childOf> owl:inverseOf <urn:hasChild> .
graph <urn:family> {
<urn:John> <urn:childOf> <urn:Mary>
}
}
The following query explains which rule has been triggered to derive the (<urn:Mary> <urn:hasChild>
<urn:John>) statement. The arguments to the proof:explain predicate from the plugin are supplied by a VALUES
expression for brevity:
4.5. Inference 65
GraphDB Documentation, Release 10.2.5
we are getting:
As you can see, (owl:inverseOf, owl:inverseOf, owl:inverseOf) is implicit, and we can investigate further
by altering the VALUES to:
Id: owl_invOfBySymProp
a <rdf:type> <owl:SymmetricProperty>
------------------------------------
a <owl:inverseOf> a
If we track down the last premise, we will see that another rule supports it. (Note that both rules and the premises
are axioms. Currently, the plugin does not check whether something is an axiom.)
Id: owl_SymPropByInverse
a <owl:inverseOf> a
------------------------------------
a <rdf:type> <owl:SymmetricProperty>
This more elaborate example demonstrates how to combine the bindings from regular SPARQL statement patterns
and explore multiple statements.
We can define a helper JavaScript function that will return the local name of an IRI using the JavaScript functions
plugin:
PREFIX jsfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] jsfn:register '''
function lname(value) {
if(value instanceof org.eclipse.rdf4j.model.IRI)
return value.getLocalName();
else
return ""+value;
}
'''
}
Next, the query will look for statements with ?subject bound to <urn:Mary>, and explain all of them. Note the
use of the newly defined function lname() in the projection expression with concat():
?rule ?s ?p ?o ?context
WHERE {
bind(<urn:Mary> as ?subject) .
{?subject ?predicate ?object}
The first result for (Mary, type, Resource) is derived from the rdf1_rdfs4a_4b_2 rule from the OWLHorst
ruleset which looks like:
Id: rdf1_rdfs4a_4b
x a y
-------------------------------
x <rdf:type> <rdfs:Resource>
a <rdf:type> <rdfs:Resource>
y <rdf:type> <rdfs:Resource>
4.5. Inference 67
GraphDB Documentation, Release 10.2.5
Let’s further explore the inference engine by adding the data below into the same repository:
<urn:MerloGrape> a <urn:Grape> .
<urn:CaberneGrape> a <urn:Grape> .
<urn:MavrudGrape> a <urn:Grape> .
It is a simple beverage ontology that uses owl:hasValue, owl:someValuesFrom, and owl:intersectionOf to clas
sify instances based on the values of some of the ontology properties.
It contains:
• two colors: Red and White;
• classes of WhiteThings and RedThigs for the items related to has:color property to White and Red colors;
• classes Wine and Beer for the items related to has:component to instances of Grape and Malt classes;
• several instances of Grape (MerloGrape, CabernetGrape etc.) and Malt (PilsenerMalt, WheatMalt etc.);
• classes RedWine and WhiteWine are declared as intersections of Wine with RedThings or WhiteThings with
WhiteWine, respectively;
• finally, we introduce an instance Merlo related to has:component with the value MerloGrape, and whose
value for has:color is Red.
The expected inference is that Merlo is classified as RedWine because it is a member of both RedThings (since
has:color is related to Red) and Wine (since has:component points to an object that is a member of the class
Grape).
If we evaluate:
DESCRIBE <urn:Merlo>
As you can see, the inferencer correctly derived that Merlo is a member of RedWine.
Now, let’s see how it derived this.
First, we will add some helper JavaScript functions to combine the results in more compact form as literals formed
by the local names of the IRI components in the statements. We already introduced the js:lname() function from
the previous examples, so we can reuse it to create a stmt() that concatenates several more items into a convenient
literal:
PREFIX jsfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] jsfn:register '''
function stmt(s, p, o, c) {
return '('+lname(s)+', '+lname(p)+', '+lname(o)+(c?', '+lname(c):'')+')';
}
'''
}
We also need a way to refer to a BNode using its label because SPARQL always generates unique BNodes during
query evaluation:
PREFIX jsfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] jsfn:register '''
function _bnode(value) {
return org.eclipse.rdf4j.model.impl.SimpleValueFactory.getInstance().createBNode(value);
}
'''
}
Now, let’s see how the (urn:Merlo rdf:type urn:RedWine) has been derived (note the use of js:stmt() function
in the projection of the query). The query will use a SUBSELECT to provide bindings for ?subject, ?predicate,
and ?object variables as a convenient way to easily add more statements to be explained by the plugin further.
PREFIX jsfn:<http://www.ontotext.com/js#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
prefix proof: <http://www.ontotext.com/proof/>
SELECT(jsfn:stmt(?subject,?predicate,?object) as ?stmt) ?rule (jsfn:stmt(?s,?p,?o,?context) as ?premise)
WHERE {
{
SELECT ?subject ?predicate ?object {
VALUES (?subject ?predicate ?object) {
(<urn:Merlo> rdf:type <urn:RedWine>)
}
}
}
?ctx proof:explain (?subject ?predicate ?object) .
?ctx proof:rule ?rule .
?ctx proof:subject ?s .
?ctx proof:predicate ?p .
?ctx proof:object ?o .
?ctx proof:context ?context .
}
4.5. Inference 69
GraphDB Documentation, Release 10.2.5
The first premise is explicit, and comes from the definition of RedWine class which is an owl:intersectionOf
of an RDF list (_:node2) that hold the classes that form the intersection. The second premise relates Merlo with
a predicate called _allTypes to the node from the intersection node. The inference is derived after applying the
following rule:
Id: owl_typeByIntersect_1
a <onto:_allTypes> b
c <owl:intersectionOf> b
------------------------------------
a <rdf:type> c
Id: owl_typeByIntersect_3
a <rdf:first> b
d <rdf:type> b
a <rdf:rest> c
d <onto:_allTypes> c
------------------------------------
d <onto:_allTypes> a
where we have two explicit and two inferred statements matching the premises (Merlo, _allTypes, _:node2)
and (Merlo, type, RedThing).
When we add to the list (Merlo, type, RedThing) first, the SUBSELECT is changed to:
We see that (Merlo, type, RedThing) is derived by matching rule owl_typeByHasVal with all explicit premises:
Id: owl_typeByHasVal
a <owl:onProperty> b
a <owl:hasValue> c
d b c
------------------------------------
d <rdf:type> a
4.5. Inference 71
GraphDB Documentation, Release 10.2.5
The statement (Merlo, _allTypes, _:node2) was derived by owl_typeByIntersect_2 and the only implicit
statement matching as premise is (Merlo, type, Wine).
The owl_typeByIntersect_2 rule looks like this:
Id: owl_typeByIntersect_2
a <rdf:first> b
a <rdf:rest> <rdf:nil>
c <rdf:type> b
------------------------------------
c <onto:_allTypes> a
These come from rule owl_typeBySomeVal where all premises matching it were explicit. The rule looks like:
Id: owl_typeBySomeVal
a <rdf:type> b
c <owl:onProperty> d
c <owl:someValuesFrom> b
e d a
------------------------------------
e <rdf:type> c
• (Merlo, type, Wine) was derived by rule owl_typeBySomeVal from all explicit statements.
• (Merlo, type, RedThing) was derived by rule owl_typeByHasVal from explicit statements.
• (Merlo, _allTypes, _:node2) was derived by rule owl_typeByIntersect_2 with combination of some
explicit and the inferred (Merlo, type, Wine).
• (Merlo, _allTypes, _:node1) was derived by rule owl_typeByIntersect_3 with combination of explicit
and inferred (Merlo, type, RedThing) and (Merlo, _allTypes, _:node2).
• And finally, (Merlo, type, RedWine) was derived by owl_typeByIntersect_1 from explicit (RedWine,
intersectionOf, _:node1) and inferred (Merlo, _allTypes, _:node1).
4.5.3 Provenance
The provenance plugin enables the generation of inference closure from a specific named graph at query time.
This is useful in situations when you want to trace what the implicit statements generated from a specific
graph are and the axiomatic triples part of the configured ruleset, i.e., the ones inserted with a special predicate
sys:schemaTransaction. For more information, check Reasoning.
By default, GraphDB’s forwardchaining inferencer materializes all implicit statements in the default graph. There
fore, it is impossible to trace which graphs these implicit statements are coming from. The provenance plugin
provides the opposite approach. With the configured ruleset, the reasoner does forwardchaining over a specific
named graph and generates all its implicit statements at query time.
4.5. Inference 73
GraphDB Documentation, Release 10.2.5
Predicates
The plugin predicates gives you an easy access to the graph, which implicit statements you want to generate.
The process is similar to the RDF reification. All plugin’s predicates start with <http://www.ontotext.com/
provenance/>:
2. Check the startup log to validate that the plugin has started correctly.
[INFO ] 2016-11-18 19:47:19,134 [http-nio-7200-exec-2 c.o.t.s.i.PluginManager] Initializing plugin
,→'provenance'
1. In the Workbench SPARQL editor, add the following data as schema transaction:
PREFIX ex: <http://example.com/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT data {
[] <http://www.ontotext.com/owlim/system#schemaTransaction> [] .
ex:BusStop a rdfs:Class .
ex:SkiResort a rdfs:Class .
ex:WebCam a rdfs:Class .
ex:OutdoorWebCam a rdfs:Class .
ex:Place a rdfs:Class .
}
GRAPH ex:g3{
ex:busstop a ex:BusStop .
ex:busstop ex:nearBy ex:skiresort .
}
}
3. If we run the following query not using the plugin, it will return solutions over all statements that were
inferred during the loading of the data:
This will return one solution since only the data from ex:g1, ex:g2, and ex:g3 will be used, so
that the solution dependent on the data within ex:g1a will not be part of the result.
Note: During evaluation, the inferences and the data are kept inmemory, so the plugin should be used with
relatively small sets of statements placed in contexts.
4.5. Inference 75
GraphDB Documentation, Release 10.2.5
4.6 Storage
GraphDB stores all of its data (statements, indexes, entity pool, etc.) in files in the configured storage directory,
usually called storage. The content and names of these files is not defined and is subject to change between
versions.
There are several types of indexes available, all of which apply to all triples, whether explicit or implicit. These
indexes are maintained automatically.
In general, the index structures used in GraphDB are chosen and optimized to allow for efficient:
• handling of billions of statements under reasonable RAM constraints;
• query optimization;
• transaction management.
GraphDB maintains two main indexes on statements for use in inference and query evaluation: the predicate
objectsubject (POS) index and the predicatesubjectobject (PSO) index. There are many other additional data
structures that are used to enable the efficient manipulation of RDF data, but these are not listed, since these internal
mechanisms cannot be configured.
There are indexing options that offer considerable advantages for specific datasets, retrieval patterns and query
loads. Most of them are disabled by default, so you need to enable them as necessary.
Note: Unless stated otherwise, GraphDB allows you to switch indexes on and off against an already populated
repository. The repository has to be shut down before the change of the configuration is specified. The next time
the repository is started, GraphDB will create or remove the corresponding index. If the repository is already
loaded with a large volume of data, switching on a new index can lead to considerable delays during initialization
– this is the time required for building the new index.
Transaction control
Transaction support is exposed via RDF4J’s RepositoryConnection interface. The three methods of this interface
that give you control when updates are committed to the repository are as follows:
Method Effect
void begin() Begins a transaction. Subsequent changes effected through update operations will only
become permanent after commit() is called.
void commit() Commits all updates that have been performed through this connection since the last call
to begin().
void rollback() Rolls back all updates that have been performed through this connection since the last
call to begin().
GraphDB supports the so called ‘read committed’ transaction isolation level, which is wellknown to relational
database management systems i.e., pending updates are not visible to other connected users, until the complete
update transaction has been committed. It guarantees that changes will not impact query evaluation before the
entire transaction they are part of is successfully committed. It does not guarantee that execution of a single
transaction is performed against a single state of the data in the repository. Regarding concurrency:
• Update transactions are processed internally in sequence, i.e., GraphDB processes the commits one after
another;
• Update transactions do not block read requests in any way, i.e., hundreds of SPARQL queries can be eval
uated in parallel (the processing is properly multithreaded) while update transactions are being handled on
separate threads.
• Multiple update/modification/write transactions cannot be initiated and stay open simultaneously, i.e., when
a transaction is initiated and started to modify the underlying indexes, no other transaction must be allowed
to change anything until the first one is either commited or rollbacked;
Note: GraphDB performs materialization, ensuring that all statements that can be inferred from the current state
of the repository are indexed and persisted (except for those compressed due to the Optimization of owl:sameAs).
When the commit method is completed, all reasoning activities related to the changes in the data introduced by the
corresponding transaction will have already been performed.
Note: In GraphDB SE, the result of leading update operations in a transaction is visible to trailing ones. Due to
a limitation of the cluster protocol, this feature is not supported in GraphDB cluster, i.e., an uncommitted transac
tion will not affect the ‘view’ of the repository through any connection, including the connection used to do the
modification.
Predicate lists
Certain datasets and certain kinds of query activities, for example queries that use wildcard patterns for predicates,
benefit from another type of index called a ‘predicate list’, i.e.:
• subjectpredicate (SP)
• objectpredicate (OP)
This index maps from entities (subject or object) to their predicates. It is not switched on by default (see the
enablePredicateList configuration parameter), because it is not always necessary. Indeed, for most datasets and
query loads, the performance of GraphDB without such an index is good enough even with wildcardpredicate
queries, and the overhead of maintaining this index is not justified. You should consider using this index for
datasets that contain a very large number (greater than around 1,000) of different predicates.
Context index
The Context index can be used to speed up query evaluation when searching statements via their context identifier.
To switch ON or OFF the CPSO index, use the enablecontextindex configuration parameter. The default value
is false.
Literal index
GraphDB automatically builds a literal index allowing faster lookups of numeric and date/time object values. The
index is used during query evaluation only if a query or a subquery (e.g., UNION) has a filter that is comprised of
a conjunction of literal constraints using comparisons and equality (not negation or inequality), e.g., FILTER(?x =
100 && ?y <= 5 && ?start > "2001-01-01"^^xsd:date).
Other patterns will not use the index, i.e., filters will not be rewritten into usable patterns.
For example, the following FILTER patterns will all make use of the literal index:
FILTER( ?x = 7 )
FILTER( 3 < ?x )
FILTER( ?x >= 3 && ?y <= 5 )
FILTER( ?x > "2001-01-01"^^xsd:date )
4.6. Storage 77
GraphDB Documentation, Release 10.2.5
FILTER( ?x > (1 + 2) )
FILTER( ?x < 3 || ?x > 5 )
FILTER( (?x + 1) < 7 )
FILTER( ! (?x < 3) )
The decision of the query optimizer whether to make use of this index is statisticsbased. If the estimated number
of matches for a filter constraint is large relative to the rest of the query, e.g., a constraint with large or onesided
range, then the index might not be used at all.
To disable this index during query evaluation, use the enableliteralindex configuration parameter. The default
value is true.
Note: Because of the way the literals are stored, the index with dates far in the future and far in the past (approxi
mately 200,000,000 years) as well as numbers beyond the range of 64bit floatingpoint representation (i.e., above
approximately 1e309 and below 1e309) will not work properly.
As already described, GraphDB applies the inference rules at load time in order to compute the full closure. There
fore, a repository will contain some statements that are explicitly asserted and other statements that exist through
implication. In most cases, clients will not be concerned with the difference, however there are some scenarios
when it is useful to work with only explicit or only implicit statements. These two groups of statements can be
isolated during programmatic statement retrieval using the RDF4J API and during (SPARQL) query evaluation.
The usual technique for retrieving statements is to use the RepositoryConnection method:
RepositoryResult<Statement> getStatements(
Resource subj,
URI pred,
Value obj,
boolean includeInferred,
Resource... contexts)
The method retrieves statements by ‘triple pattern’, where any or all of the subject, predicate and object parameters
can be null to indicate wildcards.
To retrieve explicit and implicit statements, the includeInferred parameter must be set to true. To retrieve only
explicit statements, the includeInferred parameter must be set to false.
However, the RDF4J API does not provide the means to enable only the retrieval of implicit statements. In order
to allow clients to do this, GraphDB allows the use of the special ‘implicit’ pseudograph with this API, which can
be passed as the context parameter.
The following example shows how to retrieve only implicit statements:
RepositoryResult<Statement> statements =
repositoryConnection.getStatements(
null, null, null, true,
SimpleValueFactory.getInstance().createIRI("http://www.ontotext.com/implicit"));
while (statements.hasNext()) {
Statement statement = statements.next();
// Process statement
}
statements.close();
The above example uses wildcards for subject, predicate and object and will therefore return all implicit statements
in the repository.
GraphDB also provides mechanisms to differentiate between explicit and implicit statements during query evalu
ation. This is achieved by associating statements with two pseudographs (explicit and implicit) and using special
system URIs to identify these graphs.
Query monitoring and termination can be done manually from the Workbench and automatically by configuring
GraphDB to abort queries after a certain query timeout is reached.
When there are running queries, their number is shown up next to the Repositories dropdown menu.
To track and interrupt long running queries:
1. Go to Monitoring � Queries or click the Running queries status next to the Repositories dropdown menu.
2. Press the Abort query button to stop a query.
To pause the current state of the running queries, use the Pause button. Note that this will not stop their execution
on the server.
Attribute Description
id The ID of the query
node Local or remote node repository ID
type The operation type QUERY or UPDATE
query The first 500 characters of the query string
lifetime The time in seconds since the iterator was created
state The internal state of the operation
You can also interrupt a query directly from the SPARQL Editor:
You can set a global query timeout period by adding a querytimeout configuration parameter. All queries will
stop after the number of seconds you have set in it, where a default value of 0 indicates no limit.
4.8 Virtualization
The data virtualization in GraphDB enables direct access to relational databases with SPARQL queries, which
eliminates the need to replicate data. The implementation exposes a virtual SPARQL endpoint, which translates
the queries to SQL using a declarative mapping. To achieve this functionality, GraphDB integrates with the open
source Ontop project and extends it with multiple GraphDB specific features.
The following SPARQL features are supported:
• SELECT and CONSTRUCT queries
• Default and named graph triple patterns
• Triple pattern combining: OPTIONAL, UNION, blank node path
• Result filtering and value bindings: FILTER, BIND, VALUES
• Projection modifiers: DISTINCT, LIMIT, ORDER BY
• Aggregates (GROUP BY, SUM, COUNT, AVG, MIN, MAX, GROUP_CONCAT)
• SPARQL functions (STR, IRI, LANG, REGEX)
• SPARQL data type support and their mapping to SQL types
• SUBQUERY
The most common scenario for using data virtualization is when the integrated data is highly dynamic or too big to
be replicated. For practical reasons, it is easier to not copy it and accept all limitations like data quality, integrity,
and type of supported queries of the underlying information source.
A second common scenario is to maintain a declarative mapping between the relational model and RDF, where the
user periodically dumps all statements and writes them to a native RDF database so it can support property paths
and faster data joins.
to
SELECT * WHERE {
?s ?p ?o .
}
jdbc.url=<database-jdbc-driver-connection-string>
jdbc.driver=<database-jdbc-driver-class>
jdbc.user=<your-database-username>
jdbc.password=<your-database-password>
4. A repository config file of the following type, here again with example values (optional):
<#university-virtual> a rep:Repository;
rep:repositoryID "university-virtual";
rep:repositoryImpl [
<http://inf.unibz.it/krdb/obda/quest#obdaFile> "university.obda";
(continues on next page)
4.8. Virtualization 81
GraphDB Documentation, Release 10.2.5
rep:repositoryType "graphdb:OntopRepository"
];
rdfs:label "Ontop virtual store with OBDA" .
that references the aforementioned OBDA (or R2RML), ontology, and properties files. This file
is automatically generated when creating a virtual repository through the Workbench, and is used
when creating such a repository via cURL command as described further below.
5. A database metadata file that provides information about uniqueness and nonnull constraints for database
tables and views (optional).
These files are used to create a virtual repository in GraphDB, in which you can then query the relational database.
Let’s consider the following relational database containing university data.
It has tables describing students, academic staff, courses and two relation schemas (uni1 and uni2) with many
tomany links between academic staff > course and students > course. The descriptions below are for the uni1
tables.
uni1.student
This table contains the local ID, first and last names of the students. The column s_id is a primary key.
uni1.academic
Similarly, this table contains the local ID, first and last names of the academic staff, but also information about
their position. The column a_id is a primary key.
The column position is populated with magic numbers:
• 1 > Full Professor
• 2 > Associate Professor
• 3 > Assistant Professor
• 8 > External Teacher
• 9 > PostDoc
uni1.course
c_id title
1234 Linear Algebra
This table contains the local ID and the title of the courses. The column c_id is a primary key.
uni1.teaching
c_id a_id
1234 1
1234 2
This table contains the nn relation between courses and teachers. There is no primary key, but two foreign keys
to the tables uni1.course and uni1.academic.
uni1.courseregistration
c_id s_id
1234 1
1234 2
This table contains the nn relation between courses and students. There is no primary key, but two foreign keys
to the tables uni1.course and uni1.student.
JDBC driver
As mentioned above, in order to create a virtual repository in GraphDB, you need to first install a JDBC driver for
your respective relational database.
In the lib directory of the GraphDB distribution, create a subdirectory called jdbc and place the driver .jar file
there. In case you are using GraphDB from a native installation, the driver file name should be jdbc-driver.jar.
Note: The driver can also be placed in the lib directory; however, this requires a restart of GraphDB.
If you want to set a JDBC driver directory different from the lib/jdbc location, you can define it via the graphdb.
ontop.jdbc.path property in the conf/graphdb.properties file of the GraphDB distribution.
Configuration files
Before creating a virtual repository, you will need the following files (available for download below):
• a properties file with the JDBC configuration parameters
• an OBDA mapping file describing the mapping of SPARQL queries to SQL data
• an OWL ontology file describing the ontology of your data (optional)
• a DB metadata file providing information about uniqueness and nonnull constraints for database tables
and views (optional)
1. When creating a repository from the Workbench, select the Ontop option.
2. GraphDB supports several database JDBC drivers. When creating an Ontop repository, the default setting
is Generic JDBC Driver. This means that you need to configure and upload your own JDBC properties file
(available as a template for download above).
3. In the fields for JDBC properties file and OBDA or R2RML file, upload the corresponding files. The Ontol
ogy, Constraint, and DB metadata files are optional.
4.8. Virtualization 83
GraphDB Documentation, Release 10.2.5
4. You can also test the connection to your SQL database with the button on the right.
5. Click Create.
Note: Once you have created an Ontop repository, its type cannot be changed.
For ease of use, GraphDB also supports drivers for five other commonly used databases integrated into the Ontop
framework: MySQL, PostgreSQL, Oracle, MS SQL Server, and DB2. Selecting one of them offers the advantage
of not having to configure the JDBC properties file yourself, as its Driver class and URL property values are
generated by GraphDB.
To use one of these database drivers:
1. Select the type of SQL database you want to use from the dropdown menu.
2. Download the corresponding driver by clicking the Download JDBC driver link on the right of the Driver
class field, place it in the lib directory of the GraphDB distribution, and restart GraphDB if it is running.
3. Fill in the required fields for each driver (Hostname, Database name, etc.).
4. Upload the OBDA/R2RML file. (The Ontology, Constraint, and DB metadata files are optional, just as with
the generic JDBC driver)
5. You can also test the connection to your SQL database with the button on the right.
6. Click Create.
To create a virtual repository via the API, you need the following files described above, all placed in the same
directory (here, we are using the universities examples again):
• repo-config.ttl: the config file for the repository
• university.properties: the JDBC properties file
• university.obda: the OBDA/R2RML file
As mentioned earlier, the OWL ontology, constraint, and DB metadata files are optional.
Execute the following cURL command (here including the DB metadata file):
You will see the newly created repository under Setup � Repositories in the GraphDB Workbench.
The underlying Ontop engine supports two mapping languages. The first one is the official W3C RDB2RDF
mapping language known as R2RML, which provides excellent interoperability between the various tools. The
second one is the native Ontop mapping known as OBDA, which is much shorter and easier to learn, and supports
an automatic bidirectional transformation to R2RML.
Mappings represent OWL assertions: one set of OWL assertions for each result row is returned by the SQL query
in the mapping. The assertions are those that are obtained by replacing the placeholders with the values from the
relational database.
Mappings consist of:
• source: a SQL query that retrieves some data from the database
• target: a form of template that indicates how to generate OWL assertions in a Turtlelike syntax.
All examples in this documentation use the internal OBDA mapping language.
Let’s map the uni1student table using an OBDA template.
The information source is the following:
SELECT *
FROM "uni1"."student"
ex:uni1/student/{s_id} a :Student ;
foaf:firstName {first_name}^^xsd:string ;
foaf:lastName {last_name}^^xsd:string .
The target part is described using a Turtlelike syntax while the source part is a regular SQL query.
We used the primary key s_id to create the URI. This practice enables Ontop to remove selfjoins, which is very
important for optimizing the query performance.
This entry could be split into three mapping assertions:
ex:uni1/student/{s_id} a :Student .
ex:uni1/student/{s_id} foaf:firstName {first_name}^^xsd:string .
ex:uni1/student/{s_id} foaf:lastName {last_name}^^xsd:string .
4.8. Virtualization 85
GraphDB Documentation, Release 10.2.5
SELECT *
FROM "uni1"."course"
ex:uni1/course/{c_id} a :Course ;
:title {title} ;
:isGivenAt ex:uni1/university .
Below are some examples of the SPARQL queries that are supported in a GraphDB virtual repository.
1. Return the IDs of all persons that are faculty members:
SELECT ?p
WHERE {
?p a voc:FacultyMember .
}
2. Return the IDs of all full Professors together with their first and last names:
3. Return all Associate Professors, Assistant Professors, and Full Professors with their last names and first
name if available, and the title of the course they are teaching:
GraphDB also supports querying the virtual readonly repositories using the highly efficient Internal SPARQL
federation.
Its usage is the same as with the internal federation of regular repositories. Instead of providing a URL to a remote
repository, you need to provide a special URL of the form repository:NNN, where NNN is the ID of the virtual
repository you want to access.
Let’s see how this works with our university database example.
1. Create a new, empty RDF repository called university-rdf.
2. From the ontop_repo virtual repository with university data, insert some data in the new, empty university-
rdf repository: teachers with first name and last name that give courses that are not held at university2:
3. To observe the results, again in the university-rdf repository, execute the following query that will return
the teachers that were inserted with their first and last name:
Result:
4. Then:
• get the teachers from the virtual repository that teach courses in an institution that is not university2
• merge the result of that with the RDF repository by getting the firstName and lastName of those teachers
• the IDs of the teachers are the common property for both repositories which makes the selection possible.
For the purposes of our demonstration, this query filters them by firstName that contains the letter “a”.
4.8. Virtualization 87
GraphDB Documentation, Release 10.2.5
Result:
4.8.7 Limitations
Data virtualization also comes with certain limitations due to the distributed nature of the data. In this sense, it
works best for information that requires little or no integration. For instance, if in databases X and Y, we have two
instances of the person John Smith, which do not share a unique key or other exact match attributes like “John
Smith” and “John E. Smith”, it will be quite inefficient to match the records at runtime.
One potential drawback is also the type of supported queries. If the underlying storage has no indexes, it will be
slow to answer queries such as “tell me how resource X connects to resource Y”.
The number of stacked data sources also significantly affects the efficiency of data retrieval.
Lastly, it is not possible to efficiently perform autosuggest/autocomplete type of indexes nor graph traversals or
inferencing.
4.9.1 Overview
In addition to the standard SPARQL 1.1 Federation to other SPARQL endpoints and the internal SPARQL feder
ation to other repositories in the same database instance, GraphDB also supports FedX – the federation engine of
the RDF4J framework, a data partitioning technology that provides transparent federation of multiple SPARQL
endpoints under a single virtual endpoint.
In the context of the growing need for scalability of RDF technology and sophisticated optimization techniques for
querying linked data, it is a useful framework that allows efficient SPARQL query processing on heterogeneous,
virtually integrated linked data sources. With it, explicitly addressing specific endpoints using SERVICE clauses is
no longer necessary – instead, FedX offers novel join processing and grouping techniques to minimize the number
of remote requests by automatically selecting the relevant sources, sending statement patterns to these sources
for evaluation, and joining the individual results. It extends the Sesame framework with a federation layer and is
incorporated in it as Sail (Storage and Inference Layer).
Note: Please keep in mind that the GraphDB FedX federation is currently an experimental feature.
4.9.2 Features
In the following sections, we will demonstrate how semantic technology in the context of FedX federation lowers
the cost of searching and analyzing data sources by implementing a twostep integration process: (1) mapping any
dataset to an ontology and (2) using GraphDB to access the data. With this integration methodology, the cost of
extending the number of supported sources remains linear unlike the classic data warehousing approaches.
The first type of use case that we will look at is creating a unifying repository where we can query data from
multiple linked data sources regardless of their location, such as DBpedia and Wikidata. In such cases, there is
often a significant overlap between the schemas, i.e., predicates or types are frequently repeated across the different
sources.
Note: Keep in mind that bnodes are not supported between FedX members.
Before we start exploring, let’s first create a federation between the DBpedia and Wikidata endpoints.
1. Create a FedX repository via Setup � Repositories � Create new repository � FedX Virtual SPARQL.
2. In the configuration screen that you are taken to, click Add remote repository.
3. From the endpoint options in the dialog, select Generic SPARQL endpoint.
4. For the DBpedia Endpoint URL, enter https://dbpedia.org/sparql.
5. Unselect the Supports ASK queries option, as this differs from endpoint to endpoint.
6. Repeat the same steps for the Wikidata endpoint URL – https://query.wikidata.org/sparql.
SELECT * WHERE
{
?item wdt:P31 wd:Q146.
?item rdfs:label ?label.
FILTER (LANG(?label) = 'en')
}
Here, we have used two Wikidata predicates: wdt:P31 that stands for “instance of” and wd:Q146 that stands for
“house cat”.
These will be the first 15 house cats returned:
CONSTRUCT {
?s ?p ?o
} WHERE
{
{
BIND(<http://www.wikidata.org/entity/Q378619> as ?s)
?s ?p ?o.
} UNION {
BIND(<http://www.wikidata.org/entity/Q378619> as ?o)
?s ?p ?o.
}
}
CONSTRUCT{
?dbpCompany ?p ?o .
?wdCompany ?p1 ?o .
?dbpCompany owl:sameAs ?wdCompany .
} WHERE {
BIND( dbr:Amazon_\(company\) as ?dbpCompany)
{
# Get all products from DBpedia
?dbpCompany dbo:product ?o .
?dbpCompany ?p ?o .
} UNION {
# Get all products from Wikidata
?dbpCompany owl:sameAs ?wdCompany .
?o wdt:P176 ?wdCompany .
?o ?p1 ?wdCompany .
}
}
As we saw in the previous example, we can explore a specific resource from both endpoints. Now, let’s see how to
create an advanced graph configuration for a query, with which we will then be able to explore any resource that
we input.
With the following steps, create a graph config query for all companies and all products in both datasets:
1. Go to Explore � Visual graph.
2. From Advanced graph configurations, select Create graph config.
3. Enter a name for your graph.
4. The default Starting point is Start with a search box. In it, select the Graph expansion tab and enter the
following query:
CONSTRUCT{
?node ?p1 ?o .
?s ?p ?o .
?node owl:sameAs ?s .
} WHERE {
{
?node dbo:product ?o .
?node ?p1 ?o .
} UNION {
?node owl:sameAs ?s .
?o wdt:P176 ?s .
?o ?p ?s .
}
}
The two databases are connected through the owl:sameAs predicate. The DBpedia property
dbo:product corresponds to the wdt:P176 property in Wikidata.
5. Since Wikidata shows information in a less readable way, we can clear it up a little by pulling the node labels.
To do so, go to the Node basics tab and enter the query:
SELECT ?label {
{
?node rdfs:label | skos:prefLabel ?label.
FILTER (lang(?label) = 'en')
}
}
Note: The ?node variable is required and will be replaced with the IRI of the node that we are
exploring.
6. Click Save. You will be taken back to the Visual graph screen where the newly created graph configuration
is now visible.
7. Now let’s explore the information about the nodes as visual graphs mapped in both data sources. Click on
the name of the graph and in the search field that opens, enter the DBpedia resource http://dbpedia.org/
resource/Amazon_(company) and click Show.
On the left are the DBpedia resources related to Amazon, and on the right the Wikidata ones.
Note: Some SPARQL endpoints with implementation other than GraphDB may enforce limitations that could
result in some features of the GraphDB FedX repository not working as expected. One such example is the class
hierarchy that may send big queries and not work with https://dbpedia.org/sparql, which has a query length
limit.
The second type of scenario demonstrates how to create a federated repository over two local repositories – a local
native and an Ontop one. We will divide a dataset between them and then explore the relationships.
We will be using segments of two public datasets:
• The native GraphDB repository uses data from the acquisitions.csv, ipos.csv, and objects.csv files
of the Startup investments dataset, a Crunchbase snapshot data source that contains metadata about com
panies and investors. It tracks the evolution of startups into multibillion corporations. The data has been
RDFized using Ontotext Refine.
– The acquisitions file contains M&A deals between companies, listing all buyers and acquired com
panies and the date of the deal.
– The objects file contains details about the companies, such as ID, name, country etc.
– The ipo file contains data about companies IPOs.
• The Ontop repository uses the prices.csv file of the NYSE dataset, a data source listing the open
ing/closing stock price and traded volumes on the New York Stock Exchange. The file lists stock symbols
and opening/closing stock market prices for particular dates. Most data span from 2010 to the end 2016, and
for companies new on the stock market the date range is shorter.
1. To set up the native GraphDB repository:
a. Create a new repository.
b. Download the ipo.nq, acquisitions.ttl and objects.ttl files.
c. Load them into the repository via Import � User data � Upload RDF files.
2. To set up the Ontop repository:
The first two triples represent the acquiring and the acquired company. The “USA” literal specifies that the buyer
company is based there. The target company has to be European. The country of each company is represented by
a country code. To get only the European companies that have been acquired, a filter is used that checks if a given
country’s code is among the listed ones.
The first 15 returned results look like this:
Scenario 2: List European companies acquired by US companies where the stock market price of the buyer
company has increased on the date of the M&A deal.
This query is run against the Crunchbase and the NYSE datasets and is similar to the one above, but with one
additional condition – that on the day of the deal, the stock price of the buying company has increased. This
means that when the stock market closed, that price was higher than when the market opened. Since the M&A
deals data are in the Crunchbase dataset and the stock prices data in the NYSE dataset, we will join them on the
stockSymbol field, which is present in both datasets, and the IPO of the buyer company.
We also make sure that the date of the M&A deal (from Crunchbase) is the same as the date for which we retrieve
the opening and closing stock prices (from NYSE). In the SELECT clause, we include only the names of the buyer
and seller companies. The opening and closing prices are chosen for a particular date and stock symbol.
When creating a FedX repository with local members, we can specify whether the FedX repo should respect the
security rights of the member repositories.
1. First, we will create two repositories, “Sofia” and “London”, in which we will insert some statements from
factforge.net:
a. Create a repository called Sofia.
b. Go to the FactForge SPARQL editor and execute:
CONSTRUCT WHERE {
?s ?p <http://dbpedia.org/resource/Sofia> .
} LIMIT 20
SELECT * WHERE {
?s ?p ?o .
}
4. We can observe that all statements for both Sofia and London are returned as results (here ordered by subject
in alphabetical order so as to show results for both):
5. Now, to see how this works with GraphDB security enabled, go to Setup � Users and Access, and set Security
to ON.
6. From the same page, create a new user called “sofia” with read rights for the “Sofia” and the FedX reposi
tories:
7. From Setup � Repositories, click the edit icon of the FedX repository to enter its configuration.
8. Click the edit icon of either of the “Sofia” or “London” member repositories. This will open a security
setting dialog where you can see that the default setting of each member is to respect the repository’s access
rights, meaning that if a user has no rights to this repository, they will see a federated view that does not
include results from it.
SELECT * WHERE {
?s ?p ?o .
}
We can see that only results for the Sofia repository are shown, because the current user has no
access to the London repository and the FedX repository is instructed to respect the rights for it.
11. Log out from the “sofia” user and log back in as admin.
12. Open the edit screen of the FedX repository and set the security of both its members to ignore the reposi
tory’s access rights. This means that in the federation, users will see results from the respective repository
regardless of their access rights for it.
13. After editing the Sofia and London repositories this way, Save the changes in the FedX repository.
14. Log out as admin and log in as user “sofia”.
15. In the SPARQL editor, execute:
SELECT * WHERE {
?s ?p ?o .
}
16. We will see that the returned results include statements from both the “Sofia” and the “London” members
of the federated repository.
GraphDB supports configuration of basic authentication when attaching a remote endpoint. Let’s see how this
works with the following example:
1. Run a second GraphDB instance on localhost:7201. The easiest way to do this is to:
• Make a copy of your GraphDB distribution.
• Run it with graphdb -Dgraphdb.connector.port=7201.
2. In it, create a repository called “remoterepoparis” with enabled security and default admin user, i.e., user
name: “admin”, password: “root”.
3. Go to the FactForge SPARQL editor and execute:
CONSTRUCT WHERE {
?s ?p <http://dbpedia.org/resource/Paris> .
} LIMIT 20
4. Download the results as a Turtle file and import them into “remoterepoparis”.
5. Go to the first GraphDB instance on port 7200 and open the “fedxsofialondon” repository that we created
earlier. It already has two members “Sofia” and “London”.
6. In it, include as member the “remoterepoparis” we just created:
a. Select the GraphDB/RDF4J server option.
b. As Server URL, enter the URL of the remote repository http://localhost:7201/.
c. Repository ID is the name of the remote repo remote-repo-paris.
d. Authentication credentials are the user and password for the remote repo.
e. Add.
SELECT * WHERE {
?s ?p ?o .
}
We see that all the Paris data from the remote endpoint are available in our FedX repository.
The context is the same as the previous scenario – two running GDB instances with the second one being secured.
The difference is that when the remote repository is a known location, we can configure its security credentials
when adding it as a location instead of when adding it as a remote FedX member. Let’s see how to do it.
1. Start the same way as in the example above:
• Run a second GraphDB instance on localhost:7201.
• In it, create a repository called “remoterepoparis” with enabled security and default admin user, i.e.,
username: “admin”, password: “root”.
• Import the Paris data in it.
2. In the first GraphDB instance on port 7200, attach “remoterepoparis” as a remote location following these
steps. For Authentication type, select Basic auth, and input the credentials.
3. Again in the 7200 GraphDB instance, open the edit view of the “fedxsofialondon” repository.
4. In it, include as member the “remoterepoparis” from the 7201 port. Note that this time, we are not inputting
the security credentials.
SELECT * WHERE {
?s ?p ?o .
}
Again, we see that all the Paris data from the remote location are available in the FedX repository.
Hint: You can configure signature authentication for remote endpoints in the same way.
When configuring a FedX repository, several configuration options (described in detail below) can be set:
• Left join worker threads: The (maximum) number of left join worker threads used in the Con-
trolledWorkerScheduler for join operations. Sets the number of threads that can work in par
allel evaluating a query with OPTIONAL. Default is 10.
• Union worker threads: The (maximum) number of union worker threads used in the Con-
trolledWorkerScheduler for join operations. Sets the number of threads that can work in par
allel evaluating a query with UNION. Default is 20.
• Source selection cache spec: Parameters should be passed as key1=value1,key2=value2,...
in order to be parsed correctly.
Parameters that can be passed:
– recordStats (boolean)
– initialCapacity (int)
– maximumSize (long)
– maximumWeight (long)
– concurrencyLevel (int)
– recordStats (boolean)
– refreshDuration (long)
– expireAfterWrite (TimeUnit/long)
– expireAfterAccess (TimeUnit/long)
– refreshAfterWrite (TimeUnit/long)
4.9.5 Limitations
Some limitations of the current implementation of the GraphDB FedX federation are:
• DESCRIBE queries are not supported.
• FedX is not stable with queries of the type {?s ?p ?o} UNION {?s ?p1 ?o} FILTER (xxx).
• Currently, the federation only works with remote repositories, i.e., everything goes through HTTP, which is
slower compared to direct access to local repositories.
• Queries with a Cartesian product or cyclic connections are not stable due to connections that are still open
and to blocked threads.
• There is a small possibility of threads being blocked on complex queries due to implementation flows in
parallelization.
FIVE
Note: Currently, only one import task of a type is executed at a time, while the others wait in the queue as pending.
Note: For local repositories, we support interruption and additional settings, since the parsing is done by the
Workbench. When the location is a remote one, you just send the data to the remote endpoint, and the parsing and
loading are performed there.
If you have many files, a file name filter is available to narrow the list down.
105
GraphDB Documentation, Release 10.2.5
The settings for each import are saved so that you can use them, in case you want to reimport a file. You can see
them in the dialog that opens after you have uploaded a document and press Import:
• Base IRI specifies the base IRI against which to resolve any relative IRIs found in the uploaded data. When
data does not contain relative IRIs, this field may be left empty.
• Target graphs when specified, imports the data into one or more graphs. Some RDF formats may specify
graphs, while others do not support that. The latter are treated as if they specify the default graph.
– From data Imports data into the graph(s) specified by the data source.
– The default graph Imports all data into the default graph.
– Named graph Imports everything into a userspecified named graph.
• Enable replacement of existing data Enable this to replace the data in one or more graphs with the imported
data. When enabled:
– Replaced graph(s) All specified graphs will be cleared before the import is run. If a graph ends in *,
it will be treated as a prefix matching all named graphs starting with that prefix excluding the *. This
option provides the most flexibility when the target graphs are determined from data.
– I understand that data in the replaced graphs will be cleared before importing new data this option
must be checked when the data replacement is enabled.
Advanced settings:
• Preserve BNnode IDs: assigns its own internal blank node identifiers or uses the blank node IDs it finds in
the file.
• Fail parsing if datatypes are not recognized: determines whether to fail parsing if datatypes are unknown.
• Verify recognized datatypes: verifies that the values of the datatype properties in the file are valid.
• Normalize recognized datatypes values: indicates whether recognized datatypes need to have their values
be normalized.
• Fail parsing if languages are not recognized: determines whether to fail parsing if languages are unknown.
• Verify language based on a given set of definitions for valid languages: determines whether languages tags
are to be verified.
• Normalize recognized language tags: indicates whether languages need to be normalized, and to which
format they should be normalized.
• Should stop on error: determines whether to ignore nonfatal errors.
• Force serial pipeline: enforces the use of the serial pipeline when importing data.
Note: Import without changing settings will import selected files or folders using their saved settings or default
ones.
Note: The limitation of this method is that it supports files of a limited size. The default is 200 megabytes, and
is controlled by the graphdb.workbench.maxUploadSize property. The value is in bytes (-Dgraphdb.workbench.
maxUploadSize=20971520).
Loading data from your local machine directly streams the file to the RDF4J’s statements endpoint:
1. Click the button to browse files for uploading.
2. When the files appear in the table, either import a file by clicking Import on its line, or select multiple files
and click Import from the header.
3. The import settings modal appears, just in case you want to add additional settings.
If the URL has an extension, it is used to detect the correct data type (e.g., http://linkedlifedata.com/resource/
umlsconcept/C0024117.rdf). Otherwise, you have to provide the Data Format parameter, which is sent as Accept
header to the endpoint and then to the import loader.
You can also insert triples into a graph with an INSERT query in the SPARQL editor.
ImportRDF is a tool designed for offline loading of datasets. It cannot be used against a running server. Rationale
for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing
them into GraphDB’s internal indexes and producing a readytouse repository.
The ImportRDF tool resides in the bin folder of the GraphDB distribution. It loads data in a new repository
created from the Workbench or the standard configuration Turtle file found in configs/templates, or uses an
existing repository. In the latter case, the repository data is automatically overwritten.
Note: Before using the below methods, make sure you have set up a valid GraphDB license.
Important: The ImportRDF tool cannot be used in a cluster setup as it would break the cluster consistency.
The ImportRDF tool supports two subcommands Load and Preload (supported as separate commands in
GraphDB versions 9.x and older).
Despite the many similarities between Load and Preload, such as the fact that both commands do parallel offline
transformation of RDF files into GraphDB image, there are also substantial differences in their implementation.
Load uses an algorithm very similar to online data loading. As the data variety grows, the loading speed starts to
drop, because the page splits and the tree is rebalancing. After a continuous data load, the disk image becomes
fragmented in the same way as it would happen if the RDF files were imported into the engine.
Preload eliminates the performance drop by implementing a twophase load. In the first phase, all RDF statements
are processed inmemory in chunks, which are later flushed on the disk as many GraphDB images. Then, all sorted
chunks are merged into a single nonfragmented repository image with a merge join algorithm. Thus, the Preload
subcommand requires almost twice as much disk space to complete the import.
Preload does not perform inference on the data.
Warning: During the bulk load, the GraphDB plugins are ignored in order to speed up the process. Afterwards,
when the server is started, the plugin data can be rebuilt.
Note: The ImportRDF Tool supports various RDF formats, .zip and .gz files, and directories.
There are two ways for loading data with the ImportRDF tool:
1. Configure the ImportRDF repository location by setting the property graphdb.home.data in <conf/
graphdb.properties. If no property is set, the default repository location will be the data directory of
the GraphDB distribution.
2. Start GraphDB.
3. In a browser, open the Workbench web application at http://localhost:7200. If necessary, substitute local-
host and the 7200 port number as appropriate.
4. Go to Setup � Repositories.
5. Create and configure a repository.
6. Stop GraphDB.
7. Start the bulk load with the following command:
8. Start GraphDB.
1. Stop GraphDB.
2. Configure the ImportRDF repository location by setting the property graphdb.home.data in <conf/
graphdb.properties. If no property is set, the default repository location will be the data directory of
the GraphDB distribution.
3. Create a configuration file.
4. Start the bulk load with the following command:
5. Start GraphDB.
This is an example configuration template using a minimal parameters set. You can add more optional parameters
from the configs/templates example:
[] a rep:Repository ;
rep:repositoryID "repo-test-1" ;
rdfs:label "My first test repo" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:SailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:Sail" ;
# ruleset to use
graphdb:ruleset "empty" ;
The ImportRDF tool accepts Java command line options using -D. Supply them before the subcommand as fol
lows:
$ <graphdb-dist>/bin/importrdf -Dgraphdb.inference.concurrency=6 load -c <repo-
config.ttl> -m parallel <RDF data file(s)>
The following options are used to finetune the behavior of the Load subcommand:
• -Dgraphdb.inference.buffer: the buffer size (the number of statements) for each stage. Defaults to
200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting
data:
– less buffer size reduces the memory required;
– bigger buffer size reduces the overhead as the operations performed by threads have a lower probability
of waiting for the operations on which they rely, and the CPU is intensively used most of the time.
• -Dgraphdb.inference.concurrency: the number of inference threads in parallel mode. The default value
is the number of cores of the machine processor. A bigger pool theoretically means faster load if there are
enough unoccupied cores and the inference does not wait for the other load stages to complete.
The Preload subcommand accepts the following options to finetune its operation:
• --chunk: the size of the inmemory buffer to sort RDF statements before flushing it to the disk. A bigger
chunk consumes additional RAM and reduces the number of chunks to merge. We recommend the default
value of 20 million for datasets of up to 20 billion RDF statements.
• --iterator-cache: the number of triples to cache from each chunk during the merge phase. A bigger value
is likely to eliminate the I/O wait time at the cost of more RAM. We recommend the default value of 64,000
for datasets of up to 20 billion RDF statements.
• --parsing-tasks: the number of parsing tasks controls how many parallel threads parse the input files.
• --queue-folder: the parameter controls the file system location, where all temporary chunks are stored.
The loading of a huge dataset is a long batch processing, and every run may take many hours. Preload supports
resuming of the process if something goes wrong (insufficient disk space, out of memory, etc.) and the loading is
terminated abnormally. In this case, the data processing will restart from intermediate restore point instead of at the
beginning. The data collected for the restore points is sufficient to initialize all internal components correctly and
to continue the load normally at that moment, thus saving time. The following options can be used to configure
data resuming:
• --interval: sets the recovery point interval in seconds. The default is 3,600s (60min).
• --restart: if set to true, the loading will start from the beginning, ignoring an existing recovery point. The
default is false.
Updating data in GraphDB is done via smart updates using serverside SPARQL templates.
5.3.1 Overview
Updating the content of RDF documents can generally be tricky due to the nature of RDF – no fixed schema or
standard notion for management of multidocument graphs. There are two widely employed strategies when it
comes to managing RDF documents – storing each RDF document in a single named graph vs. storing each RDF
document as a collection of triples where multiple RDF documents exist in the same graph.
The single RDF document per named graph is easy to update – you can simply replace the content of the named
graph with the updated document, and GraphDB provides an optimization to do that efficiently. However, when
there are multiple documents in a graph and a single document needs to be updated, the old content of the document
must be removed first. This is typically done using a handcrafted SPARQL update that deletes only the triples that
define the document. This update needs to be the same on every client that updates data in order to get consistent
behavior across the system.
GraphDB solves this by enabling smart updates using serverside SPARQL templates. Each template corresponds
to a single document type, and defines the SPARQL update that needs to be executed in order to remove the
previous content of the document.
To initiate a smart update, the user provides the IRI identifying the template (i.e., the document type) and the IRI
identifying the document. The new content of the document is then simply added to the database in any of the
supported ways – replace graph, SPARQL INSERT, add statements, etc.
Replace graph
A document (the smallest update unit) is defined as the contents of a named graph. Thus, to perform an update,
you need to provide the following information:
• The IRI of the named graph – the document ID
• The new RDF contents of the named graph – the document contents
DELETE/INSERT template
A document is defined as all triples for a given document identifier according to a predefined schema. The schema
is described as a SPARQL DELETE/INSERT template that can be filled from the provided data at update time.
The following must be present at update time:
• The SPARQL template update (must be predefined, not provided at update time)
– Can be a DELETE WHERE update that only deletes the previous version of the document and the new
data is inserted as is.
– Can be a DELETE INSERT WHERE update that deletes the previous version of the document and
adds additional triples, e.g. timestamp information.
• The IRI of the updated document
• The new RDF contents of the updated document
The transport mechanism defines how users send RDF update data to GraphDB. Two mechanisms are supported
– direct access and indirect access via the Kafka Sink connector.
Direct access
Direct access is a direct connection to GraphDB using the RDF4J API as well as any GraphDB extensions to that
API, e.g. using SPARQL, deleting/adding individual triples, etc.
Replace graph
When a replace graph smart update is sent directly to GraphDB, the user does not need to do anything special, e.g.
a simple CLEAR GRAPH followed by INSERT in the same graph.
DELETE/INSERT template
Unlike replace graph, this update mechanism needs a predefined SPARQL template that can be referenced at update
time. Once a template has been defined, the user can request its use by inserting a system triple.
Let’s see how such a template can be used.
1. Create a repository.
2. In the SPARQL editor, add the following data about two employees in a factory and their salaries:
SELECT * WHERE {
<http://factory/John> ?p ?o .
}
4. Again in the SPARQL editor, create and execute the following template:
INSERT DATA {
<http://example.com/my-template> <http://www.ontotext.com/sparql/template> '''
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX factory: <http://factory/>
DELETE {
?worker factory:hasSalary ?oldSalary .
} INSERT {
?id factory:updatedOn ?now
} WHERE {
?id rdf:type factory:Factory .
?worker factory:worksIn ?id .
?worker factory:hasSalary ?oldSalary .
BIND(now() as ?now)
}
'''
}
5. Next, we execute a smart update to the RDF data, changing the employees’ salaries:
6. Now let’s see how the data has changed. Run again:
SELECT * WHERE {
<http://factory/John> ?p ?o .
}
In this mode, the user pushes update messages to Kafka and the Kafka Sink Connector the updates. Users and
consumers must agree on the following:
• A given Kafka topic is configured to accept RDF updates in a predefined update type and format.
• The types of updates that can be performed are: replace graph, DELETE/INSERT template, or simple add.
• The format of the data must be one of the supported RDF formats.
For more details, see Kafka Sink connector.
Updates are performed as follows:
Replace graph
DELETE/INSERT template
Simple add
The builtin SPARQL template plugin enables you to create predefined SPARQL templates that can be used for
smart updates to the repository data. All of these operations will behave exactly like any other RDF data.
The plugin is defined with the special predicate <http://www.ontotext.com/sparql/template>.
You can create and execute SPARQL templates in the Workbench from both the SPARQL editor and from the
SPARQL Templates editor.
Create template
We will use the template from the above example example. Execute:
INSERT DATA {
<http://example.com/my-template> <http://www.ontotext.com/sparql/template> '''
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX factory: <http://factory/>
DELETE {
?worker factory:hasSalary ?oldSalary .
} INSERT {
?id factory:updatedOn ?now
} WHERE {
?id rdf:type factory:Factory .
?worker factory:worksIn ?id .
?worker factory:hasSalary ?oldSalary .
bind(now() as ?now)
}
'''
}
SELECT ?template {
<http://example.com/my-template> <http://www.ontotext.com/sparql/template> ?template
}
"
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX factory: <http://factory/>
DELETE {
?worker factory:hasSalary ?oldSalary .
} INSERT {
?id factory:updatedOn ?now
} WHERE {
?id rdf:type factory:Factory .
?worker factory:worksIn ?id .
?worker factory:hasSalary ?oldSalary .
bind(now() as ?now)
}
"
This will list the IDs of the available templates, in our case http://example.com/my-template, and their content.
Update template
We can also update the content of the template with the same update operation from earlier:
Delete template
DELETE WHERE {
<http://example.com/my-template> <http://www.ontotext.com/sparql/template> ?template
}
For ease of use, the GraphDB Workbench also offers a separate menu tab where you can define your templates.
1. Go to Setup � SPARQL Templates � Create new SPARQL template. A default example template will open.
2. The template ID is required and must be an IRI. We will use the example from earlier: http://example.
com/my-template.
If you enter an invalid IRI, the SPARQL template editor will warn you of it.
3. The template body contains a default template. Replace it with:
This template can be used for smart updates to the RDF data as shown above.
4. Save the template. It will now be visible in the list with created templates where you can also edit or delete
it.
In some cases, you may want to execute arbitrary SPARQL updates, storing not the variables but rather the rela
tionship between those variables and the database. An easy way to do that is through the GraphDB REST API
SPARQL template endpoint. Let’s see how this is done.
1. First, we need to import some data with which we will be working.
Go to Import � User data � Import RDF text snippet and import the following sample data
describing five fictitious wines:
wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .
wine:CabernetSauvignon
rdf:type wine:Grape ;
rdfs:label "Cabernet Sauvignon" .
wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .
wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .
wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .
wine:Yoyowine
rdf:type wine:RedWine ;
wine:madeFromGrape wine:CabernetSauvignon ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .
wine:Franvino
rdf:type wine:RedWine ;
wine:madeFromGrape wine:Merlo ;
wine:madeFromGrape wine:CabernetFranc ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Noirette
(continues on next page)
wine:Blanquito
rdf:type wine:WhiteWine ;
wine:madeFromGrape wine:Chardonnay ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Rozova
rdf:type wine:RoseWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .
DELETE {
?s wine:hasSugar ?oldValue .
?s wine:hasYear ?oldYear
} INSERT {
?s wine:hasSugar ?sugar .
?s wine:hasYear ?year .
} WHERE {
?s ?p ?oldValue .
?s ?p1 ?oldYear .
}
3. Let’s run a SPARQL query against the data. In the SPARQL editor, execute:
SELECT ?s ?p ?o
WHERE {
BIND(wine:Blanquito as ?s ) .
?s ?p ?o .
}
4. Example 1:
To change the values of the variables for sugar content and year, we will update the data through
the REST API endpoint.
a. Go to Help � REST API � GraphDB Workbench API � SPARQL Template Controller �
POST /rest/repositories/{repositoryID}/sparqltemplates/execute.
b. For the repositoryID parameter, enter the name of your repository, e.g. “my_repo”.
c. In the document field, enter the JSON document:
{
"sugar" : "none" ,
"year" : 2020 ,
"s" : "http://www.ontotext.com/example/wine#Blanquito"
}
e. To see how the data have been updated, let’s execute the SPARQL query from step 3 again:
We can see that the objects for the sugar content and year predicates have been
updated to “none” and “2020”, respectively.
Here, we executed a template and added specific values for its variables. Even if we
had not specified the type for 2020, we would get a typed result: "2020"^^xsd:int.
This is because standard IRIs, numbers, and boolean values are recognized and
parsed this way.
5. Example 2:
We can also create typed values explicitly by using JSONLDlike values.
a. We will be using the same SPARQL template as in example 1.
b. Again in Help � REST API � GraphDB Workbench API � SPARQL Template Controller �
POST /rest/repositories/{repositoryID}/sparqltemplates/execute, send:
{
"sugar" : { "@id" : "custom:iri" } ,
"s" : "http://www.ontotext.com/example/wine#Blanquito"
}
Most IRIs will be recognized, but some custom ones will not. Here, we are using a
special label @id so that the value for sugar can be parsed as an IRI, since the value
custom:iri will not be considered an IRI by default.
c. To see how the data have been updated, execute the query from example 1 in the SPARQL
editor. The returned results will be:
As shown in the first example, the values will get a type if recognized. If we have
a value not in its default type, we can use JSONLDlike values containing both the
@value and the @type. Here, this is demonstrated with the year variable the result
is "2020"^^<http://test.type>.
GraphDB supports SHACL validation ensuring efficient data consistency checking.
W3C standard Shapes Constraint Language (SHACL) validation is a valuable tool for efficient data consistency
checking, and is supported by GraphDB via RDF4J’s ShaclSail . It is useful in efforts towards data integration,
as well as examining data compliance, e.g., every GeoName URI must start with http://geonames.com/, or age
must be above 18 years.
The language validates RDF graphs against a set of conditions. These conditions are provided as shapes and other
constructs expressed in the form of an RDF graph. In SHACL, RDF graphs that are used in this manner are called
shapes graphs, and the RDF graphs that are validated against a shapes graph are called data graphs.
A shape is an IRI or a blank node s that fulfills at least one of the following conditions in the shapes graph:
• s is a SHACL instance of sh:NodeShape or sh:PropertyShape.
• s is subject of a triple that has sh:targetClass, sh:targetNode, sh:targetObjectsOf, or
sh:targetSubjectsOf as predicate.
5.4.2 Usage
A repository with SHACL validation must be created from scratch, i.e., Create new. You cannot modify an already
existing repository by enabling the validation afterwards.
Create a repository and enable the Support SHACL validation option. Several additional checkboxes are opened:
• Cache select nodes: The ShaclSail retrieves a lot of its relevant data through running SPARQL
SELECT queries against the underlying Sail and against the changes in the transaction. This is
usually good for performance, but it is recommended to disable this cache while validating large
amounts of data as it will be less memoryconsuming. Default value is true.
• Log the executed validation plans: Logs (INFO) the executed validation plans as GraphViz DOT.
It is recommended to disable Run parallel validation. Default value is false.
• Run parallel validation: Runs validation in parallel. May cause deadlock, especially when using
NativeStore. Default value is true.
• Log the execution time per shape: Logs (INFO) the execution time per shape. It is recommended
to disable Run parallel validation and Cache select nodes. Default value is false.
• DASH data shapes extensions: Activates DASH Data Shapes extensions. DASH Data
Shapes Vocabulary is a collection of reusable extensions to SHACL for a wide range of use
cases. Currently, this enables support for dash:hasValueIn, dash:AllObjectsTarget, and
dash:AllSubjectsTargetIt.
• Log validation violations: Logs (INFO) a list of violations and the triples that caused the violations
(BETA). It is recommended to disable Run parallel validation. Default value is false.
• Log every execution step of the SHACL validation: Logs (INFO) every execution step of the
SHACL validation. This is fairly costly and should not be used in production. It is recommended
to disable Run parallel validation. Default value is false.
• RDF4J SHACL extensions: Activates RDF4J’s SHACL extensions (RSX) that provide addi
tional functionality. RSX currently contains rsx:targetShape which will allow a Shape to be
the target for your constraints. For more information about the RSX features, see the RSX section
of RDF4J documentation.
• Named graphs for SHACL shapes: Sets the named graphs where SHACL shapes can be stored.
Commadelimited list.
Some of these are used for logging and validation you can find more about it further down in this page.
You can load shapes using all three key methods for loading data into GraphDB: through the Workbench, with an
INSERT query in the SPARQL editor, and through the REST API.
ex:PersonShape
a sh:NodeShape ;
sh:targetClass ex:Person ;
sh:property [
sh:path ex:age ;
sh:datatype xsd:integer ;
] .
It indicates that entities of the class Person have a property “age” of the type xsd:integer.
Click Import. In the dialog that opens, select Target graphs � Named graph. Insert the ShaclSail
reserved graph http://rdf4j.org/schema/rdf4j#SHACLShapeGraph (or a custom named graph
specified with the sh:shapesGraph property) as shown below:
2. After the shape has been imported, let’s test it with some data:
a. Again from Import � User data � Import RDF text snippet, insert correct data (i.e., age is an integer):
ex:Alice
rdf:type ex:Person ;
ex:age 12 ;
.
Leave the Import settings as they are, and click Import. You will see that the data has been
imported successfully, as it is compliant with the shape you just inserted.
b. Now import incorrect data (i.e., age is a double):
ex:Alice
rdf:type ex:Person ;
ex:age 12.1 ;
.
The import will fail, returning a detailed error message with all validation violations in both
the Workbench and the command line.
There are two ways to delete a SHACL shape: from the GraphDB Workbench and with the RDF4J API.
Note: Keep in mind that the Clean Repository option in the Explore � Graphs overview tab would not delete the
shape graph, as it removes all data from the repository, but not SHACL shapes.
3. Load the updated shape graph following the instructions in Loading shapes and data graphs.
Note: As shape graphs are stored separately from data, importing a new shape graph by enabling the Enable
replacement of existing data box option in the Import settings dialog box would not work. This is why the above
steps must be followed.
Currently, shape graphs cannot be accessed with SPARQL inside GraphDB, as they are not part of the data. You
can view the graph by using the RDF4J client to connect to the GraphDB repository. The following code snippet
will return all statements inside the shape graph:
ShaclSail validates the data changes on commit(). In case of a violation, it will throw an exception that contains a
validation report where you can find details about the noncompliance of your data. The exception will be shown
in the Workbench if it was caused by an update executed in the same Workbench window.
In addition to that, you may also enable ShaclSail logging to get additional validation information in the log files.
To enable logging, check one of the three logging options when creating the SHACL repository:
• Log the executed validation plans
• Log validation violations
• Log every execution step of the SHACL validation
All three will log as INFO and appear in the main-[yyyy-mm-dd].log file in the logs directory of your GraphDB
installation.
Feature Description
sh:targetClass Specifies a target class. Each value of sh:targetClass in a shape is an IRI.
sh:targetNode Specifies a node target. Each value of sh:targetNode in a shape is either an IRI or a literal.
sh:targetSubjectsOf Specifies a subjectsof target in a shape. The values are IRIs.
sh:targetObjectsOf Specifies an objectsof target in a shape. The values are IRIs.
sh:path Points at the IRI of the property that is being restricted. Alternative, it may point at a path expression, w
sh:inversePath An inverse path is a blank node that is the subject of exactly one triple in a graph. This triple has sh:inv
sh:property Specifies that each value node has a given property shape.
sh:or Specifies the condition that each value node conforms to at least one of the provided shapes.
sh:and Specifies the condition that each value node conforms to all provided shapes. This is comparable to con
sh:not Specifies the condition that each value node cannot conform to a given shape. This is comparable to neg
sh:minCount Specifies the minimum number of value nodes that satisfy the condition. If the minimum cardinality val
sh:maxCount Specifies the maximum number of value nodes that satisfy the condition.
sh:minLength Specifies the minimum string length of each value node that satisfies the condition. This can be applied
sh:maxLength Specifies the maximum string length of each value node that satisfies the condition. This can be applied
sh:pattern Specifies a regular expression that each value node matches to satisfy the condition.
sh:flags An optional string of flags, interpreted as in SPARQL 1.1 REGEX. The values of sh:flags in a shape a
sh:nodeKind Specifies a condition to be satisfied by the RDF node kind of each value node.
sh:languageIn Specifies that the allowed language tags for each value node are limited by a given list of language tags.
sh:datatype Specifies a condition to be satisfied with regards to the datatype of each value node.
sh:class Specifies that each value node is a SHACL instance of a given type.
sh:in Specifies the condition that each value node is a member of a provided SHACL list.
sh:uniqueLang Can be set to true to specify that no pair of value nodes may use the same language tag.
sh:minInclusive Specifies the minimum inclusive value. The values of sh:minInclusive in a shape are literals. A shape
sh:maxInclusive Specifies the maximum inclusive value. The values of sh:maxInclusive in a shape are literals. A shape
sh:minExclusive Specifies the minimum exclusive value. The values of sh:minExclusive in a shape are literals. A shape
sh:maxExclusive Specifies the maximum exclusive value. The values of sh:maxExclusive in a shape are literals. A shape
sh:deactivated A shape that has the value true for the property sh:deactivated is called deactivated. The value of sh:
sh:hasValue Specifies the condition that at least one value node is equal to the given RDF term.
sh:shapesGraph Sets the named graphs where SHACL shapes can be stored. Commadelimited list.
dash:hasValueIn Can be used to state that at least one value node must be a member of a provided SHACL list. This cons
sh:target For use with DASH targets.
rsx:targetShape Part of RDF4J’s SHACL extensions (RSX) and allows a shape to be the target for your constraints. For
Implicit sh:targetClass is supported for nodes that are rdfs:Class and either of sh:PropertyShape or
sh:NodeShape. Validation for all nodes that are equivalent to owl:Thing in an environment with a reasoner can be
Warning: The above description on sh:path is correct, when all sh:paths are supported, which will be
implemented in later version.
Currently: sh:path is limited to single predicate paths or a single inverse path. Sequence paths, alternative
paths, and the like are not supported.
The GraphDB change tracking plugin allows you to track changes within the context of a transaction identified by
a unique ID.
GraphDB allows the tracking of changes that you have made in your data. Two tools offer this capability: the
change tracking plugin, and the data history and versioning plugin.
The change tracking plugin is useful for tracking changes within the context of a transaction identified by a unique
ID. Different IDs allow tracking of multiple independent changes, e.g., user A tracks his updates and user B tracks
her updates without interfering with each other. The tracked data is stored only inmemory and is not available
after a restart.
As part of the GraphDB Plugin API, the change tracking plugin provides the ability to track the effects of SPARQL
updates. These can be:
• Tracking what triples have been inserted or deleted;
• Distinguishing explicit from implicit triples;
• Running SPARQL using these triples.
5.5.2 Usage
INSERT DATA {
[] <http://www.ontotext.com/track-changes> "xxx"
}
INSERT DATA {
[] <http://www.ontotext.com/track-changes> "xxx"
};
INSERT DATA {<urn:a> <urn:b> <urn:c>};
7. CONSTRUCT query using data that has just been added (advanced example):
BASE <http://ontotext.com/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX test: <http://ontotext.com/vocabulary/test/>
CONSTRUCT {
?person test:knows ?knows ;
foaf:givenName ?givenName
} FROM <http://www.ontotext.com/added/xxx> WHERE {
?person foaf:givenName ?givenName ;
foaf:knows ?knows
}
INSERT DATA {
<http://www.ontotext.com/track-changes> <http://www.ontotext.com/delete-changes> "xxx
,→"
Note: Note that you must explicitly delete the tracked changes when you no longer need to query them. Otherwise
they will stay in memory until the same ID is used again, or until GraphDB is restarted.
Tip: A good way to ensure unique tracking IDs is to use UUIDs. A random UUID can be generated in Java by
calling UUID.randomUUID().toString().
The GraphDB sequences plugin provides transactional sequences for GraphDB. A sequence is a long counter that
can be atomically incremented in a transaction to provide incremental IDs.
The Sequences plugin provides transactional sequences for GraphDB. A sequence is a long counter that can be
atomically incremented in a transaction to provide incremental IDs.
To deploy it, please follow the GitHub instructions.
5.6.2 Usage
The plugin supports multiple concurrent sequences where each sequence is identified by an IRI chosen by the user.
Creating a sequence
Choose an IRI for your sequence, for example http://example.com/my/seq1. Insert the following triple to create
a sequence whose next value will be 1:
INSERT DATA {
my:seq1 seq:create []
}
You can also create a sequence by providing the starting value, for example to create a sequence whose next value
will be 10:
INSERT DATA {
my:seq1 seq:create 10
}
When using the GraphDB cluster, you might get the following exception if the repository existed before registering
the plugin: Update would affect a disabled plugin: sequences. You can activate the plugin with:
Using a sequence
In this scenario, new and current sequence values can be retrieved on the client where they can be used to generate
new data that can be added to GraphDB in the same transaction. For a workaround in the cluster, see here.
Note: Using the below examples will not work inside the GraphDB Workbench as they need to be executed in
one single transaction, and if run one by one, they would be performed in separate transactions. See here how to
execute them in one transaction.
To use any sequence, you must first start a transaction and then prepare the sequences for use by executing the
following update:
INSERT DATA {
[] seq:prepare []
}
Then you can request new values from any sequence by running a query like this (for the sequence http://
example.com/my/seq1):
SELECT ?next {
my:seq1 seq:nextValue ?next
}
To query the last new value without incrementing the counter, you can use a query like this:
SELECT ?current {
my:seq1 seq:currentValue ?current
}
Use the obtained values to construct IRIs, assign IDs, or any other use case.
In this scenario, new and current sequence values are available only within the execution context of a SPARQL IN
SERT update. New data using the sequence values can be generated by the same INSERT and added to GraphDB.
The following example prepares the sequences for use and inserts some new data using the sequence http://
example.com/my/seq1 where the subject of the newly inserted data is created from a value obtained from the
sequence.
The example will work both in:
• the GraphDB cluster – as new sequence values do not need to be exposed to the client.
• the GraphDB Workbench – as it performs everything in a single transaction by separating individual opera
tions using a semicolon.
# Obtains a new value from the sequence and creates an IRI based on it,
# then inserts new triples using that IRI
INSERT {
?subject rdfs:label "This is my new document" ;
a my:Type1
} WHERE {
my:seq1 seq:nextValue ?next
BIND(IRI(CONCAT("http://example.com/my-data/test/", STR(?next))) as ?subject)
};
Dropping a sequence
Dropping a sequence is similar to creating it. For example, to drop the sequence http://example.com/my/seq1,
execute this:
INSERT DATA {
my:seq1 seq:drop []
}
Resetting a sequence
In some cases, you might want to reset an existing sequence such that its next value will be a different number.
Resetting is equivalent to dropping and recreating the sequence.
To reset a sequence such that its next value will be 1, execute this update:
INSERT DATA {
my:seq1 seq:reset []
}
You can also reset a sequence by providing the starting value. For example, to reset a sequence such that its next
value will be 10, execute:
INSERT DATA {
my:seq1 seq:reset 10
}
Workaround for using sequence values on the client with the cluster
If you need to process your sequence values on the client in a GraphDB 9.x cluster environment, you can create a
singlenode (i.e., not part of a cluster) worker repository to provide the sequences. It is most convenient to have
that repository on the same GraphDB instance as your primary master repository.
Let’s call the master repository where you will store your data master1 and the second worker repository where
you will create and use your sequences seqrepo1.
Managing sequences
1. First, you need to obtain one or more new sequence values from the repository seqrepo1:
a. Start a transaction in seqrepo1.
b. Prepare the sequences for use by executing this in the same transaction:
INSERT DATA {
[] seq:prepare []
}
c. Obtain one or more new sequence values from the sequence http://example.com/my/seq1:
SELECT ?next {
my:seq1 seq:nextValue ?next
}
Handling backups
Note that this example assumes that sequence values were used to generate IRIs, and IRIs with higher values were
used for the first time after IRIs with lower values were used.
SIX
The ability to query and explore the data is essential to any database. The following chapters cover the topics of
using SPARQL queries, ranking results, various specialized searches and indexing, visualizations, and more:
To manage and query your data, go to the SPARQL menu tab. The SPARQL view integrates the YASGUI query
editor plus some additional features, which are described below.
Hint: SPARQL is a SQLlike query language for RDF graph databases with the following types:
• SELECT returns tabular results;
• CONSTRUCT creates a new RDF graph based on query results;
• ASK returns “YES” if the query has a solution, otherwise “NO”;
• DESCRIBE returns RDF data about a resource; useful when you do not know the RDF data structure in the
data source;
• INSERT inserts triples into a graph;
• DELETE deletes triples from a graph.
The SPARQL editor offers two viewing/editing modes horizontal and vertical.
133
GraphDB Documentation, Release 10.2.5
Use the vertical mode switch to show the editor and the results next to each other, which is particularly useful on
wide screen. Click the switch again to return to horizontal mode.
Both in horizontal and vertical mode, you can also hide the editor or the results to focus on query editing or result
viewing. Click the buttons Editor only, Editor and results, or Results only to switch between the different modes.
1. Manage your data by writing queries in the text area. It offers syntax highlighting and namespace autocom
pletion for easy reading and writing.
2. Include or exclude inferred statements in the results by clicking the >> icon. When inferred statements are
included, both elements of the arrow icon are a solid line (ON), otherwise the left element is a solid line and
the right one is a dotted line. (OFF).
3. Enable or disable the expanding of the results over owl:sameAs by clicking the last icon above the Run
button. Similarly to the one above it, the setting is ON when all its three circles are a solid line, and OFF
when two of them are a dotted one.
4. Execute the query by clicking the Run button or use Ctrl/Cmd + Enter.
Tip: You can find other useful shortcuts in the keyboard shortcuts link in the lower right corner of the
SPARQL editor.
5. The results can be viewed in different formats corresponding to the type of the query. By default, they are
displayed as a table. Other options are Raw response, Pivot table and Google Charts. You can order the
results by column values and filter them by table values. The total number of results and the query execution
time are displayed in the query results header.
Note: The total number of results is obtained by an async request with a default-graph-uri parameter
and the value http://www.ontotext.com/count.
6. Navigate through all results by using pagination (SPARQL view can only show a limited number of results at
a time). Each page executes the query again with query limit and offset for SELECT queries. For graph queries
(CONSTRUCT and DESCRIBE), all results are fetched by the server and only the page of interest is gathered from
the results iterator and sent to the client.
7. The query results are limited to 1,000, since your browser cannot handle an infinite number of results. Obtain
all results by using Download As and select the required format for the data (JSON, XML, CSV, TSV and
Binary RDF for SELECT queries and all supported RDF formats for CONSTRUCT and DESCRIBE query results).
Use the editor’s tabs to keep several queries opened while working with GraphDB. Save a query on the server with
the Create saved query icon.
When security is ON in the Setup � Users and Access menu, the system distinguishes between different users.
The user can choose whether to share a query with others, and shared queries are editable by the owner only.
Access existing queries (default, yours, and shared) from the Show saved queries icon.
Copy your query as a URL by clicking the Get URL to current query icon.
When Free access is ON, the Free Access user will see shared queries only and will not be able to save new queries.
You can use the Abort query button in the SPARQL editor to manually interrupt any query.
RDF Rank is an algorithm that identifies the more important or more popular entities in the repository by examining
their interconnectedness. The popularity of entities can then be used to order the query results in a similar way to
the internet search engines, the way Google orders search results using PageRank.
The RDF Rank component computes a numerical weighting for all nodes in the entire RDF graph stored in the
repository, including URIs, blank nodes, literals, and RDFstar (formerly RDF*) embedded triples. The weights
are floating point numbers with values between 0 and 1 that can be interpreted as a measure of a node’s rele
vance/popularity.
Since the values range from 0 to 1, the weights can be used for sorting a result set (the lexicographical order works
fine even if the rank literals are interpreted as plain strings).
Here is an example SPARQL query that uses the RDF rank for sorting results by their popularity:
As seen in the example query, RDF Rank weights are made available via a special system predicate. GraphDB
handles triple patterns with the predicate http://www.ontotext.com/owlim/RDFRank#hasRDFRank in a special
way, where the object of the statement pattern is bound to a literal containing the RDF Rank of the subject.
rank#hasRDFRank returns the rank with precision of 0.01. You can as well retrieve the rank with precision of 0.001,
0.0001 and 0.00001 using respectively rank#hasRDFRank3, rank#hasRDFRank4, and rank#hasRDFRank5.
In order to use this mechanism, the RDF ranks for the whole repository must be computed in advance. This is done
by committing a series of SPARQL updates that use special vocabulary to parameterize the weighting algorithm,
followed by an update that triggers the computation itself.
Parameters
Parameter Epsilon
Predicate http://www.ontotext.com/owlim/RDFRank#epsilon
Description Terminates the weighting algorithm early when the total change of all RDF Rank scores
has fallen below this value.
Default 0.01
Example
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
INSERT DATA { rank:epsilon rank:setParam "0.05" . }
Full computation
To trigger the computation of the RDF Rank values for all resources, use the following update:
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
INSERT DATA { _:b1 rank:compute _:b2. }
You can also compute the RDF Rank values in the background. This operation is asynchronous which means that
the plugin manager will not be blocked during it and you can work with other plugins as the RDF Rank is being
computed.
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
INSERT DATA { _:b1 rank:computeAsync _:b2. }
Warning: Using a SPARQL query to perform an asynchronous computation while in cluster will set your
cluster out of sync. RDF Rank computations in a cluster should be performed synchronously.
Or, in the Workbench, go to Setup � RDF Rank and click Compute Full.
Note: When using the Workbench button on a standalone repository (not in a cluster), the RDF rank is computed
asynchronously. When the button is used on a master repository (in a cluster), the rank is computed synchronously.
Incremental updates
The full computation of RDF Rank values for all resources can be relatively expensive. When new resources have
been added to the repository after a previous full computation of the RDF Rank values, you can either have a
full recomputation for all resources (see above) or compute only the RDF Rank values for the new resources (an
incremental update).
The following control update:
computes RDF Rank values for the resources that do not have an associated value, i.e., the ones that have been
added to the repository since the last full RDF Rank computation.
Just like full computations, incremental updates can also be performed asynchronously:
Warning: Using a SPARQL query to perform an asynchronous computation while in cluster will set your
cluster out of sync. RDF Rank computations in a cluster should be performed synchronously.
Note: The incremental computation uses a different algorithm, which is lightweight (in order to be fast), but
is not as accurate as the proper ranking algorithm. As a result, ranks assigned by the proper and the lightweight
algorithms will be slightly different.
The computed weights can be exported to an external file using an update of this form:
If the export fails, the update throws an exception and an error message is recorded in the log file.
/**
* The ranks computation has been canceled
*/
CANCELED,
/**
* The ranks are computed and up-to-date
*/
COMPUTED,
/**
* A computing task is currently in progress
*/
COMPUTING,
/**
* Exception has been thrown during computation
*/
ERROR,
/**
* The ranks are outdated and need computing
*/
OUTDATED,
/**
* The filtering is enabled and its configuration has been changed since the last full computation
*/
CONFIG_CHANGED
You can get the current status of the plugin by running the following query:
Rank filtering
By default, the RDF Rank is calculated over the whole repository. This is useful when you want to find the most
interconnected and important entities in general.
However, there are times when you are interested only in entities in certain graphs or entities related to a particular
predicate. This is why the RDF Rank has a filtered mode – to filter the statements in the repository which are taken
under account when calculating the rank.
You can enable the filtered mode with the following query:
The filtering of the statements can be performed based on predicate, graph, or type – explicit or implicit (inferred).
You can make both inclusion and exclusion rules.
In order to include only statements having a particular predicate or being in a particular named graph, you should
include the predicate / graph IRI in one of the following lists: includedPredicates / includedGraphs. Empty
lists are treated as wildcards. See below how to control the lists with SPARQL queries:
Get the content of a list:
The filtering can be done not only by including statements of interest but by removing ones as well. In order to
do so, there are two additional lists: excludedPredicates and excludedGraphs. These lists take precedence over
their inclusion alternatives, so if for instance you have the same predicate in both inclusion and exclusion lists, it
will be treated as excluded. These lists can be controlled in exactly the same way as the inclusion ones.
There is a convenient way to include/exclude all explicit/implicit statements. This is done with two parameters
– includeExplicit and includeImplicit, which are set to true by default. When set to true, they are just
disregarded, i.e., do not take part in the filtering. However, if you set them to false, they start acting as exclusion
rules – this means they take precedence over the inclusion lists.
You can get the status of these parameters using:
6.2.2 Prominence
In GraphDB’s Prominence functionality, the prominence for a resource is defined as the sum of the number of
outgoing connections (where the resource is the subject of a triple) and the number of incoming connections (where
the resource is the object of a triple). The numbers are automatically maintained by GraphDB.
Examples
SELECT ?prominence {
<http://example.com/Book1> <http://www.ontotext.com/owlim/entity#hasProminence> ?
,→prominence
SELECT ?book {
?book a <http://example.com/Book> ;
<http://www.ontotext.com/owlim/entity#hasProminence> 5
}
SELECT ?node {
?node <http://www.ontotext.com/owlim/entity#hasProminence> 10
}
6.3.1 Overview
The GraphDB Graph path search functionality allows you to not only find complex relationships between resources
but also explore them and use them as filters to identify graph patterns. This is a key factor in a variety of use
cases and fields, such as data fabric analysis of supply chains, clinical trials in drug research, or social media
management. Discovering connections between resources must come hand in hand with the ability to explain
them to key stakeholders.
It includes algorithms for Shortest path and All paths search, which enable you to explore the connecting edges
(RDF statements) between resources for the shortest property paths and subsequently for all connecting paths.
Other supported algorithms include finding the shortest distance between resources and discovering cyclical de
pendencies in a graph.
It also supports wildcard property search and more targeted graph pattern search. A graph pattern is an edge
abstraction that can be used to define more complex relationships between resources in a graph. It targets specific
types of relationships in order to filter and limit the amount of paths returned. For example, it can define indirect
relationships such as Nary relations that rely on another resource and that cannot be expressed using a standard
subjectpredicateobject directional relationship.
The graph path search extension is compatible with the GraphDB service plugin syntax, which allows for easy
integration into queries.
Hint: Graph path search is similar to the SPARQL 1.1 property paths feature as both enable graph traversal,
allowing you to discover relationships between resources through arbitrary length patterns. However, property
paths uncover the start and end nodes of a path, but not the intermediate ones, meaning that traceability remains a
challenge.
For the examples included further down in this page, we have used a dataset containing Marvel Studiosrelated
data combined with some information from DBpedia. To try them out yourself, download it and load it into a
GraphDB repository via Import � User data � Upload RDF files.
6.3.2 Usage
Four graph path search algorithms are supported: Shortest path, All paths, Shortest distance, and Cyclic path.
For Shortest path and All paths, the following is valid:
• All of the shortest paths with the shortest length are returned. If, when searching for the shortest path between
two nodes, there are several different paths that meet this requirement, all of them will be returned as results.
• Bindings for at least the source and/or destination (preferably both) must be provided.
• The startNode and endNode properties are unbound prior to path evaluation and are bound by the path search
for each edge returned by the query. If a graph pattern is used, they show the relation between the two nodes,
and are bound by the path search dynamically and recursively.
• Edges can be returned as RDFstar statements.
• Each binding can also be returned separately.
• When using a wildcard predicate pattern, the edge label (predicate) can be accessed as well.
All of the graph path search algorithms support using a literal as a destination. Both source and destination can be
literals (e.g., Nary relations).
path:findPath is a required property that defines the type of search function.
A graph path search is defined by three types of properties described in detail below.
Path Search Algorithms
Modifier Bindings
Variable Bindings
path:resultBindingIndex
6.3. Graph Path Search Optional variable binding that re 143
turns the index of each edge inside
Shortest path
a path in incremental order. It fol
lows the Java array indexing nota All paths
GraphDB Documentation, Release 10.2.5
Filtering Parameters
path:maxPathLength
Required properties include a binding for source and/or destination, as well as the type of the search.
Optional properties include min/max path length, edge bindings, or path indexing. Setting a maximum path length
can be useful, for instance, when you are querying a large repository of over several hundred million statements
and want to limit the results so as to not strain the database.
Search algorithms
Shortest path
The algorithm finds the shortest path between two input nodes or between one bound and one unbound node. It
recursively evaluates the graph pattern in the query and replaces the start variable with the binding of the end
variable in the previous execution. If we have specified a start node in the query, its value is used for the first
evaluation of the graph pattern. If we have specified an end node, the query execution will stop when that end
node is reached.
The shortest path algorithm can be used with a wildcard predicate as well as a graph pattern that is used as an edge
abstraction. With it, we can impose filtering through property negation or selection, define indirect relationships,
specify named graphs, etc.
Note: Inside the graph pattern, we cannot define other subqueries or use federated queries for performance
reasons. The variables bound as objects to the path:startNode and path:endNode properties are required to be
present at least once inside the graph pattern.
All paths
This algorithm finds all paths between two nodes or between all nodes and the starting/destination node. It can be
used with a wildcard predicate, as well as with more complex graph patterns and relationships. With it, we can
also impose filtering with min/max number of edges, and can include or exclude inferred edges.
See examples of how All paths search is used here.
Shortest distance
The algorithm finds the distance of the shortest path between two resources, which is the number of edges that
connect the resources. This is done through the path:distanceBinding property. The nodes themselves will not
be returned as results, only the distance.
See an example of how Shortest distant search is used here.
Cyclic path
With the cyclic path search, we can explore selfreferring relationships between resources. Similarly to the All
paths search, this one can also be limited with min/max values.
See an example of how Cyclic path search is used here.
Search modifiers
This mode enables parallel path search query evaluation and allows you to specify the size of the thread pool used
to evaluate in parallel the input path search query. It is limited by the total number of cores available per license,
i.e., the more licensed cores, the larger pool size and faster queries. It is very effective when used with complex
graph patterns.
To perform parallel path search, use the path:poolSize global modifier property. The number of parallel threads
used by all parallel path searches simultaneously cannot exceed the number of licensed cores.
See an example of how Parallel path search is used here.
Export bindings allow you to project any number of bindings from the graph pattern query service. The power of
SPARQL graph patternmatching property paths is combined with GraphDB’s path search algorithm, enabling the
user to restrict the start and the end nodes of the path search to those pairs that match a particular graph pattern
defined as SPARQL property path. You can “export” bindings from such graph patterns and this way get additional
details about the found paths.
The export bindings as parameters have to be defined inside the main service of the path search query with the
magic predicate <http://www.ontotext.com/path#exportBinding> (or simply path:exportBinding). Keep in
mind that the binding names defined in the parameters of the search query have to be present in the nested graph
pattern service.
See an example of how Export bindings are used here.
Bidirectional search
The bidirectional search functionality can be used to traverse paths as if the graph is undirected, i.e., as if the edges
between the nodes have no direction. Technically, bidirectional search traverses adjacent nodes both in SPO and
OPS order, where the subject and object are the recursively evaluated start and end nodes. It can be used with
all functions and can be combined with wildcard and graph pattern search as well as with exportable graph pattern
bindings.
In order to do bidirectional search, you can use the magic predicate <http://www.ontotext.com/
path#bidirectional> (or simply path:bidirectional) followed by value true of type xsd:boolean.
Shortest path
Let’s try out the shortest path search with queries that we will run against the Marvel Studios dataset that we loaded
into GraphDB earlier.
Suppose we want to find the shortest path between source node the movie “The Black Panther (1977)”, and
destination node Marvel Comics’ creative leader Stan Lee.
In the Workbench SPARQL editor, run the following query:
PREFIX path: <http://www.ontotext.com/path#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbr: <http://dbpedia.org/resource/>
Here, the path traversal is done by using a wildcard predicate. This is because we want to explore the predicates
connecting the resources inside the path, and we do not know the relationships within the data.
The path:resultBinding property returns path edges as RDFstar statements. Each edge is indexed with the
path:resultBindingIndex property, and each of the shortest paths is indexed with the path:pathIndex property.
The results show that there are ten shortest paths between Stan Lee and the 1977 “Black Panther” movie (paths
09), each consisting of four edges. The first one, for example, reveals the following relationship:
“The Black Panther (1977)” is a different movie from “Black Panther”. The studio that made “Black Panther” is
Marvel Studios, founded by Marvel Entertainment, where Stan Lee is a key person.
We can also trace the path in the Visual graph of the Workbench.
1. Go to Setup � Autocomplete to enable it.
2. From Explore � Visual graph � Easy graph, search for the resource The Black Panther (1977) (the resource
view will autocomplete the IRI).
3. Trace the identified path.
Note: Due to the large number of connections in the dataset and for better readability, in this and the following
examples, the relationships in the Visual graph are filtered to display only the resources connected by preferred
predicates. (In our case here: differentFrom, studio, founder, and keyPerson)
In this query, we will again be searching for the shortest path between source node “The Black Panther (1977)”
and destination node Stan Lee, but this time excluding any properties of the type http://dbpedia.org/property/
keyPerson. The path traversal will be executed using a graph pattern specifying the exclusion of this property type
through property negation with the SPARQL 1.1 property paths syntax.
The paths are “served” by the nested SERVICE <urn:path> subclause where the service IRI coincides with
the subject node invoking path:findPath. The paths connect the nodes specified by the path:startNode and
path:endNode bindings.
As we are using a graph pattern to specify the relation, we cannot view the predicates connecting the resources,
i.e., path:resultBinding is not applicable, but we can still view the nodes.
As in the previous example, we can index the edge bindings with the path:resultBindingIndex property, and
index each of the shortest paths with the path:pathIndex property.
After excluding the DBpedia keyPerson property from the search, two shortest paths between these resources are
returned as results:
• first path is: “The Black Panther (1977)” “Black Panther” Marvel Studios Marvel Entertainment Stan
Lee
• second path is: “The Black Panther (1977)” “Black Panther” Marvel Studios Marvel Productions Stan
Lee
All paths
The next query will find all resources and their respective paths that can reach resource Stan Lee with a minimum
of five edges using a wildcard predicate pattern.
As with Shortest path, path edges are returned as RDFstar statements through the path:resultBinding property.
Each edge is indexed with the path:resultBindingIndex property.
The first returned path will be:
Visualizing path search results is possible through the CONSTRUCT query where you can propagate bindings from
each edge through the path:startNode, path:endNode, path:exportBinding (for more complex traversals), and
path:propertyBinding (when not specifying graph patterns) to the CONSTRUCT query projection.
CONSTRUCT {
?start ?edgeLabel ?end
} WHERE {
VALUES (?dst) {
( dbr:Stan_Lee )
}
SERVICE <http://www.ontotext.com/path#search> {
<urn:path> path:findPath path:allPaths ;
path:sourceNode ?src ;
path:destinationNode ?dst ;
path:minPathLength 5 ;
path:startNode ?start ;
path:propertyBinding ?edgeLabel ;
path:endNode ?end ;
}
}
With the Visual button now visible at the bottom right of the SPARQL editor, you can see the results in the visual
graph:
Warning: The graph visualization tool is not fully compatible with the graph path search functionality and in
most cases would not display every path returned by the path search query.
Now, let’s find all resources and their respective paths that can be reached by the resource “Guardians of the
Galaxy (TV series)” with a minimum of four and a maximum of five edges using a wildcard predicate pattern.
All edge nodes as well as predicates connecting them are viewed through the path:startNode,
path:propertyBinding, and path:endNode properties.
Tip: There is more than one way to return results – for example, path edges returned as RDFstar statements
through the path:resultBinding property.
All paths search with graph pattern - bound source & destination
Similarly to the example for shortest path search with graph pattern from earlier, we will be searching for all
paths between source node “The Black Panther (1977)” and destination node Stan Lee, but this time excluding
any properties of the type http://dbpedia.org/property/keyPerson. The path traversal will be executed using
a graph pattern specifying the exclusion of this property type through property negation with the SPARQL 1.1
property paths syntax.
PREFIX path: <http://www.ontotext.com/path#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
Path edges are returned as RDFstar statements through the path:resultBinding property, and each edge is in
dexed with the path:resultBindingIndex property.
We can see that the first identified path excluding the DBpedia keyPerson property traverses the following nodes:
The movie “The Black Panther (1977)” Marvel Studios Marvel Entertainment Stan Lee.
Note: Keep in mind that when using graph patterns, we cannot view the predicates connecting the
nodes. Thus, when exploring the path edges as RDFstar statements, the predicate http://www.ontotext.com/
path#connectedTo is generated.
You might be familiar with the Six Degrees of Kevin Bacon parlor game where players arbitrarily choose an actor
and then connect them to another actor via a film that both actors have starred in, repeating this process to try and
find the shortest path that ultimately leads to famous US actor Kevin Bacon. The game is a reference to the six
degrees of separation concept based on the assumption that any two people on Earth are six or fewer acquaintance
links apart.
In this context, let’s find all paths between source node Chris Evans and destination node Chris Hemsworth
where the relationship between nodes is defined through an Nary graph pattern based on actors costarring in
movies. The path search is limited with a minimum of two edges.
PREFIX path: <http://www.ontotext.com/path#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
(continues on next page)
Shortest distance
The next query finds the shortest distance between source node Marvel Studios and a date literal which represents
Marvel Studios President Kevin Feige’s birthday.
SELECT ?dist
WHERE {
VALUES (?src ?dst) {
( dbr:Marvel_Studios "1973-06-02"^^xsd:date )
}
SERVICE <http://www.ontotext.com/path#search> {
<urn:path> path:findPath path:distance ;
path:sourceNode ?src ;
path:destinationNode ?dst ;
path:distanceBinding ?dist;
}
}
We can see that the shortest path connecting them consists of two edges.
Cyclic path
The following query finds all paths that begin and end with source node Marvel Studios.
To demonstrate this functionality, let’s use the Shortest path search with wildcard predicate example from earlier.
To perform parallel path search, you need to set the path:poolSize property:
The query will return the same results but execute faster.
This query finds all paths between source node Chris Evans and destination node Chris Hemsworth where the
relationship between nodes is defined through an Nary graph pattern based on actors costarring in movies. The
path search is limited to a minimum of two edges. We also want to see the movies and their labels as part of the
returned path.
Bidirectional search
This query finds the shortest bidirectional path between source node The Black Panther movie from 1977 and
destination node Marvel Studios.
Fulltext search (FTS) indexing enables very fast queries over textual data. Typically, FTS is used to retrieve data
that represents text written in a human language such as English, Spanish, or French.
GraphDB supports various mechanisms for performing fulltext search depending on the use case and the needs
of a given project.
The GraphDB connectors index, search, and retrieve entire documents composed of a set of RDF statements:
• They need a predefined data model that describes how every indexed document is constructed from a tem
plate of RDF statements.
• Queries search in one or more document fields.
• Results return the document ID.
See more about the fulltext search with the GraphDB connectors, as well as the Lucene connector, the Solr con
nector, and the Elasticsearch connector.
GraphDB 10.1 introduced a simple FTS index that covers some basic FTS use cases. This index contains literals
and IRIs:
• There is no data model, so it is easy to set up.
• Queries search in literals and IRIs.
• Results return the matching literals and IRIs.
Note: When an index is supplied as a parameter, the language tag of the query string will be ignored.
When only the query is provided (the only required argument), it is possible and recommended to provide it directly
without constructing an RDF list. Thus, the pattern can be simplified to:
?value onto:fts query
Some examples:
• ("query" "en" 10): Search for “query” in the “en” index and limit results to 10.
• ("query" 15): Search for “query” in the index configured via ftsstringliteralsindex and limit results to
15.
• ("query"@de 20): Search for “query” in the “de” index and limit results to 20.
• ("query"@de-CH 20): Search for “query” in the “de” index and limit results to 20. Note that only the
language part of the tag deCH determines the index.
• ("query" "fr"): Search for “query” in the “fr” index and do not apply a limit.
• "query"@fr: Search for “query” in the “fr” index and do not apply a limit – when a sole argument is provided,
it does not need to be inside an RDF list.
Query syntax
Note: Some of the specialized query types are not textanalyzed. Lexical analysis is only run on complete terms,
i.e., a term/phrase query. Query types containing incomplete terms (e.g., prefix/wildcard/regex/fuzzy query) skip
the analysis stage and are directly added to the query tree. The only transformation applied to partial query terms
is lowercasing.
This may lead to surprising results if you expect stemming or lemmatization. For example, searching for “resti*”
and expecting to find “resting” will not work when using the English analyzer since the word “resting” was analyzed
and indexed as “rest”.
Basic clauses
A query must contain one or more clauses. A clause can be a literal term, a phrase, a wildcard expression, or any
supported expression.
The following are some examples of simple oneclause queries:
Query Description
test Selects documents containing the word “test” (term clause).
"test equip- Phrase search; selects documents containing the phrase “test equipment” (phrase clause).
ment"
"test fail- Proximity search; selects documents containing the words “test” and “failure” within 4 words
ure"~4 (positions) from each other. The provided “proximity” is technically translated into “edit
distance” (maximum number of atomic wordmoving operations required to transform the
document’s phrase into the query phrase).
tes* Prefix wildcard matching; selects documents containing words starting with “tes”, such as:
“test”, “testing” or “testable”.
/(p|n).st/ Documents containing word roots matching the provided regular expression, such as “post”
or “nest”.
nest~2 Fuzzy term matching; documents containing words within 2edits distance (2 additions, re
movals, or replacements of a letter) from “nest”, such as “test”, “net”, or “rests”.
You can combine clauses using Boolean AND, OR, and NOT operators to form more complex expressions, for
example:
Query Description
test AND results Selects documents containing both the word “test” and the word “results”.
test OR suite OR results Selects documents with at least one of “test”, “suite”, or “results”.
test AND NOT complete Selects documents containing “test” and not containing “complete”.
test AND (pass* OR fail*) Grouping; use parentheses to specify the precedence of terms in a Boolean
clause. Query will match documents containing “test” and a word starting with
“pass” or “fail”.
(pass fail skip) Shorthand notation; documents containing at least one of “pass”, “fail”, or
“skip”.
Note: The Boolean operators must be written in all caps, otherwise they are parsed as regular terms.
Range operators
To search for ranges of textual or numeric values, use square or curly brackets, for example:
Query Description
[Jones TO Smith] Inclusive range; selects documents that contain any value between “Jones” and
“Smith”, including boundaries.
{Jones TO Smith} Exclusive range; selects documents that contain any value between “Jones”
and “Smith”, excluding boundaries.
{Jones TO *] Onesided range; selects documents that contain any value larger than (i.e.,
sorted after) “Jones”.
Note: These will work intuitively only with the “iri” index, e.g., "[http://www.w3.org/2000/01/rdf-
schema#comment TO http://www.w3.org/2000/01/rdf-schema#range]" will retrieve all IRIs that are alphabet
ically ordered between http://www.w3.org/2000/01/rdf-schema#comment and http://www.w3.org/2000/01/
rdf-schema#range inclusive. If used with any of the other indexes, they will return matches but it will not be
intuitive what they match.
Term boosting
Terms, quoted terms, term range expressions, and grouped clauses can have a floatingpoint weight boost applied
to them to increase their score relative to other clauses. For example:
Query Description
jones^2 OR smith^0.5 Prioritize documents with “jones” term over matches on the “smith” term.
(a OR b NOT c)^2.5 OR d Apply the boost to a subquery.
Most search terms can be put in double quotes, making special character escaping not necessary. If the search term
contains the quote character (or cannot be quoted for some reason), any character can be quoted with a backslash.
For example:
Query Description
\:\(quoted\+term\)\: A single search term (quoted+term): with escape sequences. An alternative
quoted form would be simpler: ":(quoted+term):".
A minimumshouldmatch operator can be applied to a disjunction Boolean query (a query with only “OR”
subclauses) and forces the query to match documents with at least the provided number of these subclauses. For
example:
Query Description
(blue crab fish)@2 Matches all documents with at least two terms from the set [blue, crab, fish]
(in any order).
((yellow OR blue) crab Subclauses of a Boolean query can themselves be complex queries; here the
fish)@2 minshouldmatch selects documents that match at least two of the provided
three subclauses.
Interval functions are a powerful tool for expressing search needs in terms of one or more * contiguous fragments
of text and their relationship to one another. All interval clauses start with the fn: prefix. For example:
Query Description
fn:ordered(quick brown Matches all documents with at least one ordered sequence of “quick”, “brown”,
fox) and “fox” terms.
fn:maxwidth(5 Matches all documents where at least two of the three terms “quick”, “brown”,
fn:atLeast(2 quick and “fox” occur within five positions of each other.
brown fox))
The first thing we need to do in order to perform fulltext search is to enable the FTS index. This can be done at
repository creation by setting the Enable fulltext search (FTS) index to true, as well as at a later stage if you want
to edit the repository configuration.
Single language
Let’s say that our data is in a single supported language and we want to perform fulltext search in order to find
literals that match. Literals may or may not have a language tag, for example:
• “This is a literal in English without a language tag”
• “This is another literal in English with a language tag for the language only”@en
• “This is yet another literal tagged for English in Canada”@enCA
To configure the search:
1. Create a repository.
2. In its configuration menu, enable the "en" index by setting FTS indexes to build to “en”.
3. The literals without a language tag need to go into the "en" index too, so we will set FTS index for xsd:string
literals to “en”.
Important: After each change applied to any of the FTS parameters, you need to restart the repository.
In the Workbench SPARQL editor, let’s insert the following sample data:
INSERT DATA {
<urn:d1> rdfs:label "This is a literal in English without a language tag",
"This is another literal in English with a language tag for the language only"@en,
"This is yet another literal tagged for English in Canada"@en-CA,
"Let's pretend this literal isn't in English by tagging it as German"@de
}
Or this one:
Or this one:
They will all return the first three literals (i.e., without the one tagged as German).
Multiple languages
Here, our data is in several supported languages (e.g., English and German) and we want to perform fulltext
search in order to find literals that match. Literals without a language tag are in one of the desired languages (e.g.,
English). The data may look like this:
• “This is a literal in English without a language tag”
• “This is another literal in English with a language tag for the language only”@en
• “This is yet another literal tagged for English in Canada”@enCA
• “Das ist ein schönes deutsches Literal”@de
• “Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@deCH
To configure the search:
1. Create a repository.
2. In its configuration menu, enable the "en" and "de" indexes by setting FTS indexes to build to “en, de”.
This can be extended with additional languages by adding them to the list.
3. The literals without a language tag need to go into the "en" index too, so we will set FTS index for xsd:string
literals to “en”.
INSERT DATA {
<urn:d2> rdfs:label "This is a literal in English without a language tag",
(continues on next page)
Searching in English is exactly the same as in the first use case. To search the additional German index, we must
always specify it like this:
Or this:
Note: Keep in mind that if you have other data in the repository, it may affect the results.
In this case, our data is in one or more supported languages (e.g., English and German) and we want to perform
fulltext search in order to find literals that match. Literals without a language tag should not be treated as any of
those languages and need not be searched. Data may look like this:
• “This is another literal in English with a language tag for the language only”@en
• “This is yet another literal tagged for English in Canada”@enCA
• “Das ist ein schönes deutsches Literal”@de
• “Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@deCH
• “This is a literal in English without a language tag” (this must not be indexed)
To configure the search:
1. Create a repository.
2. In its configuration menu, enable the "en" and "de" indexes by setting FTS indexes to build to “en, de”.
This can be extended with additional languages by adding them to the list.
3. The literals without a language tag need to not be indexed, so we will set FTS index for xsd:string literals
to “none”.
INSERT DATA {
<urn:d3> rdfs:label "This is another literal in English with a language tag for the language only
,→"@en,
Searching in any of the languages requires to specify the index (there is no default search index because FTS index
for xsd:string literals is set to “none”), so like this:
Or this:
Both queries will return the two literals that are tagged for English but not the untagged one.
Here, our data is in one or more supported languages (e.g., English and German) and we want to perform fulltext
search in order to find literals that match.
Literals without a language tag should not be treated as any of those languages but should provide language
agnostic fulltext search. These literals may be data like UUIDs or anything else that has a textual representation
that we may want to search. Data may look like this:
• “This is another literal in English with a language tag for the language only”@en
• “This is yet another literal tagged for English in Canada”@enCA
• “Das ist ein schönes deutsches Literal”@de
• “Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@deCH
• “96ac1c60799745a38dfeb57b24c1cb62” (this will be indexed separately)
Important: The values of FTS indexes to build must contain the values for FTS index for xsd:string literals and
FTS index for fulltext indexing of IRIs, unless those are set to “none”.
INSERT DATA {
<urn:d4> rdfs:label "This is another literal in English with a language tag for the language only
,→"@en,
Searching in any of the languages is like in the third example related to ignoring untagged literals, i.e., you need
to provide the index to search.
Searching in the untagged literals can be done like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "b57*"
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The language tag of the query literal supplies the index to query
?value onto:fts "b57*"@default
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The query string and the index to query are supplied as two separate values
(continues on next page)
All of these queries will return the single untagged literal where "b57*" was matched to one of the hyphenated
components.
In this case, regardless of our need to search literals, we also want to search within IRIs treating them as keywords
(the entire IRI is considered a single searchable token). These can be any IRIs, such as:
• <http://www.w3.org/2000/01/rdf-schema#domain>
• <http://example.com/data/john>
• <http://example.com/data/mary>
• <http://exampel.com/data/william>
To configure the search:
1. Create a repository.
2. In its configuration menu, enable a special index called "iri" by adding it to the FTS indexes to build
property. For example, if we also want English literals to be indexed, we will set FTS indexes to build to
“en, iri”.
3. Set FTS index for xsd:string literals to “en” so that the literals without a language tag will go to the “en”
index.
INSERT DATA {
<http://example.com/data/john> rdfs:label "John" .
<http://example.com/data/mary> rdfs:label "Mary" .
<http://example.com/data/william> rdfs:label "William" .
}
To search the IRIs, you need to query the "iri" index like this:
Or like this:
Both of these will return the http://example.com/xxx IRIs from the sample data.
When the entire search string is a single keyword, which is the case for the "iri" index, you can also use range
searches to find IRIs that sort between two IRIs:
Or like this:
Indexing
In this scenario, regardless of our need to search literals, we also need to search within IRIs, treating them as regular
text (the IRI is split into multiple searchable tokens). These are typically IRIs that are readable and are composed
of words:
• <http://example.com/data/john>
• <http://example.com/data/mary>
• <http://exampel.com/data/william>
To configure the search:
1. Create a repository.
2. In its configuration menu, enable the index for the language we want by adding it to FTS indexes to build –
for English, we will set FTS indexes to build to “en”.
3. The value of FTS index for xsd:string literals must also be set to “en”.
4. We also need IRIs to be indexed for fulltext search in the language we enabled, so we will set FTS index
for fulltext indexing of IRIs to “en”.
INSERT DATA {
<http://example.com/data/john> rdfs:label "John" .
<http://example.com/data/mary> rdfs:label "Mary" .
<http://example.com/data/william> rdfs:label "William" .
}
IRIs are then searchable in the "en" index just like literals:
Or like this:
Both of these queries will return the IRI http://example.com/john, as well as the literal "John".
All literals where “luke” and “vader” are near each other
Note that the above searches in the "en" index since the default index is disabled and we requested xsd:string
literals to go to the "en" index.
Note that we use single quotes for the query literal to avoid escaping the double quotes that are part of the fulltext
search query.
It returns several results, some of which are Luke’s grandmother Shmi Skywalker and Luke’s father Anakin Sky
walker (before he became Darth Vader).
It returns many results, some of which are “The Empire Strikes Back” and “Return of the Jedi”. This illustrates
how fulltext search tuned to a specific language (in this case English) is able to match “striking” to “strikes” and
“jedis” to “jedi”.
Note that the query written like that does not need all tokens to be in the matched result, or in other words the query
is equivalent to “striking OR jedis”.
It returns matches like “Ahmed Best”, “Oscar für den besten Film” and “Oscar für die beste Regie”, again illus
trating the ability of FTS to match different word forms in German.
It returns matches like “Oscar de la meilleure actrice” and “Oscar du meilleur acteur”, again illustrating the
ability of FTS to match different word forms in French.
It returns matches like “Oscar al miglior film”, “Oscar ai migliori costumi” and “Oscar alla migliore scenografia”,
again illustrating the ability of FTS to match different word forms in Italian.
It returns matches like “Película del 2005” and “personaje de ficción el las películas de Star Wars”, again illus
trating the ability of FTS to match different word forms in Spanish but also the ability to ignore diacritics when
searching.
The similarity plugin allows exploring and searching semantic similarity in RDF resources.
As a user, you may want to solve cases where statistical semantics queries will be highly valuable, for example:
For this text (encoded as a literal in the database), return the closest texts based on a vector space model.
Another type of use case is the clustering of news (from a news feed) into groups by discussing events.
Humans determine the similarity between texts based on the similarity of the composing words and their abstract
meaning. Documents containing by similar words are semantically related, and words frequently cooccurring are
also considered close. The plugin supports document and term searches. A document is a literal or an aggregation
of multiple literals, and a term is a word from a document.
There are four types of similarity searches:
• Term to term returns the closest semantically related terms
• Term to document returns the most representative documents for a specific searched term
• Document to term returns the most representative terms for a specific document
• Document to document returns the closest related texts
The similarity plugin integrates the semantic vectors library and the underlying Random Indexing algorithm. The
algorithm uses a tokenizer to translate documents to sequences of words (terms) and to represent them into a
vector space model representing their abstract meaning. A distinctive feature of the algorithm is the dimensionality
reduction approach based on Random Projection, where the initial vector state is generated randomly. With the
indexing of each document, the term vectors are adjusted based on the contextual words. This approach makes
the algorithm highly scalable for very large text corpora of documents, and research papers have proven that its
efficiency is comparable to more sound dimensionality reduction algorithms like singular value decomposition.
The example shows terms similar to “novichok” in the search index allNews that we will look at in more detail
below. The term “novichok” is used in the search field. The selected option for both Search type and Result type
is Term. Sample results of terms similar to “novichok”, listed by their score, are given below.
The term “novichok” is used as an example again. The selected option for Search type is Term, and for Result
type is Document. Sample results of the most representative documents for a specific searched term, listed by their
score, are given below.
The result with the highest score from the previous search is used in the new search. The selected option for Search
type is Document, and for Result type is Term. Sample results of the most representative terms, listed by their score,
are given below.
A search for the texts closest to the selected document is also possible. The same document is used in the search
field. Sample results of the documents with the closest texts to the selected document listed by their score are
given below. The titles of the documents prove that their content is similar, even though the sources are different.
To obtain the sample results listed above, you need to download data and create an index.
The following examples use data from factforge.net. News from January to April 2018, together with their content,
creationDate, and mentionsEntity triples, are downloaded.
CONSTRUCT {
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
} WHERE {
?document a pubo:Document .
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
FILTER (?p NOT IN (pubo:containsMention, pubo:hasFeature, pubo:hasImage))
FILTER ( (?date > "2018-01-01"^^xsd:dateTime) && (?date < "2018-04-30"^^
,→xsd:dateTime))
2. Download the data via the Download As button, choosing the Turtle option. It will take some time to export
the data to the query-result.ttl file.
3. Open your GraphDB instance and create a new repository called “news”.
4. Move the downloaded file to the <HOME>/graphdb-import folder so that it is visible in Import � Server files
(see how to import server files).
5. Import the query-result.ttl file into the “news” repository.
6. Go to Setup and enable the Autocomplete index for the “news” repository. It is used for autocompletion of
URLs in the SPARQL editor and the View resource page.
This will index the content, where the ID of a document is the news piece’s IRI, and the text is
the content.
2. Name the index allNews, save it, and wait until it is ready.
3. Once the index has been created, you can see the following options on the right:
• With the {…} button, you can review or copy the SPARQL query that this index was created
with;
• The Edit icon allows you to modify the search query without having to build an index;
• You can also create a new index from an existing one;
• Rebuild the index;
• As well as delete it.
A list of creation parameters under More options � Semantic Vectors create index parameters can be used to further
configure the similarity index.
• seedlength: Number of nonzero entries in a sparse random vector, default value 10 except for when vec
tortype is binary, in which case default of dimension / 2 is enforced. For real and complex vectors default
value is 10, but it is a good idea to use a higher value when the vector dimension is higher than 200. Simplest
thing to do is to preserve this ratio, i.e., to divide the dimension by 20. It is worth mentioning that in the
original implementation of random indexing, the ratio of nonzero elements was 1/3.
• trainingcycles: Number of training cycles used for Reflective Random Indexing.
• termweight: Term weighting used when constructing document vectors. Values can be none, idf, logen-
tropy, sqrt. It is a good idea to use term weighting when building indexes so we add -termweight idf
as a default when creating an index. It uses inverse document frequency when building the vectors. See
LuceneUtils for more details.
• minfrequency: Minimum number of times that a term has to occur in order to be indexed. Default value is
set to 0, but it would be a bad idea to use it, as that would add a lot of big numbers/weird terms/misspelled
words to the list of word vectors. Best approach would be to set it as a fraction of the total word count in the
corpus. For example 40 per million as a frequency threshold. Another approach is to start with an intuitive
value, a single digit number like 34, and start fine tuning from there.
• maxfrequency: Maximum number of times that a term can occur before getting removed from indexes.
Default value is Integer.MAX_VALUE. Again, a better approach is to calculate it as a percentage of the total
word count. Otherwise, you can use the default value and add most common English words to the stop list.
• maxnonalphabetchars: Maximum number of non alphabet characters a term can contain in order to be
indexed. Default value is Integer.MAX_VALUE. Recommended values depend on the dataset and the type of
terms it contains, but setting it to 0 works pretty well for most basic cases, as it takes care of punctuation (if
data has not been preprocessed), malformed terms, and weird codes and abbreviations.
• filternumbers: true/false, index numbers or not.
• mintermlength: Minimum number of characters in a term.
• indexfileformat: Format used for serializing/deserializing vectors from disk, default lucene. Another op
tion is text, may be used for debug to see the actual vectors. Too slow on real data.
Disabled parameters
• luceneindexpath: Currently, you are not allowed to build your own Lucene index and create vectors from
it since index + vectors creation is all done in one step.
• stoplistfile: Replaced by the <http://www.ontotext.com/graphdb/similarity/stopList> predicate.
Stop words are passed as a string literal as opposed to a file.
• elementalmethod
• docindexing
In the Stop words field, add a custom list of stop words to be passed to the Semantic Vector plugin. If left empty,
the default Lucene stop words list will be used.
In the Analyzer class field, set a Lucene analyzer to be used during Semantic Vector indexing and query time
tokenization. The default is org.apache.lucene.analysis.en.EnglishAnalyzer, but it can be any from the
supported list as well.
Additionally, the Lucene connector also supports custom Analyzer implementations. This way you can create your
own analyzer and add it to a classpath. The value of the Analyzer Class parameter must be a fully qualified name
of a class that extends org.apache.lucene.analysis.Analyzer.
Go to the list of indexes and click on allNews. For search options, select Search type to be either Term or Document.
The Result type can also be either Term or Document.
Search parameters
Expand the Search options to configure more parameters for your search.
• searchtype: Different types of searches can be performed. Most involve processing combinations of vec
tors in different ways, in building a query expression, scoring candidates against these query expressions, or
both. Default is sum that builds a query by adding together (weighted) vectors for each of the query terms,
and search using cosine similarity. See more about SearchType here.
• matchcase: If true, matching of query terms is casesensitive; otherwise caseinsensitive, default value is
false.
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
INSERT DATA {
similarity-index:my_index similarity:deleteIndex "" .
}
To rebuild an index, simply create it again following the steps shown above.
GraphDB enables you to use the similarity index with no downtime while the database is being modified. While
rebuilding the index, its last successfully built version is preserved until the new index is ready. This way, when
you search in it during rebuild, the retrieved results will be from this last version. The following message will
notify you of this:
Locality-sensitive hashing
Note: As localitysensitive hashing does not guarantee the retrieval of the most similar results, this hashing is not
the most suitable option if precision is essential. Hashing with the same configuration over the same data does not
guarantee the same search results.
Localitysensitive hashing is introduced in order to reduce the searching times. Without a hashing algorithm, a
search consists of the following steps:
1. A search vector is generated.
2. All vectors in store are compared to this search vector, and the most similar ones are returned as matches.
While this approach is complete and accurate, it is also timeconsuming. In order to speed up the process, hashing
can be used to reduce the number of candidates for most similar vectors. This is where Localitysensitive hashing
can be very useful.
The Localitysensitive hashing algorithm has two parameters that can be passed either during index creation, or as
search option:
• lsh_hashes_num: The number of n random vectors used for hashing, default value is 0.
• lsh_max_bits_diff: The m number of bits by which two hashes can differ and still be considered similar,
default value is 0.
The hashing workflow is as follows:
1. An n number of random orthogonal vectors are generated.
2. Each vector in store is compared to each of those vectors (checking whether their scalar product is positive
or not).
3. Given this data, a hash is generated for each of the vectors in store.
During a search, the workflow is as follows:
1. A search vector is generated.
2. A hash is generated for this search vector by comparing it to the n number of random vectors used during
the initial hashing.
3. All similar hashes like the one of the searched vector are found. (a hash is considered similar when it has up
to m bits difference from the original one).
4. All vectors with such hash are collected and compared to the generated vector in order to get the closest
ones, based on the assumption that the vectors with similar hashes will be close to each other.
Note: If both parameters have the same value, then all possible hashes are considered similar and therefore no
filtering is done. For optimization purposes in this scenario, the entire hashing logic has been bypassed.
If one of the parameters is specified during the index creation, then its value will be used as the default one for
searching.
Note: If lsh_max_bits_diff is too close to lsh_hashes_num, the performance can be poorer compared to the
default one because of the computational overhead.
d. On top of the returned results, click the View SPARQL Query option. It will contain the following
query:
PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
:searchParameters "";
:documentResult ?result .
?result :value ?documentID ;
(continues on next page)
PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?documentID ?score ?matchDate ?searchDate {
BIND (<http://www.uawire.org/merkel-and-putin-discuss-syria-and-nord-stream-2> as ?
,→searchDocumentID )
?search a inst:allNews ;
:searchDocumentID ?searchDocumentID;
:searchParameters "";
:documentResult ?result .
?result :value ?documentID ;
:score ?score.
?documentID pubo:creationDate ?matchDate .
?searchDocumentID pubo:creationDate ?searchDate .
FILTER (?matchDate > ?searchDate - "P2D"^^xsd:duration && ?matchDate < ?searchDate�
,→+ "P2D"^^xsd:duration)
}
Search for similar news, get their creationDate and filter only the news within the time period
of two days.
Do the same for February, March, and April by changing the date range. For each month, go to the corresponding
index and select Term for both Search type and Result type to be . Type “korea” in the search field. See how the
results change over time.
It is possible to boost the weight of a given term in the textbased similarity index for termbased searches (Term
to term or Term to document). Boosting a term’s weight can be done by using the caret symbol ^ followed by a
boosting factor a positive decimal number term^factor.
For example, UK Brexit^3 EU will perform a search in which the term “Brexit” will have 3 times more weight
than “UK” and “EU”, and the results will be expected to be mainly related to “Brexit”.
The default boosting factor is 1. Setting a boosting factor of 0 will completely ignore the given term. Escaping the
caret symbol ^ is done with a double backslash \\^.
Note: The boosting will not work in documentbased searches (Document to term or Document to document),
meaning that the caret following by a number will not be treated as a weight boosting symbol.
Predicationbased Semantic Indexing, or PSI, is an application of distributional semantic techniques for reasoning
and inference. PSI starts with a collection of known facts or observations, and combines them into a single semantic
vector model, in which both concepts and relationships are represented. This way, the usual ways for constructing
query vectors and searching for results in SemanticVectors can be used to suggest similar concepts based on the
knowledge graph.
The predicationbased semantic search examples are based on Person data from the DBpedia dataset. The sample
dataset contains over 730,000 triples for over 101,000 persons born between 1960 and 1970.
1. Download the provided persons-1960-1970 dataset.
2. Unzip it and import the .ttl file into a repository.
3. Enable the Autocomplete index for the repository from Setup � Autocomplete.
For ease of use, you may add the following namespaces for the example dataset (done from Setup � Namespaces):
• dbo: http://dbpedia.org/ontology/
• dbr: http://dbpedia.org/resource/
• foaf: http://xmlns.com/foaf/0.1/
1. From Explore � Similarity � Create similarity index, select Create predication index.
2. Fill in the index name, and add the desired Semantic Vectors create index parameters. For example, it is a
good idea to use term weighting when building indexes, so we will add -termweight idf. Also, for better
results, set -dimension to higher than 200 which is the default.
3. Configure the Data query. This SPARQL SELECT query determines the data that will be indexed. The
query must SELECT the following bindings:
• ?subject
• ?predicate
• ?object
The Data query is executed during index creation to obtain the actual data for the index. When
data in your repo changes, you need to also rebuild the index. It is a subquery of a more compli
cated query that you can see with the View Index Query button.
For the given example, leave the default Data query. This will create an index with all triples in
the repo:
4. Set the Search query. This SELECT query determines the data that will be fetched on search. The Search
query is executed during search. Add more bindings by modifying this query to see more data in the results
table.
For this example, set the Search query to:
PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
In the list of Existing indexes, select the people_60s index that you will search in.
In our example, we will be looking for individuals similar to Hristo Stoichkov – the most famous Bulgarian football
player.
In the results, you can see Bulgarian football players born in the same town, other Bulgarian athletes born in the
same place, as well as other people with the same birth date.
Analogical searches
Along with searching explicit relations and similarities, PSI can also be used for analogical search.
Suppose you have a dataset with currencies and countries, and want to know the following: “If I use dollars in
the USA, what do I use in Mexico?” By using the predicate index, you do not need to know the predicate (“has
currency”).
1. Import the Nations.ttl sample dataset into a repository.
2. Build an Autocomplete index for the repository.
3. Build a predication index following the steps above.
4. Once the index is built, you can use the Analogical search option of your index. In logical terms, your query
will translate to “If USA implies dollars, what does Mexico imply?”
As you can see, the first result is peso, the Mexican currency. The rest of the results are not relevant in this situation
since they are part of a very small dataset.
PSI supplements traditional tools for artificial inference by giving “nearby” results. In cases where there is a single
clear winner, this is essentially the behavior of giving “one right answer”. But in cases where there are several
possible plausible answers, having robust approximate answers can be greatly beneficial.
When building a Predication index, it creates a random vector for each entity in the database, and uses these random
vectors to generate the similarity vectors to be used later on for similarity searches. This approach does not take
into consideration the similarity between the literals themselves. Let’s examine the following example, using the
FactForge data from the previous parts of the page:
Naturally we would expect the first news article to be more similar to the second one than to the third one, not only
based on their topics Poland’s relationship with the EU but also because of their dates. However, the normal
Predication index would not take into account the similarity of the dates, and all news would have fairly close
scores. In order to handle this type of scenario, we can first create a Text similarity index. It will find that the dates
of the three articles are similar, and will then use this information when building the Predication index.
In order to do so, you need to:
Dates, as presented in FactForge, are not literals that the similarity plugin can handle easily. This is why you need
to format them to something easier to parse.
Replacing dateTime with a simple string will enable you to create a Literal index.
At this stage, you should enable Autocomplete in case you have not enabled it yet, so as to make testing easier.
Go to Setup, and enable the Autocomplete index for allNews.
The Literal index is a subtype of the Text index. To build it, create a normal Text index by ticking the Literal index
checkbox from the More options menu. This type of indexes can only be used as input indexes for predication
indexes, and will be indicated in the Similarity page. They can not be used for similarity searching. The index will
include all literals returned by the ?documentText variable from the Data query.
Make sure to filter out the mentions, so the data in the Literal index only contains the news. When creating the
index, use the following Data query:
SELECT ?documentID ?documentText {
?documentID ?p ?documentText .
filter(isLiteral(?documentText))
filter (?p != <http://factforge.net/ff2016-mapping/mentionsEntity>)
}
When creating the predication index from the More options menu, select Input Literal Index > the index created
in the previous step.
Since you do not want to look at mentions, and in this sense the default data format is useless, you need to filter
them out from the data used in the predication index. Add the following Data query:
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object .
filter (?predicate != <http://factforge.net/ff2016-mapping/mentionsEntity>)
filter (?predicate != <http://ontology.ontotext.com/publishing#creationDate>)
}
For the purposes of the test, we want to also display the new formatted date when retrieving data. Go to the search
query tab and add the following query:
PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
(continues on next page)
With those two queries in place, the data returned from the index should be more useful. Create your hybrid
predication index and wait for the process to be completed. Then, open it and run a query for “donald tusk”,
selecting the express article about “Polexit” from the Autocomplete suggest box. You will see that the first results
are related to the Polexit and dated the same.
Indexing behavior
When building the Literal index, it is a good idea to index all literals that will be indexed in the Predication index,
or at least all literals of the same type. Continuing with the example above, let’s say that the Literal index you have
created only returns these three news pieces. Add the following triple about a hypothetical Guardian article, and
create a Predication index to index all news:
Based on the triples, it would be expected that the first article will be equally similar to the third and the new one
their contents and dates have little in common. However, depending on the binding method used when creating
the Predication index, you can get higher score for the third article compared to the new one only because the third
article has been indexed by the Literal index. There are two ways to easily avoid this either all literals, or at least
all dates are indexed.
Manual creation
If you are not using the Similarity page, you could pass the following options when creating the indexes:
• -literal_index true: passed to a Text index creates a Literal index
• -input_index <literaIndex> (replace <literalIndex> with the name of an existing Literal index): passed
to a Predication index creates a hybrid index based on a Literal index
When building Text and Predication indexes, training cycles could be used to increase the accuracy of the index.
The number of training cycles can be set by passing the option:
• trainingcycles <numOfCycles>: The default number of training cycles is 0.
Text and Predication indexes have quite different implementations of the training cycles.
Text indexes just repeat the same algorithm multiple times, which leads to algorithm convergence.
Predication indexes initially start the training with a random vector for each entity in the database. On each cycle,
the initially random elemental vectors are replaced with the product of the previous cycle, and the algorithm is run
again. In addition to the entity vectors, the predicate vectors get trained as well. This leads to higher computational
time for a cycle compared to the initial run (with trainingcycles = 0).
Note: Each training cycle is time and computationally consuming, and a higher number of cycles will greatly
increase the building time.
GraphDB offers two independent extension that provide indexing and accelerated querying of geographic data:
GraphDB provides support for 2dimensional geospatial data that uses the WGS84 Geo Positioning RDF vocabu
lary (World Geodetic System 1984). Specialized indexes can be used for this type of data, which allow efficient
evaluation of query forms and extension functions for finding locations:
• within a certain distance of a point, i.e., within a specified circle on the surface of a sphere (Earth), using the
nearby(…) construction;
• within rectangles and polygons, where the vertices are defined by spherical polar coordinates, using the
within(…) construction.
Element Description
SpatialThing A class for representing anything with a spatial extent, i.e., size, shape, or posi
tion.
Point A class for representing a point (relative to Earth) defined by latitude,
longitude (and altitude). subClassOf http://www.w3.org/2003/01/geo/
wgs84_pos#SpatialThing
location The relation between a thing and where it is. Range SpatialThing subProper-
tyOf http://xmlns.com/foaf/0.1/based_near
lat The WGS84 latitude of a SpatialThing (decimal degrees). domain http://www.
w3.org/2003/01/geo/wgs84_pos#SpatialThing
long The WGS84 longitude of a SpatialThing (decimal degrees). domain http://
www.w3.org/2003/01/geo/wgs84_pos#SpatialThing
lat_long A commaseparated representation of a latitude, longitude coordinate.
alt The WGS84 altitude of a SpatialThing (decimal meters above the lo
cal reference ellipsoid). domain http://www.w3.org/2003/01/geo/
wgs84_pos#SpatialThing
If all geospatial data is indexed successfully, the above update query will succeed. If there is an error, you will get
a notification about a failed transaction and an error will be registered in the GraphDB log files.
Note: If there is no geospatial data in the repository, i.e., no statements describing resources with latitude and
longitude properties, this update query will fail.
The Geospatial query syntax is the SPARQL RDF Collections syntax. It uses round brackets as a shorthand for
the statements, which connect a list of values using rdf:first and rdf:rest predicates with terminating rdf:nil.
Statement patterns that use custom geospatial predicates, supported by GraphDB are treated differently by the
query engine.
The following special syntax is supported when evaluating SPARQL queries. All descriptions use the namespace:
omgeo: <http://www.ontotext.com/owlim/geo#>
At present, there is just one SPARQL extension function. The prefix omgeo: stands for the namespace <http://
www.ontotext.com/owlim/geo#>.
Implementation details
Knowing the implementation’s algorithms and assumptions allow you to make the best use of the GraphDB geospa
tial extensions.
The following aspects are significant and can affect the expected behavior during query answering:
• Spherical Earth the current implementation treats the Earth as a perfect sphere with a 6371.009km radius;
• Only 2dimensional points are supported, i.e., there is no special handling of geo:alt (metres above the
reference surface of the Earth);
• All latitude and longitude values must be specified using decimal degrees, where East and North are positive
and 90 <= latitude <= +90 and 180 <= longitude <= +180;
• Distances must be in units of kilometers (suffix ‘km’) or statute miles (suffix ‘mi’). If the suffix is omitted,
kilometers are assumed;
• omgeo:within( rectangle ) construct uses a ‘rectangle’ whose edges are lines of latitude and longitude,
so the northsouth distance is constant, and the rectangle described forms a band around the Earth, which
starts and stops at the given longitudes;
• omgeo:within( polygon ) joins vertices with straight lines on a cylindrical projection of the Earth tangential
to the equator. A straight line starting at the point under test and continuing East out of the polygon is
examined to see how many polygon edges it intersects. If the number of intersections is even, then the point
is outside the polygon. If the number of intersections is odd, the point is inside the polygon. With the current
algorithm, the order of vertices is not relevant (clockwise or anticlockwise);
• omgeo:within() may not work correctly when the region (polygon or rectangle) spans the +/180 meridian;
What is GeoSPARQL
GeoSPARQL is a standard for representing and querying geospatial linked data for the Semantic Web from the
Open Geospatial Consortium (OGC). The standard provides:
• a small topological ontology in RDFS/OWL for representation using Geography Markup Language (GML)
and WellKnown Text (WKT) literals;
• Simple Features, RCC8, and Egenhofer topological relationship vocabularies and ontologies for qualitative
reasoning;
• A SPARQL query interface using a set of topological SPARQL extension functions for quantitative reason
ing.
The GraphDB GeoSPARQL plugin allows the conversion of WellKnown Text from different coordinate reference
systems (CRS) into the CRS84 format, which is the default CRS according to the Open Geospatial Consortium
(OGC). You can input data of all known CRS types it will be properly indexed by the plugin, and you will also
be able to query it in both the default CRS84 format and in the format in which it was imported.
The following is a simplified diagram of the GeoSPARQL classes Feature and Geometry, as well as some of their
properties:
Usage
Configuration parameters
Parameter enabled
Predicate <http://www.ontotext.com/plugins/geosparql#enabled>
Description Enables and disables plugin
Default false
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:enabled "true" . }
Parameter prefixTree
Predicate <http://www.ontotext.com/plugins/geosparql#prefixTree>
Description Implementation of the tree used while building the index; stores value before rebuilding.
Default prefixTree.QUAD
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:prefixTree "geohash" . }
Parameter precision
Predicate <http://www.ontotext.com/plugins/geosparql#precision>
Description Specifies the desired precision; stores value before rebuilding
Default 11 min value 1; max value depends on used prefixTree or (24 for geohash and 50 for
QUAD)
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:precision "11" . }
Parameter currentPrefixTree
Predicate <http://www.ontotext.com/plugins/geosparql#currentPrefixTree>
Description Value of last built index
Default PrefixTree.QUAD
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:currentPrefixTree "geohash" . }
Parameter currentPrecision
Predicate <http://www.ontotext.com/plugins/geosparql#currentPrecision>
Description Value of last built index
Default 11
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:currentPrecision "11" . }
Parameter maxBufferedDocs
Predicate <http://www.ontotext.com/plugins/geosparql#maxBufferedDocs>
Description Speeds up building and rebuilding of index
Default 1,000 (max. allowed 5,000)
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:maxBufferedDocs "3000" . }
Parameter ramBufferSizeMB
Predicate <http://www.ontotext.com/plugins/geosparql#ramBufferSizeMB>
Description Speeds up building and rebuilding of index
Default 32.0 (max. allowed 512.0)
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:ramBufferSizeMB "256.0" . }
Parameter ignoreErrors
Predicate <http://www.ontotext.com/plugins/geosparql#ignoreErrors>
Description Ensures building of the index even in case of erroneous data
Default false
Example
PREFIX geoSparql: <http://www.ontotext.com/plugins/geosparql#>
INSERT DATA { [] geoSparql:ignoreErrors "true" . }
The plugin allows you to configure it through SPARQL UPDATE queries with embedded control predicates.
Enable plugin
When the plugin is enabled, it indexes all existing GeoSPARQL data in the repository and automatically reindexes
any updates.
INSERT DATA {
[] geosparql:enabled "true" .
}
Note: All functions require as input WKT or GML literals while the predicates expect resources of type
geo:Feature or geo:Geometry. The GraphDB implementation has a nonstandard extension that allows you to
use literals with the predicates too. See Example 2 (using predicates) for an example of that usage.
Warning: All GeoSPARQL functions starting with geof: like geof:sfOverlaps do not use any indexes
and are always enabled! That is why it is recommended to use the indexed operations like geo:sfOverlaps,
whenever it is possible.
Disable plugin
When the plugin is disabled, it does not index any data or process updates. It does not handle any of the
GeoSPARQL predicates either.
INSERT DATA {
[] geosparql:enabled "false" .
}
The plugin supports two indexing algorithms quad prefix tree and geohash prefix tree. Both algorithms support
approximate matching controlled with the precision parameter. The default 11 precision value of the quad prefix
is about ±2.5km on the equator. When increased to 20 the precision goes down to ±6m accuracy. Respectively, the
geohash prefix tree with precision 11 results ±1m.
INSERT DATA {
[] geosparql:prefixTree "quad"; #geohash
geosparql:precision "25".
}
To speed up the building and rebuilding of your GeoSPARQL index, we recommend setting higher values for the
ramBufferSizeMB and maxBufferedDocs parameters. This disables the Lucene IndexWriter autocommit, and starts
flushing disk changes if one of these values is reached.
Default and maximum values are as follows:
• ramBufferSizeMB default 32.0, maximum 512.0.
• maxBufferedDocs default 1,000, maximum 5,000.
Depending on your dataset and machine parameters, you can experiment with the values to find the ones most
suitable for your use case.
Note: However, do not set these values too high, otherwise you may hit an IndexWriter overmerging issue.
This configuration option is usually used after a configuration change or when index files are either corrupted or
have been mistakenly deleted.
INSERT DATA {
[] onto-geo:forceReindex []
}
INSERT DATA {
[] geosparql:ignoreErrors "true"
}
ignoreErrors predicate determines whether the GeoSPARQL index will continue building if an error has occurred.
If the value is set to false, the whole index will fail if there is a problem with a document. If the value is set to true,
the index will continue building and a warning will be logged in the log. By default, the value of ignoreErrors is
false.
GeoSPARQL extensions
On top of the standard GeoSPARQL functions, GraphDB adds a few useful extensions based on the USeekM
library. The prefix geoext: stands for the namespace <http://rdf.useekm.com/ext#>.
The types geo:Geometry, geo:Point, etc. refer to GeoSPARQL types in the http://www.opengis.net/ont/
geosparql# namespace.
Function Description
xsd:double geoext:area(geomLiteral Calculates the area of the surface of the geometry.
g)
geomLiteral For two given geometries, computes the point on the first geom
geoext:closestPoint(geomLiteral etry that is closest to the second geometry.
g1, geomLiteral g2)
xsd:boolean Tests if the first geometry properly contains the second geometry.
geoext:containsProperly(geomLiteral Geom1 contains properly geom2 if all geom1 contains geom2 and
g1, geomLiteral g2) the boundaries of the two geometries do not intersect.
xsd:boolean Tests if the first geometry is covered by the second geometry.
geoext:coveredBy(geomLiteral g1, Geom1 is covered by geom2 if every point of geom1 is a point
geomLiteral g2) of geom2.
xsd:boolean Tests if the first geometry covers the second geometry. Geom1
geoext:covers(geomLiteral g1, covers geom2 if every point of geom2 is a point of geom1.
geomLiteral g2)
xsd:double Measures the degree of similarity between two geometries. The
geoext:hausdorffDistance(geomLiteral measure is normalized to lie in the range [0, 1]. Higher measures
g1, geomLiteral g2) indicate a greater degree of similarity.
geo:Line Computes the shortest line between two geometries. Returns it as
geoext:shortestLine(geomLiteral a LineString object.
g1, geomLiteral g2)
geomLiteral Given a maximum deviation from the curve, computes a simpli
geoext:simplify(geomLiteral g,
fied version of the given geometry using the DouglasPeuker al
double d) gorithm.
geomLiteral Given a maximum deviation from the curve, computes a simpli
geoext:simplifyPreserveTopology(geomLiteral
fied version of the given geometry using the DouglasPeuker al
g, double d) gorithm. Will avoid creating derived geometries (polygons in par
ticular) that are invalid.
xsd:boolean Checks whether the input geometry is a valid geometry.
geoext:isValid(geomLiteral g)
GeoSPARQL examples
Example 1
Find all features that feature my:A contains, where spatial calculations are based on my:hasExactGeometry.
Using a function
SELECT ?f
WHERE {
my:A my:hasExactGeometry ?aGeom .
?aGeom geo:asWKT ?aWKT .
?f my:hasExactGeometry ?fGeom .
?fGeom geo:asWKT ?fWKT .
FILTER (geof:sfContains(?aWKT, ?fWKT) && !sameTerm(?aGeom, ?fGeom))
}
Using a predicate
SELECT ?f
WHERE {
my:A my:hasExactGeometry ?aGeom .
?f my:hasExactGeometry ?fGeom .
?aGeom geo:sfContains ?fGeom .
FILTER (!sameTerm(?aGeom, ?fGeom))
}
Example 1 result
?f
my:B
my:F
Example 2
Find all features that are within a transient bounding box geometry, where spatial calculations are based on
my:hasPointGeometry.
Using a function
SELECT ?f
WHERE {
?f my:hasPointGeometry ?fGeom .
?fGeom geo:asWKT ?fWKT .
FILTER (geof:sfWithin(?fWKT, '''
<http://www.opengis.net/def/crs/OGC/1.3/CRS84>
Polygon ((-83.4 34.0, -83.1 34.0,
-83.1 34.2, -83.4 34.2,
-83.4 34.0))
'''^^geo:wktLiteral))
}
Using a predicate
Note: Using geometry literals in the object position is a GraphDB extension and not part of the GeoSPARQL
specification.
SELECT ?f
WHERE {
?f my:hasPointGeometry ?fGeom .
?fGeom geo:sfWithin '''
<http://www.opengis.net/def/crs/OGC/1.3/CRS84>
Polygon ((-83.4 34.0, -83.1 34.0,
-83.1 34.2, -83.4 34.2,
-83.4 34.0))
'''^^geo:wktLiteral
}
Example 2 result
?f
my:D
Example 3
Find all features that touch the union of feature my:A and feature my:D, where computations are based on
my:hasExactGeometry.
Using a function
SELECT ?f
WHERE {
?f my:hasExactGeometry ?fGeom .
?fGeom geo:asWKT ?fWKT .
my:A my:hasExactGeometry ?aGeom .
?aGeom geo:asWKT ?aWKT .
my:D my:hasExactGeometry ?dGeom .
?dGeom geo:asWKT ?dWKT .
FILTER (geof:sfTouches(?fWKT, geof:union(?aWKT, ?dWKT)))
}
Using a predicate
SELECT ?f
WHERE {
?f my:hasExactGeometry ?fGeom .
?fGeom geo:asWKT ?fWKT .
my:A my:hasExactGeometry ?aGeom .
?aGeom geo:asWKT ?aWKT .
my:D my:hasExactGeometry ?dGeom .
?dGeom geo:asWKT ?dWKT .
BIND(geof:union(?aWKT, ?dWKT) AS ?union) .
?fGeom geo:sfTouches ?union
}
Example 3 result
?f
my:C
Example 4
Find the 3 closest features to feature my:C, where computations are based on my:hasExactGeometry.
SELECT ?f
WHERE {
my:C my:hasExactGeometry ?cGeom .
?cGeom geo:asWKT ?cWKT .
?f my:hasExactGeometry ?fGeom .
?fGeom geo:asWKT ?fWKT .
FILTER (?fGeom != ?cGeom)
}
ORDER BY ASC(geof:distance(?cWKT, ?fWKT, uom:metre))
LIMIT 3
Example 4 result
?f
my:A
my:E
my:D
Note: The example in the GeoSPARQL specification has a different order in the result: my:A, my:D, my:E. In
fact, feature my:E is closer than feature my:D even if that does not seem obvious from the drawing of the objects.
my:E’s closest point is 0.1° to the West of my:C, while my:D’s closest point is 0.1° to the South. At that latitude and
longitude the difference in terms of distance is larger in latitude, hence my:E is closer.
Example 5
SELECT ?f
WHERE {
?f geo:sfOverlaps my:AExactGeom
}
Example 5 result
?f
my:D
my:DExactGeom
Note: The example in the GeoSPARQL specification has additional results my:E and my:EExactGeom. In fact,
my:E and my:EExactGeom do not overlap my:AExactGeom because they are of different dimensions (my:AExactGeom
is a Polygon and my:EExactGeom is a LineString) and the overlaps relation is defined only for objects of the same
dimension.
Tip: For more information on GeoSPARQL predicates and functions, see the current official spec.
The Data history and versioning plugin enables you to access past states of your database through versioning of the
RDF data model level. Collecting and querying the history of a database is beneficial for users and organizations
that want to preserve all of their historical data, and are often faced with the common use case: I want to know
when a value in the database has changed, and what the previous system state in time was.
The plugin remembers changes from multiple transactions and provides the means to track historical data. Changes
in the repository are tracked globally for all users and all updates can be queried and processed at once. The tracked
data is persisted to disk and is available after a restart.
It can be useful in several main types of cases, such as:
• Generating a “diff” between generations while data updates are loaded into the system on a regular basis,
either through ETL or a change data stream;
• Answering the question of what has changed between moment A and moment B, for example: “After an
application change was implemented over the weekend, I need to compare the deployment footprint or
configuration of the before/after situation”;
• Maintaining history only for specific classes or properties, i.e., no need for keeping history for everything.
This is a significant advantage when working with very large databases, the querying of which would require
substantial amounts of time and system resources;
• Searching for the members of a specific team at point X.
Warning: Note that querying the history log may be slow for big history logs. This is why we recommend
using filters to reduce the number of history entries if you have a big repository.
The plugin index is of the type DSPOCI, meaning that it consists of the following components:
• Datetime a 64bit long value that represents the exact time an operation occurred with millisecond preci
sion. All operations in the same transaction have the same datetime value.
• Subject the statement subject, 32 or 40 bit long.
• Predicate the statement predicate, 32 or 40 bit long.
• Object the statement object, 32 or 40 bit long.
• Context the statement context, 32 or 40 bit long. Special values are used for explicit statements in the
default graph and for implicit statements. By including the implicit statements, we get transparent support
for transactions.
• Insert a boolean value stored with as minimum bits as it makes sense. True represents an INSERT, and
false represents a DELETE.
The index is ordered by each component going from left to right, where the datetime component is ordered in
descending order (most recent updates come first), and all other components are ordered in ascending order. For
example:
Tip: Due to the order of the index components, the most timeefficient way to query your data is first by date
time and then by subject. This is particularly valid when using predicate parameters as described in the examples
below.
6.7.3 Usage
Enable/disable plugin
Enabling and disabling the plugin refers to collecting history only, and is disabled by default. Querying the col
lected history is possible at any moment.
To enable the plugin, execute the following query:
INSERT DATA {
[] <http://www.ontotext.com/at/enabled> true
}
INSERT DATA {
[] <http://www.ontotext.com/at/enabled> false
}
SELECT ?enabled {
[] <http://www.ontotext.com/at/enabled> ?enabled
}
If you want to clear all data in your repository, you should first disable collecting history, as there is no way to
have usable history after this operation has been executed. For example:
• You try to execute CLEAR ALL, but get an error: The reason is that clearing all statements in the repository is
incompatible with collecting history. Disable collecting history if you really want to clear all data.
• You disable collecting history and retry CLEAR ALL: All data in the repository is deleted. All history data is
deleted as well, since whatever is there is no longer usable.
Clear history
You can also delete only the history without deleting the data in the repository or having to disable collecting
history. Execute:
Trim history
The provided literal must be interpretable as xsd:date or xsd:dateTime. If only the date is specified, the time is
assumed to be midnight (00:00:00). The timezone is by default the system timezone. For more precise trimming,
a full datetime should be specified.
Size here means the number of statements in the history log to be preserved.
Trim the history to a given period from the current date and time
The provided literal must be interpretable as xsd:duration. P3D here means 3 days so only the history from the
last 3 days would remain after executing the update. We can also specify minutes, hours, etc.
History filtering
As keeping history for everything is, most of the time, unnecessary, as well as quite time and resourceconsuming,
this plugin provides the capability for specifying only certain classes or properties. When configuring the index,
you need to specify 4 mandatory positions: subject, predicate, object, and context. Each position can have one of
the following values:
• *: Everything is allowed.
• !(IRI, Bnode, or Literal): Anything apart from the selected type is allowed.
• IRI, BNode or Literal: The type of the entity on this position must be the specified one, case insensitive.
• an IRI: Only this IRI is allowed.
• an IRI prefix (http://myIRI*): All IRIs that start with the given prefix are allowed.
Filter examples
• * * literal *: Match statements that contain any literal in the object position.
• * * !literal *: Match statements that do not contain any literal in the object position.
• * http://example.com/name * *: Match statements whose predicate is http://example.com/name.
• http://example.com/person/* * * *: Match statements whose subject is an IRI starting with http://
example.com/person/.
A statement is kept in the history if it matches at least one of the provided statement templates.
Manage filters
• Add filter
INSERT DATA {
[] <http://www.ontotext.com/at/addFilters> "* * LITERAL *"
}
• Remove filter
INSERT DATA {
[] <http://www.ontotext.com/at/removeFilters> "* * LITERAL *"
}
• List filters
3. Change the name of a particular Starfleet officer, so that you can then see how this change is tracked:
delete data { <urn:Kirk> <urn:name> "James T. Kirk" };
insert data { <urn:Kirk> <urn:name> "James Tiberius Kirk" }
The retrieved results are in descending order, i.e., the most recent change comes
first:
ii. Now, let’s add a second date of birth for the Commander:
iii. If we go back to the query from 4.a and execute it, we will see that the data has
not been added since it is a literal.
c. You can also find out what changes were made for a subject and a predicate within a spe
cific time period between moment A and moment B. This is done with the hist:parameters
predicate used the following way:
While the predicate is not mandatory, passing parameters when querying history is
much more efficient than fetching all history elements and then filtering them. Note
that their order is important, and when present, the predicate will only return history
entries that match the list. Only bound variables will be taken, and there may also
be unbound parameters. Not all bindings are required, but since the object list is
an ordered list, if you want to filter by subject for example, you must add at least ?
fromDateTime ?toDateTime ?subject as bindings. ?fromDateTime ?toDateTime
may be left unbound.
The following query returns all changes made within a given time period:
hist:timestamp ?time ;
hist:graph ?g ;
hist:subject ?s ;
hist:predicate ?p ;
hist:object ?o ;
hist:insert ?i
}
You can also find out all changes for a particular subject and predicate. Note
that the ?fromDateTime ?toDateTime parameters are left unbound.
d. You can query the data at a specific point in time by including FROM
<http://www.ontotext.com/at/xxx>, where xxx is a datetime in the format:
yyyy[[[[[MM]dd]HH]mm]ss]. For example:
The same query will return a valid graph with only the date specified:
To retrieve all data for that particular Starfleet officer at a specific point in time, you
can also use a DESCRIBE query:
The result from our example at that point in time would be:
Note: Statements that have history will use the history data according to the requested point
in time. Statements that do not have history will be returned directly, assuming they were never
modified and existed at the requested point as well.
As a data scientist or an engineer with experience in specific SQLbased tools, you might want to consume RDF
data from your knowledge graph or other RDF databases by accessing GraphDB via a BI tool of your choice (e.g.,
Tableau or Microsoft Power BI). This capability is provided by GraphDB’s JDBC driver, which enables you to
create SQL views using SPARQL SELECT queries, and to access all GraphDB features including plugins and
SPARQL federation. The functionality is based on the Apache Calcite protocol and on performing optimizations
and mappings.
The JDBC driver works with preconfigured SQL views (tables) that are saved under each repository whose data
we want to access. For simplicity of the table creation process, we have integrated the SQL View Manager in
the GraphDB Workbench. It allows you to configure, store, update, preview, and delete SQL views that can be
used with the JDBC driver, where each SQL view is based on a SPARQL SELECT query and requires additional
metadata in order to configure the SQL columns.
Important: Over this functionality, you can only read data from the repository. Write operations are not enabled.
6.8.1 Configuration
Prerequisites
You need to download the GraphDB JDBC driver (graphdbjdbcremote10.2.5.jar), a selfcontained .jar file.
The driver needs to be installed according to the requirements of the software that supports JDBC. See below for
specific instructions.
For the purposes of this guide, we will be using the Netherlands restaurants RDF dataset. Upload it into a
GraphDB repository, name it nl_restaurants, and set it as the active repository.
Now, let’s access its data over the JDBC driver.
1. Go to Setup � JDBC. Initially, the list of SQL table configurations will be empty as none are configured.
2. Click Create new SQL table configuration.
In the view that opens, there are two tabs:
• Data query: The editor where to input the SPARQL SELECT query that is abstracted as
a SQL view for the JDBC driver. By default, it opens with a simple SPARQL query that
defines two columns using rdfs:label id and label.
Note: The query contains a special comment in the query body that specifies the
position of the filter clause that will be generated on the SQL side. Make sure that
it is spelled out in lowercase, as otherwise the query parser would not recognize it.
• Column types: Here, you can configure the SQL column types and other metadata of the
SQL table. Hover over a field or a checkbox to see more information about it in a tooltip.
Note that in order to create a table, it must contain at least one column.
3. Fill in a Table name for your table, e.g., restaurant_data. This field is mandatory and cannot be changed
once the table has been created.
4. Now, let’s edit the SPARQL SELECT query in the Data query body.
Enter the following query in the editor:
PREFIX ex:<http://example.com/ex>
PREFIX base:<http://example/base/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
5. After adding the SPARQL SELECT query, go to the Column types tab and click the Suggest button. This
will generate all possible columns based on the bindings inside the SELECT query. Additionally, SQL types
will be suggested based on the xsd types from the first 100 results of the execution of the input query:
Note: If you click Cancel before saving, a warning will notify you that you have unsaved
changes.
9. After successfully configuring the SQL view, we can Save it. It will appear in the list of configured tables
that can be used with the JDBC driver.
For the purposes of the BI tool examples further below, let’s also create another SQL view with the following
query:
PREFIX ex:<http://example.com/ex>
PREFIX base:<http://example/base/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
To edit and update a SQL view, select it from the list of available SQL views that are configured for the selected
repository. The configuration is identical to that used for creation, with the only difference that here you cannot
update the name of the SQL view. You can edit and update the query and SQL column metadata.
After updating the configuration, you can Save and see that all changes have been reflected.
To delete a SQL view, click the delete icon next to its name in the available SQL views list.
This table shows all RDF data types, their type equivalent in SQL, and the conversion (or mapping) of RDF to
SQL values.
Metadata SQL type Default precision and RDF to SQL Default RDF type in FILTER()
type scale
string VAR 1,000 Lit plain literal or literal with lan
CHAR eral.stringValue() guage tag
IRI VAR 500 IRI.stringValue() IRI
CHAR
boolean BOOLEAN Lit literal with xsd:boolean
eral.booleanValue()
byte BYTE Lit literal with xsd:byte
eral.byteValue()
short SHORT Lit literal with xsd:short
eral.shortValue()
int INT Literal.intValue() literal with xsd:int
long LONG Lit literal with xsd:long
eral.longValue()
float FLOAT Lit literal with xsd:float
eral.floatValue()
double DOUBLE Lit literal with xsd:double
eral.doubleValue()
decimal DECIMAL 19, 0 Lit literal with xsd:decimal
eral.decimalValue()
date DATE See below literal with xsd:date, no time
zone
time TIME See below literal with xsd:time, no time
zone
timestamp TIMES See below literal with xsd:datetime, no
TAMP timezone
Each metadata type may be followed by optional precision and scale in parentheses, e.g., decimal(15,2) or
string(100) and an optional nullability specification that consists of the literal null or not null. By default, all
columns are nullable.
RDF values are converted to SQL values on a best effort basis. For example, if something was specified as “long”
in SQL, it will convert to a long value if the corresponding literal looks like a long number regardless of its datatype.
If the conversion fails (e.g., “foo” cannot be parsed as a long value), the SQL value will become null.
The default RDF type is used only to construct values when a condition from SQL WHERE is pushed to a SPARQL
FILTER().
Dates, times, and timestamps are tricky as there is no timezone support in those types in SQL. There are SQL types
with timezone support but they are not implemented fully in Calcite. In order to provide a most common use case,
we proceed as follows:
• Ignore the timezone on date and time literals.
Dates such as 20200701, 20200701Z, 20200701+03:00, and 2020070103:00 will all be
converted to 20200701.
Times such as 12:00:01, 12:00:01Z, 12:00:01+03:00, and 12:00:0103:00 will all be converted
to 12:00:01.
No timezone will be added when constructing a value for filtering.
• On datetime values we consider “no timezone” to be equivalent to “Z” (i.e., +00:00), all other timezones
will be converted by adjusting the datetime value by the respective offset.
No time zone will be added when constructing a value for filtering.
The following SQL operators are converted to FILTER and pushed to SPARQL, if possible:
• Equality: =, <>, <, <=, >=
• Nullability: IS NULL, IS NOT NULL
• Text search: LIKE, SIMILAR TO
The conversion happens only if one of the operands is a column and the other one is a constant.
We can also use an external tool such as SQuirrel Universal SQL Client to verify that the SQL table that we created
through the Workbench is functioning properly.
After installing it, execute the following steps:
1. Download the GraphDB JDBC driver (graphdbjdbcremote10.2.5.jar), a selfcontained .jar file.
2. Open SQuirrel and add the JDBC driver: go to the Drivers tab on the left, and click the + icon to create a
new driver.
3. In the dialog window, select Extra Class Path and click Add.
4. Go to the driver’s location on your computer, select it, and click Choose.
5. In the Name field, choose a name for the driver, e.g., GraphDB.
6. For Example URL, enter the string jdbc:graphdb:url=http://localhost:7200 (or the respective endpoint
URL if your repository is in a remote location).
7. For Class Name, enter com.ontotext.graphdb.jdbc.remote.Driver. Click OK.
8. Now go to the Aliases tab on the left, and again click click the + to create a new one.
9. You will see the newly created driver and its URL visible in the dialog window. Choose a name for the alias,
e.g., GraphDB localhost. Username “admin” and password “root” are only necessary if GraphDB security
is enabled.
10. You can now see your repository with the two tables that it contains:
11. In the SQL tab, you can see information about the tables, such as their content. Write your SQL query in the
empty field and hit Ctrl+Enter (or the Run SQL icon above):
Tableau
5. On the next screen, under Databases you will see GraphDB. Select it.
6. On the dropdown Schema menu, you should see the name of the GraphDB repository, in our case
NL_Restaurants. Select it.
7. Tableau is now showing the SQL tables that we created earlier restaurant_data and restaurant_location.
8. Drag the Restaurant_Location table into the field in the centre of the screen and click Update Now.
9. Go to Sheet 1 where we will visualize the restaurants in the dataset based on:
a. their location:
i. On the left side of the screen, select the parameters: Country, City, Restaurant_Name,
Zipcode.
ii. On the right side of the screen, select the symbol maps option.
iii. Drag the Restaurant_Name parameter, which is now in the Rows field, into Marks �
Colors.
The resulting map should look like this:
When working with BI tools that do not support JDBC, as is the case with Microsoft Power BI, you need to use an
ODBCJDBC bridge, e.g., Easysoft’s ODBCJDBC Gateway.
After downloading and installing the gateway in your Windows operating system, connect it to GraphDB the
following way:
1. Download the GraphDB JDBC driver (graphdbjdbcremote10.2.5.jar).
2. From the main menu, go to ODBC Data Sources (64bit).
3. In the dialog window, go to System DSN and click Add.
4. In the next window, select Easysoft ODBCJDBC Gateway and click Finish.
5. In the next window, we will configure the connection to GraphDB:
• in the DSN field, enter the name of the new driver, for example “GraphDBTest”. The De
scription field is optional.
• for User Name, enter “admin”, and for Password “root”. These are not mandatory, except
when GraphDB security is enabled.
• for Driver Class, enter com.ontotext.graphdb.jdbc.remote.Driver.
• for Class Path, click Add and go to the location of the driver’s .jar file on your computer.
Select it and click Open.
• for URL, enter the same string as in the Tableau example above: jdbc:graphdb:url=http:/
/localhost:7200/ (or the respective endpoint URL if your repository is in a remote loca
tion).
6. Click Test to make sure that the connection is working, then click OK.
7. In the previous dialog window, you should now see the GraphDBTest connection.
This concludes the gateway configuration, and we are now ready to use it with Microsoft Power BI.
Let’s use the Netherlands Restaurants example again:
1. Start Power BI Desktop and go to Get Data.
2. From the popup Get Data window, go to Other > ODBC. Click Connect.
3. From dropdown menu in the next dialog, select GraphDBTest.
4. In the next dialog window, enter username “admin” and password “root” (the password is only mandatory
if GraphDB security is enabled).
5. in the Navigator window that appears, you can now see the GraphDB directory and the tables it contains
Restaurant_Data and Restaurant_Location. Select the tables and click Load.
6. To visualize the data as a geographic map (similar to the Tableau example above), select the Report option
on the left, and then the Map icon from the Visualizations options on the right.
7. You can experiment with the Fields that you want visualized, for example: selecting City will display all the
locations in the dataset.
8. You can also view the data in table format, as well as see the way the two tables are connected, by using the
Data and Model views on the left.
As mentioned above, each SQL table is described by a SPARQL query that also includes some metadata defining
the SQL columns, their types, and the expected RDF type. For the restaurant_data example, it will look like this:
PREFIX ex:<http://example.com/ex>
PREFIX base:<http://example/base/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
It is generated as an .rq file upon creation of a SQL table from the Workbench, and is automatically saved in a
newly created sql subdirectory in the respective repository folder. In our case, this would be:
<data/repositories/nl_restaurants/sql/restaurant_data
You can download and have a look at the two SPARQL queries that we used for the above examples:
• restaurant_data.rq
• restaurant_location.rq
6.9.1 Overview
SPARQL 1.1 Federation provides extensions to the query syntax for executing distributed queries over any number
of SPARQL endpoints. This feature is very powerful, and allows integration of RDF data from different sources
using a single query.
For example, to discover DBpedia resources about people who have the same names as those stored in a local
repository, use the following query:
SELECT ?dbpedia_id
WHERE {
?person a foaf:Person ;
foaf:name ?name .
SERVICE <http://dbpedia.org/sparql> {
?dbpedia_id a dbpedia-owl:Person ;
foaf:name ?name .
}
}
It matches the first part against the local repository and for each person it finds, it checks the DBpedia SPARQL
endpoint to see if a person with the same name exists and, if so, returns the ID.
Note: Federation must be used with caution. First of all, to avoid doing excessive querying of remote (public)
SPARQL endpoints, but also because it can lead to inefficient query patterns.
The following example finds resources in the second SPARQL endpoint that have a similar rdfs:label to the
rdfs:label of <http://dbpedia.org/resource/Vaccination> in the first SPARQL endpoint:
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?endpoint2_id {
VALUES ?endpoint1_id {
<http://dbpedia.org/resource/Vaccination>
}
SERVICE <http://faraway_endpoint.org/sparql> {
?endpoint1_id rdfs:label ?l1 .
FILTER( langMatches(lang(?l1), "en") )
}
SERVICE <http://remote_endpoint.com/sparql> {
(continues on next page)
However, such a query is very inefficient, because no intermediate bindings are passed between endpoints. Instead,
both subqueries execute independently, requiring the second subquery to return all X rdfs:label Y statements
that it stores. These are then joined locally to the (likely much smaller) results of the first subquery.
Query execution can be optimized by batching multiple values where the following is valid:
• The default batching size is 15, which is ok to use in most cases.
• You can change the default via the graphdb.federation.block.join.size global property.
• By using a system graph, you can set a value only for a particular query evaluation.
Since RDF4J repositories are also SPARQL endpoints, it is possible to use the federation mechanism to do dis
tributed querying over several repositories on a local server. You can do it by referring to them as a standard
SERVICE with their full path, or, if they are running on the same GraphDB instance, you can use the optimized
local repository prefix. The prefix triggers the internal federation mechanism. The internal SPARQL federation
is used in almost the same way as the standard SPARQL federation over HTTP, and has several advantages:
Speed The HTTP transport layer is bypassed and iterators are accessed directly. The speed is comparable to
accessing data in the same repository.
Security When security is ON, you can access every repository that is readable by the currently authenticated
user. Standard SPARQL 1.1 federation does not support authentication.
Flexibility Inline parameters provide control over inference and statement expansion over owl:sameAs.
Usage
Instead of providing a URL to a remote repository, you need to provide a special URL of the form repository:NNN,
where NNN is the ID of the repository you want to access. For example, to access the repository authors via internal
federation, use a query like this:
SERVICE <repository:authors> {
?author rdfs:label ?authorName
}
}
Parameters
There are four parameters that control how the federated part of the query is executed:
Parameter Definition
infer (boolean) Controls if inferred statements are included. True by default.
When set to false, it is equivalent to adding FROM <http://www.ontotext.
com/explicit> to the federated query.
sameAs (boolean) Controls if statements are expanded over owl:sameAs. True by default.
When set to false, it is equivalent to adding FROM <http://www.ontotext.
com/disable-sameAs> to the federated query.
from (string) Can be repeated multiple times, translates to FROM <...>. No default value.
fromNamed (string) Can be repeated multiple times, translates to FROM NAMED <...>. No default
value.
To set a parameter, put a comma after the special URL referring to the internal repository, then the parameter
name, an equals sign, and finally the value of the parameter. If you need to set more than one parameter, put
another comma, parameter name, equals sign, and value.
Some examples:
repository:NNN,infer=false Turns off inference and inferred statements are not included in the results.
repository:NNN,sameAs=false Turns off the expansion of statements over owl:sameAs and they are not included
in the results.
repository:NNN,infer=false,sameAs=false Turns off the inferred statements and they are not included in the
results.
Turns off the expansion of statements over owl:sameAs and they are not included in the results.
service <repository:repo1> No FROM and FROM NAMED.
service <repository:repo1,from=http://test.com> Adds FROM <http://test.com>.
service <repository:repo1,fromNamed=http://test.com/named> Adds FROM NAMED <http://test.com/
named>.
service <repository:repo1,from=http://test.com,fromNamed=http://test.com/named,sameAs=false>
Adds FROM <http://test.com>, adds FROM NAMED <http://test.com/named>, does not expand over
owl:sameAs.
Note: This needs to be a valid URL and thus there cannot be spaces/blanks.
The example SPARQL query from above will look like this if you want to skip the inferred statements and disable
the expansion over owl:sameAs:
SERVICE <repository:authors,infer=false,sameAs=false> {
?author rdfs:label ?authorName
}
}
GraphDB repositories
You can also use federation to query a remote passwordprotected GraphDB repository by adding the other
GraphDB instance as a remote location and specify the credentials for it.
For example, if the remote location is on http://localhost:7201, this will enable you to query the remote repos
itory as follows:
SPARQL endpoints
For nonGraphDB repositories, i.e., SPARQL endpoints, there are two ways to perform a federated query to a
passwordprotected SPARQL endpoint:
• By editing the repository configuration as follows:
1. Download the configuration file.
2. In it, edit the repositoryURL (<http://user:password@db.example.com/sparql>) by placing your
login details and the SPARQL endpoint name.
3. Stop GraphDB if it is running.
4. Create a new directory in $GDB_HOME/data/repositories/ with the same name as repositoryID from
the config file.
5. Place the edited config file in the newly created folder. Make sure that it is named config.ttl, as
otherwise GraphDB will not recognize it and the repository will not be created.
6. Start GraphDB again.
• By importing the repository configuration file in the Workbench (does not require stopping
GraphDB):
1. Download the mentioned configuration file.
2. In it, change rep:repositoryID "<RepoName>" to the name of your repository.
3. Edit the repositoryURL (<http://user:password@db.example.com/sparql>) by placing your login
details and the SPARQL endpoint name.
4. Open GraphDB Workbench and go to Repositories � Create new repository � Create from file.
5. Upload the file. The newly created repository will have the same name used for <RepoName>.
This will enable you to query the SPARQL endpoint:
For the following guide, we will be using a variation of the Star Wars dataset that you can download and execute
the examples yourself.
To explore your data, navigate to Explore � Class hierarchy. You can see a diagram depicting the hierarchy of the
imported RDF classes by the number of instances. The biggest circles are the parent classes, and the nested ones
are their children.
Note: If your data has no ontology (hierarchy), the RDF classes are visualized as separate circles instead of nested
ones.
• To see what classes each parent has, hover over the nested circles.
• To explore a given class, click its circle. The selected class is highlighted with a dashed line and a side panel
with its instances opens for further exploration. For each RDF class, you can see its local name, IRI and a
list of its first 1,000 class instances. The class instances are represented by their IRIs, which, when clicked,
lead to another view where you can further explore their metadata.
• To go to the DomainRange Graph diagram, doubleclick a class circle or the DomainRange Graph button
from the side panel.
• To explore an instance, click its IRI from the side panel.
• To adjust the number of classes displayed, drag the slider on the lefthand side of the screen. Classes are
sorted by the maximum instance count, and the diagram displays only the current slider value.
• To administrate your data view, use the toolbar options on the righthand side of the screen.
– To see only the class labels, click the Hide/Show Prefixes. You can still view the prefixes when you
hover over the class that interests you.
Domain-range graph
To see all properties of a given class as well as their domain and range, doubleclick its class circle or the Domain
Range Graph button from the side panel. The RDF DomainRange Graph view opens, enabling you to further
explore the class connectedness by clicking the green nodes (object property class).
• To administrate your graph view, use the toolbar options on the righthand side of the screen.
– To go back to your class in the RDF Class hierarchy, click the Back to Class hierarchy diagram button.
– To export the diagram as an .svg image, click the Export Diagram download icon.
To explore the relationships between the classes, navigate to Explore � Class relationships. You can see a compli
cated diagram, which by default is showing only the top relationships. Each of them is a bundle of links between
the individual instances of two classes. Each link is an RDF statement where the subject is an instance of one class,
the object is an instance of another class, and the link is the predicate. Depending on the number of links between
the instances of two classes, the bundle can be thicker or thinner, and has the color of the class with more incoming
links. These links can be in both directions. Note that contrary to the Class hierarchy, the Class relationships
diagram is based on the real statements between classes and not on the ontology schema.
In the example below, we can see that “Character” is the class with the biggest number of links. It is very strongly
connected to “Film” and “Species”, and most of the links are to “Character”.
Left of the diagram, you can see a list of all classes ordered by the number of links they have, as well as an indicator
of the direction of the links. Click on it to see the actual classes this class is linked to, again ordered by the number
of links with the actual number shown. The direction of the links is also displayed.
Use the list of classes to control which classes to see in the diagram with the add/remove icons next to each class.
Remove all classes with the X icon on the top right of the diagram. The green background of a class indicates
that the class is present in the diagram. We see that “Planet” has many more connections to “Character” than to
“Species”.
For each two classes in the diagram, you can find the top predicates that connect them by clicking on the connection,
again ordered and with the number of statements of this predicate and instances of the classes.
Just like in the Class hierarchy view, you can also filter the class relationships by graph when there is more than
one named graph in the repository. Expand the All graphs dropdown menu next to the toolbar options and select
the graph you want to explore.
Note: All of these statistics are built on top of the whole repository, so when you have a lot of data, the building
of the diagram may be fairly slow.
You can also explore the class relationships of your data programmatically. To do so, go to the SPARQL tab of the
Workbench menu and execute the following query:
Note: Before you start exploring resources from this view, make sure to have enabled the Autocomplete index
for this repository from Setup � Autocomplete.
Navigate to Explore � Visual graph. Easy graph enables you to explore the graph of your data without using
SPARQL. You see a search input field to choose a resource as a starting point for graph exploration. Click on the
chosen resource.
A graph of the resource links is shown. Nodes that have the same type have the same color. All types for a node
are listed when you hover over it. By default, what you see are the first 20 links to other resources ordered by
RDF rank if present. See the settings below to modify this limit and the types and predicates to hide or see with
preference.
The size of the nodes reflects the importance of the node by RDF rank. Hover over a node of interest to open a
menu with four options. Click the expand icon to see the links for the chosen node. Another way to expand it is to
doubleclick on it.
you want to see more, and the side panel will automatically show the information about it.
Once a node is expanded, you have the option to collapse it. This will remove all its links and their nodes, except
those that are connected to other nodes also – see the example below. Collapsing “The Force Awakens” removes
all nodes connected to it except “R2D2” and “BB8”, because they are also linked to “Droid”, which is expanded.
If you are not interested in a node anymore, you can hide it by using the remove icon.
The focus icon is used to restart the graph with the node of interest. Use carefully, as it resets the state of the graph.
More global actions are available in the menu in the upper right corner.
For example, add voc:film as preferred predicate and tick the option to see only preferred predicates.
Create your own custom visual graph by modifying the queries that fetch the graph data. To do this, navigate to
Explore � Visual Graph. In the Advanced graph section, click Create graph config.
The configuration consists of five queries separated in different tabs. A list of sample queries is provided to guide
you in the process. Note that some bindings are required.
• Starting point this is the initial state of your graph.
– Search box: Start with a search box to choose a different start resource each time. This is similar to
the initial state of the Easy graph.
– Fixed resource: You may want to start exploration with the same resource each time, i.e., select http:
//dbpedia.org/resource/Sofia from the autocomplete input as a start resource, so that every time you
open the graph, you will see Sofia and its connections.
– Graph query results: Visual graph can render a random SPARQL Graph Query result. Each result is a
triple that is transformed to a link where the subject and object are shown as nodes, and the predicate
is a link between them.
• Graph expansion: This is a CONSTRUCT query that determines which nodes and edges are added to the graph
when the user expands an existing node. The ?node variable is required and will be replaced with the IRI of
the expanded node. If empty, the Unfiltered object properties sample query will be used. Each triple from
the result is visualized as an edge where subject and object are nodes, and each predicate is the link between
them. If new nodes appear in the results, they are added to the graph.
• Node basics: This SELECT query determines the basic information about a node. Some of that information
affects the color and size of the node. This query is executed each time a node is added to the graph to
present it correctly. The ?node variable is required and will be replaced with the IRI of the expanded node.
It is a SELECT query and the following bindings are expected in the results.
– ?type determines the color. If missing, all nodes will have the same color.
– ?label determines the label of the node. If missing, the IRI’s local name will be used.
– ?comment determines the description of the node. If missing, no description will be provided.
– ?rank determines the size of the node, and must be a real number between 0 and 1. If missing, all
nodes will have the same size.
• Edge basics: This query SELECT the ?label binding that determines the text of the edge. If empty, the edge
IRI’s local name is used.
• Node extra: This SELECT query determines the extra properties shown for a node when the info icon is
clicked. It should return two bindings ?property and ?value. Results are then shown as a list in the
sidebar.
If you leave a query empty, the first sample will be taken as a default. You can execute a query to see some of the
results it will produce. Except for the samples, you will also see the queries from the other configurations, in case
you want to reuse some of them. Explore your data with your custom visual graph.
During graph exploration, you can save a snapshot of the graph state with the Save icon in the top right to load it
later. The graph config you are currently using is also saved, so when you load a saved graph, you can continue
exploring with the same config.
GraphDB also allows you to share your saved graphs with other users. When security is ON in the Setup � Users
and Access menu, the system distinguishes between different users. The graphs that you choose to share are only
editable by you.
The graphs are located in Visual graph � Saved graphs. Other users will be able to view them and copy their URL
by clicking the Get URL to graph icon.
When Users and Access � Free Access is ON, the free access user will see shared graphs only and will not be able
to save new graphs.
GraphDB also enables you to embed your visual graph by adding the &embedded HTTP parameter that hides the
Workbench menus (side panel, dropdown, and footer).
The following embedding options are available (substitute localhost and the 7200 port number as appropriate):
• Start with a specific resource: http://localhost:7200/graphs-visualizations?uri=<encoded-
iri>&embedded
Note: When using embeddings, it is recommended to run the Workbench in free access mode.
Important: Before using the View resource functionality, make sure you have enabled the Autocomplete index
from Setup � Autocomplete.
To view a resource in the repository, go to the GraphDB home page and start typing in the Explore � View resource
field.
You can also use the Search RDF resource icon in the top right, which is visible in all Workbench screens.
Viewing resources provides an easy way to see triples where a given IRI is the subject, predicate, or object.
Even when the resource is not in the database, you can still add it from the resource view. Type in the resource IRI
and hit Enter.
Here, you can create as many triples as you need for it, using the resource edit. To add a triple, fill in the necessary
fields and click on the orange tick on the right. The created triple appears, and the Predicate, Object, and Context
fields are empty again for you to insert another triple if you want to do so. You can also edit or delete already
created triples.
To view the new statements in .TriG format, click the View TriG button.
Edit a resource
Once you open a resource in View resource, you can also edit it. Click the edit icon next to the resource namespace
and add, change, or delete the properties of this resource.
The SPARQL query results can also be exported from the SPARQL view by clicking Download As.
After finding a resource from the View resource on GraphDB’s home page, you can download its RDF triples in a
format of your choice:
In addition to internal functions, such as NOW(), RAND(), UUID(), and STRUUID(), GraphDB allows users to de
fine and execute JavaScript code, further enhancing data manipulation with SPARQL. JavaScript functions are
implemented within the special namespace <http://www.ontotext.com/js#>.
JS functions are initialized by an INSERT DATA request where the subject is a blank node [], <http://www.
ontotext.com/js#register> is a reserved predicate, and an object of type literal defines your JavaScript code. It
is possible to add multiple function definitions at once.
The following example registers two JavaScript functions isPalindrome(str) and reverse(str):
prefix extfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] <http://www.ontotext.com/js#register> '''
function isPalindrome(str) {
if (!(str instanceof java.lang.String)) return false;
rev = reverse(str);
return str.equals(rev);
}
function reverse(str) {
return str.split("").reverse().join("");
}
'''
}
PREFIX jsfn:<http://www.ontotext.com/js#>
SELECT ?s ?o {
(continues on next page)
PREFIX jsfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] jsfn:register '''
function getDateYesterday() {
var date = new Date();
date.setDate(date.getDate() - 1);
return date.toJSON().slice(0,10);
}
'''
}
We can then use this function in a regular SPARQL query, e.g., to retrieve data created yesterday:
Note: The projected ?date is filtered by type and dynamically assigned value xsd:date and the output of the
JS function, respectively.
Deregistering a JavaScript function is handled in the same fashion as registering one, with the only difference
being the predicate used in the INSERT statement http://www.ontotext.com/js#remove.
PREFIX jsfn:<http://www.ontotext.com/js#>
INSERT DATA {
[] jsfn:remove "getDateYesterday"
}
Note: If multiple function definitions have been registered by a single INSERT, removing one of these functions
will remove the rest of the functions added by that insert request.
SPARQLMM is a multimedia extension for SPARQL 1.1. The implementation is based on code developed by
Thomas Kurz, and is implemented as a GraphDB plugin.
Temporal relations
Temporal aggregation
Temporal accessors
Spatial relations
Spatial aggregation
General relation
General aggregation
General accessor
SEVEN
The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically imple
mented by an external component or a service such as Elasticsearch but have the additional benefit of staying
automatically uptodate with the GraphDB repository data.
The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier
(an IRI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the
same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains.
A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.
The main features of the GraphDB Connectors are:
• maintaining an index that is always in sync with the data stored in GraphDB;
• multiple independent instances per repository;
• the entities for synchronization are defined by:
– a list of fields (on the Elasticsearch side) and property chains (on the GraphDB side) whose values will
be synchronized;
– a list of rdf:type’s of the entities for synchronization;
– a list of languages for synchronization (the default is all languages);
– additional filtering by property and value.
• fulltext search using native Elasticsearch queries;
• snippet extraction: highlighting of search terms in the search result;
• faceted search;
• sorting by any preconfigured field;
• paging of results using OFFSET and LIMIT;
• custom mapping of RDF types to Elasticsearch types;
Each feature is described in detail below.
255
GraphDB Documentation, Release 10.2.5
7.1.2 Usage
All interactions with the Elasticsearch GraphDB Connector are done through SPARQL queries.
There are three types of SPARQL queries:
• INSERT for creating, updating, and deleting connector instances;
• SELECT for listing connector instances and querying their configuration parameters;
• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
In general, this corresponds to INSERT that adds or modifies data, and to SELECT that queries existing data.
Each connector implementation defines its own IRI prefix to distinguish it from other connectors. For the Elas
ticsearch GraphDB Connector, this is http://www.ontotext.com/connectors/elasticsearch#. Each com
mand or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/
elasticsearch#createConnector to create a connector instance for Elasticsearch.
Individual instances of a connector are distinguished by unique names that are also IRIs. They have their own
prefix to avoid clashing with any of the command predicates. For Elasticsearch, the instance prefix is http://
www.ontotext.com/connectors/elasticsearch/instance#.
Sample data All examples use the following sample data that describes five fictitious wines: Yoyowine, Fran
vino, Noirette, Blanquito, and Rozova, as well as the grape varieties required to make these wines. The
minimum required ruleset level in GraphDB is RDFS.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix wine: <http://www.ontotext.com/example/wine#> .
wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .
wine:CabernetSauvignon
rdf:type wine:Grape ;
rdfs:label "Cabernet Sauvignon" .
wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .
wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .
wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .
wine:Yoyowine
rdf:type wine:RedWine ;
wine:madeFromGrape wine:CabernetSauvignon ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .
wine:Franvino
rdf:type wine:RedWine ;
wine:madeFromGrape wine:Merlo ;
(continues on next page)
wine:Noirette
rdf:type wine:RedWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2012"^^xsd:integer .
wine:Blanquito
rdf:type wine:WhiteWine ;
wine:madeFromGrape wine:Chardonnay ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Rozova
rdf:type wine:RoseWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .
Prerequisites
Thirdparty component versions This version of the Elasticsearch GraphDB Connector uses Elasticsearch ver
sion 7.17.7.
Tip: Since version 2.0, by default Elasticsearch commits the translog at the end of every index, delete, update, or
bulk request. The new configuration may causes a massive slowdown of the Elasticsearch connector, so we highly
recommend to change the index.translog.durability value to async. For more information, see Elasticsearch’s
transaction log settings.
Tip: In Elasticsearch 7.x.x, the default value for the wait_for_active_shards parameter of the open index com
mand has been changed from 0 to 1. This means that the command will now by default wait for all primary shards
of the opened index to be allocated. You can find more information about it here. Depending on your specific case,
you can experiment with different values to find the optimal ones for you, for example: "indexCreateSettings":
{"number_of_shards" : 5, "number_of_replicas" : 1, "write.wait_for_active_shards" : 0}.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
• the name of the connector instance (e.g., my_index);
• an Elasticsearch instance to synchronize to;
• classes to synchronize;
• properties to synchronize.
The configuration data has to be provided as a JSON string representation and passed together with the create
command.
You can create connectors via a Workbench dialog or by using a SPARQL update query (create command).
If you create the connector via the Workbench, no matter which way you use, you will be presented with a popup
screen showing you the connector creation progress.
1. Go to Setup � Connectors.
2. Click New Connector in the tab of the respective Connector type you want to create.
3. Fill out the configuration form.
4. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query
by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.
The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates a
connector instance called my_index, which synchronizes the wines from the sample data above.
To be able to use newlines and quotes without the need for escaping, here we use SPARQL’s multiline string
delimiter consisting of 3 apostrophes: '''...'''. You can also use 3 quotes instead: """...""".
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
],
"analyzed": false
}
]
(continues on next page)
The above command creates a new Elasticsearch connector instance that connects to the Elasticsearch instance
accessible at port 9200 on the localhost as specified by the elasticsearchNode key.
The "types" key defines the RDF type of the entities to synchronize and, in the example, it is only entities of
the type http://www.ontotext.com/example/wine#Wine (and its subtypes if RDFS or higherlevel reasoning is
enabled). The "fields" key defines the mapping from RDF to Elasticsearch. The basic building block is the
property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following
property. In the example, three bits of information are mapped the grape the wines are made of, sugar content,
and year. Each chain is assigned a short and convenient field name: “grape”, “sugar”, and “year”. The field names
are later used in the queries.
The field grape is an example of a property chain composed of more than one property. First, we take the wine’s
madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label
of this instance. The fields sugar and year are both composed of a single property that links the value directly to
the wine.
The fields sugar and year contain discrete values, such as medium, dry, 2012, 2013, and thus it is best to specify
the option analyzed: false as well. See analyzed in Defining fields for more information.
By default, GraphDB manages (creates, deletes, or updates if needed) the Elasticsearch index and the Elasticsearch
mapping. This makes it easier to use Elasticsearch as everything is done automatically. This behavior can be
changed by the following options:
• manageIndex: if true, GraphDB manages the index. True by default.
• manageMapping: if true, GraphDB manages the mapping. True by default.
Note: If either of the options is set to false, you have to create, update or remove the index/mapping and, in
case Elasticsearch is misconfigured, the connector instance will not function correctly.
The present version provides no support for changing some advanced options, such as stop words, on a perfield
basis. The recommended way to do this for now is to manage the mapping yourself and tell the connector to just
sync the object values in the appropriate fields. Here is an example:
PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
(continues on next page)
This creates the same connector instance as above but it expects fields with the specified field names to be already
present in the index mapping, as well as some internal GraphDB fields. For the example, you must have the
following fields:
GraphDB allows the access of a secured Elasticsearch instance by passing the arbitrary elasticsearchBasicAu-
thUser and elasticsearchBasicAuthPassword parameters.
Instead of supplying the username and password as part of the connector instance configuration, you can also
implement a custom authenticator class and set it via the authenticationConfiguratorClass option. See these
connector authenticator examples for more information and example projects that implement such a custom class.
See the List of creation parameters for more information.
Dropping a connector instance removes all references to its external store from GraphDB as well as the Elastic
search index associated with it.
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the
connector instance has to be in the subject position, e.g., this removes the connector my_index:
You can also force drop a connector in case a normal delete does not work. The force delete will remove the
connector even if part of the operation fails. Go to Setup � Connectors where you will see the already existing
connectors that you have created. Click the delete icon, and check Force delete in the dialog box.
You can view the options string that was used to create a particular connector instance with the following query:
SELECT ?createString {
elastic-index:my_index elastic:listOptionValues ?createString .
}
Existing Connector instances are shown below the New Connector button. Click the name of an instance to view
its configuration and SPARQL query, or click the repair / delete icons to perform these operations. Click the copy
icon to copy the connector definition query to your clipboard.
Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors
predicate:
?cntUri is bound to the prefixed IRI of the connector instance that was used during creation, e.g., http://www.
ontotext.com/connectors/elasticsearch/instance#my_index, while ?cntStr is bound to a string, represent
ing the part after the prefix, e.g., "my_index".
The internal state of each connector instance can be queried using a SELECT query and the connectorStatus pred
icate:
?cntUri is bound to the prefixed IRI of the connector instance, while ?cntStatus is bound to a string representation
of the status of the connector represented by this IRI. The status is keyvalue based.
From the user point of view, all synchronization happens transparently without using any additional predicates or
naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This
is achieved by intercepting all changes in the plugin and determining which Elasticsearch documents need to be
updated.
Simple queries
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching
Elasticsearch document, the connector instance returns the document subject. In its simplest form, querying is
achieved by using a SELECT and providing the Elasticsearch query as the object of the elastic:query predicate:
SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query "grape:cabernet" ;
elastic:entities ?entity .
}
The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely :Yoyowine
and :Franvino.
Note: You must use the field names you chose when you created the connector instance. They can be identical
to the property IRIs but you must escape any special characters according to what Elasticsearch expects.
1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type
Y), where X is a variable and Y is a connector instance IRI. X is bound to a query instance of the connector
instance.
2. Assign a query to the query instance by using the system predicate elastic:query.
3. Request the matching entities through the elastic:entities predicate.
It is also possible to provide perquery search options by using one or more option predicates. The option predicates
are described in detail below.
Raw queries
To access an Elasticsearch query parameter that is not exposed through a special predicate, use a raw query. Instead
of providing a fulltext query in the :query part, specify raw Elasticsearch parameters. For example, to boost some
parts of your fulltext query as described here, execute the following query:
SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query '''
{
"query" : {
"bool" : {
"should" : [ {
"query_string" : {
(continues on next page)
The bound ?entity can be used in other SPARQL triples in order to build complex queries that join to or fetch
additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they
were made:
PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>
PREFIX wine: <http://www.ontotext.com/example/wine#>
Note: :Franvino is returned twice because it is made from two different grapes, both of which are returned.
It is possible to access the match score returned by Elasticsearch with the score predicate. As each entity has its
own score, the predicate should come at the entity level. For example:
PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>
The result looks like this but the actual score might be different as it depends on the specific Elasticsearch version:
Consider the sample wine data and the my_index connector instance described previously. You can also query
facets using the same instance:
It is important to specify the facet fields by using the facetFields predicate. Its value is a simple commadelimited
list of field names. In order to get the faceted results, use the elastic:facets predicate. As each facet has three
components (name, value, and count), the elastic:facets predicate returns multiple nodes that can be used to
access the individual values for each component through the predicates facetName, facetValue, and facetCount.
The resulting bindings will look like this:
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the
wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are
the same as the three dry wines as each facet is computed independently.
Tip: Faceting by analyzed textual field works but might produce unexpected results. Analyzed textual fields are
composed of tokens and faceting uses each token to create a faceting bucket. For example, “North America” and
“Europe” produce three buckets: “north”, “america”, and “europe”, corresponding to each token in the two values.
If you need to facet by a textual field and still do fulltext search on it, it is best to create a copy of the field with
the setting "analyzed": false. For more information, see Copy fields.
While basic faceting allows for simple counting of documents based on the discrete values of a particular field,
there are more complex faceted or aggregation searches in Elasticsearch. The Elasticsearch GraphDB Connector
provides a mapping from Elasticsearch results to RDF results but no mechanism for specifying the queries other
than executing Raw queries.
The Elasticsearch GraphDB Connector supports mapping of the following facets and aggregations:
• Facets: terms, histogram, date histogram;
• Aggregations: terms, histogram, date histogram, range, min, max, sum, avg, stats, extended stats, value
count.
For aggregations, the connector also supports subaggregations.
Tip: For more information on each supported facet or aggregation type, refer to the Elasticsearch documentation.
The results are accessed through the predicate aggregations (much like the basic facets are accessed through
facets). The predicate binds multiple blank nodes that each contains a single aggregation bucket. The individual
bucket items can be accessed through these predicates:
Sorting
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved
by the orderBy predicate the value of which is a commadelimited list of fields. Each field can be prefixed with a
minus to indicate sorting in descending order. For example:
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
By default, entities are sorted according to their matching score in descending order.
Note: If you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble
the order. To remedy this, use ORDER BY from SPARQL.
Tip: Sorting by an analyzed textual field works but might produce unexpected results. Analyzed textual fields are
composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America”
will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”.
If you need to sort by a textual field and still do fulltext search on it, it is best to create a copy of the field with the
setting "analyzed": false. For more information, see Copy fields.
Limit and offset are supported on the Elasticsearch side of the query. This is achieved through the predicates limit
and offset. Consider this example in which an offset of 1 and a limit of 1 are specified:
SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query "sugar:dry" ;
elastic:offset "1" ;
elastic:limit "1" ;
elastic:entities ?entity .
}
offset is counted from 0. The result contains a single wine, Franvino. If you execute the query without the limit
and offset, Franvino will be second in the list:
Note: The specific order in which GraphDB returns the results depends on how Elasticsearch returns the matches,
unless sorting is specified.
Snippet extraction
Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are accessed
through the dedicated predicate elastic:snippets. It binds a blank node that in turn provides the actual snippets
via the predicates elastic:snippetField and elastic:snippetText. The predicate snippets must be attached to
the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:
the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective
matching fields and snippets:
Note: The actual snippets might be different as this depends on the specific Elasticsearch implementation.
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
• elastic:snippetSize sets the maximum size of the extracted text fragment, 250 by default;
• elastic:snippetSpanOpen the text to insert before the highlighted text, <em> by default;
• elastic:snippetSpanClose the text to insert after the highlighted text, </em> by default.
The option predicates are set on the query instance, much like the elastic:query predicate.
Snippets extracted from nested documents (when a nested query is used) will be available through the same mech
anism as snippets from nonnested fields. In addition, nested snippet results provide the nested search path via the
snippetInnerField predicate. For example, in a nested search on the field “grandChildren” (specified by “path”)
and a match query for “tylor” on the nested field “grandChildren.name”:
the query returns all people who have a grandchild whose name matches “tylor”, as well as the highlighted snippets:
Note that the matching field whose matching values are highlighted is provided via the snippetField predicate,
just like extracting snippets with nonnested searches, while the predicate snippetInnerField provides the field
on which the nested search was executed.
Total hits
You can get the total number of matching Elasticsearch documents (hits) by using the elastic:totalHits predi
cate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:
SELECT ?totalHits {
?r a elastic-index:my_index ;
elastic:query "year:2012" ;
elastic:totalHits ?totalHits .
}
As there are three wines made in 2012, the value 3 (of type xsd:long) binds to ?totalHits.
As you see above, you can omit returning any of the matching entities. This can be useful if there are many hits
and you want to calculate pagination parameters.
The creation parameters define how a connector instance is created by the elastic:createConnector predicate.
Some are required and some are optional. All parameters are provided together in a JSON object, where the
parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean,
or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface without any
knowledge of JSON.
readonly (boolean), optional, readonly mode A readonly connector will index all existing data in the reposi
tory at creation time, but, unlike nonreadonly connectors, it will:
• Not react to updates. Changes will not be synced to the connector.
• Not keep any extra structures (such as the internal Lucene index for tracking updates to chains)
The only way to index changes in data after the connector has been created is to repair (or drop/recreate) the
connector.
importGraph (boolean), optional, specifies that the RDF data from which to create the connector is in a special virtual graph
Used to make an Elasticsearch index from temporary RDF data inserted in the same transaction. It requires
readonly mode and creates a connector whose data will come from statements inserted into a special virtual
graph instead of data contained in the repository. The virtual graph is elastic:graph, where the prefix
elastic: is as defined before. The data have to be inserted into this graph before the connector create
statement is executed.
Both the insertion into the special graph and the create statement must be in the same transaction. In GDB
Workbench, this can be done by pasting them one after another in the SPARQL editor and putting a semicolon
at the end of the first INSERT. This functionality requires readonly mode.
importFile (string), optional, an RDF file with data from which to create the connector Creates a connector
whose data will come from an RDF file on the file system instead of data contained in the repository. The
value must be the full path to the RDF file. This functionality requires readonly mode.
detectFields (boolean), optional, detects fields This mode introduces automatic field detection when creating
a connector. You can omit specifying fields in JSON. Instead, you will get automatic fields: each cor
responds to a single predicate, and its field name is the same as the predicate (so you need to use escaping
when issuing Elasticsearch queries).
In this mode, specifying types is optional too. If types are not provided, then all types will be indexed. This
mode requires importGraph or importFile.
Once the connector is created, you can inspect the detected fields in the Connector management section of
the Workbench.
elasticsearchNode (string), required, the Elasticsearch instance to sync to As Elasticsearch is a thirdparty
service, you have to specify the node where it is running. The format of the node value is of the form http:/
/hostname.domain:port, https:// is allowed too. No default value. Can be updated at runtime without
having to rebuild the index.
Note: Elasticsearch exposes two protocols – the native transport* protocol over port 9300 and the RESTful
API over port 9200. The Elasticsearch GraphDB Connector uses the RESTful API over port 9200.
indexCreateSettings (json), optional, the settings for creating the Elasticsearch index This option is passed
directly to Elasticsearch when creating the index.
elasticsearchBasicAuthUser (string), optional, the settings for supplying the authentication user No
default value. Can be updated at runtime without having to rebuild the index.
elasticsearchBasicAuthPassword (string), optional, the settings for supplying the authentication password
A password is a string with a single value that is not logged or printed. No default value. Can be updated at
runtime without having to rebuild the index.
elasticsearchClusterSniff (boolean), controls whether to build the server address list by sniffing on the Elasticsearch clust
Corresponds to the Elasticsearch client.transport.sniff option. True by default. Can be updated at
runtime without having to rebuild the index.
bulkUpdateBatchSize (integer), controls the maximum number of documents sent per bulk request
Default value is 5,000. Can be updated at runtime without having to rebuild the index.
bulkUpdateRequestSize (integer), controls the maximum size in bytes per bulk request Defaults to 5,242,
880 bytes (5 million bytes). Can be updated at runtime without having to rebuild the index.
The limits of bulkUpdateBatchSize and bulkUpdateRequestSize are combined, and a bulk request is sent once
either limit is hit.
authenticationConfiguratorClass optional, provides custom authentication behavior
types (list of IRIs), required, specifies the types of entities to sync The RDF types of entities to sync are spec
ified as a list of IRIs. At least one type IRI is required.
Use the pseudoIRI $any to sync entities that have at least one RDF type.
Use the pseudoIRI $untyped to sync entities regardless of whether they have any RDF type, see also the
examples in General fulltext search with the connectors.
languages (list of strings), optional, valid languages for literals RDF data is often multilingual, but only some
of the languages represented in the literal values can be mapped. This can be done by specifying a list of
language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic
Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of
language ranges maps all existing literals that have matching language tags.
fields (list of field objects), required, defines the mapping from RDF to Elasticsearch The fields specify
exactly which parts of each entity will be synchronized as well as the specific details on the connector side.
The field is the smallest synchronization unit and it maps a property chain from GraphDB to a field in
Elasticsearch. The fields are specified as a list of field objects. At least one field object is required. Each
field object has further keys that specify details.
• fieldName (string), required, the name of the field in Elasticsearch The name of the field defines
the mapping on the connector side. It is specified by the key fieldName with a string value. The
field name is used at query time to refer to the field. There are few restrictions on the allowed
characters in a field name but to avoid unnecessary escaping (which depends on how Elasticsearch
parses its queries), we recommend to keep the field names simple.
• fieldNameTransform (one of none, predicate or predicate.localName), optional, none by default
Defines an optional transformation of the field name. Although fieldName is always required, it
is ignored if fieldNameTransform is predicate or predicate.localName.
If true, this option corresponds to "index" = true. If false, it corresponds to "index" = false.
• stored (boolean), optional, default true Fields can be stored in Elasticsearch, and this is controlled
by the Boolean option stored. Stored fields are required for retrieving snippets. true by default.
This option corresponds to the property "store" in the Elasticsearch mapping.
• analyzed (boolean), optional, default true When literal fields are indexed in Elasticsearch, they will
be analyzed according to the analyzer settings. Should you require that a given field is not an
alyzed, you may use "analyzed". This option has no effect for IRIs (they are never analyzed).
true by default.
If true, this option will use automatic or manual (datatype option) type for the Elasticsearch
mapping. If false, it corresponds to "type" = "keyword" (i.e., the default type will be changed
to keyword).
• multivalued (boolean), optional, default true RDF properties and synchronized fields may have
more than one value. If multivalued is set to true, all values will be synchronized to Elas
ticsearch. If set to false, only a single value will be synchronized. true by default.
• ignoreInvalidValues (boolean), optional, default false Perfield option that controls what hap
pens when a value cannot be converted to the requested (or previously detected) type. False
by default.
Example use: when an invalid date literal like "2021-02-29"^^xsd:date (2021 is not a leap year)
needs to be indexed as a date, or when an IRI needs to be indexed as a number.
Note that some conversions are always valid: any literal to an FTS field, any nonliteral (IRI,
blank node, embedded triple) to a nonanalyzed field. When true, such values will be skipped
with a note in the logs. When false, such values will break the transaction.
• array (boolean), optional, default false Normally, Elasticsearch creates an array only if more than
value is present for a given field. If array is set to true, Elasticsearch will always create an array
even for single values. If set to false, Elasticsearch will create arrays for multiple values only.
False by default.
• fielddata (boolean), optional, default false Allows fielddata to be built in memory for text fields.
Fielddata can consume a lot of heap space, especially when loading high cardinality text fields.
False by default.
• datatype (string), optional, the manual datatype override By default, the Elasticsearch GraphDB
Connector uses datatype of literal values to determine how they should be mapped to Elasticsearch
types. For more information on the supported datatypes, see Datatype mapping.
The mapping can be overridden through the property "datatype", which can be specified per
field. The value of datatype can be any of the xsd: types supported by the automatic mapping
or a native Elasticsearch type prefixed by native:, e.g., both xsd:long and native:long map to
the long type in Elasticsearch.
• nativeSettings (json), optional, custom field settings The setting for the Elasticsearch mapping
parameters of the respective field, for example the format of the datatype. Native field settings
require an explicit native datatype.
nativeSettings are not allowed for the following parameters so as to avoid conflicts with the
existing way to specify them: type, index, store, analyzer, fielddata.
• objectFields (objects array), optional, nested object mapping When native:object, na-
tive:nested, or native:geo_point is used as a datatype value, provide a mapping for the nested
object`s fields. If datatype is not provided, then native:object will be assumed.
For the difference between object and nested, refer to the Elastic nested field type. The
geo_point type must have exactly two fields named lat and long (required by Elastic, see geo
point field type).
Nested objects support further nested objects with a limit of five levels of nesting. See Nested
objects for an example.
• startFromParent (integer), optional, default 0 Start processing the property chain from the Nth
parent instead of the root of the current nested object. 0 is the root of the current nested object, 1
is the parent of the nested object, 2 is the parent of the parent and so on.
• analyzer (string), optional, per field analyzer The Elasticsearch analyzer that is used for indexing
the field can be specified with the parameter analyzer. It will be passed directly to Elasticsearch’s
property analyzer when creating the mapping (see Custom Analyzers in the Elasticsearch docu
mentation). For example:
{
...
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
],
(continues on next page)
valueFilter (string), optional, specifies the toplevel value filter for the document See also Entity filtering.
documentFilter (string), optional, specifies the toplevel document filter for the document See also Entity
filtering.
As mentioned above, the following connector parameters can be updated at runtime without having to rebuild the
index:
• elasticsearchNode
• elasticsearchClusterSniff
• elasticsearchBasicAuthUser
• elasticsearchBasicAuthPassword
• bulkUpdateBatchSize
• bulkUpdateRequestSize
This can be done by executing the following SPARQL update, here with examples for changing the user and
password:
PREFIX conn:<http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst:<http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:proper_index conn:updateConnector '''
{
"elasticsearchBasicAuthUser": "foo",
"elasticsearchBasicAuthPassword": "bar"
}
''' .
}
Nested objects
Nested objects are Elasticsearch documents that are used as values in the main document or other nested objects
(up to five levels of nesting is possible). They are defined with the objectFields option.
Having the following data consisting of children and grand children relations:
<urn:John>
a <urn:Person> ;
<urn:name> "John" ;
<urn:gender> <urn:Male> ;
<urn:age> 60 ;
<urn:hasSpouse> <urn:Mary> ;
<urn:hasChild> <urn:Billy> ;
<urn:hasChild> <urn:Annie> .
<urn:Mary>
(continues on next page)
<urn:Eva>
a <urn:Person> ;
<urn:name> "Eva" ;
<urn:gender> <urn:Female> ;
<urn:age> 45 ;
<urn:hasChild> <urn:Annie> .
<urn:Billy>
a <urn:Person> ;
<urn:name> "Billy" ;
<urn:gender> <urn:Male> ;
<urn:age> 35 ;
<urn:hasChild> <urn:Tylor> ;
<urn:hasChild> <urn:Melody> .
<urn:Annie>
a <urn:Person> ;
<urn:name> "Annie" ;
<urn:gender> <urn:Female> ;
<urn:age> 28 ;
<urn:hasChild> <urn:Sammy> .
<urn:Tylor>
a <urn:Person> ;
<urn:name> "Tylor" ;
<urn:gender> <urn:Male> ;
<urn:age> 5 .
<urn:Melody>
a <urn:Person> ;
<urn:name> "Melody" ;
<urn:gender> <urn:Female> ;
<urn:age> 2 .
<urn:Sammy>
a <urn:Person> ;
<urn:name> "Sammy" ;
<urn:gender> <urn:Male> ;
<urn:age> 10 .
We can create a nested objects index that consists of children and grandchildren with their corresponding fields
defining their gender and age. We use the native:nested type as we want to query the nested objects independently
of each other:
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
(continues on next page)
To find male grandchildren age of 5 years and older, we will use the following query:
SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query '''
{
"query" : {
"nested" : {
"path" : "grandChildren",
"query" : {
"bool" : {
"must" : [
{
"match" : {
"grandChildren.gender" : "male"
}
},
{
"range" : {
"grandChildren.age" : {
"gt" : 5
}
}
}
]
}
}
}
}
}
''' ;
elastic:entities ?entity .
}
ORDER BY ?entity
?entity
urn:Eva
urn:John
Copy fields
Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate
for different use cases, e.g., faceting or sorting vs fulltext search. The Elasticsearch GraphDB Connector has
explicit support for fields that copy their value from another field. This is achieved by specifying a single element
in the property chain of the form @otherFieldName, where otherFieldName is another noncopy field. Take the
following example:
...
"fields": [
{
"fieldName": "grape",
"facet": false,
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzed": true
},
{
"fieldName": "grapeFacet",
"propertyChain": [
"@grape"
],
"analyzed": false
}
]
...
The snippet creates an analyzed field “grape” and a nonanalyzed field “grapeFacet”, both fields are populated
with the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.
Note: The connector handles copy fields in a more optimal way than specifying a field with exactly the same
property chain as another field.
Sometimes, you have to work with data models that define the same concept (in terms of what you want to index
in Elasticsearch) with more than one property chain, e.g., the concept of “name” could be defined as a single
canonical name, multiple historical names and some unofficial names. If you want to index these together as a
single field in Elasticsearch, you can define this as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single
physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric
sequence of convenience.For example, we can define the fields name$1 and name$2 like this:
...
"fields": [
{
"fieldName": "name$1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name$2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
(continues on next page)
The values of the fields name$1 and name$2 will be merged and synchronized to the field name in Elasticsearch.
Note: You cannot mix suffixed and unsuffixed fields with the same same, e.g., if you defined myField$new and
myField$old, you cannot have a field called just myField.
Filters can be used with fields defined with multiple property chains. Both the physical field values and the indi
vidual virtual field values are available:
• Physical fields are specified without the suffix, e.g., ?myField
• Virtual fields are specified with the suffix, e.g., ?myField$2 or ?myField$alt.
Note: Physical fields cannot be combined with parent() as their values come from different property chains. If
you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>) as
parent(?myField$1) in (<urn:x>, <urn:y>) || parent(?myField$2) in (<urn:x>, <urn:y>) || parent(?
myField$3) ... and surround it with parentheses if it is a part of a bigger expression.
The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the
pseudoIRI lang(). The property preceding lang() must lead to a literal value. For example:
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameLanguage",
"propertyChain": [
"http://www.ontotext.com/example#name",
"lang()"
]
}
],
}
''' .
}
The above connector will index the language tag of each literal value of the property http://www.ontotext.com/
example#name into the field nameLanguage.
The named graph of a given value can be indexed by ending a property chain with the special pseudoURI graph().
Indexing the named graph of the value instead of the value itself allows searching by named graph.
PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameGraph",
"propertyChain": [
"http://www.ontotext.com/example#name",
"graph()"
]
}
],
}
''' .
}
The above connector will index the named graph of each value of the property http://www.ontotext.com/
example#name into the field nameGraph.
In this mode, the last element of a property chain is a wildcard that will match any predicate that leads to a literal
value. Use the special pseudoIRI $literal as the last element of the property chain to activate it.
Note: Currently, it really means any literal, including literals with data types.
For example:
{
"fields" : [ {
"propertyChain" : [ "$literal" ],
"fieldName" : "name"
}, {
"propertyChain" : [ "http://example.com/description", "$literal" ],
"fieldName" : "description"
}
...
}
Sometimes you may need the IRI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from
our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a
single property referring to the pseudoIRI $self. For example:
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "entityId",
"propertyChain": [
"$self"
],
},
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
]
}
''' .
}
The above connector will index the IRI of each wine into the field entityId.
Note: Note that GraphDB will also use the IRI of each entity as the ID of each document in Elasticsearch, which
is represented by the field id.
The Elasticsearch GraphDB Connector maps different types of RDF values to different types of Elasticsearch
values according to the basic type of the RDF value (IRI or literal) and the datatype of literals. The autodetection
uses the following mapping:
Note: For any given field, the automatic mapping uses the first value it sees. This works fine for clean datasets
but might lead to problems, if your dataset has nonnormalized data, e.g., the first value has no datatype but other
values have.
It is therefore recommended to set datatype to a fixed value, e.g. xsd:date.
Please note that the commonly used xsd:integer and xsd:decimal datatypes are not indexed as numbers because
they represent infinite precision numbers. You can override that by using the datatype option to cast to xsd:long,
xsd:double, xsd:float as appropriate.
RDF and Elasticsearch use slightly different models for representing dates and times, even though the values might
look very similar.
Years in RDF values use the XSD format and are era years, where positive values denote the common era and
negative values denote years before the common era. There is no year zero.
Years in Elasticsearch use the ISO format and are proleptic years, i.e., positive values denote years from the com
mon era with any previous eras just going down by one mathematically so there is year zero.
In short:
• year 2020 CE = year 2020 in XSD = year 2020 in ISO.
• …
• year 1 CE = year 1 in XSD = year 1 in ISO.
• year 1 BCE = year 1 in XSD = year 0 in ISO.
• year 2 BCE = year 2 in XSD = year 1 in ISO.
• …
All years coming from RDF literals will be converted to ISO before indexing in Elasticsearch.
Both XSD and ISO date and time values support timezones. In addition to that, XSD defines the lack of a time
zone as undetermined. Since we do not want to have any undetermined state in the indexing system, we define
the undetermined time zone as UTC, i.e., "2020-02-14T12:00:00"^^xsd:dateTime is equivalent to "2020-02-
14T12:00:00Z"^^xsd:dateTime (Z is the UTC timezone, also known as +00:00).
Also note that XSD dates and partial dates, e.g., xsd:gYear values, may have a timezone, which leads to additional
complications. E.g., "2020+02:00"^^xsd:gYear (the year 2020 in the +02:00 timezone) will be normalized to
2019-12-31T22:00:00Z (the previous year!) if strict timezone adherence is followed. We have chosen to ignore
the timezone on any values that do not have an associated time value, e.g.:
• "2020-02-15+02:00"^^xsd:date
• "2020-02+02:00"^^xsd:gYearMonth
• "2020+02:00"^^xsd:gYear
All of the above will be treated as if they specified UTC as their timezone.
The Elasticsearch connector supports four kinds of entity filters used to finetune the set of entities and/or individual
values for the configured fields, based on the field value. Entities and field values are synchronized to Elasticsearch
if, and only if, they pass the filter. The filters are similar to a FILTER() inside a SPARQL query but not exactly the
same. In them, each configured field can be referred to by prefixing it with a ?, much like referring to a variable
in SPARQL.
Types of filters
Toplevel value filter The toplevel value filter is specified via valueFilter. It is evaluated prior to anything
else when only the document ID is known and it may not refer to any field names but only to the special
field $this that contains the current document ID. Failing to pass this filter removes the entire document
early in the indexing process and it can be used to introduce more restrictions similar to the builtin filtering
by type via the types property.
Toplevel document filter The toplevel document filter is specified via documentFilter. This filter is evaluated
last when all of the document has been collected and it decides whether to include the document in the index.
It can be used to enforce global document restrictions, e.g., certain fields are required or a document needs
to be indexed only if a certain field value meets specific conditions.
Perfield value filter The perfield value filter is specified via valueFilter inside the field definition of the field
whose values are to be filtered. The filter is evaluated while collecting the data for the field when each field
value becomes available.
The variable that contains the field value is $this. Other field names can be used to filter the current field’s
value based on the value of another field, e.g., $this > ?age will compare the current field value to the
value of the field age (see also Twovariable filtering). Failing to pass the filter will remove the current field
value.
On nested documents, the perfield value filter can be used to remove the entire nested document early in
the indexing process, e.g., by checking the type of the nested document via next hop with rdf:type.
Nested document filter The nested document filter is specified via documentFilter inside the field definition
of the field that defines the root of a nested document. The filter is evaluated after the entire nested document
has been collected. Failing to pass this filter removes the entire nested document.
Inside a nested document filter, the field names are within the context of the nested document and not within
the context of the toplevel document. For example, if we have a field children that defines a nested
document, and we use a filter like ?age < "10"^^xsd:int, we will be referring to the field children.age.
We can use the prefix $outer. one or more times to refer to field values from the outer document (from the
viewpoint of the nested document). For example, $outer.age > "25"^^xsd:int will refer to the age field
that is a sibling of the children field.
Other than the above differences, the nested document filter is equivalent to the toplevel document filter
from the viewpoint of the nested document.
See also Migrating from GraphDB 9.x.
Filter operators
The filter operators are used to test if the value of a given field satisfies a certain condition.
Field comparisons are done on original RDF values before they are converted to Elasticsearch values using datatype
mapping.
Operator Meaning
?var in (value1, value2, ...) Tests if the field var’s value is one of the specified values. Values
are compared strictly unlike the similar SPARQL operator, i.e. for
literals to match their datatype must be exactly the same (similar
to how SPARQL sameTerm works). Values that do not match, are
treated as if they were not present in the repository.
Example:
?status in ("active", "new")
?var not in (value1, value2, ...) The negated version of the inoperator.
Example:
?status not in ("archived")
bound(?var) Tests if the field var has a valid value. This can be used to make
the field compulsory.
Example:
bound(?name)
isExplicit(?var) Tests if the field var’s value came from an explicit statement.
This will use the last element of the property chain. If you need
to assert the explicit status of a previous property chain use par
ent(?var) as many times as needed.
Example:
isExplicit(?name)
?var = value (equal to) RDF value comparison operators that compare RDF values
?var != value (not equal to) similarly to the equivalent SPARQL operators. The field var’s
?var > value (greater than) value will be compared to the specified RDF value. When
comparing RDF values that are literals, their datatypes must be
?var >= value (greater than or equal to)
compatible, e.g., xsd:integer and xsd:long but not
?var < value (less than)
xsd:string and xsd:date. Values that do not match are treated
?var <= value (less than or equal to) as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and
dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
regex(?var, "pattern")
or Tests if the field var’s value matches the given regular
regex(?var, "pattern", "i") expression pattern.
If the “i” flag option is present, this indicates that the match
operates in caseinsensitive mode.
Values that do not match are treated as if they were not present
in the repository.
Example:
regex(?name, "^mrs?", "i")
Example:
!bound(?company)
Example:
(bound(?name) or bound(?company)) && bound(?address)
Filter modifiers
In addition to the operators, there are some constructions that can be used to write filters based not on the values
of a field but on values related to them:
Accessing the previous element in the chain The construction parent(?var) is used for going to a pre
vious level in a property chain. It can be applied recursively as many times as needed, e.g.,
parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var)
can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the
bound operator like this: parent(bound(?var)).
Accessing an element beyond the chain The construction ?var -> uri (alternatively, ?var o uri or just ?
var uri) is used to access additional values that are accessible through the property uri. In essence, this
construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound
by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this:
?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?
company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound
operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this:
bound(parent(?company) -> <urn:hasGroup>).
The IRI parameter can be a full IRI within < > or the special string rdf:type (alternatively, just type), which
will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
Filtering by RDF graph The construction graph(?var) is used for accessing the RDF graph of a field’s value.
A typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/
implicit>) but using isExplicit(?a) is the recommended way.
The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).
Filtering by language tags The construction lang(?var) is used for accessing the language tag of field’s value
(only RDF literals can have a language tag). The typical use case is to sync only values written in a given lan
guage: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element
beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in
("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".
Current context variable $this The special field variable $this (and not ?this, ?$this, $?this) is used to refer
to the current context. In the toplevel value filter and the toplevel document filter, it refers to the document.
In the perfield value filter, it refers to the currently filtered field value. In the nested document filter, it refers
to the nested document.
ALL() quantifier In the context of documentlevel filtering, a match is true if at least one of potentially many field
values match, e.g., ?location = <urn:Europe> would return true if the document contains { "location":
["<urn:Asia>", "<urn:Europe>"] }.
In addition to this, you can also use the ALL() quantifier when you need all values to match, e.g., ALL(?
location) = <urn:Europe> would not match with the above document because <urn:Asia> does not match.
Entity filters and default values Entity filters can be combined with default values in order to get more flexible
behavior.
If a field has no values in the RDF database, the defaultValue is used. But if a field has some values,
defaultValue is NOT used, even if all values are filtered out. See an example in Basic entity filter.
A typical usecase for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as
deleted by the presence of a specific value for a given property.
Two-variable filtering
Besides comparing a field value to one or more constants or running an existential check on the field value, some
use cases also require comparing the field value to the value of another field in order to produce the desired result.
GraphDB solves this by supporting twovariable filtering in the perfield value filter, the toplevel document filter,
and the nested document filter.
Note: This type of filtering is not possible in the toplevel value filter because the only variable that is available
there is $this.
In the toplevel document filter and the nested document filter, there are no restrictions as all values are available
at the time of evaluation.
In the perfield value filter, twovariable filtering will reorder the defined fields such that values for other fields
are already available when the current field’s filter is evaluated. For example, let’s say we defined a filter $this
> ?salary for the field price. This will force the connector to process the field salary first, apply its perfield
value filter if any, and only then start collecting and filtering the values for the field price.
Cyclic dependencies will be detected and reported as an invalid filter. For example, if in addition to the above
we define a perfield value filter ?price > "1000"^^xsd:int for the field salary, a cyclic dependency will be
detected as both price and salary will require the other field being indexed first.
# the entity below will be synchronised because it has a matching value for city: ?city in ("London")
example:alpha
rdf:type example:gadget ;
example:name "John Synced" ;
example:city "London" .
# the entity below will not be synchronised because it lacks the property completely: bound(?city)
example:beta
rdf:type example:gadget ;
example:name "Peter Syncfree" .
# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
example:gamma
rdf:type example:gadget ;
example:name "Mary Syncless" ;
example:city "Liverpool" .
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
(continues on next page)
The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”,
the entity is synchronized.
Sometimes, data represented in RDF is not well suited to map directly to nonRDF. For example, if you have news
articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model
this is a single property :taggedWith. Consider the following RDF data:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix example2: <http://www.ontotext.com/example2#> .
example2:Berlin
rdf:type example2:Location ;
rdfs:label "Berlin" .
example2:Mozart
rdf:type example2:Person ;
rdfs:label "Wolfgang Amadeus Mozart" .
example2:Einstein
rdf:type example2:Person ;
rdfs:label "Albert Einstein" .
example2:Cannes-FF
rdf:type example2:Event ;
rdfs:label "Cannes Film Festival" .
example2:Article1
rdf:type example2:Article ;
rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
example2:taggedWith example2:Berlin ;
(continues on next page)
example2:Article2
rdf:type example2:Article ;
rdfs:comment "An article about Berlin." ;
example2:taggedWith example2:Berlin .
example2:Article3
rdf:type example2:Article ;
rdfs:comment "An article about Mozart's life." ;
example2:taggedWith example2:Mozart .
example2:Article4
rdf:type example2:Article ;
rdfs:comment "An article about classical music in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Mozart .
example2:Article5
rdf:type example2:Article ;
rdfs:comment "A boring article that has no tags." .
example2:Article6
rdf:type example2:Article ;
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
example2:taggedWith example2:Cannes-FF .
Assume you want to map this data to Elasticsearch, so that the property example2:taggedWith x is mapped
to separate fields taggedWithPerson and taggedWithLocation, according to the type of x (whereas we are not
interested in Events). You can map taggedWith twice to different fields and then use an entity filter to get the
desired values:
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": ["http://www.ontotext.com/example2#Article"],
"fields": [
{
"fieldName": "comment",
"propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
},
{
"fieldName": "taggedWithPerson",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Person>"
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Location>"
}
]
}
''' .
}
The six articles in the RDF data above will be mapped as such:
This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:
If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart
for taggedWithPerson:
If you used entity filters in the connectors in GraphDB 9.x (or older) with the entityFilter option, you need to
rewrite them using one of the current filter types.
In general, most older connector filters can be easily rewritten using the perfield value filter and toplevel document
filter.
Rule of thumb:
• If you want to remove individual values, i.e., if the operand is not BOUND() –> rewrite with perfield value
filter.
• If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with toplevel document
filter.
So if we take the example:
The following diagram shows a summary of all predicates that can administrate (create, drop, check status) connec
tor instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate
needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to
retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown
in green, blank helper nodes are shown in blue, literals in red, and IRIs in orange. The predicates are represented
by labeled arrows.
7.1.9 Caveats
Order of control
Even though SPARQL per se is not sensitive to the order of triple patterns, the Elasticsearch GraphDB Connector
expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates
that specify the query or query options need to come before any predicates that fetch results.
The diagram in Overview of connector predicates provides a quick overview of the predicates.
GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances
will not be usable and attempting to use them for queries or updates will throw an error.
If your GraphDB 9.x (or older) connector definitions do not include an entity filter, you can simply repair them.
If your GraphDB 9.x (or older) connector definitions do include an entity filter with the entityFilter option, you
need to rewrite the filter with one of the current filter types:
1. Save your existing connector definition.
2. Drop the connector instance.
3. In general, most older connector filters can be easily rewritten using the perfield value filter and toplevel
document filter. Rewrite the filters as follows:
Rule of thumb:
• If you want to remove individual values, i.e., if the operand is not BOUND() –> rewrite with
perfield value filter.
• If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with top
level document filter.
So if we take the example:
The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically imple
mented by an external component or a service such as Lucene but have the additional benefit of staying automati
cally uptodate with the GraphDB repository data.
The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier
(a IRI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the
same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains.
A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.
The main features of the GraphDB Connectors are:
• maintaining an index that is always in sync with the data stored in GraphDB;
• multiple independent instances per repository;
• the entities for synchronization are defined by:
– a list of fields (on the Lucene side) and property chains (on the GraphDB side) whose values will be
synchronized;
– a list of rdf:type’s of the entities for synchronization;
– a list of languages for synchronization (the default is all languages);
– additional filtering by property and value.
• fulltext search using native Lucene queries;
• snippet extraction: highlighting of search terms in the search result;
• faceted search;
• sorting by any preconfigured field;
• paging of results using offset and limit;
• custom mapping of RDF types to Lucene types;
• specifying which Lucene analyzer to use (the default is Lucene’s StandardAnalyzer);
• stripping HTML/XML tags in literals (the default is not to strip markup);
• boosting an entity by the numeric value of one or more predicates;
• custom scoring expressions at query time to evaluate a total score based on Lucene score and entity boost.
Each feature is described in detail below.
7.2.2 Usage
All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.
There are three types of SPARQL queries:
• INSERT for creating, updating, and deleting connector instances;
• SELECT for listing connector instances and querying their configuration parameters;
• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
In general, this corresponds to INSERT that adds or modifies data, and to SELECT that queries existing data.
Each connector implementation defines its own IRI prefix to distinguish it from other connectors. For the Lucene
GraphDB Connector, this is http://www.ontotext.com/connectors/lucene#. Each command or predicate exe
cuted by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/lucene#createConnector
to create a connector instance for Lucene.
Individual instances of a connector are distinguished by unique names that are also IRIs. They have their own
prefix to avoid clashing with any of the command predicates. For Lucene, the instance prefix is http://www.
ontotext.com/connectors/lucene/instance#.
Sample data All examples use the following sample data that describes five fictitious wines: Yoyowine, Fran
vino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The
minimum required ruleset level in GraphDB is RDFS.
wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .
wine:CabernetSauvignon
rdf:type wine:Grape ;
rdfs:label "Cabernet Sauvignon" .
wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .
wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .
wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .
wine:Yoyowine
rdf:type wine:RedWine ;
wine:madeFromGrape wine:CabernetSauvignon ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .
wine:Franvino
rdf:type wine:RedWine ;
wine:madeFromGrape wine:Merlo ;
wine:madeFromGrape wine:CabernetFranc ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Noirette
rdf:type wine:RedWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2012"^^xsd:integer .
wine:Blanquito
rdf:type wine:WhiteWine ;
wine:madeFromGrape wine:Chardonnay ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Rozova
rdf:type wine:RoseWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .
Thirdparty component versions This version of the Lucene GraphDB Connector uses Lucene version 8.11.2.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
• the name of the connector instance (e.g., my_index);
• classes to synchronize;
• properties to synchronize.
The configuration data has to be provided as a JSON string representation and passed together with the create
command.
You can create connectors via a Workbench dialog or by using a SPARQL update query (create command).
If you create the connector via the Workbench, no matter which way you use, you will be presented with a popup
screen showing you the connector creation progress.
1. Go to Setup � Connectors.
2. Click New Connector in the tab of the respective Connector type you want to create.
3. Fill out the configuration form.
4. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query
by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.
The create command is triggered by a SPARQL INSERT with the luc:createConnector predicate, e.g., it creates
a connector instance called my_index, which synchronizes the wines from the sample data above.
To be able to use newlines and quotes without the need for escaping, here we use SPARQL’s multiline string
delimiter consisting of 3 apostrophes: '''...'''. You can also use 3 quotes instead: """...""".
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false,
"multivalued": false
},
{
"fieldName": "year",
"propertyChain": [
(continues on next page)
Dropping (deleting) a connector instance removes all references to its external store from GraphDB, as well as all
Lucene files associated with it.
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the
connector instance has to be in the subject position, e.g., this removes the connector my_index:
INSERT DATA {
luc-index:my_index luc:dropConnector [] .
}
You can also force drop a connector in case a normal delete does not work. The force delete will remove the
connector even if part of the operation fails. Go to Setup � Connectors where you will see the already existing
connectors that you have created. Click the delete icon, and check Force delete in the dialog box.
You can view the options string that was used to create a particular connector instance with the following query:
SELECT ?createString {
luc-index:my_index luc:listOptionValues ?createString .
}
Existing Connector instances are shown below the New Connector button. Click the name of an instance to view
its configuration and SPARQL query, or click the repair / delete icons to perform these operations. Click the copy
icon to copy the connector definition query to your clipboard.
Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors
predicate:
?cntUri is bound to the prefixed IRI of the connector instance that was used during creation, e.g., http://www.
ontotext.com/connectors/lucene/instance#my_index>, while ?cntStr is bound to a string, representing the
part after the prefix, e.g., "my_index".
The internal state of each connector instance can be queried using a SELECT query and the connectorStatus pred
icate:
?cntUri is bound to the prefixed IRI of the connector instance, while ?cntStatus is bound to a string representation
of the status of the connector represented by this IRI. The status is keyvalue based.
From the user point of view, all synchronization happens transparently without using any additional predicates or
naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This is
achieved by intercepting all changes in the plugin and determining which Lucene documents need to be updated.
Simple queries
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching
Lucene document, the connector instance returns the document subject. In its simplest form, querying is achieved
by using a SELECT and providing the Lucene query as the object of the luc:query predicate:
SELECT ?entity {
?search a luc-index:my_index ;
luc:query "grape:cabernet" ;
luc:entities ?entity .
}
The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely :Yoyowine
and :Franvino.
Note: You must use the field names you chose when you created the connector instance. They can be identical
to the property IRIs but you must escape any special characters according to what Lucene expects.
1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type
Y), where X is a variable and Y is a connector instance IRI. X is bound to a query instance of the connector
instance.
2. Assign a query to the query instance by using the system predicate luc:query.
3. Request the matching entities through the luc:entities predicate.
It is also possible to provide per query search options by using one or more option predicates. The option predicates
are described in detail below.
The bound ?entity can be used in other SPARQL triples in order to build complex queries that join to or fetch
additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they
were made:
Note: :Franvino is returned twice because it is made from two different grapes, both of which are returned.
It is possible to access the match score returned by Lucene with the score predicate. As each entity has its own
score, the predicate should come at the entity level. For example:
The result looks like this but the actual score might be different as it depends on the specific Lucene version:
Consider the sample wine data and the my_index connector instance described previously. You can also query
facets using the same instance:
It is important to specify the facet fields by using the facetFields predicate. Its value is a simple commadelimited
list of field names. In order to get the faceted results, use the luc:facets predicate. As each facet has three
components (name, value and count), the luc:facets predicate returns multiple nodes that can be used to access
the individual values for each component through the predicates facetName, facetValue, and facetCount.
The resulting bindings look like the following:
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the
wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are
the same as the three dry wines as each facet is computed independently.
Sorting
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved
by the orderBy predicate the value of which is a commadelimited list of fields. Each field can be prefixed with a
minus to indicate sorting in descending order. For example:
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
By default, entities are sorted according to their matching score in descending order.
Note: If you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble
the order. To remedy this, use ORDER BY from SPARQL.
Tip: Sorting by an analyzed textual field works but might produce unexpected results. Analyzed textual fields are
composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America”
will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”.
If you need to sort by a textual field and still do fulltext search on it, it is best to create a copy of the field with the
setting "analyzed": false. For more information, see Copy fields.
Note: Unlike Lucene 4, which was used in GraphDB 6.x, Lucene 5 imposes an additional requirement on fields
used for sorting. They must be defined with multivalued = false.
Limit and offset are supported on the Lucene side of the query. This is achieved through the predicates limit and
offset. Consider this example in which an offset of 1 and a limit of 1 are specified:
SELECT ?entity {
?search a luc-index:my_index ;
luc:query "sugar:dry" ;
luc:offset "1" ;
luc:limit "1" ;
luc:entities ?entity .
}
offset is counted from 0. The result contains a single wine, Franvino. If you execute the query without the limit
and offset, Franvino will be second in the list:
Note: The specific order in which GraphDB returns the results depends on how Lucene returns the matches,
unless sorting is specified.
Snippet extraction
Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are accessed
through the dedicated predicate luc:snippets. It binds a blank node that in turn provides the actual snippets via
the predicates luc:snippetField and luc:snippetText. The predicate snippets must be attached to the entity, as
each entity has a different set of snippets. For example, in a search for Cabernet:
the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective
matching fields and snippets:
Note: The actual snippets might be different as this depends on the specific Lucene implementation.
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
• luc:snippetSize sets the maximum size of the extracted text fragment, 250 by default;
• luc:snippetSpanOpen text to insert before the highlighted text, <em> by default;
• luc:snippetSpanClose text to insert after the highlighted text, </em> by default.
The option predicates are set on the query instance, much like the luc:query predicate.
Total hits
You can get the total number of matching Lucene documents (hits) by using the luc:totalHits predicate, e.g., for
the connector instance my_index and a query that retrieves all wines made in 2012:
SELECT ?totalHits {
?r a luc-index:my_index ;
luc:query "year:2012" ;
luc:totalHits ?totalHits .
}
As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.
As you see above, you can omit returning any of the matching entities. This can be useful if there are many hits
and you want to calculate pagination parameters.
The creation parameters define how a connector instance is created by the luc:createConnector predicate. Some
are required and some are optional. All parameters are provided together in a JSON object, where the parameter
names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can
be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB
Workbench without any knowledge of JSON.
readonly (boolean), optional, readonly mode A readonly connector will index all existing data in the reposi
tory at creation time, but, unlike nonreadonly connectors, it will:
• Not react to updates. Changes will not be synced to the connector.
• Not keep any extra structures (such as the internal Lucene index for tracking updates to chains)
The only way to index changes in data after the connector has been created is to repair (or drop/recreate) the
connector.
importGraph (boolean), optional, specifies that the RDF data from which to create the connector is in a special virtual graph
Used to make a Lucene index from temporary RDF data inserted in the same transaction. It requires read
only mode and creates a connector whose data will come from statements inserted into a special virtual
graph instead of data contained in the repository. The virtual graph is luc:graph, where the prefix luc:
is as defined before. Data needs to be inserted into this graph before the connector create statement is
executed.
Both the insertion into the special graph and create statement must be in the same transaction. In GDB
Workbench, this can be done by pasting them one after another in the SPARQL editor and putting a semicolon
at the end of the first INSERT. This functionality requires readonly mode.
PREFIX luc: <http://www.ontotext.com/connectors/lucene#>
INSERT {
GRAPH luc:graph {
...
}
} WHERE {
...
};
importFile (string), optional, an RDF file with data from which to create the connector Creates a connector
whose data will come from an RDF file on the file system instead of data contained in the repository. The
value must be the full path to the RDF file. This functionality requires readonly mode.
detectFields (boolean), optional, detects fields This mode introduces automatic field detection when creating
a connector. You can omit specifying fields in JSON. Instead, you will get automatic fields: each cor
responds to a single predicate, and its field name is the same as the predicate (so you need to use escaping
when issuing Lucene queries).
In this mode, specifying types is optional too. If types are not provided, then all types will be indexed. This
mode requires importGraph or importFile.
Once the connector is created, you can inspect the detected fields in the Connector management section of
the Workbench.
analyzer (string), optional, specifies Lucene analyzer The Lucene Connector supports custom Analyzer im
plementations. They may be specified via the analyzer parameter whose value must be a fully qualified
name of a class that extends org.apache.lucene.analysis.Analyzer. The class requires either a default
constructor or a constructor with exactly one parameter of type org.apache.lucene.util.Version. For
example, these two classes are valid implementations:
package com.ontotext.example;
import org.apache.lucene.analysis.Analyzer;
package com.ontotext.example;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.Version;
FancyAnalyzer and SmartAnalyzer can then be used by specifying their fully qualified names, for example:
...
"analyzer": "com.ontotext.example.SmartAnalyzer",
...
types (list of IRIs), required, specifies the types of entities to sync The RDF types of entities to sync are spec
ified as a list of IRIs. At least one type IRI is required.
Use the pseudoIRI $any to sync entities that have at least one RDF type.
Use the pseudoIRI $untyped to sync entities regardless of whether they have any RDF type, see also the
examples in General fulltext search with the connectors.
languages (list of strings), optional, valid languages for literals RDF data is often multilingual but you can
map only some of the languages represented in the literal values. This can be done by specifying a list
of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1.
Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The
list of language ranges maps all existing literals that have matching language tags.
fields (list of field objects), required, defines the mapping from RDF to Lucene The fields define exactly
what parts of each entity will be synchronized as well as the specific details on the connector side. The
field is the smallest synchronization unit and it maps a property chain from GraphDB to a field in Lucene.
The fields are specified as a list of field objects. At least one field object is required. Each field object has
further keys that specify details.
• fieldName (string), required, the name of the field in Lucene The name of the field defines the
mapping on the connector side. It is specified by the key fieldName with a string value. The
field name is used at query time to refer to the field. There are few restrictions on the allowed
characters in a field name but to avoid unnecessary escaping (which depends on how Lucene
parses its queries), we recommend to keep the field names simple.
Example use: when an invalid date literal like "2021-02-29"^^xsd:date (2021 is not a leap year)
needs to be indexed as a date, or when an IRI needs to be indexed as a number.
Note that some conversions are always valid: any literal to an FTS field, any nonliteral (IRI,
blank node, embedded triple) to a nonanalyzed field. When true, such values will be skipped
with a note in the logs. When false, such values will break the transaction.
• facet (boolean), optional, default true Lucene needs to index data in a special way, if it will be used
for faceted search. This is controlled by the Boolean option “facet”. True by default. Fields that
are not synchronized for faceting are also not available for faceted search.
• datatype (string), optional, the manual datatype override By default, the Lucene GraphDB Con
nector uses datatype of literal values to determine how they must be mapped to Lucene types. For
more information on the supported datatypes, see Datatype mapping.
The datatype mapping can be overridden through the parameter "datatype", which can be speci
fied per field. The value of "datatype" can be any of the xsd: types supported by the automatic
mapping.
valueFilter (string), optional, specifies the toplevel value filter for the document See also Entity filtering.
documentFilter (string), optional, specifies the toplevel document filter for the document See also Entity
filtering.
Copy fields
Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate
for different use cases, e.g., faceting or sorting vs fulltext search. The Lucene GraphDB Connector has explicit
support for fields that copy their value from another field. This is achieved by specifying a single element in
the property chain of the form @otherFieldName, where otherFieldName is another noncopy field. Take the
following example:
...
"fields": [
{
"fieldName": "grape",
"facet": false,
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzed": true,
},
{
"fieldName": "grapeFacet",
"propertyChain": [
"@grape"
],
"analyzed": false,
}
]
...
The snippet creates an analyzed field “grape” and a nonanalyzed field “grapeFacet”, both fields are populated
with the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.
Note: The connector handles copy fields in a more optimal way than specifying a field with exactly the same
property chain as another field.
Sometimes, you have to work with data models that define the same concept (in terms of what you want to index
in Lucene) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical
name, multiple historical names and some unofficial names. If you want to index these together as a single field
in Lucene you can define this as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single
physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric
sequence of convenience. For example, we can define the fields name$1 and name$2 like this:
...
"fields": [
{
"fieldName": "name$1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name$2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
},
...
The values of the fields name$1 and name$2 will be merged and synchronized to the field name in Lucene.
Note: You cannot mix suffixed and unsuffixed fields with the same name, e.g., if you defined myField$new and
myField$old you cannot have a field called just myField.
Filters can be used with fields defined with multiple property chains. Both the physical field values and the indi
vidual virtual field values are available:
• Physical fields are specified without the suffix, e.g., ?myField
• Virtual fields are specified with the suffix, e.g., ?myField$2 or ?myField$alt.
Note: Physical fields cannot be combined with parent() as their values come from different property chains. If
you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>) as
parent(?myField$1) in (<urn:x>, <urn:y>) || parent(?myField$2) in (<urn:x>, <urn:y>) || parent(?
myField$3) ... and surround it with parentheses if it is a part of a bigger expression.
The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the
pseudoIRI lang(). The property preceding lang() must lead to a literal value. For example:
INSERT DATA {
luc-index:my_index luc:createConnector '''
(continues on next page)
The above connector will index the language tag of each literal value of the property http://www.ontotext.com/
example#name into the field nameLanguage.
The named graph of a given value can be indexed by ending a property chain with the special pseudoURI graph().
Indexing the named graph of the value instead of the value itself allows searching by named graph.
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameGraph",
"propertyChain": [
"http://www.ontotext.com/example#name",
"graph()"
]
}
],
}
''' .
}
The above connector will index the named graph of each value of the property http://www.ontotext.com/
example#name into the field nameGraph.
In this mode, the last element of a property chain is a wildcard that will match any predicate that leads to a literal
value. Use the special pseudoIRI $literal as the last element of the property chain to activate it.
Note: Currently, it really means any literal, including literals with data types.
For example:
{
"fields" : [ {
"propertyChain" : [ "$literal" ],
"fieldName" : "name"
}, {
"propertyChain" : [ "http://example.com/description", "$literal" ],
"fieldName" : "description"
}
...
}
Sometimes you may need the IRI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from
our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a
single property referring to the pseudoIRI $self. For example:
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "entityId",
"propertyChain": [
"$self"
],
},
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
]
}
''' .
}
The above connector will index the IRI of each wine into the field entityId.
The Lucene GraphDB Connector maps different types of RDF values to different types of Lucene values according
to the basic type of the RDF value (IRI or literal) and the datatype of literals. The autodetection uses the following
mapping:
The datatype mapping can be affected by the synchronization options too, e.g., a nonanalyzed field that has
xsd:long values is indexed as a nontokenized Field.
Note: For any given field the automatic mapping uses the first value it sees. This works fine for clean datasets
but might lead to problems, if your dataset has nonnormalized data, e.g., the first value has no datatype but other
values have.
It is therefore recommended to set datatype to a fixed value, e.g. xsd:date.
Please note that the commonly used xsd:integer and xsd:decimal datatypes are not indexed as numbers because
they represent infinite precision numbers. You can override that by using the datatype option to cast to xsd:long,
xsd:double, xsd:float as appropriate.
RDF and Lucene use different models to represent dates and times. Lucene stores values as offsets in seconds for
sorting, or as padded ISO strings for range search, e.g., "2020-03-23T12:34:56"^^xsd:dateTime will be stored
as the string 20200323123456.
Years in RDF values use the XSD format and are era years, where positive values denote the common era and
negative values denote years before the common era. There is no year zero.
Years in padded string date and time Lucene values use the ISO format and are proleptic years, i.e., positive values
denote years from the common era with any previous eras just going down by one mathematically so there is year
zero.
In short:
• year 2020 CE = year 2020 in XSD = year 2020 in ISO.
• …
• year 1 CE = year 1 in XSD = year 1 in ISO.
• year 1 BCE = year 1 in XSD = year 0 in ISO.
• year 2 BCE = year 2 in XSD = year 1 in ISO.
• …
All years coming from RDF literals will be converted to ISO before indexing in Lucene.
Note: Range search will not work as expected with negative years. This is a limitation of storing the date and
time as strings.
XSD date and time values support timezones. In order to have a unified view over values with different timezones,
all xsd:dateTime values will be normalized to the UTC time zone before indexing.
In addition to that, XSD defines the lack of a timezone as undetermined. Since we do not want to have any
undetermined state in the indexing system, we define the undetermined time zone as UTC, i.e., "2020-02-
14T12:00:00"^^xsd:dateTime is equivalent to "2020-02-14T12:00:00Z"^^xsd:dateTime (Z is the UTC time
zone, also known as +00:00).
Also note that XSD dates may have a timezone, which leads to additional complications. E.g., "2020-01-
01+02:00"^^xsd:date (the date 1 January 2020 in the +02:00 timezone) will be normalized to 2019-12-
31T22:00:00Z (a different day!) if strict timezone adherence is followed. We have chosen to ignore the timezone
on any values that do not have an associated time value, e.g.:
• "2020-02-15+02:00"^^xsd:date
• "2020-05-08-05:00"^^xsd:date
All of the above will be treated as if they specified UTC as their timezone.
The Lucene connector supports three kinds of entity filters used to finetune the set of entities and/or individual
values for the configured fields, based on the field value. Entities and field values are synchronized to Lucene if,
and only if, they pass the filter. The filters are similar to a FILTER() inside a SPARQL query but not exactly the
same. In them, each configured field can be referred to by prefixing it with a ?, much like referring to a variable
in SPARQL.
Types of filters
Toplevel value filter The toplevel value filter is specified via valueFilter. It is evaluated prior to anything
else when only the document ID is known and it may not refer to any field names but only to the special
field $this that contains the current document ID. Failing to pass this filter removes the entire document
early in the indexing process and it can be used to introduce more restrictions similar to the builtin filtering
by type via the types property.
Toplevel document filter The toplevel document filter is specified via documentFilter. This filter is evaluated
last when all of the document has been collected and it decides whether to include the document in the index.
It can be used to enforce global document restrictions, e.g., certain fields are required or a document needs
to be indexed only if a certain field value meets specific conditions.
Perfield value filter The perfield value filter is specified via valueFilter inside the field definition of the field
whose values are to be filtered. The filter is evaluated while collecting the data for the field when each field
value becomes available.
The variable that contains the field value is $this. Other field names can be used to filter the current field’s
value based on the value of another field, e.g., $this > ?age will compare the current field value to the
value of the field age (see also Twovariable filtering). Failing to pass the filter will remove the current field
value.
See also Migrating from GraphDB 9.x.
Filter operators
The filter operators are used to test if the value of a given field satisfies a certain condition.
Field comparisons are done on original RDF values before they are converted to Lucene values using datatype
mapping.
Operator Meaning
?var in (value1, value2, ...) Tests if the field var’s value is one of the specified values. Values
are compared strictly unlike the similar SPARQL operator, i.e. for
literals to match their datatype must be exactly the same (similar
to how SPARQL sameTerm works). Values that do not match, are
treated as if they were not present in the repository.
Example:
?status in ("active", "new")
?var not in (value1, value2, ...) The negated version of the inoperator.
Example:
?status not in ("archived")
bound(?var) Tests if the field var has a valid value. This can be used to make
the field compulsory.
Example:
bound(?name)
isExplicit(?var) Tests if the field var’s value came from an explicit statement.
This will use the last element of the property chain. If you need
to assert the explicit status of a previous property chain use par
ent(?var) as many times as needed.
Example:
isExplicit(?name)
?var = value (equal to) RDF value comparison operators that compare RDF values
?var != value (not equal to) similarly to the equivalent SPARQL operators. The field var’s
?var > value (greater than) value will be compared to the specified RDF value. When
comparing RDF values that are literals, their datatypes must be
?var >= value (greater than or equal to)
compatible, e.g., xsd:integer and xsd:long but not
?var < value (less than)
xsd:string and xsd:date. Values that do not match are treated
?var <= value (less than or equal to) as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and
dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
regex(?var, "pattern")
or Tests if the field var’s value matches the given regular
regex(?var, "pattern", "i") expression pattern.
If the “i” flag option is present, this indicates that the match
operates in caseinsensitive mode.
Values that do not match are treated as if they were not present
in the repository.
Example:
regex(?name, "^mrs?", "i")
Example:
!bound(?company)
Example:
(bound(?name) or bound(?company)) && bound(?address)
Filter modifiers
In addition to the operators, there are some constructions that can be used to write filters based not on the values
of a field but on values related to them:
Accessing the previous element in the chain The construction parent(?var) is used for going to a pre
vious level in a property chain. It can be applied recursively as many times as needed, e.g.,
parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var)
can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the
bound operator like this: parent(bound(?var)).
Accessing an element beyond the chain The construction ?var -> uri (alternatively, ?var o uri or just ?
var uri) is used to access additional values that are accessible through the property uri. In essence, this
construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound
by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this:
?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?
company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound
operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this:
bound(parent(?company) -> <urn:hasGroup>).
The IRI parameter can be a full IRI within < > or the special string rdf:type (alternatively, just type), which
will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
Filtering by RDF graph The construction graph(?var) is used for accessing the RDF graph of a field’s value.
A typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/
implicit>) but using isExplicit(?a) is the recommended way.
The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).
Filtering by language tags The construction lang(?var) is used for accessing the language tag of field’s value
(only RDF literals can have a language tag). The typical use case is to sync only values written in a given lan
guage: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element
beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in
("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".
Current context variable $this The special field variable $this (and not ?this, ?$this, $?this) is used to refer
to the current context. In the toplevel value filter and the toplevel document filter, it refers to the document.
In the perfield value filter, it refers to the currently filtered field value. In the nested document filter, it refers
to the nested document.
ALL() quantifier In the context of documentlevel filtering, a match is true if at least one of potentially many field
values match, e.g., ?location = <urn:Europe> would return true if the document contains { "location":
["<urn:Asia>", "<urn:Europe>"] }.
In addition to this, you can also use the ALL() quantifier when you need all values to match, e.g., ALL(?
location) = <urn:Europe> would not match with the above document because <urn:Asia> does not match.
Entity filters and default values Entity filters can be combined with default values in order to get more flexible
behavior.
If a field has no values in the RDF database, the defaultValue is used. But if a field has some values,
defaultValue is NOT used, even if all values are filtered out. See an example in Basic entity filter.
A typical usecase for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as
deleted by the presence of a specific value for a given property.
Two-variable filtering
Besides comparing a field value to one or more constants or running an existential check on the field value, some
use cases also require comparing the field value to the value of another field in order to produce the desired result.
GraphDB solves this by supporting twovariable filtering in the perfield value filter and the toplevel document
filter.
Note: This type of filtering is not possible in the toplevel value filter because the only variable that is available
there is $this.
In the toplevel document filter, there are no restrictions as all values are available at the time of evaluation.
In the perfield value filter, twovariable filtering will reorder the defined fields such that values for other fields
are already available when the current field’s filter is evaluated. For example, let’s say we defined a filter $this
> ?salary for the field price. This will force the connector to process the field salary first, apply its perfield
value filter if any, and only then start collecting and filtering the values for the field price.
Cyclic dependencies will be detected and reported as an invalid filter. For example, if in addition to the above
we define a perfield value filter ?price > "1000"^^xsd:int for the field salary, a cyclic dependency will be
detected as both price and salary will require the other field being indexed first.
# the entity below will be synchronised because it has a matching value for city: ?city in ("London")
example:alpha
rdf:type example:gadget ;
example:name "John Synced" ;
example:city "London" .
# the entity below will not be synchronised because it lacks the property completely: bound(?city)
example:beta
rdf:type example:gadget ;
example:name "Peter Syncfree" .
# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
example:gamma
rdf:type example:gadget ;
example:name "Mary Syncless" ;
example:city "Liverpool" .
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": ["http://www.ontotext.com/example#name"]
(continues on next page)
The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”,
the entity is synchronized.
Sometimes, data represented in RDF is not well suited to map directly to nonRDF. For example, if you have news
articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model
this is a single property :taggedWith. Consider the following RDF data:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix example2: <http://www.ontotext.com/example2#> .
example2:Berlin
rdf:type example2:Location ;
rdfs:label "Berlin" .
example2:Mozart
rdf:type example2:Person ;
rdfs:label "Wolfgang Amadeus Mozart" .
example2:Einstein
rdf:type example2:Person ;
rdfs:label "Albert Einstein" .
example2:Cannes-FF
rdf:type example2:Event ;
rdfs:label "Cannes Film Festival" .
example2:Article1
rdf:type example2:Article ;
rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Einstein ;
example2:taggedWith example2:Cannes-FF .
(continues on next page)
example2:Article2
rdf:type example2:Article ;
rdfs:comment "An article about Berlin." ;
example2:taggedWith example2:Berlin .
example2:Article3
rdf:type example2:Article ;
rdfs:comment "An article about Mozart's life." ;
example2:taggedWith example2:Mozart .
example2:Article4
rdf:type example2:Article ;
rdfs:comment "An article about classical music in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Mozart .
example2:Article5
rdf:type example2:Article ;
rdfs:comment "A boring article that has no tags." .
example2:Article6
rdf:type example2:Article ;
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
example2:taggedWith example2:Cannes-FF .
Assume you want to map this data to Lucene, so that the property example2:taggedWith x is mapped to separate
fields taggedWithPerson and taggedWithLocation, according to the type of x (whereas we are not interested in
Events). You can map taggedWith twice to different fields and then use an entity filter to get the desired values:
INSERT DATA {
luc-index:my_index luc:createConnector '''
{
"types": ["http://www.ontotext.com/example2#Article"],
"fields": [
{
"fieldName": "comment",
"propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
},
{
"fieldName": "taggedWithPerson",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Person>"
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Location>"
}
]
}
''' .
}
The six articles in the RDF data above will be mapped as such:
This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:
If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart
for taggedWithPerson:
The following diagram shows a summary of all predicates that can administrate (create, drop, check status) connec
tor instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate
needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to
retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown
in green, blank helper nodes are shown in blue, literals in red, and IRIs in orange. The predicates are represented
by labeled arrows.
7.2.9 Caveats
Order of control
Even though SPARQL per se is not sensitive to the order of triple patterns, the Lucene GraphDB Connector expects
to receive certain predicates before others so that queries can be executed properly. In particular, predicates that
specify the query or query options need to come before any predicates that fetch results.
The diagram in Overview of connector predicates provides a quick overview of the predicates.
GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances
will not be usable and attempting to use them for queries or updates will throw an error.
If your GraphDB 9.x (or older) connector definitions do not include an entity filter, you can simply repair them.
If your GraphDB 9.x (or older) connector definitions do include an entity filter with the entityFilter option, you
need to rewrite the filter with one of the current filter types:
1. Save your existing connector definition.
2. Drop the connector instance.
3. In general, most older connector filters can be easily rewritten using the perfield value filter and toplevel
document filter. Rewrite the filters as follows:
Rule of thumb:
• If you want to remove individual values, i.e., if the operand is not BOUND() –> rewrite with
perfield value filter.
• If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with top
level document filter.
So if we take the example:
The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically imple
mented by an external component or a service such as Solr but have the additional benefit of staying automatically
uptodate with the GraphDB repository data.
The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier
(a IRI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the
same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains.
A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.
The main features of the GraphDB Connectors are:
• maintaining an index that is always in sync with the data stored in GraphDB;
• multiple independent instances per repository;
• the entities for synchronization are defined by:
– a list of fields (on the Solr side) and property chains (on the GraphDB side) whose values will be
synchronized;
– a list of rdf:type’s of the entities for synchronization;
– a list of languages for synchronization (the default is all languages);
– additional filtering by property and value.
• fulltext search using native Solr queries;
• snippet extraction: highlighting of search terms in the search result;
• faceted search;
• sorting by any preconfigured field;
• paging of results using offset and limit;
• custom mapping of RDF types to Solr types;
Each feature is described in detail below.
7.3.2 Usage
All interactions with the Solr GraphDB Connector shall be done through SPARQL queries.
There are three types of SPARQL queries:
• INSERT for creating, updating, and deleting connector instances;
• SELECT for listing connector instances and querying their configuration parameters;
• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
In general, this corresponds to INSERT that adds or modifies data, and to SELECT that queries existing data.
Each connector implementation defines its own IRI prefix to distinguish it from other connectors. For the Solr
GraphDB Connector, this is http://www.ontotext.com/connectors/solr#. Each command or predicate exe
cuted by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/solr#createConnector to
create a connector instance for Solr.
Individual instances of a connector are distinguished by unique names that are also IRIs. They have their own prefix
to avoid clashing with any of the command predicates. For Solr, the instance prefix is http://www.ontotext.com/
connectors/solr/instance#.
Sample data All examples use the following sample data that describes five fictitious wines: Yoyowine, Fran
vino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The
minimum required ruleset level in GraphDB is RDFS.
wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .
(continues on next page)
wine:CabernetSauvignon
rdf:type wine:Grape ;
rdfs:label "Cabernet Sauvignon" .
wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .
wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .
wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .
wine:Yoyowine
rdf:type wine:RedWine ;
wine:madeFromGrape wine:CabernetSauvignon ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .
wine:Franvino
rdf:type wine:RedWine ;
wine:madeFromGrape wine:Merlo ;
wine:madeFromGrape wine:CabernetFranc ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Noirette
rdf:type wine:RedWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2012"^^xsd:integer .
wine:Blanquito
rdf:type wine:WhiteWine ;
wine:madeFromGrape wine:Chardonnay ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Rozova
rdf:type wine:RoseWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .
Prerequisites
Solr core creation To create new Solr cores on the fly, you have to use the custom admin handler provided with
the Solr Connector.
1. Copy the solr-core-admin-handler.jar file from the /tools to the /configs/solr-home/ directory
of the GraphDB distribution.
2. To start Solr, execute:
Solr schema setup To use the connector, the core’s schema from which the configuration will be copied (most
of the time named collection1) must be configured to allow schema modifications. See “Managed Schema
Definition in SolrConfig” on page 409 of the Apache Solr Reference Guide.
A good starting point is the configuration from example-schemaless in the Solr distribution.
Thirdparty component versions This version of the Solr GraphDB Connector uses Solr version 8.11.2.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
• the name of the connector instance (e.g., my_index);
• a Solr instance to synchronize to;
• classes to synchronize;
• properties to synchronize.
The configuration data has to be provided as a JSON string representation and passed together with the create
command.
You can create connectors via a Workbench dialog or by using a SPARQL update query (create command).
If you create the connector via the Workbench, no matter which way you use, you will be presented with a popup
screen showing you the connector creation progress.
1. Go to Setup � Connectors.
2. Click New Connector in the tab of the respective Connector type you want to create.
3. Fill in the configuration form.
1. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query
by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.
The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates a
connector instance called my_index, which synchronizes the wines from the sample data above.
To be able to use newlines and quotes without the need for escaping, here we use SPARQL’s multiline string
delimiter consisting of 3 apostrophes: '''...'''. You can also use 3 quotes instead: """...""".
INSERT DATA {
solr-index:my_index solr:createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false,
"multivalued": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
],
"analyzed": false
}
]
}
''' .
}
Note: One of the fields has "multivalued": false. This is explained further under Sorting.
The above command creates a new Solr connector instance that connects to the Solr instance accessible at port
8983 on the localhost as specified by the "solrUrl" key.
The "types" key defines the RDF type of the entities to synchronize and, in the example, it is only entities of
the type http://www.ontotext.com/example/wine#Wine (and its subtypes if RDFS or higherlevel reasoning is
enabled). The "fields" key defines the mapping from RDF to Solr. The basic building block is the property chain,
i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In
the example, three bits of information are mapped the grape the wines are made of, sugar content, and year. Each
chain is assigned a short and convenient field name: “grape”, “sugar”, and “year”. The field names are later used
in the queries.
The field grape is an example of a property chain composed of more than one property. First, we take the wine’s
madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label
of this instance. The fields sugar and year are both composed of a single property that links the value directly to
the wine.
The fields sugar and year contain discrete values, such as medium, dry, 2012, 2013, and thus it is best to specify
the option analyzed: false as well. See analyzed in Defining fields for more information.
By default, GraphDB manages (create, delete or update if needed) the Solr core and the Solr schema. This makes
it easier to use Solr as everything is done automatically. This behavior can be changed by the following options:
• manageCore: if true, GraphDB manages the core. true by default.
• manageSchema: if true, GraphDB manages the schema. true by default.
The automatic core management requires the custom Solr admin handler provided with the GraphDB distribution.
For more information, see Solr core creation.
Note: If either of the options is set to false, you have to create, update or remove the core/schema manually
and, in case Solr is misconfigured, the connector instance will not function correctly.
The present version provides no support for changing some advanced options, such as stop words, on a perfield
basis. The recommended way to do this for now is to manage the schema yourself and tell the connector to just
sync the object values in the appropriate fields. Here is an example:
INSERT DATA {
solr-index:my_index solr:createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false,
"multivalued": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
],
(continues on next page)
This creates the same connector instance as above but it expects fields with the specified field names to be already
present in the core as well as some internal GraphDB fields. For the example, you must have the following fields:
GraphDB allows the access of a secured Solr instance by passing the arbitrary parameters.
To setup basic user authentication configuration in GraphDB Solr Connector, you need to configure the solrBa-
sicAuthUser and solrBasicAuthPassword parameters.
...
solr-index:my_index conn:createConnector '''
{
"hasProperty": "http://www.w3.org/2000/01/rdf-schema#comment",
"solrUrl": "${validSolrUrl}",
"solrUrl": "http://localhost:9090/solr",
"solrBasicAuthUser": "solr",
"solrBasicAuthPassword": "SolrRocks",
"fields": [
...
When you create a new Solr Connector in GraphDB Workbench, you need to add values for the solrBasicAuthUser
and solrBasicAuthPassword options.
Instead of supplying the username and password as part of the connector instance configuration, you can also
implement a custom authenticator class and set it via the authenticationConfiguratorClass option. See these
connector authenticator examples for more information and example projects that implement such a custom class.
For more information about securing Solr, see the documentation for Solr: Enable Basic Authentication.
Dropping a connector instance removes all references to its external store from GraphDB as well as the Solr core
associated with it.
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the
connector instance has to be in the subject position, e.g., this removes the connector my_index:
INSERT DATA {
solr-index:my_index solr:dropConnector [] .
}
You can also force drop a connector in case a normal delete does not work. The force delete will remove the
connector even if part of the operation fails. Go to Setup � Connectors where you will see the already existing
connectors that you have created. Click the delete icon, and check Force delete in the dialog box.
You can view the options string that was used to create a particular connector instance with the following query:
SELECT ?createString {
solr-index:my_index solr:listOptionValues ?createString .
}
Existing Connector instances are shown below the New Connector button. Click the name of an instance to view
its configuration and SPARQL query, or click the repair / delete icons to perform these operations. Click the copy
icon to copy the connector definition query to your clipboard.
Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors
predicate:
?cntUri is bound to the prefixed IRI of the connector instance that was used during creation, e.g., http://www.
ontotext.com/connectors/solr/instance#my_index>, while ?cntStr is bound to a string, representing the part
after the prefix, e.g., "my_index".
The internal state of each connector instance can be queried using a SELECT query and the connectorStatus pred
icate:
?cntUri is bound to the prefixed IRI of the connector instance, while ?cntStatus is bound to a string representation
of the status of the connector represented by this IRI. The status is keyvalue based.
From the user point of view, all synchronization happens transparently without using any additional predicates or
naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This
is achieved by intercepting all changes in the plugin and determining which Solr documents need to be updated.
Simple queries
Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching
Solr document, the connector instance returns the document subject. In its simplest form, querying is achieved by
using a SELECT and providing the Solr query as the object of the solr:query predicate:
SELECT ?entity {
?search a solr-index:my_index ;
solr:query "grape:cabernet" ;
solr:entities ?entity .
}
The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely :Yoyowine
and :Franvino.
Note: You must use the field names you chose when you created the connector instance. They can be identical
to the property IRIs but you must escape any special characters according to what Solr expects.
1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type
Y), where X is a variable and Y is a connector instance IRI. X is bound to a query instance of the connector
instance.
2. Assign a query to the query instance by using the system predicate solr:query.
3. Request the matching entities through the solr:entities predicate.
It is also possible to provide per query search options by using one or more option predicates. The option predicates
are described in detail below.
Raw queries
To access a Solr query parameter that is not exposed through a special predicate, use a raw query. Instead of
providing a fulltext query in the :query part, specify raw Solr parameters. For example, to sort the facets in a
different order than described in facet.sort, execute the following query:
SELECT ?entity {
?search a solr-index:my_index ;
solr:query '''
{
"facet":"true",
"indent":"true",
"facet.sort":"index",
"q":"*:*",
"wt":"json"
}
''' ;
solr:entities ?entity .
}
You can get these parameters when you do your query from the admin interface in Solr, or from the response
payload (where they are included). The query parameters from the select endpoint are also supported in Solr. Here
is an example:
SELECT ?entity {
?search a solr-index:my_index ;
solr:query '''q=*%3A*&wt=json&indent=true&facet=true&facet.sort=index''' ;
solr:entities ?entity .
}
Note: You have to specify q= as the first parameter as it is used for detecting the raw query.
The bound ?entity can be used in other SPARQL triples in order to build complex queries that join to or fetch
additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they
were made:
Note: :Franvino is returned twice because it is made from two different grapes, both of which are returned.
It is possible to access the match score returned by Solr with the score predicate. As each entity has its own score,
the predicate should come at the entity level. For example:
The result looks like this but the actual score might be different as it depends on the specific Solr version:
Consider the sample wine data and the my_index connector instance described previously. You can also query
facets using the same instance:
It is important to specify the facet fields by using the facetFields predicate. Its value is a simple commadelimited
list of field names. In order to get the faceted results, use the solr:facets predicate. As each facet has three
components (name, value and count), the solr:facets predicate returns multiple nodes that can be used to access
the individual values for each component through the predicates facetName, facetValue, and facetCount.
The resulting bindings look like the following:
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the
wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are
the same as the three dry wines as each facet is computed independently.
Tip: Faceting by analysed textual field works but might produce unexpected results. Analysed textual fields are
composed of tokens and faceting uses each token to create a faceting bucket. For example, “North America” and
“Europe” produce three buckets: “north”, “america” and “europe”, corresponding to each token in the two values.
If you need to facet by a textual field and still do fulltext search on it, it is best to create a copy of the field with
the setting "analyzed": false. For more information, see Copy fields.
While basic faceting allows for simple counting of documents based on the discrete values of a particular field,
there are more complex faceted or aggregation searches in Solr. The Solr GraphDB Connector provides a mapping
from Solr results to RDF results but no mechanism for specifying the queries other than executing a Raw queries.
The Solr GraphDB Connector supports mapping of range, interval, and pivot facets.
The results are accessed through the predicate aggregations (much like the basic facets are accessed through
facets). The predicate binds multiple blank nodes that each contain a single aggregation bucket. The individual
bucket items can be accessed through these predicates:
Sorting
It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved
by the orderBy predicate the value of which is a commadelimited list of fields. Each field can be prefixed with a
minus to indicate sorting in descending order. For example:
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
By default, entities are sorted according to their matching score in descending order.
Note: If you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble
the order. To remedy this, use ORDER BY from SPARQL.
Tip: Sorting by an analysed textual field works but might produce unexpected results. Analysed textual fields are
composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America”
will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”.
If you need to sort by a textual field and still do fulltext search on it, it is best to create a copy of the field with the
setting "analyzed": false. For more information, see Copy fields.
Note: Solr imposes an additional requirement on fields used for sorting. They must be defined with multivalued
= false.
Limit and offset are supported on the Solr side of the query. This is achieved through the predicates limit and
offset. Consider this example in which an offset of 1 and a limit of 1 are specified:
SELECT ?entity {
?search a solr-index:my_index ;
solr:query "sugar:dry" ;
solr:offset "1" ;
solr:limit "1" ;
solr:entities ?entity .
}
offset is counted from 0. The result contains a single wine, Franvino. If you execute the query without the limit
and offset, Franvino will be second in the list:
Note: The specific order in which GraphDB returns the results depends on how Solr returns the matches, unless
sorting is specified.
Snippet extraction
Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are accessed
through the dedicated predicate solr:snippets. It binds a blank node that in turn provides the actual snippets via
the predicates solr:snippetField and solr:snippetText. The predicate snippets must be attached to the entity,
as each entity has a different set of snippets. For example, in a search for Cabernet:
PREFIX solr: <http://www.ontotext.com/connectors/solr#>
PREFIX solr-index: <http://www.ontotext.com/connectors/solr/instance#>
the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective
matching fields and snippets:
Note: The actual snippets might be different as this depends on the specific Solr implementation.
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
• solr:snippetSize sets the maximum size of the extracted text fragment, 250 by default;
• solr:snippetSpanOpen text to insert before the highlighted text, <em> by default;
• solr:snippetSpanClose text to insert after the highlighted text, </em> by default.
The option predicates are set on the query instance, much like the solr:query predicate.
Total hits
You can get the total number of matching Solr documents (hits) by using the solr:totalHits predicate, e.g., for
the connector instance my_index and a query that retrieves all wines made in 2012:
PREFIX solr: <http://www.ontotext.com/connectors/solr#>
PREFIX solr-index: <http://www.ontotext.com/connectors/solr/instance#>
SELECT ?totalHits {
?r a solr-index:my_index ;
solr:query "year:2012" ;
(continues on next page)
As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.
As you see above, you can omit returning any of the matching entities. This can be useful if there are many hits
and you want to calculate pagination parameters.
The creation parameters define how a connector instance is created by the solr:createConnector predicate. Some
are required and some are optional. All parameters are provided together in a JSON object, where the parameter
names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can
be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB
Workbench without any knowledge of JSON.
readonly (boolean), optional, readonly mode A readonly connector will index all existing data in the reposi
tory at creation time, but, unlike nonreadonly connectors, it will:
• Not react to updates. Changes will not be synced to the connector.
• Not keep any extra structures (such as the internal Lucene index for tracking updates to chains)
The only way to index changes in data after the connector has been created is to repair (or drop/recreate) the
connector.
importGraph (boolean), optional, specifies that the RDF data from which to create the connector is in a special virtual graph
Used to make a Solr index from temporary RDF data inserted in the same transaction. It requires readonly
mode and creates a connector whose data will come from statements inserted into a special virtual graph
instead of data contained in the repository. The virtual graph is solr:graph, where the prefix solr: is as
defined before. Data needs to be inserted into this graph before the connector create statement is executed.
Both the insertion into the special graph and create statement must be in the same transaction. In GDB
Workbench, this can be done by pasting them one after another in the SPARQL editor and putting a semicolon
at the end of the first INSERT. This functionality requires readonly mode.
importFile (string), optional, an RDF file with data from which to create the connector Creates a connector
whose data will come from an RDF file on the file system instead of data contained in the repository. The
value must be the full path to the RDF file. This functionality requires readonly mode.
detectFields (boolean), optional, detect fields This mode introduces automatic field detection when creating a
connector. You can omit specifying fields in JSON. Instead, you will get automatic fields: each corre
sponds to a single predicate, and its field name is the same as the predicate (so you need to use escaping
when issuing Solr queries).
In this mode, specifying types is optional too. If types are not provided, then all types will be indexed. This
mode requires importGraph or importFile.
Once the connector is created, you can inspect the detected fields in the Connector management section of
the Workbench.
solrUrl (URL), required, Solr instance to sync to As Solr is a thirdparty service, you have to specify the URL
on which it is running. The format of the URL is of the form http://hostname.domain:port/. There is no
default value. Can be updated at runtime without having to rebuild the index.
solrBasicAuthUser (string), optional, the settings for supplying the authentication user No default value.
Can be updated at runtime without having to rebuild the index.
solrBasicAuthPassword (string), optional, the settings for supplying the authentication password A pass
word is a string with a single value that is not logged or printed. No default value. Can be updated at
runtime without having to rebuild the index.
bulkUpdateBatchSize (integer), controls the maximum number of documents sent per bulk request.
Default value is 1,000. Can be updated at runtime without having to rebuild the index.
types (list of IRIs), required, specifies the types of entities to sync The RDF types of entities to sync are spec
ified as a list of IRIs. At least one type IRI is required.
Use the pseudoIRI $any to sync entities that have at least one RDF type.
Use the pseudoIRI $untyped to sync entities regardless of whether they have any RDF type, see also the
examples in General fulltext search with the connectors.
languages (list of strings), optional, valid languages for literals RDF data is often multilingual but you can
map only some of the languages represented in the literal values. This can be done by specifying a list
of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1.
Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The
list of language ranges maps all existing literals that have matching language tags.
fields (list of field objects), required, defines the mapping from RDF to Solr The fields define exactly what
parts of each entity will be synchronized as well as the specific details on the connector side. The field is
the smallest synchronization unit and it maps a property chain from GraphDB to a field in Solr. The fields
are specified as a list of field objects. At least one field object is required. Each field object has further keys
that specify details.
• fieldName (string), required, the name of the field in Solr The name of the field defines the map
ping on the connector side. It is specified by the key fieldName with a string value. The field name
is used at query time to refer to the field. There are few restrictions on the allowed characters in a
field name but to avoid unnecessary escaping (which depends on how Solr parses its queries), we
recommend to keep the field names simple.
• fieldNameTransform (one of none, predicate or predicate.localName), optional, none by default
Defines an optional transformation of the field name. Although fieldName is always required, it
is ignored if fieldNameTransform is predicate or predicate.localName.
– none: The field name is supplied via the fieldName option.
– predicate: The field name is equal to the full IRI of the last predicate of the chain, e.g., if
the last predicate was http://www.w3.org/2000/01/rdf-schema#label, then the field name
will be http://www.w3.org/2000/01/rdf-schema#label too.
– predicate.localName: The field name is the derived from the local name of the IRI of the
last predicate of the chain, e.g., if the last predicate was http://www.w3.org/2000/01/rdf-
schema#comment, then the field name will be comment.
Note that some conversions are always valid: any literal to an FTS field, any nonliteral (IRI,
blank node, embedded triple) to a nonanalyzed field. When true, such values will be skipped
with a note in the logs. When false, such values will break the transaction.
• datatype (string), optional, the manual datatype override By default, the Solr GraphDB Connec
tor uses datatype of literal values to determine how they must be mapped to Solr types. For more
information on the supported datatypes, see Datatype mapping.
The mapping can be overridden through the property “datatype”, which can be specified per field.
The value of “datatype” can be any of the xsd: types supported by the automatic mapping or a
native Solr type prefixed by native:, e.g., both xsd:long and native:tlongs map to the tlongs
type in Solr.
valueFilter (string), optional, specifies the toplevel value filter for the document See also Entity filtering.
documentFilter (string), optional, specifies the toplevel document filter for the document See also Entity
filtering.
As mentioned above, the following connector parameters can be updated at runtime without having to rebuild the
index:
• solrUrl
• bulkUpdateBatchSize
• solrBasicAuthUser
• solrBasicAuthPassword
This can be done by executing the following SPARQL update, here with examples for changing the user and
password:
PREFIX conn:<http://www.ontotext.com/connectors/solr#>
PREFIX inst:<http://www.ontotext.com/connectors/solr/instance#>
INSERT DATA {
inst:properIndex conn:updateConnector '''
{
"solrBasicAuthUser": "foo",
"solrBasicAuthPassword": "bar"
}
'''.
}
Copy fields
Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate
for different use cases, e.g., faceting or sorting vs fulltext search. The Solr GraphDB Connector has explicit
support for fields that copy their value from another field. This is achieved by specifying a single element in
the property chain of the form @otherFieldName, where otherFieldName is another noncopy field. Take the
following example:
...
"fields": [
{
"fieldName": "grape",
"facet": false,
"propertyChain": [
(continues on next page)
The snippet creates an analysed field “grape” and a nonanalysed field “grapeFacet”, both fields are populated with
the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.
Note: The connector handles copy fields in a more optimal way than specifying a field with exactly the same
property chain as another field.
Sometimes, you have to work with data models that define the same concept (in terms of what you want to index in
Solr) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical name,
multiple historical names and some unofficial names. If you want to index these together as a single field in Solr
you can define this as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single
physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric
sequence of convenience. For example, we can define the fields name$1 and name$2 like this:
...
"fields": [
{
"fieldName": "name$1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name$2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
},
...
The values of the fields name$1 and name$2 will be merged and synchronized to the field name in Solr.
Note: You cannot mix suffixed and unsuffixed fields with the same same, e.g., if you defined myField$new and
myField$old you cannot have a field called just myField.
Filters can be used with fields defined with multiple property chains. Both the physical field values and the indi
vidual virtual field values are available:
• Physical fields are specified without the suffix, e.g., ?myField
• Virtual fields are specified with the suffix, e.g., ?myField$2 or ?myField$alt.
Note: Physical fields cannot be combined with parent() as their values come from different property chains. If
you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>) as
parent(?myField$1) in (<urn:x>, <urn:y>) || parent(?myField$2) in (<urn:x>, <urn:y>) || parent(?
myField$3) ... and surround it with parentheses if it is a part of a bigger expression.
The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the
pseudoIRI lang(). The property preceding lang() must lead to a literal value. For example,
INSERT DATA {
solr-index:my_index :createConnector '''
{
"solrUrl": "http://localhost:8984/solr",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameLanguage",
"propertyChain": [
"http://www.ontotext.com/example#name",
"lang()"
]
}
],
}
''' .
}
The above connector will index the language tag of each literal value of the property http://www.ontotext.com/
example#name into the field nameLanguage.
The named graph of a given value can be indexed by ending a property chain with the special pseudoURI graph().
Indexing the named graph of the value instead of the value itself allows searching by named graph.
INSERT DATA {
solr-index:my_index :createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameGraph",
"propertyChain": [
"http://www.ontotext.com/example#name",
"graph()"
]
}
],
}
''' .
}
The above connector will index the named graph of each value of the property http://www.ontotext.com/
example#name into the field nameGraph.
In this mode, the last element of a property chain is a wildcard that will match any predicate that leads to a literal
value. Use the special pseudoIRI $literal as the last element of the property chain to activate it.
Note: Currently, it really means any literal, including literals with data types.
For example:
{
"fields" : [ {
"propertyChain" : [ "$literal" ],
"fieldName" : "name"
}, {
"propertyChain" : [ "http://example.com/description", "$literal" ],
"fieldName" : "description"
}
...
}
Sometimes you may need the IRI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from
our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a
single property referring to the pseudoIRI $self. For example,
INSERT DATA {
solr-index:my_index solr:createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "entityId",
"propertyChain": [
"$self"
],
},
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
]
}
''' .
}
The above connector will index the IRI of each wine into the field entityId.
Note: Note that GraphDB will also use the IRI of each entity as the ID of each document in Solr, which is
represented by the field id.
The Solr GraphDB Connector maps different types of RDF values to different types of Solr values according to
the basic type of the RDF value (IRI or literal) and the datatype of literals. The autodetection uses the following
mapping:
The datatype mapping can be affected by the synchronization options, too. For example, a nonanalysed field that
has xsd:long values does not use plong or plongs but string instead.
Note: For any given field the automatic mapping uses the first value it sees. This works fine for clean datasets
but might lead to problems, if your dataset has nonnormalised data, e.g., the first value has no datatype but other
values have.
It is therefore recommended to set datatype to a fixed value, e.g. xsd:date.
Please note that the commonly used xsd:integer and xsd:decimal datatypes are not indexed as numbers because
they represent infinite precision numbers. You can override that by using the datatype option to cast to xsd:long,
xsd:double, xsd:float as appropriate.
RDF and Solr use slightly different models to represent dates and times, even though the values might look very
similar.
Years in RDF values use the XSD format and are era years, where positive values denote the common era and
negative values denote years before the common era. There is no year zero.
Years in Solr use the ISO format and are proleptic years, i.e., positive values denote years from the common era
with any previous eras just going down by one mathematically so there is year zero.
In short:
• year 2020 CE = year 2020 in XSD = year 2020 in ISO.
• …
• year 1 CE = year 1 in XSD = year 1 in ISO.
• year 1 BCE = year 1 in XSD = year 0 in ISO.
• year 2 BCE = year 2 in XSD = year 1 in ISO.
• …
All years coming from RDF literals will be converted to ISO before indexing in Solr.
Both XSD and ISO date and time values support timezones. Solr requires all date and time values to be normalized
to the UTC timezone, so the Solr connector will convert the values accordingly before sending them to Solr for
indexing.
In addition to that, XSD defines the lack of a timezone as undetermined. Since we do not want to have any
undetermined state in the indexing system, we define the undetermined time zone as UTC, i.e., "2020-02-
14T12:00:00"^^xsd:dateTime is equivalent to "2020-02-14T12:00:00Z"^^xsd:dateTime (Z is the UTC time
zone, also known as +00:00).
Also note that XSD dates and partial dates, e.g., xsd:gYear values, may have a timezone, which leads to additional
complications. E.g., "2020+02:00"^^xsd:gYear (the year 2020 in the +02:00 timezone) will be normalized to
2019-12-31T22:00:00Z (the previous year!) if strict timezone adherence is followed. We have chosen to ignore
the timezone on any values that do not have an associated time value, e.g.:
• "2020-02-15+02:00"^^xsd:date
• "2020-02+02:00"^^xsd:gYearMonth
• "2020+02:00"^^xsd:gYear
All of the above will be treated as if they specified UTC as their timezone.
The Solr connector supports three kinds of entity filters used to finetune the set of entities and/or individual values
for the configured fields, based on the field value. Entities and field values are synchronized to Solr if, and only if,
they pass the filter. The filters are similar to a FILTER() inside a SPARQL query but not exactly the same. In them,
each configured field can be referred to by prefixing it with a ?, much like referring to a variable in SPARQL.
Types of filters
Toplevel value filter The toplevel value filter is specified via valueFilter. It is evaluated prior to anything
else when only the document ID is known and it may not refer to any field names but only to the special
field $this that contains the current document ID. Failing to pass this filter removes the entire document
early in the indexing process and it can be used to introduce more restrictions similar to the builtin filtering
by type via the types property.
Toplevel document filter The toplevel document filter is specified via documentFilter. This filter is evaluated
last when all of the document has been collected and it decides whether to include the document in the index.
It can be used to enforce global document restrictions, e.g., certain fields are required or a document needs
to be indexed only if a certain field value meets specific conditions.
Perfield value filter The perfield value filter is specified via valueFilter inside the field definition of the field
whose values are to be filtered. The filter is evaluated while collecting the data for the field when each field
value becomes available.
The variable that contains the field value is $this. Other field names can be used to filter the current field’s
value based on the value of another field, e.g., $this > ?age will compare the current field value to the
value of the field age (see also Twovariable filtering). Failing to pass the filter will remove the current field
value.
See also Migrating from GraphDB 9.x.
Filter operators
The filter operators are used to test if the value of a given field satisfies a certain condition.
Field comparisons are done on original RDF values before they are converted to Solr values using datatype map
ping.
Operator Meaning
?var in (value1, value2, ...) Tests if the field var’s value is one of the specified values. Values
are compared strictly unlike the similar SPARQL operator, i.e. for
literals to match their datatype must be exactly the same (similar
to how SPARQL sameTerm works). Values that do not match, are
treated as if they were not present in the repository.
Example:
?status in ("active", "new")
?var not in (value1, value2, ...) The negated version of the inoperator.
Example:
?status not in ("archived")
bound(?var) Tests if the field var has a valid value. This can be used to make
the field compulsory.
Example:
bound(?name)
isExplicit(?var) Tests if the field var’s value came from an explicit statement.
This will use the last element of the property chain. If you need
to assert the explicit status of a previous property chain use par
ent(?var) as many times as needed.
Example:
isExplicit(?name)
?var = value (equal to) RDF value comparison operators that compare RDF values
?var != value (not equal to) similarly to the equivalent SPARQL operators. The field var’s
?var > value (greater than) value will be compared to the specified RDF value. When
comparing RDF values that are literals, their datatypes must be
?var >= value (greater than or equal to)
compatible, e.g., xsd:integer and xsd:long but not
?var < value (less than)
xsd:string and xsd:date. Values that do not match are treated
?var <= value (less than or equal to) as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and
dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
Example:
regex(?name, "^mrs?", "i")
Example:
!bound(?company)
Example:
(bound(?name) or bound(?company)) && bound(?address)
Filter modifiers
In addition to the operators, there are some constructions that can be used to write filters based not on the values
of a field but on values related to them:
Accessing the previous element in the chain The construction parent(?var) is used for going to a pre
vious level in a property chain. It can be applied recursively as many times as needed, e.g.,
parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var)
can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the
bound operator like this: parent(bound(?var)).
Accessing an element beyond the chain The construction ?var -> uri (alternatively, ?var o uri or just ?
var uri) is used to access additional values that are accessible through the property uri. In essence, this
construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound
by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this:
?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?
company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound
operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this:
bound(parent(?company) -> <urn:hasGroup>).
The IRI parameter can be a full IRI within < > or the special string rdf:type (alternatively, just type), which
The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).
Filtering by language tags The construction lang(?var) is used for accessing the language tag of field’s value
(only RDF literals can have a language tag). The typical use case is to sync only values written in a given lan
guage: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element
beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in
("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".
Current context variable $this The special field variable $this (and not ?this, ?$this, $?this) is used to refer
to the current context. In the toplevel value filter and the toplevel document filter, it refers to the document.
In the perfield value filter, it refers to the currently filtered field value. In the nested document filter, it refers
to the nested document.
ALL() quantifier In the context of documentlevel filtering, a match is true if at least one of potentially many field
values match, e.g., ?location = <urn:Europe> would return true if the document contains { "location":
["<urn:Asia>", "<urn:Europe>"] }.
In addition to this, you can also use the ALL() quantifier when you need all values to match, e.g., ALL(?
location) = <urn:Europe> would not match with the above document because <urn:Asia> does not match.
Entity filters and default values Entity filters can be combined with default values in order to get more flexible
behavior.
If a field has no values in the RDF database, the defaultValue is used. But if a field has some values,
defaultValue is NOT used, even if all values are filtered out. See an example in Basic entity filter.
A typical usecase for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as
deleted by the presence of a specific value for a given property.
Two-variable filtering
Besides comparing a field value to one or more constants or running an existential check on the field value, some
use cases also require comparing the field value to the value of another field in order to produce the desired result.
GraphDB solves this by supporting twovariable filtering in the perfield value filter and the toplevel document
filter.
Note: This type of filtering is not possible in the toplevel value filter because the only variable that is available
there is $this.
In the toplevel document filter, there are no restrictions as all values are available at the time of evaluation.
In the perfield value filter, twovariable filtering will reorder the defined fields such that values for other fields
are already available when the current field’s filter is evaluated. For example, let’s say we defined a filter $this
> ?salary for the field price. This will force the connector to process the field salary first, apply its perfield
value filter if any, and only then start collecting and filtering the values for the field price.
Cyclic dependencies will be detected and reported as an invalid filter. For example, if in addition to the above
we define a perfield value filter ?price > "1000"^^xsd:int for the field salary, a cyclic dependency will be
detected as both price and salary will require the other field being indexed first.
# the entity below will be synchronised because it has a matching value for city: ?city in ("London")
example:alpha
rdf:type example:gadget ;
example:name "John Synced" ;
example:city "London" .
# the entity below will not be synchronised because it lacks the property completely: bound(?city)
example:beta
rdf:type example:gadget ;
example:name "Peter Syncfree" .
# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
example:gamma
rdf:type example:gadget ;
example:name "Mary Syncless" ;
example:city "Liverpool" .
INSERT DATA {
solr-index:my_index solr:createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": ["http://www.ontotext.com/example#name"]
},
{
"fieldName": "city",
"propertyChain": ["http://www.ontotext.com/example#city"],
"valueFilter": "$this = \\"London\\""
}
],
"documentFilter": "bound(?city)"
}
''' .
}
...
{
"fieldName": "city",
"propertyChain": ["http://www.ontotext.com/example#city"],
"defaultValue": "London"
}
...
}
The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”,
the entity is synchronized.
Sometimes, data represented in RDF is not well suited to map directly to nonRDF. For example, if you have news
articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model
this is a single property :taggedWith. Consider the following RDF data:
example2:Berlin
rdf:type example2:Location ;
rdfs:label "Berlin" .
example2:Mozart
rdf:type example2:Person ;
rdfs:label "Wolfgang Amadeus Mozart" .
example2:Einstein
rdf:type example2:Person ;
rdfs:label "Albert Einstein" .
example2:Cannes-FF
rdf:type example2:Event ;
rdfs:label "Cannes Film Festival" .
example2:Article1
rdf:type example2:Article ;
rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Einstein ;
example2:taggedWith example2:Cannes-FF .
example2:Article2
rdf:type example2:Article ;
rdfs:comment "An article about Berlin." ;
example2:taggedWith example2:Berlin .
example2:Article3
rdf:type example2:Article ;
rdfs:comment "An article about Mozart's life." ;
example2:taggedWith example2:Mozart .
example2:Article4
rdf:type example2:Article ;
rdfs:comment "An article about classical music in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Mozart .
example2:Article5
rdf:type example2:Article ;
rdfs:comment "A boring article that has no tags." .
example2:Article6
rdf:type example2:Article ;
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
example2:taggedWith example2:Cannes-FF .
Assume you want to map this data to Solr, so that the property example2:taggedWith x is mapped to separate
fields taggedWithPerson and taggedWithLocation, according to the type of x (whereas we are not interested in
Events). You can map taggedWith twice to different fields and then use an entity filter to get the desired values:
INSERT DATA {
solr-index:my_index solr:createConnector '''
{
"solrUrl": "http://localhost:8983/solr",
"types": ["http://www.ontotext.com/example2#Article"],
"fields": [
{
"fieldName": "comment",
"propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
},
{
"fieldName": "taggedWithPerson",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Person>"
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Location>"
}
]
}
''' .
}
The six articles in the RDF data above will be mapped as such:
This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:
If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart
for taggedWithPerson:
The following diagram shows a summary of all predicates that can administrate (create, drop, check status) connec
tor instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate
needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to
retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown
in green, blank helper nodes are shown in blue, literals in red, and IRIs in orange. The predicates are represented
by labeled arrows.
From GraphDB 8.0/Connectors 6.0, the Solr connector has SolrCloud support. SolrCloud is the distributed version
of Solr, which offers index sharding, better scaling, fault tolerance, etc. It uses Apache Zookeeper for distributed
synchronization and central configuration of the Solr nodes. The Solr indexes are called collections, which is the
sharded version of cores.
Zookeeper instances
Creating a SolrCloud connector is the same as creating a Solr connector with the only difference in the syntax of
the solrUrl parameter:
"solrUrl":"zk://localhost:2181|numShards=2|replicationFactor=2|maxShardsPerNode=3"
zk://localhost:2181 is the host and port of the started Zookeeper instance and the rest are the parameters for
creating the SolrCloud collection, delimited with pipes. The supported cluster parameters are:
• numShards
• replicationFactor
• maxShardsPerNode
• autoAddReplicas
• router.name
• router.field
• shards
Note: numShards and replicationFactor are mandatory parameters. maxShardsPerNode is set to numShards
value when absent.
For more information on how to use these options, check the SolrCloud’s Collection API documentation.
You can also have multiple Zookeeper instances orchestrating the Solr nodes. They have to be mentioned in the
connection string.
"solrUrl":"zk://localhost:2181,zk://localhost:2182|numShards=2|replicationFactor=2|maxShardsPerNode=3"
Note: The Zookeeper instances must be running on the same hosts as in the solrUrl parameter.
More information on how to setup a SolrCloud cluster.
Unlike the standard Solr cores, where each core has a /conf directory containing all of its configurations, SolrCloud
collections decouple the configuration from the data. The configurations are called configsets and they reside in
the Zookeeper instances. Before you want to create a new collection, you have to upload all your default or custom
configurations to Zookeeper under specific names.
Note: Check Command Line Utilities and ConfigSets API from SolrCloud documentation on how to upload
configsets.
When creating a SolrCloud connector, you have to specify the configset name in the copyConfigsFrom parameter.
If you do not specify it, it will search for a default configset name, which is collection1. As a good practice, it is
recommended to upload your default configuration under the name collection1, and then, when you want to create
a new connector with default index configuration, you will not have to specify this parameter again. Otherwise,
for other custom configsets, use the parameter with the name of the custom configset, i.e., customConfigset.
Example: Create SolrCloud connector query using a custom configset
INSERT DATA {
solr-index:my_collection :createConnector '''
{
"solrUrl": "zk://localhost:2181|numShards=2|replicationFactor=2|maxShardsPerNode=3",
"copyConfigsFrom": "customConfigset"
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"multivalued": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
]
}
]
}
''' .
}
7.3.10 Caveats
Order of control
Even though SPARQL per se is not sensitive to the order of triple patterns, the Solr GraphDB Connector expects
to receive certain predicates before others so that queries can be executed properly. In particular, predicates that
specify the query or query options need to come before any predicates that fetch results.
The diagram in Overview of connector predicates provides a quick overview of the predicates.
GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances
will not be usable and attempting to use them for queries or updates will throw an error.
If your GraphDB 9.x (or older) connector definitions do not include an entity filter, you can simply repair them.
If your GraphDB 9.x (or older) connector definitions do include an entity filter with the entityFilter option, you
need to rewrite the filter with one of the current filter types:
1. Save your existing connector definition.
2. Drop the connector instance.
3. In general, most older connector filters can be easily rewritten using the perfield value filter and toplevel
document filter. Rewrite the filters as follows:
Rule of thumb:
• If you want to remove individual values, i.e., if the operand is not BOUND() –> rewrite with
perfield value filter.
• If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with top
level document filter.
So if we take the example:
The Kafka connector provides a means to synchronize changes to the RDF model to any Kafka consumer, staying
automatically uptodate with the GraphDB repository data.
The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier
(an IRI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the
same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains.
A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.
On the Kafka side, the RDF entities are translated to JSON documents.
The main features of the Kafka Connector are:
• maintaining a Kafka topic that is always in sync with the data stored in GraphDB;
• multiple independent instances per repository;
• the entities for synchronization are defined by:
– a list of fields (on the Kafka side) and property chains (on the GraphDB side) whose values will be
synchronized;
– a list of rdf:type’s of the entities for synchronization;
– a list of languages for synchronization (the default is all languages);
– additional filtering by property and value.
Unlike the Elasticsearch, Solr, and Lucene connectors, the Kafka connector does not have a query interface since
Kafka is a simple message queue and does not provide search functionality.
Each feature is described in detail below.
In terms of Kafka terminology and behavior:
• Each connector instance must be assigned to a fixed Kafka topic.
• The connector is a Kafka producer, and does not have any information about the Kafka consumers.
• The partitions are assigned by the Kafka framework and not the connector.
7.4.2 Usage
All interactions with the Kafka GraphDB Connector are done through SPARQL queries.
There are three types of SPARQL queries:
• INSERT for creating, updating, and deleting connector instances;
• SELECT for listing connector instances and querying their configuration parameters;
• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.
In general, this corresponds to INSERT that adds or modifies data, and to SELECT that queries existing data.
Each connector implementation defines its own IRI prefix to distinguish it from other connectors. For the Kafka
GraphDB Connector, this is http://www.ontotext.com/connectors/kafka#. Each command or predicate exe
cuted by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/kafka#createConnector
to create a connector instance for Kafka.
Individual instances of a connector are distinguished by unique names that are also IRIs. They have their own prefix
to avoid clashing with any of the command predicates. For Kafka, the instance prefix is http://www.ontotext.
com/connectors/kafka/instance#.
Sample data All examples use the following sample data that describes five fictitious wines: Yoyowine, Fran
vino, Noirette, Blanquito, and Rozova, as well as the grape varieties required to make these wines. The
minimum required ruleset level in GraphDB is RDFS.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix wine: <http://www.ontotext.com/example/wine#> .
wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .
wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .
wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .
wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .
wine:Yoyowine
rdf:type wine:RedWine ;
wine:madeFromGrape wine:CabernetSauvignon ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .
wine:Franvino
rdf:type wine:RedWine ;
wine:madeFromGrape wine:Merlo ;
wine:madeFromGrape wine:CabernetFranc ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Noirette
rdf:type wine:RedWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2012"^^xsd:integer .
wine:Blanquito
rdf:type wine:WhiteWine ;
wine:madeFromGrape wine:Chardonnay ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .
wine:Rozova
rdf:type wine:RoseWine ;
wine:madeFromGrape wine:PinotNoir ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .
Prerequisites
Thirdparty component versions This version of the Kafka GraphDB Connector uses Kafka version 3.3.1.
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
• the name of the connector instance (e.g., my_index);
• a Kafka node and topic to synchronize to;
• classes to synchronize;
• properties to synchronize.
The configuration data has to be provided as a JSON string representation and passed together with the create
command.
You can create connectors via a Workbench dialog or by using a SPARQL update query (create command).
If you create the connector via the Workbench, no matter which way you use, you will be presented with a popup
screen showing you the connector creation progress.
1. Go to Setup � Connectors.
2. Click New Connector in the tab of the respective Connector type you want to create.
3. Fill out the configuration form.
4. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query
by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.
The create command is triggered by a SPARQL INSERT with the kafka:createConnector predicate, e.g., it creates
a connector instance called my_index, which synchronizes the wines from the sample data above.
To be able to use newlines and quotes without the need for escaping, here we use SPARQL’s multiline string
delimiter consisting of 3 apostrophes: '''...'''. You can also use 3 quotes instead: """...""".
PREFIX kafka: <http://www.ontotext.com/connectors/kafka#>
PREFIX kafka-inst: <http://www.ontotext.com/connectors/kafka/instance#>
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
]
},
(continues on next page)
The above command creates a new Kafka connector instance that connects to the Kafka instance accessible at port
9200 on the localhost as specified by the kafkaNode key.
The "types" key defines the RDF type of the entities to synchronize and, in the example, it is only entities of
the type http://www.ontotext.com/example/wine#Wine (and its subtypes if RDFS or higherlevel reasoning is
enabled). The "fields" key defines the mapping from RDF to Kafka. The basic building block is the property
chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property.
In the example, three bits of information are mapped the grape the wines are made of, sugar content, and year.
Each chain is assigned a short and convenient field name: “grape”, “sugar”, and “year”. The field names are later
used in the queries.
The field grape is an example of a property chain composed of more than one property. First, we take the wine’s
madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label
of this instance. The fields sugar and year are both composed of a single property that links the value directly to
the wine.
GraphDB can connect to a secured Kafka broker using the SASL/PLAIN authentication mechanism. To con
figure it, set the kafkaPlainAuthUsername and kafkaPlainAuthPassword parameters. Since the password will be
transmitted in clear text, it is recommended to enable SSL on the Kafka broker, and accordingly set the kafkaSSL
parameter to true.
Instead of supplying the username and password as part of the connector instance configuration, you can also
implement a custom authenticator class and set it via the authenticationConfiguratorClass option. See these
connector authenticator examples for more information and example projects that implement such a custom class.
There is no explicitly configurable support for other authentication mechanism supported Kafka. It should be pos
sible to configure most of them by supplying the relevant Kafka producer properties via the kafkaProducerConfig
parameter.
Dropping a connector instance removes all references to its external store from GraphDB as well as the Kafka
index associated with it.
The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the
connector instance has to be in the subject position, e.g., this removes the connector my_index:
INSERT DATA {
kafka-inst:my_index kafka:dropConnector [] .
}
You can also force drop a connector in case a normal delete does not work. The force delete will remove the
connector even if part of the operation fails. Go to Setup � Connectors where you will see the already existing
connectors that you have created. Click the delete icon, and check Force delete in the dialog box.
You can view the options string that was used to create a particular connector instance with the following query:
SELECT ?createString {
kafka-inst:my_index kafka:listOptionValues ?createString .
}
Existing Connector instances shown below the New Connector button. Click the name of an instance to view its
configuration and SPARQL query, or click the repair / delete icons to perform these operations. Click the copy
icon to copy the connector definition query to your clipboard.
Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors
predicate:
?cntUri is bound to the prefixed IRI of the connector instance that was used during creation, e.g., http://www.
ontotext.com/connectors/kafka/instance#my_index, while ?cntStr is bound to a string, representing the part
after the prefix, e.g., "my_index".
The internal state of each connector instance can be queried using a SELECT query and the connectorStatus pred
icate:
?cntUri is bound to the prefixed IRI of the connector instance, while ?cntStatus is bound to a string representation
of the status of the connector represented by this IRI. The status is keyvalue based.
From the user point of view, all synchronization happens transparently without using any additional predicates or
naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This
is achieved by intercepting all changes in the plugin and determining which Kafka documents need to be updated.
The creation parameters define how a connector instance is created by the kafka:createConnector predicate.
Some are required and some are optional. All parameters are provided together in a JSON object, where the
parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean,
or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface without any
knowledge of JSON.
readonly (boolean), optional, readonly mode A readonly connector will index all existing data in the reposi
tory at creation time, but, unlike nonreadonly connectors, it will:
• Not react to updates. Changes will not be synced to the connector.
• Not keep any extra structures (such as the internal Lucene index for tracking updates to chains)
The only way to index changes in data after the connector has been created is to repair (or drop/recreate) the
connector.
importGraph (boolean), optional, specifies that the RDF data from which to create the connector is in a special virtual graph
Used to make a Kafka index from temporary RDF data inserted in the same transaction. It requires readonly
mode and creates a connector whose data will come from statements inserted into a special virtual graph
instead of data contained in the repository. The virtual graph is kafka:graph, where the prefix kafka: is
as defined before. The data have to be inserted into this graph before the connector create statement is
executed.
Both the insertion into the special graph and the create statement must be in the same transaction. In GDB
Workbench, this can be done by pasting them one after another in the SPARQL editor and putting a semicolon
at the end of the first INSERT. This functionality requires readonly mode.
importFile (string), optional, an RDF file with data from which to create the connector Creates a connector
whose data will come from an RDF file on the file system instead of data contained in the repository. The
value must be the full path to the RDF file. This functionality requires readonly mode.
detectFields (boolean), optional, detects fields This mode introduces automatic field detection when creating
a connector. You can omit specifying fields in JSON. Instead, you will get automatic fields: each cor
responds to a single predicate, and its field name is the same as the predicate (so you need to use escaping
when issuing Kafka queries).
In this mode, specifying types is optional too. If types are not provided, then all types will be indexed. This
mode requires importGraph or importFile.
Once the connector is created, you can inspect the detected fields in the Connector management section of
the Workbench.
kafkaNode (string), required, the Kafka instance to sync to As Kafka is a thirdparty service, you have to spec
ify the node where it is running. The format of the node value is of the form http://hostname.domain:port,
https:// is allowed too. No default value. Can be updated at runtime without having to rebuild the index.
kafkaTopic (string), required, the Kafka topic to send documents to. No default value.
kafkaSSL (boolean), optional, controls whether to use an SSL connection to the Kafka broker. False by de
fault. Can be updated at runtime without having to rebuild the index.
kafkaPlainAuthUsername (string), optional, supplies the username for Kafka SASL PLAIN authentication.
No default value. Can be updated at runtime without having to rebuild the index.
kafkaPlainAuthPassword (string), optional, supplies the password for Kafka SASL PLAIN authentication.
No default value. Can be updated at runtime without having to rebuild the index.
bulkUpdateBatchSize (integer), controls the maximum batch size in bytes and corresponds to Kafka producer config batch.
Default value is 1,048,576 (1 megabyte). Can be updated at runtime without having to rebuild the index.
bulkUpdateRequestSize (integer), controls the maximum request size (and consequently the maximum size per document) in
Default value is 1,048,576 (1 megabyte). Can be updated at runtime without having to rebuild the index.
authenticationConfiguratorClass optional, provides custom authentication behavior
kafkaCompressionType (string), sets the compression to use when sending documents to Kafka. One of
none, gzip, lz4, snappy), the default is snappy. This corresponds to Kafka producer config property
compression.type. Can be updated at runtime without having to rebuild the index.
kafkaProducerId (string), an optional identifier that allows for separate Kafka producers with different options to the same
No default – all instances to the same Kafka broker will use a shared Kafka producer and thus must have
the same options. See also Producer sharing and Conflict resolution.
kafkaProducerConfig (JSON), optional, the settings for creating the Kafka producer. This option is passed
directly to the Kafka producer when it is instantiated. Each key is a Kafka producer configuration prop
erty. Some config keys, e.g., transactional.id, are not allowed here. No default. Can be updated at
runtime without having to rebuild the index.
kafkaIgnoreDeleteAll (boolean), optional, a flag that, when selected, will not notify Kafka when all repository statements ar
GraphDB handles the removal of all statements as a special operation that is manifested as sending a Kafka
record with NULL key and NULL value. If this flag is true, no such record will be sent. False by default.
kafkaPropagateConfig (boolean), optional, a nonpersisted flag that, when selected, will force propagating the Kafka config
False by default. See also Producer sharing and Conflict resolution. Can be updated at runtime without
having to rebuild the index.
types (list of IRIs), required, specifies the types of entities to sync The RDF types of entities to sync are spec
ified as a list of IRIs. At least one type IRI is required.
Use the pseudoIRI $any to sync entities that have at least one RDF type.
Use the pseudoIRI $untyped to sync entities regardless of whether they have any RDF type.
languages (list of strings), optional, valid languages for literals RDF data is often multilingual, but only some
of the languages represented in the literal values can be mapped. This can be done by specifying a list of
language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic
Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of
language ranges maps all existing literals that have matching language tags.
fields (list of field objects), required, defines the mapping from RDF to Kafka The fields specify exactly
which parts of each entity will be synchronized as well as the specific details on the connector side. The
field is the smallest synchronization unit and it maps a property chain from GraphDB to a field in Kafka.
The fields are specified as a list of field objects. At least one field object is required. Each field object has
further keys that specify details.
• fieldName (string), required, the name of the field in Kafka The name of the field defines the map
ping on the connector side. It is specified by the key fieldName with a string value. The field name
is used as the key in the JSON document that will be sent to Kafka.
• fieldNameTransform (one of none, predicate, or predicate.localName), optional, none by default
Defines an optional transformation of the field name. Although fieldName is always required, it
is ignored if fieldNameTransform is predicate or predicate.localName.
– none: The field name is supplied via the fieldName option.
– predicate: The field name is equal to the full IRI of the last predicate of the chain, e.g., if
the last predicate was http://www.w3.org/2000/01/rdf-schema#label, then the field name
will be http://www.w3.org/2000/01/rdf-schema#label too.
– predicate.localName: The field name is the derived from the local name of the IRI of the
last predicate of the chain, e.g., if the last predicate was http://www.w3.org/2000/01/rdf-
schema#comment, then the field name will be comment.
See Wildcard literal indexing for defining a field whose values are populated with literals regard
less of their predicate.
• valueFilter (string), optional, specifies the value filter for the field See also Entity filtering.
• documentFilter (string), optional, specifies the nested document filter for the field Only for
fields that define nested documents). See also Entity filtering.
• defaultValue (string), optional, specifies a default value for the field The default value
(defaultValue) provides means for specifying a default value for the field when the prop
erty chain has no matching values in GraphDB. The default value can be a plain literal, a literal
with a datatype (xsd: prefix supported), a literal with language, or a IRI. It has no default value.
• indexed (boolean), optional, default true If indexed, a field will be included in the JSON document
sent to Kafka. True by default.
If true, this option corresponds to "index" = true. If false, it corresponds to "index" = false.
• multivalued (boolean), optional, default true RDF properties and synchronized fields may have
more than one value. If multivalued is set to true, all values will be synchronized to Kafka.
If set to false, only a single value will be synchronized. True by default.
• ignoreInvalidValues (boolean), optional, default false Perfield option that controls what hap
pens when a value cannot be converted to the requested (or previously detected) type. False
by default.
Example use: when an invalid date literal like "2021-02-29"^^xsd:date (2021 is not a leap year)
needs to be indexed as a date, or when an IRI needs to be indexed as a number.
Note that some conversions are always valid, for example a literal or an IRI to a string field. When
true, such values will be skipped with a note in the logs. When false, such values will break the
transaction.
• array (boolean), optional, default false Normally, Kafka creates an array only if more than value is
present for a given field. If array is set to true, Kafka will always create an array even for single
values. If set to false, Kafka will create arrays for multiple values only. False by default.
• datatype (string), optional, the manual datatype override By default, the Kafka GraphDB Con
nector uses datatype of literal values to determine how they should be mapped to Kafka types.
For more information on the supported datatypes, see Datatype mapping.
The mapping can be overridden through the property "datatype", which can be specified per
field. The value of datatype can be any of the xsd: types supported by the automatic mapping or
a native Kafka type prefixed by native:, e.g., both xsd:long and native:long map to the long
type in Kafka.
• objectFields (objects array), optional, nested object mapping When native:object is used as a
datatype value, provide a mapping for the nested object’s fields. If datatype is not provided, then
native:object will be assumed.
Nested objects support further nested objects with a limit of five levels of nesting.
• startFromParent (integer), optional, default 0 Start processing the property chain from the Nth
parent instead of the root of the current nested object. 0 is the root of the current nested object, 1
is the parent of the nested object, 2 is the parent of the parent and so on.
valueFilter (string), optional, specifies the toplevel value filter for the document See also Entity filtering.
documentFilter (string), optional, specifies the toplevel document filter for the document See also Entity
filtering.
As mentioned above, the following connector parameters can be updated at runtime without having to rebuild the
index:
• kafkaNode
• kafkaSSL
• kafkaProducerConfig
• kafkaCompressionType
• kafkaPlainAuthUsername
• kafkaPlainAuthPassword
• bulkUpdateBatchSize
• bulkUpdateRequestSize
• kafkaPropagateConfig
This can be done by executing the following SPARQL update, here with examples for changing the user and
password:
PREFIX kafka: <http://www.ontotext.com/connectors/kafka#>
PREFIX kafka-inst: <http://www.ontotext.com/connectors/kafka/instance#>
INSERT DATA {
kafka-inst:my_index kafka:updateConnector '''
{
"kafkaPlainAuthUsername": "foo"
"kafkaPlainAuthPassword": "bar"
}
''' .
}
Nested objects
Nested objects are JSON objects that are used as values in the main document or other nested objects (up to five
levels of nesting is possible). They are defined with the objectFields option.
Having the following data consisting of children and grandchildren relations:
<urn:John>
a <urn:Person> ;
<urn:name> "John" ;
<urn:gender> <urn:Male> ;
<urn:age> 60 ;
<urn:hasSpouse> <urn:Mary> ;
<urn:hasChild> <urn:Billy> ;
<urn:hasChild> <urn:Annie> .
<urn:Mary>
a <urn:Person> ;
<urn:name> "Mary" ;
<urn:gender> <urn:Female> ;
<urn:age> 58 ;
<urn:hasSpouse> <urn:John> ;
<urn:hasChild> <urn:Billy> .
<urn:Eva>
(continues on next page)
<urn:Billy>
a <urn:Person> ;
<urn:name> "Billy" ;
<urn:gender> <urn:Male> ;
<urn:age> 35 ;
<urn:hasChild> <urn:Tylor> ;
<urn:hasChild> <urn:Melody> .
<urn:Annie>
a <urn:Person> ;
<urn:name> "Annie" ;
<urn:gender> <urn:Female> ;
<urn:age> 28 ;
<urn:hasChild> <urn:Sammy> .
<urn:Tylor>
a <urn:Person> ;
<urn:name> "Tylor" ;
<urn:gender> <urn:Male> ;
<urn:age> 5 .
<urn:Melody>
a <urn:Person> ;
<urn:name> "Melody" ;
<urn:gender> <urn:Female> ;
<urn:age> 2 .
<urn:Sammy>
a <urn:Person> ;
<urn:name> "Sammy" ;
<urn:gender> <urn:Male> ;
<urn:age> 10 .
We can create a nested objects index that consists of children and grandchildren with their corresponding fields
defining their gender and age:
{
"fields": [
{
"fieldName": "name",
"propertyChain": [
"urn:name"
]
},
{
"fieldName": "age",
"propertyChain": [
"urn:age"
],
"datatype": "xsd:long"
},
(continues on next page)
Copy fields
Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate
for different use cases. The Kafka GraphDB Connector has explicit support for fields that copy their value from
another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName,
where otherFieldName is another noncopy field. Take the following example:
...
"fields": [
{
"fieldName": "grape",
"facet": false,
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "whiteGrape",
"propertyChain": [
"@grape"
]
}
],
"entityFilter": "?whiteGrape -> type = <wine:WhiteGrape>"
...
The snippet creates a field “grape” containing all grapes, and another field “whiteGrape”. Both fields are populated
with the same values initially and “whiteGrape” is defined as a copy field that refers to the field “grape”. The field
“whiteGrape” is additionally filtered so that only certain grape varieties will be synchronized.
Note: The connector handles copy fields in a more optimal way than specifying a field with exactly the same
property chain as another field.
Sometimes, you have to work with data models that define the same concept (in terms of what you want to index
in Kafka) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical
name, multiple historical names and some unofficial names. If you want to index these together as a single field
in Kafka, you can define this as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single
physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric
sequence of convenience.For example, we can define the fields name$1 and name$2 like this:
...
"fields": [
{
"fieldName": "name$1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name$2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
(continues on next page)
The values of the fields name$1 and name$2 will be merged and synchronized to the field name in Kafka.
Note: You cannot mix suffixed and unsuffixed fields with the same same, e.g., if you defined myField$new and
myField$old, you cannot have a field called just myField.
Filters can be used with fields defined with multiple property chains. Both the physical field values and the indi
vidual virtual field values are available:
• Physical fields are specified without the suffix, e.g., ?myField
• Virtual fields are specified with the suffix, e.g., ?myField$2 or ?myField$alt.
Note: Physical fields cannot be combined with parent() as their values come from different property chains. If
you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>) as
parent(?myField$1) in (<urn:x>, <urn:y>) || parent(?myField$2) in (<urn:x>, <urn:y>) || parent(?
myField$3) ... and surround it with parentheses if it is a part of a bigger expression.
The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the
pseudoIRI lang(). The property preceding lang() must lead to a literal value. For example:
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameLanguage",
"propertyChain": [
"http://www.ontotext.com/example#name",
"lang()"
]
}
],
}
''' .
}
The above connector will index the language tag of each literal value of the property http://www.ontotext.com/
example#name into the field nameLanguage.
The named graph of a given value can be indexed by ending a property chain with the special pseudoURI graph().
Indexing the named graph of the value instead of the value itself allows searching by named graph.
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameGraph",
"propertyChain": [
"http://www.ontotext.com/example#name",
"graph()"
]
}
],
}
''' .
}
The above connector will index the named graph of each value of the property http://www.ontotext.com/
example#name into the field nameGraph.
In this mode, the last element of a property chain is a wildcard that will match any predicate that leads to a literal
value. Use the special pseudoIRI $literal as the last element of the property chain to activate it.
Note: Currently, it really means any literal, including literals with data types.
For example:
{
"fields" : [ {
"propertyChain" : [ "$literal" ],
"fieldName" : "name"
}, {
"propertyChain" : [ "http://example.com/description", "$literal" ],
"fieldName" : "description"
}
...
}
Sometimes you may need the IRI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from
our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a
single property referring to the pseudoIRI $self. For example:
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "entityId",
"propertyChain": [
"$self"
],
},
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
]
}
''' .
}
The above connector will index the IRI of each wine into the field entityId.
Note: Note that GraphDB will also use the IRI of each entity as the ID of each document in Kafka, which is
represented by the field id.
The Kafka GraphDB Connector maps different types of RDF values to different types of Kafka values according
to the basic type of the RDF value (IRI or literal) and the datatype of literals. The autodetection uses the following
mapping:
Note: For any given field, the automatic mapping uses the first value it sees. This works fine for clean datasets
but might lead to problems, if your dataset has nonnormalized data, e.g., the first value has no datatype but other
values have.
It is therefore recommended to set datatype to a fixed value, e.g., xsd:date.
Please note that the commonly used xsd:integer and xsd:decimal datatypes are not indexed as numbers because
they represent infinite precision numbers. You can override that by using the datatype option to cast to xsd:long,
xsd:double, xsd:float as appropriate.
RDF and ISO use slightly different models for representing dates and times, even though the values might look
very similar.
Years in RDF values use the XSD format and are era years, where positive values denote the common era and
negative values denote years before the common era. There is no year zero.
Years in the ISO format are proleptic years, i.e., positive values denote years from the common era with any
previous eras just going down by one mathematically so there is year zero.
In short:
• year 2020 CE = year 2020 in XSD = year 2020 in ISO.
• …
• year 1 CE = year 1 in XSD = year 1 in ISO.
• year 1 BCE = year 1 in XSD = year 0 in ISO.
• year 2 BCE = year 2 in XSD = year 1 in ISO.
• …
All years coming from RDF literals will be converted to ISO before sending to Kafka.
Both XSD and ISO date and time values support timezones. In addition to that, XSD defines the lack of a time
zone as undetermined. Since we do not want to have any undetermined state in the indexing system, we define
the undetermined time zone as UTC, i.e., "2020-02-14T12:00:00"^^xsd:dateTime is equivalent to "2020-02-
14T12:00:00Z"^^xsd:dateTime (Z is the UTC timezone, also known as +00:00).
Also note that XSD dates and partial dates, e.g., xsd:gYear values, may have a timezone, which leads to additional
complications. E.g., "2020+02:00"^^xsd:gYear (the year 2020 in the +02:00 timezone) will be normalized to
2019-12-31T22:00:00Z (the previous year!) if strict timezone adherence is followed. We have chosen to ignore
the timezone on any values that do not have an associated time value, e.g.:
• "2020-02-15+02:00"^^xsd:date
• "2020-02+02:00"^^xsd:gYearMonth
• "2020+02:00"^^xsd:gYear
All of the above will be treated as if they specified UTC as their timezone.
The Kafka connector supports four kinds of entity filters used to finetune the set of entities and/or individual
values for the configured fields, based on the field value. Entities and field values are synchronized to Kafka if,
and only if, they pass the filter. The filters are similar to a FILTER() inside a SPARQL query but not exactly the
same. In them, each configured field can be referred to by prefixing it with a ?, much like referring to a variable
in SPARQL.
Types of filters
Toplevel value filter The toplevel value filter is specified via valueFilter. It is evaluated prior to anything
else when only the document ID is known and it may not refer to any field names but only to the special
field $this that contains the current document ID. Failing to pass this filter removes the entire document
early in the indexing process and it can be used to introduce more restrictions similar to the builtin filtering
by type via the types property.
Toplevel document filter The toplevel document filter is specified via documentFilter. This filter is evaluated
last when all of the document has been collected and it decides whether to include the document in the index.
It can be used to enforce global document restrictions, e.g., certain fields are required or a document needs
to be indexed only if a certain field value meets specific conditions.
Perfield value filter The perfield value filter is specified via valueFilter inside the field definition of the field
whose values are to be filtered. The filter is evaluated while collecting the data for the field when each field
value becomes available.
The variable that contains the field value is $this. Other field names can be used to filter the current field’s
value based on the value of another field, e.g., $this > ?age will compare the current field value to the
value of the field age (see also Twovariable filtering). Failing to pass the filter will remove the current field
value.
On nested documents, the perfield value filter can be used to remove the entire nested document early in
the indexing process, e.g., by checking the type of the nested document via next hop with rdf:type.
Nested document filter The nested document filter is specified via documentFilter inside the field definition
of the field that defines the root of a nested document. The filter is evaluated after the entire nested document
has been collected. Failing to pass this filter removes the entire nested document.
Inside a nested document filter, the field names are within the context of the nested document and not within
the context of the toplevel document. For example, if we have a field children that defines a nested
document, and we use a filter like ?age < "10"^^xsd:int, we will be referring to the field children.age.
We can use the prefix $outer. one or more times to refer to field values from the outer document (from the
viewpoint of the nested document). For example, $outer.age > "25"^^xsd:int will refer to the age field
that is a sibling of the children field.
Other than the above differences, the nested document filter is equivalent to the toplevel document filter
from the viewpoint of the nested document.
See also Migrating from GraphDB 9.x.
Filter operators
The filter operators are used to test if the value of a given field satisfies a certain condition.
Field comparisons are done on original RDF values before they are converted to Kafka values using datatype
mapping.
Operator Meaning
?var in (value1, value2, ...) Tests if the field var’s value is one of the specified values. Values
are compared strictly unlike the similar SPARQL operator, i.e. for
literals to match their datatype must be exactly the same (similar
to how SPARQL sameTerm works). Values that do not match, are
treated as if they were not present in the repository.
Example:
?status in ("active", "new")
?var not in (value1, value2, ...) The negated version of the inoperator.
Example:
?status not in ("archived")
bound(?var) Tests if the field var has a valid value. This can be used to make
the field compulsory.
Example:
bound(?name)
isExplicit(?var) Tests if the field var’s value came from an explicit statement.
This will use the last element of the property chain. If you need
to assert the explicit status of a previous property chain use par
ent(?var) as many times as needed.
Example:
isExplicit(?name)
?var = value (equal to) RDF value comparison operators that compare RDF values
?var != value (not equal to) similarly to the equivalent SPARQL operators. The field var’s
?var > value (greater than) value will be compared to the specified RDF value. When
comparing RDF values that are literals, their datatypes must be
?var >= value (greater than or equal to)
compatible, e.g., xsd:integer and xsd:long but not
?var < value (less than)
xsd:string and xsd:date. Values that do not match are treated
?var <= value (less than or equal to) as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and
dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
regex(?var, "pattern")
or Tests if the field var’s value matches the given regular
regex(?var, "pattern", "i") expression pattern.
If the “i” flag option is present, this indicates that the match
operates in caseinsensitive mode.
Values that do not match are treated as if they were not present
in the repository.
Example:
regex(?name, "^mrs?", "i")
Example:
!bound(?company)
Example:
(bound(?name) or bound(?company)) && bound(?address)
Filter modifiers
In addition to the operators, there are some constructions that can be used to write filters based not on the values
of a field but on values related to them:
Accessing the previous element in the chain The construction parent(?var) is used for going to a pre
vious level in a property chain. It can be applied recursively as many times as needed, e.g.,
parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var)
can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the
bound operator like this: parent(bound(?var)).
Accessing an element beyond the chain The construction ?var -> uri (alternatively, ?var o uri or just ?
var uri) is used to access additional values that are accessible through the property uri. In essence, this
construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound
by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this:
?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?
company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound
operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this:
bound(parent(?company) -> <urn:hasGroup>).
The IRI parameter can be a full IRI within < > or the special string rdf:type (alternatively, just type), which
will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
Filtering by RDF graph The construction graph(?var) is used for accessing the RDF graph of a field’s value.
A typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/
implicit>) but using isExplicit(?a) is the recommended way.
The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).
Filtering by language tags The construction lang(?var) is used for accessing the language tag of field’s value
(only RDF literals can have a language tag). The typical use case is to sync only values written in a given lan
guage: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element
beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in
("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".
Current context variable $this The special field variable $this (and not ?this, ?$this, $?this) is used to refer
to the current context. In the toplevel value filter and the toplevel document filter, it refers to the document.
In the perfield value filter, it refers to the currently filtered field value. In the nested document filter, it refers
to the nested document.
ALL() quantifier In the context of documentlevel filtering, a match is true if at least one of potentially many field
values match, e.g., ?location = <urn:Europe> would return true if the document contains { "location":
["<urn:Asia>", "<urn:Europe>"] }.
In addition to this, you can also use the ALL() quantifier when you need all values to match, e.g., ALL(?
location) = <urn:Europe> would not match with the above document because <urn:Asia> does not match.
Entity filters and default values Entity filters can be combined with default values in order to get more flexible
behavior.
If a field has no values in the RDF database, the defaultValue is used. But if a field has some values,
defaultValue is NOT used, even if all values are filtered out. See an example in Basic entity filter.
A typical usecase for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as
deleted by the presence of a specific value for a given property.
Two-variable filtering
Besides comparing a field value to one or more constants or running an existential check on the field value, some
use cases also require comparing the field value to the value of another field in order to produce the desired result.
GraphDB solves this by supporting twovariable filtering in the perfield value filter, the toplevel document filter,
and the nested document filter.
Note: This type of filtering is not possible in the toplevel value filter because the only variable that is available
there is $this.
In the toplevel document filter and the nested document filter, there are no restrictions as all values are available
at the time of evaluation.
In the perfield value filter, twovariable filtering will reorder the defined fields such that values for other fields
are already available when the current field’s filter is evaluated. For example, let’s say we defined a filter $this
> ?salary for the field price. This will force the connector to process the field salary first, apply its perfield
value filter if any, and only then start collecting and filtering the values for the field price.
Cyclic dependencies will be detected and reported as an invalid filter. For example, if in addition to the above
we define a perfield value filter ?price > "1000"^^xsd:int for the field salary, a cyclic dependency will be
detected as both price and salary will require the other field being indexed first.
# the entity below will be synchronised because it has a matching value for city: ?city in ("London")
example:alpha
rdf:type example:gadget ;
example:name "John Synced" ;
example:city "London" .
# the entity below will not be synchronised because it lacks the property completely: bound(?city)
example:beta
rdf:type example:gadget ;
example:name "Peter Syncfree" .
# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
example:gamma
rdf:type example:gadget ;
example:name "Mary Syncless" ;
example:city "Liverpool" .
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
(continues on next page)
The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”,
the entity is synchronized.
Sometimes, data represented in RDF is not well suited to map directly to nonRDF. For example, if you have news
articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model
this is a single property :taggedWith. Consider the following RDF data:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix example2: <http://www.ontotext.com/example2#> .
example2:Berlin
rdf:type example2:Location ;
rdfs:label "Berlin" .
example2:Mozart
rdf:type example2:Person ;
rdfs:label "Wolfgang Amadeus Mozart" .
example2:Einstein
rdf:type example2:Person ;
rdfs:label "Albert Einstein" .
example2:Cannes-FF
rdf:type example2:Event ;
rdfs:label "Cannes Film Festival" .
example2:Article1
rdf:type example2:Article ;
rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
(continues on next page)
example2:Article2
rdf:type example2:Article ;
rdfs:comment "An article about Berlin." ;
example2:taggedWith example2:Berlin .
example2:Article3
rdf:type example2:Article ;
rdfs:comment "An article about Mozart's life." ;
example2:taggedWith example2:Mozart .
example2:Article4
rdf:type example2:Article ;
rdfs:comment "An article about classical music in Berlin." ;
example2:taggedWith example2:Berlin ;
example2:taggedWith example2:Mozart .
example2:Article5
rdf:type example2:Article ;
rdfs:comment "A boring article that has no tags." .
example2:Article6
rdf:type example2:Article ;
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
example2:taggedWith example2:Cannes-FF .
Assume you want to map this data to Kafka, so that the property example2:taggedWith x is mapped to separate
fields taggedWithPerson and taggedWithLocation, according to the type of x (whereas we are not interested in
Events). You can map taggedWith twice to different fields and then use an entity filter to get the desired values:
INSERT DATA {
kafka-inst:my_index kafka:createConnector '''
{
"kafkaNode": "localhost:9092",
"kafkaTopic": "my_index",
"types": ["http://www.ontotext.com/example2#Article"],
"fields": [
{
"fieldName": "comment",
"propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
},
{
"fieldName": "taggedWithPerson",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Person>"
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],
"valueFilter": "$this -> type = <http://www.ontotext.com/example2#Location>"
}
]
}
''' .
}
The six articles in the RDF data above will be mapped as such:
The following diagram shows a summary of all predicates that can administrate (create, drop, check status) connec
tor instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate
needs to be attached to. Variables that are bound as a result of a query are shown in green, blank helper nodes are
shown in blue, literals in red, and IRIs in orange. The predicates are represented by labeled arrows.
7.4.9 Caveats
Producer sharing
The Kafka connector aims to minimize resource usage and provide smooth transactional operation. This is achieved
by using a single Kafka producer object for each connector instance that is connected to the same Kafka broker
node. This has the following benefits:
• Memory consumption is reduced as each Kafka producer requires a certain amount of buffer memory.
• A failed transaction in one Kafka connector instance will be reverted in all other Kafka connector instances
together with the GraphDB transaction.
Due to the nature of Kafka producers it imposes a restriction as well:
• All connector instances must use the same Kafka options, e.g., they must have the same values for the
bulkUpdateBatchSize and kafkaCompressionType options.
Once you have created at least one Kafka connector instance and attempt to create another instance, the following
are possible scenarios:
Different Kafka broker
• The new connector instance specifies a different Kafka broker.
• The connector instance will be created and a new Kafka producer will be instantiated.
Same Kafka broker + same Kafka options
• The new connector instance specifies the same Kafka broker as one of the existing connectors and the
SAME options as the existing connector.
• The connector instance will be created and the existing Kafka producer will be reused
Same Kafka broker + different Kafka options
• The new connector instance specifies the same Kafka broker as one of the existing connectors and
DIFFERENT options than the existing connector.
• The connector instance will NOT be created and an error explaining the reason will be thrown.
• See Conflict resolution for possible workarounds.
Note: The Kafka broker for two connector instances is considered to be the same if at least one of the host/port
pairs supplied via the kafkaNode option is the same.
Conflict resolution
When the attempt to create a new Kafka connector instance was denied because another instance was already
created with different options, there are several possible ways to resolve the conflict:
Manual resolution
• Examine the options of the new connector instance you want to create.
• Make the options the same as of the existing connector instance.
Propagate the new options to the existing instances
• Set the option kafkaPropagateConfig of the new instance to true.
• The new options will be propagated to all existing instances that share the same Kafka broker node.
Force the allocation of a new producer
• Set the option kafkaProducerId of the new instance to some nonempty identifier.
• This will override the producer sharing mechanism and allocate a new producer associating it with the
supplied producer ID.
• The new connector will use the new options.
• All existing instances will continue using their previous options.
GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances
will not be usable and attempting to use them for queries or updates will throw an error.
If your GraphDB 9.x (or older) connector definitions do not include an entity filter, you can simply repair them.
If your GraphDB 9.x (or older) connector definitions do include an entity filter with the entityFilter option, you
need to rewrite the filter with one of the current filter types:
1. Save your existing connector definition.
2. Drop the connector instance.
3. In general, most older connector filters can be easily rewritten using the perfield value filter and toplevel
document filter. Rewrite the filters as follows:
Rule of thumb:
• If you want to remove individual values, i.e., if the operand is not BOUND() –> rewrite with
perfield value filter.
• If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with top
level document filter.
So if we take the example:
?location = <urn:Europe> AND BOUND(?location) AND ?type IN (<urn:Foo>, <urn:Bar>)
The MongoDB integration feature is a GraphDB plugin allowing users to query MongoDB databases using
SPARQL and to execute heterogeneous joins. This section describes how to configure GraphDB and MongoDB
to work together.
MongoDB is a documentbased database with the biggest developer/user community. It is part of the MEAN
technology stack and guarantees scalability and performance well beyond the throughput supported in GraphDB.
Often, we see use cases with extreme scalability requirements and simple data model (i.e., tree representation of a
document and its metadata).
MongoDB is a NoSQL JSON document store and does not natively support joins, SPARQL, or RDFenabled linked
data. The integration between GraphDB and MongoDB is done by a plugin that sends a request to MongoDB and
then transforms the result to RDF model.
7.5.2 Usage
{
"_id": { "$oid": "5c0fb7f329298f15dc37bb81"},
"@graph":
[{
"@id": "http://www.bbc.co.uk/things/1#id",
"@type": "cwork:NewsItem",
"bbc:primaryContentOf":
[{
"@id": "bbcd:3#id",
"bbc:webDocumentType": {
"@id": "bbc:HighWeb"
}
},
{
"@id": "bbcd:4#id",
"bbc:webDocumentType": {
"@id": "bbc:Mobile"
}
}],
"cwork:about":
[{
"@id": "dbpedia:AccessAir"
},
{
"@id": "dbpedia:Battle_of_Bristoe_Station"
},
{
"@id": "dbpedia:Nicolas_Bricaire_de_la_Dixmerie"
},
{
"@id": "dbpedia:Bernard_Roberts"
},
{
"@id": "dbpedia:Bartolomé_de_Medina"
},
{
"@id": "dbpedia:Don_Bonker"
},
{
"@id": "dbpedia:Cornel_Nistorescu"
(continues on next page)
Note: The keys in MongoDB cannot contain “.”, nor start with “$”. Although the JSONLD standard allows it,
MongoDB does not. Therefore, either use namespaces (see the sample above) or encoding the . and $, respectively.
Only the JSON keys are subject to decoding.
Installing MongoDB
Setting up and maintaining a MongoDB database is a separate task and must be accomplished outside of GraphDB.
See the MongoDB website for details.
Note: Throughout the rest of this document, we assume that you have the MongoDB server installed and running
on a computer you can access.
Note: The GraphDB integration plugin uses MongoDB Java driver version 3.8. More information about the
compatibility between MongoDB Java driver and MongoDB version is available on the MongoDB website.
Creating an index
Supported predicates:
• :service MongoDB connection string;
• :database MongoDB database;
• :collection MongoDB collection;
• :user (optional) MongoDB user for the connection;
• :password (optional) the user’s password;
• :authDb (optional) the database where the user is authenticated.
Upgrading an index
When upgrading to a newer GraphDB version, it might happen that it contains plugins that are not present in the
older version. In this case, the PluginManager disables the newly detected plugin, so you need to enable it by
executing the following SPARQL query:
Then create the plugin in question by executing the SPARQL query provided above, and also make sure to not
delete the database in the plugin you are using.
Deleting an index
Import this cwork1000.json file with 1,000 of CreativeWork documents in MongoDB database “ldbc” and “cre
ativeWorks” collection.
Querying MongoDB
This is a sample query that returns the dateModified for docs with the specific audience:
In a query, use the exact values as in the docs. For example, if the full URIs are used instead of
“cwork:NationalAudience” or “@graph.cwork:audience.@id”, there would not be any matching results.
Note: The results are returned in a named graph to indicate when the plugin should bind the variables. This is
an API plugin limitation. The variables to be bound by the plugin are in a named graph. This allows GraphDB to
determine whether to bind the specific variable using MongoDB or not.
Supported predicates:
• mongodb:find: Accepts single JSON and sets a query string. The value is used to call db.collection.
find().
• mongodb:project: Accepts single JSON. The value is used to select the projection for the results returned
by mongodb:find. Find more info at MongoDB: Project Fields to Return from Query.
• mongodb:aggregate: Accepts an array of JSONs. Calls db.collection.aggregate(). This is the most
flexible way to make a MongoDB query as the find() method is just a single phase of the aggregation
pipeline. The mongodb:aggregate predicate takes precedence over mongodb:find and mongodb:project.
This means that if both mongodb:aggregate and mongodb:find are used, mongodb:find will be ignored.
• mongodb:graph: Accepts an IRI. Specifies the IRI of the named graph in which the bound variables should
be. Its default value is the name of the index itself.
• mongodb:entity (required): Returns the IRI of the MongoDB document. If the JSONLD has context, the
value of @graph.@id is used. In case of multiple values, the first one is chosen and a warning is logged. If
the JSONLD has no context, the value of @id node is used. Even if the value from this predicate is not used,
it is required to have it in the query in order to inform the plugin that the graph part of the current iteration
is completed.
• mongodb:hint: Specifies the index to be used when executing the query (calls cursor.hint()).
• mongodb:collation (optional): Accepts JSON. Specifies languagespecific rules for string comparison,
such as rules for lettercase and accent marks. It is applied to a mongodb:find or an mongodb:aggregate
query.
Multiple MongoDB calls are supported in the same query. There are two approaches:
• Each index call to be in a separate SUBSELECT (Example 1);
• Each index call to use different named graph. If querying different indexes, this comes outofthebox. If
not, use the :graph predicate. (Example 2).
Example 1:
Example 2:
MongoDB has a number of aggregation functions such as: min, max, size, etc. These functions are called using
the :aggregate predicate. The data of the retrieved results has to be converted to RDF model. The example below
shows how to retrieve the RDF context of a MongoDB document.
SELECT ?s ?o {
?search a mongodb-index:spb1000 ;
mongodb:aggregate '''[{"$match": {"@graph.@id": "http://www.bbc.co.uk/things/1#id"}},
{'$addFields': {'@graph.cwork:graph.@id' : '$@id'}}]''' ;
mongodb:entity ?entity .
GRAPH mongodb-index:spb1000 {
?s cwork:graph ?o .
}
}
The $addFields phrase adds a new nested document in the JSONLD stored in MongoDB. The newly added
document is then parsed to the following RDF statement:
It looks really similar to the first one except that instead of @graph.cwork:graph.@id we are writing the value to
@graph.inst:graph.@id and as a result ?g1 will not get bound. This happens because in the JSONLD stored in
MongoDB we are aware of the cwork context but not of the inst context. So ?g2 will get bound instead.
Custom fields
Example:
The values are projected as child elements of a custom node. After JSONLD is taken from MongoDB, a pre
processing follows in order to retrieve all child elements of custom and create statements with predicates in the
<http://www.ontotext.com/connectors/mongodb/instance#> namespace.
Authentication
All types of authentication can be achieved by setting the credentials in the connection string. However, as it is not
a good practice to store the passwords in plain text, the :user, :password, and :authDb predicates are introduced.
If one of those predicates is used, it is mandatory to set the other two as well. These predicates set credentials for
SCRAM and LDAP authentication and the password is stored encrypted with a symmetrical algorithm on the disk.
For x.509 and Kerberos authentication the connection string should be used as no passwords are being stored.
The GraphDB Connectors offer an excellent solution for indexing data with a wellknown schema, e.g., index
documents that have type A, where each document has a field F1 that can be reached by following the property
chain composed of IRIs P1 and P2.
The features described below add a more general fulltext search (FTS) functionality to the connectors, and can be
used individually or combined as desired to meet the specific needs of the use case.
The following connector features are useful when defining a connector for general fulltext search:
Wildcard literal
This feature allows for indexing of literals without specifying the IRI of the predicate that leads to the literal. Use
$literal as the last element of the property chain.
See more about wildcard literals in the Lucene connector, the Solr connector, and the Elasticsearch connector.
This feature allows for having dynamic field names derived from the IRI of the last predicate in the property chain.
See more about field name transformations in the Lucene connector, the Solr connector, and the Elasticsearch
connector.
Specify $any or $untyped as the sole type to index all entities that have at least one RDF type, or all entities
regardless of whether they have any RDF type.
See more about types in the Lucene connector, the Solr connector, and the Elasticsearch connector.
7.6.2 Examples
All examples use the Star Wars RDF dataset. Download starwars-data.ttl and import it into a fresh repository
before proceeding further.
The example connector definitions use the Lucene connector but can be easily adapted to Solr and Elasticsearch
by changing lucene in the prefix definitions to solr or elasticsearch, and adding any additional parameters
required by the respective connector, e.g., elasticsearchNode.
To index all literals in the repository regardless of where they are attached in the graph, you can combine wildcard
literal and untyped indexing. Create a connector such as:
PREFIX con: <http://www.ontotext.com/connectors/lucene#>
PREFIX con-inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
con-inst:starwars_fts con:createConnector '''
{
"fields": [
{
"fieldName": "fts",
"propertyChain": [
"$literal"
],
"facet": false
}
],
"languages": [
""
],
(continues on next page)
The connector defines a single field, fts, that will index all literals regardless of their predicate: $literal as the
last element of the property chain. The connector has no type expectations on the entities that lead to those literals
and will index any entity regardless of whether it has an RDF type: $untyped in the types parameter.
Since the Star Wars dataset contains literals in many different languages, we restrict the index definition further
by specifying "" (the empty language = any literal without a language tag) using the languages option.
We can now search in this connector as usual, for example for the FTS query “luke skywalker”:
# Full-text search for "skywalker"
PREFIX con: <http://www.ontotext.com/connectors/lucene#>
PREFIX con-inst: <http://www.ontotext.com/connectors/lucene/instance#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
We get many different results belonging to different types (showing only the first ten results):
The above example indexes all literals into a single field, which is convenient for very rough fulltext search. It can
be finetuned by using field names derived from the predicate. In this example, we added "fieldNameTransform":
"predicate.localName" so we will get a field for every predicate whose object literal is indexed, and the field
name will be derived from the local name of the predicate:
PREFIX con: <http://www.ontotext.com/connectors/lucene#>
PREFIX con-inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
con-inst:starwars_fts2 con:createConnector '''
{
"fields": [
(continues on next page)
We can use this connector to do general fulltext searches, but also more precise ones, such as a query only in
the label of entities (the field label is the result of taking the local name of <http://www.w3.org/2000/01/rdf-
schema#label> at indexing time):
We get only three results back, namely the people that have “Skywalker” in their name:
Note: Despite having a similar name, the Kafka Sink connector is not a GraphDB connector.
7.7.1 Overview
Modern business has an ever increasing need of integrating data coming from multiple and diverse systems. Au
tomating the update and continuous build of the knowledge graphs with the incoming streams of data can be
cumbersome due to a number of reasons such as verbose functional code writing, numerous transactions per up
date, suboptimal usability of GraphDB’s RDF mapping language and the lack of a direct way to stream updates to
knowledge graphs.
GraphDB’s opensource Kafka Sink connector, which supports smart updates with SPARQL templates, solves
this issue by reducing the amount of code needed for raw event data transformation and thus contributing to the
automation of knowledge graph updates. It is a separately running process, which helps avoid database sizing.
The connector allows for customization according to the user’s specific business logic, and requires no GraphDB
downtime during configuration.
With it, users can push update messages to Kafka, after which a Kafka consumer processes them and applies the
updates in GraphDB.
7.7.2 Setup
Important: Before setting up the connector, make sure to have JDK 11 installed.
The Kafka Sink connector supports three types of updates: simple add, replace graph, and smart update with
a DELETE/INSERT template. A given Kafka topic is configured to accept updates in a predefined mode and
format. The format must be one of the supported RDF formats.
Simple add
This is a simple INSERT operation where no document identifiers are needed, and new data is always added as is.
All you need to provide is the new RDF data that is to be added. The following is valid:
• The Kafka topic is configured to only add data.
• The Kafka key is irrelevant but it is recommended to use a unique ID, e.g. a random UUID.
• The Kafka value is the new RDF data to add.
Let’s see how it works.
1. Start GraphDB on the same or a different machine.
2. In GraphDB, create a repository called “kafkatest”.
3. To deploy the connector, execute in the project’s docker-compose directory:
where graphdb=0 denotes that GraphDB must be started outside of the Docker container.
4. Next, we will configure the Kafka sink connector that will add data into the repository. In the directory of
the Kafka sink connector, execute:
curl http://localhost:8083/connectors \
-H 'Content-Type: application/json' \
--data '{"name":"kafka-sink-graphdb-add",
"config":{
"graphdb.server.url":"http://graphdb.example.com:7200",
"connector.class":"com.ontotext.kafka.GraphDBSinkConnector",
"key.converter":"com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter":"com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter.schemas.enable":"false",
"topics":"gdb-add",
"tasks.max":1,
"offset.storage.file.filename":"/tmp/storage-add",
"graphdb.server.repository":"kafka-test",
"graphdb.batch.size":64,
"graphdb.batch.commit.limit.ms":1000,
"graphdb.auth.type":"NONE",
"graphdb.update.type":"ADD",
"graphdb.update.rdf.format":"nq"}}'
Important: Since GraphDB is running outside the Kafka Sink Docker container,
using localhost in graphdb.server.url will not work. Use a hostname or IP that is
visible from within the container.
Note: One connector can work with only one configuration. If multiple configurations
are added, Kafka Sink will pick a single config and run it. If we need more than one
connector, we have to create and configure them correspondingly.
5. For the purposes of the example, we will also create a test Kafka producer that will write in the respective
Kafka topic. In the Kafka installation directory, execute:
6. To add some RDF data in the producer, paste this into the same window, and press Enter.
7. In the Workbench SPARQL editor of the “kafkatest” repository, run the query:
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o
}
}
8. The RDF data that we just added via the producer should be returned as result.
Replace graph
In this update type, a document (the smallest update unit) is defined as the contents of a named graph. Thus, to
perform an update, the following information must be provided:
• The IRI of the named graph – the document ID
• The new RDF contents of the named graph – the document contents
The update is performed as follows:
• The Kafka topic is configured for replace graph.
• The Kafka key defines the named graph to update.
• The Kafka value defines the contents of the named graph.
Let’s try it out.
1. We already have the Docker container with the Kafka sink connector running, and have created the “kafka
test” repository.
2. Now, let’s configure the Kafka sink connector that will replace data in a named graph. In the directory of
the Kafka sink connector, execute:
curl http://localhost:8084/connectors \
-H 'Content-Type: application/json'\
--data '{"name":"kafka-sink-graphdb-replace",
"config":{
"graphdb.server.url":"http://graphdb.example.com:7200",
"connector.class":"com.ontotext.kafka.GraphDBSinkConnector",
"key.converter":"com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter":"com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter.schemas.enable":"false",
"topics":"gdb-replace",
"tasks.max":1,
"offset.storage.file.filename":"/tmp/storage-replace",
"graphdb.server.repository":"kafka-test",
"graphdb.batch.size":64,
"graphdb.batch.commit.limit.ms":1000,
"graphdb.auth.type":"NONE",
"graphdb.update.type":"REPLACE_GRAPH",
"graphdb.update.rdf.format":"nq"}}'
with the same important parameters as in the add data example above.
This will configure the replace graph connector, which will read data from the gdb-replace topic
and send them to the kafka-test repository on the respective GraphDB server.
Note: Here, we have created the connector on a different URL from the previous one
http://localhost:8084/connectors. If you want to create it on the same URL (http://
localhost:8083/connectors), you need to first delete the existing connector:
4. To replace the data in the graph, paste this into the same window, and press Enter.
The key value must be the ?id value from the template.
5. To see the replaced data, run the query from above in the Workbench SPARQL editor of the “kafkatest”
repository:
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o
}
}
DELETE/INSERT template
In this update type, a document is defined as all triples for a given document identifier according to a predefined
schema. The schema is described as a SPARQL DELETE/INSERT template that can be filled from the provided
data at update time. The following must be present at update time:
• The SPARQL template update must be predefined, not provided at update time
– Can be a DELETE WHERE update that only deletes the previous version of the document and the new
data is inserted as is.
– Can be a DELETE INSERT WHERE update that deletes the previous version of the document and
adds additional triples, e.g. timestamp information.
• The IRI of the updated document
• The new RDF contents of the updated document
The update is performed as follows:
• The Kafka topic is configured for a specific template.
• The Kafka key of the message holds the value to be used for the ?id parameter in the template’s body the
template binding.
• The Kafka value defines the new data to be added with the update.
Important: One SPARQL template typically corresponds to a single document type and is used by a single Kafka
sink.
DELETE {
graph ?g { ?id ?p ?oldValue . }
} INSERT {
graph ?g { ?id ?p "Successfully updated example" . }
} WHERE {
graph ?g { ?id ?p ?oldValue . }
}
This simple template will look for a given subject in all graphs ?id, which we will need to
supply later when executing the update. The template will then update the object in all triples
containing this subject to a new value "Successfully updated example".
d. Save it, after which it will appear in the templates list.
Now we need to configure the Kafka sink connector that will update some data in a named graph. In the directory
of the Kafka sink connector, execute:
curl http://localhost:8085/connectors \
-H 'Content-Type: application/json' \
--data '{"name": "kafka-sink-graphdb-update",
"config": {
"graphdb.server.url":"http://graphdb.example.com:7200",
"connector.class": "com.ontotext.kafka.GraphDBSinkConnector",
"key.converter": "com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter": "com.ontotext.kafka.convert.DirectRDFConverter",
"value.converter.schemas.enable": "false",
"topics": "gdb-update",
"tasks.max": 1,
"offset.storage.file.filename": "/tmp/storage-update",
"graphdb.server.repository": "kafka-test",
"graphdb.batch.size": 64,
"graphdb.batch.commit.limit.ms": 1000,
"graphdb.auth.type": "NONE",
"graphdb.update.type": "SMART_UPDATE",
"graphdb.update.rdf.format": "nq",
"graphdb.template.id":
"http://example.com/my-template"}}'
Note: As in the previous example, we have created the connector on a different URL
http://localhost:8085/connectors. If you want to create it on a URL that is already
used, you first need to clean the connector that is on it as shown above.
1. To execute a SPARQL update, we need to provide the binding of the template, ?id. It is passed it as the key
of the Kafka message, and the data to be add is passed as value.
In the Kafka installation directory, execute:
2. To execute an update, paste this into the same window, and press Enter.
3. To see the updated data, run the query from above in the Workbench SPARQL editor of the “kafkatest”
repository:
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o
}
}
The following properties are used to configure the Kafka Sink connector:
The GraphDB text mining plugin allows you to consume the output of text mining APIs as SPARQL binding
variables. Depending on the annotations returned by the concrete API, the plugin enables multiple use cases like:
• Generate semantic annotations by linking fragments from texts to knowledge graph entities (entity linking)
• Transform and filter the text annotations to a concrete RDF data model using SPARQL
• Enrich the knowledge graph with additional information suggested by the information extraction or invali
date their input
• Evaluate and control the quality of the text annotations by comparing different versions
• Implement complex text mining use cases in a combination with the Kafka GraphDB connector
The plugin readily supports the protocols of these services:
• spaCy server
• GATE Cloud
• Ontotext’s Tag API
In addition, any text mining service that provides response as JSON can be used when you provide a JSLT trans
formation to remodel the output from the service output to an output understandable by the plugin. See the below
examples for querying the Google Cloud Natural Language API and the Refinitiv API using the generic client.
A typical use case would be having a piece of text (for example news content), in which we want to recognize
people, organizations, and locations fragments. Ideally, we will link them to entity IRIs that are already known in
the knowledge graph, i.e., Wikidata or PermID IRIs providing infinite possibilities for graph enrichment.
Let’s say we have the following text that mentions Dyson as the company “Dyson Ltd.”, the person “James Dyson”,
and also only as “Dyson”.
“Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the citystate. This
comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Sin
gapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers
for relocating his company.”
Let’s find out what annotations the different services will find in the text.
Note: Please keep in mind that some of the query results provided below may vary as they are dependent on the
respective services.
spaCy server
The spaCy server is a containerized HTTP API that provides industrialstrength natural language processing whose
named entity recognition (NER) component is used by the plugin.
Currently, the NER pipeline is the only spaCy component supported by the text mining plugin.
1. Run the spaCy server through its Docker image with the following commands:
• docker pull neelkamath/spacy-server:2-en_core_web_sm-sense2vec
where http://localhost:8000 is the location of the spaCy server set up using the above Docker
image.
Note that the sense2vec similarity feature is enabled by default. If your Docker image does not support it or you
want to disable it when creating the client, set it to false in the SPARQL query:
PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
txtm-inst:localSpacy txtm:connect txtm:Spacy;
txtm:service "http://localhost:8000";
txtm:sense2vec "false" .
}
The simplest query will return all annotations with their types and offsets. Since spaCy also provides sentence
grouping, for each annotation, we can get the text it is found in.
PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?sentence ?annotationType ?annotationStart ?annotationEnd
WHERE {
?searchDocument a txtm-inst:localSpacy;
txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half�
,→the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state.
,→ This comes short before the founder James Dyson announced he is moving back to the UK after moving�
,→residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced�
,→criticism from British lawmakers for relocating his company''' .
graph txtm-inst:localSpacy {
?annotatedDocument txtm:annotations ?annotation .
?annotation txtm:annotationText ?annotationText ;
txtm:annotationKey ?annotationKey;
txtm:annotationType ?annotationType ;
(continues on next page)
We see that spaCy succeeds in assigning the correct types to each “Dyson” found in the text.
Each of the mentioned services attaches to the annotations its own metadata, which can be obtained through the
feature predicate. In spaCy’s case, we can reach the sense2vec similarity using the following query:
The sense2vec similarity feature provides us with the additional knowledge that Dyson is somehow related to
“vacuums” and “Miele”.
GATE Cloud
GATE Cloud is a text analytics as a service that provides various pipelines. Its ANNIE named entity recognizer
used by the plugin identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and
Date expressions.
,→annotations=:Person&annotations=:Money&annotations=:Percent&annotations=:Sentence" .
}
Obviously, you can provide the annotation types you are interested in using the query parameters.
graph txtm-inst:gateService {
(continues on next page)
In GATE, sentences are returned as annotations, so they will appear as annotations in the response.
Tag
Ontotext’s Tag API provides the ability to semantically enrich content of your choice with annotations by discov
ering mentions of both known and novel concepts.
Based on data from DBpedia and Wikidata, and processed with smart machine learning algorithms, it recognizes
mentions of entities such as Person, Organisation, and Location, various relationships between them, as well as
general topics and key phrases mentioned. Visit the NOW demonstrator to explore such entities found in news.
For some annotations, an exact match to one or more IRIs in the knowledge graph are found and accessible through
annotation features along with other annotation metadata.
Tag also succeeds in assigning the proper type “Person” for “Dyson”.
Here are some details about the features that Tag provides for each annotation:
• txtm:inst: The id of the concept from the knowledge graph which was assigned to this annotation, or an id
of a generated concept in case it is not trusted (see txtm:isTrusted below).
For example, http://ontology.ontotext.com/resource/9cafep – you can find a short description and
news that mention this entity in the NOW web application at http://now.ontotext.com/#/concept&
uri=http://ontology.ontotext.com/resource/9cafep, using the IRI value as uri parameter.
• txtm:class: The class of the concept from the knowledge graph which was assigned to this annotation.
• txtm:isTrusted: Has value true when the entity is mapped to an existing entity in the database.
• txtm:isGenerated: Has value true when the annotation has been generated by the pipeline itself, i.e, from
NER taggers for which there is no suitable concept in the knowledge graph. Note that generated does not
mean that the annotation is not trusted.
• txtm:relevanceScore: A float number that represents the level of relevancy of the annotation to the target
document.
• txtm:confidence: A float number that represents the confidence score for the annotation to be produced.
The Tag service provides a way to serve entities and their features as RDF. The model is based on the Web anno
tation data model. The following headers should be passed when creating the Tag client:
The common model applied for all services is no longer used because you get the Tag response in RDF as is formed
by the service.
The following request type (Content-type) and response type (Accept) combinations are supported:
• Content-type: text/plain Accept: application/vnd.ontotext.ces+json (this is the default if noth
ing is specified)
• Content-type: application/vnd.ontotext.ces+json+ld Accept: application/vnd.ontotext.
ces+json
Not supported:
• Content-type: text/plain Accept: application/vnd.ontotext.ces+json+ld
• Content-type: application/vnd.ontotext.ces+json
Note: This means that JSONLD as response type requires that the request is JSONLD and nothing else. The
default text/plain will not work, so when creating the plugin, you need to pass the Content-type explicitly.
When the request type is JSONLD, the response type can be JSON or JSONLD.
When using the JSONLD, the following document features are required. Note that they should be passed using
the txtm:features predicate on ?annotatedDocument and in this order:
txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than�
,→half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners�
,→and hand dryers will add 250 engineers in the city-state. This comes short before the founder James�
,→Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent�
,→Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating�
,→his company. ''' ;
graph txtm-inst:tagInstJSONLD {
?subject ?predicate ?object
}
}
You can also use the txtm:rawInput predicate to provide your own raw JSONLD document. The query above
will look as follows, and will return the same results:
To register a service in the text mining plugin, the service must provide a REST interface with a POST endpoint.
The response Content-Type must be application/json. The headers of the POST request are passed using the
predicate http://www.ontotext.com/textmining#header. The request body is passed with the predicate http:/
/www.ontotext.com/textmining#text.
curl -X POST --header "HEADER1: VALUE1" --header "HEADER2: VALUE2" -d 'body' 'https://endoint.com?
,→queryParam1=param1'
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
inst:myService :connect :Provider;
:service "https://endoint.com?queryParam1=param1";
:header "HEADER1: VALUE1";
:header "HEADER2: VALUE2";
:transformation '''
...
(continues on next page)
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
?searchDocument a inst:myService;
:text '''body''' .
graph inst:myService {
?annotatedDocument :annotations ?annotation .
?annotation :annotationText ?annotationText ;
:annotationType ?annotationType ;
:annotationStart ?annotationStart ;
:annotationEnd ?annotationEnd ;
{
?annotation :features ?item .
?item ?feature ?value
}
}
}
If we want to extract annotations using another named entity recognition provider, we can do so by creating a client
for such services by providing a JSLT transformation. The transformation will convert the JSON returned by the
target service to a JSON model understandable for the text mining plugin. The target JSON should look like this:
{
"content":"",
"sentences":[ ],
"features":{ },
"annotations":[
{
"text":"Google",
"type":"Company",
"startOffset":78,
"endOffset":84,
"confidence":0.0,
"features":{ }
}
]
}
"annotations":[
{
"text":"Google",
"type":"Company",
"startOffset":78,
"endOffset":84,
}
]
}
Google Cloud Natural Language’s API associates information, such as salience and mentions, with annotations,
where an annotation represents a phrase in the text that is a known entity, such as a person, an organization, or a
location. It also requires a token to access the API.
Once created, you can list annotations using a model similar to the other services. Note that you need to provide
the input in the way the service expects it. No transformation is applied to the request content.
WHERE {
?searchDocument a txtm-inst:myGoogleService;
txtm:text '''
{
"document":{
"type":"PLAIN_TEXT",
"content":"Net income was $9.4 million compared to the prior year of $2.7 million. Google is a�
,→big company.
Revenue exceeded twelve billion dollars, with a loss of $1b"
}, "features": {'extractEntities': 'true', 'extractSyntax': 'true'},
'encodingType':'UTF8',
}
''' .
graph txtm-inst:myGoogleService {
?annotatedDocument txtm:annotations ?annotation .
(continues on next page)
Refinitiv API
Refinitiv’s PermIDs are open, permanent, and universal identifiers where underlying attributes capture the context
of the identity they each represent.
The tricky part of the integration of an arbitrary NER provider is to write the JSLT transformation, but once you
get used to the language, you can enrich your text document with any entity provider of your choice, and extend
your knowledge graph solely with the power of SPARQL and GraphDB.
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
?searchDocument a inst:razor;
:text '''
{"text":"Prosecutors want NFL's Peterson arrested on alleged bond violation | Reuters
Prosecutors want NFL's Peterson arrested on alleged bond violation
By Eric Kelsey
(Reuters) - Suspended Minnesota Vikings star Adrian Peterson faced new legal trouble on Thursday�
,→after Texas prosecutors in his child abuse case asked a court to order his arrest on a possible drug-
,→related bond violation.
Peterson, 29, who has been accused of injuring his 4-year-old son while disciplining him with the�
,→thin end of a tree branch, allegedly told a drug-testing administrator on Wednesday he had smoked�
,→marijuana before submitting to a urinalysis test, court papers said.
\\"During this process the defendant admitted ... that he smoked a little weed,\\" according to the�
,→motion filed by Montgomery County District Attorney Brett Ligon.
(continues on next page)
Since the text enclosed within the ''' marks represents a literal string, SPARQL will store it as is and keep new
lines and paragraphs. The only special characters that need to be escaped with a double backslash are the quotation
marks: \\”. This will form the values of the valid JSON that the plugin will send to the service.
The text mining plugin generates meaningful IRIs for the ?annotatedDocument and ?annotation variables. It
provides the additional txtm:annotationKey predicate that binds to the ?annotationKey variable an IRI for the
annotation based on the text and offsets, meaning that regardless of the service that generated the annotation, the
same pieces of text will have the same ?annotationKey IRIs. This can be used to compare annotations over the
same piece of text provided by different services.
The following query compares annotation types obtained from spaCy and Tag for annotations that have the same
key and text, meaning that they refer to the same piece of text.
WHERE {
BIND ('''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its�
,→headquarters in Singapore.
(continues on next page)
?searchDocument2 a txtm-inst:tagService;
txtm:text ?text .
graph txtm-inst:tagService {
?tagDocument txtm:annotations ?tagAnnotation .
?tagAnnotation txtm:annotationText ?annotationText ;
txtm:annotationKey ?annotationKey;
txtm:annotationType ?tagType .
}
}
The IRIs generated by the text mining plugin have the following meaning:
• ?annotatedDocument (?tagDocument or ?spacyDocument in the above query): <http://www.ontotext.
com/textmining/document/<md5-content>> where md5-content is the MD5 code of the document content.
Note that document IRIs will be the same for the same pieces of text, regardless of the service.
• ?annotation: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/
<start>/<end>/<service-name>/<index>>
<http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/
annotation/102/111>
• ?annotationKey: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/
<start>/<end>>: The annotation key IRI marks only a piece of text in the document and can be used to
find annotation over the same piece of text, but provided by different services.
For example: <http://www.ontotext.com/textmining/document/
ffa3feed18dacea1c195492cc1c06847/annotation/102/111>
Using the Tag txtm:exactMatch feature and our own mentions predicate, we can generate the following triples
and enrich our dataset with entities from DBpedia.
Of course, the power of RDF allows you to construct any graph you want based on the response from the named
entity recognition service.
Let’s say you have multiple documents with content that you want to send for annotation, for example documents
from your own knowledge graph. For the example to work, insert the following documents in your repository:
INSERT DATA
{
GRAPH <http://my.knowledge.graph.com> {
my-kg:doc1 my-kg:content "SOFIA, March 14 (Reuters) - Bulgaria expects Azeri state energy�
,→company SOCAR to start investing in the Balkan country's retail gas distribution network this year,�
,→Prime Minister said on Thursday".
my-kg:doc2 my-kg:content "Bulgaria is looking to secure gas supplies for its planned gas hub at�
,→the Black Sea port of Varna and Borissov said he had discussed the possibility of additional Azeri�
,→gas shipments for the plan.".
my-kg:doc3 my-kg:content "In the Sunny Beach resort, this one-bedroom apartment is 150m from�
,→the sea. It is in the Yassen complex, which has a communal pool and gardens. On the third floor, the�
,→66sq m (718sq ft) apartment has a livingroom, with kitchen, that opens to a balcony overlooking the�
,→pool. There are also a bedroom and bathroom. The property is being sold with furniture. The service�
,→charge is €8 a square metre, making it about €528. Burgas Airport is about 12km away. Varna is 40km�
,→away.".
}
}
You can send all of them for annotation with a single query. By default, if the service fails for one document, the
whole query will fail. As a result, you will miss the results for the documents that were successfully annotated. To
prevent this from happening, you can use the txtm:serviceErrors predicate that defines a maximum number of
errors allowed before the query fails, where -1 means that an infinite number of errors is allowed. As a result of
the following query, you will either get an error for the document, or its annotations.
The following results will be returned if the spaCy service successfully annotates the first document, but is then
stopped. We can simulate this by stopping the spaCy Docker during the query execution (Ctrl+C in the terminal
where the Docker is running). The error message is returned as a document feature.
Use the queries below to explore the instances of text mining clients you have in the repository with their config
urations, as well as to remove them.
Drop an instance
If you are annotating multiple documents in one transaction, you may want to get feedback on the progress. This
is done by setting the log level of the text mining plugin to DEBUG in the conf/logback.xml file of the GraphDB
distribution:
<logger name="com.ontotext.graphdb.plugins.textmining" level="DEBUG"/>
You will see a message for each document sent for annotation in the GraphDB main log file in the logs directory.
[DEBUG] 2021-05-19 08:39:40,893 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating docu-
ment content starting with: "Australia's Cardinal Pell sentenced to six years jail for sexually...
MELBOURNE (Reuters) - Former ..." with length: 911
EIGHT
Once created, a cluster can be used almost like a single GraphDB configuration. However, all write operations
need to be performed on the current leader node. Read operations are allowed on any node.
When using the Workbench, make sure you have opened the leader node (go to Setup � Cluster to check). If you
are connected to a follower and try to perform a write operation, you will get an error message:
4. Import some data in it from Import � User data � Upload RDF files. For this example, let’s use the W3.org
wine ontology.
429
GraphDB Documentation, Release 10.2.5
5. If we open the SPARQL editor and run a basic SELECT query against the imported data, we will see that it
behaves just like a regular GraphDB instance.
The GraphDB client API for Java is an extension of RDF4J’s HTTPRepository that adds support for automatic
leader discovery.
You can create an instance of GraphDBHTTPRepository like this:
package com.ontotext.example;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepository;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepositoryBuilder;
import org.eclipse.rdf4j.query.TupleQueryResult;
import org.eclipse.rdf4j.repository.RepositoryConnection;
.withServerUrl("http://graphdb1.example.com:7200")
.withServerUrl("http://graphdb2.example.com:7200")
.withCluster()
.build();
String query = "select ?fact ?contents { ?fact a <urn:Fact> ; <urn:contents> ?contents }";
try (TupleQueryResult tqr = connection.prepareTupleQuery(query).evaluate()) {
while (tqr.hasNext()) {
System.out.println(tqr.next());
}
}
}
(continues on next page)
Tip: The client needs to be configured with at least one server URL that is part of the cluster. The remaining
server URLs will be discovered automatically. The server URLs that are provided when the client is created will
be tried always, so it is recommended to specify at least two of them in case one of them is down.
GraphDB 10 includes an additional mechanism that allows using any of the cluster nodes with any standard client,
e.g., RDF4J’s HTTPRepository or your own software that already works with GraphDB.
The GraphDBHTTPRepository class is part of the graphdb-client-api module. Use the following Maven config
uration to include it in your project:
<dependency>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb-client-api</artifactId>
<version>${graphdb.version}</version>
</dependency>
Note: Do not forget to set the graphdb.version property to the actual GraphDB version you want to use, or
replace the ${graphdb.version} string with the version.
The cluster can also be used through an external proxy. To do this, instead of the providing the GraphDB HTTP
address, you need to provide that of the proxy. For example, if for the repository “myrepo” GraphDB is on http:/
/graphdb.example.com:7200/repositories/myrepo, the external proxy will be on http://graphdb.example.
com:7204/repositories/myrepo.
Local consistency determines the freshness of the query results. At the lowest level (using the REST API), it is
controlled by setting the X-GraphDB-Local-Consistency header to one of the following values:
last-committed Sets Last Committed local consistency. The queries will always return results that include the
last completed transaction.
none Sets no local consistency. The queries may return results from a node that has not yet seen the last completed
transaction. This is the default setting.
You can set the header just like any other header in your HTTP client library. For example, with curl:
curl 'http://graphdb1.example.com:7200/repositories/myrepo'\
-H 'X-GraphDB-Local-Consistency: last-committed'\
-H 'Content-Type: application/sparql-query'\
-d 'select * { ?s ?p ?o } limit 5'
The GraphDB client API for Java has builtin support for setting the local consistency via the RequestHeaderAware
interface:
/**
* An interface for adding and setting HTTP request headers.
*/
public interface RequestHeaderAware {
...
/**
* Convenience method for setting the X-GraphDB-Local-Consistency header.
*
* @param localConsistency the desired local consistency level
*/
default void setLocalConsistencyHeader(LocalConsistency localConsistency) {
setHeader(GraphDBHTTPProtocol.LOCAL_CONSISTENCY_HEADER_NAME, localConsistency.toString());
}
}
package com.ontotext.example;
import com.ontotext.graphdb.replicationcluster.LocalConsistency;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepository;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepositoryBuilder;
import com.ontotext.graphdb.repository.http.RequestHeaderAware;
The Workbench REST API can be used to automate various tasks without having to open the Workbench in a
browser and doing them manually.
You can find more information about each REST API functionality group and its operations under Help � REST
API Documentation in the Workbench, as well as execute them directly from there and see the results.
Click on a functionality group to expand it and see the operations it includes. Click on an operation to see details
about it.
The REST API calls fall into the below major categories.
Use the cluster group controller API to create a cluster, view its configuration, monitor the status of both the cluster
group and each of its nodes, as well as to delete the cluster.
See these cURL examples for cluster group management.
Use the data import API to import data in GraphDB. You can choose between server files and a remote URL.
See these cURL examples for data import.
Use the repository management API to add, edit, or remove a repository to/from any attached location. Unlike the
RDF4J API, you can work with multiple remote locations from a single access point. When combined with the
location management, it can be used to automate the creation of multiple repositories across your network.
See these cURL examples for repository management.
Use the saved queries API to create, edit or remove saved queries. It is a convenient way to automate the creation
of saved queries that are important to your project.
See these cURL examples for saved queries.
Use the security management API to enable or disable security and free access, as well as add, edit, or remove
users, thus integrating the Workbench security into an existing system.
See these cURL examples for security management.
Use the SPARQL template management API to create, edit, delete, and execute SPARQL templates, as well as to
view all templates and their configuration.
See these cURL examples for SPARQL template management.
Use the SQL views management API to access, create, and edit SQL views (tables), as well as to delete existing
saved queries and view all SQL views for the active repository.
See these cURL examples for SQL views management.
8.2.9 Authentication
Use this login REST API endpoint to obtain a GDB token in exchange for username and password.
See this cURL example for authentication.
8.2.10 Monitoring
The GraphDB REST API currently exposes four endpoints suitable for scraping by Prometheus. See here the
metrics that can be monitored, as well as how to configure the Prometheus scrapers.
Cluster monitoring
Use the cluster statistics monitoring API to diagnose problems and cluster slowdowns more easily.
See this cURL example for cluster monitoring.
Use the infrastructure statistics monitoring API to monitor GraphDB’s infrastructure so as to have better visibility
of the hardware resources usage.
See this cURL example for infrastructure statistics monitoring.
Repository monitoring
Use the repository monitoring API to monitor query and transactions statistics in order to obtain a better under
standing of the slow queries, suboptimal queries, active transactions, and open connections.
See this cURL example for repository monitoring.
Use the GraphDB structures monitoring API to monitor GraphDB structures – the global page cache and the entity
pool, in order to get a better understanding of whether the current GraphDB configuration is optimal for your
specific use case.
See this cURL example for structures statistics monitoring.
This section describes how to use the RDF4J API to create and access GraphDB repositories, both on the local file
system and remotely via the RDF4J HTTP server.
RDF4J comprises a large collection of libraries, utilities and APIs. The important components for this section are:
• the RDF4J classes and interfaces (API), which provide a uniform access to the SAIL components from
multiple vendors/publishers;
• the RDF4J server application.
Programmatically, GraphDB can be used via the RDF4J Java framework of classes and interfaces. Documentation
for these interfaces (including Javadoc). Code snippets in the sections below are taken from, or are variations of,
the developergettingstarted examples that come with the GraphDB distribution.
With RDF4J 2, repository configurations are represented as RDF graphs. A particular repository configuration is
described as a resource, possibly a blank node, of type:
http://www.openrdf.org/config/repository#Repository.
This resource has an ID, a label, and an implementation, which in turn has a type, SAIL type, etc. A short
repository configuration is taken from the developergettingstarted template file repo-defaults.ttl.
[] a rep:Repository ;
rep:repositoryID "graphdb-repo" ;
rdfs:label "GraphDB Getting Started" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:SailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:Sail" ;
graphdb:ruleset "owl-horst-optimized" ;
graphdb:storage-folder "storage" ;
graphdb:base-URL "http://example.org/" ;
graphdb:repository-type "file-repository" ;
graphdb:imports "./ontology/owl.rdfs" ;
graphdb:defaultNS "http://example.org/" .
]
].
The Java code that uses the configuration to instantiate a repository and get a connection to it is as follows:
// Get the repository from repository manager, note the repository id set in configuration .ttl file
Repository repository = repositoryManager.getRepository("graphdb-repo");
Note: Example above assumes that GraphDB Free edition is used. If using Standard or Enterprise editions, a
valid license file should be set to the system property graphdb.license.file.
The RDF4J server is a Web application that allows interaction with repositories using the HTTP protocol. It runs in
a JEE compliant servlet container, e.g., Tomcat, and allows client applications to interact with repositories located
on remote machines. In order to connect to and use a remote repository, you have to replace the local repository
manager with a remote one. The URL of the RDF4J server must be provided, but no repository configuration is
needed if the repository already exists on the server. The following lines can be added to the developergetting
started example program, although a correct URL must be specified:
RepositoryManager repositoryManager =
new RemoteRepositoryManager( "http://192.168.1.25:7200" );
repositoryManager.initialize();
The rest of the example program should work as expected, although the following library files must be added to
the classpath:
• commons-httpclient-3.1.jar
• commons-codec-1.10.jar
The RDF4J HTTP server is a fully fledged SPARQL endpoint – the RDF4J HTTP protocol is a superset of the
SPARQL 1.1 protocol. It provides an interface for transmitting SPARQL queries and updates to a SPARQL pro
cessing service and returning the results via HTTP to the entity that requested them.
Any tools or utilities designed to interoperate with the SPARQL protocol will function with GraphDB because it
exposes a SPARQLcompliant endpoint.
The Graph Store HTTP Protocol is fully supported for direct and indirect graph names. The SPARQL 1.1 Graph
Store HTTP Protocol has the most details, although further information can be found in the RDF4J Server REST
API.
This protocol supports the management of RDF statements in named graphs in the REST style by providing the
ability to get, delete, add to, or overwrite statement in named graphs using the basic HTTP methods.
The GraphDB Plugin API is a framework and a set of public classes and interfaces that allow developers to extend
GraphDB in many useful ways. These extensions are bundled into plugins, which GraphDB discovers during
its initialization phase and then uses to delegate parts of its query or update processing tasks. The plugins are
given lowlevel access to the GraphDB repository data, which enables them to do their job efficiently. They are
discovered via the Java service discovery mechanism, which enables dynamic addition/removal of plugins from
the system without having to recompile GraphDB or change any configuration files.
A GraphDB plugin is a Java class that implements the com.ontotext.trree.sdk.Plugin interface. All public
classes and interfaces of the plugin API are located in this Java package, i.e., com.ontotext.trree.sdk. Here is
what the plugin interface looks like in an abbreviated form:
/**
* The base interface for a GraphDB plugin. As a minimum a plugin must implement this interface.
* <p>
* Plugins also need to be listed in META-INF/services/com.ontotext.trree.sdk.Plugin so that Java's�
,→services
/**
* A method used by the plugin framework to provide plugins with a {@link Logger} object
*
* @param logger {@link Logger} object to be used for logging
*/
void setLogger(Logger logger);
/**
* Plugin initialization method called once when the repository is being initialized, after the plugin�
,→has been
* configured and before it is actually used. It enables plugins to execute whatever
* initialization routines they consider appropriate, load resources, open connections, etc., based on�
,→the
/**
* Sets a new plugin fingerprint.
* Every plugin should maintain a fingerprint of its data that could be used by GraphDB to determine�
,→if the
* data has changed or not. Initially, on system initialization the plugins are injected their
* fingerprints as they reported them before the last system shutdown
*
* @param fingerprint the last known plugin fingerprint
*/
void setFingerprint(long fingerprint);
/**
* Returns the fingerprint of the plugin.
* <p>
* Every plugin should maintain a fingerprint of its data that could be used by GraphDB to determine�
,→if the
* data has changed or not. The plugin fingerprint will become part of the repository fingerprint.
*
* @return the current plugin fingerprint based on its data
*/
long getFingerprint();
/**
* Plugin shutdown method that is called when the repository is being shutdown. It enables plugins to�
,→execute whatever
* finalization routines they consider appropriate, free resources, buffered streams, etc., based on�
,→the
As it derives from the Service interface, the plugin is automatically discovered at runtime, provided that the
following conditions also hold:
• The plugin class is located in the classpath.
• It is mentioned in a META-INF/services/com.ontotext.trree.sdk.Plugin file in the classpath or in a .jar
that is in the classpath. The full class signature has to be written on a separate line in such a file.
The only method introduced by the Service interface is getName(), which provides the plugin’s (service’s) name.
This name must be unique within a particular GraphDB repository, and serves as a plugin identifier that can be
used at any time to retrieve a reference to the plugin instance.
/**
* Interface implemented by all run-time discoverable services (e.g. {@link Plugin} instances). Classes
* implementing this interface should furthermore be declared in the respective
* META-INF/services/<class.signature> file and will then be discoverable at run-time.
* <p>
* Plugins need not implement this interface directly but rather implement {@link Plugin}.
*/
public interface Service {
/**
(continues on next page)
There are many more functions (interfaces) that a plugin could implement, but these are all optional and are declared
in separate interfaces. Implementing any such complementary interface is the means to announce to the system
what this particular plugin can do in addition to its mandatory plugin responsibilities. It is then automatically used
as appropriate. See List of plugin interfaces and classes.
Discovery
This phase is executed at repository initialization. GraphDB searches for all plugin services in the classpath regis
tered in the META-INF/services/com.ontotext.trree.sdk.Plugins service registry files, and constructs a single
instance of each plugin found.
Configuration
Every plugin instance discovered and constructed during the previous phase is then configured. During this phase,
plugins are injected with a Logger object, which they use for logging (setLogger(Logger logger)), and the path
to their own data directory (setDataDir(File dataDir)), which they create, if needed, and then use to store their
data. If a plugin does not need to store anything to the disk, it can skip the creation of its data directory. However,
if it needs to use it, it is guaranteed that this directory will be unique and available only to the particular plugin that
it was assigned to.
This phase is also called when a plugin is enabled after repository initialization.
Initialization
After a plugin has been configured, the framework calls its initialize(InitReason reason, PluginConnection
pluginConnection) method so it gets the chance to do whatever initialization work it needs to do. The passed
instance of PluginConnection provides access to various other structures and interfaces, such as Statements and
Entities instances (Repository internals), and a SystemProperties instance, which gives the plugins access to
the systemwide configuration options and settings. Plugins typically use this phase to create IRIs that will be used
to communicate with the plugin.
This phase is also called when a plugin is enabled after repository initialization.
Request processing
The plugin participates in the request processing. The request phase applies to the evaluation of SPARQL queries,
getStatements calls, the transaction stages and the execution of SPARQL updates. Various event notifications
can also be part of this phase.
This phase is optional for the plugins but no plugin is useful without implementing at least one of its interfaces.
Request processing can be divided roughly into query processing and update processing.
Query processing
Query processing includes several subphases that can be used on their own or combined together:
Preprocessing Plugins are given the chance to modify the request before it is processed. In this phase, they could
also initialize a context object, which will be visible till the end of the request processing (Preprocessing).
Pattern interpretation Plugins can choose to provide results for requested statement patterns (Pattern interpre
tation). This subphase applies only to queries.
Postprocessing Before the request results are returned to the client, plugins are given a chance to modify them,
filter them out, or even insert new results (Postprocessing);
Update processing
Shutdown
During repository shutdown, each plugin is prompted to execute its own shutdown routines, free resources, flush
data to disk, etc. This must be done in the shutdown(ShutdownReason reason) method.
This phase is also called when a plugin is disabled after repository initialization.
/**
* The {@link PluginConnection} interface provides access to various objects that can be used to query data
* or get the properties of the current transaction. An instance of {@link PluginConnection} will be�
,→passed to almost
* all methods that a plugin may implement.
*/
public interface PluginConnection {
/**
* Returns an instance of {@link Entities} that can be used to retrieve or create RDF entities.
*
* @return an {@link Entities} instance
*/
Entities getEntities();
(continues on next page)
/**
* Returns an instance of {@link Statements} that can be used to retrieve RDF statements.
*
* @return a {@link Statements} instance
*/
Statements getStatements();
/**
* Returns an instance of {@link Repository} that can be used for higher level access to the�
,→repository.
*
* @return a {@link Repository} instance
*/
Repository getRepository();
/**
* Returns the transaction ID of the current transaction or 0 if no explicit transaction is available.
*
* @return the transaction ID
*/
long getTransactionId();
/**
* Returns the update testing status. In a multi-node GraphDB configuration (currently only GraphDB�
,→EE) an update
* will be sent to multiple nodes. The first node that receives the update will be used to test if the�
,→update is
* successful and only if so, it will be send to other nodes. Plugins may use the update test status�
,→to perform
* certain operations only when the update is tested (e.g. indexing data via an external service). The�
,→method will
* return true if this is a GraphDB EE worker node testing the update or this is GraphDB Free or SE.�
,→The method will
* return false only if this is a GraphDB EE worker node that is receiving a copy of the original�
,→update
/**
* Returns an instance of {@link SystemProperties} that can be used to retrieve various properties�
,→that identify
* the current GraphDB installation and repository.
*
* @return an instance of {@link SystemProperties}
*/
SystemProperties getProperties();
/**
* Returns the repository fingerprint. Note that during an active transactions the fingerprint will be�
,→updated
* if you want to get the updated fingerprint for the just-completed transaction.
*
* @return the repository fingerprint
*/
String getFingerprint();
/**
* Returns whether the current GraphDB instance is part of a cluster. This is useful in cases where a�
,→plugin may modify
* the fingerprint via a query. To protect cluster integrity the fingerprint may be changed only via�
,→an update.
*
* @return true if the current instance is in cluster group, false otherwise
*/
boolean isInCluster();
/*
* Creates a thread-safe instance of this {@link PluginConnection} that can be used by other threads.
* Note that every {@link ThreadsafePluginConnecton} must be explicitly closed when no longer needed.
*
* @return an instance of {@link ThreadsafePluginConnecton}
*/
ThreadsafePluginConnecton getThreadsafeConnection();
/**
* Returns an instance of {@link SecurityContext} that can be used to check if the user that initiated�
,→a plugin
* request has the required access level.
*
* @return an instance of {@link SecurityContext}
*/
SecurityContext getSecurityContext();
}
PluginConnection instances passed to the plugin are not threadsafe and not guaranteed to operate normally once
the called method returns. If the plugin needs to process data asynchronously in another thread it must get an
instance of ThreadsafePluginConnection via PluginConnection.getThreadsafeConnection(). Once the allo
cated threadsafe connection is no longer needed it should be closed.
PluginConnection provides access to various other interfaces that access the repository’s data (Statements and
Entities), the current transaction’s properties, the repository fingerprint, various system and repository properties
(SystemProperties), and the security context of plugin requests (SecurityContext).
PluginConnection also provides higher level access to the repository via the Repository interface, with the ability
for simple data updates.
In order to enable efficient request processing, plugins are given lowlevel access to the repository data and inter
nals. This is done through the Statements and Entities interfaces.
The Entities interface represents a set of RDF objects (IRIs, blank nodes, literals, and RDFstar embedded triples).
All such objects are termed entities and are given unique long identifiers. The Entities instance is responsible
for resolving these objects from their identifiers and inversely for looking up the identifier of a given entity. Most
plugins process entities using their identifiers, because dealing with integer identifiers is a lot more efficient than
working with the actual RDF entities they represent. The Entities interface is the single entry point available
to plugins for entity management. It supports the addition of new entities, lookup of entity type and properties,
resolving entities, etc.
It is possible to declare two RDF objects to be equivalent in a GraphDB repository, e.g., by using owl:sameAs
optimization. In order to provide a way to use such declarations, the Entities interface assigns a class identifier
to each entity. For newly created entities, this class identifier is the same as the entity identifier. When two entities
are declared equivalent, one of them adopts the class identifier of the other, and thus they become members of the
same equivalence class. The Entities interface exposes the entity class identifier for plugins to determine which
entities are equivalent.
Entities within an Entities instance have a certain scope. There are three entity scopes:
• Default – entities are persisted on the disk and can be used in statements that are also physically stored on
disk. They have positive (nonzero) identifiers, and are often referred to as physical or data entities.
• System – system entities have negative identifiers and are not persisted on the disk. They can be used, for
example, for system (or magic) predicates that can provide configuration to a plugin or request something to
be handled by a plugin. They are available throughout the whole repository lifetime, but after restart, they
have to be recreated again.
• Request – entities are not persisted on disk and have negative identifiers. They only live in the scope of
a particular request, and are not visible to other concurrent requests. These entities disappear immediately
after the request processing finishes. The request scope is useful for temporary entities such as those entities
that are returned by a plugin as a response to a particular query.
The Statements interface represents a set of RDF statements, where ‘statement’ means a quadruple of subject,
predicate, object, and context RDF entity identifiers. Statements can be searched for but not modified.
An important abstract class, which is related to GraphDB internals, is StatementIterator. It has a boolean
next() method, which attempts to scroll the iterator onto the next available statement and returns true only if it
succeeds. In case of success, its subject, predicate, object, and context fields are initialized with the respective
components of the next statement. Furthermore, some properties of each statement are available via the following
methods:
• boolean isReadOnly() – returns true if the statement is in the Axioms part of the rulefile or is imported
at initialization;
• boolean isExplicit() – returns true if the statement is explicitly asserted;
• boolean isImplicit() – returns true if the statement is produced by the inferencer (raw statements can be
both explicit and implicit).
Here is a brief example that puts Statements, Entities, and StatementIterator together in order to output all
literals that are related to a given URI:
StatementIterator is also used to return statements via one of the pattern interpretation interfaces.
Each GraphDB transaction has several properties accessible via PluginConnection:
Transaction ID (PluginConnection.getTransactionId()) An integer value. Bigger values indicate newer
transactions.
Testing (PluginConnection.isTesting()) A boolean value indicating the testing status of transaction. In
GraphDB EE the testing transaction is the first execution of a given transaction that determines if the trans
action can be executed successfully before being propagated to the entire cluster. Despite the _testing_
name it is a fullfeatured transaction that will modify the data. In GraphDB Free and SE the transaction is
always executed only once so it is always testing there.
Repository access
*
* @since 10
*/
public interface Repository {
/**
* Returns true if this instance is allowed to add statements to the repository.
* Adding statements is disallowed during plugin initialization, without an active transaction, and in�
,→thread-safe
/**
* Returns true if this instance is allowed to remove statements from the repository.
* Removing statements is disallowed during plugin initialization, without an active transaction,�
,→during a parallel
* load, and in thread-safe instances obtained via {@link PluginConnection#getThreadsafeConnection()}.
*
* @return true if adding is allowed, false otherwise.
*/
boolean isRemoveAllowed();
/**
* Add a statement to the repository.
*
* @param subject subject of the statement to add
* @param predicate predicate of the statement to add
* @param object object of the statement to add
* @param contexts context(s) to add the statement to, if no contexts are specified, the statement
* will be added to the default graph.
* @throws IllegalStateException if this instance isn't allowed to add statements
*/
void addStatement(Resource subject, IRI predicate, Value object, Resource... contexts)
throws IllegalStateException;
/**
* Removes all statements matching the specified subject, predicate and object from the repository.
* All three parameters may be null to indicate wildcards.
*
* @param subject subject of the statement to remove
* @param predicate predicate of the statement to remove
* @param object object of the statement to remove
* @param contexts context(s) to remove the statement from, if no contexts are specified, the�
,→statement
* will be removed from all graphs. Use null to remove from the default graph only.
* @throws IllegalStateException if this instance isn't allowed to remove statements
*/
void removeStatements(Resource subject, IRI predicate, Value object, Resource... contexts)
(continues on next page)
System properties
PluginConnection provides access to various static repository and system properties via getProperties(). Most
of the values of these properties are set at repository initialization time and will not change while the repository
is operating. The values for the product type and capabilities may change after repository initialization if the
GraphDB license is updated.
The getProperties() method returns an instance of SystemProperties:
/**
* This interface represents various properties for the running GraphDB instance and the repository as�
,→seen by the Plugin API.
*/
public interface SystemProperties {
/**
* Returns the read-only status of the current repository.
*
* @return true if read-only, false otherwise
*/
boolean isReadOnly();
/**
* Returns the number of bits needed to represent an entity id
*
* @return the number of bits as an integer
*/
int getEntityIdSize();
/**
* Returns the product type of the current GraphDB license.
*
* @return one of {@link ProductType#FREE}, {@link ProductType#SE} or {@link ProductType#EE}
*/
ProductType getProductType();
/**
* Checks whether the current license has the provided product capability.
*
* @param productCapability a product capability
* @return true if the capability is supported by the license, false otherwise.
*/
boolean hasProductCapability(String productCapability);
/**
* Returns the full GraphDB version string.
*
* @return a string describing the GraphDB version
*/
String getVersion();
/**
* Returns the GraphDB major version component.
*
* @return the major version as an integer
*/
int getVersionMajor();
/**
* Returns the GraphDB patch version component.
*
* @return the patch version as an integer
*/
int getVersionPatch();
/**
* Returns the number of cores in the currently set license up to the physical number of cores on the�
,→machine.
*
* @return the number of cores as an integerÒ
*/
int getNumberOfLicensedCores();
/**
* Retrieve string repository configuration identified by the given IRI.
*
* @param settingName the configuration identifier
* @param defaultValue the default value to return if not configured
* @return the configuration value or default value
*/
String getRepositorySetting(IRI settingName, String defaultValue);
/**
* Retrieve boolean repository configuration identified by the given IRI.
*
* @param settingName the configuration identifier
* @param defaultValue the default value to return if not configured
* @return the configuration value or default value
*/
boolean getRepositorySetting(IRI settingName, boolean defaultValue);
/**
* Retrieve integer repository configuration identified by the given IRI.
*
* @param settingName the configuration identifier
* @param defaultValue the default value to return if not configured
* @return the configuration value or default value
*/
int getRepositorySetting(IRI settingName, int defaultValue);
/**
* Retrieve multi-valued string based repository configuration identified by the given IRI.
*
* @param settingName the configuration identifier
* @return the configuration value or empty array
*/
String[] getRepositorySetting(IRI settingName);
/**
* The possible product types of the installed GraphDB license.
*/
enum ProductType {
Repository properties
There are some dynamic repository properties that may change once a repository has been initialized. These
properties are:
Repository fingerprint (PluginConnection.getFingerprint()) The repository fingerprint. Note that the fin
gerprint will be updated at the very end of a transaction so the updated fingerprint after a transaction should
be accessed within PluginTransactionListener.transactionCompleted().
Whether the repository is attached to a cluster (PluginConnection.isAttached()) GraphDB EE worker
repositories are typically attached to a master repository and not accessed directly. When this is the case
this method will return true and the plugin may use it to refuse to perform actions that may cause the
fingerprint to change outside of a transaction. In GraphDB Free and SE the method always returns false.
Security context
PluginConnection provides access to the security context of plugin requests via getSecurityContext().
The security context can be used to check if the user that initiated a plugin request has the required access level
based on simple criteria such as having write access to the repository, checking if the user has a specific role or a
username matching an accesscontrol list maintained by the plugin.
The getSecurityContext() method returns an instance of SecurityContext:
/**
* Plugin interface that provides access to the security context.
*/
public interface SecurityContext {
/**
* Returns the username of the user that initiated the plugin request.
*
* @return a username
*/
String getUsername();
/**
* Returns true if the user that initiated the plugin request has write access to the repository.
*
* @return true if write granted, false otherwise
*/
boolean hasWriteAccess();
/**
* Returns the roles of the user that initiated the plugin request.
(continues on next page)
/**
* Returns true if the user that initiated the plugin request has the supplied role.
*
* @return true if the user has the role, false otherwise
*/
boolean hasRole(String role);
}
As already mentioned, a plugin’s interaction with each of the requestprocessing phases is optional. The plugin
declares if it plans to participate in any phase by implementing the appropriate interface.
Pre-processing
A plugin that will be participating in request preprocessing must implement the Preprocessor interface. It looks
like this:
/**
* Interface that should be implemented by all plugins that need to maintain per-query context.
*/
public interface Preprocessor {
/**
* Pre-processing method called once for every SPARQL query or getStatements() request before it is
* processed.
*
* @param request request object
* @return context object that will be passed to all other plugin methods in the future stages of the
* request processing
*/
RequestContext preprocess(Request request);
}
The preprocess(Request request) method receives the request object and returns a RequestContext instance.
The passed request parameter is an instance of one the interfaces extending Request, depending on the type of
the request (QueryRequest for a SPARQL query or StatementRequest for “get statements”). The plugin changes
the request object accordingly, initializes, and returns its context object, which is passed back to it in every other
method during the request processing phase. The returned request context may be null, but regardless of it is, it is
only visible to the plugin that initializes it. It can be used to store data visible for (and only for) this whole request,
e.g., to pass data related to two different statement patterns recognized by the plugin. The request context gives
further request processing phases access to the Request object reference. Plugins that opt to skip this phase do not
have a request context, and are not able to get access to the original Request object.
Plugins may create their own RequestContext implementation or use the default one, RequestContextImpl.
Pattern interpretation
This is one of the most important phases in the life cycle of a plugin. In fact, most plugins need to participate in
exactly this phase. This is the point where request statement patterns need to get evaluated and statement results
are returned.
For example, consider the following SPARQL query:
SELECT * WHERE {
?s <http://example.com/predicate> ?o
}
There is just one statement pattern inside this query: ?s <http://example/predicate> ?o. All plugins that
have implemented the PatternInterpreter interface (thus declaring that they intend to participate in the pattern
interpretation phase) are asked if they can interpret this pattern. The first one to accept it and return results will
be used. If no plugin interprets the pattern, it will look to use the repository’s physical statements, i.e., the ones
persisted on the disk.
Here is the PatternInterpreter interface:
/**
* Interface implemented by plugins that want to interpret basic triple patterns
*/
public interface PatternInterpreter {
/**
* Estimate the number of results that could be returned by the plugin for the given parameters
*
* @param subject subject ID (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND}
,→)
RequestContext requestContext);
/**
* Interpret basic triple pattern and return {@link StatementIterator} with results
*
* @param subject subject ID (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND})
* @param predicate predicate ID (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND}
,→)
The estimate() and interpret() methods take the same arguments and are used in the following way:
• Given a statement pattern (e.g., the one in the SPARQL query above), all plugins that implement PatternIn-
terpreter are asked to interpret() the pattern. The subject, predicate, object and context values are
either the identifiers of the values in the pattern or 0, if any of them is an unbound variable. The statements
and entities objects represent respectively the statements and entities that are available for this particular
request. For instance, if the query contains any FROM <http://some/graph> clauses, the statements ob
ject will only provide access to the statements in the defined named graphs. Similarly, the entities object
contains entities that might be valid only for this particular request. The plugin’s interpret() method must
return a StatementIterator if it intends to interpret this pattern, or null if it refuses.
• In case the plugin signals that it will interpret the given pattern (returns a nonnull value), GraphDB’s query
optimizer will call the plugin’s estimate() method, in order to get an estimate on how many results will be
returned by the StatementIterator returned by interpret(). This estimate does not need to be precise.
But the more precise it is, the more likely the optimizer will make an efficient optimization. There is a slight
difference in the values that will be passed to estimate(). The statement components (e.g., subject) might
not only be entity identifiers, but they can also be set to 2 special values:
– Entities.BOUND – the pattern component is said to be bound, but its particular binding is not yet
known;
– Entities.UNBOUND – the pattern component will not be bound. These values must be treated as hints
to the estimate() method to provide a better approximation of the result set size, although its precise
value cannot be determined before the query is actually run.
• After the query has been optimized, the interpret() method of the plugin might be called again should any
variable become bound due to the pattern reordering applied by the optimizer. Plugins must be prepared to
expect different combinations of bound and unbound statement pattern components, and return appropriate
iterators.
The requestContext parameter is the value returned by the preprocess() method if one exists, or null otherwise.
Results are returned as statements.
The plugin framework also supports the interpretation of an extended type of a list pattern.
Consider the following SPARQL queries:
SELECT * WHERE {
?s <http://example.com/predicate> (?o1 ?o2)
}
SELECT * WHERE {
(?s1, ?s2) <http://example.com/predicate> ?o
}
Internally the object or subject list will be converted to a series of triples conforming to rdf:List. These triples can
be handled with PatternInterpreter but the whole list semantics will have to be implemented by the plugin.
In order to make this task easier the Plugin API defines two additional interfaces very similar to the PatternIn-
terpreter interface – ListPatternInterpreter and SubjectListPatternInterpreter.
RequestContext requestContext);
/**
* Interpret list-like triple pattern and return {@link StatementIterator} with results
*
* @param subject subject ID (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND}
,→)
It differs from PatternInterpreter by having multiple objects passed as an array of long, instead of a single long
object. The semantics of both methods is equivalent to the one in the basic pattern interpretation case.
SubjectListPatternInterpreter handles lists in the subject position:
/**
* Interface implemented by plugins that want to interpret list-like triple patterns
*/
public interface SubjectListPatternInterpreter {
/**
* Estimate the number of results that could be returned by the plugin for the given parameters
*
* @param subjects subject IDs (alternatively {@link Entities#BOUND} or {@link Entities
,→#UNBOUND})
RequestContext requestContext);
/**
* Interpret list-like triple pattern and return {@link StatementIterator} with results
*
* @param subjects subject IDs (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND})
* @param predicate predicate ID (alternatively {@link Entities#BOUND} or {@link Entities#UNBOUND}
,→)
(continues on next page)
It differs from PatternInterpreter by having multiple subjects passed as an array of long, instead of a single
long subject. The semantics of both methods is equivalent to the one in the basic pattern interpretation case.
Post-processing
There are cases when a plugin would like to modify or otherwise filter the final results of a request. This is where
the Postprocessor interface comes into play:
/**
* Interface that should be implemented by plugins that need to post-process results from queries.
*/
public interface Postprocessor {
/**
* A query method that is used by the framework to determine if a {@link Postprocessor} plugin really�
,→wants to
* post-process the request results.
*
* @param requestContext the request context reference
* @return boolean value
*/
boolean shouldPostprocess(RequestContext requestContext);
/**
* Method called for each {@link BindingSet} in the query result set. Each binding set is processed in
* sequence by all plugins that implement the {@link Postprocessor} interface, piping the result�
,→returned
* by each plugin into the next one. If any of the post-processing plugins returns null the result is
* deleted from the result set.
*
* @param bindingSet binding set object to be post-processed
* @param requestContext context objected as returned by {@link Preprocessor#preprocess(Request)} (in�
,→case this plugin
* implemented this interface)
* @return binding set object that should be post-processed further by next post-processing plugins or
* null if the current binding set should be deleted from the result set
*/
BindingSet postprocess(BindingSet bindingSet, RequestContext requestContext);
/**
* Method called after all post-processing has been finished for each plugin. This is the point where
* every plugin could introduce its results even if the original result set was empty
*
* @param requestContext context objected as returned by {@link Preprocessor#preprocess(Request)} (in�
,→case this plugin
* implemented this interface)
* @return iterator for resulting binding sets that need to be added to the final result set
*/
Iterator<BindingSet> flush(RequestContext requestContext);
}
The postprocess() method is called for each binding set that is to be returned to the repository client. This method
may modify the binding set and return it, or alternatively, return null, in which case the binding set is removed
from the result set. After a binding set is processed by a plugin, the possibly modified binding set is passed to the
next plugin having postprocessing functionality enabled. After the binding set is processed by all plugins (in the
case where no plugin deletes it), it is returned to the client. Finally, after all results are processed and returned,
each plugin’s flush() method is called to introduce new binding set results in the result set. These in turn are
finally returned to the client.
As well as query/read processing, plugins are able to process update operations for statement patterns containing
specific predicates. In order to intercept updates, a plugin must implement the UpdateInterpreter interface.
During initialization, the getPredicatesToListenFor() is called once by the framework, so that the plugin can
indicate which predicates it is interested in.
From then onwards, the plugin framework filters updates for statements using these predicates and notifies the
plugin. The plugin may do whatever processing is required and must return a boolean value indicating whether
the statement should be skipped. Skipped statements are not processed further by GraphDB, so the insert or delete
will have no effect on actual data in the repository.
/**
* An interface that should be implemented by the plugins that want to be notified for particular update
* events. The getPredicatesToListenFor() method should return the predicates of interest to the plugin.�
,→This
* method will be called once only immediately after the plugin has been initialized. After that point the
* plugin's interpretUpdate() method will be called for each inserted or deleted statement sharing one of�
,→the
/**
* Hook that is called whenever a statement containing one of the registered predicates
* (see {@link #getPredicatesToListenFor()} is added or removed.
*
* @param subject subject value of the updated statement
* @param predicate predicate value of the updated statement
* @param object object value of the updated statement
* @param context context value of the updated statement
* @param isAddition true if the statement was added, false if it was removed
* @param isExplicit true if the updated statement was explicit one
* @param pluginConnection an instance of {@link PluginConnection}
* @return true - when the statement was handled by the plugin only and should <i>NOT</i> be added to/
,→removed from the repository,
* false - when the statement should be added to/removed from the repository
*/
boolean interpretUpdate(long subject, long predicate, long object, long context, boolean isAddition,
boolean isExplicit, PluginConnection pluginConnection);
}
Statement deletion in GraphDB is specified as a quadruple (subject, predicate, object, context), where each position
can be explicit or null. Null in this case means all subjects, predicates, objects or contexts depending on the position
where null was specified.
When at least one of the positions is nonnull, the plugin framework will fire individual events for each matching
and removed statement.
When all positions are null (i.e., delete everything in the repository) the operation will be optimized internally
and individual events will not be fired. This means that UpdateInterpreter and StatementListener will not be
called.
ClearInterpreter is an interface that allows plugins to detect the removal of entire contexts or removal of all data
in the repository:
/**
* This interface can be implemented by plugins that want to be notified on clear()
* or remove() (all statements in any context).
*/
public interface ClearInterpreter {
/**
* Notification called before the statements are removed from the given context.
*
* @param context the ID of the context or 0 if all contexts
* @param pluginConnection an instance of {@link PluginConnection}
*/
void beforeClear(long context, PluginConnection pluginConnection);
/**
* Notification called after the statements have been removed from the given context.
*
* @param context the ID of the context or 0 if all contexts
* @param pluginConnection an instance of {@link PluginConnection}
*/
void afterClear(long context, PluginConnection pluginConnection);
}
The Plugin API provides a way to intercept data inserted into or removed from a particular predefined context.
The ContextUpdateHandler interface:
/**
* This interface provides a mechanism for plugins to handle updates to certain contexts.
* When a plugin requests handling of a context, all data for that context will forwarded to the plugin
* and not inserted into any GraphDB collections.
* <p>
* Note that unlike other plugin interfaces, {@link ContextUpdateHandler} does not use entity IDs but�
,→works directly
* with the RDF values. Data handled by this interface does not reach the entity pool and so no entity IDs�
,→are created.
*/
public interface ContextUpdateHandler {
/**
* Returns the contexts for which the plugin will handle the updates.
*
* @return array of {@link Resource}
*/
Resource[] getUpdateContexts();
PluginConnection pluginConnection);
}
This is similar to Updates involving specific predicates with some important differences:
• ContextUpdateHandler
– Configured via a list of contexts specified as IRI objects.
– Statements with these contexts are passed to the plugin as Value objects and never enter any of the
database collections.
– The plugin is assumed to always handle the update.
• UpdateInterpreter
– Configured via a list of predicates specified as integer IDs.
– Statements with these predicates are passed to the plugin as integer IDs after their RDF values are
converted to integer IDs in the entity pool.
– The plugin decides whether to handle the statement or pass it on to other plugins and eventually to the
database.
This mechanism is especially useful for the creation of virtual contexts (graphs) whose data is stored within a
plugin and never pollutes any of the database collections with unnecessary values.
Unlike the rest of the Plugin API this interface uses RDF values as objects bypassing the use of integer IDs.
8.4.7 Transactions
A plugin may require to participate in the transaction workflow, e.g., because the plugin needs to update certain
data structures such that they reflect the actual data in the repository. Without being part of the transaction the
plugin would not know when to persist or discard a given state.
Transactions can be easily tracked by implementing the PluginTransactionListener interface:
/**
* The {@link PluginTransactionListener} allows plugins to be notified about transactions (start,�
,→commit+completed or abort)
*/
public interface PluginTransactionListener {
/**
* Notifies the listener about the start of a transaction.
*
* @param pluginConnection an instance of {@link PluginConnection}
*/
void transactionStarted(PluginConnection pluginConnection);
/**
(continues on next page)
/**
* Notifies the listener about the completion of a transaction. This will be the last event in a�
,→successful transaction.
* The plugin is not allowed to throw any exceptions here and if so they will be ignored. If a plugin�
,→needs to abort
* a transaction it should be done in {@link #transactionCommit(PluginConnection)}.
*
* @param pluginConnection an instance of {@link PluginConnection}
*/
void transactionCompleted(PluginConnection pluginConnection);
/**
* Notifies the listener about the abortion of a transaction. This will be the last event in an�
,→aborted transaction.
* <p>
* Plugins should revert any modifications caused by this transaction, including the fingerprint.
*
* @param pluginConnection an instance of {@link PluginConnection}
*/
void transactionAborted(PluginConnection pluginConnection);
/**
* Notifies the listener about a user abort request. A user abort request is a request by an end-user�
,→to abort the
* transaction. Unlike the other events this will be called asynchronously whenever the request is�
,→received.
* <p>
* Plugins may react and terminate any long-running computation or ignore the request. This is just a�
,→handy way
* to speed up abortion when a user requests it. For example, this event may be received�
,→asynchronously while
* the plugin is indexing data (in {@link #transactionCommit(PluginConnection)} running in the main�
,→thread).
* The plugin may notify itself that the indexing should stop. Regardless of the actions taken by the�
,→plugin
* the transaction may still be aborted and {@link #transactionAborted(PluginConnection)} will be�
,→called.
}
}
Each transaction has a beginning signalled by a call to transactionStarted(). Then the transaction can proceed
in several ways:
• Commit and completion:
– transactionCommit() is called;
– transactionCompleted() is called.
• Commit followed by abortion (typically because another plugin aborted the transaction in its own transac-
tionCommit()):
– transactionCommit() is called;
– transactionAborted() is called.
• Abortion before entering commit:
– transactionAborted() is called.
Plugins should strive to do all heavy transaction work in transactionCommit() in such a way that call to transac-
tionAborted() can revert the changes. Plugins may throw exceptions in transactionCommit() in order to abort
the transaction, e.g., if some constraint was violated.
Plugins should do no heavy processing in transactionCompleted() and are not allowed to throw exceptions there.
Such exceptions will be logged and ignored, and the transaction will still go through normally.
The transactionAbortedByUser() will be called asynchronously (e.g., while the plugin is executing transac-
tionCommit() in the main update thread) when a user requests the transaction to be aborted. The plugin may use
this to signal its other thread to abort processing at earliest convenience or simply ignore the request.
8.4.8 Exceptions
Plugins may throw exceptions on invalid input, constraint violations or unexpected events (e.g. out of
disk space). It is possible to throw such exceptions almost everywhere with the notable exception of
PluginTransactionListener.transactionCompleted().
Plugins can make use of the functionality of other plugins. For example, the Lucenebased fulltext search plugin
can make use of the rank values provided by the RDF Rank plugin, to facilitate query result scoring and ordering.
This is not a matter of reusing program code (e.g., in a .jar with common classes), but rather it is about reusing
data. The mechanism to do this allows plugins to obtain references to other plugin objects by knowing their names.
To achieve this, they only need to implement the PluginDependency interface:
/**
* Interface that should be implemented by plugins that depend on other plugins and want to be able to
* retrieve references to them at runtime.
*/
public interface PluginDependency {
/**
* Method used by the plugin framework to inject a {@link PluginLocator} instance in the plugin.
*
* @param locator a {@link PluginLocator} instance
*/
void setLocator(PluginLocator locator);
}
They are then injected into an instance of the PluginLocator interface (during the configuration phase), which
does the actual plugin discovery for them:
/**
* Interface that supports obtaining of a plugin instance by plugin name. An object implementing this
* interface is injected into plugins that implement the {@link PluginDependency} interface.
(continues on next page)
/**
* Retrieves a {@link RDFRankProvider} instance.
*
* @return a {@link RDFRankProvider} instance or null if no {@link RDFRankProvider} is available
*/
RDFRankProvider locateRDFRankProvider();
}
Having a reference to another plugin is all that is needed to call its methods directly and make use of its services.
An important interface related to accessing other plugins is the RDFRankProvider interface. The sole implemen
tation is the RDF Rank plugin but it can be easily replaced by another implementation. By having a dedicated
interface it is easy for plugins to get access to RDF ranks without relying on a specific implementation.
Basics
Marks a plugin as aware of parallel processing. The plugin will be injected an instance of
PluginExecutorService via setExecutorService(PluginExecutorService executorService).
PluginExecutorService is a simplified version of Java’s ExecutorService and provides an easy
mechanism for plugins to schedule parallel tasks safely.
No opensource plugins use ParallelPlugin.
StatelessPlugin
Marks a plugin as stateless. Stateless plugins do not contribute to the repository fingerprint and their
fingerprint will not be queried.
It is suitable for plugins that are unimportant for query results or update executions, e.g., plugins that are
not typically used in the normal data flow.
Opensource plugins using StatelessPlugin:
• Autocomplete
• Notifications logger
On initialize() and shutdown() plugins receive an enum value, InitReason and ShutdownReason respectively,
describing the reason why the plugin is being initialized or shut down.
InitReason
• DEFAULT: initialized as part of the repository initialization or the plugin was enabled;
• CREATED_BACKUP: initialized after a shutdown for backup;
• DEFAULT: shutdown as part of the repository shutdown or the plugin was disabled;
• CREATE_BACKUP: shutdown before backup;
• RESTORE_FROM_BACKUP: shutdown before restore.
Plugins may use the reason to handle their own backup scenarios. In most cases it is unnecessary since the plugin’s
files will be backed up or restored together with the rest of the repository data.
Data structures
The pattern interpretation handlers interpret the evaluation of triple patterns. Each triple pattern will be sent to
plugins that implement the respective interface.
For more information, see Pattern interpretation.
PatternInterpreter
Interprets a simple triple pattern, where the subject, predicate, object and context are single values.
This interface handles all triple patterns: subject predicate object context.
Opensource plugins using PatternInterpreter:
• Autocomplete
• GeoSPARQL
• Geospatial
• Lucene FTS
• MongoDB
• RDF Rank
ListPatternInterpreter
Interprets a triple pattern, where the subject, predicate and context are single values while the object is a
list of values.
This interface handles triple patterns of this form: subject predicate (object1 object2 ...) context.
Opensource plugins using ListPatternInterpreter:
• Geospatial
SubjectListPatternInterpreter
Interprets a triple pattern, where the predicate, object and context are single values while the subject is a
list of values.
This interface handles triple patterns of this form: (subject1 subject2 ...) predicate object
context.
Request A basic read request. Passed to Preprocess.preprocess(). Provides access to the isIncludeInferred
property.
QueryRequest An extension of Request for SPARQL queries. It provides access to the various constituents of the
query such as the FROM clauses and the parsed query.
StatementsRequest An extension of Request for RepositoryConnection.getStatements(). It provides access
to each of the individual constituents of the request quadruple (subject, predicate, object, and context).
RequestContext Plugins may create an instance of this interface in Preprocess.preprocess() to keep track
of requestglobal data. The instance will be passed to PatternInterpreter, ListPatternInterpreter,
SubjectListPatternInterpreter and Postprocessor.
RequestContextImpl A default implementation of RequestContext that provides a way to keep arbitrary values
by key.
The update request handlers are responsible for processing updates. Unlike the query request handlers, the update
handlers will be called only for statements that match a predefined pattern.
For more information, see Update processing.
UpdateInterpreter
Handles the addition or removal of statements. Only statements that have one of a set of predefined
predicates will be passed to the handler.
The return value determines if the statement will be added or deleted as real data (in the repository) or
processed only by the plugin.
Note that this handler will not be called for each individual statement when removing all statements from
all contexts.
Opensource plugins using UpdateInterpreter:
• Autocomplete
• GeoSPARQL
• Geospatial
• Lucene FTS
• MongoDB
• Notifications logger
• RDF Rank
ClearInterpreter
Notification listeners
In general the listeners are used as simple notifications about a certain event, such as the beginning of a new
transaction or the creation of a new entity.
EntityListener Notified about the creation of a new data entity (IRI, blank node, or literal).
Opensource plugins using EntityListener:
• Autocomplete
StatementListener
• GeoSPARQL
• Notifications logger
PluginTransactionListener and ParallelTransactionListener
Notifications about the different stages of a transaction (started, followed by either commit + completed or
aborted).
Plugins should do the bulk of their transaction work within the commit stage.
Plugin dependencies
Health checks
The health check classes can be used to include a plugin in the repository health check.
HealthCheckable Marks a component (a plugin or part of a plugin) as able to provide health checks. If a plugin
implements this interface it will be included in the repository health check.
HealthResult The result from a health check. In general health results can be green (everything ok), yellow
(needs attention) or red (something broken).
CompositeHealthResult A composite health result that aggregates several HealthResult instances into a single
HealthResult.
Exceptions
With the graphdb.extra.plugins property, you can attach a directory with external plugins when starting
GraphDB. It is set the following way:
graphdb -Dgraphdb.extra.plugins=path/to/directory/with/external/plugins
If the property is omitted when starting GraphDB, then you need to load external plugins by placing them in the
dist/lib/plugins directory and then restarting GraphDB.
Tip: This property is useful in situations when, for example, GraphDB is used in an environment such as Kuber
netes, where the database cannot be restarted and the dist folder cannot be persisted.
A project containing two example plugins, ExampleBasicPlugin and ExamplePlugin can be found here.
ExampleBasicPlugin
In this basic implementation, the plugin name is defined and during initialization, a single systemscope predicate
is registered.
The next step is to implement the first of the plugin’s requirements – the pattern interpretation part:
// Create the date/time literal. Here it is important to create the literal in the entities�
,→instance of the
// request and NOT in getEntities(). If you create it in the entities instance returned by�
,→getEntities() it
// will not be visible in the current request.
long literalId = createDateTimeLiteral(pluginConnection.getEntities());
@Override
public double estimate(long subject, long predicate, long object, long context,
PluginConnection pluginConnection, RequestContext requestContext) {
// We always return a single statement so we return a constant 1. This value will be used by the�
,→QueryOptimizer
The interpret() method only processes patterns with a predicate matching the desired predicate identifier. Further
on, it simply creates a new date/time literal (in the request scope) and places its identifier in the object position of
the returned single result. The estimate() method always returns 1, because this is the exact size of the result set.
ExamplePlugin
// Put the entities in the entity pool using the SYSTEM scope
(continues on next page)
In this implementation, the plugin name is defined and during initialization, three systemscope predicates are
registered.
To implement the first functional requirement the plugin must inspect the query and detect the FROM clause in the
preprocessing phase. Then, the plugin must hook into the postprocessing phase where, if the preprocessing
phase detected the desired FROM clause, it deletes all query results (in postprocess() and returns a single result
(in flush()) containing the binding set specified by the requirements. Since this happens as part of pre and
postprocessing we can pass the literals without going through the entity pool and using integer IDs.
To do this the plugin must implement Preprocessor and Postprocessor:
// Check if the predicate is included in the default graph. This means that we have a "FROM
,→<our_predicate>"
// Prepare a binding set with all projected variables set to the date/time literal value
MapBindingSet result = new MapBindingSet();
for (String bindingName : queryRequest.getTupleExpr().getBindingNames()) {
result.addBinding(bindingName, literal);
}
// Create a Context object which will be available during the other phases of the request�
,→processing
return context;
}
}
// If we are not interested in the request there is no need to create a Context.
return null;
}
@Override
public BindingSet postprocess(BindingSet bindingSet, RequestContext requestContext) {
// Filter all results. Returning null will remove the binding set from the returned query result.
// We will add the result we want in the flush() phase.
return null;
}
@Override
public Iterator<BindingSet> flush(RequestContext requestContext) {
// Get the BindingSet we created in the Preprocess phase and return it.
// This will be returned as the query result.
BindingSet result = (BindingSet) ((RequestContextImpl) requestContext).getAttribute("bindings");
return new SingletonIterator<>(result);
}
return SimpleValueFactory.getInstance().createLiteral(calendar.getTime());
}
}
The plugin creates an instance of RequestContext using the default implementation RequestContextImpl. It
can hold attributes of any type referenced by a name. Then the plugin creates a BindingSet with the date/time
literal, bound to every variable name in the query projection, and sets it as an attribute with the name “bindings”.
The postprocess() method filters out all results if the requestContext is nonnull (i.e., if the FROM clause was
detected by preprocess()). Finally, flush() returns a singleton iterator, containing the desired binding set in the
required case or does not return anything.
To implement the second functional requirement that allows setting an offset in the future or the past, the plugin
must react to specific update statements. This is achieved via implementing UpdateInterpreter:
@Override
public boolean interpretUpdate(long subject, long predicate, long object, long context, boolean�
,→isAddition,
if (predicate == goFutureID) {
timeOffsetHrs += step;
} else if (predicate == goPastID) {
timeOffsetHrs -= step;
}
// Tell the PluginManager that we can not interpret the tuple so further processing can continue.
return false;
}
}
UpdateInterpreter must specify the predicates the plugin is interested in via getPredicatesToListenFor().
Then whenever a statement with one of those predicates is inserted or removed the plugin framework calls in-
terpretUpdate(). The plugin then checks if the subject value is http://example.com/time and if so handles the
update and returns true to the plugin framework to signal that the plugin has processed the update and it needs not
be inserted as regular data.
Part of GraphDB’s Maven repository is open and allows downloading GraphDB Maven artifacts without creden
tials.
Note: You still need to obtain a license from our Sales team, as the artifacts do not provide one.
To browse and search the public GraphDB’s Maven repository, use our Nexus.
For the Gradle build script:
repositories {
maven {
url "https://maven.ontotext.com/repository/owlim-releases"
}
}
<repositories>
<repository>
<id>ontotext-public</id>
<url>https://maven.ontotext.com/repository/owlim-releases</url>
</repository>
</repositories>
8.5.2 Distribution
To use the distribution for some automation or to run integration tests in embedded Tomcat, get the .zip artifacts
with the following snippet:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<id>copy</id>
<phase>package</phase>
<goals>
<goal>copy</goal>
</goals>
<configuration>
<artifactItems>
<artifactItem>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb</artifactId>
<version>${graphdb.version}</version>
<type>zip</type>
<classifier>dist</classifier>
<outputDirectory>target/graphdb-dist</outputDirectory>
</artifactItem>
</artifactItems>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
To embed the database in your application or develop a plugin, you need the GraphDB runtime .jar. Here are the
details for the runtime .jar artifact:
<dependency>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb-runtime</artifactId>
<version>${graphdb.version}</version>
<!-- Temporary workaround for missing Ontop dependencies for Ontotext build of Ontop -->
<exclusions>
<exclusion>
<groupId>it.unibz.inf.ontop</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
The com.ontotext.graphdb:graphdb-runtime artifact is also available from the Maven Central Repository.
The GraphDB Client API is an extension of RDF4J’s HTTP repository and provides some GraphDB extensions
and smart GraphDB cluster support. Here are the details for the .jar artifact:
<dependency>
<groupId>com.ontotext.graphdb</groupId>
<artifactId>graphdb-client-api</artifactId>
<version>${graphdb.version}</version>
</dependency>
The com.ontotext.graphdb:graphdb-client-api artifact is also available from the Maven Central Repository.
NINE
PERFORMANCE OPTIMIZATIONS
The best performance is typically measured by the shortest load time and the fastest query answering. Here are all
the factors that affect GraphDB performance:
• Configuring GraphDB Memory
• Data Loading & Query Optimizations
– Dataset loading
– GraphDB’s optional indexes
– Cache/index monitoring and optimizations
– Query optimizations
• Explain Plan
• Inference Optimizations
– Delete optimizations
– Rules optimizations
– Optimization of owl:sameAs
– RDFS and OWL support optimizations
The life cycle of a repository instance typically starts with the initial loading of datasets, followed by the processing
of queries and updates. The loading of a large dataset can take a long time up to 12 hours for one billion statements
with inference. Therefore, during loading, it is often helpful to use a different configuration than the one for a
normal operation.
Furthermore, if you frequently load a certain dataset, since it gradually changes over time, the loading configuration
can evolve as you become more familiar with the GraphDB behavior towards this dataset. Many dataset properties
only become apparent after the initial load (such as the number of unique entities) and this information can be used
to optimize the loading step for the next round or to improve the configuration for a normal operation.
473
GraphDB Documentation, Release 10.2.5
Normal operation
The size of the data structures used to index entities is directly related to the number of unique entities in the loaded
dataset. These data structures are always kept in memory. In order to get an upper bound on the number of unique
entities loaded and to find the actual amount of RAM used to index them, it is useful to know the contents of the
storage folder.
The total amount of memory needed to index entities is equal to the sum of the sizes of the files entities.index
and entities.hash. This value can be used to determine how much memory is used and therefore how to divide
the remaining memory between the cache memory, etc.
An upper bound on the number of unique entities is given by the size of entities.hash divided by 12 (memory
is allocated in pages and therefore the last page will likely not be full).
The entities.index file is used to look up entries in the file entities.hash, and its size is equal to the value
of the entity-index-size parameter multiplied by 4. Therefore, the entity-index-size parameter has less to
do with efficient use of memory and more with the performance of entity indexing and lookup. The larger this
value, the less collisions occur in the entities.hash table. A reasonable size for this parameter is at least half the
number of unique entities. However, the size of this data structure is never changed once the repository is created,
so this knowledge can only be used to adjust this value for the next clean load of the dataset with a new (empty)
repository.
The following parameters can be adjusted:
Parameter Description
Set to a large enough value.
entity-index-size
(see more)
Furthermore, the inference semantics can be adjusted by choosing a different ruleset. However, this will require a
reload of the whole repository, otherwise some inferences may remain in the wrong location.
Note: The optional indexes can be built at a later point when the repository is used for query answering. You
need to experiment using typical query patterns from the user environment.
Predicate lists
Predicate lists are two indexes (SP and OP) that can improve performance in the following situations:
• When loading/querying datasets that have a large number of predicates;
• When executing queries or retrieving statements that use a wildcard in the predicate position, e.g., the state
ment pattern: dbpedia:Human ?predicate dbpedia:Land.
As a rough guideline, a dataset with more than about 1,000 predicates will benefit from using these indexes for
both loading and query answering. Predicate list indexes are not enabled by default, but can be switched on using
the enablePredicateList configuration parameter.
Context index
To provide better performance when executing queries that use contexts, you can use the context index CPSO. It is
enabled by using the enable-context-index configuration parameter.
Statistics are kept for the main index data structures, and include information such as cache hits/misses, file
reads/writes, etc. This information can be used to finetune GraphDB memory configuration, and can be use
ful for ‘debugging’ certain situations, such as understanding why load performance changes over time or with
particular datasets.
For each index, there will be a CollectionStatistics MBean published, which shows the cache and file I/O
values updated in real time:
Package com.ontotext
MBean name CollectionStatistics
Attribute Description
CacheHits The number of operations completed without accessing the storage system.
CacheMisses The number of operations completed, which needed to access the storage system.
FlushInvocations
FlushReadItems
FlushReadTimeAverage
FlushReadTimeTotal
FlushWriteItems
FlushWriteTimeAverage
FlushWriteTimeTotal
PageDiscards The number of times a nondirty page’s memory was reused to read in another
page.
PageSwaps The number of times a page was written to the disk, so its memory could be used
to load another page.
Reads The total number of times an index was searched for a statement or a range of
statements.
Writes The total number of times a statement was added to a collection.
Operation Description
resetCounters Resets all the counters for this index.
Ideally, the system should be configured to keep the number of cache misses to a minimum. If the ratio of hits to
misses is low, consider increasing the memory available to the index (if other factors permit this).
Page swaps tend to occur much more often during large scale data loading. Page discards occur more frequently
during query evaluation.
GraphDB uses a number of query optimization techniques by default. They can be disabled by using the enable-
optimization configuration parameter set to false, however there is rarely any need to do this. See GraphDB’s
Explain Plan for a way to view query plans and applied optimizations.
This optimization applies when the repository contains a large number of literals with language tags, and it is
necessary to execute queries that filter based on language, e.g., using the following SPARQL query construct:
FILTER ( langMatches(lang(?name), "es") )
In this situation, the in-memory-literal-properties configuration parameters can be set to true, causing the
data values with language tags to be cached.
During query answering, all URIs from each equivalence class produced by the sameAs optimization are enumer
ated. You can use the onto:disable-sameAs pseudograph (see Other special query behavior) to significantly
reduce these duplicate results (by returning a single representative from each equivalence class).
Consider these example queries executed against the FactForge combined dataset. Here, the default is to enumerate:
dbpedia:Air_strip
http://sw.cyc.com/concept/Mx4ruQS1AL_QQdeZXf-MIWWdng
umbel-sc:CommercialAirport
opencyc:Mx4ruQS1AL_QQdeZXf-MIWWdng
dbpedia:Jetport
dbpedia:Airstrips
dbpedia:Airport
fb:guid.9202a8c04000641f800000000004ae12
opencyc-en:CommercialAirport
dbpedia:Air_strip
opencyc-en:CommercialAirport
The Expand results over equivalent URIs checkbox in the GraphDB Workbench SPARQL editor plays a similar
role, but the meaning is reversed.
Warning: If the query uses a filter over the textual representation of a URI, e.g., filter(strstarts(str(?x),
"http://dbpedia.org/ontology")), this may omit some valid solutions, as not all URIs within an equivalence
class are matched against the filter.
In some cases, database indexes get fragmented over time and with the accumulation of updates. This may lead to
a slowdown in data import.
Index compacting is a useful method to tackle this. To enable it, run:
INSERT DATA {
[] <http://www.ontotext.com/compactIndexes> [] .
}
This will:
1. Shut down the repository internally.
2. Scan the indexes.
3. Rebuild them.
4. Reinitialize the repository.
GraphDB’s Explain Plan is a feature that explains how GraphDB executes a SPARQL query. It also includes
information about unique subject, predicate and object collection sizes. It can help you improve your query, leading
to better execution performance.
For the simplest query explain plan possible (?s ?p ?o), execute the following query:
Depending on the number of triples that you have in the database, the results will vary, but you will get something
like the following:
This is the same query, but with some estimations next to the statement pattern (1 in this case).
Note: The query might not be the same as the original one. See below the triple patterns in the order in which
they are executed internally.
• ----- Begin optimization group 1 -----: indicates starting a group of statements, which most probably
are part of a subquery (in the case of property paths, the group will be the whole path);
• Collection size: an estimation of the number of statements that match the pattern;
• Predicate collection size: the number of statements in the database for this particular predicate (in this
case, for all predicates);
• Unique subjects: the number of subjects that match the statement pattern;
• Unique objects: the number of objects that match the statement pattern;
• Current complexity: the complexity (the number of atomic lookups in the index) the database will need to
make so far in the optimization group (most of the time a subquery). When you have multiple triple patterns,
these numbers grow fast.
• ----- End optimization group 1 -----: the end of the optimization group;
• ESTIMATED NUMBER OF ITERATIONS: the approximate number of iterations that will be executed for this
group.
Note: The result of the explain plan is given in the exact order, in which the engine will execute the query.
The following is an example where the engine reorders the triple patterns based on their complexity. The query is
a simple join:
select *
from onto:explain
{
?o rdf:type ?o1 .
?o rdfs:subPropertyOf ?o2
}
All of the following examples are based on this simple dataset describing five fictitious wines. The file is quite
small and contains the following data:
• There are different types of wine (Red, White, Rose).
• Each wine has a label.
• Wines are made from different types of grapes.
• Wines contain different levels of sugar.
• Wines are produced in a specific year.
A typical aggregation query contains a group with some aggregation function. Here, we have added an explain
graph.
This query retrieves the number of wines produced in each year along with the year.
When you execute the query in GraphDB, you get the following as an output (instead of the real results):
This aggregation query applies a filter to the result set after grouping via the HAVING clause. It retrieves red wines
made from more than one type of grape along with their grapes count.
This is a typical SPARQL query with filter function. It retrieves the wines that are made from Pinot Noir grape.
GraphDB’s inference policy is based on materialization, where implicit statements are inferred from explicit state
ments as soon as they are inserted into the repository, using the specified semantics ruleset. This approach has the
advantage of achieving query answering very quickly, since no inference needs to be done at query time.
However, no justification information is stored for inferred statements, therefore deleting a statement normally
requires a full recomputation of all inferred statements. This can take a very long time for large datasets.
GraphDB uses a special technique for handling the deletion of explicit statements and their inferences, called
smooth delete. It allows fast delete operations as well as ensures that schemas can be changed when necessary.
The algorithm
The algorithm for identifying and removing the inferred statements that can no longer be derived by the explicit
statements that have been deleted, is as follows:
1. Use forward chaining to determine what statements can be inferred from the statements marked for deletion.
2. Use backward chaining to see if these statements are still supported by other means.
3. Delete explicit statements and the no longer supported inferred statements.
Note: We recommend that you mark the visited statements as readonly. Otherwise, as almost all delete operations
follow inference paths that touch schema statements, which then lead to almost all other statements in the repository,
the smooth delete can take a very long time. However, since a readonly statement cannot be deleted, there is no
reason to find what statements are inferred from it (such inferred statements might still get deleted, but they will
be found by following other inference paths).
Statements are marked as readonly if they occur in the Axioms section of the ruleset files (standard or custom)
or are loaded at initialization time via the imports configuration parameter.
Note: When using smooth delete, we recommend that you load all ontology/schema/vocabulary statements using
the imports configuration parameter.
Example
Schema:
<foaf:name> <rdfs:domain> <owl:Thing> .
<MyClass> <rdfs:subClassOf> <owl:Thing> .
Data:
<wayne_rooney> <foaf:name> "Wayne Rooney" .
<Reviewer40476> <rdf:type> <MyClass> .
<Reviewer40478> <rdf:type> <MyClass> .
<Reviewer40480> <rdf:type> <MyClass> .
<Reviewer40481> <rdf:type> <MyClass> .
rdfs2:
x a y - (x=<wayne_rooney>, a=foaf:name, y="Wayne Rooney")
a rdfs:domain z (a=foaf:name, z=owl:Thing)
-----------------------
x rdf:type z - The inferred statement [<wayne_rooney> rdf:type owl:Thing] is to be removed.
rdfs3:
x a u - (x=<wayne_rooney>, a=rdf:type, u=owl:Thing)
a rdfs:range z (a=rdf:type, z=rdfs:Class)
-----------------------
u rdf:type z - The inferred statement [owl:Thing rdf:type rdfs:Class] is to be removed.
rdfs8_10:
x rdf:type rdfs:Class - (x=owl:Thing)
-----------------------
x rdfs:subClassOf x - The inferred statement [owl:Thing rdfs:subClassOf owl:Thing] is to be removed.
proton_TransitiveOver:
y q z - (y=owl:Thing, q=rdfs:subClassOf, z=owl:Thing)
p protons:transitiveOver q - (p=rdf:type, q=rdfs:subClassOf)
x p y - (x=[<Reviewer40476>, <Reviewer40478>, <Reviewer40480>, <Reviewer40481>], p=rdf:type,�
,→y=owl:Thing)
-----------------------
x p z - The inferred statements [<Reviewer40476> rdf:type owl:Thing], etc., are to be removed.
Statements such as [<Reviewer40476> rdf:type owl:Thing] exist because of the statements [<Reviewer40476>
rdf:type <MyClass>] and [<MyClass> rdfs:subClassOf owl:Thing].
In large datasets, there are typically millions of statements [X rdf:type owl:Thing], and they are all visited by
the algorithm.
The [X rdf:type owl:Thing] statements are not the only problematic statements considered for removal. Every
class that has millions of instances leads to similar behavior.
One check to see if a statement is still supported requires about 30 query evaluations with OWLHorst, hence the
slow removal.
If [owl:Thing rdf:type owl:Class] is marked as an axiom (because it is derived by statements from the schema,
which must be axioms), then the process stops when reaching this statement. So, the schema (the system state
ments) must necessarily be imported through the imports configuration parameter in order to mark the schema
statements as axioms.
Schema transactions
As mentioned above, ontologies and schemas imported at initialization time using the imports configuration pa
rameter configuration parameter are flagged as readonly. However, there are times when it is necessary to change
a schema. This can be done inside a ‘system transaction’.
The user instructs GraphDB that the transaction is a system transaction by including a dummy statement with the
special schemaTransaction predicate, i.e.:
_:b1 <http://www.ontotext.com/owlim/system#schemaTransaction> _:b2
This statement is not inserted into the database, but is rather serving as a flag telling GraphDB that the statements
from this transaction are going to be inserted as readonly; all statements derived from them are also marked as
readonly. When you delete statements in a system transaction, you can remove statements marked as readonly,
as well as statements derived from them. Axiom statements and all statements derived from them stay untouched.
GraphDB includes the useful feature of rule optimizing that allows you to profile and debug rule performance.
Warning: Rule profiling slows down the rule execution (the leading premise checking part) by 1030%, so
do not use it in production.
Log file
This is a conjunction of two props. It is declared with the axiomatic (ABox) triples involving t. Whenever the
premise p and restriction r hold between two resources, the rule infers the conclusion q between the same resources,
i.e., p & r => q.
The corresponding log for variant 4 of this rule may look like the following:
RULE ptop_PropRestr_4 invoked 163,475,763 times.
ptop_PropRestr_4:
e b f
a ptop_premise b
a rdf_type ptop_PropRestr
e c f
a ptop_restriction c
a ptop_conclusion d
------------------------------------
e d f
Note: Variable names are renamed due to the compilation to Java bytecode.
• Invoked is the number of times the rule variant or specific premise was checked successfully. Tracing
through the rule:
– ptop_PropRestr_4 checked successfully 163 million times: for each incoming triple, since the lead
premise (e b f = x p y) is a free pattern.
– a ptop_premise b checked successfully 10 million times: for each b=p that has an axiomatic triple
involving ptop_premise.
This premise was selected because it has only 1 unbound variable a and it is first in the rule text.
– a rdf_type ptop_PropRestr checked successfully 7 million times: for each ptop_premise that has
type ptop_PropRestr.
This premise was selected because it has 0 unbound variables (after the previous premise binds a).
• The time to check each premise is printed in ns.
• Fired is the number of times all premises matched, so the rule variant was fired.
• Inferred is the number of inferred triples.
• Time overall is the total time that this rule variant took.
Excel format
The log records detailed information about each rule and premise, which is very useful when you are trying to
understand which of the rules is too timeconsuming. However, it can still be overwhelming because of this level
of detail.
Therefore, we have developed the rule-stats.pl script that outputs a TSV file such as the following:
rule ver tried time patts checks time fired time triples speed
ptop_PropChain 4 163475763 776.3 5 117177482 185.3 15547176 590.9 9707142 12505
Parameters:
Parameter Description
rule the rule ID (name)
ver the rule version (variant) or “T” for overall rule totals
tried, time the number of times the rule/variant was tried, the overall time in sec
patts the number of triple patterns (premises) in the rule, not counting the leading premise
checks, time the number of times premises were checked, time in sec
fired the number of times all premises matched, so the rule was fired
triples the number of inferred triples
speed inference speed, triples/sec
Investigating performance
The following is an example of using the Excel format to investigate where time is spent during rule execution.
Download the time-spent-during-rule.xlsx example file, and use it as a template.
Note: These formulas are dynamic, and they are updated every time you change the filters.
7. Focus on a variant to investigate the reasons for its poorer time and speed performance.
In this example, the first variant you would want to investigate will be ptop_PropRestr_5, as it is spending 30%
of the time of this rule, and has very low speed. The reason is that it fired 1.4 million times but produced only 238
triples, so most of the inferred triples were duplicates.
You can find the definition of this variant in the log file:
It is very similar to the productive variant ptop_PropRestr_4 (see Log file above):
• one checks e b f. a ptop_premise b first,
• the other checks e c f. a ptop_restriction c first.
Still, the function of these premises in the rule is the same and therefore the variant ptop_PropRestr_5 (which is
checked after 4) is unproductive.
The most likely way to improve performance would be if you make the two premises use the same axiomatic triple
ptop:premise (emphasizing they have the same role), and introduce a Cut:
Id: ptop_PropRestr_SYM
t <ptop:premise> p
t <ptop:premise> r
t <ptop:conclusion> q
t <rdf:type> <ptop:PropRestr>
x p y
x r y [Cut]
----------------
x q y
The Cut eliminates the rule variant with x r y as leading premise. It is legitimate to do this, since the two variants
are the same, up to substitution p<->r.
Note: Introducing a Cut in the original version of the rule would not be legitimate:
Id: ptop_PropRestr_CUT
t <ptop:premise> p
t <ptop:restriction> r
t <ptop:conclusion> q
t <rdf:type> <ptop:PropRestr>
x p y
x r y [Cut]
----------------
x q y
since it would omit some potential inferences (in the case above, 238 triples), changing the semantics of the rule
(see the example below).
:t_CUT a ptop:PropRestr; ptop:premise :p; ptop:restriction :r; ptop:conclusion :q. # for ptop_PropRestr_
,→CUT
:t_SYM a ptop:PropRestr; ptop:premise :p; ptop:premise :r; ptop:conclusion :q. # for ptop_PropRestr_
,→SYM
• ptop_PropRestr_SYM will infer :x :q :y, since the second incoming triple :x :r :y will match x p y and
t ptop:premise :r, then the previously inserted :x :p :y will match t ptop:premise :p and the rule will
fire.
Tip: Rule execution is often nonintuitive, therefore we recommend that you detail the speed history and compare
the performance after each change.
The complexity of the ruleset has a significant effect on the loading performance, the number of inferred statements,
and the overall size of the repository after inferencing. The complexity of the standard rulesets increases as follows:
• no inference (lowest complexity, best performance)
• RDFSOptimized
• RDFS
• RDFSPlusOptimized
• RDFSPlus
• OWLHorstOptimized
• OWLHorst
• OWLMaxOptimized
• OWlMax
• OWL2QLOptimized
• OWL2QL
• OWL2RLOptimized
• OWL2RL (highest complexity, worst performance)
It needs to be mentioned that OWLRL and OWLQL do a lot of heavy work that is often not required by applica
tions. For more details, see OWL Compliance.
Check the expansion ratio (total/explicit statements) for your dataset in order to get an idea of whether this is the
result that you are expecting. For example, if your ruleset infers 4 times more statements over a large number of
explicit statements, this will take time regardless of the ways in which you try to optimize the rules.
The number of rules and their complexity affects inferencing performance, even for rules that never infer any new
statements. The reason for this is that every incoming statement is passed through every variant of every rule to
check whether something can be inferred. This often results in many checks and joins, even if the rule never fires.
So, start with a minimal ruleset and only add the rules that you need. The default ruleset (RDFSPlusOptimized)
works for many users, but you might even consider starting from RDFS. For example, if you need owl:Symmetric
and owl:inverseOf on top of RDFS, you can copy only these rules from OWLHorst to RDFS and leave out the
rest.
Conversely, you can start with a bigger standard ruleset and remove the rules that you do not need.
Note: To deploy a custom ruleset, set the ruleset configuration parameter to the full pathname of your custom
.pie file.
• Be careful with the recursive rules as they can lead to an explosion in the number of inferred statements.
• Always check your spelling:
– A misspelled variable in a premise leads to a Cartesian explosion (variables quickly growing to an
intractable level) of the number of triple joins to be considered by the rule.
– A misspelled variable in a conclusion (or the use of an unbound variable) leads to the creation of new
blank nodes. This is almost never what you really want.
• Order premises by specificity. GraphDB first checks premises with the least number of unbound variables.
But if there is a tie, it follows the order given by you. Since you may know the cardinalities of triples in your
data, you may be in a better position to determine which premise has better specificity (selectivity).
• Use Cut for premises that have the same role (for an example, see Investigating performance), but be careful
not to remove any necessary inferences by mistake.
Avoid inserting explicit statements in a named graph if the same statements are inferable. GraphDB always stores
inferred statements in the default graph, so this will lead to duplicating statements. This will increase the repository
size and slow down query answering.
You can eliminate duplicates from query results using DISTINCT or FROM onto:skip-redundant-implicit (see
Other special GraphDB query behavior). However, these are slow operations, so it is better not to produce dupli
cate statements in the first place.
This means that every dcterms:created statement will expand to 3 statements. So, do not load the DC ontology
unless you really need these inferred dc:date.
Inverse properties (e.g., :p owl:inverseOf :q) offer some convenience in querying, but are never necessary:
• SPARQL natively has bidirectional data access: instead of ?x :q ?y, you can always query for ?y :p ?x.
• You can even invert the direction in a property path: instead of ?x :p1/:q ?y, use ?x :p1/(^:p) ?y.
If an ontology defines inverses but you skip inverse reasoning, you have to check which of the two properties is
used in a particular dataset, and write your queries carefully.
The Provenance Ontology (PROVO) has considered this dilemma thoroughly, and has abstained from defining
inverses to “avoid the need for OWL reasoning, additional code, and larger queries” (see http://www.w3.org/TR/
provo/#inversenames).
A chain of N transitive relations (e.g., rdfs:subClassOf) causes GraphDB to infer and store a further (n2 − n)/2
statements. If the relationship is also symmetric (e.g., in a family ontology with a predicate such as relatedTo),
then there will be n2 − n inferred statements.
Consider removing the transitivity and/or symmetry of relations that make long chains. Or, if you must have them,
consider the implementation of TransitiveProperty through step property, which can be faster than the standard
implementation of owl:TransitiveProperty.
While OWL2 has very powerful class constructs, its property constructs are quite weak. Some widely used OWL2
property constructs could be done faster.
See this draft for some ideas and clear illustrations. Below, we describe three of these ideas.
Tip: To learn more, see a detailed account of applying some of these ideas in a realworld setting. Here are the
respective rule implementations.
PropChain
Id: ptop_PropChain
t <ptop:premise1> p1
t <ptop:premise2> p2
t <ptop:conclusion> q
t <rdf:type> <ptop:PropChain>
x p1 y
y p2 z
----------------
x q z
transitiveOver
psys:transitiveOver has been part of Ontotext’s PROTON ontology since 2008. It is defined as follows:
Id: psys_transitiveOver
p <psys:transitiveOver> q
x p y
y q z
---------------
x p z
It is a specialized PropChain, where premise1 and conclusion coincide. It allows you to chain p with q on the
right, yielding p. For example, the inferencing of types along the class hierarchy can be expressed as:
Id: owl_TransitiveProperty
p <rdf:type> <owl:TransitiveProperty>
x p y
y p z
----------
x p z
Most transitive properties comprise transitive closure over a basic ‘step’ property. For example,
skos:broaderTransitive is based on skos:broader and is implemented as:
Now consider a chain of N skos:broader between two nodes. The owl_TransitiveProperty rule has to consider
every split of the chain, thus inferring the same closure between the two nodes N times, leading to quadratic
inference complexity.
This can be optimized by looking for the step property s and extending the chain only at the right end:
Id: TransitiveUsingStep
p <rdf:type> <owl:TransitiveProperty>
s <rdfs:subPropertyOf> p
x p y
y s z
----------
x p z
However, this would not make the same inferences as owl_TransitiveProperty if someone inserts the transitive
property explicitly (which is a bad practice).
A more robust approach is to declare the step and transitive properties together using psys:transitiveOver, for
instance:
Id: ptop_PropChain_from_propertyChainAxiom
q <owl:propertyChainAxiom> l1
l1 <rdf:first> p1
l1 <rdf:rest> l2
l2 <rdf:first> p2
l2 <rdf:rest> <rdf:nil>
----------------------
t <ptop:premise1> p1
t <ptop:premise2> p2
t <ptop:conclusion> q
t <rdf:type> <ptop:PropChain>
GraphDB applies special processing to the following rules so that inferred statements such as <P a rdf:Property>,
<P rdfs:subPropertyOf P> and <X a rdfs:Resource> can appear in the repository without slowdown of infer
ence:
/*partialRDFS*/
Id: rdf1_rdfs4a_4b
x a y
-------------------------------
a <rdf:type> <rdf:Property>
x <rdf:type> <rdfs:Resource>
a <rdf:type> <rdfs:Resource>
y <rdf:type> <rdfs:Resource>
/*partialRDFS*/
Id: rdfs6
a <rdf:type> <rdf:Property>
-------------------------------
a <rdfs:subPropertyOf> a
According to them, whatever statement comes into the repository, its subject, predicate and object are resources
and its predicate is an rdf:Property, which then becomes subPropertyOf itself using the second rule (the re
flexivity of subPropertyOf). These rules, however, if executed every time, present a similar challenge to when
using owl:sameAs. To avoid the performance drop, GraphDB obtains these statements through code so that <P a
rdf:Property> and <X a rdfs:Resource> are asserted only once – when a property or a resource is encountered
for the first time (except in the ‘optimized’ rulesets, where rdfs:Resource is omitted because of the very limited
use of such inference).
If we start with the empty ruleset, <P a rdf:Property>, <P rdfs:subPropertyOf P> and <X a rdfs:Resource>
statements will not be inferred until we switch the ruleset. Then the inference will take place for the new properties
and resources only.
Inversely, if we start with a nonempty ruleset and switch to the empty one, then the statements <P a
rdf:Property>, <P rdfs:subPropertyOf P> and <X a rdfs:Resource> inferred so far will remain. This is
true even if we delete statements or recompute the inferred closure.
The OWL same as optimization uses the OWL owl:sameAs property to create an equivalence class between nodes
of an RDF graph. An equivalence class has the following properties:
• Reflexivity, i.e., A > A
• Symmetricity, i.e., if A > B then B > A
• Transitivity, i.e., if A > B and B > C then A > C
Instead of using simple rules and axioms for owl:sameAs (actually 2 axioms that state that it is Symmetric and
Transitive), GraphDB offers an effective nonrule implementation, i.e., the owl:sameAs support is hardcoded.
The rules are commented out in the .pie files, and are left only as a reference.
In GraphDB, the equivalence class is represented with a single node, thus avoiding the explosion of all N^2
owl:sameAs statements, and instead storing the members of the equivalence class in a separate structure. In this
way, the ID of the equivalence class can be used as an ordinary node, which eliminates the need to copy statements
by subject, predicate and object. So, all these copies are replaced by a single statement.
There is no restriction on how to choose this single node that will represent the class as a whole, so we pick the
first node that enters the class. After creating such a class, all statements with nodes from this class are altered to
use the class representative. These statements also participate in the inference.
The equivalence classes may grow when more owl:sameAs statements containing nodes from the class are added
to the repository. Every time you add a new owl:sameAs statement linking two classes, they merge into a single
class.
During query evaluation, GraphDB uses a kind of backward chaining by enumerating equivalent URIs, thus guar
anteeing the completeness of the inference and query results. It takes special care to ensure that this optimization
does not hinder the ability to distinguish between explicit and implicit statements.
When removing owl:sameAs statements from the repository, some nodes may remain detached from the class they
belong to, the class may split into two or more classes, or may disappear altogether. To determine the behavior of
the classes in each particular case, you should track what the original owl:sameAs statements were and which of
them remain in the repository. All statements coming from the user (either through a SPARQL query or through the
RDF4J API) are marked as explicit, and every statement derived from them during inference is marked as inferred.
So, by knowing which the remaining explicit owl:sameAs statements are, you can rebuild the equivalence classes.
Note: It is not necessary to rebuild all the classes but only the ones that were referred to by the removed
owl:sameAs statements.
When nodes are removed from classes, or when classes split or disappear, the new classes (or the removal of
classes) yield new representatives. So, statements using the old representatives should be replaced with statements
using the new ones. This is also achieved by knowing which statements are explicit. The representative statements
(i.e., statements that use representative nodes) are flagged as a special type of statement that may cease to exist
after making changes to the equivalence classes. In order to make new representative statements, you should use
the explicit statements and the new state of the equivalence classes (e.g., it is not necessary to process all statements
when only a single equivalence class has been changed). The representative statements, although being volatile, are
visible to SPARQL queries and the inferencer, whereas the explicit statements that use nodes from the equivalence
classes remain invisible and are only used for rebuilding the representative statements.
By default, the owl:sameAs support is enabled in all rulesets except for Empty (without inference), RDFS, and
RDFSPlus. However, disabling the owl:sameAs behavior may be beneficial in some cases. For example, it can
save you time or you may want to visualize your data without the statements generated by owl:sameAs in queries
or inferences of such statements.
To disable owl:sameAs, use:
• (for individual queries) FROM onto:disable-sameAs system graph;
• (for the whole repository) the disable-sameAs configuration parameter (Boolean, defaults to false). This
disables all inferences.
Disabling owl:sameAs by query does not remove the inferences that have taken place because of owl:sameAs.
Consider the following example:
INSERT DATA {
<urn:A> owl:sameAs <urn:B> .
<urn:A> a <urn:Class1> .
<urn:B> a <urn:Class2> .
}
This leads to <urn:A> and <urn:B> being instances of the intersection of the two classes:
INSERT DATA {
test:Intersection owl:intersectionOf (<urn:Class1> <urn:Class2>) .
}
SELECT * {
?s a test:Intersection .
}
the response will be: <urn:A> and <urn:B>. Using FROM onto:disable-sameAs returns only the equivalence class
representative (e.g., <urn:A>). But it does not disable the inference as a whole.
In contrast, when you set up a repository with the disable-sameAs repository parameter set to true, the inference
<urn:A> a :Intersection will not take place. Then, if you query what instances the intersection has, it will return
neither <urn:A>, nor <urn:B>.
Apart from this difference that affects the scope of action, disabling owl:sameAs both as a repository parameter
and a FROM clause in the query will have the same behavior.
See how to configure the Expand results over owl:sameAs setting from the Workbench here.
• disable-sameAs: true + inference – disables the owl:sameAs expansion but still shows the other implicit
statements. However, these results will be different from the ones retrieved by owl:sameAs + inference or
when there is no inference.
• FROM onto:disable-sameAs – including this clause in a query produces different results with different
rulesets.
• FROM onto:explicit – using only this clause (or with FROM onto:disable-sameAs) produces the same
results as when the inferencer is disabled (as with the empty ruleset). This means that the ruleset and the
disable-sameAs parameter do not affect the results.
• FROM onto:explicit + FROM onto:implicit – produces the same results as if both clauses are omitted.
• FROM onto:implicit – using this clause returns only the statements derived by the inferencer. Therefore,
with the empty ruleset, it is expected to produce no results.
• FROM onto:implicit + FROM onto:disable-sameAs – shows all inferred statements (except for the ones
generated by owl:sameAs).
The following examples illustrate this behavior:
Example 1
INSERT DATA {
test:a test:b test:c .
test:a owl:sameAs test:d .
test:d owl:sameAs test:e .
}
the result is the same as if you query for explicit statements when there is no inference or if you add FROM
onto:explicit.
However, if you enable the inference, you will see a completely different picture. For example, if you use owl-
horst-optimized, disable-sameAs=false, you will receive the following results:
Example 2
INSERT DATA {
_:b sys:addRuleset "owl-horst-optimized" .
_:b sys:defaultRuleset "owl-horst-optimized" .
}
INSERT DATA {
_:b sys:reinfer _:b .
}
:a :b :c .
:a owl:sameAs :a .
:a owl:sameAs :d .
:a owl:sameAs :e .
:d owl:sameAs :a .
:d owl:sameAs :d .
:d owl:sameAs :e .
:e owl:sameAs :a .
:e owl:sameAs :d .
:e owl:sameAs :e .
:d :b :c .
:e :b :c .
i.e., without the <P a rdf:Property> and <P rdfs:subPropertyOf P> statements.
Example 3
If you start with owl-horst-optimized and set the disable-sameAs parameter to true or use FROM onto:disable-
sameAs, you will receive:
:a :b :c .
:a owl:sameAs :d .
:b a rdf:Property .
:b rdfs:subPropertyOf :b .
:d owl:sameAs :e .
Example 4
This query:
yields:
test:b a rdf:Property .
test:b rdfs:subPropertyOf test:b .
because all owl:sameAs statements and the statements generated from them (<:d :b :c>, <:e :b :c>) will not be
shown.
Note: The same is achieved with the disable-sameAs repository parameter set to true. However, if you start
with the empty ruleset and then switch to a nonempty ruleset, the latter query will not return any results. If you
start with owlhorstoptimized and then switch to empty, <type Property> will persist, i.e., the latter query will
return some results.
Example 5
INSERT DATA {
GRAPH test:graph {
test:a test:b test:c .
test:a owl:sameAs test:d .
test:d owl:sameAs test:e .
}
}
SELECT DISTINCT *
{
GRAPH ?g {
?s ?p ?o
FILTER (
?s IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?p IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?o IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?g IN (test:a, test:b, test:c, test:d, test:e, test:graph)
)
}
}
graph {
:a :b :c .
:a owl:sameAs :d .
:d owl:sameAs :e .
}
because the statements from the default graph are not automatically included. This is the same as in the DESCRIBE
query, where using both FROM onto:explicit and FROM onto:implicit nullifies them.
So, if you want to see all the statements, you should write:
SELECT DISTINCT *
FROM NAMED onto:explicit
FROM NAMED onto:implicit
{
GRAPH ?g {
?s ?p ?o
FILTER (
?s IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?p IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?o IN (test:a, test:b, test:c, test:d, test:e, test:graph) ||
?g IN (test:a, test:b, test:c, test:d, test:e, test:graph)
)
}
}
ORDER BY ?g ?s
Note that when querying quads, you should use the FROM NAMED clause and when querying triples FROM. Using
FROM NAMED with triples and FROM with quads has no effect and the query will return the following:
:graph {
:a :b :c .
:a owl:sameAs :d .
:d owl:sameAs :e .
}
onto:implicit {
:b a rdf:Property .
:b rdfs:subPropertyOf :b .
}
onto:implicit {
:a owl:sameAs :a .
:a owl:sameAs :d .
:a owl:sameAs :e .
:d owl:sameAs :a .
:d owl:sameAs :d .
:d owl:sameAs :e .
:e owl:sameAs :a .
:e owl:sameAs :d .
:e owl:sameAs :e .
}
onto:implicit {
:d :b :c .
:e :b :c .
}
In this case, the explicit statements <:a owl:sameAs :d> and <:d owl:sameAs :e> appear also as implicit. They
do not appear twice when dealing with triples because the iterators return unique triples. When dealing with quads,
however, you can see all statements.
Here, you have the same effects with FROM NAMED onto:explicit, FROM NAMED onto:impicit and FROM NAMED
There are several features in the RDFS and OWL specifications that lead to inefficient entailment rules and axioms,
which can have a significant impact on the performance of the inferencer. For example:
• The consequence X rdf:type rdfs:Resource for each URI node in the RDF graph;
• The system should be able to infer that URIs are classes and properties, if they appear in schemadefining
statements such as Xrdfs:subClassOf Y and X rdfs:subPropertyOf Y;
• The individual equality property in OWL is reflexive, i.e., the statement X owl:sameAs X holds for every
OWL individual;
• All OWL classes are subclasses of owl:Thing and for all individuals X rdf:type owl:Thing should hold;
• C is inferred as being rdfs:Class whenever an instance of the class is defined: I rdf:type C.
Although the above inferences are important for formal semantics completeness, users rarely execute queries that
seek such statements. Moreover, these inferences generate so many inferred statements that performance and
scalability can be significantly degraded.
For this reason, optimized versions of the standard rulesets are provided. These have -optimized appended to the
ruleset name, e.g., owl-horst-optimized.
The following optimizations are enacted in GraphDB:
TEN
The GraphDB platformindependent distribution packaged in version 7.0.0 and newer contains the following files:
Path Description
adapters/ Support for SAIL graphs with the Blueprints API
benchmark/ Semantic publishing benchmark scripts
bin/ Scripts for running various utilities, such as ImportRDF and the Storage Tool
conf/ GraphDB properties and logback.xml
configs/ Standard reasoning rulesets and a repository template
doc/ License agreements
examples/ Getting started and Maven installer examples, sample dataset, and queries
lib/ Database binary files
plugins/ GeoSPARQL and SPARQLmm plugins
README The readme file Custom admin handler for the Solr Connectors
tools/ Custom admin handler for the Solr Connectors
After the first successful database run, the following directories will be generated, unless their default value is
explicitly changed in conf/graphdb.properties:
The easiest way to set up and run GraphDB is to use the native installations provided for the GraphDB Desktop
distribution. This kind of installation is the best option for your laptop/desktop computer, and does not require the
use of a console, as it works in a graphic user interface (GUI). For this distribution, you do not need to download
Java, as it comes bundled together with GraphDB.
Go to the GraphDB download page and request your GraphDB copy. You will receive an email with the download
link. According to your OS, proceed as follows:
Important: GraphDB Desktop is a new application that is similar to but different from the previous application
GraphDB Free.
503
GraphDB Documentation, Release 10.2.5
If you are upgrading from the old GraphDB Free application, you need to stop GraphDB Free and uninstall it
before or after installing GraphDB Desktop. Once you run GraphDB Desktop for the first time, it will convert
some of the data files and GraphDB Free will no longer work correctly.
On Windows
On MacOS
On Linux
Configuring GraphDB
Once GraphDB Desktop is running, a small icon appears in the status bar/menu/tray area (varying depending on
OS). It allows you to check whether the database is running, as well as to stop it or change the configuration
settings. Additionally, an application window is also opened, where you can go to the GraphDB documentation,
configure settings (such as the port on which the instance runs), and see all log files. You can hide the window from
the Hide window button and reopen it by choosing Show GraphDB window from the menu of the aforementioned
icon.
You can add and edit the JVM options (such as Java system properties or parameters to set memory usage) of the
GraphDB native app from the GraphDB Desktop config file. It is located at:
• On Mac: /Applications/GraphDB Desktop.app/Contents/app/GraphDB Desktop.cfg
• On Windows: \Users\<username>\AppData\Local\GraphDB Desktop\app\GraphDB Desktop.cfg
• On Linux: /opt/graphdb-desktop/lib/app/graphdb-desktop.cfg
The JVM options are defined at the end of the file and will look very similar to this:
[JavaOptions]
java-options=-Djpackage.app-version=10.0.0
java-options=-cp
java-options=$APPDIR/graphdb-native-app.jar:$APPDIR/lib/*
java-options=-Xms1g
java-options=-Dgraphdb.dist=$APPDIR
java-options=-Dfile.encoding=UTF-8
java-options=--add-exports
java-options=jdk.management.agent/jdk.internal.agent=ALL-UNNAMED
java-options=--add-opens
java-options=java.base/java.lang=ALL-UNNAMED
Each java-options= line provides a single argument passed to the JVM when it starts. To be on the safe side, it
is recommended not to remove or change any of the existing options provided with the installation. You can add
your own options at the end. For example, if you want to run GraphDB Desktop with 8 gigabytes of maximum
heap memory, you can set the following option:
java-options=-Xmx8g
Stopping GraphDB
To stop the database, simply quit it from the status bar/menu/tray area icon, or close the GraphDB Desktop appli
cation window.
Hint: On some Linux systems, there is no support for status bar/menu/tray area. If you have hidden the GraphDB
window, you can quit it by killing the process.
The default way of running GraphDB is as a standalone server. The server is platformindependent, and includes
all recommended JVM (Java virtual machine) parameters for immediate use.
Note: Before downloading and running GraphDB, please make sure to have JDK (Java Development Kit, rec
ommended) or JRE (Java Runtime Environment) installed. GraphDB requires Java 11 or greater.
Running GraphDB
Configuring GraphDB
The configuration of all GraphDB directory paths and network settings is read from the conf/graphdb.properties
file. It controls where to store the database data, log files, and internal data. To assign a new value, modify the file
or override the setting by adding -D<property>=<new-value> as a parameter to the startup script. For example, to
change the database port number:
graphdb -Dgraphdb.connector.port=<your-port>
The configuration properties can also be set in the environment variable GDB_JAVA_OPTS, using the same -
D<property>=<new-value> syntax.
Note: The order of precedence for GraphDB configuration properties is as follows: command line supplied
arguments > GDB_JAVA_OPTS > config file.
The GraphDB home defines the root directory where GraphDB stores all of its data. The home can be set through
the system or config file property graphdb.home.
The default value for the GraphDB home directory depends on how you run GraphDB:
• Running as a standalone server: the default is the same as the distribution directory.
• All other types of installations: OSdependent directory.
– On Mac: ~/Library/Application Support/GraphDB.
– On Windows: \Users\<username>\AppData\Roaming\GraphDB.
– On Linux and other Unixes: ~/.graphdb.
GraphDB does not store any files directly in the home directory, but uses several subdirectories for data or con
figuration.
We strongly recommend setting explicit values for the Java heap space. You can control the heap size by supplying
an explicit value to the startup script such as graphdb -Xms10g -Xmx10g or setting one of the following environment
variables:
• GDB_HEAP_SIZE: environment variable to set both the minimum and the maximum heap size (recommended);
• GDB_MIN_MEM: environment variable to set only the minimum heap size;
• GDB_MAX_MEM: environment variable to set only the maximum heap size.
For more information on how to change the default Java settings, check the instructions in the bin/graphdb file.
Note: The order of precedence for JVM options is as follows: command line supplied arguments > GDB_JAVA_OPTS
> GDB_HEAP_SIZE > GDB_MIN_MEM/GDB_MAX_MEM.
Tip: Every JDK package contains a default garbage collector (GC) that can potentially affect performance.
We benchmarked GraphDB’s performance against the LDBC SPB and BSBM benchmarks with JDK 8 and 11.
With JDK 8, the recommended GC is Parallel Garbage Collector (ParallelGC). With JDK 11, the most optimal
performance can be achieved with either G1 GC or ParallelGC.
To stop the database, find the GraphDB process identifier and send kill <process-id>. This sends a shutdown
signal and the database stops. If the database is run in nondaemon mode, you can also send Ctrl+C interrupt to
stop it.
GraphDB can be operated as a desktop or a server application. The server application is recommended if you plan
to migrate your setup to a production environment. Choose the one that best suits your needs, and follow the steps
below:
Run GraphDB as a desktop installation For desktop users, we recommend the quick installation, which comes
with a preconfigured Java. This is the easiest and fastest way to start using the GraphDB database.
• Running GraphDB as a desktop installation
Run GraphDB as a standalone server For production use, we recommend installing the standalone server. The
installation comes with a preconfigured web server. This is the standard way to use GraphDB if you plan to use
the database for longer periods with preconfigured log files.
To migrate from one GraphDB version to another, follow the instructions in the last column of the table below,
and then the steps described further down in this page.
To migrate your GraphDB configuration and data, follow the steps below.
Warning: Keep in mind that after that, you cannot automatically revert to GraphDB 9.x.
Hint: You can also copy the conf, data, and work directories from the GraphDB 9.x home direc
tory to a new directory to use as the GraphDB 10.0 home directory. In this case, your GraphDB
9.x home directory is also the backup so you may skip the backup steps.
The cluster in GraphDB 10 is based on an entirely new approach and is not directly comparable or compatible with
the cluster in GraphDB 9.x. See the High Availability Cluster Basics for more details on how the new GraphDB
cluster operates.
The described procedures refer to the three recommended cluster topologies in the 9.x cluster: single master with
three or more workers; two masters sharing workers, one of the masters is readonly; and multiple masters with
dedicated workers. See more about 9.x cluster topologies.
Understand
You will need an existing GraphDB 9.x cluster in good condition before you start the migration. Data and config
uration will be copied from two of the nodes:
• A worker node that is in sync with the master. This node will provide:
– The data for each repository that is part of the GraphDB 9.x cluster.
– Any repositories that are not part of the cluster, e.g., an Ontop repository created on the same instance
as the worker repository. Typically, these are used via internal SPARQL federation in the cluster.
• A master node that will provide:
– The user database containing users, credentials, and user settings.
– Any repositories that are not part of the cluster, e.g., an Ontop repository created on the same instance
as the master repository. Typically, these are used by connecting to the repository via HTTP – directly
or via standard SPARQL federation.
– The graphdb.properties file that contains all GraphDB configuration properties.
The instructions below assume your GraphDB 9.x setup has a single home directory that contains the conf, data,
and work directories. If your setup uses explicitly configured separate directories for any of these, you need to
adjust the instructions accordingly. The various directories are described in detail here.
Important: The cluster in GraphDB 10 is configured at the instance level, while the cluster in GraphDB 9.x is
defined per repository. This means that every repository you migrate following the steps below will automatically
become part of the cluster.
Once a cluster is created, it is not possible to have a repository that is not part of the cluster in GraphDB 10.
Prepare
In order to minimize downtime during the migration, you may want to keep the GraphDB 9.x cluster running in
readonly mode while performing the migration.
To make a master readonly, go to Setup � Cluster, click on the master node and enable the readonly setting:
Alternatively, you can reconfigure your application such that it does not do any writes during the migration.
Procedure
To migrate a cluster configuration from GraphDB version 9.x to the 10.0 cluster, please follow the steps outlined
below.
Warning: The instructions are written in such a way that your existing GraphDB 9.x setup is preserved so
you can abort the migration at any point and revert to your previous setup. Note that once you decide to go live
with the migrated GraphDB 10 setup, there is no automatic way to revert that configuration to GraphDB 9.x.
1. First, choose a temporary GraphDB 10 home directory that will be used to copy files and directories and
bootstrap all the nodes.
Hint: All instructions below mean this directory when “temporary GraphDB 10 home directory”
is mentioned.
2. Select one of the worker nodes that is in sync with the master.
3. Stop the GraphDB 9.x instance where the worker node is located – the rest of the GraphDB 9.x cluster will
remain operational.
4. Locate the data directory within the GraphDB 9.x home directory of the worker node and copy it to the
temporary GraphDB 10 home directory.
• The data/repositories directory contains all repositories and their data.
• If any repository is a master repository, delete it from the copy.
5. Select one of the master nodes.
6. Stop the GraphDB 9.x instance where the master node is located – you may want to point your application
to another master or a worker repository so that read operations will continue to work during the migration.
7. Locate the data directory within the GraphDB 9.x home directory of the master node and copy it to the
temporary GraphDB home directory.
• The data/repositories directory contains all repositories and their data.
• If any repository is a master repository, do not copy it.
• If you have only master repositories on the master node you can skip this step.
8. Locate the work directory within the GraphDB 9.x home directory of the master node and copy it to the
temporary GraphDB home directory.
• On GraphDB 9.x, the work directory contains the user database.
Note: After copying the work directory from the master to the new nodes, the old locations
of the GraphDB 9.x cluster workers will be visible in the Workbench of the new nodes. We
recommend deleting the old locations.
9. Locate the conf directory within the GraphDB 9.x home directory of the master node and copy it to the
temporary GraphDB home directory.
• The conf directory contains the graphdb.properties file.
10. Choose the number of nodes for the new cluster. Due to the nature of the Raft consensus algorithm on which
the GraphDB 10 cluster is based, an odd number of nodes is recommended, e.g., three, five, or seven.
As a rule of thumb, use as many nodes as the number of workers you had but add or remove a node to make
the number odd. For example:
• If you had three workers, use three nodes.
• If you had six workers, use five or seven nodes.
11. Copy the temporary GraphDB 10 home directory to each node to serve as the GraphDB 10 on that node.
12. Edit the graphdb.properties file on each node to reflect any settings specific to that node, e.g., graphdb.
external-url or SSL certificate properties but keep general properties, especially graphdb.auth.token.
secret and any securityrelated properties identical on all nodes.
• If necessary, consult the graphdb.properties file on that node from your GraphDB 9.x setup.
• If the nodes are hosted on the same machine, edit the graphdb.connector.port property so that it is
different for each node.
• See also the notes on configuring networking properties related to the GraphDB 10 cluster.
13. Start GraphDB 10 on each node.
• Make sure each node is up and has a valid EE license. If no license is applied, you will be able to create
the cluster with all nodes in state Follower no leader will be elected. However, if you attempt to run
a query on any of them, their state will change to Restricted.
14. On any of the instances that you just created, go to Setup � Cluster in the Workbench and create the cluster
group.
See more information about the new Workbench user interface for creating, configuring, and accessing a
cluster.
You can revert to your old setup by restarting the worker and master nodes that you stopped while performing the
migration.
If you set your master to readonly, do not forget to set it back to write mode using the same Workbench interface
that you used to make it readonly.
Example migration
Given the following GraphDB 9.x cluster setup consisting of two masters and three workers for each master, or a
total of eight GraphDB instances:
graphdb1.example.com
• Master repository master1, the primary master repository
• Worker repository mydata, which is not part of any cluster
graphdb2.example.com
• Master repository master2, the secondary master repository
graphdb3.example.com
• Worker repository worker1 connected to master1
• Ontop repository sql1
graphdb4.example.com
• Worker repository worker2 connected to master1
• Ontop repository sql1
graphdb5.example.com
• Worker repository worker3 connected to master1
• Ontop repository sql1
graphdb6.example.com
• Worker repository worker4 connected to master2
• Ontop repository sql1
graphdb7.example.com
• Worker repository worker5 connected to master2
• Ontop repository sql1
graphdb8.example.com
• Worker repository worker6 connected to master2
• Ontop repository sql1
You choose the worker worker1 and the master master1 to perform the migration.
After completing the steps that copy files from the worker and the master, you should have a directory structure in
the temporary GraphDB 10 home that looks like this:
Directory Description
data/ The worker repository copied from the worker node
repositories/
worker1/
data/ The Ontop repository copied from the worker node
repositories/
sql1/
data/ The nonclustered worker repository copied from the master node
repositories/
mydata/
conf/graphdb. The GraphDB configuration file copied from the master node
properties
work/ The GraphDB 9.x Workbench settings and user database copied from the master node
workbench/
settings.js
There may be other files in the data, conf, and work directories, e.g., conf/logback.xml, that are safe to have in
the copy in order to preserve as much of the same configuration as possible.
Note, however, that you should NOT have the following directories:
Directory Description
data/ The master repository from the master node should NOT be copied
repositories/
master1/
Since you have six workers in the GraphDB 9.x cluster, it makes sense to choose five (the number of workers
minus one to make the number odd) nodes for the GraphDB 10.0 cluster.
If you proceed with the migration, your cluster will contain three repositories that are part of the same cluster:
Repository ID Description
worker1 Migrated GraphDB repository – note it uses the repository ID from the worker node you
used to copy the files from
sql1 Migrated Ontop repository
mydata Migrated GraphDB repository that previously was not part of any cluster
See how to configure the external GraphDB 10.0 cluster proxy here.
GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances
will not be usable and attempting to use them for queries or updates will throw an error.
If your connector definitions do not include an entity filter, you can simply repair them.
If your connector definitions do include an entity filter, you need to rewrite the filter using the new filter options.
See the migration steps from GraphDB 9.x for Lucene, Solr, Elasticsearch, and Kafka.
When upgrading to a newer GraphDB version, it might contain plugins that are not present in the older version. In
this case, and when using a cluster, the Plugin Manager disables the newly detected plugins, so you need to enable
them by executing the following SPARQL query:
insert data {
[] <http://www.ontotext.com/owlim/system#startplugin> "plugin-name"
}
Then create your plugin following the steps described in the corresponding documentation, and also make sure to
not delete the database in the plugin you are using.
You can also stop a plugin before the migration in case you deem it necessary:
insert data {
[] <http://www.ontotext.com/owlim/system#stopplugin> "plugin_name"
}
From version 9.8 onwards, GraphDB Enterprise Edition can be deployed with opensource Helm charts. See how
to migrate them to GraphDB 10.0 here.
ELEVEN
MANAGING SERVERS
GraphDB relies on several main directories for configuration, logging, and data.
11.1.1 Directories
GraphDB Home
The GraphDB home defines the root directory where GraphDB stores all of its data. The home can be set through
the system or config file property graphdb.home.
The default value for the GraphDB home directory depends on how you run GraphDB:
• Running as a standalone server: the default is the same as the distribution directory.
• All other types of installations: OSdependent directory.
– On Mac: ~/Library/Application Support/GraphDB.
– On Windows: \Users\<username>\AppData\Roaming\GraphDB.
– On Linux and other Unixes: ~/.graphdb.
Note: In the unlikely case of running GraphDB on an ancient Windows XP, the default directory is \Documents
and Settings\<username>\Application Data\GraphDB.
GraphDB does not store any files directly in the home directory, but uses the following subdirectories for data or
configuration:
Data directory
The GraphDB data directory defines where GraphDB stores repository data. The data directory can be set through
the system or config property graphdb.home.data. The default value is the data subdirectory relative to the
GraphDB home directory.
517
GraphDB Documentation, Release 10.2.5
Configuration directory
The GraphDB configuration directory defines where GraphDB looks for userdefinable configuration. It can be
set through the system property graphdb.home.conf.
Note: It is not possible to set the config directory through a config property as the value needs to be set before
the config properties are loaded.
The default value is the conf subdirectory relative to the GraphDB home directory.
Work directory
The GraphDB work directory defines where GraphDB stores nonuserdefinable configuration. The work directory
can be set through the system or config property graphdb.home.work. The default value is the work subdirectory
relative to the GraphDB home directory.
Logs directory
The GraphDB logs directory defines where GraphDB stores log files. The logs directory can be set through the
system or config property graphdb.home.logs. The default value is the logs subdirectory relative to the GraphDB
home directory.
Note: When running GraphDB as deployed .war files, the logs directory will be a subdirectory graphdb within
the Tomcat’s logs directory.
Important: Even though GraphDB provides the means to specify separate custom directories for data, config
uration and so on, it is recommended to specify the home directory only. This ensures that every piece of data,
configuration, or logging, is within the specified location.
Stepbystep guide:
1. Choose a directory for GraphDB home, e.g., /opt/graphdb-instance.
2. Create the directory /opt/graphdb-instance.
3. (Optional) Copy the subdirectory conf from the distribution into /opt/graphdb-instance.
4. Start GraphDB with graphdb -Dgraphdb.home=/opt/graphdb-instance.
GraphDB creates the missing subdirectories data, conf (if you skipped that step), logs, and work.
When GraphDB starts, it logs the actual value for each of the above directories, e.g.:
11.1.2 Configuration
There is a single graphdb.properties config file for GraphDB. It is provided in the distribution under conf/
graphdb.properties, where GraphDB loads it from.
This file contains a list of config properties defined in the following format:
propertyName = propertyValue, i.e., using the standard Java properties file syntax.
Each config property can be overridden through a Java system property with the same name, provided in the
environment variable GDB_JAVA_OPTS, or in the command line.
Configuration properties
General properties
The general properties define some basic configuration values that are shared with all GraphDB components and
types of installation:
graphdb.inference.buffer
Buffer size (the number of statements) for each load
stage in parallel import. Defaults to 200,000
statements.
See also graphdb.inference.concurrency.
graphdb.inference.concurrency
Number of inference threads in parallel import. The
default value is the number of cores of the machine
processor.
See also graphdb.inference.buffer.
Workbench properties
In addition to the standard GraphDB command line parameters, the GraphDB Workbench can be controlled with
the following parameters (they should be of the form -Dparam=value):
Example:
graphdb.workbench.cors.enable=true
graphdb.workbench.cors.origin=*
graphdb.workbench.cors.expose-
headers="content,location"
URL properties
Hint: Jump ahead to Typical use cases for a list of examples that cover URL properties usage.
– GraphDB will map the external / to its own / automatically, no need to add or change any configuration.
– GraphDB will still not know how to construct external URLs, so setting graphdb.external-url is
recommended even though it might appear to work without setting it.
• The external URL as seen via the proxy uses /something as its root (i.e., something in addition to the /), for
example, http://example.com/rdf.
– GraphDB cannot map this automatically and needs to be configured using the property graphdb.
vhosts or graphdb.external-url (see below).
– This will instruct GraphDB that URLs beginning with http://example.com/rdf/ map to the root path
/ of the GraphDB server.
The URL properties determine how GraphDB constructs URLs that refer to itself, as well as what URLs are
recognized as URLs to access the GraphDB installation. GraphDB will try to autodetect those values based on
URLs used to access it, and the network configuration of the machine running GraphDB. In certain setups involving
virtualization or a reverse proxy, it may be necessary to set one or more of the following properties:
Property Description
graphdb.vhosts A commadelimited list of virtual host URLs that can be used to access GraphDB. Setting
this property is necessary when GraphDB needs to be accessed behind a reverse proxy
and the path of the external URL is different from /, for example http://example.com/
rdf.
graphdb. Sets the canonical external URL. This property implies graphdb.vhosts. If you have
external-url provided an explicit value for both graphdb.vhosts and graphdb.external-url, then
the URL specified for graphdb.external-url must be one of the URLs in the value for
graphdb.vhosts.
When a reverse proxy is in use and most users will access GraphDB through the proxy,
it is recommended to set this property instead of, or in addition to graphdb.vhosts, as
it will let GraphDB know that the canonical external URL is the one as seen through the
proxy.
Tip: Prior to GraphDB 9.8, only the graphdb.external-url property existed. You can
keep using it as is.
graphdb. Determines whether it is necessary to rewrite the Location header when no proxy is con
external- figured. Setting this property to true will use the graphdb.external-url when building
url.enforce. the transaction URLs.
transactions Set it to true when the returned URLs are incorrect due to missing or invalid proxy
configurations. Set it to false when the server can be called on multiple addresses, as it
will override the returned address to the one defined by the graphdb.external-url.
Boolean, default is false.
graphdb.hostname Overrides the hostname reported by the machine.
Enabling the configuration will use the graphdb.externalurl when building the transaction URLs. It should be
used when the returned URLs are not correct due to missing or invalid proxy configurations. The configuration
should not be used when the server can be called on multiple addresses as it will override the returned address to
a single one defined by the graphdb.externalurl.
Note: For remote locations, the URLs are always constructed using the base URL of the remote location as
specified when the location was attached.
1. GraphDB is behind a reverse proxy whose URL path is / and most clients will use the proxy URL.
This setup will appear to work outofthe box without setting any of the URL properties but it is
recommended to set graphdb.external-url. Example URLs:
• Internal URL: http://graphdb.example.com:7200/
• External URL used by most clients: http://rdf.example.com/
The corresponding configuration is:
# Recommended even though it may appear to work without setting this property
graphdb.external-url = http://rdf.example.com/
2. GraphDB is behind a reverse proxy whose URL path is /something and most clients will use the proxy
URL.
This configuration requires setting graphdb.external-url (recommended) or graphdb.vhosts
to the correct URLs as seen externally through the proxy. Example URLs:
• Internal URL: http://graphdb.example.com:7200/
• External URL used by most clients: http://example.com/rdf/
The corresponding configuration is:
Network properties
The network properties control how the standalone application listens on a network. These properties correspond
to the attributes of the embedded Tomcat Connector. For more information, see Tomcat’s documentation.
Each property is composed of the prefix graphdb.connector. + the relevant Tomcat Connector attribute. The
most important property is graphdb.connector.port, which defines the port to be used. The default is 7200.
In addition, the sample config file provides an example for setting up SSL.
Note: The graphdb.connector.<xxx> properties are only relevant when running GraphDB as a standalone ap
plication.
Engine properties
You can configure the GraphDB Engine through a set of properties composed of the prefix graphdb.engine. + the
relevant engine property. These properties correspond to the properties that can be set when creating a repository
through the Workbench or through a .ttl file.
Note: The properties defined in the config override the properties for each repository, regardless of whether you
created the repository before or after setting the global value of an engine property. As such, the global override
should be used only in specific cases. For normal everyday needs, set the corresponding properties when you
create a repository.
Note: Note that IRI validation makes the import of broken data more problematic in such a case, you would
have to change a config property and restart your GraphDB instance instead of changing the setting per import.
Configuring logging
GraphDB uses logback to configure logging. The default configuration is provided as logback.xml in the GraphDB
confdirectory.
GraphDB is available in three different editions: Free, Standard Edition (SE), and Enterprise Edition (EE).
The Free edition is free to use and does not require a license. This is the default mode in which GraphDB will
start. However, it is not opensource.
SE and EE are RDBMSlike commercial licenses on a perserverCPU basis. They are neither free nor opensource.
To purchase a license or obtain a copy for evaluation, please contact graphdbinfo@ontotext.com.
When installing GraphDB, the SE/EE license file can be set through the GraphDB Workbench or programmatically.
From here, you can also Revert to Free license. If you do so, GraphDB will ask you to confirm.
4. After completing these steps, you will be able to view your license details.
GraphDB will look for a graphdb.license file in the GraphDB work directory (where nonuserdefinable configu
rations are stored) under GraphDBHOME. To install a license file there, copy the license file as graphdb.license.
You can use the configuration property graphdb.license.file to provide a custom path for the license file, for
example:
graphdb.license.file = /opt/graphdb/my-graphdb-dev.license
Note: If you set the license through a file in the work directory or a custom path, you will not be able to change
the license through the GraphDB Workbench.
When looking for a license, GraphDB will use the first license it finds in this order:
• The custom license file property graphdb.license.file;
• The graphdb.license file in the work directory;
• A license set through the GraphDB Workbench.
The following diagram offers a view of the memory use by the GraphDB structures and processes:
To specify the maximum amount of heap space used by a JVM, use the -Xmx virtual machine parameter.
GraphDB’s cache strategy, the single global page cache, employs the concept of one global cache shared between
all internal structures of all repositories. This way, you no longer have to configure the cache-memory, tuple-
index-memory, and predicate-memory, or size every repository and calculate the amount of memory dedicated to
it. If at a given moment one of the repositories is being used more, it will naturally get more slots in the cache.
The global page cache size is dynamic and is determined by the given -Xmx value. It is set as follows:
The current global page cache size can be set manually by specifying: -Dgraphdb.page.cache.size=3G.
You can disable the current global page cache implementation by setting -Dgraphdb.global.page.cache=false.
If you do not specify graphdb.page.cache.size, it will be determined by the heap range as outlined above.
Note: You do not have to change/edit your repository configurations. The new cache will be used when you
upgrade to the new version.
By default, all entity pool structures are residing onheap, i.e., inside the regular JVM heap. The graphdb.engine.
onheap.allocation property is used to configure memory allocation not only for the entity pool but also for the
other structures. It also specifies the entity pool onheap allocation regardless of whether the deprecated property
graphdb.epool.onheap is set to true.
Note: To activate the old behavior, i.e., the entity pool residing offheap, you can enable offheap allocation with
-Dgraphdb.epool.onheap=false.
If you are concerned that the process will eat up an unlimited amount of memory, you can specify a maximum size
with -XX:MaxDirectMemorySize, which defaults to the -Xmx parameter (at least in OpenJDK and Oracle JDK).
This is a sample configuration demonstrating how to correctly size a GraphDB server with a single repository. The
loaded dataset is estimated to 500 million RDF statements and 150 million unique entities. As a rule of thumb, the
average number of unique entities compared to the total number of statements in a standard dataset is 1:3.
11.3.5 Upper bounds for the memory consumed by the GraphDB process
In order to make sure that no OutOfMemoryExceptions are thrown while working with an active GraphDB repos
itory, you need to set an upper bound value for the memory consumed by all instances of the tupleSet/distinct
collections. This is done with the -Ddefault.min.distinct.threshold parameter, whose default value is 250m
and can be changed. If this value is surpassed, a QueryEvaluationException is thrown so as to avoid running out
of memory due to hungry distinct/group by operation.
The below instructions will walk you through the steps for creating and monitoring a cluster group.
11.4.1 Prerequisites
You will need at least three GraphDB installations to create a fully functional cluster. Remember that the Raft
algorithm recommends an odd number of nodes, so a cluster of five nodes is a good choice too.
All of the nodes must have the same security settings, and in particular the same shared token secret even when
security is disabled.
For all GraphDB instances, set the following configuration property in the graphdb.properties file and change
<my-shared-secret-key> to the desired secret:
graphdb.auth.token.secret = <my-shared-secret-key>
All of the nodes must have their networking configured correctly – the hostname reported by the OS must be
resolvable to the correct IP address on each node that will participate in the cluster. In case your networking is not
configured correctly or you are not sure, you can set the hostname for each node by putting graphdb.hostname
= hostname.example.com into each graphdb.properties file, where hostname.example.com is the hostname for
that GraphDB to use in the cluster. If you do not have resolvable hostnames, you can supply an IP address instead.
The examples below assume that there are five nodes reachable at the hostnames graphdb1.example.com,
graphdb2.example.com, graphdb3.example.com, graphdb4.example.com, and graphdb5.example.com.
A typical deployment scenario would be a deployment in cloud infrastructure with the ability to deploy GraphDB
instances in different regions or zones so that if a region/zone fails, the GraphDB cluster will continue functioning
without any issues for the enduser.
To achieve high availability, it is recommended to deploy GraphDB instances in different zones/regions while
considering the need for a majority quorum in order to be able to accept INSERT/DELETE requests. This means
that the deployment should always have over 50% of the instances running.
Another recommendation is to distribute the GraphDB instances so that you do not have exactly half of the
GraphDB instances in one zone and the other half in another zone, as this way it would be easy to lose the majority
quorum. In such cases, it is better to use three zones.
In a cluster with three nodes, we need at least two in order to be able to write data successfully. In this case, the
best deployment strategy is to have three GraphDB instances distributed in three zones in the same region. This
way, if one zone fails, the other two instances will still form a quorum and the cluster will accept all requests.
In a cluster with five nodes, we need three nodes for a quorum. If we have three available regions/zones, we can
deploy:
• two instances in zone 1,
• two instances in zone 2,
• one instance in zone 3.
If any of the zones fail, we would still have at least three more GraphDB instances that will form a quorum.
A cluster can be created interactively from the Workbench or programmatically via the REST API.
1. Open any of the GraphDB instances that you want to be part of the cluster, for example http://graphdb1.
example.com:7200, and go to Setup � Cluster.
This is essentially the same operation as when connecting to a remote GraphDB instance.
Clicking on Advanced settings opens an additional panel with settings that affect the entire cluster group but
the defaults should be good for a start:
3. Once you have added all nodes (in this case graphdb2.example.com:7200, graphdb3.example.com:7200,
graphdb4.example.com:7200, and graphdb5.example.com:7200, since graphdb1.example.com:7200 was
discovered automatically and always has to be part of the cluster), click on each of them to include them in
the cluster group:
4. Click OK.
5. At first, all nodes become followers (colored in blue). Then one of the nodes initiates election, after which
for a brief moment, one node becomes a candidate (you may see it briefly flash in green), and finally a leader
(colored in orange).
In this example, graphdb1.example.com became the leader but it could have been any of the
other four nodes. The fact that graphdb1.example.com was used to create the cluster does not
affect the leader election process in any way.
All possible node and connection states are listed in the legend on the bottom left that you can
toggle by clicking the question mark icon.
6. You can also add or remove nodes from the cluster group, as well as delete it.
You can also create a cluster using the respective REST API – see Help � REST API � GraphDB Workbench API
� clustergroupcontroller for the interactive REST API documentation.
The examples below use cURL.
To create the cluster group, simply POST the desired cluster configuration to the /rest/cluster/config endpoint
of any of the nodes (in this case http://graphdb1.example.com:7200):
Each node uses the default HTTP port of 7200 and the default RPC port of 7300.
Tip: The default RPC port is the HTTP port + 100. Thus, when the HTTP port is 7200, the RPC port will be
7300. You can set a custom RPC port using graphdb.rpc.port = NNNN, where NNNN is the chosen port.
Just like in the Workbench, you do not need to specify the advanced settings if you want to use the defaults. If
needed, you can specify them like this:
{
"graphdb1.example.com:7300": "CREATED",
"graphdb2.example.com:7300": "CREATED",
"graphdb3.example.com:7300": "CREATED"
}
{
"graphdb1.example.com:7301": "NOT_CREATED",
"graphdb2.example.com:7302": "NO_CONNECTION",
"graphdb3.example.com:7303": "NOT_CREATED"
}
{
"graphdb1.example.com:7301": "NOT_CREATED",
"graphdb2.example.com:7302": "ALREADY_EXISTS",
"graphdb3.example.com:7303": "NOT_CREATED"
}
Creation parameters
The cluster group configuration has several properties that have sane default values:
We can add and remove cluster nodes at runtime without having to stop the entire cluster group. This is achieved
through total consensus between the nodes in the new configuration when making a change to the cluster mem
bership.
When adding nodes, a total consensus means that all nodes, both the new and the old ones, have successfully
appended the configuration.
If there is no majority of nodes responding to heartbeats, we can remove the nonresponsive ones all at once.
In this situation, a total consensus on the new configuration would be enough for this operation to be executed
successfully.
It is recommended to remove fewer than 1/2 of the nodes from the current configuration.
Add nodes
New nodes can be added to the cluster group only from the leader. From Setup � Cluster � Add nodes, just like
with node creation, attach the node’s HTTP address as a remote location and click OK.
From Help � REST API � GraphDB Workbench API � clustergroupcontroller, send a POST request to the
/rest/cluster/config/node endpoint:
If one of the nodes from the group or from the newly added ones has no connection to any of the nodes, an error
message will be returned. This is because a total consensus between the nodes in the new group is needed to accept
the configuration, which means that all of them should see each other.
If the added node is part of a different cluster, an error message will be returned.
Only the leader can make cluster membership changes, so if a follower tries to add a node to the cluster group,
again an error message will be returned.
The added node should be either empty or in the same state as the cluster, which means that it should have the
same repositories and namespaces as the nodes in the cluster. If one of these conditions is not met, you will not be
able to add the node.
Remove nodes
Nodes can be removed from the cluster group only from the leader. From Setup � Cluster � Remove nodes.
Click on the nodes that you want to remove and click OK.
From Help � REST API � GraphDB Workbench API � clustergroupcontroller, send a DELETE request to the
/rest/cluster/config/node endpoint:
If one of the nodes remaining in the new cluster configuration is down or not visible to the others, the operation
will not be successful. This is because a total consensus between all nodes in the new configuration is needed, so
all of them should see each other.
If a node is down, it can still be removed as it will not be part of the new configuration. If started again, the node
will “think” that it is still part of the cluster and will be stuck in candidate state. The rest of the nodes will not
accept any communication coming from it. In such a case, the cluster only on this node can be manually deleted
from Setup � Cluster � Delete cluster.
You can view and manage the cluster configuration properties both from the Workbench and the REST API.
To view the properties, go to Setup � Cluster and click the cog icon on the top right.
It will open a panel showing the cluster group config properties and a list of its nodes.
To view the cluster configuration properties, go to Help � REST API � GraphDB Workbench API � clustergroup
controller and perform a GET request to the /rest/cluster/config endpoint on any of the nodes:
curl http://graphdb1.example.com:7200/rest/cluster/config
To check the cluster configuration, go to GET /rest/cluster/config and click Try it out.
200: Returns cluster configuration
If the cluster configuration has passed successfully, the response code will be 200 Success.
404: Cluster not found
If no cluster group has been found, the returned response code will be 404. One such case could be when you
attempt to create a cluster group with just one GraphDB node.
To update the config properties, perform a PATCH request containing the parameters of the new config to the
/rest/cluster/config endpoint:
If one of the cluster nodes is down or was not able to accept the new configuration, the operation will not be
successful. This is because we need a total consensus between the nodes, so if one of them cannot append the new
config, all of them will reject it.
To check the current status of the cluster, including the current leader, open the Workbench and go to Setup �
Cluster.
The solid green lines indicate that the leader is IN_SYNC with all followers.
Clicking on a node will display some basic information about it, such its state (leader or follower) and RPC address.
Clicking on its URL will open the node in a new browser tab.
You can also use the REST API to get more detailed information or to automate monitoring.
Cluster group
To check the status of the entire cluster group, send a GET request to the /rest/cluster/group/status endpoint
of any of the nodes, for example:
curl http://graphdb1.example.com:7200/rest/cluster/group/status
If there are no issues with the cluster group, the returned response code will be 200 with the following result:
[
{
"address" : "graphdb1.example.com:7300",
"endpoint" : "http://graphdb1.example.com:7200",
"lastLogIndex" : 0,
"lastLogTerm" : 0,
"nodeState" : "LEADER",
"syncStatus" : {
"graphdb2.example.com:7300" : "IN_SYNC",
"graphdb3.example.com:7300" : "IN_SYNC"
},
"term" : 2
},
{
"address" : "graphdb2.example.com:7300",
"endpoint" : "http://graphdb2.example.com:7200",
"lastLogIndex" : 0,
"lastLogTerm" : 0,
"nodeState" : "FOLLOWER",
"syncStatus" : {},
"term" : 2
},
{
"address" : "graphdb3.example.com:7300",
"endpoint" : "http://graphdb3.example.com:7200",
"lastLogIndex" : 0,
"lastLogTerm" : 0,
"nodeState" : "FOLLOWER",
"syncStatus" : {},
"term" : 2
}
]
Note: Any node, regardless of whether it is a leader or a follower, will return the status for all nodes in the cluster
group.
Cluster node
To check the status of a single cluster node, send a GET request to the /rest/cluster/node/status endpoint of
the node, for example:
curl http://graphdb1.example.com:7200/rest/cluster/node/status
If there are no issues with the node, the returned response code will be 200 with the following information (for a
leader):
{
"address" : "graphdb1.example.com:7300",
(continues on next page)
{
"address" : "graphdb2.example.com:7300",
"endpoint" : "http://graphdb2.example.com:7200",
"lastLogIndex" : 0,
"lastLogTerm" : 0,
"nodeState" : "FOLLOWER",
"syncStatus" : {},
"term" : 2
}
To delete a cluster, open the Workbench and go to Setup � Cluster. Click Delete cluster and confirm the operation.
Warning: This operation deletes the cluster group on all nodes, and can be executed from any node regardless
of whether it is a leader or a follower. Proceed with caution.
You can also use the REST API to automate the delete operation. Send a DELETE request to the /rest/cluster/
config endpoint of any of the nodes, for example:
By default, the cluster group cannot be deleted if one or more nodes are unreachable. Reachable here means that
the nodes are not in status NO_CONNECTION, therefore there is an RPC connection to them.
200: Cluster deleted
If the deletion is successful, the response code will be 200 and the returned response body:
{
"graphdb1.example.com:7300": "DELETED",
"graphdb2.example.com:7300": "DELETED",
"graphdb3.example.com:7300": "DELETED"
}
{
"graphdb1.example.com:7300": "NOT_DELETED",
"graphdb2.example.com:7300": "NO_CONNECTION",
"graphdb3.example.com:7300": "NOT_DELETED"
}
Force parameter
The optional force parameter (false by default) enables you to bypass this restriction and delete the cluster group
on the nodes that are reachable:
• When set to false, the cluster configuration will not be deleted on any node if at least one of the nodes is
unreachable.
• When set to true, the cluster configuration will be deleted only on the reachable nodes, and not be deleted
on the unreachable ones.
In such a case, the returned response will be 200:
{
"graphdb1.example.com:7300": "DELETED",
"graphdb2.example.com:7300": "io.grpc.StatusRuntimeException: UNAVAILABLE: io exception",
"graphdb3.example.com:7300": "DELETED"
}
The external cluster proxy can be deployed separately on its own URL. This way, you do not need to know where
all cluster nodes are. Instead, there is a single URL that will always point to the leader node.
The externally deployed proxy will behave like a regular GraphDB instance, including opening and using the
Workbench. It will always know which one the leader is and always serve all requests to the current leader.
Note: The external proxy does not require a GraphDB SE/EE license.
./bin/cluster-proxy -g http://graphdb1.example.com:7200,http://graphdb2.example.com:7200
A console message will inform you that GraphDB has been started in proxy mode.
Cluster proxy options
The cluster-proxy script supports the following options:
Option Description
-d, --daemon Daemonize (run in background)
-r, --follower-retries Number of times to retry a request to a different node in the cluster
<num>
-g, --graphdb-hosts List of GraphDB nodes’ HTTP or RPC addresses that are part of the same
<address> cluster
-h, --help Print command line options
p, --pid-file <file> Write PID to <file>
-Dprop Set Java system property
-Xprop Set nonstandard Java system property
By default, the proxy will start on port 7200. To change it, use, for example, -Dgraphdb.connector.port=7201.
As mentioned above, the default RPC port of the proxy is the HTTP proxy port + 100, which will be 7300 if you have
not used a custom HTTP port. You can change the RPC port by setting, for example, -Dgraphdb.rpc.port=7301
or -Dgraphdb.rpc.address=graphdb-proxy.example.com:7301, e.g.:
Important: Remember to set the -Dgraphdb.auth.token.secret=<cluster-secret> with the same secret with
which you have set up the cluster. If the secrets do not match, some of the proxy functions may appear as if they
are working correctly, but will still be misconfigured and you may experience unexpected behavior at any time.
The external proxy works with two types of cluster node lists: static and dynamic.
• The static list is provided to the proxy through the -g/--graphdb-hosts option of the script. This is a
commaseparated list of HTTP or RPC addresses of cluster nodes. At least one address to an active node
should be provided. Once the proxy is started, it tries to connect to each of the nodes provided in this list. If
it succeeds with one of them, it then builds the dynamic cluster node list:
• The dynamic cluster node list is built by requesting the cluster’s current status from one of the nodes in
the static list. The proxy then subscribes to any changes in the cluster status leader changes, nodes being
added or removed, nodes out of reach, etc. The external proxy always sends all requests to the current cluster
leader. If there is no leader at the moment, or the leader is unreachable, requests will go to a random node.
Note: The dynamic cluster node list is reset every time the external proxy is restarted. After each restart, the
proxy knows only about the nodes listed in the static node list provided by the -g/--graphdb-hosts option of the
script.
To set up the external proxy to connect to a cluster over SSL, the same options used to set up GraphDB with
security can be provided to the cluster-proxy script. The most common ones are:
For more information on the cluster security options, please see below.
Encryption
As there is a lot of traffic between the cluster nodes, it is important that it is encrypted. In order to do so, the
following requirements need to be met:
• SSL/TLS should be enabled on all cluster nodes.
• The nodes’ certificates should be trusted by the other nodes in the cluster.
The method of enabling SSL/TLS is already described in Configuring GraphDB instance with SSL. There are no
differences when setting up the node to be used as a cluster one.
See how to set up certificate trust between the nodes here.
Access control
Authorization and authentication methods in the cluster do not differ from those for a regular GraphDB instance.
The rule of thumb is that all nodes in the cluster group must have the same security configuration.
For example, if SSL is enabled on one of the nodes, you must enable it on the other nodes as well; or if you have
configured OpenID on one of the nodes, it needs to be configured on the rest of them as well.
The truncate log operation is used to free up storage space on all cluster nodes by clearing the current transaction
log and removing cached recovery snapshots. It can be triggered with a POST request to the /rest/cluster/
truncate-log endpoint.
Note: The operation requires a healthy cluster, i.e., one where a leader node is present and all follower nodes
are IN_SYNC. The reason for this is that the truncate log operation is propagated to each node in the cluster and
truncates the log subsequently on each node through the Raft quorum mechanism.
You can truncate the cluster log with the following cURL request:
TWELVE
SECURITY
Database security refers to the collective measures used to protect and secure a database from illegitimate use and
malicious threats and attacks. It covers and enforces security in several aspects:
Security configurations in the GraphDB Workbench are located under Setup � Users and Access.
The Users and Access page allows you to create new users, edit the profiles, change their password and read/write
permissions for each repository, as well as delete them.
Note: As a security precaution, you cannot delete or rename the admin user.
By default, the security for the entire Workbench instance is disabled. This means that everyone has full access to
the repositories and the admin functionality.
To enable security, click the Security slider on the top right. You are immediately taken to the login screen.
545
GraphDB Documentation, Release 10.2.5
username: admin
password: root
Note: We recommend changing the default credentials for the admin account as soon as possible. Using the
default password in production is not secure.
Once you have enabled security, you can turn on free access mode. If you click the slider associated with it, you
will be shown this popup box:
This gives you the ability to allow unrestricted access to a number of resources without the need of any authenti
cation.
In the example above, all users will be able to read and write in the repository called “my_repo”, and read the “re
mote_repo” repository. They will also be able to create or delete connectors and toggle plugins for the “my_repo”
repository.
The Workbench user settings allow you to configure the default behavior for the GraphDB Workbench. Here, you
can enable or disable the following:
• Default sameAs value This is the default value for the Expand results over owl:sameAs option in the
SPARQL editor. It is taken each time a new tab is created. Note that once you toggle the value in the
editor, the changed value is saved in your browser, so the default is used only for new tabs. The setting is
also reflected in the Graph settings panel of the Visual graph.
• Default Inference Same as above, but for the Include inferred data in results option in the SPARQL editor.
The setting is also reflected in the Graph settings panel of the Visual graph.
• Count all SPARQL results For each query without limit sent through the SPARQL editor, an additional
query is sent to determine the total number of results. This value is needed both for your information and for
results pagination. In some cases, you do not want this additional query to be executed, because for example
the evaluation may be too slow for your data set. Set this option to false in this case.
The edit icon under Actions next to each user in the list will take you to the following screen:
The only difference between the Edit user and Create new user screens is that in Edit user, you cannot change the
username.
Authorization is the process of mapping a known user to a set of specific permissions. GraphDB implements
Spring Security, where permissions are defined based on a combination of a URL pattern and an HTTP method.
When an HTTP request is received, Spring Security intercepts it, verifies the permissions, and either grants or
denies access.
GraphDB’s access control is implemented using a hierarchical Role Based Access Control (RBAC) model. This
corresponds to the hierarchical level of the NIST/ANSI/INCITS RBAC standard and is also known as RBAC1 in
older publications.
The model defines three entities:
Users Users are members of roles and acquire the permissions associated with the roles.
Roles Roles group a set of permissions and are organized in hierarchies, i.e., a role includes its directly associated
permissions as well as the permissions it inherits from any parent roles.
Permissions Permissions grant access rights to execute a specific operation.
RBAC in GraphDB does not define sessions, as the security implementation is stateless. An authorized user always
receives the full set of roles associated with it. Within a single API request call there is always an associated user
and hence roles and permissions.
The core roles defined in GraphDB security model follow a hierarchy:
ROLE_REPO_MANAGER
ROLE_MONITORINGCan create, edit, and delete repositories with read and write per
missions to all repositories
ROLE_MONITORINGROLE_USER Allows monitoring operations (queries, updates, abort
query/update, resource monitoring)
ROLE_USER Can save SPARQL queries, graph visualizations, or userspecific
settings
ROLE_CLUSTER Can perform internal cluster operations
Note: When providing the WRITE_REPO_xxx role for a given repository, the READ_REPO_xxx role must be
provided for it as well.
The GraphDB user management interface uses a simplified high level model, where each created user falls into
one of three categories: a regular user, a repository manager, or an administrator. The three categories correspond
directly to one of the core roles. In addition to that, regular users may be granted individual read/write rights to
one or more repositories:
GraphDB has two special internal users that are required for the functioning of the database. These users cannot
be seen or modified via user management.
WRITE_REPO_*
GraphDB supports three types of user databases used for authentication and authorization, explained in detail
below: Local, LDAP, and OAuth. Each of them contains the information about who the user is, where they come
from, and what type of rights and roles they have. The database may also store and validate the user’s credentials,
if that is required.
Only one database is active at a time. When one is selected, all available users are provided from that database.
The default database is Local.
As mentioned above, this is the default security access provider. The local database stores usernames, encrypted
passwords, assigned roles and user settings. Passwords are encrypted using the bcrypt algorithm.
The local database is located in the settings.js file under the GraphDB data directory. If you are worried about
the security of this file, we recommend encrypting it (see Encryption at rest).
The local user database does not need to be configured but can be explicitly specified with the following property:
graphdb.auth.database = local
A fresh installation of GraphDB comes with a single default user whose username is admin and default password is
root. This user cannot be deleted or demoted to any of the nonadministrator levels. It is recommended to change
the default password at earliest convenience in order to avoid undesired access by a third party.
If you wish to disable the default admin user, you can unset its password from Setup � My Settings in the GraphDB
Workbench.
Warning: If you unset the password for any user and then enable security, that user will not be able to log
into GraphDB. The only way to log in would be through OpenID or Kerberos authentication.
Tip: See also the configuration examples for Basic/GDB + LDAP, OpenID + LDAP, and Kerberos + LDAP.
Lightweight Directory Access Protocol (LDAP) is a lightweight clientserver protocol for accessing directory
services implementing X.500 standards. All its records are organized around the LDAP Data Interchange Format
(LDIF), which is represented in a standard plain text file format.
When LDAP is enabled and configured, it replaces the local database and GraphDB security will use the LDAP
server to provide authentication and authorization. An internal user settings database is still used for storing user
settings. This means that you can use the Workbench or the GraphDB API to change them. All other administration
operations need to be performed on the LDAP server side.
Note: As of GraphDB version 9.5 and newer, local users will no longer be accessible when using LDAP.
graphdb.auth.database = ldap
When LDAP is turned on, the following security settings can be used to configure it:
GraphDB has three standard user roles: Administrator, Repository manager, and User. Every user authenticated
over LDAP will be assigned one of these roles.
Set the following property to the LDAP group that must receive this role:
graphdb.auth.ldap.role.map.administrator = gdbadmin
Set the following property to the LDAP group that must receive this role:
graphdb.auth.ldap.role.map.repositoryManager = gdbrepomanager
Unless a user has been assigned the Administrator or Repository manager role, they will receive the User role
automatically.
OAuth is an openstandard authorization protocol for providing secure delegated access as a way for users to grant
websites/applications access to their information on other websites/applications without sharing their initial login
credential. OAuth is centralized, which means only the authorization server owns user credentials.
Note: OAuth requires OpenID for authentication, and the authorization comes from an OAuth claim. Direct
password authentication with GraphDB (e.g., basic or using the Workbench login form) is not possible.
When OAuth is enabled and configured, it replaces the local database and GraphDB security will use only the
OAuth claims to provide authorization. An internal user settings database is still used for storing user settings. This
means that you can use the Workbench or the GraphDB API to change them. All other administration operations
need to be performed in the OpenID/OAuth provider.
Enable OAuth authorization with the following property:
graphdb.auth.database = oauth
When OAuth authorization is enabled, the following property settings can be used to configure it:
Note: GraphDB enables caseinsensitive validation for user accounts so that users can log in regardless of the
case used at login time. For example, if the user database contains a user “john.smith”, they can log in using any
of these:
• john.smith
• John.Smith
• JOHN.SMITH
• JOHN.smitH
This is controlled via the boolean config property graphdb.auth.database.case_insensitive. It is optional and
false by default.
When using the local database, it is enough to just set graphdb.auth.database.case_insensitive = true.
When using an external user database (LDAP, OpenID), the external database must support caseinsensitive login
as well.
Whenever a client connects to GraphDB, a security context is created. Each security context is always associated
with a single authenticated user or a default anonymous user when no credentials have been provided.
Authentication is the process of mapping this security context to a specific user. Once the security context is
mapped to a user, a set of permissions can be associated with it, using authorization.
When GraphDB security is ON, the following authentication methods are available:
• Basic authentication: The username and password are sent in a header as plain text (usually used when using
the RDF4J client, or via Java when run with cURL). Enabled by default (can be optionally disabled).
• GDB: Tokenbased authentication used by the Workbench for username/password login. This login method
is also available through the REST API. Enabled by default (can be optionally disabled).
• Kerberos: Highly secure single signon protocol that uses tickets for authentication. Disabled by default
(must be configured to be enabled).
• X.509 certificate authentication: When a certificate is signed by a trusted authority, or is otherwise validated,
the device holding the certificate can validate documents. Disabled by default (must be configured to be
enabled).
• OpenID: Single signon method that allows accessing GraphDB without the need for creating a new pass
word. Its biggest advantage is the delegation of the security outside the database. Disabled by default (must
be configured to be enabled).
All five authentication providers Basic, GDB, OpenID, X.509, and Kerberos can be combined with both a local
and an LDAP database. The only provider that can be combined with OAuth is OpenID, as OAuth is an extension
of the latter.
There is also an additional authentication provider, the GDB Signature. It is for internal use only, works with a
detached internal cluster user, and is always enabled. This is the builtin cluster security that uses tokens similar
to those used for logging in from the Workbench.
The following combinations of authentication provider and user database are possible:
Kerberos
Local DB
LDAP
X.509 certificate
Local DB
LDAP
GDB
Local DB
LDAP
OpenID
Local DB
LDAP
OAuth
We will look at each of the above in greater detail in the following sections.
Basic authentication
Basic authentication is a method for an HTTP client to provide a username and password when making a request.
The request contains a header in the form of Authorization: Basic <credentials>, where <credentials> is
the Base64 encoding of the username and password joined by a single colon, e.g.:
Authorization: Basic YWRtaW46cm9vdA==
Warning: Basic authentication is the least secure authentication method. Anyone who intercepts your requests
will be able to reuse your credentials indefinitely until you change them. Since the credentials are merely base
64 encoded, they will also get your username and password. This is why it is very important to always use
encryption in transit.
GDB authentication
GDB authentication is a method for an HTTP client to obtain a token in advance by supplying a username and
password, and then send the token with every HTTP request that requires authentication. The token must be sent
as an HTTP header in the form of Authorization: GDB <token>, where <token> is the actual token.
This authentication method is used by the GraphDB Workbench when a user logs in by typing their username and
password in the Workbench.
Note: Anyone who intercepts a GDB token can reuse it until it expires. To prevent this, we recommend to always
enable encryption in transit.
It is also possible to obtain a token via the REST API and use the token in your own HTTP client to authenticate
with GraphDB, e.g. with cURL:
1. Log in and obtain a token:
The token will be returned in the Authorization header. It can be copied as is and used to authenticate other
requests.
2. Use the returned token to authenticate with GraphDB:
GDB tokens are signed with a private key and the signature is valid for a limited period of time. If the private key
changes or the signature expires, the token is no longer valid and the user must obtain a new token. The default
validity period is 30 days. It can be configured via the graphdb.auth.token.validity property that takes a single
number, optionally suffixed by the letters d (days), h (hours) or m (minutes) to specify the unit. If no letter is
provided, then days are assumed. For example, graphdb.auth.token.validity = 2d and graphdb.auth.token.
validity = 2 will both set the validity to two days.
Note: During the token validity period, if the password is changed the user will still have access to the server.
However, if the user is removed, the token will stop working.
The private key used to sign the GDB tokens is generated randomly when GraphDB starts. This means that after a
restart, all tokens issued previously will expire immediately and users will be forced to login again. To avoid that,
you can set a secret to derive a static private key by setting the following property:
graphdb.auth.token.secret = <my-secret>
Treat the secret as any password, it must be sufficiently long and not easily guessable.
Note: The token secret is used to sign the internal cluster communication and needs to be the same on all cluster
nodes.
OpenID authentication
Tip: See also the configuration examples for OpenID + Local users, OpenID + LDAP, and OpenID + OAuth.
Single signon over the OpenID protocol enables you to log in just once and access all internal services. From a
security standpoint, it provides a more secure environment, because it minimizes the number of places where a
password is processed.
When OpenID is used for authentication, the authorization may come from the local user database, LDAP, or
OAuth. Direct password authentication with GraphDB is possible only with the local database or LDAP, and can
be optionally disabled.
OpenID needs to be configured from the graphdb.properties file. Enable it with the following property:
Provide only openid if passwordbased login methods (Basic and GDB) are not needed, or if you combine OpenID
with the OAuth user database.
When OpenID authentication is enabled, the following property settings can be used to configure it:
Note: Logging out in this mode when using the GraphDB Workbench only deletes the GraphDB session without
logging you out from your provider account.
The OpenID provider needs to be configured as well, as the GraphDB Workbench will use its own root browser
URL, e.g., https://graphdb.example.com:7200/ (note the terminating slash) as the redirect_uri parameter
when it redirects the browser to the authorization endpoint. Once the login is completed at the remote end, OIDC
mandates that the identity provider redirects back to the supplied redirect_uri.
Typically, the allowed values for redirect_uri must be registered with the OpenID provider.
Kerberos authentication
Tip: See also the example configurations for Kerberos + Local users and Kerberos + LDAP.
Kerberos is a highly secure single signon protocol that uses tickets for authentication, and avoids storing pass
words locally or sending them over the Internet. The authentication mechanism involves communication between a
trusted thirdparty connection encrypted with symmetrickey cryptography. Although considered a legacy technol
ogy, Kerberos is still the default single signon mechanism in big Windowsbased enterprises, and is an alternative
of OpenID authentication.
The basic support for authentication via Kerberos in GraphDB involves:
• Validation of SPNEGO HTTP Authorization tokens. For example:
• Extraction of the username from the SPNEGO token and matching the username against a user from the
local database or a user from LDAP.
SPNEGO is the mechanism that integrates Kerberos with HTTP authentication.
After the token is validated and matched to an existing user, the process continues with authorization (assigning
user roles) via the existing mechanism.
Using Kerberos this way is equivalent to authenticating via Basic, GDB, or OpenID.
In order to validate incoming SPNEGO tokens, the Spring Security Kerberos module needs a Kerberos keytab (a
set of keys associated with a particular Kerberos account) and a service principal (the username of the associated
Kerberos account). This account is used only to validate and decrypt the incoming SPNEGO tokens and is not
associated with any user in GraphDB. See more on how to create a keytab file here.
Enable Kerberos with the following property:
In addition, you might want to specify a custom krb5.conf file via the java.security.krb5.conf property but
Java should be able to pick up the default system file automatically.
User matching
Kerberos principals (usernames) need to be matched to GraphDB usernames. A Kerberos principal consists of a
username, followed by @, followed by a realm. The realm looks like a domain name and is usually written out
in capital letters. The principals are converted by simply dropping the @ sign and the realm. However, the realm
from incoming SPNEGO tokens must match the realm of the service principal. Some examples:
There are various ways to use SPNEGO when talking to GraphDB as a client. All methods add the Ker
beros/SPNEGO authentication in the HTTP client used by the RDF4J libraries.
Native method
The native method does not require any thirdparty libraries and relies on the builtin Kerberos capabilities of
Java and Apache’s HttpClient. However, it is a bit cumbersome to use since it requires wrapping calls into an
authentication context. This method supports only nonpreemptive authentication, i.e., the GraphDB server must
explicitly say it needs Kerberos/SPNEGO by sending a WWW-Authenticate: Negotiate header to the client.
There is a thirdparty library called kerb4j, which makes some things easier. It does not require wrapping the
execution into an authentication context and supports preemptive authentication, i.e., sending the necessary headers
without asking the server if it needs authentication.
Both methods are illustrated in this example project.
X.509 is a digital certificate based on the widely accepted International Telecommunications Union (ITU) X.509
standard, which defines the format of public key infrastructure (PKI) certificates. Some of its advantages include:
• Increased security compared to traditional username and password combinations.
• Streamlined authentication as certificates eliminate the need to remember username and password combi
nations.
• Ease of deployment as certificates are stored locally and are implemented without needing any extra hard
ware.
This authentication method can be used with the local users and the LDAP authorization databases. Direct pass
word authentication with GraphDB is possible only with local users or LDAP, and can be optionally disabled.
1. Enable X.509 certificate authentication with the following graphdb.properties file property. The default
value is basic, gdb. Provide only x509 if passwordbased login methods (basic and gdb) will not be used.
graphdb.auth.methods = basic, gdb, x509
2. Enable local or LDAP authorization. The default value of the property is local, corresponding to the local
user database. If LDAP is the chosen authorization database, enable it via the property below and then
configure it.
graphdb.auth.database = ldap
3. Provide a regular expression to extract the username from the certificate. The default is CN=(.*?)(?:,|$).
If you want to provide a custom expression, uncomment the following and edit it.
graphdb.auth.methods.x509.subject.dn.pattern = CN=(.*?)(?:,|$)
b. There is also a third option – setting a Certificate Revocation List (CRL) to Tomcat, which
will allow revocation checks for certificates that do not provide an Authority Information
Access (AIA) extension or can serve as an alternative in the event of OCSP or CRLDP
responders downtime. Uncomment and set the property:
graphdb.auth.methods.x509.crlFile = <path-to-certificate-revocation-list>
Note: If all three methods are provided, the order of precedence in which GraphDB will look
for them is:
1. Online Certificate Status Protocol (OCSP) check
2. Certificate Revocation List Distribution Point (CRLDP) check
3. Certificate Revocation List (CRL) check
To bypass certificate validation, pass the -k or --insecure flag to cURL. This will tell cURL to ignore certificate
errors and accept insecure certificates without complaining about them.
curl -k --cert cerfile.pem --key cerfile.key 'https://<base_url>/<graphdb_endpoint>'
This is a list of example configurations for some of the possible combinations of authentication methods (Basic,
GDB, OpenID, X.509 certificate, and Kerberos) with the three supported user databases for authorization (Local,
LDAP, and OAuth).
Hint: The OpenID, Kerberos and LDAP part of the examples is identical in all cases but is repeated for conve
nience.
Basic/GDB + LDAP
# The methods basic and gdb are active by default but may be provided explicitly as such:
graphdb.auth.methods = basic, gdb
# Permit access for all users that are par t of the “people” unit of the fictional “example.org”�
,→organization. (continues on next page)
# Make all users in the Management group GraphDB Repository Managers as well.
graphdb.auth.ldap.role.map.repositoryManager = Management
# Enable all users in the Readers group to read the my_repo repository.
graphdb.auth.ldap.role.map.repository.read.my_repo = Readers
# Enable all users in the Writers group to write and read the my_repo repository.
graphdb.auth.ldap.role.map.repository.write.my_repo = Writers
# Required for accessing a LDAP server that does not allow anonymous binds and anonymous access.
graphdb.auth.ldap.bind.userDn = uid=userId,ou=people,dc=example,dc=org
graphdb.auth.ldap.bind.userDn.password = 123456
# OpenID issuer URL, used to derive keys endpoints and token validation.
graphdb.auth.openid.issuer = https://accounts.example.com
# The local database is the default setting but it may be set explicitly as such:
graphdb.auth.database = local
OpenID + LDAP
# OpenID issuer URL, used to derive keys endpoints and token validation.
graphdb.auth.openid.issuer = https://accounts.example.com
# Permit access for all users that are part of the “people” unit of the fictional “example.org”�
,→organization.
graphdb.auth.ldap.user.search.base = ou=people
graphdb.auth.ldap.user.search.filter = (cn={0})
# Make all users in the Management group GraphDB Repository Managers as well.
graphdb.auth.ldap.role.map.repositoryManager = Management
# Enable all users in the Readers group to read the my_repo repository.
graphdb.auth.ldap.role.map.repository.read.my_repo = Readers
# Enable all users in the Writers group to write and read the my_repo repository.
graphdb.auth.ldap.role.map.repository.write.my_repo = Writers
# Required for accessing a LDAP server that does not allow anonymous binds and anonymous access.
graphdb.auth.ldap.bind.userDn = uid=userId,ou=people,dc=example,dc=org
graphdb.auth.ldap.bind.userDn.password = 123456
OpenID + OAuth
# OpenID issuer URL, used to derive keys endpoints and token validation.
graphdb.auth.openid.issuer = https://accounts.example.com
# OAuth roles claim. The field from the JWT token that will provide the GraphDB roles.
graphdb.auth.oauth.roles_claim = roles
# OAuth roles prefix to strip. The roles claim may provide the GraphDB roles with some prefix, e.g., GDB_
,→ROLE_USER.
# OAuth default roles to assign. It may be convenient to always assign certain roles without listing them�
,→in the roles claim.
graphdb.auth.oauth.default_roles = ROLE_USER
Example configuration for X.509 certificate authentication + local user database authorization:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ X.509 AUTHENTICATION ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Provide a regular expression to extract the username from the certificate. The default is CN=(.*?)(?:,|
,→$).
# If you want to provide a custom expression, uncomment the below and edit it.
graphdb.auth.methods.x509.subject.dn.pattern = CN=(.*?)(?:,|$)
# GraphDB uses the Java implementation of SSL, which requires a configured key in the Java keystore.
# To setup keystore, uncomment the following properties and set 'keystorePass' and 'keyPass' to their�
,→actual values.
(continues on next page)
graphdb.connector.keystoreFile = <path-to-the-keystore-file>
graphdb.connector.keystorePass = <secret>
graphdb.connector.keyAlias = graphdb
graphdb.connector.keyPass = <secret>
# Configure the X.509 certificate revocation status check. Only one of OCSP and CRLDP can be enabled at a�
,→time.
# To enable the check you want, uncomment the other one and set it to false.
# graphdb.auth.methods.x509.ocsp = true
graphdb.auth.methods.x509.crldp = false
# In the event of OCSP or CRLDP responders downtime or certificates that do not provide an Authority�
,→Information Access (AIA) extension,
# you can set a Certificate Revocation List (CRL) to Tomcat.
graphdb.auth.methods.x509.crlFile = <path-to-certificate-revocation-list>
# The local database is the default setting but it may be set explicitly as such:
graphdb.auth.database = local
# Provide a regular expression to extract the username from the certificate. The default is CN=(.*?)(?:,|
,→$).
# If you want to provide a custom expression, uncomment the below and edit it.
graphdb.auth.methods.x509.subject.dn.pattern = CN=(.*?)(?:,|$)
# GraphDB uses the Java implementation of SSL, which requires a configured key in the Java keystore.
# To setup keystore, uncomment the following properties and set 'keystorePass' and 'keyPass' to their�
,→actual values.
# The default is the .keystore file in the operating system home directory of the user that is running�
,→GraphDB.
graphdb.connector.keystoreFile = <path-to-the-keystore-file>
graphdb.connector.keystorePass = <secret>
graphdb.connector.keyAlias = graphdb
graphdb.connector.keyPass = <secret>
# Configure the X.509 certificate revocation status check. Only one of OCSP and CRLDP can be enabled at a�
,→time.
# To enable the check you want, uncomment the other one and set it to false.
# graphdb.auth.methods.x509.ocsp = true
graphdb.auth.methods.x509.crldp = false
# In the event of OCSP or CRLDP responders downtime or certificates that do not provide an Authority�
,→Information Access (AIA) extension,
# you can set a Certificate Revocation List (CRL) to Tomcat.
graphdb.auth.methods.x509.crlFile = <path-to-certificate-revocation-list>
# Permit access for all users that are part of the “people” unit of the fictional “example.org”�
,→organization.
graphdb.auth.ldap.user.search.base = ou=people
graphdb.auth.ldap.user.search.filter = (cn={0})
# Make all users in the Management group GraphDB Repository Managers as well.
graphdb.auth.ldap.role.map.repositoryManager = Management
# Enable all users in the Readers group to read the my_repo repository.
graphdb.auth.ldap.role.map.repository.read.my_repo = Readers
# Enable all users in the Writers group to write and read the my_repo repository.
graphdb.auth.ldap.role.map.repository.write.my_repo = Writers
# Required for accessing a LDAP server that does not allow anonymous binds and anonymous access.
graphdb.auth.ldap.bind.userDn = uid=userId,ou=people,dc=example,dc=org
graphdb.auth.ldap.bind.userDn.password = 123456
# Enable Kerberos authentication and keep Basic and GDB authentication enabled.
graphdb.auth.methods = basic, gdb, kerberos
# Provides the Kerberos keytab file relative to the GraphDB config directory.
graphdb.auth.kerberos.keytab = graphdb-http.keytab
# Provides the Kerberos principal for GraphDB running at data.example.org and Kerberos users from
(continues on next page)
# Enable Kerberos debug messages (recommended when you first setup Kerberos, can be disabled later).
graphdb.auth.kerberos.debug = true
# The local database is the default setting but it may be set explicitly as such:
graphdb.auth.database = local
Kerberos + LDAP
# Enable Kerberos authentication and keep Basic and GDB authentication enabled.
graphdb.auth.methods = basic, gdb, kerberos
# Provides the Kerberos keytab file relative to the GraphDB config directory.
graphdb.auth.kerberos.keytab = graphdb-http.keytab
# Provides the Kerberos principal for GraphDB running at data.example.org and Kerberos users from
# the realm EXAMPLE.ORG.
graphdb.auth.kerberos.principal = HTTP/data.example.org@EXAMPLE.ORG
# Enable Kerberos debug messages (recommended when you first setup Kerberos, can be disabled later).
graphdb.auth.kerberos.debug = true
# Permit access for all users that are part of the “people” unit of the fictional “example.org”�
,→organization.
graphdb.auth.ldap.user.search.base = ou=people
graphdb.auth.ldap.user.search.filter = (cn={0})
# Make all users in the Management group GraphDB Repository Managers as well.
graphdb.auth.ldap.role.map.repositoryManager = Management
# Enable all users in the Readers group to read the my_repo repository.
graphdb.auth.ldap.role.map.repository.read.my_repo = Readers
# Enable all users in the Writers group to write and read the my_repo repository.
graphdb.auth.ldap.role.map.repository.write.my_repo = Writers
# Required for accessing a LDAP server that does not allow anonymous binds and anonymous access.
graphdb.auth.ldap.bind.userDn = uid=userId,ou=people,dc=example,dc=org
graphdb.auth.ldap.bind.userDn.password = 123456
12.4 Encryption
All network traffic between the clients and GraphDB and between the different GraphDB nodes (in case of a cluster
topology) can be performed over either HTTP or HTTPS protocols. It is highly advisable to encrypt the traffic
with SSL/TLS because it has numerous security benefits.
As GraphDB runs on embedded Tomcat server, the security configuration is standard with a few exceptions. See
more in the official Tomcat documentation on how to enable SSL/TLS.
SSL can be enabled by configuring the following three parameters:
graphdb.connector.SSLEnabled = true
graphdb.connector.scheme = https
graphdb.connector.secure = true
GraphDB uses the Java implementation of SSL, which requires a configured key in the Java keystore.
If you have no Java keystore, you can generate one by using one of the following methods:
Option one generate a selfsigned key. You would have to trust the certificate in all clients, including all nodes
that run in a different JVM.
Option two convert a third party trusted OpenSSL certificate to PKCS12 key and then import to the Java keystore.
For any additional encryption information, please refer to the Encryption section or, since GraphDB runs in an
embedded Tomcat, to the Tomcat ssl documentation.
In addition to the above settings, you can set any Tomcat Connector attribute through a property:
graphdb.connector.<attribute> = xxx
Currently, GraphDB does not support configuration of the SSLHostConfig part of the Tomcat configuration. So
when configuring SSL, please refer to the Connector attributes and not the SSLHostConfig ones. See the Tomcat
attributes documentation for more information.
Certificate trust
After configuring the GraphDB instance with SSL, certificate trust should be set up between the GraphDB node
and all client nodes communicating with it. Certificate trust can be provided in one of two ways:
This way, you will not need any additional configuration and the clients will not get security warning when con
necting to the server. The drawback is that these certificates are usually not free and you need to work with a
thirdparty CA. We will not look at this option in more detail as creating such a certificate is highly dependent on
the CA.
The benefit is that you generate these certificates yourself and they do not need to be signed by anyone else.
However, the drawback is that by default, the nodes will not trust each others’ certificates.
If you generate a separate selfsigned certificate for each node in the communication, this certificate would have
to be present in the Java Truststores of all other nodes. You can do this by either adding the certificate to the
default Java Truststore or specifying an additional Truststore when running GraphDB. Information on how to
generate a certificate, add it to a Truststore, and make the JVM use this Truststore can be found in the official Java
documentation.
However, this method introduces a lot of configuration overhead. Therefore, we recommend that instead of sep
arate certificates for each node, you generate a single selfsigned certificate and use it on all nodes. GraphDB
extends the standard Java TrustManager, so it will automatically trust its own certificate. This means that if all
nodes involved in the communication are using a shared certificate, there would be no need to add it to the Trust
store.
Another difference from the standard Java TrustManager is that GraphDB has the option to disregard the hostname
when validating the certificates. If this option is disabled, it is recommended to add all possible IPs and DNS
names of all nodes that will be using the certificate as Subject Alternative Names when generating the certificate
(wildcards can be used as well).
Both options for trusting your own certificate and skipping the hostname validation are configurable from the
graphdb.properties file:
GraphDB does not provide encryption for its data. All indexes and entities are stored in binary format on the hard
drive. It should be noted that the data from them can be easily extracted in case somebody gains access to the data
directory.
This is why it is recommended to implement some kind of disk encryption on your GraphDB server. There are
multiple thirdparty solutions that can be used.
GraphDB has been tested on LUKSencrypted hard drive, and no noticeable performance impact has been ob
served. However, please keep in mind that such may be present, as it is highly dependent on your specific use
case.
Audit trail enables accountability for actions. The common use cases are to detect unauthorized access to the
system, trace changes to the configuration, and prevent inappropriate actions through accountability.
You can enable the detailed audit trail log by using the graphdb.audit.role configuration parameter. Here is an
example:
graphdb.audit.role=USER
graphdb.audit.repository=WRITE
will lead to all write operations being logged. Read permissions also include write operations.
The detail of the audit trail increases depending on the role that is configured. For example, configuring the audit
role for REPO_MANAGER means that access to the repository management resources will be logged, as well as
access to the administration resources and the logging form. Configuring the audit role to ADMIN will only log
access to the administration resources and the logging form.
The ANY role lists all requests towards resources where that requires authentication.
The following fields are logged for every successful security check:
• Username
• Source IP address
• Response status code
• Type of request method
• Request endpoint
• XGraphDBRepository header value or, if missing, which repository is being accessed
• Serialization of the request headers specified in the graphdb.audit.headers parameter
• Serialization of all input HTTP parameters and the message body, limited by the graphdb.audit.request.
max.length parameter
By default, no headers are logged. The graphdb.audit.headers parameter configuring this can take multiple
values. For instance, if you want to log two headers, you will simply list them with commas:
Graphdb.audit.headers = Referer,User-Agent
The amount of bytes from the message body which get logged defaults to 1,024 if the graphdb.audit.request.
max.length parameter is not set.
Note: Logs can be spaceintensive, especially if you toggle them to level 1 or 2 as described above.
You can configure GraphDB security settings and user profiles and rights from the Workbench under Setup �
Users and Access.
For access control, GraphDB implements Spring Security. When an HTTP request is received, Spring Security
intercepts it, verifies the permissions, and either grants or denies access to the requested database resource or API.
GraphDB supports three types of user databases used for authentication and authorization: Local, LDAP, and
OAuth. Each of them contains and manages the user information. GraphDB supports four authentication methods:
Basic, GDB, OpenID and Kerberos. Each authentication method is responsible for a specific type of credentials
or tokens.
GraphDB supports encryption in transit with SSL/TLS certificates for encrypting the network traffic between
the clients and GraphDB, and between the different GraphDB nodes (when in a cluster).
GraphDB’s detailed security audit trail provides actions accountability, and is hierarchically structured in audit
roles. The level of detail of the audit depends on the role that is configured.
THIRTEEN
GraphDB supports the backup and restore of both a single GraphDB instance and a cluster through its recovery
REST API. Both partial (perrepository) and full recovery procedures are available with optional inclusion of user
account data.
Important: As with all operations that involve a REST API, in order to perform a backup or a restore procedure:
• The respective GraphDB instance must be online.
• The cluster must be writable, i.e., the majority of its nodes must be active.
Whether you want to be able to quickly recover your data in case of failure or perform routine admin operations
such as upgrading a GraphDB instance, it is important to prepare an optimal backup & restore procedure.
There are various factors to take into consideration when designing a backup strategy, such as:
• Optimal timing for downtime tolerance for applying backup
• Readonly tolerance on a single node setup for creating a backup
• Loadbalanced backup creation (backup is created by one of the followers, so if a quorum exists, updates
will be processed)
• Scope of the backedup data (e.g., full or perrepository backup, or whether user accounts and settings are
included)
• Available system resources and specifically ensuring enough disk space for backup
• Frequency of backup creation
As mentioned, backups can be either covering all repositories (full data backup) or only selected existing repos
itories (partial data backup), and they may also include the user accounts and settings.
Cluster backup creation is lockfree, meaning that by leveraging the multiple instances and quorum mechanism,
the cluster can create a backup while simultaneously processing updates if the deployment has more than 2 nodes.
A GraphDB instance can be backed up using the /rest/recovery/backup endpoint. To create a backup, simply
POST an HTTP request as shown below.
575
GraphDB Documentation, Release 10.2.5
Option Description
repositories List of repositories to be backed up. Specified as JSON in the request body.
• If the parameter is missing, all repositories will be included in the backup.
• If it is an empty list ([]), no repositories will be included in the backup.
• Otherwise, the repositories from the list will be included in the backup.
backupSystemData Determines whether user account data such as user accounts, saved queries, visual
graphs etc. should be included in the backup. Specified as JSON in the request
body. Boolean, the default value is false.
Here is an example cURL request for full data backup creation without system data (i.e., user accounts and
settings):
Note: This is an archive file that you do not need to extract – it is to be used as is.
To set the name of the backup yourself, replace -OJ with --output <backup-name>, i.e.:
Here is an example cURL request for partial data backup creation without system data:
Note: If a POST request does not include a list of repositories for backup, it will automatically create a full data
backup.
Here is an example cURL request for full data and system backup creation with system data:
Here is an example cURL request for creating a backup of system data only:
Note: If this parameter is not provided, all repositories will be included in the backup.
To create a backup saved in the cloud, the GraphDB instance uses a different endpoint – /rest/recovery/cloud-
backup.
Cloud backup has the same options as regular GraphDB backup, with an additional bucketUri parameter that
contains all the information about the cloud bucket. For Amazon’s S3, it uses the following format:
s3://[<endpoint-hostname>:<endpoint-port>]/<bucket-name>/<backup-name>?
region=<AWSRegion>&AWS_ACCESS_KEY_ID=<key-id>&AWS_SECRET_ACCESS_KEY=<access-key>
The endpoint-hostname and endpoint-port values are only used for local S3 clones. To use Amazon S3, these
values should be left blank and the URL should start with three / before the bucket, as below:
s3:///my-bucket/graphdb-backups/<backup-name>?region=eu-west-1&AWS_ACCESS_KEY_ID=secretKey&AWS_SECRET_ACCESS_K
Here is an example cURL request for full data backup creation with system data:
}' '<base_url>/rest/recovery/cloud-backup'
The backupOptions parameter is optional. If nothing is passed for it, the default values of the options will be used.
The backup examples from above are also valid for the cloud backup. As long as the cloud backup is provided
with the same backupOptions and the bucketUri is valid, the resulting backup .tar file should be the same.
A GraphDB instance or cluster can be restored to a backedup state through the /rest/recovery/restore end
point.
The recovery procedure in the cluster is treated as a simple update as it leverages the Raft protocol that allows a
set of distributed nodes to act as one.
Important: It is recommended to perform cluster transaction log truncate operations after a successful data
restore, as the transaction log will use more storage space upon a backup/restore procedure.
Option Description
repositories List of repositories to recover from the backup. Specified as JSON in the request
body.
• If the parameter is missing, all repositories that are in the backup will be
restored.
• If it is an empty list ([]), no repositories from the backup will be restored.
• Otherwise, the repositories from the list will be restored.
restoreSystemData Determines whether GraphDB should restore user account data such as user ac
counts, saved queries, visual graphs etc. from a backup or continue with the their
current state. Specified as JSON in the request body. If no system data is found in
the backup, an error will be returned. Boolean, the default is false.
removeStaleRepositories Cleans other existing repositories on the GraphDB instance where the restore is
done. The default is false, meaning that no repositories will be cleaned.
If we have successfully created a backup and want to completely revert to the backedup state while preserving
the existing repositories on the instance where we are restoring, we can use the below cURL request example. No
additional parameters are provided, meaning that defaults are applied.
Note: The full-data-backup-name.tar file must be a full data backup created as shown here.
We can also apply a backup and remove repositories that are not restored from it.
Here, we need to provide the names of the repositories that we want to restore as values for the repositories
parameter.
To restore only the system data from a backup, we can use the following cURL request:
Note: The full-data-system-backup-name.tar file must contain system data, i.e., the backup
must be created with backupSystemData = true as shown here.
To restore from a backed up state saved on cloud storage, the GraphDB instance uses a different endpoint – /rest/
recovery/cloud-restore.
Cloud restore has the same options as regular GraphDB restore, with an additional bucketUri parameter that
contains all the information about the cloud bucket. For Amazon’s S3, it uses the following format:
s3://[<endpoint-hostname>:<endpoint-port>]/<bucket-name>/<backup-name>?
region=<AWSRegion>&AWS_ACCESS_KEY_ID=<key-id>&AWS_SECRET_ACCESS_KEY=<access-key>
The endpoint-hostname and endpoint-port values are only used for local S3 clones. To use Amazon S3, these
values should be left blank and the URL should start with three / before the bucket, as below:
s3:///my-bucket/graphdb-backups/<backup-name>?region=eu-west-1&AWS_ACCESS_KEY_ID=secretKey&AWS_SECRET_ACCESS_K
Here is an example cURL request for applying a backup and removing all repositories that are not restored from
it:
}' '<base_url>/rest/recovery/cloud-restore'
The restoreOptions parameter is optional. If nothing is passed for it, the default values of the options will be
used.
The restore examples from above are also valid for the cloud restore. As long as the cloud restore endpoint is
provided with the same restoreOptions and the bucketUri is a valid GraphDB backup file, the resulting restore
should be the same.
FOURTEEN
Tracking a single request through a distributed system is an issue due to the scattered nature of the logs. Therefore,
GraphDB offers the capability for tracking particular request ID headers, or generates those itself if need be. This
allows for easier auditing and system monitoring. Headers will be intercepted when a request comes into the
database and passed onwards together with the response. Request tracking is turned off by default, and can be
enabled by adding graphdb.append.request.id.headers=true to their graphdb.properties file. The value is
already present in the default configuration file, but needs to be uncommented to work.
By default, GraphDB scans all incoming requests for an X-Request-ID header. If no such header exists, it assigns
to the incoming request a random ID in the UUID type 5 format.
Some clients and systems assign alternative names to their request identifiers. Those can be listed in the following
format:
graphdb.request.id.alternatives=my-request-header-1, outside-app-request-header
581
GraphDB Documentation, Release 10.2.5
Value Description
repository-state Checks the state of the repository. Returns message RUNNING, STARTING,
or INACTIVE. RUNNING and INACTIVE result in green health, and all
other states result in yellow health.
read-availability Checks whether the repository is readable.
storage-folder Checks if there are at least 20 megabytes writable left for the storage folder.
The amount can be controlled with the system parameter health.minimal.
free.storage.
long-running-queries Checks if there are queries running longer than 20 seconds. The time can be
controlled with the system parameter health.max.query.time.seconds. If
a query is running for more than 20 seconds, it is either a slow one, or there
is a problem with the database.
predicates-statistics Checks if the predicate statistics contain correct values.
plugins Provides aggregated health checks for the individual plugins.
The aggregated GraphDB health checks include checks for dependent services and components as plugins and
connectors.
Each connector plugin is reported independently as part of the composite “plugins” check in the repository health
check. Each connector’s check is also a composite where each component is an individual connector instance.
The output may look like this:
{
"name":"wine",
"status":"green",
"components":[
{
"name":"read-availability",
"status":"green"
},
{
"name":"storage-folder",
"status":"green"
},
{
"name":"long-running-queries",
"status":"green"
},
{
"name":"predicates-statistics",
"status":"green"
},
{
"name":"plugins",
"status":"yellow",
"components":[
{
"name":"elasticsearch-connector",
"status":"green",
"components":[
]
},
{
(continues on next page)
An individual check run involves sending a query for all documents to the connector instance, and the result is:
• green more than zero hits
• yellow zero hits or failing shards (shards check only for Elasticsearch)
• red unable to execute query
In all of these cases, including the green status, there is also a message providing details, e.g., “query took 15 ms,
5 hits, 0 failing shards”.
To run the health checks for a particular repository, in the example myrepo, execute the following cURL command:
curl 'http://localhost:7200/repositories/myrepo/health?'
In passive check mode, the repository state will be compared to determine if it is safe to do an active check.
• Immediate passive: Activated by passing ?passive to the health endpoint.
– If the state is RUNNING, do an active check.
– If the state is something else (e.g., INACTIVE or STARTING), return immediately with a simple check
that only lists the state.
• Delayed passive (if needed): Tries to get the repository for up to N seconds. Activated by passing ?
passive=N to the endpoint, where N is a timeout in seconds.
{
"status" : "green",
"components" : [
{
"status" : "green",
"name" : "repository-state",
"message" : "INACTIVE"
}
],
"name" : "test"
}
{
"status" : "yellow",
"components" : [
{
"status" : "yellow",
"name" : "repository-state",
"message" : "STARTING"
}
],
"name" : "test"
}
Note: From GraphDB 9.7.x onwards, legacy health checks are no longer supported.
GraphDB offers several options for system monitoring described in detail below.
In the respective tabs under Monitor � Resources in the GraphDB Workbench, you can monitor the most important
hardware information as well as other applicationrelated metrics:
• Resource monitoring: system CPU load, file descriptors, heap memory usage, offheap memory usage, and
disk storage.
• Performance (per repository): queries, global page cache, entity pool, and transactions and connections.
• Cluster health (in case a cluster exists).
The GraphDB REST API exposes several monitoring endpoints suitable for scraping by Prometheus. They return
a suitable data format when the request has an Accept header of the type text/plain, which is the default type for
Prometheus scrapers.
The /rest/monitor/structures endpoint enables you to monitor GraphDB structures – the global page cache
and the entity pool. This provides a better understanding of whether the current GraphDB configuration is optimal
for your specific use case (e.g., repository size, query complexity, etc.)
The current state of the global page cache and the entity pool are returned via the following metrics:
Parameter Description
graphdb_cache_hit GraphDB’s global page cache hits count. Along with the global page cache miss
count, this metric can be used to diagnose a small or oversized global page cache.
• In ideal conditions, the percentage of hits should be over 96%.
• If it is below 96%, it might be a good idea to increase its size.
• If it is over 99%, it might be worth experimenting with a smaller global page
cache size.
Parameter Description
graphdb_open_file_descriptors
Count of currently open file descriptors. This helps diagnose slowdowns of the
system or a slow storage if the number remains high for a longer period of time.
graphdb_cpu_load Shows the current CPU load for the entire system in %.
graphdb_heap_max_mem Maximum available memory for the GraphDB instance. Returns -1 if the maxi
mum memory size is undefined.
graphdb_heap_init_mem Initial amount of memory (controlled by -Xms) in bytes.
graphdb_heap_committed_mem
Current committed memory in bytes.
graphdb_heap_used_mem Current used memory in bytes. Along with the rest of the memoryrelated proper
ties, this can be used to detect memory issues.
graphdb_mem_garbage_collections_count
Count of full garbage collections from the starting of the GraphDB instance. This
metric is useful for detecting memory usage issues and system “freezes”.
graphdb_nonheap_init_memOffheap initial memory in bytes.
graphdb_nonheap_max_mem Maximum direct memory. Returns -1 if undefined.
graphdb_nonheap_committed_mem
Current offheap committed memory in bytes.
graphdb_nonheap_used_memCurrent offheap used memory in bytes.
graphdb_data_dir_used Used storage space on the partition where the data directory sits, in bytes. This is
useful for detecting a soonoutofharddiskspace issue along with the free storage
metric.
graphdb_data_dir_free Free storage space on the partition where the data directory sits, in bytes.
graphdb_logs_dir_used Used storage space on the partition where the logs directory sits, in bytes. This is
useful for detecting a soonoutofharddiskspace issue along with the free storage
metric.
graphdb_logs_dir_free Free storage space on the partition where the logs directory sits, in bytes.
graphdb_work_dir_used Used storage space on the partition where the work directory sits, in bytes. This is
useful for detecting a soonoutofharddiskspace issue along with the free storage
metric.
graphdb_work_dir_free Free storage space on the partition where the work directory sits, in bytes.
graphdb_threads_count Current used threads count.
Via the /rest/monitor/cluster endpoint, you can monitor GraphDB’s cluster statistics in order to diagnose prob
lems and cluster slowdowns more easily. The endpoint returns several clusterrelated metrics, and will not return
anything if a cluster is not created.
Parameter Description
graphdb_leader_elections_count
Count of leader elections from cluster creation. If there are a lot of leader elections,
this might mean an unstable cluster setup with nodes that are not always properly
operating.
graphdb_failure_recoveries_count
Count of total failure recoveries in the cluster from cluster creation. Includes failed
and successful recoveries. If there are a lot of recoveries, this indicates issues with
the cluster stability.
graphdb_failed_transactions_count
Count of failed transactions in the cluster.
graphdb_nodes_in_clusterTotal nodes count in the cluster.
graphdb_nodes_in_sync Count of nodes that are currently insync. If a lower number than the total nodes
count is reported, this means that there are nodes that are either outofsync, dis
connected, or syncing.
graphdb_nodes_out_of_sync
Count of nodes that are outofsync. If there are such nodes for a longer period of
time, this might indicate a failure in one or more nodes.
graphdb_nodes_disconnected
Count of nodes that are disconnected. If there are such nodes for a longer period
of time, this might indicate a failure in one or more nodes.
graphdb_nodes_syncing Count of nodes that are currently syncing. If there are such nodes for a longer
period of time, this might indicate a failure in one or more nodes.
Via the /rest/monitor/repository/{repositoryID} endpoint, you can monitor GraphDB’s query and transac
tion statistics in order to obtain a better understanding of the slow queries, suboptimal queries, active transactions,
and open connections. This information helps in identifying possible issues more easily.
The endpoint exists for each repository, and a scrape configuration must be created for each repository that you want
to monitor. Normally, repositories are not created or deleted frequently, so the Prometheus scrape configurations
should not be changed often either.
Important: In order for GraphDB to be able to return these metrics, the repository must be initialized.
Parameter Description
graphdb_slow_queries_count
Count of slow queries executed on the repository. The counter is reset when a
GraphDB instance is restarted. If the count of slow queries is high, this might
indicate a setup issue, unoptimized queries, or not good enough hardware.
graphdb_suboptimal_queries_count
Count of queries that the GraphDB engine was not able to evaluate and were sent
for evaluation to the RDF4J engine. A too high number might indicate that the
queries typically used on the repository are not optimal.
graphdb_active_transactions
Count of currently active transactions.
graphdb_open_connectionsCount of currently open connections. If this number stays high for a longer period
of time, it might indicate an issue with connections not being closed once their job
is done.
graphdb_entity_pool_reads
GraphDB’s entity pool reads count. Along with the entity pool writes count, this
metric can be used to diagnose a small or oversized entity pool.
graphdb_entity_pool_writes
GraphDB’s entity pool writes count.
graphdb_epool_size Current entity pool size, i.e., entity count in the entity pool.
Prometheus setup
To scrape the mentioned endpoints in Prometheus, we need to add scraper configurations. Below is an example
configuration for three of the endpoints, assuming we have a repository called “wines”.
- job_name: graphdb_queries_monitor
metrics_path: /rest/monitor/repository/wines
scrape_interval: 5s
static_configs:
- targets: [ 'my-graphdb-hostname:7200' ]
- job_name: graphdb_hw_monitor
metrics_path: /rest/monitor/infrastructure
scrape_interval: 5s
static_configs:
- targets: [ 'my-graphdb-hostname:7200' ]
- job_name: graphdb_structures_monitor
metrics_path: /rest/monitor/structures
scrape_interval: 5s
static_configs:
- targets: [ 'my-graphdb-hostname:7200' ]
Cluster monitoring
When configuring Prometheus to monitor a GraphDB cluster, the setup is similar with a few differences.
In order to get the information for each cluster node, each node’s address must be included in the targets list.
The other difference is that another scraper must be configured to monitor the cluster status. This scraper can be
configured in several ways:
• Scrape only the external proxy (which will always point to the current cluster leader) if it exists in the
current cluster configuration.
The downside of this method is that if for some reason, there is a connectivity problem between
the external proxy and the nodes, it will not report any metrics.
• Scrape the external proxy and all cluster nodes.
This method will enable you to receive metrics from all cluster nodes including the external proxy.
This way, you can see the cluster status even if the external proxy has issues connecting to the
nodes. The downside is that most of the time, each cluster will be duplicated for each cluster
node.
• Scrape all cluster nodes (if there is no external proxy).
If there is no external proxy in the cluster setup, the only option is to monitor all nodes in order
to determine the status of the entire cluster. If you choose only one node and it is down for some
reason, you would not receive any clusterrelated metrics.
The scraper configuration is similar to the previous ones, with the only difference that the targets array might
contain one or more cluster nodes (and/or external proxies). For example, if you have a cluster with two external
proxies and five cluster nodes, the scraper might be configured to scrape only the two proxies like so:
- job_name: graphdb_cluster_monitor
metrics_path: /rest/monitor/cluster
scrape_interval: 5s
static_configs:
- targets: [ 'graphdb-proxy-0:7200', 'graphdb-proxy-1:7200' ]
As mentioned, you can also include some or all of the cluster nodes if you want.
The database employs a number of metrics that help tune the memory parameters and performance. They can be
found in the JMX console under the com.ontotext.metrics package. The global metrics that are shared between
the repositories are under the top level package, and those specific to repositories under com.ontotext.metrics.
<repository-id>.
The global page cache provides metrics that help tune the amount of memory given for the page cache. It contains
the following elements:
Pa- Description
rame-
ter
cache. Counter for the pages that are evicted out of the page and the amount of time it takes for them to be
flush flushed on the disc.
cache. Number of hits in the cache. This can be viewed as the number of pages that do not need to be read
hit from the disc but can be taken from the cache.
cache. Counter for the pages that have to be read from the disc. The smaller the number of pages is, the
load better.
cache. Number of cache misses. The smaller this number is, the better. If you see that the number of hits is
miss smaller than the misses, then it is probably a good idea to increase the page cache memory.
You can monitor the number of reads and writes in the entity pool of each repository with the following parameters:
Parameter Description
epool.read Counter for the number of reads in the entity pool.
epool.write Counter for the number of writes in the entity pool.
It is essential to gather as many details as possible about an issue once it appears. For this purpose, we provide
utilities that generate such issue reports by collecting data from various log files, JVM, etc. Using these issue
reports helps us to investigate the problem and provide an appropriate solution as quickly as possible.
14.4.1 Report
GraphDB provides an easy way to gather all important system information and package it as an archive that can
be sent to graphdbsupport@ontotext.com. Run the report using the GraphDB Workbench, or from the generate-
report script in the bin directory of your distribution. The report is saved in the GraphDBWork/report directory.
There is always one report the one that has been generated most recently.
Report content
• GraphDB version
• recursive directory list of the files in GraphDBHome as home.txt
• recursive directory list of the files in GraphDBWork as work.txt
• recursive directory list of the files in GraphDBData data.txt
• the 30 most recent logs files from GraphDBLogs ordered by time of creation
• full copy of the content of GraphDBConf
• the output from jcmd GC.class_histogram as jcmd_histogram.txt
• the output from jcmd Thread.print as thread_dump.txt
• the System Properties for the GraphDB instance
• the repository configurations info as system.ttl. All repositories can be found in this file.
• the owlim.properties file for each repository if found. It exists only when the repository has been initialized
at least once.
In a cluster, the report can be run from each node in the group. It adds the following to the standard report:
• Report data for each node in the cluster
Go to Help � System information. Click on New report in the Application info tab to obtain a new one, wait until
it is ready, and download it. It is downloaded in .zip format as graphdb-server-report-<timestamp>.
The generate-report script can be found in the bin folder in the GraphDB distribution. It needs graphdb-pid
the GraphDB for which you want a report. An optional argument is output-file, the default for which is
graphdb-server-report.zip.
14.4.2 Logs
GraphDB uses slf4j for logging through the Logback implementation (the RDF4J facilities for log configuration
discovery are no longer used). Instead, the whole distribution has a central place for the logback.xml configuration
file in GraphDBHOME/conf/logback.xml. If you use the .war file setup, you can provide the log file location
through a system parameter, or we will pick it up from the generated .war file.
Note: Check the Logback configuration location rules for more information.
The default root logger is set to info. You can change it in several ways:
• Edit the logback.xml configuration file.
Note: You do not have to restart the database as it will check the file for changes every 30 seconds, and
will reconfigure the logger.
• Change the log level through the logback JMX configurator. For more information, see the Logback manual
chapter 10.
• Start each component with graphdb.logger.root.level set to your desired root logging level. For example:
bin/graphdb -Dgraphdb.logger.root.level=WARN
Logs location
By default, all database components and tools log in GraphDBHOME/logs when run from the bin folder. If you
set up GraphDB by deploying .war files into a standalone servlet container, the following rules apply:
1. To log in a specified directory, set the logDestinationDirectory system property.
2. If GraphDB is run in Tomcat, the logs can be found in $catalina.base/logs/graphdb.
3. If GraphDB is run in Jetty, the logs can be found in $jetty.base/logs/graphdb.
4. Otherwise, all logs are in the logs subdirectory of the current working directory for the process.
Log files
Different information is logged in different files. This makes it easier to follow what goes on in different parts of
the system.
FIFTEEN
Run GraphDB in a Docker container: If you are into Docker and containers, we provide readytouse Docker
images.
Run GraphDB with Helm charts: From version 9.8 onwards, GraphDB can be deployed with opensource Helm
charts. See how to set up complex GraphDB deployments on Kubernetes.
593
GraphDB Documentation, Release 10.2.5
SIXTEEN
GRAPHDB WORKBENCH
The Workbench is the webbased administration interface to GraphDB. It lets you administrate GraphDB, as well
as load, transform, explore, manage, query, and export data.
The Workbench layout consists of two main areas. The navigation area is on the lefthand side of the screen and
contains dropdown menus to all functionalities Import, Explore, SPARQL, Monitor, Setup, and Help. The work
area shows the tasks associated with the selected functionality. The home page provides easy access to some of
the actions in the Workbench such as creating a repository, attaching a location, finding a resource, querying your
data, etc. At the bottom of the page, you can see the license details, and in the footer the versions of the various
GraphDB components.
16.1 Functionalities
595
GraphDB Documentation, Release 10.2.5
Explore
• Graphs overview –> See a list of the default graph and all named graphs
in GraphDB. Use it to inspect the statements in each graph, export the
graph, or clear its data.
• Class hierarchy –> Explore the hierarchy of RDF classes by number of
instances. The biggest circles are the parent classes and the nested ones
are their children. Hover over a given class to see its subclasses or zoom
in a nested circle (RDF class) for further exploration.
• Class relationships –> Explore the relationships between RDF classes,
where a relationship is represented by links between the individual in
stances of two classes. Each link is an RDF statement where the subject
is an instance of one class, the object is an instance of another class, and
the link is the predicate. Depending on the number of links between the
instances of two classes, the bundle can be thicker or thinner and it gets
the color of the class with more incoming links. The links can be in both
directions.
• Visual graph –> Explore your data graph in a visual way. Start from a
single resource and the resources connected to it, or from a graph query
result. Click on a resource to expand its connections as well.
• Similarity –> Look up semantically similar entities and text.
SPARQL
• SPARQL –> Query and update your data. Use any type of SPARQL
query and click Run to execute it.
Monitor
• Queries and Updates –> Monitor all running queries or updates in
GraphDB. Any query or update can be killed by pressing the Abort but
ton.
• Resources –> Monitor:
– The usage of various system resources: system CPU load, file de
scriptors, heap memory usage, offheap memory usage, and disk
storage.
– The performance of: queries, global page cache, entity pool, and
transactions and connections.
– Cluster health (in case a cluster exists).
• Namespaces –> View and manipulate the RDF namespaces for the active
repository. You need a write permission to add or delete namespaces.
• Autocomplete –> Enable/disable the autocomplete index and check its
status. It is used for automatic completion of URIs in the SPARQL editor
and the View Resource page.
• RDF Rank –> Identify the more important or popular entities in your
repository by examining their interconnectedness determined by the
RDF Rank algorithm. Their popularity can then be used to order query
results.
• JDBC –> Configure the JDBC driver to allow SQL access to repository
data.
• SPARQL Templates –> Create and store predefined SPARQL templates
for futures updates of repository data.
• License –> View the details of your current GraphDB license and set or
revert to a different one.
Help
• Interactive guides –> A set of interactive guides that will lead you
through various GraphDB functionalities using the Workbench user in
terface.
• REST API –> REST API documentation of all available public RESTful
endpoints together with an interactive interface for executing requests.
• Documentation –> Link to the GraphDB public documentation.
• Developer Hub –> Link to the GraphDB dev hub a handson com
pendium to the GraphDB documentation that gives practical advice and
tips on accomplishing realworld tasks.
• Support –> Link to the GraphDB support page.
• System information –> See the configuration values of the JVM run
ning the GraphDB Workbench: Application Info, JVM Arguments, and
Workbench Configuration properties. You can also generate a detailed
server report file that you can use to hunt down issues.
These settings help you to configure the default behavior of the GraphDB Workbench.
The Workbench interface has some useful options that change only the way you query the database, not changing
the rest of the GraphDB behavior:
• Expand results over owl:sameAs This is the default value for the Expand results over owl:sameAs option
in the SPARQL editor. It is taken each time a new tab is created. Note that once you toggle the value in the
editor, the changed value is saved in your browser, so the default is used only for new tabs. The setting is
also reflected in the Graph settings panel of the Visual graph.
• Default Inference value Same as above, but for the Include inferred data in results option in the SPARQL
editor. The setting is also reflected in the Graph settings panel of the Visual graph.
• Show schema by default in visual graph This includes or excludes predicates from owl:, rdf:, rdfs:,
sesame:, dul:, prov:, fibo:, wd:.
• Count total results in SPARQL editor For each query without limit sent through the SPARQL editor, an
additional query is sent to determine the total number of results. This value is needed both for your informa
tion and for results pagination. In some cases, you do not want this additional query to be executed, because
for example the evaluation may be too slow for your data set. Set this option to false in this case.
• Ignore shared saved queries in SPARQL editor In the SPARQL editor, saved queries can be shared, and
you can choose not to see them.
Application settings are userbased. When security is ON, each user can access their own settings through the
Setup � My Settings menu. The admin user can also change other users’ settings through Setup � User and access
� Edit user.
When security is OFF, the settings are global for the application and available through Setup � My Settings.
When free access is ON, only the admin can edit the Free Access configuration, which applies to the anonymous
user.
The Autocomplete index offers suggestions for the IRIs’ local names in the SPARQL editor, the View Resource
page, and in the Search RDF resources box. It is an opensource GraphDB plugin that builds an index over all
IRIs in the repository plus some additional wellknown IRIs from RDF4J vocabularies.
The index is disabled by default. In the Workbench, you can enable it from Setup � Autocomplete.
In case you are getting peculiar results and you think the index might be broken, use the Build Now button.
If you try to use autocompletion before it is enabled, a tooltip will warn you that the index is off and provide a link
for building it.
You can also enable it with a SPARQL query from the Workbench SPARQL editor.
All IRIs and their labels are split into words (tokens). During search, the whole words or their beginnings are
matched.
For each IRI, the index includes the following:
• The text of the IRI local name is tokenized;
• If the IRI is part of a triple <IRI rdfs:label ?label>, the text of the label literal is tokenized and indexed;
• If the IRI is part of a triple <IRI ?p ?label>, and ?p is added to the index config as label predicate, then the
text of the ?label is tokenized and indexed for this IRI. You can add a new label via the righthand button
in the Autocomplete screen, which will open this dialog box:
Local names are split by special characters (e.g., _, -), or in cases when they contain camelCase and/or numbers.
For example:
Search strings
You can search for one or more words. When searching for multiple words, they can be separated with space, or
with - and _ symbols, in which case these will be required to be present in the matched text as well. You can also
use camelCase notation to split the search string into multiple words.
Once the search string has been split into words, search is caseinsensitive. When typing multiple words, each of
them is treated as full match search and must be fully typed except for the last one, which is treated as startsWith.
The order of the search string words is irrelevant, e.g., whiteWin would return the same results as wineWhit.
Some examples:
“ukwal” https://www.bbc.com/news/ukwales44849196
“63” http://purl.org/dc/terms/ISO6393
For the examples below, we will be using the W3C wine ontology dataset that you can import in your repository.
To start autocompletion in the SPARQL editor, use the shortcuts Alt+Enter / Ctrl+Space / Cmd+Space depending
on your OS and the way you have set up your shortcuts.
You can use autocompletion to:
• Search for a single word in all IRIs:
• Indexed text is split where digits or digit sequences are found, so you can also search by number:
To use the autocompletion feature to find a resource, go to the GraphDB home page, and start typing in the View
resource field.
You can also autocomplete resources in the Search RDF resource box, which is visible in all GraphDB screens in
the top right corner and works the same way as the View resource field in the home page. Clicking the icon will
open a search field where you can explore the resources in the repository.
You can also work with the autocomplete index via SPARQL queries in the Workbench SPARQL editor. Some
important examples:
• Check if the index is enabled
ASK WHERE {
_:s <http://www.ontotext.com/plugins/autocomplete#enabled> ?o .
}
INSERT DATA {
_:s <http://www.ontotext.com/plugins/autocomplete#enabled> true .
}
SELECT ?s WHERE {
?s <http://www.ontotext.com/plugins/autocomplete#query> "win"
}
SEVENTEEN
The GraphDB distribution includes a number of command line tools located in the bin directory. Their file exten
sions are .sh or empty for Linux/Unix, and .cmd for Windows.
17.1 console
Option Description
-c,--cautious Always answer no to (suppressed) confirmation prompts
-d,--dataDir <arg> Sesame data directory to ‘connect’ to
-e,--echo Echoes input back to stdout, useful for logging script sessions
-f,--force Always answer yes to (suppressed) confirmation prompts
-h,--help Print this help
-q,--quiet Suppresses prompts, useful for scripting
-s,--serverURL <arg> URL of Sesame server to connect to, e.g., http://localhost/openrdf-
sesame/
-v,--version Print version information
-x,--exitOnError Immediately exit the console on the first error
17.2 generate-report
This tool is used to generate a zip with report about a GraphDB server. On startup, graphdb -p specifies a PID
file to which to write the process ID, which is needed by this tool.
Usage: <graphdb-pid> [<output-file>].
The available options are:
Option Description
<graphdb- (Required) The process ID of a running GraphDB instance.
pid>
<output- (Optional) The path of the file where the report should be saved. If this option is missing, the report
file> will be saved in a file called graphdb-server-report.zip in the current directory.
603
GraphDB Documentation, Release 10.2.5
17.3 graphdb
The graphdb command line tool starts the database. It supports the following options:
Option Description
-d daemonize (run in background), not available on Windows
-s run in serveronly mode (no Workbench UI)
-p pidfile write PID to pidfile
-h print command line options
--help
-v print GraphDB version, then exit
-Dprop set Java system property
-Xprop set nonstandard Java system property
Note: Run graphdb -s to start GraphDB in serveronly mode without the web interface (no Workbench). A
remote Workbench can still be attached to the instance.
17.4 importrdf
The importrdf tool is used for offline loading of datasets. It supports two subcommands Load and Preload.
See more about loading data with ImportRDF here.
Note: The --partial-load will load data up to the first corrupt line of the file.
The mode specifies the way the data is loaded in the repository:
• serial: parsing is followed by entity resolution, which is then followed by load, followed by inference, all
done in a single thread.
• parallel: using multithreaded parse, entity resolution, load, and inference. This gives a significant boost
when loading large datasets with enabled inference.
Tip: For loading datasets larger than several billion RDF statements, consider using the Preload subcommand.
17.5 rdfvalidator
17.6 reification-convert
This tool converts standard RDF reification to RDFstar. The output file must be an RDFstar format.
Usage: reification-convert [--relaxed] <input-file1> [<input-file2> ...] <output-file>.
Available options:
Option Description
--relaxed Enables relaxed mode where x a rdf:Statement is not required.
17.7 rule-compiler
Usage: rule-compiler <rules.pie> <java-class-name> <output-class-file> [<partial>].
Available options:
Option Description
<rules.pie> The name of the rule .pie file
<java-class-name> The name of the Java class
<output-class-file> The output file name
[<partial>] (Optional)
17.8 storage-tool
bin/storage-tool -–help
Note: The tool works only on repository images that are not in use (i.e., when the database is down).
Com- Description
mand
scan Scans repository index(es) and prints statistics for the number of statements and repo consistency.
re- Uses the source index (src-index) to rebuild the destination index dest-index. If src-index = dest-
build index, compacts dest-index. If src-index is missing and dest-index = predicates. then it just
rebuilds dest-index.
re- Replaces an existing entity origin-uri with a nonexisting one repl-uri.
place
re- Repairs the repository indexes and restores data, a better variant of the merge index.
pair
ex- Uses the source index (src-index) to export repository data to the destination file (dest-file). Sup
port ported destination file extension formats: .trig, .ttl, .nq.
epool Scans the entity pool for inconsistencies and checks for invalid IRIs. IRIs are validated against the
RFC 3987 standard. Invalid IRIs will be listed in an entities.invalid.log file for review. If -fix is
specified, instead of listing the invalid IRIs, they will instead be fixed in the entity pool.
-- Prints commandspecific help messages.
help
17.8.2 Options
17.8.3 Examples
• scan the repository, print statement statistics and repository consistency status:
– when everything is OK
_______________________scan results_______________________
mask | pso | pos | diff | flags
0001 | 29,937,266 | 29,937,266 | OK | INF
0002 | 61,251,058 | 61,251,058 | OK | EXP
0005 | 145 | 145 | OK | INF RO
0006 | 8,134 | 8,134 | OK | EXP RO
0009 | 1,661,585 | 1,661,585 | OK | INF HID
000a | 2,834,694 | 2,834,694 | OK | EXP HID
0011 | 1,601,875 | 1,601,875 | OK | INF EQ
0012 | 1,934,013 | 1,934,013 | OK | EXP EQ
0020 | 309 | 221 | OK | DEL
0021 | 15 | 23 | OK | INF DEL
0022 | 34 | 30 | OK | EXP DEL
_______________________additional checks_______________________
| pso | pos | stat | check-type
| 59b30d4d | 59b30d4d | OK | checksum
| 0 | 0 | OK | not existing ids
| 0 | 0 | OK | literals as subjects
(continues on next page)
_______________________scan results_______________________
mask | pso | pos | diff | flags
0001 | 29,284,580 | 29,284,580 | OK | INF
0002 | 63,559,252 | 63,559,252 | OK | EXP
0004 | 8,134 | 8,134 | OK | RO
0005 | 1,140 | 1,140 | OK | INF RO
0009 | 1,617,004 | 1,617,004 | OK | INF HID
000a | 3,068,289 | 3,068,289 | OK | EXP HID
0011 | 1,599,375 | 1,599,375 | OK | INF EQ
0012 | 2,167,536 | 2,167,536 | OK | EXP EQ
0020 | 327 | 254 | OK | DEL
0021 | 11 | 12 | OK | INF DEL
0022 | 31 | 24 | OK | EXP DEL
004a | 17 | 17 | OK | EXP HID MRK
_______________________additional checks_______________________
| pso | pos | stat | check-type
| ffffffff93e6a372 | ffffffff93e6a372 | OK | checksum
| 0 | 0 | OK | not existing ids
| 0 | 0 | OK | literals as subjects
| 0 | 0 | OK | literals as predicates
| 0 | 0 | OK | literals as contexts
| 0 | 0 | OK | blanks as predicates
| true | true | OK | page consistency
| bf55ab00 | bf55ab00 | OK | cpso crc
| - | - | OK | epool duplicate ids
| - | - | OK | epool consistency
| - | - | ERR | literal index consistency
The literals index contains more statements than the literals in epool, and you have to rebuild it:
• scan the PSO index and print a status message every 60 seconds:
• rebuild the POS index from the PSO index and compact POS:
• dump the repository data using the POS index into a f.trig file:
• scan the entity pool and create a report with invalid IRIs, if such exist:
EIGHTEEN
TUTORIALS
GraphDB Fundamentals builds the bases for working with graph databases that implement the W3C standards and
particularly GraphDB. It is a training class delivered in a series of ten videos that will accompany you in your first
steps of using triplestore graph databases.
RDF is a standardized format for graph data representation. This module introduces RDF, what RDFS adds to it,
and how to use it by easytofollow examples from “The Flintstones” cartoon.
SPARQL is a SQLlike query language for RDF data. It is recognized as one of the key tools of the semantic
technology and was made a standard by W3C. This module covers the basis of SPARQL, sufficient to create you
first RDF graph and run your first SPARQL queries.
This module looks at ontologies: what is an ontology, what kind of resources does it describe, and what are the
benefits of using ontologies. Ontologies are the core of how we model knowledge semantically. They are part of
all Linked Data sets.
This video guides you through the steps of setting up your GraphDB: from downloading and deploying it as a
native desktop application, a standalone server, or a Docker image, through launching the Workbench, to creating
a repository and executing SPARQL queries against the data in it. Our favorite example from The Flintstones is
available here as data for you to start with.
611
GraphDB Documentation, Release 10.2.5
GraphDB Workbench is a webbased administration tool that allows you to manage GraphDB repositories, load and
export data, monitor query execution, develop and execute queries, manage connectors and users. The GraphDB
REST API can be used to automate various tasks without having to open the Workbench in a browser and doing
them manually. This makes it easy to script cURL calls in your applications. In this video, we provide a brief
overview of their main functionalities that you will be using most of the time.
Data is the most valuable asset and GraphDB is designed to store and enhance it. This module shows you several
ways of loading individual files and bulk data, as well as how to RDFize your tabular data and map it against an
existing ontology.
This module outlines the reasoning strategies (how to get new information from your data) as well as the rulesets
that are used by GraphDB. The three different reasoning strategies that are discussed are: forward chaining, back
ward chaining, hybrid chaining. They support various GraphDB reasoning optimizations, e.g., using owl:sameAs.
This module walks you through GraphDB’s data virtualization functionality, which enables direct access to rela
tional databases with SPARQL queries, eliminating the need to replicate data. To achieve this, GraphDB integrates
the opensource Ontop project and extends it with multiple GraphDBspecific features.
This video covers the GraphDB plugins – externally provided libraries allowing developers to extend the engine.
They can synchronize their internal state over the public GraphDB Plugin API and handle the execution of regis
tered triple patterns. Plugin examples include RDF Rank, Geospatial extensions, and more.
The Lucene, Solr, and Elasticsearch GraphDB connectors enable the connection to an external component or
service, providing fulltext search and aggregation. The MongoDB integration allows querying a database using
SPARQL and executing heterogeneous joins, and the Kafka GraphDB connector provides a means to synchronize
changes to the RDF model to any downstream system via the Kafka framework. This module explains how to
create, list, and drop connector instances in GraphDB.
GraphDB is built on top of RDF4J, a powerful Java framework for processing and handling RDF data. This
includes creating, parsing, storing, inferencing, and querying over such data. It offers an easytouse API. GraphDB
comes with a set of example programs and utilities that illustrate the basics of accessing GraphDB through the
RDF4J API.
All GraphDB programming examples are provided as a single Maven project. GraphDB is available from Maven
Central (the public Maven repository). You can find the most recent version here.
18.2.2 Examples
The two examples below can be found under examples/developer-getting-started of the GraphDB distribu
tion.
The following program opens a connection to a repository, evaluates a SPARQL query and prints the result. The
example uses the GraphDBHTTPRepository class, which is an extension of RDF4J’s HTTPRepository that adds
support for GraphDB features such as the GraphDB cluster.
In order to run the example program, you need to build from the appropriate .pom file:
mvn install
package com.ontotext.graphdb.example.app.hello;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepository;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepositoryBuilder;
import org.eclipse.rdf4j.model.Value;
import org.eclipse.rdf4j.query.*;
import org.eclipse.rdf4j.repository.RepositoryConnection;
/**
* Hello World app for GraphDB
*/
public class HelloWorld {
public void hello() throws Exception {
// Connect to a remote repository using the GraphDB client API
// (ruleset is irrelevant for this example)
GraphDBHTTPRepository repository = new GraphDBHTTPRepositoryBuilder()
.withServerUrl("http://localhost:7200")
.withRepositoryId("myrepo")
//.withCluster(); // uncomment this line to enable cluster mode
.build();
try {
// Preparing a SELECT query for later evaluation
TupleQuery tupleQuery = connection.prepareTupleQuery(QueryLanguage.SPARQL,
"SELECT ?x WHERE {" +
"BIND('Hello world!' as ?x)" +
"}");
This example illustrates loading of ontologies and data from files, querying data through SPARQL SELECT, deleting
data through the RDF4J API and inserting data through SPARQL INSERT.
In order to run the example program, you first need to locate appropriate pom file. In this file, there will be a
commented line pointing towards the FamilyRelationsApp class. Remove the comment markers from this line,
making it active, and comment out the line pointing towards the HelloWorld class instead. Then build the app
from the .pom file:
mvn install
package com.ontotext.graphdb.example.app.family;
import com.ontotext.graphdb.example.util.QueryUtil;
import com.ontotext.graphdb.example.util.UpdateUtil;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepository;
import com.ontotext.graphdb.repository.http.GraphDBHTTPRepositoryBuilder;
import org.eclipse.rdf4j.model.IRI;
(continues on next page)
import java.io.IOException;
/**
* An example that illustrates loading of ontologies, data, querying and modifying data.
*/
public class FamilyRelationsApp {
private RepositoryConnection connection;
/**
* Loads the ontology and the sample data into the repository.
*
* @throws RepositoryException
* @throws IOException
* @throws RDFParseException
*/
public void loadData() throws RepositoryException, IOException, RDFParseException {
System.out.println("# Loading ontology and data");
/**
* Lists family relations for a given person. The output will be printed to stdout.
*
* @param person a person (the local part of a URI)
* @throws RepositoryException
* @throws MalformedQueryException
* @throws QueryEvaluationException
*/
public void listRelationsForPerson(String person) throws RepositoryException,�
,→MalformedQueryException, QueryEvaluationException {
System.out.println("# Listing family relations for " + person);
// A simple query that will return the family relations for the provided person parameter
TupleQueryResult result = QueryUtil.evaluateSelectQuery(connection,
"PREFIX family: <http://examples.ontotext.com/family#>" +
"SELECT ?p1 ?r ?p2 WHERE {" +
while (result.hasNext()) {
BindingSet bindingSet = result.next();
IRI p1 = (IRI) bindingSet.getBinding("p1").getValue();
IRI r = (IRI) bindingSet.getBinding("r").getValue();
IRI p2 = (IRI) bindingSet.getBinding("p2").getValue();
/**
* Deletes all triples that refer to a person (i.e. where the person is the subject or the object).
*
* @param person the local part of a URI referring to a person
* @throws RepositoryException
*/
public void deletePerson(String person) throws RepositoryException {
System.out.println("# Deleting " + person);
// Removing a person means deleting all triples where the person is the subject or the object.
// Alternatively, this can be done with SPARQL.
connection.remove(uriForPerson(person), null, null);
connection.remove((IRI) null, null, uriForPerson(person));
/**
* Adds a child relation to a person, i.e. inserts the triple :person :hasChild :child.
*
* @param child the local part of a URI referring to a person (the child)
* @param person the local part of a URI referring to a person
* @throws MalformedQueryException
* @throws RepositoryException
* @throws UpdateExecutionException
*/
public void addChildToPerson(String child, String person) throws MalformedQueryException,�
,→RepositoryException, UpdateExecutionException {
System.out.println("# Adding " + child + " as a child to " + person);
// We interpolate the URIs inside the string as INSERT DATA may not contain variables (bindings)
UpdateUtil.executeUpdate(connection,
try {
familyRelations.loadData();
// Once we've loaded the data we should see all explicit and implicit relations for John
familyRelations.listRelationsForPerson("John");
// Deleting Mary also removes Kate from John's list of relatives as Kate is his relative�
,→through Mary
familyRelations.listRelationsForPerson("John");
We also recommend the online book Programming with RDF4J provided by the RDF4J project. It provides detailed
explanations on the RDF4J API and its core concepts.
GraphDB Workbench is now a separate opensource project, enabling the fast development of knowledge graph
prototypes or rich UI applications. This provides you with the ability to add your custom colors to the graph views,
as well as to easily start a FactForgelike interface.
This tutorial will show you how to extend and customize GraphDB Workbench by adding your own page and
Angular controller. We will create a simple paths application that allows you to import RDF data, find paths
between to nodes in the graph, and visualize them using D3.
All pages are located under src/pages/, so you need to add your new page paths.html there with a {title}
placeholder. The page content will be served by an Angular controller, which is placed under src/js/angular/
graphexplore/controllers/paths.controller.js. Path exploration is a functionality related to graph explo
ration, so you need to register your new page and controller there.
In src/js/angular/graphexplore/app.js:
1. Import the controller:
'angular/graphexplore/controllers/paths.controller',
.when('/paths', {
templateUrl: 'pages/paths.html',
controller: 'GraphPathsCtrl',
title: 'Graph Paths',
helpInfo: 'Find all paths in a graph.',
});
$menuItemsProvider.addItem({
label: 'Paths',
href: 'paths',
order: 5,
parent: 'Explore',
});
}
}
);
Now your controller and page are ready to be filled with content.
In your page, you need a repository with data in it. Like most views in GraphDB, you need to have a repository
set. The template that most of the pages use is similar to this, where the repository-is-set div is where you put
your html. Error handling related to repository errors is added for you.
<div class="container-fluid">
<h1>
{{title}}
<span class="btn btn-link"
popover-template="'js/angular/templates/titlePopoverTemplate.html'"
popover-trigger="mouseenter"
popover-placement="bottom-right"
popover-append-to-body="true"><span class="icon-info"></span></span>
</h1>
<div core-errors></div>
<div system-repo-warning></div>
<div class="alert alert-danger" ng-show="repositoryError">
<p>The currently selected repository cannot be used for queries due to an error:</p>
(continues on next page)
You need to define the functions on which this snippet depends in your paths.controller.js. They use the
repository service that you imported in the controller definition.
$scope.getActiveRepository = function () {
return $repositories.getActiveRepository();
};
$scope.isLoadingLocation = function () {
return $repositories.isLoadingLocation();
};
$scope.hasActiveLocation = function () {
return $repositories.hasActiveLocation();
};
1. Create a repository.
2. Import the airports.ttl dataset.
3. Enable the Autocomplete index for your repository.
4. Execute the following SPARQL insert to add direct links for flights:
Now we will search for paths between airports based on the hasFlightTo predicate.
Now let’s add inputs using Autocomplete to select the departure and destination airports. Inside the repository-
is-set diff, add the two fields. Note the visual-callback="findPath(startNode, uri)" snippet that defines the
callback to be executed once a value is selected through the Autocomplete. uri is the value from the Autocomplete.
The following code sets the starNode variable in Angular and calls the findPath function when the destination is
given. You can find out how to define this function in the scope a little further down in this tutorial.
They need the getNamespacesPromise and getAutocompletePromise to fetch the Autocomplete data. They should
be initialized once the repository has been set in the controller.
function initForRepository() {
if (!$repositories.getActiveRepository()) {
return;
}
$scope.getNamespacesPromise = ClassInstanceDetailsService.getNamespaces($scope.
,→getActiveRepository());
$scope.getAutocompletePromise = AutocompleteService.checkAutocompleteStatus();
}
Note that both of these functions need to be called when the repository is changed, because you need to make sure
that Autocomplete is enabled for this repository, and fetch the namespaces for it. Now you can autocomplete in
your page.
Now let’s implement the findPath function in the scope. It finds all paths between nodes by using a simple
depthfirst search algorithm (recursive algorithm based on the idea of backtracking).
For each node, you can obtain its siblings with a call to the rest/explore-graph/links endpoint. This is the same
endpoint the Visual graph is using to expand node links. Note that it is not part of the GraphDB API, but we will
reuse it for simplicity.
As an alternative, you can also obtain the direct links of a node by sending a SPARQL query to GraphDB.
Note: This is a demo implementation. For each repository containing a lot of links, the proposed approach is
not appropriate, as it will send a request to the server for each node. This will quickly result in a huge amount of
requests, which will very soon flood the browser.
var maxPathLength = 3;
The findPath recursive function returns all the promises that will or will not resolve to paths. Each path is a
collection of links.
When all promises are resolved, you can flatten the array to obtain all links from all paths and draw one single graph
with these links. Graph drawing is done with D3 in the renderGraph function. It needs a graph-visualization
element to draw the graph inside. Add it inside the repository-is-set element below the autocomplete divs.
Additionally, import graphs-visualizations.css to reuse some styles.
function renderGraph(linksFound) {
var graph = new Graph();
// For each node in the graph find its label with a rest call
_.forEach(nodesFromLinks, function (newNode, index) {
promises.push($http({
url: 'rest/explore-graph/node',
method: 'GET',
params: {
iri: newNode,
config: 'default',
includeInferred: true,
sameAsState: true
}
}).then(function (response) {
// Save the data for later
nodesData[index] = response.data;
}));
});
graph.addLinks(linksFound);
draw(graph);
});
}
function Graph() {
this.nodes = [];
this.links = [];
force.nodes(graph.nodes).charge(-3000);
force.links(graph.links).linkDistance(function (link) {
// link distance depends on length of text with an added bonus for strongly connected nodes,
// i.e. they will be pushed further from each other so that their common nodes can cluster up
return getPredicateTextLength(link) + 30;
});
function getPredicateTextLength(link) {
var textLength = link.source.size * 2 + link.target.size * 2 + 50;
return textLength * 0.75;
}
// arrow markers
container.append("defs").selectAll("marker")
.data(force.links())
.enter().append("marker")
.attr("class", "arrow-marker")
.attr("id", function (d) {
return d.target.size;
})
.attr("viewBox", "0 -5 10 10")
.attr("refX", function (d) {
return d.target.size + 11;
})
.attr("refY", 0)
.attr("markerWidth", 10)
.attr("markerHeight", 10)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-5L10,0L0,5 L10,0 L0, -5");
updateNodeLabels(nodeLabels);
function updateNodeLabels(nodeLabels) {
nodeLabels.each(function (d) {
d.fontSize = D3.Text.calcFontSizeRaw(d.labels[0].label, d.size, 16, true);
// TODO: get language and set it on the label html tag
})
.attr("height", function (d) {
return d.fontSize * 3;
})
// if this was kosher we would use xhtml:body here but if we do that angular (or the browser)
// goes crazy and resizes/messes up other unrelated elements. div seems to work too.
.append("xhtml:div")
.attr("class", "node-label-body")
.style("font-size", function (d) {
return d.fontSize + 'px';
})
.append('xhtml:div')
});
force.start();
}
It obtains the URIs for the nodes from all links, and finds their labels through calls to the rest/explore-graph/
node endpoint. A graph object is defined to represent the visual abstraction, which is simply a collection of nodes
and links. The draw(graph) function does the D3 drawing itself using the D3 force layout.
Now let’s find all paths between Sofia and La Palma with maximum 2 nodes in between (maximum path length
3):
Note: The airports graph is highly connected. Increasing the maximum path length will send too many requests to
the server. The purpose of this tutorial is to introduce you to the Workbench extension with a naive paths prototype.
Noticing that path finding can take some time, we may want to add a message for the user.
The source code for this example can be found in the workbenchpathsexample GitHub project.
18.4 Location and Repository Management with the GraphDB REST API
The GraphDB REST API can be used for managing locations and repositories programmatically. It includes con
necting to remote GraphDB instances (locations), activating a location, and different ways for creating a repository.
This tutorial shows how to use cURL command to perform basic location and repository management through the
GraphDB REST API.
18.4.1 Prerequisites
Tip: For more information on deploying GraphDB, please see Installing and Upgrading.
• Another GraphDB instance (optional, needed for the Attaching a remote location example):
– Start GraphDB on the second machine.
• The cURL command line tool for sending requests to the API.
Hint: Throughout the tutorial, the two instances will be referred to with the following URLs:
• http://192.0.2.1:7200/ for the first instance;
• http://192.0.2.2:7200/ for the second instance.
Please adjust the URLs according to the IPs or hostnames of your own machines.
Create a repository
Repositories can be created by providing a .ttl file with all the configuration parameters.
1. Download the sample repository config file repo-config.ttl.
2. Send the file with a POST request using the following cURL command:
curl -X POST\
http://192.0.2.1:7200/rest/repositories\
-H 'Content-Type: multipart/form-data'\
-F "config=@repo-config.ttl"
Note: You can provide a parameter location to create a repository in another location, see Managing locations
below.
18.4. Location and Repository Management with the GraphDB REST API 629
GraphDB Documentation, Release 10.2.5
List repositories
Use the following cURL command to list all repositories by sending a GET request to the API:
curl -G http://192.0.2.1:7200/rest/repositories\
-H 'Accept: application/json'
The output shows the repository repo1 that was created in the previous step.
[
{
"id":"repo1",
"title":"my repository number one",
"uri":"http://192.0.2.1:7200/repositories/repo1",
"type":"free",
"sesameType":"graphdb:SailRepository",
"location":"",
"readable":true,
"writable":true,
"local":true
}
]
Attach a location
Use the following cURL command to attach a remote location by sending a PUT request to the API:
List locations
Use the following cURL command to list all locations that are attached to a machine by sending a GET request to
the API:
curl http://192.0.2.1:7200/rest/locations\
-H 'Accept: application/json'
[
{
"system" : true,
"errorMsg" : null,
"active" : false,
"defaultRepository" : null,
"local" : true,
(continues on next page)
Note: If you skipped the “Attaching a remote location” step or if you already had other locations attached, the
output will look different.
Detach a location
Use the following cURL command to detach a location from a machine by sending a DELETE request to the API:
• To detach the remote location http://192.0.2.1:7200/:
Important: Detaching a location simply removes it from the Workbench and will not delete any data. A detached
location can be reattached at any point.
For a full list of request parameters and more information regarding sending requests, check the REST API docu
mentation within the GraphDB Workbench accessible from the Help � REST API menu.
This page displays GraphDB REST API calls as cURL commands, which enables developers to script these calls
in their applications.
See also the Help � REST API view of the GraphDB Workbench where you will find a complete reference of all
REST APIs and be able to run API calls directly from the browser.
In addition to this, the RDF4J API is also available.
GET /rest/cluster/config
Example:
GET /rest/cluster/group/status
Example:
GET /rest/cluster/node/status
Example:
POST /rest/cluster/config
Example:
POST /rest/cluster/config/node
Example:
PATCH /rest/cluster/config
Example:
DELETE /rest/cluster/config
Example:
DELETE /rest/cluster/config/node
Example:
GET /rest/monitor/cluster
Example:
Most data import queries can either take the following set of attributes as an argument or return them as a response.
• fileNames (string list): A list of filenames that are to be imported.
• importSettings (JSON object): Import settings.
– baseURI (string): Base URI for the files to be imported.
– context (string): Context for the files to be imported.
– data (string): Inline data.
– forceSerial (boolean): Force use of the serial statements pipeline.
– name (string): Filename.
– status (string): Status of an import pending, importing, done, error, none, interrupting.
– timestamp (integer): When was the import started.
– type (string): The type of the import.
– replaceGraphs (string list): A list of graphs that you want to be completely replaced by the import.
* preserveBNodeIds (boolean): Use blank node IDs found in the file instead of assigning them.
* stopOnError (boolean): Stop on error. If false, the error will be logged and parsing will continue.
* verifyLanguageTags (boolean): Verify language based on a given set of definitions for valid
languages.
Cancel server file import operation
DELETE /rest/repositories/<repo_id>/import/server
Example:
GET /rest/repositories/<repo_id>/import/server
Example:
curl <base_url>/rest/repositories/<repo_id>/import/server
POST /rest/repositories/<repo_id>/import/server
Example:
GET /rest/monitor/infrastructure
Example:
Most location management queries can either take the following set of attributes as an argument or return them as
a response.
• active (boolean): True if the location is the currently active one – the local location is the only location that
can be active at any given point.
• defaultRepository (string): Default repository for the location.
• errorMsg (string): Error message, if there was an error connecting to this location.
• label (string): Human readable label
• local (boolean): True if the location is local (on the same machine as the Workbench).
• password (string): Password for the new location if any. This parameter only makes sense for remote
locations.
• system (boolean): True if the location is the system location.
• uri (string): The GraphDB location URL.
• username (string): Username for the new location if any. This parameter only makes sense for remote
locations.
Get all connected GraphDB locations
GET /rest/locations
Example:
curl <base_url>/rest/locations
POST /rest/locations
Example:
PUT /rest/locations
Example:
DELETE /rest/locations
Example:
POST /rest/locations/active/default-repository
Example:
{
"repository": "<repo_id>"
}'
Most repository management queries can either take the following set of attributes as an argument or return them
as a response.
• externalUrl (string): The URL that the repository can be accessed at by an external service.
• id (string): The repository id.
• local (boolean): True if the repository is local (on the same machine as the Workbench).
• location (string): If remote, the repository’s location.
• title (string): The repository title.
• type (string): Repository type worker, master or system.
• unsupported (boolean): True if the repository is unsupported.
• writable (boolean): True if the repository is writable.
• readable (boolean): True if the repository is readable.
• uri (string): The GraphDB location URL.
Get all repositories in the current or another location
GET /rest/repositories
Example:
curl <base_url>/rest/repositories
GET /rest/repositories/<repo_id>
Example:
curl <base_url>/rest/repositories/<repo_id>?location=<encoded_location_uri>
GET /rest/repositories/<repo_id>/size
Example:
curl <base_url>/rest/repositories/<repo_id>/size?location=<encoded_location_uri>
POST /rest/repositories
Example:
Restart a repository
POST /rest/repositories/<repo_id>/restart
Example:
PUT /rest/repositories/<repo_id>
Example:
Hint: Adjust parameters with new values except for <repo_id> in order to edit the current repository configura
tion.
DELETE /rest/repositories/<repo_id>
Example:
GET /rest/monitor/repository/{repositoryID}
Example:
GET /rest/sparql/saved-queries
Example:
curl <base_url>/rest/sparql/saved-queries?name=<query_name>
POST /rest/sparql/saved-queries
Example:
PUT /rest/sparql/saved-queries
Example:
DELETE /rest/sparql/saved-queries
Example:
GET /rest/security
Example:
curl <base_url>/rest/security
Enable security
POST /rest/security
Example:
GET /rest/security/free-access
curl <base_url>/rest/security/free-access
POST /rest/security/free-access
Example:
GET /rest/security/users
Example:
curl <base_url>/rest/security/users
Get a user
GET /rest/security/users/<username>
Example:
curl <base_url>/rest/security/users/<username>
Delete a user
DELETE /rest/security/users/<username>
Example:
PATCH /rest/security/users/<username>
Example:
Create a user
POST /rest/security/users/<username>
Example:
Edit a user
PUT /rest/security/users/<username>
Example:
Create, edit, delete, and execute SPARQL templates, as well as to view all templates and their configuration.
Get IDs of all configured SPARQL templates per current repository
GET /rest/repositories/<repo_id>/sparql-templates
Example:
curl '<base_url>/rest/repositories/<repo_id>/sparql-templates'
GET /rest/repositories/<repo_id>/sparql-templates/configuration
Example:
POST /rest/repositories/<repo_id>/sparql-templates
Example:
DELETE /rest/repositories/<repo_id>/sparql-templates
Example:
PUT /rest/repositories/<repo_id>/sparql-templates
Example:
curl -X PUT --header 'Content-Type: text/plain' --header 'Accept: */*' -d '<update query_string> ' '
,→<base_url>/rest/repositories/<repo_id>/sparql-templates?templateID=<template_id>'
POST /rest/repositories/<repo_id>/sparql-templates/execute
Access, create, and edit SQL views (tables), as well as delete existing saved queries and see all SQL views for the
active repository.
Get all SQL view names for current repository
GET /rest/sql-views/tables
Example:
GET /rest/sql-views/tables/<name>
Example:
POST /rest/sql-views/tables/
Example:
curl -X POST --header 'Content-Type: application/json' --header 'Accept: */*' --header 'X-GraphDB-
,→Repository:<repoID> -d '{
"name": "string",
"query": "string",
"columns": ["string"]
}'
<base_url>/rest/sql-views/tables/
PUT /rest/sql-views/tables/<name>
Example:
curl -X PUT --header 'Content-Type: application/json' --header 'Accept: */*' --header 'X-GraphDB-
,→Repository:<repoID>' -d '{
"name": "string",
"query": "string",
"columns": ["string"]
}'
<base_url>/rest/sql-views/tables/<name>
DELETE /rest/sql-views/tables/<name>
Example:
18.5.12 Authentication
POST /rest/login/**
Example:
This command will return the user’s roles and GraphDB applications settings. It will also generate a GDB token,
which is returned in an Authorization header and will be used at every next authentication request.
GET /rest/monitor/structures
Example:
Ogma is a powerful JavaScript library for graph visualization. In the following examples, data is fetched from
a GraphDB repository, converted into an Ogma graph object, and visualized using different graph layouts. All
samples reuse functions from a commons.js file.
You need a version of Ogma JS to run the samples.
The following example fetches people and organizations related to Google. One of the sample queries in fact
forge.net is used and rewritten into a CONSTRUCT query. Type is used to differ entities of different types.
<html>
<body>
<!-- Include the library -->
<script src="../lib/ogma.min.js"></script>
<script src="../lib/jquery-3.2.0.min.js"></script>
<script src="commons.js"></script>
<script src="../lib/lodash.js"></script>
<!-- This div is the DOM element containing the graph. The style ensures that it takes the whole screen. --
,→>
<script>
// Which namespace to chose types from
var dboNamespace = "http://dbpedia.org/ontology"
// One of factforge saved queries enriched with types and rdf rank
var peopleAndOrganizationsRelatedToGoogle = `
# F03: People and organizations related to Google
# - picks up people related through any type of relationships
# - picks up parent and child organizations
# - benefits from inference over transitive dbo:parent
# - RDFRank makes it easy to see the “top suspects” in a list of 94 entities
# Change Google with any organization, e.g. type dbr:Hew and Ctrl-Space to auto-complete
var postData = {
query: peopleAndOrganizationsRelatedToGoogle,
infer: true,
sameAs: true,
limit: 1000,
offset: 0
}
$.ajax({
url: graphDBRepoLocation,
type: 'POST',
data: postData,
headers: {
'Accept': 'application/rdf+json'
},
success: function (data) {
ogma.locate.center();
ogma.layouts.start('forceLink', {}, {
// sync parameters
onEnd: endLayout
});
function endLayout() {
ogma.locate.center({
easing: 'linear',
duration: 300
});
}
}
})
</script>
</body>
</html>
The following example fetches suspicious control chain through offshore companies, which is another saved query
in factforge.net rewritten as a graph query. The entities, their RDF Rank, and their type are fetched. Node size is
based on RDF Rank and node color on its type. All examples use a commons.js file with some common function,
i.e., data model conversion.
<html>
<body>
<!-- Include the library -->
<script src="../lib/ogma.min.js"></script>
<script src="../lib/jquery-3.2.0.min.js"></script>
<script src="commons.js"></script>
<script src="../lib/lodash.js"></script>
<!-- This div is the DOM element containing the graph. The style ensures that it takes the whole screen. --
,→>
<script>
CONSTRUCT {
?c1 fibo-fnd-rel-rel:controls ?c2 .
?c2 fibo-fnd-rel-rel:controls ?c3 .
?c1 ff-map:primaryCountry ?c1_country .
?c2 ff-map:primaryCountry ?c2_country .
?c3 ff-map:primaryCountry ?c3_country .
?c1 sesame:directType ?t1 .
?c2 sesame:directType ?t2 .
?c3 sesame:directType ?t3 .
?c1_country sesame:directType dbo:Country .
?c3_country sesame:directType dbo:Country .
?c3_country sesame:directType dbo:Country .
} FROM onto:disable-sameAs
WHERE {
?c1 fibo-fnd-rel-rel:controls ?c2 .
?c2 fibo-fnd-rel-rel:controls ?c3 .
?c1 sesame:directType ?t1 .
?c2 sesame:directType ?t2 .
?c3 sesame:directType ?t3 .
?c1 ff-map:primaryCountry ?c1_country .
?c2 ff-map:primaryCountry ?c2_country .
?c3 ff-map:primaryCountry ?c1_country .
FILTER (?c1_country != ?c2_country)
var postData = {
query: suspiciousOffshore,
infer: true,
sameAs: true,
limit: 1000,
offset: 0
}
$.ajax({
url: graphDBRepoLocation,
type: 'POST',
data: postData,
headers: {
'Accept': 'application/rdf+json'
},
success: function (data) {
ogma.locate.center();
ogma.layouts.start('forceLink', {}, {
onEnd: endLayout
});
function endLayout() {
<style>
#graph-container { top: 0; bottom: 0; left: 0; right: 0; position: absolute; margin: 0; overflow:�
,→hidden; }
.info {
position: absolute;
color: #fff;
background: #141229;
font-size: 12px;
font-family: monospace;
padding: 5px;
}
.info.n { top: 0; left: 0; }
</style>
<!-- This div is the DOM element containing the graph. The style ensures that it takes the whole screen. --
,→>
<div id="graph-container"></div>
<div id="n" class="info n">loading a large graph, it can take a few seconds...</div>
<script>
var postData = {
query: airportsQuery,
infer: true,
sameAs: true,
$.ajax({
url: 'http://localhost:8082/repositories/airroutes',
type: 'POST',
data: postData,
headers: {
'Accept': 'application/rdf+json'
},
success: function (data) {
ogma.geo.enable();
ogma.topology.getAdjacentEdges(pathNodes[i]).forEach(function (edge) {
</script>
</body>
</html>
// Get type for a node to color nodes of the same type with the same color
var typePredicate = "http://www.openrdf.org/schema/sesame#directType";
RDF is the most popular format for exchanging semantic data. Unlike logical database models, ontologies are
optimized to correctly represent the knowledge in a particular business domain. This means that their structure
is often verbose, includes abstract entities to express OWL axioms, and contains implicit statements and complex
Nary relationship with provenance information. Graph View is a user interface optimized for mapping knowledge
base models to simpler edge and vertex models configured by a list of SPARQL queries.
The Graph View interface accepts four different SPARQL queries to retrieve data from the knowledge base:
• Node expansion determines how new nodes and links are added to the visual graph when the user expands
an existing node.
• Node type, size, and label control the node appearance. Types correspond to different colors. Each binding
is optional.
• Vertex (i.e., predicate) label determines where to read the name.
• Node info controls all data visible for the resource displayed in tabular format. If an ?image binding is found
in the results, the value is used as an image source.
By using these four queries, you may override the default configuration and adapt the knowledge base visualization
to:
• Integrate custom ontology schema and the preferred label;
• Hide provenance or another metadata related information;
• Combine nodes, so you can skip relation objects and show them as a direct link;
• Filter instances with all sorts of tools offered by the SPARQL language;
• Generate RDF resources on the fly from existing literals.
The OpenFlights Airports Database contains over 10,000 airports, train stations, and ferry terminals spanning the
globe. Airport base data was generated by from DAFIF (October 2006 cycle) and OurAirports, plus time zone
information from EarthTools. All DST information are added manually. Significant revisions and additions have
been made by the users of OpenFlights. Airline data was extracted directly from Wikipedia’s gargantuan List of
airlines. The dataset can easily link to DBPedia and be integrated with the rest of linked open data cloud.
Data model
18.7. Create Custom Graph View over Your RDF Data 659
GraphDB Documentation, Release 10.2.5
All OpenFlight CSV files are converted by using Ontotext Refine. To start exploring, first import the airports.ttl
dataset which contains the data in RDF.
Configured queries
Let’s find out how the airports are connected by skipping the route relation and model a new relation hasFlightTo.
Using the Visual button in the SPARQL editor, we can see the results of this query as a visual graph.
We can also save the graph and expand to more airports. To do this, navigate to Explore � Visual graph and click
Create graph config.
First you are asked to select the initial state of your graph. For simplicity, we choose to start with a query and enter
from above. Now let’s make this graph expandable by configuring the Graph expansion query:
You can also select a different airport to start from every time by making the starting point a search box.
The power of the visual graph is that we can create multiple Graph views on top of the same data. Let’s create a
new one using the following query:
And let’s create a visual graph with the following expand query:
# Note that ?node is the node you clicked and must be used in the query
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
PREFIX onto: <http://www.ontotext.com/>
CONSTRUCT {
# The triples that will be added to the visual graph when you expand airports
?node onto:hasFlightFromWithAirline ?airline1 .
?node onto:hasFlightToWithAirline ?airline2 .
} UNION {
# Outgoing flights for airport
?route <http://openflights.org/resource/route/destinationId> ?node .
?route <http://openflights.org/resource/route/airlineId> ?airline2 .
} UNION
{
# Incoming flights for airline
?route <http://openflights.org/resource/route/sourceId> ?airport1 .
?route <http://openflights.org/resource/route/airlineId> ?node .
} UNION {
# Outgoing flights for airline
?route <http://openflights.org/resource/route/destinationId> ?airport2 .
?route <http://openflights.org/resource/route/airlineId> ?node .
}
}
18.7. Create Custom Graph View over Your RDF Data 661
GraphDB Documentation, Release 10.2.5
SciGraph is a Linked Open Data platform for the scholarly domain. The dataset aggregates data sources from
Springer Nature and key partners from the domain. It collates information from across the research landscape,
such as funders, research projects, conferences, affiliations, and publications.
Data model
but let’s say we are only interested in articles, contributions, and subjects.
From this we can say that a researcher contributes to a subject, and create a virtual URI for the researcher since it
is a Literal.
18.7. Create Custom Graph View over Your RDF Data 663
GraphDB Documentation, Release 10.2.5
We do not have a URI for a researcher. How can we search for researchers?
Navigate to Setup � Autocomplete and add a sg:publishedName predicate. The retrieved result will be contribu
tions by given names in the search box.
Now let’s create the graph config. We need to configure an expansion for contribution since this is our starting
point for both subjects and researchers.
{
BIND( IRI(CONCAT("http://www.ontotext.com/", REPLACE(STR(?researcherName)," ","_"))) as ?
,→researcherNameUri1)
?contribution a sg:Contribution .
?contribution sg:publishedName ?researcherName .
?article sg:hasContribution ?contribution .
?article sg:hasSubject ?node .
}
}
However, not all researchers have contributions to articles with subjects. Let’s use an initial query that will fetch
some researchers that have such relations. This is just a simplified version of the query above fetching some
researchers and subjects.
{
BIND( IRI(CONCAT("http://www.ontotext.com/", REPLACE(STR(?researcherName)," ","_"))) as ?
,→researcherNameUri2)
?contribution a sg:Contribution .
?contribution sg:publishedName ?researcherName .
?article sg:hasContribution ?contribution .
?article sg:hasSubject ?node .
}
} limit 100
But the nodes in our graph are all the same since they do not have RDF types. Now let’s configure the way the
types of the nodes are obtained.
} ORDER BY ?type
But what if we want to see additional data for each node, i.e., which university has a researcher contribution for:
18.7. Create Custom Graph View over Your RDF Data 665
GraphDB Documentation, Release 10.2.5
To learn more about the SPARQL editing and data visualization capabilities of the GraphDB Workbench, as well
as features that can be added with little programming, and about SPARQL writing aids and visualization tools that
can be integrated with GraphDB, please have a look at this Howto Guide.
18.8 Notifications
Notifications are a publish/subscribe mechanism for registering and receiving events from a GraphDB repository
whenever triples matching a certain graph pattern are inserted or removed.
The RDF4J API provides such a mechanism where a RepositoryConnectionListener can be notified of changes
to a NotifiyingRepositoryConnection. However, the GraphDB notifications API works at a lower level and
uses the internal raw entity IDs for subject, predicate, and object instead of Java objects. The benefit of this is
that a much higher performance is possible. The downside is that the client must do a separate lookup to get the
actual entity values and because of this, the notification mechanism works only when the client is running inside
the same JVM as the repository instance.
Note: Local notifications only work in an embedded GraphDB instance, which is usually used only in test envi
ronments.
For remote notifications, we recommend using the Kafka GraphDB Connector.
Note: The SPARQL query is interpreted as a plain graph pattern by ignoring all more complicated SPARQL
constructs such as FILTER, OPTIONAL, DISTINCT, LIMIT, ORDER BY, etc. Therefore, the SPARQL query is interpreted
as a complex graph pattern involving triple patterns combined by means of joins and unions at any level. The order
of the triple patterns is not significant.
Here is an example of how to register for notifications based on a given SPARQL query:
AbstractRepository rep =
((OwlimSchemaRepository)owlimSail).getRepository();
EntityPool ent = ((OwlimSchemaRepository)owlimSail).getEntities();
String query = "SELECT * WHERE { ?s rdf:type ?o }";
SPARQLQueryListener listener =
new SPARQLQueryListener(query, rep, ent) {
public void notifyMatch(int subj, int pred, int obj, int context) {
System.out.println("Notification on subject: " + subj);
}
}
rep.addListener(listener); // start receiving notifications
...
rep.removeListener(listener); // stop receiving notifications
In the example code, the caller will be asynchronously notified about incoming statements matching the pattern
?s rdf:type ?o.
Note: In general, notifications are sent for all incoming triples, which contribute to a solution of the query. The
integer parameters in the notifyMatch method can be mapped to values using the EntityPool object. Furthermore,
any statements inferred from newly inserted statements are also subject to handling by the notification mechanism,
i.e., clients are notified also of new implicit statements when the requested triple pattern matches.
Note: The subscriber should not rely on any particular order or distinctness of the statement notifications. Du
plicate statements might be delivered in response to a graph pattern subscription in an order not even bound to the
chronological order of the statements insertion in the underlying triplestore.
Tip: The purpose of the notification services is to enable the efficient and timely discovery of newly added RDF
data. Therefore, it should be treated as a mechanism for giving the client a hint that certain new data is available
and not as an asynchronous SPARQL evaluation engine.
Clearing and old graph and then importing the new information can often be inefficient. Since the two operations
are handled separately, it is impossible to determine if a statement will also be present in the new graph and
therefore, keep it there. The same applies for preserving connectors or inferring statements. Therefore, GraphDB
offers an optimized graph replacement algorithm, making graph updates faster in those situations where the new
graph will partially overlap with data in the old one.
The graph replacement optimization is in effect when the replacement is done in a single transaction and when the
transaction is bigger than a certain threshold. By default, this threshold is set to 1,000, but it can be controlled by
using the graphdb.engine.min-replace-graph-tx-size configuration parameter.
The algorithm has the following steps:
1. Check transaction contents. If the transaction includes a graph replacement and is of sufficient size, proceed.
2. Check if any of the graphs to be replaced are valid and if any of them have data. If so, store their identifiers
in a list.
3. While processing transaction statements for insertion, if their context (graph) matches an identifier from the
list, store them inside a tracker.
4. While clearing the graph to be replaced, if it is not mentioned in the tracker, directly delete all its contents.
5. If a graph is mentioned in the tracker, iterate over its triples.
6. Triples in the replacement graph that are also in the tracker are preserved. Otherwise, they are deleted.
Deletions may trigger reinference and are a more costly process than the check described in the algorithm. There
fore, in some test cases due to the optimization users can observe a speedup of up to 200%.
Here is an example of an update that will use the replacement optimization algorithm:
By contrast, the following approach will not use the optimization since it performs the replacement in two separate
steps:
Note: The replacement optimization described here applies to all forms of transactions. i.e., it will be triggered
by standard PUT requests, such as the ones in the example, but also by SPARQL INSERT queries containing the
http://www.ontotext.com/replaceGraph predicate, such as <http://any/subject> <http://wwww.ontotext.
com/replaceGraph> <http://example.org/graph>
The GraphDB Tutorials hub is meant as the central point for the GraphDB Developer Community. It serves as
a handson compendium to the GraphDB documentation that gives practical advice and tips on accomplishing
realworld tasks.
If you want an indepth introduction to everything GraphDB, we suggest the following video tutorials:
• GraphDB Fundamentals
If you are already familiar with RDF or are eager to start programming, please refer to:
• Programming with GraphDB
• Extending GraphDB Workbench
• Location and Repository Management with the GraphDB REST API
• GraphDB REST API cURL Commands
• Visualize GraphDB Data with Ogma JS
• Create Custom Graph View over Your RDF Data
• Notifications
• Graph Replacement Optimization
NINETEEN
REFERENCES
The Semantic Web represents a broad range of ideas and technologies that attempt to bring meaning to the vast
amount of information available via the Web. The intention is to provide information in a structured form so that it
can be processed automatically by machines. The combination of structured data and inferencing can yield much
information not explicitly stated.
The aim of the Semantic Web is to solve the most problematic issues that come with the growth of the nonsemantic
(HTMLbased or similar) Web that results in a high level of human effort for finding, retrieving and exploiting
information. For example, contemporary search engines are extremely fast, but tend to be very poor at producing
relevant results. Of the thousands of matches typically returned, only a few point to truly relevant content and
some of this content may be buried deep within the identified pages. Such issues dramatically reduce the value of
the information discovered as well as the ability to automate the consumption of such data. Other problems related
to classification and generalization of identifiers further confuse the landscape.
The Semantic Web solves such issues by adopting unique identifiers for concepts and the relationships between
them. These identifiers, called Universal Resource Identifiers (URIs) (a “resource” is any ‘thing’ or ‘concept’)
are similar to Web page URLs, but do not necessarily identify documents from the Web. Their sole purpose is to
uniquely identify objects or concepts and the relationships between them.
The use of URIs removes much of the ambiguity from information, but the Semantic Web goes further by allowing
concepts to be associated with hierarchies of classifications, thus making it possible to infer new information based
on an individual’s classification and relationship to other concepts. This is achieved by making use of ontologies
– hierarchical structures of concepts – to classify individual concepts.
The WorldWide Web has grown rapidly and contains huge amounts of information that cannot be interpreted by
machines. Machines cannot understand meaning, therefore they cannot understand Web content. For this reason,
most attempts to retrieve some useful pieces of information from the Web require a high degree of user involvement
– manually retrieving information from multiple sources (different Web pages), ‘digging’ through multiple search
engine results (where useful pieces of data are often buried many pages deep), comparing differently structured
result sets (most of them incomplete), and so on.
For the machine interpretation of semantic content to become possible, there are two prerequisites:
1. Every concept should be uniquely identified. (For example, if a particular person owns a web site, authors
articles on other sites, gives an interview on another site and has profiles in a couple of social media sites
such as Facebook and LinkedIn, then the occurrences of his name/identifier in all these places should be
related to exact same identifier.)
2. There must be a unified system for conveying and interpreting meaning that all automated search agents and
data storage applications should use.
One approach for attaching semantic information to Web content is to embed the necessary machineprocessable
information through the use of special metadescriptors (metatagging) in addition to the existing metatags that
mainly concern the layout.
669
GraphDB Documentation, Release 10.2.5
Within these meta tags, the resources (the pieces of useful information) can be uniquely identified in the same
manner in which Web pages are uniquely identified, i.e., by extending the existing URL system into something
more universal – a URI (Uniform Resource Identifier). In addition, conventions can be devised, so that resources
can be described in terms of properties and values (resources can have properties and properties have values).
The concrete implementations of these conventions (or vocabularies) can be embedded into Web pages (through
metadescriptors again) thus effectively ‘telling’ the processing machines things like:
[resource] John Doe has a [property] web site which is [value] www.johndoesite.com
The Resource Description Framework (RDF) developed by the World Wide Web Consortium (W3C) makes possi
ble the automated semantic processing of information, by structuring information using individual statements that
consist of: Subject, Predicate, Object. Although frequently referred to as a ‘language’, RDF is mainly a data model.
It is based on the idea that the things being described have properties, which have values, and that resources can be
described by making statements. RDF prescribes how to make statements about resources, in particular, Web re
sources, in the form of subjectpredicateobject expressions. The ‘John Doe’ example above is precisely this kind
of statement. The statements are also referred to as triples, because they always have the subjectpredicateobject
structure.
The basic RDF components include statements, Uniform Resource Identifiers, properties, blank nodes, and literals.
RDFstar (formerly RDF*) extends RDF with support for embedded triples. They are discussed in the topics that
follow.
A unique Uniform Resource Identifier (URI) is assigned to any resource or thing that needs to be described. Re
sources can be authors, books, publishers, places, people, hotels, goods, articles, search queries, and so on. In the
Semantic Web, every resource has a URI. A URI can be a URL or some other kind of unique identifier. Unlike
URLs, URIs do not necessarily enable access to the resource they describe, i.e, in most cases they do not represent
actual web pages. For example, the string http://www.johndoesite.com/aboutme.htm, if used as a URL (Web
link) is expected to take us to a Web page of the site providing information about the site owner, the person John
Doe. The same string can however be used simply to identify that person on the Web (URI) irrespective of whether
such a page exists or not.
Thus URI schemes can be used not only for Web locations, but also for such diverse objects as telephone numbers,
ISBN numbers, and geographic locations. In general, we assume that a URI is the identifier of a resource and can
be used as either the subject or the object of a statement. Once the subject is assigned a URI, it can be treated as a
resource and further statements can be made about it.
This idea of using URIs to identify ‘things’ and the relations between them is important. This approach goes some
way towards a global, unique naming scheme. The use of such a scheme greatly reduces the homonym problem
that has plagued distributed data representation in the past.
There are several conventions for writing abbreviated RDF statements, as used in the RDF specifications them
selves. This shorthand employs an XML qualified name (or QName) without angle brackets as an abbreviation
for a full URI reference. A QName contains a prefix that has been assigned to a namespace URI, followed by a
colon, and then a local name. The full URI reference is formed from the QName by appending the local name to
the namespace URI assigned to the prefix. So, for example, if the QName prefix foo is assigned to the namespace
URI http://example.com/somewhere/, then the QName “foo:bar” is a shorthand for the URI http://example.
com/somewhere/bar.
In our example, we can define the namespace jds for http://www.johndoesite.com and use the Dublin Core
Metadata namespace dc for http://purl.org/dc/elements/1.1/.
So, the shorthand form for the example statement is simply:
Objects of RDF statements can (and very often do) form the subjects of other statements leading to a graphlike
representation of knowledge. Using this notation, a statement is represented by:
• a node for the subject;
• a node for the object;
• an arc for the predicate, directed from the subject node to the object node.
So the RDF statement above could be represented by the following graph:
This kind of graph is known in the artificial intelligence community as a ‘semantic net’.
In order to represent RDF statements in a machineprocessable way, RDF uses markup languages, namely (and
almost exclusively) the Extensible Markup Language (XML). Because an abstract data model needs a concrete
syntax in order to be represented and transmitted, RDF has been given a syntax in XML. As a result, it inherits the
benefits associated with XML. However, it is important to understand that other syntactic representations of RDF,
not based on XML, are also possible. XMLbased syntax is not a necessary component of the RDF model. XML
was designed to allow anyone to design their own document format and then write a document in that format. RDF
defines a specific XML markup language, referred to as RDF/XML, for use in representing RDF information and
for exchanging it between machines. Written in RDF/XML, our example will look as follows:
Note: RDF/XML uses the namespace mechanism of XML, but in an expanded way. In XML, namespaces are
only used for disambiguation purposes. In RDF/XML, external namespaces are expected to be RDF documents
defining resources, which are then used in the importing RDF document. This mechanism allows the reuse of
resources by other people who may decide to insert additional features into these resources. The result is the
emergence of large, distributed collections of knowledge.
Also observe that the rdf:about attribute of the element rdf:Description is equivalent in meaning to that of
an ID attribute, but it is often used to suggest that the object about which a statement is made has already been
‘defined’ elsewhere. Strictly speaking, a set of RDF statements together simply forms a large graph, relating things
to other things through properties, and there is no such concept as ‘defining’ an object in one place and referring to
it elsewhere. Nevertheless, in the serialized XML syntax, it is sometimes useful (if only for human readability) to
suggest that one location in the XML serialization is the ‘defining’ location, while other locations state ‘additional’
properties about an object that has been ‘defined’ elsewhere.
Properties
Properties are a special kind of resource: they describe relationships between resources, e.g., written by, age,
title, and so on. Properties in RDF are also identified by URIs (in most cases, these are actual URLs). Therefore,
properties themselves can be used as the subject in other statements, which allows for an expressive ways to
describe properties, e.g., by defining property hierarchies.
Named graphs
A named graph (NG) is a set of triples named by a URI. This URI can then be used outside or within the graph to
refer to it. The ability to name a graph allows separate graphs to be identified out of a large collection of statements
and further allows statements to be made about graphs.
Named graphs represent an extension of the RDF data model, where quadruples <s,p,o,ng> are used to define
statements in an RDF multigraph. This mechanism allows, e.g., the handling of provenance when multiple RDF
graphs are integrated into a single repository.
From the perspective of GraphDB, named graphs are important, because comprehensive support for SPARQL
requires NG support.
While being a universal model that lets users describe resources using their own vocabularies, RDF does not make
assumptions about any particular application domain, nor does it define the semantics of any domain. It is up to
the user to do so using an RDF Schema (RDFS) vocabulary.
RDF Schema is a vocabulary description language for describing properties and classes of RDF resources, with a
semantics for generalization hierarchies of such properties and classes. Be aware of the fact that the RDF Schema
is conceptually different from the XML Schema, even though the common term schema suggests similarity. The
XML Schema constrains the structure of XML documents, whereas the RDF Schema defines the vocabulary used in
RDF data models. Thus, RDFS makes semantic information machineaccessible, in accordance with the Semantic
Web vision. RDF Schema is a primitive ontology language. It offers certain modelling primitives with fixed
meaning.
RDF Schema does not provide a vocabulary of applicationspecific classes. Instead, it provides the facilities
needed to describe such classes and properties, and to indicate which classes and properties are expected to be
used together (for example, to say that the property JobTitle will be used in describing a class “Person”). In other
words, RDF Schema provides a type system for RDF.
The RDF Schema type system is similar in some respects to the type systems of objectoriented programming
languages such as Java. For example, RDFS allows resources to be defined as instances of one or more classes.
In addition, it allows classes to be organized in a hierarchical fashion. For example, a class Dog might be defined
as a subclass of Mammal, which itself is a subclass of Animal, meaning that any resource that is in class Dog is also
implicitly in class Animal as well.
RDF classes and properties, however, are in some respects very different from programming language types. RDF
class and property descriptions do not create a straightjacket into which information must be forced, but instead
provide additional information about the RDF resources they describe.
The RDFS facilities are themselves provided in the form of an RDF vocabulary, i.e., as a specialized set of prede
fined RDF resources with their own special meanings. The resources in the RDFS vocabulary have URIs with the
prefix http://www.w3.org/2000/01/rdf-schema# (conventionally associated with the namespace prefix rdfs).
Vocabulary descriptions (schemas) written in the RDFS language are legal RDF graphs. Hence, systems processing
RDF information that do not understand the additional RDFS vocabulary can still interpret a schema as a legal RDF
graph consisting of various resources and properties. However, such a system will be oblivious to the additional
builtin meaning of the RDFS terms. To understand these additional meanings, the software that processes RDF
information has to be extended to include these language features and to interpret their meanings in the defined
way.
Describing classes
A class can be thought of as a set of elements. Individual objects that belong to a class are referred to as instances
of that class. A class in RDFS corresponds to the generic concept of a type or category similar to the notion of a
class in objectoriented programming languages such as Java. RDF classes can be used to represent any category
of objects such as web pages, people, document types, databases or abstract concepts. Classes are described using
the RDF Schema resources rdfs:Class and rdfs:Resource, and the properties rdf:type and rdfs:subClassOf.
The relationship between instances and classes in RDF is defined using rdf:type.
An important use of classes is to impose restrictions on what can be stated in an RDF document using the schema.
In programming languages, typing is used to prevent incorrect use of objects (resources) and the same is true in
RDF imposing a restriction on the objects to which the property can be applied. In logical terms, this is a restriction
on the domain of the property.
Describing properties
In addition to describing the specific classes of things they want to describe, user communities also need to be
able to describe specific properties that characterize these classes of things (such as numberOfBedrooms to describe
an apartment). In RDFS, properties are described using the RDF class rdf:Property, and the RDFS properties
rdfs:domain, rdfs:range and rdfs:subPropertyOf.
All properties in RDF are described as instances of class rdf:Property. So, a new property, such as ex-
terms:weightInKg, is defined with the following RDF statement:
RDFS also provides vocabulary for describing how properties and classes are intended to be used together. The
most important information of this kind is supplied by using the RDFS properties rdfs:range and rdfs:domain
to further describe applicationspecific properties.
The rdfs:range property is used to indicate that the values of a particular property are members of a designated
class. For example, to indicate that the property ex:author has values that are instances of class ex:Person, the
following RDF statements are used:
These statements indicate that ex:Person is a class, ex:author is a property, and that RDF statements using the
ex:author property have instances of ex:Person as objects.
The rdfs:domain property is used to indicate that a particular property is used to describe a specific class of
objects. For example, to indicate that the property ex:author applies to instances of class ex:Book, the following
RDF statements are used:
These statements indicate that ex:Book is a class, ex:author is a property, and that RDF statements using the
ex:author property have instances of ex:Book as subjects.
Sharing vocabularies
RDFS provides the means to create custom vocabularies. However, it is generally easier and better practice to
use an existing vocabulary created by someone else who has already been describing a similar conceptual domain.
Such publicly available vocabularies, called ‘shared vocabularies’, are not only costefficient to use, but they also
promote the shared understanding of the described domains.
Considering the earlier example, in the statement:
the predicate dc:creator, when fully expanded into a URI, is an unambiguous reference to the creator attribute
in the Dublin Core metadata attribute set, a widely used set of attributes (properties) for describing information of
this kind. So this triple is effectively saying that the relationship between the website (identified by http://www.
johndoesite.com/) and the creator of the site (a distinct person, identified by http://www.johndoesite.com/
aboutme) is exactly the property identified by http://purl.org/dc/elements/1.1/creator. This way, anyone
familiar with the Dublin Core vocabulary or those who find out what dc:creator means (say, by looking up its
definition on the Web) will know what is meant by this relationship. In addition, this shared understanding based
upon using unique URIs for identifying concepts is exactly the requirement for creating computer systems that can
automatically process structured information.
However, the use of URIs does not solve all identification problems, because different URIs can be created for
referring to the same thing. For this reason, it is a good idea to have a preference towards using terms from existing
vocabularies (such as the Dublin Core) where possible, rather than making up new terms that might overlap with
those of some other vocabulary. Appropriate vocabularies for use in specific application areas are being developed
all the time, but even so, the sharing of these vocabularies in a common ‘Web space’ provides the opportunity to
identify and deal with any equivalent terminology.
An example of a shared vocabulary that is readily available for reuse is The Dublin Core, which is a set of elements
(properties) for describing documents (and hence, for recording metadata). The element set was originally devel
oped at the March 1995 Metadata Workshop in Dublin, Ohio, USA. Dublin Core has subsequently been modified
on the basis of later Dublin Core Metadata workshops and is currently maintained by the Dublin Core Metadata
Initiative.
The goal of Dublin Core is to provide a minimal set of descriptive elements that facilitate the description and
the automated indexing of documentlike networked objects, in a manner similar to a library card catalogue. The
Dublin Core metadata set is suitable for use by resource discovery tools on the Internet, such as Web crawlers
employed by search engines. In addition, Dublin Core is meant to be sufficiently simple to be understood and used
by the wide range of authors and casual publishers of information to the Internet.
Dublin Core elements have become widely used in documenting Internet resources (the Dublin Core creator
element was used in the earlier examples). The current elements of Dublin Core contain definitions for properties
such as title (a name given to a resource), creator (an entity primarily responsible for creating the content of
the resource), date (a date associated with an event in the lifecycle of the resource) and type (the nature or genre
of the content of the resource).
Information using Dublin Core elements may be represented in any suitable language (e.g., in HTML meta ele
ments). However, RDF is an ideal representation for Dublin Core information. The following example uses Dublin
Core by itself to describe an audio recording of a guide to growing rose bushes:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://media.example.com/audio/guide.ra">
<dc:creator>Mr. Dan D. Lion</dc:creator>
<dc:title>A Guide to Growing Roses</dc:title>
<dc:description>Describes planting and nurturing rose bushes.
(continues on next page)
In general, an ontology formally describes a (usually finite) domain of related concepts (classes of objects) and
their relationships. For example, in a company setting, staff members, managers, company products, offices, and
departments might be some important concepts. The relationships typically include hierarchies of classes. A
hierarchy specifies a class C to be a subclass of another class C' if every object in C is also included in C'. For
example, all managers are staff members.
Apart from subclass relationships, ontologies may include information such as:
• properties (X is subordinated Y);
• value restrictions (only managers may head departments);
• disjointness statements (managers and general employees are disjoint);
• specifications of logical relationships between objects (every department must have at least three staff mem
bers).
Ontologies are important because semantic repositories use ontologies as semantic schemata. This makes auto
mated reasoning about the data possible (and easy to implement) since the most essential relationships between
the concepts are built into the ontology.
Formal knowledge representation (KR) is about building models. The typical modeling paradigm is mathematical
logic, but there are also other approaches, rooted in the information and library science. KR is a very broad term;
here we only refer to the mainstream meaning of the world (of a particular state of affairs, situation, domain or
problem), which allow for automated reasoning and interpretation. Such models consist of ontologies defined in
a formal language. Ontologies can be used to provide formal semantics (i.e., machineinterpretable meaning) to
any sort of information: databases, catalogues, documents, Web pages, etc. Ontologies can be used as semantic
frameworks: the association of information with ontologies makes such information much more amenable to
machine processing and interpretation. This is because ontologies are described using logical formalisms, such as
OWL, which allow automatic inferencing over these ontologies and datasets that use them, i.e., as a vocabulary.
An important role of ontologies is to serve as schemata or ‘intelligent’ views over information resources. This is
also the role of ontologies in the Semantic Web. Thus, they can be used for indexing, querying, and reference pur
poses over nonontological datasets and systems such as databases, document and catalogue management systems.
Because ontological languages have formal semantics, ontologies allow a wider interpretation of data, i.e., infer
ence of facts, which are not explicitly stated. In this way, they can improve the interoperability and the efficiency
of using arbitrary datasets.
An ontology O can be defined as comprising the 4tuple.
O = <C,R,I,A>
where
• C is a set of classes representing concepts from the domain we wish to describe (e.g., invoices, payments,
products, prices, etc);
• R is a set of relations (also referred to as properties or predicates) holding between (instances of) these classes
(e.g., Product hasPrice Price);
• I is a set of instances, where each instance can be a member of one or more classes and can be linked
to other instances or to literal values (strings, numbers and other datatypes) by relations (e.g., product23
compatibleWith product348 or product23 hasPrice €170);
• A is a set of axioms (e.g., if a product has a price greater than €200, then shipping is free).
Classification of ontologies
Ontologies can be classified as lightweight or heavyweight according to the complexity of the KR language and
the extent to which it is used. Lightweight ontologies allow for more efficient and scalable reasoning, but do not
possess the highly predictive (or restrictive) power of more powerful KR languages. Ontologies can be further
differentiated according to the sort of conceptualization that they formalize: upperlevel ontologies model general
knowledge, while domain and application ontologies represent knowledge about a specific domain (e.g., medicine
or sport) or a type of application, e.g., knowledge management systems.
Finally, ontologies can be distinguished according to the sort of semantics being modeled and their intended usage.
The major categories from this perspective are:
• Schemaontologies: ontologies that are close in purpose and nature to database and objectoriented schemata.
They define classes of objects, their properties and relationships to objects of other classes. A typical use of
such an ontology involves using it as a vocabulary for defining large sets of instances. In basic terms, a class
in a schema ontology corresponds to a table in a relational database; a relation – to a column; an instance –
to a row in the table for the corresponding class;
• Topicontologies: taxonomies that define hierarchies of topics, subjects, categories, or designators. These
have a wide range of applications related to classification of different things (entities, information resources,
files, Web pages, etc). The most popular examples are library classification systems and taxonomies, which
are widely used in the knowledge management field. Yahoo and DMOZ are popular largescale incarnations
of this approach. A number of the most popular taxonomies are listed as encoding schemata in Dublin Core;
• Lexical ontologies: lexicons with formal semantics that define lexical concepts. We use ‘lexical concept’
here as some kind of a formal representation of the meaning of a word or a phrase. In Wordnet, for example,
lexical concepts are modeled as synsets (synonym sets), while wordsense is the relation between a word and
a synset, wordsenses and terms. These can be considered as semantic thesauri or dictionaries. The concepts
defined in such ontologies are not instantiated, rather they are directly used for reference, e.g., for annotation
of the corresponding terms in text. WordNet is the most popular general purpose (i.e., upperlevel) lexical
ontology.
Knowledge bases
Knowledge base (KB) is a broader term than ontology. Similar to an ontology, a KB is represented in a KR formal
ism, which allows automatic inference. It could include multiple axioms, definitions, rules, facts, statements, and
any other primitives. In contrast to ontologies, however, KBs are not intended to represent a shared or consensual
conceptualization. Thus, ontologies are a specific sort of a KB. Many KBs can be split into ontology and instance
data parts, in a way analogous to the splitting of schemata and concrete data in databases.
Proton
PROTON is a lightweight upperlevel schemaontology developed in the scope of the SEKT project, which we will
use for ontologyrelated examples in this section. PROTON is encoded in OWL Lite and defines about 542 entity
classes and 183 properties, providing good coverage of named entity types and concrete domains, i.e., modeling
of concepts such as people, organizations, locations, numbers, dates, addresses, etc. A snapshot of the PROTON
class hierarchy is shown below.
The topics that follow take a closer look at the logic that underlies the retrieval and manipulation of semantic data
and the kind of programming that supports it.
Logic programming
Logic programming involves the use of logic for computer programming, where the programmer uses a declarative
language to assert statements and a reasoner or theoremprover is used to solve problems. A reasoner can interpret
sentences, such as IF A THEN B, as a means to prove B from A. In other words, given a collection of logical
sentences, a reasoner will explore the solution space in order to find a path to justify the requested theory. For
example, to determine the truth value of C given the following logical sentences:
IF A AND B THEN C
B
IF D THEN A
D
a reasoner will interpret the IF..THEN statements as rules and determine that C is indeed inferred from the KB. This
use of rules in logic programming has led to ‘rulebased reasoning’ and ‘logic programming’ becoming synony
mous, although this is not strictly the case.
In LP, there are rules of logical inference that allow new (implicit) statements to be inferred from other (explicit)
statements, with the guarantee that if the explicit statements are true, so are the implicit statements.
Because these rules of inference can be expressed in purely symbolic terms, applying them is the kind of symbol
manipulation that can be carried out by a computer. This is what happens when a computer executes a logical
program: it uses the rules of inference to derive new statements from the ones given in the program, until it finds
one that expresses the solution to the problem that has been formulated. If the statements in the program are true,
then so are the statements that the machine derives from them, and the answers it gives will be correct.
The program can give correct answers only if the following two conditions are met:
Predicate logic
From a more abstract viewpoint, the subject of the previous topic is related to the foundation upon which logical
programming resides, which is logic, particularly in the form of predicate logic (also known as ‘first order logic’).
Some of the specific features of predicate logic render it very suitable for making inferences over the Semantic
Web, namely:
• It provides a highlevel language in which knowledge can be expressed in a transparent way and with high
expressive power;
• It has a wellunderstood formal semantics, which assigns unambiguous meaning to logical statements;
• There are proof systems that can automatically derive statements syntactically from a set of premises. These
proof systems are both sound (meaning that all derived statements follow semantically from the premises)
and complete (all logical consequences of the premises can be derived in the proof system);
• It is possible to trace the proof that leads to a logical consequence. (This is because the proof system is sound
and complete.) In this sense, the logic can provide explanations for answers.
The languages of RDF and OWL (Lite and DL) can be viewed as specializations of predicate logic. One reason
for such specialized languages to exist is that they provide a syntax that fits well with the intended use (in our
case, Web languages based on tags). The other major reason is that they define reasonable subsets of logic. This is
important because there is a tradeoff between the expressive power and the computational complexity of certain
logic: the more expressive the language, the less efficient (in the worst case) the corresponding proof systems. As
previously stated, OWL Lite and OWL DL correspond roughly to description logic, a subset of predicate logic for
which efficient proof systems exist.
Another subset of predicate logic with efficient proof systems comprises the socalled rule systems (also known
as Horn logic or definite logic programs).
A rule has the form:
A1, ... , An � B
where Ai and B are atomic formulas. In fact, there are two intuitive ways of reading such a rule:
• If A1, ... , An are known to be true, then B is also true. Rules with this interpretation are referred to as
‘deductive rules’.
• If the conditions A1, ... , An are true, then carry out the action B. Rules with this interpretation are referred
to as ‘reactive rules’.
Both approaches have important applications. The deductive approach, however, is more relevant for the purpose
of retrieving and managing structured data. This is because it relates better to the possible queries that one can ask,
as well as to the appropriate answers and their proofs.
Description logic
Description Logic (DL) has historically evolved from a combination of framebased systems and predicate logic.
Its main purpose is to overcome some of the problems with framebased systems and to provide a clean and
efficient formalism to represent knowledge. The main idea of DL is to describe the world in terms of ‘properties’
or ‘constraints’ that specific ‘individuals’ must satisfy. DL is based on the following basic entities:
• Objects: Correspond to single ‘objects’ of the real world such as a specific person, a table or a telephone.
The main properties of an object are that it can be distinguished from other objects and that it can be referred
to by a name. DL objects correspond to the individual constants in predicate logic;
• Concepts: Can be seen as ‘classes of objects’. Concepts have two functions: on one hand, they describe
a set of objects and on the other, they determine properties of objects. For example, the class “table” is
supposed to describe the set of all table objects in the universe. On the other hand, it also determines some
properties of a table such as having legs and a flat horizontal surface that one can lay something on. DL
concepts correspond to unary predicates in first order logic and to classes in framebased systems;
• Roles: Represent relationships between objects. For example, the role ‘lays on’ might define the relationship
between a book and a table, where the book lays upon the table. Roles can also be applied to concepts.
However, they do not describe the relationship between the classes (concepts), rather they describe the
properties of the objects that are members of that classes;
• Rules: In DL, rules take the form of “if condition x (left side), then property y (right side)” and form state
ments that read as “if an object satisfies the condition on the left side, then it has the properties of the right
side”. So, for example, a rule can state something like ‘all objects that are male and have at least one child
are fathers’.
The family of DL system consists of many members that differ mainly with respect to the constructs they provide.
Not all of the constructs can be found in a single DL system.
In order to achieve the goal of a broad range of shared ontologies using vocabularies with expressiveness appropri
ate for each domain, the Semantic Web requires a scalable highperformance storage and reasoning infrastructure.
The major challenge towards building such an infrastructure is the expressivity of the underlying standards: RDF,
RDFS, OWL, and OWL 2. Even though RDFS can be considered a simple KR language, it is already a challenging
task to implement a repository for it, which provides performance and scalability comparable to those of relational
database management systems (RDBMS). Even the simplest dialect of OWL (OWL Lite) is a description logic
(DL) that does not scale due to reasoning complexity. Furthermore, the semantics of OWL Lite are incompatible
with that of RDF(S).
Figure 1 OWL Layering Map
OWL DLP
OWL DLP is a nonstandard dialect, offering a promising compromise between expressive power, efficient rea
soning, and compatibility. It is defined as the intersection of the expressivity of OWL DL and logic programming.
In fact, OWL DLP is defined as the most expressive sublanguage of OWL DL, which can be mapped to Datalog.
OWL DLP is simpler than OWL Lite. The alignment of its semantics to RDFS is easier, as compared to OWL
Lite and OWL DL dialects. Still, this can only be achieved through the enforcement of some additional modeling
constraints and transformations.
Horn logic and description logic are orthogonal (in the sense that neither of them is a subset of the other). OWL
DLP is the ‘intersection’ of Horn logic and OWL; it is the Horndefinable part of OWL, or stated another way, the
OWLdefinable part of Horn logic.
DLP has certain advantages:
• From a modeler’s perspective, there is freedom to use either OWL or rules (and associated tools and method
ologies) for modeling purposes, depending on the modeler’s experience and preferences.
• From an implementation perspective, either description logic reasoners or deductive rule systems can be
used. This feature provides extra flexibility and ensures interoperability with a variety of tools.
Experience with using OWL has shown that existing ontologies frequently use very few constructs outside the
DLP language.
OWL-Horst
In “Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity” ter Horst defines RDFS
extensions towards rule support and describes a fragment of OWL, more expressive than DLP. He introduces
the notion of Rentailment of one (target) RDF graph from another (source) RDF graph on the basis of a set of
entailment rules R. Rentailment is more general than the Dentailment used by Hayes in defining the standard
RDFS semantics. Each rule has a set of premises, which conjunctively define the body of the rule. The premises
are ‘extended’ RDF statements, where variables can take any of the three positions.
The head of the rule comprises one or more consequences, each of which is, again, an extended RDF statement.
The consequences may not contain free variables, i.e., which are not used in the body of the rule. The consequences
may contain blank nodes.
The extension of Rentailment (as compared to Dentailment) is that it ‘operates’ on top of socalled generalized
RDF graphs, where blank nodes can appear as predicates. Rentailment rules without premises are used to declare
axiomatic statements. Rules without consequences are used to detect inconsistencies.
In this document, we refer to this extension of RDFS as “OWLHorst”. This language has a number of important
characteristics:
• It is a proper (backwardcompatible) extension of RDFS. In contrast to OWL DLP, it puts no constraints
on the RDFS semantics. The widely discussed metaclasses (classes as instances of other classes) are not
disallowed in OWLHorst. It also does not enforce the unique name assumption;
• Unlike DLbased rule languages such as SWRL, Rentailment provides a formalism for rule extensions
without DLrelated constraints;
• Its complexity is lower than SWRL and other approaches combining DL ontologies with rules.
In Figure 1, the pink box represents the range of expressivity of GraphDB, i.e., including OWL DLP, OWLHorst,
OWL2RL, most of OWL Lite. However, none of the rulesets include support for the entailment of typed literals
(Dentailment).
OWLHorst is close to what SWADEurope has intuitively described as OWL Tiny. The major difference is that
OWL Tiny (like the fragment supported by GraphDB) does not support entailment over data types.
OWL2-RL
OWL 2 is a rework of the OWL language family by the OWL working group. This work includes identifying
fragments of the OWL 2 language that have desirable behavior for specific applications/environments.
The OWL 2 RL profile is aimed at applications that require scalable reasoning without sacrificing too much ex
pressive power. It is designed to accommodate both OWL 2 applications that can trade the full expressivity of the
language for efficiency, and RDF(S) applications that need some added expressivity from OWL 2. This is achieved
by defining a syntactic subset of OWL 2, which is amenable to implementation using rulebased technologies, and
presenting a partial axiomatization of the OWL 2 RDFBased Semantics in the form of firstorder implications
that can be used as the basis for such an implementation. The design of OWL 2 RL was inspired by Description
Logic Programs and pD.
OWL Lite
The original OWL specification, now known as OWL 1, provides two specific subsets of OWL Full designed to
be of use to implementers and language users. The OWL Lite subset was designed for easy implementation and
to offer users a functional subset that provides an easy way to start using OWL.
OWL Lite is a sublanguage of OWL DL that supports only a subset of the OWL language constructs. OWL Lite
is particularly targeted at tool builders, who want to support OWL, but who want to start with a relatively simple
basic set of language features. OWL Lite abides by the same semantic restrictions as OWL DL, allowing reasoning
engines to guarantee certain desirable properties.
OWL DL
The OWL DL (where DL stands for Description Logic) subset was designed to support the existing Description
Logic business segment and to provide a language subset that has desirable computational properties for reasoning
systems.
OWL Full and OWL DL support the same set of OWL language constructs. Their difference lies in the restrictions
on the use of some of these features and on the use of RDF features. OWL Full allows free mixing of OWL
with RDF Schema and, like RDF Schema, does not enforce a strict separation of classes, properties, individuals
and data values. OWL DL puts constraints on mixing with RDF and requires disjointness of classes, properties,
individuals and data values. The main reason for having the OWL DL sublanguage is that tool builders have
developed powerful reasoning systems that support ontologies constrained by the restrictions required for OWL
DL.
In this section, we introduce some query languages for RDF. This may beg the question why we need RDFspecific
query languages at all instead of using an XML query language. The answer is that XML is located at a lower
level of abstraction than RDF. This fact would lead to complications if we were querying RDF documents with an
XMLbased language. The RDF query languages explicitly capture the RDF semantics in the language itself.
All the query languages discussed below have a SQLlike syntax, but there are also a few nonSQLlike languages
like Versa and Adenine.
The query languages supported by RDF4J (which is the Java framework within which GraphDB operates) and
therefore by GraphDB, are SPARQL and SeRQL.
RQL, RDQL
RQL (RDF Query Language) was initially developed by the Institute of Computer Science at Heraklion, Greece,
in the context of the European IST project MESMUSES.3. RQL adopts the syntax of OQL (a query language
standard for objectoriented databases), and, like OQL, is defined by means of a set of core queries, a set of basic
filters, and a way to build new queries through functional composition and iterators.
The core queries are the basic building blocks of RQL, which give access to the RDFSspecific contents of an
RDF triplestore. RQL allows queries such as Class (retrieving all classes), Property (retrieving all properties) or
Employee (returning all instances of the class with name Employee). This last query, of course, also returns all
instances of subclasses of Employee, as these are also instances of the class Employee by virtue of the semantics
of RDFS.
RDQL (RDF Data Query Language) is a query language for RDF first developed for Jena models. RDQL is an
implementation of the SquishQL RDF query language, which itself is derived from rdfDB. This class of query
languages regards RDF as triple data, without schema or ontology information unless explicitly included in the
RDF source.
Apart from RDF4J, the following systems currently provide RDQL (all these implementations are known to derive
from the original grammar): Jena, RDFStore, PHP XML Classes, 3Store, and RAP (RDF API for PHP).
SPARQL
SPARQL (pronounced “sparkle”) is currently the most popular RDF query language; its name is a recursive
acronym that stands for “SPARQL Protocol and RDF Query Language”. It was standardized by the RDF Data
Access Working Group (DAWG) of the World Wide Web Consortium, and is now considered a key Semantic Web
technology. On 15 January 2008, SPARQL became an official W3C Recommendation.
SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. Several
SPARQL implementations for multiple programming languages exist at present.
SeRQL
SeRQL (Sesame RDF Query Language, pronounced “circle”) is an RDF/RDFS query language developed by
Sesame’s developer Aduna as part of Sesame (now RDF4J). It selectively combines the best features (considered
by its creators) of other query languages (RQL, RDQL, NTriples, N3) and adds some features of its own. As of
this writing, SeRQL provides advanced features not yet available in SPARQL. Some of SeRQL’s most important
features are:
• Graph transformation;
• RDF Schema support;
• XML Schema datatype support;
• Expressive path expression syntax;
• Optional path matching.
There are two principle strategies for rulebased inference: Forwardchaining and Backwardchaining:
Forwardchaining to start from the known facts (the explicit statements) and to perform inference in a deductive
fashion. Forwardchaining involves applying the inference rules to the known facts (explicit statements)
to generate new facts. The rules can then be reapplied to the combination of original facts and inferred
facts to produce more new facts. The process is iterative and continues until no new facts can be generated.
The goals of such reasoning can have diverse objectives, e.g., to compute the inferred closure, to answer a
particular query, to infer a particular sort of knowledge (e.g., the class taxonomy), etc.
Advantages: When all inferences have been computed, query answering can proceed extremely quickly.
Disadvantages: Initialization costs (inference computed at load time) and space/memory usage (especially
when the number of inferred facts is very large).
Backwardchaining involves starting with a fact to be proved or a query to be answered. Typically, the reasoner
examines the knowledge base to see if the fact to be proved is present and if not it examines the ruleset to see
which rules could be used to prove it. For the latter case, a check is made to see what other ‘supporting’ facts
would need to be present to ‘fire’ these rules. The reasoner searches for proofs of each of these ‘supporting’
facts in the same way and iteratively maps out a search tree. The process terminates when either all of the
leaves of the tree have proofs or no new candidate solutions can be found. Query processing is similar, but
only stops when all search paths have been explored. The purpose in query answering is to find not just one
but all possible substitutions in the query expression.
Advantages: There are no inferencing costs at startup and minimal space requirements.
Disadvantages: Inference must be done each and every time a query is answered and for complex search
graphs this can be computationally expensive and slow.
As both strategies have advantages and disadvantages, attempts to overcome their weak points have led to the
development of various hybrid strategies (involving partial forward and backwardchaining), which have proven
efficient in many contexts.
Total materialization
Imagine a repository that performs total forwardchaining, i.e., it tries to make sure that after each update to the KB,
the inferred closure is computed and made available for query evaluation or retrieval. This strategy is generally
known as materialization. In order to avoid ambiguity with various partial materialization approaches, let us call
such an inference strategy, taken together with the monotonic entailment. When new explicit facts (statements)
are added to a KB (repository), new implicit facts will likely be inferred. Under a monotonic logic, adding new
explicit statements will never cause previously inferred statements to be retracted. In other words, the addition of
new facts can only monotonically extend the inferred closure. Assumption, total materialization.
Advantages and disadvantages of the total materialization:
• Upload/store/addition of new facts is relatively slow, because the repository is extending the inferred closure
after each transaction. In fact, all the reasoning is performed during the upload;
• Deletion of facts is also slow, because the repository should remove from the inferred closure all the facts
that can no longer be proved;
• The maintenance of the inferred closure usually requires considerable additional space (RAM, disk, or both,
depending on the implementation);
• Query and retrieval are fast, because no deduction, satisfiability checking, or other sorts of reasoning are re
quired. The evaluation of queries becomes computationally comparable to the same task for relation database
management systems (RDBMS).
Probably the most important advantage of the inductive systems, based on total materialization, is that they can
easily benefit from RDBMSlike query optimization techniques, as long as all the data is available at query time.
The latter makes it possible for the query evaluation engine to use statistics and other means in order to make
‘educated’ guesses about the ‘cost’ and the ‘selectivity’ of a particular constraint. These optimizations are much
more complex in the case of deductive query evaluation.
Total materialization is adopted as the reasoning strategy in a number of popular Semantic Web repositories, in
cluding some of the standard configurations of RDF4J and Jena. Based on publicly available evaluation data, it is
also the only strategy that allows scalable reasoning in the range of a billion of triples; such results are published
by BBN (for DAML DB) and ORACLE (for RDF support in ORACLE 11g).
Over the last decade, the Semantic Web has emerged as an area where semantic repositories became as important
as HTTP servers are today. This perspective boosted the development, under W3C driven community processes,
of a number of robust metadata and ontology standards. These standards play the role, which SQL had for the
development and spread of the relational DBMS. Although designed for the Semantic Web, these standards face
increasing acceptance in areas such as Enterprise Application Integration and Life Sciences.
In this document, the term ‘semantic repository’ is used to refer to a system for storage, querying, and manage
ment of structured data with respect to ontologies. At present, there is no single wellestablished term for such
engines. Weak synonyms are: reasoner, ontology server, metastore, semantic/triple/RDF store, database, reposi
tory, knowledge base. The different wording usually reflects a somewhat different approach to implementation,
performance, intended application, etc. Introducing the term ‘semantic repository’ is an attempt to convey the
core functionality offered by most of these tools. Semantic repositories can be used as a replacement for database
management systems (DBMS), offering easier integration of diverse data and more analytical power. In a nutshell,
a semantic repository can dynamically interpret metadata schemata and ontologies, which define the structure and
the semantics related to the data and the queries. Compared to the approach taken in a relational DBMS, this allows
for easier changing and combining of data schemata and automated interpretation of the data.
The Resource Description Framework, more commonly known as RDF, is a graph data model that formally de
scribes the semantics, or meaning of information. It also represents metadata, that is, data about data.
RDF consists of triples. These triples are based on an Entity Attribute Value (EAV) model, in which the subject
is the entity, the predicate is the attribute, and the object is the value. Each triple has a unique identifier known
as the Uniform Resource Identifier, or URI. URIs look like web page addresses. The parts of a triple, the subject,
predicate, and object, represent links in a graph.
Example triples:
In the first triple, “Fred hasSpouse Wilma”, Fred is the subject, hasSpouse is the predicate, and Wilma is the object.
Also, in the next triple, “Fred hasAge 25”, Fred is the subject, hasAge is the predicate and 25 is the object, or value.
Multiple triples link together to form an RDF model. The graph below describes the characters and relationships
from the Flintstones television cartoon series. We can easily identify triples such as “WilmaFlintstone livesIn
Bedrock” or “FredFlintstone livesIn Bedrock”. We now know that the Flintstones live in Bedrock, which is part
of Cobblestone County in Prehistoric America.
The rest of the triples in the Flintstones graph describe the characters’ relations, such as hasSpouse or hasChild,
as well as their occupational association (worksFor).
Fred Flintstone is married to Wilma and they have a child Pebbles. Fred works for the Rock Quarry company and
Wilma’s mother is Pearl Slaghoople. Pebbles Flintstone is married to BammBamm Rubble who is the child of
Barney and Betty Rubble. Thus, as you can see, many triples form an RDF model.
RDF Schema, more commonly known as RDFS, adds schema to the RDF. It defines a metamodel of concepts like
Resource, Literal, Class, and Datatype and relationships such as subClassOf, subPropertyOf, domain, and range.
RDFS provides a means for defining the classes, properties, and relationships in an RDF model and organizing
these concepts and relationships into hierarchies.
RDFS specifies entailment rules or axioms for the concepts and relationships. These rules can be used to infer new
triples, as we show in the following diagram.
Looking at this example, we see how new triples can be inferred by applying RDFS rules to a small RDF/RDFS
model. In this model, we use RDFS to define that the hasSpouse relationship is restricted to humans. And as you
can see, human is a subclass of mammal.
If we assert that Wilma is Fred’s spouse using the hasSpouse relationship, then we can infer that Fred and Wilma
are human because, in RDFS, the hasSpouse relationship is defined to be between humans. Because we also know
humans are mammals, we can further infer that Fred and Wilma are mammals.
19.3 SPARQL
SPARQL is a SQLlike query language for RDF data. SPARQL queries can produce result sets that are tabular or
RDF graphs depending on the kind of query used.
• SELECT is similar to the SQL SELECT in that it produces tabular result sets.
• CONSTRUCT creates a new RDF graph based on query results.
• ASK returns Yes or No depending on whether the query has a solution.
• DESCRIBE returns the RDF graph data about a resource. This is, of course, useful when the query client does
not know the structure of the RDF data in the data source.
As you can see in the example shown in the gray box, we wrote a query which included PREFIX, INSERT DATA, and
several subject predicate object statements, which are:
Fred has spouse Wilma, Fred has child Pebbles, Wilma has child Pebbles, Pebbles has spouse BammBamm, and
Pebbles has children Roxy and Chip.
Now, let’s write a SPARQL query to access the RDF graph you just created.
First, define prefixes to URIs with the PREFIX keyword. As in the earlier example, we set bedrock as the default
namespace for the query.
Next, use SELECT to signify you want to select certain information, and WHERE to signify your conditions, restric
tions, and filters.
Finally, execute this query:
As you can see in this example shown in the gray box, we wrote a SPARQL query which included PREFIX, SELECT,
and WHERE. The red box displays the information which is returned in response to the written query. We can see
the familial relationships between Fred, Pebbles, Wilma, Roxy, and Chip.
SPARQL is quite similar to SQL, however, unlike SQL which requires SQL schema and data in SQL tables,
SPARQL can be used on graphs and does not need a schema to be defined initially.
In the following example, we will use SPARQL to find out if Fred has any grandchildren.
First, define prefixes to URIs with the PREFIX keyword.
Next, we use ASK to discover whether Fred has a grandchild, and WHERE to signify the conditions.
As you can see in the query in the green box, Fred’s children’s children are his grandchildren. Thus the query is
easily written in SPARQL by matching Fred’s children and then matching his children’s children. The ASK query
returns “Yes” so we know Fred has grandchildren.
If instead we want a list of Fred’s grandchildren we can change the ASK query to a SELECT one:
The query results, reflected in the red box, tell us that Fred’s grandchildren are Roxy and Chip.
The easiest way to execute SPARQL queries in GraphDB is by using the GraphDB Workbench. Just choose
SPARQL from the navigation bar, enter your query and hit Run, as shown in this example:
GraphDB supports multiple RDF formats for importing or exporting data. All RDF formats have at least one file
extension and MIME type that identify the format. Where multiple file extensions or MIME types are available,
the preferred file extension or MIME type is listed first.
The various formats differ when it comes to supporting named graphs, namespaces, and RDFstar. The following
formats support everything and may be used to dump an entire repository preserving all of the information:
• TriGstar (text, human readable, standard based)
• BinaryRDF (binary, compact representation, RDF4Jspecific)
19.4.1 Turtle
Named graphs No
Namespaces Yes
RDF-star No
MIME types text/turtle
application/x-turtle
File extensions .ttl
RDF4J Java API constant RDFFormat.TURTLE
Standard definition http://www.w3.org/ns/formats/Turtle
19.4.2 Turtle-star
Named graphs No
Namespaces Yes
RDF-star Yes
MIME types text/x-turtlestar
application/x-turtlestar
File extensions .ttls
RDF4J Java API constant RDFFormat.TURTLESTAR
Standard definition
19.4.3 TriG
19.4.4 TriG-star
19.4.5 N3
Named graphs No
Namespaces Yes
RDF-star No
MIME types text/n3
text/rdf+n3
File extensions .n3
RDF4J Java API constant RDFFormat.N3
Standard definition http://www.w3.org/ns/formats/N3
19.4.6 N-Triples
Named graphs No
Namespaces No
RDF-star No
MIME types application/n-triples
text/plain
File extensions .nt
RDF4J Java API constant RDFFormat.NTRIPLES
Standard definition http://www.w3.org/ns/formats/NTriples
19.4.7 N-Quads
19.4.8 JSON-LD
19.4.9 NDJSON-LD
19.4.10 RDF/JSON
19.4.11 RDF/XML
Named graphs No
Namespaces Yes
RDF-star No
MIME types application/rdf+xml
application/xml
text/xml
File extensions .rdf
.rdfs
.owl
.xml
RDF4J Java API constant RDFFormat.RDFXML
Standard definition http://www.w3.org/ns/formats/RDF_XML
19.4.12 TriX
19.4.13 BinaryRDF
RDF is an abstract knowledge representation model that does not differentiate data from metadata. This prevents
the extension of an existing model with statementlevel metadata annotations like certainty scores, weights, tem
poral restrictions, and provenance information like if this was a manually modified annotation. Several approaches
discussed on this page mitigate the inherent lack of native support for such annotations in RDF. However, they all
have certain advantages and disadvantages, which we will look at below.
Standard reification
Reification means expressing an abstract construct with the existing concrete methods supported by the language.
The RDF specification sets a standard vocabulary for representing references to statements like:
Standard reification requires stating four additional triples to refer to the triple for which we want to provide
metadata. The subject of these four additional triples has to be a new identifier (IRI or blank node), which later
on may be used for providing the metadata. The existence of a reference to a triple does not automatically assert
it. The main advantage of this method is the standard support by every RDF store. Its disadvantages are the
inefficiency related to exchanging or persisting the RDF data and the cumbersome syntax to access and match the
corresponding four reification triples.
N-ary relations
The approach for representing Nary relations in RDF is to model it via a new relationship concept that connects
all arguments like:
The approach is similar to standard reification, but it adopts a schema specific to the domain model that is presum
ably understood by its consumers. The only disadvantage here is that this approach increases the ontology model
complexity and is proven difficult to evolve models in a backward compatible way.
Singleton properties
Singleton properties are a hacky way to introduce statement identifiers as a part of the predicate like:
The local name of the predicate after the # encodes a unique identifier. The approach is compact for exchanging
data since it uses only two statements, but is highly inefficient for querying data. A query to return all :hasSpouse
links must parse all predicate values with a regular expression.
Warning: GraphDB supports singleton properties in a reasonably inefficient way. The database expects the
number of unique predicates to be much smaller than the total number of statements. Our recommendation is
to avoid this modeling approach for models with significant size.
Named graphs
The named graph approach is a variation of the singleton properties, where a unique value on the named graph
position identifies the statement like:
The approach has multiple advantages over the singleton properties and eliminates the need for regular expression
parsing. A significant drawback is the overload of the named graph parameter with an identifier instead of the
file or source that produced the triple. The updates based on the triple source become more complicated and
cumbersome to maintain.
Tip: If a repository stores a large number of named graphs, make sure to enable the context indexes.
RDFstar (formerly RDF*) is an extension of the RDF 1.1 standard that proposes a more efficient reification
serialization syntax. The main advantages of this representation include reduced document size that increases the
efficiency of data exchange, as well as shorter SPARQL queries for improved comprehensibility.
The RDFstar extension captures the notion of an embedded triple by enclosing the referenced triple using the
strings << and >>. The embedded triples, like the blank nodes, may take a subject and object position only, and
their meaning is aligned to the semantics of the standard reification, but using a much more efficient serialization
syntax. To simplify the querying of the embedded triples, the paper extends the query syntax with SPARQLstar
(formerly SPARQL*) enabling queries like:
The embedded triple in SPARQLstar also supports free variables for retrieving a list of reference statements:
To test the different approaches, we benchmark a subset of Wikidata, whose data model heavily uses statement
level metadata. The authors of the paper Reifying RDF: What works well with Wikidata? have done an excellent
job with remodeling the dataset in various formats, and kindly shared with our team the output datasets. According
to their modeling approach, the dataset includes:
Modeling approach Total statements Loading time (min) Repository image size (MB)
Standard reification 391,652,270 52.4 36,768
Nary relations 334,571,877 50.6 34,519
Named graphs 277,478,521 56 35,146
RDFstar 220,375,702 34 22,465
We did not test the singleton properties approach due to the high number of unique predicates.
The section provides more indepth details on how GraphDB implements the RDFstar/SPARQLstar syntax. Let’s
say we have a statement like the one above, together with the metadata fact that we are 90% certain about this
statement. The RDFstar syntax allows us to represent both the data and the metadata by using an embedded triple
as follows:
According to the formal semantics of RDFstar, each embedded triple also asserts the referenced statement and its
retraction deletes it. Unfortunately, this requirement breaks the compatibility with the standard reification and
causes a nontransparent behavior when dealing with triples stored in multiple named graphs. GraphDB imple
ments the embedded triples by introducing a new additional RDF type next to IRI, blank node, and literal. So in
the previous example, the engine will store only a single triple.
Warning: GraphDB will not explicitly assert the referenced statement by an embedded triple! Every embed
ded triple acts as a new RDF type, which means only a reference to a statement.
Below are a few more examples of how this syntax can be utilized.
• Object relation qualifiers:
voc:hasSpouse is a symmetric relation so that it can be inferred in the opposite direction. How
ever, the metadata in the opposite direction is not asserted automatically, so it needs to be added:
• Statement sources/references:
Carried over into the syntax of the extended query language SPARQLstar, triple patterns can be embedded as
well. This provides a query syntax in which accessing specific metadata about a triple is just a matter of mention
ing the triple in the subject or object position of a metadatarelated triple pattern. For example, by adopting the
aforementioned syntax for nesting, we can query for all age statements and their respective certainty as follows:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?p ?a ?c WHERE {
<<?p foaf:age ?a>> ex:certainty ?c .
}
Additionally, SPARQLstar modifies the BIND clauses to select a group of embedded triples by using free variables:
PREFIX ex: <http://example.com/>
SELECT ?p ?a ?c WHERE {
BIND (<<?p foaf:age ?a>> AS ?t)
?t ex:certainty ?c .
}
The semantics of BIND has a deviation from that of the other RDF types. When binding an embedded triple, it
creates an iterator over the triple entities that match its components and binds these to the target variable. As a
result, the BIND, when used with three constants, works like a FILTER. The same does not apply for VALUES, which
will return any value.
PREFIX ex: <http://example.com/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{
# Binds the value to ?literal variable
BIND ("new value for the store" as ?literal)
}
UNION
{
# Returns empty value and acts like a FILTER
BIND (<<ex:subject foaf:name "new value for the store">> AS ?triple)
}
UNION
{
# Values generates new values
VALUES ?newTriple { <<ex:subject foaf:name "new value for the store">> }
}
}
To avoid any parsing of the embedded triple, GraphDB introduces multiple new SPARQL functions:
PREFIX voc: <http://example.com/voc#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE {
This also showcases the fact that in SPARQLstar, variables in query results may be bound not only to IRIs, literals,
or blank nodes, but also to full RDFstar triples.
6. When the embedded triple contains just one link, click on the predicate label to explore it:
The edge will be highlighted, and in the side panel that opens you can view more details about
the predicate. You can also click on it to open it in the resource view.
7. When the embedded triple we want to explore contains more than one link, click on its predicate label to see
a list with all of the embedded predicates in the side panel. Click on an embedded predicate to view more
details about it.
The RDFstar support in GraphDB does not exclude any of the other modeling approaches. It is possible to inde
pendently maintain RDFstar and standard reification statements in the same repository, like:
Still, this is likely to confuse, so GraphDB provides a tool for converting standard reification to RDFstar outside of
the database using the reification-convert command line tool. If the data is already imported, use this SPARQL
for a conversion:
GraphDB extends the existing RDF and query results formats with dedicated formats that encode embedded triples
natively (for example, <<:subject :predicate :object>> in Turtlestar). Each new format has its own MIME
type and file extension:
GraphDB uses all RDFstar formats in the way they are defined in RDF4J.
The RDFstar extensions of SPARQL 1.1 Query result formats are in the process of ongoing W3C standardization
activities and for this reason may be subject to change. See SPARQL 1.1 Query Results JSON format and SPARQL
Query Results XML format for more details.
For the benefit of older clients, in all other formats the embedded triples are serialized as special IRIs in the format
urn:rdf4j:triple:xxx. Here, xxx stands for the Base64 URLsafe encoding of the NTriples representation of
the embedded triple. This is controlled by a boolean writer setting, and is ON by default. The setting is ignored
by writers that support RDFstar natively.
Such special IRIs are converted back to triples on parsing. This is controlled by a boolean parser setting, and is
ON by default. It is respected by all parsers, including those with native RDFstar support.
19.6 Plugins
Multiple GraphDB features are implemented as plugins based on the GraphDB Plugin API. As they vary in func
tionality, you can find them under the respective sections in the GraphDB documentation.
Plugin Description
Semantic Similarity Searches Exploring and searching semantic similarity in RDF resources.
RDF Rank An algorithm that identifies the more important or more popular entities in the
repository by examining their interconnectedness.
JavaScript Functions Defining and executing JavaScript code, further enhancing data manipulation
with SPARQL.
Change Tracking Tracking changes within the context of a transaction identified by a unique ID.
Provenance Generation of inference closure from a specific named graph at query time.
Proof plugin Finding out how a given statement has been derived by the inferencer.
Graph Path Search Exploring complex relationships between resources.
Several of the plugins enable you to create and access userdefined indexes. They are created with SPARQL, and
differ from the system indexes in that they can be configured dynamically at runtime. Any user with write access
to a given repository can define such an index.
These are:
Plugin Description
Autocomplete Index Suggestions for the IRIs` local names in the SPARQL editor and the View
Resource page.
GeoSPARQL Support GeoSPARQL is a standard for representing and querying geospatial linked data
for the Semantic Web from the Open Geospatial Consortium (OGC). The plu
gin allows the conversion of WellKnown Text from different coordinate refer
ence systems (CRS) into the CRS84 format, which is the default CRS accord
ing to the OGC.
Geospatial Extensions Support of 2dimensional geospatial data that uses the WGS84 Geo Positioning
RDF vocabulary (World Geodetic System 1984).
Data History and Versioning Accessing past states of your database through versioning of the RDF data
model level.
Text Mining Plugin Calling of text mining algorithms and generation of new relationships between
entities.
Sequences Plugin Providing transactional sequences for GraphDB. A sequence is a long counter
that can be atomically incremented in a transaction to provide incremental IDs.
19.7 Ontologies
An ontology is a formal specification that provides sharable and reusable knowledge representation. Examples of
ontologies include:
• Taxonomies
• Vocabularies
• Thesauri
• Topic maps
• Logical models
An ontology specification includes descriptions of concepts and properties in a domain, relationships between
concepts, constraints on how the relationships can be used and individuals as members of concepts.
In the example below, we can classify the two individuals, Fred and Wilma, in a class of type Person, and we also
know that a Person is a Mammal. Fred works for the Slate Rock Company and the Slate Rock Company is of type
Company, so we also know that Person worksFor Company.
First, ontologies are very useful in gaining a common understanding of information and making assumptions
explicit in ways that can be used to support a number of activities.
These provisions, a common understanding of information and explicit domain assumptions, are valuable because
ontologies support data integration for analytics, apply domain knowledge to data, support application interop
erability, enable model driven applications, reduce time and cost of application development, and improve data
quality by improving metadata and provenance.
The Web Ontology Language, or OWL, adds more powerful ontology modeling means to RDF and RDFS. Thus,
when used with OWL reasoners, like in GraphDB, it provides consistency checks, such as are there any logical
inconsistencies? It also provides satisfiability checks, such as are there classes that cannot have instances? And
OWL provides classification such as the type of an instance.
OWL also adds identity equivalence and identity difference, such as sameAs, differentFrom, equivalentClass, and
equivalentProperty.
In addition, OWL offers more expressive class definitions, such as class intersection, union, complement, disjoint
ness, and cardinality restrictions.
OWL also offers more expressive property definitions, such as object and datatype properties, transitive, functional,
symmetric, inverse properties, and value restrictions.
Finally, ontologies are important because semantic repositories use them as semantic schemata. This makes au
tomated reasoning about the data possible (and easy to implement) since the most essential relationships between
the concepts are built into the ontology.
To load your ontology in GraphDB, simply use the import function in the GraphDB Workbench. The example
below shows loading an ontology through the Import view:
19.8 Reasoning
Hint: To get the full benefit from this section, you need some basic knowledge of the two principle Reasoning
strategies for rulebased inference forward chaining and backward chaining.
GraphDB performs reasoning based on forward chaining of entailment rules defined using RDF triple patterns
with variables. GraphDB’s reasoning strategy is one of Total materialization, where the inference rules are applied
repeatedly to the asserted (explicit) statements until no further inferred (implicit) statements are produced.
The GraphDB repository uses configured rulesets to compute all inferred statements at load time. To some extent,
this process increases the processing cost and time taken to load a repository with a large amount of data. However,
it has the desirable advantage that subsequent query evaluation can proceed extremely quickly.
GraphDB uses a notation almost identical to REntailment defined by Horst. RDFS inference is achieved via a set
of axiomatic triples and entailment rules. These rules allow the full set of valid inferences using RDFS semantics
to be determined.
Herman ter Horst defines RDFS extensions for more general rule support and a fragment of OWL, which is
more expressive than DLP and fully compatible with RDFS. First, he defines Rentailment, which extends RDFS
entailment in the following way:
• It can operate on the basis of any set of rules R (i.e., allows for extension or replacement of the standard set,
defining the semantics of RDFS);
• It operates over socalled generalized RDF graphs, where blank nodes can appear as predicates (a possibility
disallowed in RDF);
• Rules without premises are used to declare axiomatic statements;
• Rules without consequences are used to detect inconsistencies (integrity constraints).
The rule format and the semantics enforced in GraphDB is analogous to Rentailment with the following differ
ences:
• Free variables in the head (without binding in the body) are treated as blank nodes. This feature must be
used with extreme caution because custom rulesets can easily be created, which recursively infer an infinite
number of statements making the semantics intractable;
• Variable inequality constraints can be specified in addition to the triple patterns (they can be placed after any
premise or consequence). This leads to less complexity compared to Rentailment;
• the cut operator can be associated with rule premises. This is an optimization that tells the rule compiler not
to generate a variant of the rule with the identified rule premise as the first triple pattern;
• Context can be used for both rule premises and rule consequences allowing more expressive constructions
that utilize ‘intermediate’ statements contained within the given context URI;
• Consistency checking rules do not have consequences and will indicate an inconsistency when the premises
are satisfied;
• Axiomatic triples can be provided as a set of statements, although these are not modeled as rules with empty
bodies.
GraphDB can be configured via rulesets sets of axiomatic triples, consistency checks and entailment rules, which
determine the applied semantics.
A ruleset file has three sections named Prefices, Axioms, and Rules. All sections are mandatory and must appear
sequentially in this order. Comments are allowed anywhere and follow the Java convention, i.e.,. "/* ... */" for
block comments and "//" for end of line comments.
For historic reasons, the way in which terms (variables, URLs and literals) are written differs from Turtle and
SPARQL:
• URLs in Prefixes are written without angle brackets
• variables are written without ? or $ and can include multiple alphanumeric chars
• URLs are written in brackets, no matter if they are use prefix or are spelled in full
a <owl:maxQualifiedCardinality> "1"^^xsd:nonNegativeInteger
Prefixes
This section defines the abbreviations for the namespaces used in the rest of the file. The syntax is:
shortname : URI
The following is an example of what a typical prefixes section might look like:
Prefices
{
rdf : http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs : http://www.w3.org/2000/01/rdf-schema#
owl : http://www.w3.org/2002/07/owl#
xsd : http://www.w3.org/2001/XMLSchema#
}
Axioms
This section asserts axiomatic triples, which usually describe the metalevel primitives used for defining the schema
such as rdf:type, rdfs:Class, etc. It contains a list of the (variable free) triples, one per line.
For example, the RDF axiomatic triples are defined in the following way:
Axioms
{
// RDF axiomatic triples
<rdf:type> <rdf:type> <rdf:Property>
<rdf:subject> <rdf:type> <rdf:Property>
<rdf:predicate> <rdf:type> <rdf:Property>
<rdf:object> <rdf:type> <rdf:Property>
<rdf:first> <rdf:type> <rdf:Property>
<rdf:rest> <rdf:type> <rdf:Property>
<rdf:value> <rdf:type> <rdf:Property>
<rdf:nil> <rdf:type> <rdf:List>
}
Note: Axiomatic statements are considered to be inferred for the purpose of query answering because they are a
result of semantic interpretation defined by the chosen ruleset.
Rules
This section is used to define entailment rules and consistency checks, which share a similar format. Each definition
consists of premises and corollaries that are RDF statements defined with subject, predicate, object and optional
context components. The subject and object can each be a variable, blank node, literal, a full URI, or the short
name for a URI. The predicate can be a variable, a full URI, or a short name for a URI. If given, the context must
be a full URI or a short name for a URI. Variables are alphanumeric and must begin with a letter.
If the context is provided, the statements produced as rule consequences are not ‘visible’ during normal query
answering. Instead, they can only be used as input to this or other rules and only when the rule premise explicitly
uses the given context (see the example below).
Furthermore, inequality constraints can be used to state that the values of the variables in a statement must not be
equal to a specific full URI (or its short name), a blank node, or to the value of another variable within the same
rule. The behavior of an inequality constraint depends on whether it is placed in the body or the head of a rule. If
it is placed in the body of a rule, then the whole rule will not ‘fire’ if the constraint fails, i.e., the constraint can
be next to any statement pattern in the body of a rule with the same behavior (the constraint does not have to be
placed next to the variables it references). If the constraint is in the head, then its location is significant because a
constraint that does not hold will prevent only the statement it is adjacent to from being inferred.
Entailment rules
Id: <rule_name>
<premises> <optional_constraints>
-------------------------------
<consequences> <optional_constraints>
Rules
{
Id: rdf1_rdfs4a_4b
x a y
-------------------------------
x <rdf:type> <rdfs:Resource>
a <rdf:type> <rdfs:Resource>
y <rdf:type> <rdfs:Resource>
Id: rdfs2
x a y [Constraint a != <rdf:type>]
a <rdfs:domain> z [Constraint z != <rdfs:Resource>]
-------------------------------
x <rdf:type> z
Id: owl_FunctProp
p <rdf:type> <owl:FunctionalProperty>
x p y [Constraint y != z, p != <rdf:type>]
x p z [Constraint z != y] [Cut]
-------------------------------
y <owl:sameAs> z
}
The symbols p, x, y, z and a are variables. The second rule contains two constraints that reduce the number of
bindings for each premise, i.e., they ‘filter out’ those statements where the constraint does not hold.
In a forward chaining inference step, a rule is interpreted as meaning that for all possible ways of satisfying the
premises, the bindings for the variables are used to populate the consequences of the rule. This generates new
statements that will manifest themselves in the repository, e.g., by being returned as query results.
The last rule contains an example of using the Cut operator, which is an optimization hint for the rule compiler.
When rules are compiled, a different variant of the rule is created for each premise, so that each premise occurs as
the first triple pattern in one of the variants. This is done so that incoming statements can be efficiently matched
to appropriate inferences rules. However, when a rule contains two or more premises that match identical triples
patterns, but using different variable names, the extra variant(s) are redundant and better efficiency can be achieved
by simply not creating the extra rule variant(s).
In the above example, the rule owl_FunctProp would by default be compiled in three variants:
p <rdf:type> <owl:FunctionalProperty>
x p y
x p z
-------------------------------
y <owl:sameAs> z
x p y
p <rdf:type> <owl:FunctionalProperty>
x p z
-------------------------------
y <owl:sameAs> z
x p z
p <rdf:type> <owl:FunctionalProperty>
x p y
-------------------------------
y <owl:sameAs> z
Here, the last two variants are identical apart from the rotation of variables y and z, so one of these variants is
not needed. The use of the Cut operator above tells the rule compiler to eliminate this last variant, i.e., the one
beginning with the premise x p z.
The use of context in rule bodies and rule heads is also best explained by an example. The following three rules
implement the OWL2RL property chain rule prp-spo2, and are inspired by the Rule Interchange Format (RIF)
implementation:
Id: prp-spo2_1
p <owl:propertyChainAxiom> pc
start pc last [Context <onto:_checkChain>]
----------------------------
start p last
Id: prp-spo2_2
pc <rdf:first> p
pc <rdf:rest> t [Constraint t != <rdf:nil>]
start p next
next t last [Context <onto:_checkChain>]
----------------------------
start pc last [Context <onto:_checkChain>]
Id: prp-spo2_3
pc <rdf:first> p
pc <rdf:rest> <rdf:nil>
start p last
----------------------------
start pc last [Context <onto:_checkChain>]
The RIF rules that implement prp-spo2 use a relation (unrelated to the input or generated triples) called
_checkChain. The GraphDB implementation maps this relation to the ‘invisible’ context of the same name with the
addition of [Context <onto:_checkChain>] to certain statement patterns. Generated statements with this context
can only be used for bindings to rule premises when the exact same context is specified in the rule premise. The
generated statements with this context will not be used for any other rules.
Inequality constraints in rules check if a variable is bound to a blank node. If it is not, then the inference rule will
fire:
Id: prp_dom
a <rdfs:domain> b
c a d
(continues on next page)
Same as optimization
The builtin OWL property owl:sameAs indicates that two URI references actually refer to the same thing. The
following lines express the transitive and symmetric semantics of the rule:
/**
Id: owl_sameAsCopySubj
// Copy of statement over owl:sameAs on the subject. The support for owl:sameAs
// is implemented through replication of the statements where the equivalent
// resources appear as subject, predicate, or object. See also the couple of
// rules below
//
x <owl:sameAs> y [Constraint x != y]
x p z //Constraint p [Constrain p != <owl:sameAs>]
-------------------------------
y p z
Id: owl_sameAsCopyPred
// Copy of statement over owl:sameAs on the predicate
//
p <owl:sameAs> q [Constraint p != q]
x p y
-------------------------------
x q y
Id: owl_sameAsCopyObj
// Copy of statement over owl:sameAs on the object
//
x <owl:sameAs> y [Constraint x != y]
z p x //Constraint p [Constrain p != <owl:sameAs>]
-------------------------------
z p y
**/
So, all nodes in the transitive and symmetric chain make relations to all other nodes, i.e., the relation coincides
with the Cartesian N xN , hence the full closure contains N 2 statements. GraphDB optimizes the generation of
excessive links by nominating an equivalence class representative to represent all resources in the symmetric and
transitive chain. By default, the owl:sameAs optimization is enabled in all rulesets except when the ruleset is empty,
rdfs, or rdfsplus. For additional information, check Optimization of owl:sameAs.
Consistency checks
Consistency checks are used to ensure that the data model is in a consistent state and are applied whenever an
update transaction is committed. GraphDB supports consistency violation checks using standard OWL2RL se
mantics. You can define rulesets that contain consistency rules. When creating a new repository, set the check
forinconsistencies configuration parameter to true. It is false by default.
The syntax is similar to that of rules, except that Consistency replaces the Id tag that introduces normal rules.
Also, consistency checks do not have any consequences and indicate an inconsistency whenever their premises
can be satisfied, e.g.:
Consistency: something_can_not_be_nothing
x rdf:type owl:Nothing
-------------------------------
(continues on next page)
Consistency: both_sameAs_and_differentFrom_is_forbidden
x owl:sameAs y
x owl:differentFrom y
-------------------------------
19.8.4 Rulesets
GraphDB offers several predefined semantics by way of standard rulesets (files), but can also be configured to
use custom rulesets with semantics better tuned to the particular domain. The required semantics can be specified
through the ruleset for each specific repository instance. Applications that do not need the complexity of the most
expressive supported semantics can choose one of the less complex, which will result in faster inference.
Note: Each ruleset defines both rules and some schema statements, otherwise known as axiomatic triples. These
(readonly) triples are inserted into the repository at initialization time and count towards the total number of
reported ‘explicit’ triples. The variation may be up to the order of hundreds depending upon the ruleset.
Predefined rulesets
The predefined rulesets provided with GraphDB cover various wellknown knowledge representation formalisms,
and are layered in such a way that each extends the preceding one.
Ruleset Description
empty No reasoning, i.e., GraphDB operates as a plain RDF store.
rdfs Supports the standard modeltheoretic RDFS semantics. This includes support for sub-
ClassOf and related type inference, as well as subPropertyOf.
rdfsplus Extended version of RDFS with the support also symmetric, inverse and transitive
properties, via the OWL vocabulary: owl:SymmetricProperty, owl:inverseOf and
owl:TransitiveProperty.
owlhorst OWL dialect close to OWLHorst essentially pD*
owlmax RDFS and that part of OWL Lite that can be captured in rules (deriving functional and
inverse functional properties, alldifferent, subclass by union/enumeration; min/max
cardinality constraints, etc.).
owl2ql The OWL2QL profile a fragment of OWL2 Full designed so that sound and complete
query answering is LOGSPACE with respect to the size of the data. This OWL2 pro
file is based on DLLiteR, a variant of DLLite that does not require the unique name
assumption.
owl2rl The OWL2RL profile an expressive fragment of OWL2 Full that is amenable for
implementation on rule engines.
Note: Not all rulesets support datatype reasoning, which is the main reason why OWLHorst is not the same
as pD*. The ruleset you need to use for a specific repository is defined through the ruleset parameter. There are
optimized versions of all rulesets that avoid some little used inferences.
OWL2-QL non-conformance
The implementation of OWL2QL is nonconformant with the W3C OWL2 profiles recommendation as shown in
the following table:
Custom rulesets
GraphDB has an internal rule compiler that can be configured with a custom set of inference rules and axioms.
You may define a custom ruleset in a .pie file (e.g., MySemantics.pie). The easiest way to create a custom ruleset
is to start modifying one of the .pie files that were used to build the precompiled rulesets.
Note: All predefined .pie files are included in configs/rules folder of the GraphDB distribution.
If the code generation or compilation cannot be completed successfully, a Java exception is thrown indicating the
problem. It will state either the Id of the rule, or the complete line from the source file where the problem is
located. Line information is not preserved during the parsing of the rule file.
You must specify the custom ruleset via the ruleset configuration parameter. There are optimized versions of all
rulesets. The value of the ruleset parameter is interpreted as a filename and .pie is appended when not present.
This file is processed to create Java source code that is compiled using the compiler from the Java Development
Kit (JDK). The compiler is invoked using the mechanism provided by the JDK version 1.6 (or later).
Therefore, a prerequisite for using custom rulesets is that you use the Java Virtual Machine (JVM) from a JDK
version 1.6 (or later) to run the application. If all goes well, the class is loaded dynamically and instantiated for
further use by GraphDB during inference. The intermediate files are created in the folder that is pointed by the
java.io.tmpdir system property. The JVM should have sufficient rights to read and write to this directory.
Note: Using GraphDB, this is more difficult. It will be necessary to export/backup all explicit statements and
recreate a new repository with the required ruleset. Once created, the explicit statements exported from the old
repository can be imported to the new one.
19.8.5 Inference
Reasoner
The GraphDB reasoner requires a .pie file of each ruleset to be compiled in order to instantiate. The process
includes several steps:
1. Generate a java code out of the .pie file contents using the builtin GraphDB rule compiler.
2. Compile the java code (it requires JDK instead of JRE, hence the java compiler will be available through
the standard java instrumentation infrastructure).
3. Instantiate the java code using a custom bytecode class loader.
Note: GraphDB supports dynamic extension of the reasoner with new rulesets.
Rulesets execution
• For each rule and each premise (triple pattern in the rule head), a rule variant is generated. We call this
the ‘leading premise’ of the variant. If a premise has the Cut annotation, no variant is generated for it.
• Every incoming triple (inserted or inferred) is checked against the leading premise of every rule variant.
Since rules are compiled to Java bytecode on startup, this checking is very fast.
• If the leading premise matches, the rest of the premises are checked. This checking needs to access the
repository, so it can be much slower.
– GraphDB first checks premises with the least number of unbound variables.
– For premises that have the same number of unbound variables, GraphDB follows the textual order in
the rule.
Retraction of assertions
GraphDB stores explicit and implicit statements, i.e., the statements inferred (materialized) from the explicit state
ments. So, when explicit statements are removed from the repository, any implicit statements that rely on the
removed statement must also be removed.
In the previous versions of GraphDB, this was achieved with a recomputation of the full closure (minimal model),
i.e., applying the entailment rules to all explicit statements and computing the inferences. This approach guarantees
correctness, but does not scale the computation is increasingly slow and computationally expensive in proportion
to the number of explicit statements and the complexity of the entailment ruleset.
Removal of explicit statements is now achieved in a more efficient manner, by invalidating only the inferred
statements that can no longer be derived in any way.
One approach is to maintain track information for every statement typically the list of statements that can be
inferred from this statement. The list is built up during inference as the rules are applied and the statements
inferred by the rules are added to the lists of all statements that triggered the inferences. The drawback of this
technique is that track information inflates more rapidly than the inferred closure in the case of large datasets up
to 90% of the storage is required just to store the track information.
Another approach is to perform backward chaining. Backward chaining does not require track information, since
it essentially recomputes the tracks as required. Instead, a flag for each statement is used so that the algorithm
can detect when a statement has been previously visited and thus avoid an infinite recursion.
The algorithm used in GraphDB works as follows:
1. Apply a ‘visited’ flag to all statements (false by default).
2. Store the statements to be deleted in the list L.
3. For each statement in L that is not visited yet, mark it as visited and apply the forward chaining rules.
Statements marked as visited become invisible, which is why the statement must be first marked and then
used for forward chaining.
4. If there are no more unvisited statements in L, then END.
5. Store all inferred statements in the list L1.
6. For each element in L1 check the following:
• If the statement is a purely implicit statement (a statement can be both explicit and implicit and if so,
then it is not considered purely implicit), mark it as deleted (prevent it from being returned by the
iterators) and check whether it is supported by other statements. The isSupported() method uses
queries that contain the premises of the rules and the variables of the rules are preliminarily bound
using the statement in question. That is to say, the isSupported() method starts from the projection of
the query and then checks whether the query will return results (at least one), i.e., this method performs
backward chaining.
• If a result is returned by any query (every rule is represented by a query) in isSupported(), then this
statement can be still derived from other statements in the repository, so it must not be deleted (its
status is returned to ‘inferred’).
• If all queries return no results, then this statement can no longer be derived from any other statements,
so its status remains ‘deleted’ and the number of statements counter is updated.
7. L := L1 and GOTO 3.
Special care is taken when retracting owl:sameAs statements, so that the algorithm still works correctly when
modifying equivalence classes.
Note: One consequence of this algorithm is that deletion can still have poor performance when deleting schema
statements, due to the (probably) large number of implicit statements inferred from them.
Note: The forward chaining part of the algorithm terminates as soon as it detects that a statement is readonly,
because if it cannot be deleted, there is no need to look for statements derived from it. For this reason, performance
can be greatly improved when all schema statements are made readonly by importing ontologies (and OWL/RDFS
vocabularies) using the imports repository parameter.
When fast statement retraction is required, but it is also necessary to update schemas, you can use a special statement
pattern. By including an insert for a statement with the following form in the update:
[] <http://www.ontotext.com/owlim/system#schemaTransaction> []
GraphDB will use the smoothdelete algorithm, but will also traverse readonly statements and allow them to
be deleted/inserted. Such transactions are likely to be much more computationally expensive to achieve, but are
intended for the occasional, offline update to otherwise readonly schemas. The advantage is that fastdelete can
still be used, but no repository export and import is required when making a modification to a schema.
For any transaction that includes an insert of the above special predicate/statement:
• Readonly (explicit or inferred) statements can be deleted;
• New explicit statements are marked as readonly;
• New inferred statements are marked:
– Readonly if all the premises that fired the rule are readonly;
– Normal otherwise.
Schema statements can be inserted or deleted using SPARQL UPDATE as follows:
DELETE {
# [[schema statements to delete]]
}
INSERT {
[] <http://www.ontotext.com/owlim/system#schemaTransaction> [] .
# [[schema statements to insert]]
}
WHERE { }
Operations on rulesets
The predicate sys:addRuleset adds a custom ruleset from the specified .pie file. The ruleset is named after the
filename, without the .pie extension.
Example 1 This creates a new ruleset ‘test’. If the absolute path to the file resides on, for example, /opt/
rules/test.pie, it can be specified as <file:/opt/rules/test.pie>, <file://opt/rules/test.pie>,
or <file:///opt/rules/test.pie>, i.e., with 1, 2, or 3 slashes. Relative paths are specified without the
slashes or with a dot between the slashes: <file:opt/rules/test.pie>, <file:/./opt/rules/test.pie>,
<file://./opt/rules/test.pie>, or even <file:./opt/rules/test.pie> (with a dot in front of the path).
Relative paths can be used if you know the work directory of the Java process in which GraphDB runs.
INSERT DATA {
_:b sys:addRuleset <file:c:/graphdb/test-data/test.pie>
}
Example 2 Same as above but creates a ruleset called ‘custom’ out of the test.pie file found in the given absolute
path.
INSERT DATA {
<_:custom> sys:addRuleset <file:c:/graphdb/test-data/test.pie>
}
Example 3 Retrieves the .pie file from the given URL. Again, you can use <_:custom> to change the name of
the ruleset to “custom” or as necessary.
INSERT DATA {
_:b sys:addRuleset <http://example.com/test-data/test.pie>
}
The predicate sys:addRuleset adds a builtin ruleset (one of the rulesets that GraphDB supports natively).
Example This adds the "owl-max" ruleset to the list of rulesets in the repository.
INSERT DATA {
_:b sys:addRuleset "owl-max"
}
The predicate sys:addRuleset adds a custom ruleset from the specified .pie file. The ruleset is named after the
filename, without the .pie extension.
Example This creates a new ruleset "custom".
INSERT DATA {
<_:custom> sys:addRuleset
'''Prefices { a : http://a/ }
Axioms {}
Rules
{
Id: custom
a b c
a <a:custom1> c
-----------------------
b <a:custom1> a
(continues on next page)
Explore a ruleset
SELECT * {
?content sys:exploreRuleset "test"
}
The predicate sys:defaultRuleset switches the default ruleset to the one specified in the object literal.
Example This sets the default ruleset to “test”. All transactions use this ruleset, unless they specify another ruleset
as a first operation in the transaction.
INSERT DATA {
_:b sys:defaultRuleset "test"
}
Rename a ruleset
The predicate sys:renameRuleset renames the ruleset from “custom” to “test”. Note that “custom” is specified
as the subject URI in the default namespace.
Example This renames the ruleset “custom” to “test”.
INSERT DATA {
<_:custom> sys:renameRuleset "test"
}
Delete a ruleset
The predicate sys:removeRuleset deletes the ruleset "test" specified in the object literal.
Example
INSERT DATA {
_:b sys:removeRuleset "test"
}
Consistency check
The predicate sys:consistencyCheckAgainstRuleset checks if the repository is consistent with the specified
ruleset.
Example
INSERT DATA {
_:b sys:consistencyCheckAgainstRuleset "test"
}
Reinferring
Statements are inferred only when you insert new statements. So, if reconnected to a repository with a different
ruleset, it does not take effect immediately. However, you can cause reinference with an Update statement such
as:
This removes all inferred statements and reinfers from scratch using the current ruleset. If a statement is both
explicitly inserted and inferred, it is not removed. Statements of the type <P rdf:type rdf:Property>, <P
rdfs:subPropertyOf P>, <X rdf:type rdfs:Resource>, and the axioms from all rulesets will stay untouched.
Tip: To learn more, see How to manage explicit and implicit statements.
19.8.7 Provenance
GraphDB’s Provenance plugin enables the generation of inference closure from a specific named graph at query
time. This is useful in situations where you want to trace what the implicit statements generated from a specific
graph are and the axiomatic triples part of the configured ruleset, i.e., the ones inserted with a special predicate
sys:schemaTransaction. Find more about it in the plugin’s documentation.
SPARQL 1.1 Protocol for RDF defines the means for transmitting SPARQL queries to a SPARQL query processing
service, and returning the query results to the entity that requested them.
SPARQL 1.1 Query provides more powerful query constructions compared to SPARQL 1.0. It adds:
• Aggregates;
• Subqueries;
• Negation;
• Expressions in the SELECT clause;
• Property Paths;
• Assignment;
• An expanded set of functions and operators.
SPARQL 1.1 Update provides a means to change the state of the database using a querylike syntax. SPARQL
Update has similarities to SQL INSERT INTO, UPDATE WHERE, and DELETE FROM behavior. For full details, see the
W3C SPARQL Update working group page.
SPARQL 1.1 Federation provides extensions to the query syntax for executing distributed queries over any number
of SPARQL endpoints. This feature is very powerful, and allows integration of RDF data from different sources
using a single query. See more about it here.
In addition to the standard SPARQL 1.1 Federation to other SPARQL endpoints, GraphDB supports internal fed
eration to other repositories in the same GraphDB instance. The internal SPARQL federation is used in almost the
same way as the standard SPARQL federation over HTTP, but since this approach skips all HTTP communication
overheads, it is more efficient. See more about it here.
You can also use federation to query a remote passwordprotected GraphDB repository and a SPARQL endpoint.
See how to do it here.
SPARQL 1.1 Graph Store HTTP Protocol provides a means for updating and fetching RDF graph content from a
Graph Store over HTTP in the REST style.
• GET: Fetches statements in the named graph from the repository in the requested format.
• PUT: Updates data in the named graph in the repository, replacing any existing data in the named graph with
the supplied data. The data supplied with this request is expected to contain an RDF document in one of the
supported RDF formats.
• DELETE: Deletes all data in the specified named graph in the repository.
• POST: Updates data in the named graph in the repository by adding the supplied data to any existing data in
the named graph. The data supplied with this request is expected to contain an RDF document in one of the
supported RDF formats.
Request headers
• Accept: Relevant values for GET requests are the MIME types of the supported RDF formats.
• Content-Type: Must specify the encoding of any request data sent to a server. Relevant values are the MIME
types of the supported RDF formats.
Note: Each request on an indirectly referenced graph needs to specify precisely one of the above parameters.
This section lists all supported SPARQL functions in GraphDB. The function specifications include the types of
the arguments and the output. Types from XML Schema should be readily recognizable as they start with the xsd:
prefix. In addition, the following more generic types are used:
rdfTerm Any RDF value: a literal, a blank node or an IRI.
iri An IRI.
bnode A blank node.
literal A literal regardless of its datatype or the presence of a language tag.
string A plain literal or a literal with a language tag. Note that plain literals have the implicit datatype
xsd:string.
numeric A literal with a numeric XSD datatype, e.g. xsd:double and xsd:long.
variable A SPARQL variable.
expression A SPARQL expression that may use any constants and variables to compute a value.
Functions and magic predicates are denoted and used differently. Magic predicates are similar to how GraphDB
plugins can interpret certain triple patterns, and unlike functions, they can return multiple values per call.
• Functions are denoted like this: ex:function(arg1, arg2, ...) where all arguments must be bound, and
are used in bind, in select expressions, in the order clause, etc.
• Magic predicates are denoted like this: subject ex:magicPredicate (arg1 arg2 ...) where in some
cases, the arguments are allowed to be unbound (and are then calculated from the subject). They are used
as triple patterns. The object is an RDF list of the arguments (indicated by the parentheses on the righthand
side).
Function Description
xsd:boolean BOUND(variable Returns true if the variable var is bound to a value. Returns false otherwise.
var) Variables with the value NaN or INF are considered bound. More
rdfTerm IF(expression e1, The IF function form evaluates the first argument, interprets it as a effective
expression e2, expression boolean value, then returns the value of e2 if the EBV is true, otherwise it
e3) returns the value of e3. Only one of e2 and e3 is evaluated. If evaluating the
first argument raises an error, then an error is raised for the evaluation of the
IF expression. More
rdfTerm COA- The COALESCE function form returns the RDF term value of the first ex
LESCE(expression e1, pression that evaluates without error. In SPARQL, evaluating an unbound
…) variable raises an error.
If none of the arguments evaluates to an RDF term, an error is raised. If no
expressions are evaluated without error, an error is raised. More
There is a filter operator EXISTS that takes a graph pattern. EXISTS returns
xsd:boolean NOT EXISTS { true or false depending on whether the pattern matches the dataset given the
pattern } bindings in the current group graph pattern, the dataset and the active graph at
this point in the query evaluation. No additional binding of variables occurs.
xsd:boolean EXISTS { The NOT EXISTS form translates into fn:not(EXISTS{...}). More
pattern }
xsd:boolean xsd:boolean Returns a logical OR of left and right. Note that logicalor operates on the
left || xsd:boolean right effective boolean value of its arguments. More
xsd:boolean xsd:boolean Returns a logical AND of left and right. Note that logicaland operates on
left && xsd:boolean right the effective boolean value of its arguments. More
xsd:boolean rdfTerm term1 Returns true if term1 and term2 are equal. Returns false otherwise. IRIs
= rdfTerm term2 and blank nodes are equal if they are the same RDF term as defined in RDF
Concepts. Literals are equal if they have an XSD datatype, the same lan
guage tag (if any) and their values produced by applying the lexicaltovalue
mapping of their datatypes are also equal. If the arguments are both literal
but their datatype is not an XSD datatype an error will be produced. More
xsd:boolean same- Returns true if term1 and term2 are the same RDF term as defined in RDF
Term(rdfTerm term1, Concepts; returns false otherwise. More
rdfTerm term2)
xsd:boolean rdfTerm term The IN operator tests whether the RDF term on the lefthand side is found in
IN (expression e1, …) the values of list of expressions on the righthand side. The test is done with
= operator, which compares the RDF term to each expression for equality.
More
xsd:boolean rdfTerm term The NOT IN operator tests whether the RDF term on the lefthand side is not
NOT IN (expression e1, …) found in the values of list of expressions on the righthand side. The test is
done with != operator, which compares the RDF term to each expression for
inequality. More
Returns true if term is an IRI. Returns false otherwise. More
xsd:boolean isIRI(rdfTerm
term)
xsd:boolean isURI(rdfTerm
term)
xsd:boolean is- Returns true if term is a blank node. Returns false otherwise. More
Blank(rdfTerm term)
xsd:boolean isLit- Returns true if term is a literal. Returns false otherwise. More
eral(rdfTerm term)
Continued on next page
xsd:string LANG(literal Returns the language tag of the literal ltrl, if it has one. It returns “” if ltrl has
ltrl) no language tag. Note that the RDF data model does not include literals with
an empty language tag. More
iri DATATYPE(literal ltrl) Returns the datatype IRI of the literal ltrl. More
The IRI function constructs an IRI by resolving the string argument str. The
iri IRI(string str) IRI is resolved against the base IRI of the query and must result in an absolute
IRI. If the function is passed an IRI rsrc, it returns the IRI unchanged. More
iri IRI(iri rsrc)
The BNODE function constructs a blank node that is distinct from all blank
bnode BNODE() nodes in the dataset being queried and distinct from all blank nodes created
by calls to this constructor for other query solutions. If the no argument form
bnode BNODE(string str) is used, every call results in a distinct blank node. If the form with the string
str is used, every call results in distinct blank nodes for different strings, and
the same blank node for calls with the same string within expressions for one
solution mapping. More
iri UUID() Return a fresh IRI from the UUID URN scheme. Each call of UUID() returns
a different UUID. More
xsd:string STRUUID() Return a string that is the scheme specific part of UUID. That is, as a string
literal, the result of generating a UUID, converting to a string literal and re
moving the initial urn:uuid:. More
xsd:integer STRLEN(string The STRLEN function corresponds to the XPath fn:string-length function
str) and returns an xsd:integer equal to the length in characters of the lexical
form of the string str. More
The SUBSTR function corresponds to the XPath fn:substring function and
string SUBSTR(string returns a literal of the same kind (string literal or literal with language tag) as
source, xsd:integer the source input parameter but with a lexical form formed from the substring
startingLoc) of the lexical form of the source. More
string SUBSTR(string
source, xsd:integer
startingLoc, xsd:integer
length)
string UCASE(string str) The UCASE function corresponds to the XPath fn:upper-case function. It
returns a string literal whose lexical form is the upper case of the lexical form
of the argument. More
Continued on next page
xsd:boolean REGEX(string
text, xsd:string pattern,
xsd:string flags)
string REPLACE(string
arg, xsd:string pattern,
xsd:string replacement,
xsd:string flags)
numeric ABS(numeric num) Returns the absolute value of num. An error is raised if the argument is not a
numeric value.
This function is the same as fn:numeric-abs for terms with a
datatype from XDM. More
numeric ROUND(numeric num) Returns the number with no fractional part that is closest to num. If there are
two such numbers, then the one that is closest to positive infinity is returned.
An error is raised if the argument is not a numeric value.
This function is the same as fn:numeric-round for terms with a
datatype from XDM. More
numeric CEIL(numeric num) Returns the smallest (closest to negative infinity) number with no fractional
part that is not less than the value of num. An error is raised if the argument
is not a numeric value.
This function is the same as fn:numeric-ceil for terms with a
datatype from XDM. More
numeric FLOOR(numeric num) Returns the largest (closest to positive infinity) number with no fractional part
that is not greater than the value of num. An error is raised if the argument is
not a numeric value.
This function is the same as fn:numeric-floor for terms with a
datatype from XDM. More
xsd:double RAND() Returns a pseudorandom number between 0 (inclusive) and 1.0 (exclusive).
Different numbers can be produced every time this function is invoked. Num
bers should be produced with approximately equal probability. More
xsd:dateTime NOW() Returns an XSD dateTime value for the current query execution. All calls
to this function in any one query execution will return the same value. The
exact moment returned is not specified. More
xsd:integer Returns the year part of arg as an integer.
YEAR(xsd:dateTime arg) This function corresponds to fn:year-from-dateTime. More
xsd:integer Returns the hours part of arg as an integer. The value is as given in the lexical
HOURS(xsd:dateTime arg) form of the XSD dateTime.
This function corresponds to fn:hours-from-dateTime. More
xsd:decimal SEC- Returns the seconds part of the lexical form of arg.
ONDS(xsd:dateTime arg) This function corresponds to fn:seconds-from-dateTime.
More
xsd:dayTimeDuration TIME- Returns the timezone part of arg as an xsd:dayTimeDuration. Raises an error
ZONE(xsd:dateTime arg) if there is no timezone.
This function corresponds to fn:timezone-from-dateTime ex
cept for the treatment of literals with no timezone. More
xsd:string TZ(xsd:dateTime Returns the timezone part of arg as a simple literal. Returns the empty string
arg) if there is no timezone. More
Returns the MD5 checksum, as a hex digit string, calculated on the UTF8
xsd:string MD5(xsd:string representation of the lexical form of the argument. More
arg)
Returns the SHA1 checksum, as a hex digit string, calculated on the UTF8
xsd:string representation of the lexical form of the argument. More
SHA1(xsd:string arg)
Returns the SHA256 checksum, as a hex digit string, calculated on the UTF8
xsd:string representation of the lexical form of the argument. More
SHA256(xsd:string arg)
Returns the SHA512 checksum, as a hex digit string, calculated on the UTF8
xsd:string representation of the lexical form of the argument. More
SHA512(xsd:string arg)
Casting in SPARQL 1.1 is performed by calling a constructor function for the target type on an operand of the
source type. The standard includes the following constructor functions:
Note: Note that SPARQL 1.1 does not have an xsd:date constructor. Instead, use STRDT(value,xsd:date) to
attach the xsd:date datatype to the value.
Beside the standard SPARQL functions operating on numbers, GraphDB offers several additional functions, al
lowing users to do more mathematical operations. These are implemented using Java’s Math class.
The prefix ofn: stands for the namespace <http://www.ontotext.com/sparql/functions/>.
Function Description
xsd:double The arccosine function. The input is in the range [−1, +1]. The output is in
ofn:acos(numeric a) the range [0, π] radians. See Math.acos(double).
Example: ofn:acos(0.5) = 1.0471975511965979
xsd:double The arcsine function. The input is in the range [−1, +1]. The output is in the
ofn:asin(numeric a) range [−π/2, π/2] radians. See Math.asin(double).
Example: ofn:asin(0.5) = 0.5235987755982989
xsd:double The doubleargument arctangent function (the angle component of the con
ofn:atan2(numeric y, version from rectangular coordinates to polar coordinates). The output is in
numeric x) the range [−π/2, π/2] radians. See Math.atan2(double,double).
Example: ofn:atan2(1, 0) = 1.5707963267948966
xsd:double Returns the first floatingpoint argument with the sign of the second floating
ofn:copySign(numeric point argument. See Math.copySign(double,double).
magnitude, numeric sign) Example: ofn:copySign(2, -7.5) = -2.0
xsd:double ofn:cos(numeric The cosine function. The argument is in radians. See Math.cos(double).
a) Example: ofn:cos(1) = 0.5403023058681398
xsd:double ofn:e() Returns the double value that is closer than any other to e, the base of the
natural logarithms. See Math.E.
Example: ofn:e() = 2.718281828459045
xsd:double Returns the largest (closest to positive infinity) int value (as a double num
ofn:floorDiv(numeric ber) that is less than or equal to the algebraic quotient. The arguments are
x, numeric y) implicitly cast to long. See Math.floorDiv(long,long).
Example: ofn:floorDiv(5, 2) = 2.0
xsd:double Returns the floor modulus (as a double number) of the arguments. The argu
ofn:floorMod(numeric ments are implicitly cast to long. See Math.floorMod(long,long).
x, numeric y) Example: ofn:floorMod(10, 3) = 1.0
xsd:double Returns the unbiased exponent used in the representation of a double. This
ofn:getExponent(numeric means that we take n from the binary representation of x: x = 1 × 2n +
d) {1|0} × 2(n−1) + ... + {1|0} × 20 , i.e., the power of the highest nonzero bit
of the binary form of x. See Math.getExponent(double).
Example: ofn:getExponent(10) = 3.0
xsd:double Returns the natural logarithm of the sum of the argument and 1. See
ofn:log1p(numeric x) Math.log1p(double).
Example: ofn:log1p(4) = 1.6094379124341003
xsd:double Returns the floatingpoint number adjacent to the first argument in the direc
ofn:nextAfter(numeric tion of the second argument. See Math.nextAfter(double,double).
start, numeric direction) Example: ofn:nextAfter(2, -7) = 1.9999999999999998
xsd:double Returns the floatingpoint value adjacent to d in the direction of positive in
ofn:nextUp(numeric d) finity. See Math.nextUp(double).
Example: ofn:nextUp(2) = 2.0000000000000004
xsd:double ofn:pi() Returns the double value that is closer than any other to π, the ratio of the
circumference of a circle to its diameter. See Math.PI.
Example: ofn:pi() = 3.141592653589793
xsd:double Returns the double value that is closest in value to the argument and is equal
ofn:rint(numeric a) to a mathematical integer. See Math.rint(double).
Example: ofn:rint(2.51) = 3.0
xsd:double Returns the signum function of the argument; zero if the argument is zero,
ofn:signum(numeric d) 1.0 if the argument is greater than zero, 1.0 if the argument is less than zero.
See Math.signum(double).
Example: ofn:signum(-5) = -1.0
xsd:double ofn:sin(numeric The sine function. The argument is in radians. See Math.sin(double).
a) Example: ofn:sin(2) = 0.9092974268256817
xsd:double ofn:tan(numeric The tangent function. The argument is in radians. See Math.tan(double).
a) Example: ofn:tan(1) = 1.5574077246549023
xsd:double ofn:ulp(numeric Returns the size of an ulp of the argument. An ulp, unit in the last place, of
d) a double value is the positive distance between this floatingpoint value and
the double value next larger in magnitude. Note that for nonNaN x, ulp(x)
== ulp(x). See Math.ulp(double).
Example: ofn:ulp(1) = 2.220446049250313E-16
GraphDB also supports several Jena ARQ simple function analogs. The prefix afn: stands for the namespace
<http://jena.apache.org/ARQ/function#>.
Function Description
afn:min(num1, num2) Return the minimum of two numbers.
afn:max(num1, num2) Return the maximum of two numbers.
afn:pi() The value of pi as an XSD double.
afn:e() The value of e as an XSD double.
afn:sqrt(num) The square root of num.
Beside the standard SPARQL functions related to date and time, GraphDB offers several additional functions,
allowing users to do more with their temporal data.
The prefix ofn: stands for the namespace <http://www.ontotext.com/sparql/functions/>. For more informa
tion, refer to Time Functions Extensions.
Function Description
xsd:long ofn:years-from- Return the “years” part of the duration literal
duration(xsd:duration dur)
xsd:long ofn:months-from- Returns the “months” part of the duration literal
duration(xsd:duration dur)
xsd:long ofn:days-from- Returns the “days” part of the duration literal
duration(xsd:duration
dur)
xsd:long ofn:hours-from- Returns the “hours” part of the duration literal
duration(xsd:duration dur)
xsd:long ofn:minutes-from- Returns the “minutes” part of the duration literal
duration(xsd:duration dur)
xsd:long ofn:seconds-from- Returns the “seconds” part of the duration literal
duration(xsd:duration dur)
xsd:long ofn:millis-from- Returns the “milliseconds” part of the duration literal
duration(xsd:duration dur)
xsd:long Returns the duration of the period as weeks
ofn:asWeeks(xsd:duration
dur)
xsd:long Returns the duration of the period as days
ofn:asDays(xsd:duration
dur)
xsd:long Returns the duration of the period as hours
ofn:asHours(xsd:duration
dur)
xsd:long Returns the duration of the period as minutes
ofn:asMinutes(xsd:duration
dur)
xsd:long Returns the duration of the period as seconds
ofn:asSeconds(xsd:duration
dur)
xsd:long Returns the duration of the period as milliseconds
ofn:asMillis(xsd:duration
dur)
xsd:long Returns the duration between the two dates as weeks
ofn:weeksBetween(xsd:dateTime
d1, xsd:dateTime d2)
xsd:long Returns the duration between the two dates as days
ofn:daysBetween(xsd:dateTime
d1, xsd:dateTime d2)
xsd:long Returns the duration between the two dates as hours
ofn:hoursBetween(xsd:dateTime
d1, xsd:dateTime d2)
xsd:long Returns the duration between the two dates as minutes
ofn:minutesBetween(xsd:dateTime
d1, xsd:dateTime d2)
xsd:long Returns the duration between the two dates as seconds
ofn:secondsBetween(xsd:dateTime
d1, xsd:dateTime d2)
xsd:long Returns the duration between the two dates as milliseconds
ofn:millisBetween(xsd:dateTime
d1, xsd:dateTime d2)
The following SPIN SPARQL functions and magic predicates are available in GraphDB. The prefix spif: stands
for the namespace <http://spinrdf.org/spif#>.
SPIN functions that work on text use 0based indexes, unlike SPARQL’s functions, which use 1based indexes.
Function Description
Dates spif:parseDate(xsd:string Parses date using format with
date, xsd:string format) Java’s SimpleDateFormat
spif:dateFormat(xsd:dateTime Formats date using format with
date, xsd:string format) Java SimpleDateFormat
xsd:long The current time in milliseconds
spif:currentTimeMillis() since the epoch
xsd:long The time in milliseconds since the
spif:timeMillis(xsd:dateTime epoch for the provided argument
date)
Numbers numeric spif:mod(xsd:numeric Remainder from integer division
dividend, xsd:numeric divi-
sor)
xsd:string Formats number using format with
spif:decimalFormat(numeric Java’s DecimalFormat
number, xsd:string format)
xsd:double spif:random() Calls Java’s Math.random()
Strings xsd:string spif:trim(string Calls String.trim()
str)
spif:generateUUID() UUID generation as a literal. Same
as SPARQL’s STRUUID().
spif:cast(literal value, iri Same as SPARQL’s
type) STRDT(STR(value), type).
Does not do validation either.
xsd:int spif:indexOf(string
str, string substr)
Position of first occurrence of a
substring.
xsd:int
spif:lastIndexOf(string
str, string substr)
Position of last occurrence of a
substring.
Predicate Description
?result spif:split (? Takes two arguments: a string to split and a regex to split on. The current
string ?regex) implementation uses Java’s String.split().
?result spif:for (?start ? Generates bindings from a given start integer value to another given end in
end) teger value.
?result spif:foreach (? Generates bindings for the given arguments arg1, arg2 and so on.
arg1 ?arg2 …)
To avoid any parsing of an embedded triple, GraphDB introduces the following SPARQL functions:
Function Description
xsd:boolean Checks if the variable var is bound to an embedded triple
rdf:isTriple(variable
var)
Extracts the subject, predicate, or object from a variable bound to an embed
ded triple
iri rdf:subject(variable
var)
iri
rdf:predicate(variable
var)
rdfTerm
rdf:object(variable var)
rdf:Statement(iri subj, Creates a new embedded statement with the provided values
iri pred, rdfTerm obj)
Function Description
list list:member member Membership of an RDF List (RDF Collection). Currently in GraphDB,
if list is not bound or a constant, an error will be thrown; else evaluate
for the list in the variable list. If member is a variable, generate solu
tions with member bound to each element of list. If member is bound or a
constant expression, test to see if it is a member of list.
list list:index (index mem- Index of an RDF List (RDF Collection). Currently in GraphDB, if list
ber) is not bound or a constant, an error will be thrown; else evaluate for one
particular list. The object is a list pair, either element can be bound, un
bound or a fixed node. Unbound variables in the object list are bound by
the property function.
list list:length length Length of an RDF List (RDF Collection). Currently in GraphDB, if list
is not bound or a constant, an error will be thrown; else evaluate for one
particular list. The object is tested against or bound to the length of the
list.
Note: The Jena behavior is that if list is not bound or a constant, the function finds and iterates all lists in
the graph (can be slow). As mentioned above, currently, GraphDB does not provide support for unbound list.
Support for it will be added with the coming releases.
GraphDB supports the below Jena ARQ aggregate function analogs, which are modeled after the corresponding
SQL aggregate functions.
The prefix agg: stands for the namespace <http://jena.apache.org/ARQ/function/aggregate#>.
• agg:stdev
• agg:stdev_samp
• agg:stdev_pop
• agg:variance
• agg:var_samp
• agg:var_pop
The stdev_pop() and stdev_samp() functions compute the population standard deviation and sample standard
deviation, respectively, of the input values. (stdev() is an alias for stdev_samp()) Both functions evaluate all
input rows matched by the query. The difference is that stdev_samp() is scaled by 1/(N1) while stdev_pop() is
scaled by 1/N.
The var_samp() and var_pop() functions compute the sample variance and population variance, respectively, of
the input values. (variance() is an alias for variance_samp()) Both functions evaluate all input rows matched by
the query. The difference is that variance_samp() is scaled by 1/(N1) while variance_pop() is scaled by 1/N.
The following functions are defined by the GeoSPARQL standard. For more information, refer to OGC
GeoSPARQL A Geographic Query Language for RDF Data. The prefix geof: stands for the namespace <http:/
/www.opengis.net/def/function/geosparql/>.
The type geomLiteral serves as a placeholder for any GeoSPARQL literal that describes a geometry. GraphDB
supports WKT (datatype geo:wktLiteral) and GML (datatype geo:gmlLiteral).
Function Description
xsd:double Returns the shortest distance in units between any two Points in the two geometric objects
geof:distance(geomLiteral
as calculated in the spatial reference system of geom1.
geom1, geomLit-
eral geom2, iri
units)
geomLiteral This function returns a geometric object that represents all Points whose distance from
geof:buffer(geomLiteral
geom1 is less than or equal to the radius measured in units. Calculations are in the spatial
geom, xsd:double reference system of geom1.
radius, iri
units)
geomLiteral This function returns a geometric object that represents all Points in the convex hull of
geof:convexHull(geomLiteral
geom1. Calculations are in the spatial reference system of geom1.
geom1)
Continued on next page
On top of the standard GeoSPARQL functions, GraphDB adds a few useful extensions based on the USeekM
library. The prefix geoext: stands for the namespace <http://rdf.useekm.com/ext#>.
The types geo:Geometry, geo:Point, etc. refer to GeoSPARQL types in the http://www.opengis.net/ont/
geosparql# namespace.
Function Description
xsd:double geoext:area(geomLiteral Calculates the area of the surface of the geometry.
g)
geomLiteral For two given geometries, computes the point on the first geom
geoext:closestPoint(geomLiteral etry that is closest to the second geometry.
g1, geomLiteral g2)
xsd:boolean Tests if the first geometry properly contains the second geometry.
geoext:containsProperly(geomLiteral Geom1 contains properly geom2 if all geom1 contains geom2 and
g1, geomLiteral g2) the boundaries of the two geometries do not intersect.
xsd:boolean Tests if the first geometry is covered by the second geometry.
geoext:coveredBy(geomLiteral g1, Geom1 is covered by geom2 if every point of geom1 is a point
geomLiteral g2) of geom2.
xsd:boolean Tests if the first geometry covers the second geometry. Geom1
geoext:covers(geomLiteral g1, covers geom2 if every point of geom2 is a point of geom1.
geomLiteral g2)
xsd:double Measures the degree of similarity between two geometries. The
geoext:hausdorffDistance(geomLiteral measure is normalized to lie in the range [0, 1]. Higher measures
g1, geomLiteral g2) indicate a greater degree of similarity.
geo:Line Computes the shortest line between two geometries. Returns it as
geoext:shortestLine(geomLiteral a LineString object.
g1, geomLiteral g2)
geomLiteral Given a maximum deviation from the curve, computes a simpli
geoext:simplify(geomLiteral g,
fied version of the given geometry using the DouglasPeuker al
double d) gorithm.
geomLiteral Given a maximum deviation from the curve, computes a simpli
geoext:simplifyPreserveTopology(geomLiteral
fied version of the given geometry using the DouglasPeuker al
g, double d) gorithm. Will avoid creating derived geometries (polygons in par
ticular) that are invalid.
xsd:boolean Checks whether the input geometry is a valid geometry.
geoext:isValid(geomLiteral g)
At present, there is just one SPARQL extension function. The prefix omgeo: stands for the namespace <http://
www.ontotext.com/owlim/geo#>.
Function Description
xsd:double om- Computes the distance between two points in kilometers and can be used in
geo:distance(numeric FILTER and ORDER BY clauses.
lat1, numeric long1, Latitude is limited to the range 90 (South) to +90 (North). Longitude is
numeric lat2, numeric limited to the range 180 (West) to +180 (East).
long2)
Function Description
Split the IRI or URI into namespace (an IRI) and local name (a string).
Compare if given values or bound variables, otherwise set the variable.
iri apf:splitIRI (namespace
The object is a list with 2 elements. splitURI is a synonym.
localname)
iri apf:splitURI (namespace
localname)
var apf:concat (arg arg …) Concatenate the arguments in the object list as strings, and assign to var.
var apf:strSplit (arg arg) Split a string and return a binding for each result. The subject variable
should be unbound. The first argument to the object list is the string to
be split. The second argument to the object list is a regular expression by
which to split the string. The subject var is bound for each result of the
split, and each result has the whitespace trimmed from it.
afn:bnode(?x) Return the blank node label if ?x is a blank node.
afn:localname(?x) The local name of ?x.
afn:namespace(?x) The namespace of ?x.
afn:sprintf(format, v1, v2, Make a string from the format string and the RDF terms.
…)
afn:substr(string, startIndex Substring, Java style using startIndex and endIndex.
[,endIndex])
afn:substring Synonym for afn:substr.
afn:strjoin(sep, string …) Concatenate given strings, using sep as a separator.
afn:sha1sum(resource) Calculate the SHA1 checksum of a literal or URI.
afn:now() Current time. (Actually, a fixed moment of the current query execution –
see the standard function NOW() for details.)
Beside the standard SPARQL functions related to time, GraphDB offers several additional functions, allowing users
to do more with their time data. Those are implemented within the same namespace as standard math functions,
<http://www.ontotext.com/sparql/functions/>. The default prefix for the functions is ofn.
The first group of functions is related to accessing particular parts of standard duration literals. For example, the
expression 2019-03-24T22:12:29.183+02:00" - "2019-04-19T02:42:28.182+02:00" will produce the follow
ing duration literal: -P0Y0M25DT4H29M58.999S. It is possible to parse the result and obtain the proper parts of it
for example, “25 days”, “4” hours, or more discrete time units. However, instead of having to do this manually,
GraphDB offers functions that perform the computations at the engine level. The functions take a period as input
and output xsd:long.
Note: The functions described here perform simple extractions, rather than computing the periods. For example,
if you have 40 days in the duration literal, but no months, i.e., P0Y0M40DT4H29M58.999S, a months-from-duration
extraction will not return 1 months.
The following table describes the functions that are implemented and gives example results, assuming the literal
-P0Y0M25DT4H29M58.999S is passed to them:
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX ofn:<http://www.ontotext.com/sparql/functions/>
SELECT ?result {
bind (ofn:millis-from-duration("-P0Y0M25DT4H29M58.999S"^^xsd:dayTimeDuration) as ?result)
}
The second group of functions is related to transforming a standard duration literal. This reduces the need for per
forming mathematical transformations on the input date. The functions take a period as input and output xsd:long.
Note: The transformation is performed with no fractional components. For example, if transformed, the duration
literal we used previously, -P0Y0M25DT4H29M58.999S will yield 25 days, rather than 25.19 days.
The following table describes the functions that are implemented and gives example results, assuming the literal
-P0Y0M25DT4H29M58.999S is passed to them. Note that the return values are negative since the period is negative:
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX ofn:<http://www.ontotext.com/sparql/functions/>
SELECT ?result {
bind (ofn:asMillis("-P0Y0M25DT4H29M58.999S"^^xsd:dayTimeDuration) as ?result)
}
The third group of functions eliminates the need for computing a difference between two dates when a trans
formation will be necessary, essentially combining the mathematical operation of subtracting two dates with a
transformation. It is more efficient than performing an explicit mathematical operation between two date liter
als, for example: "2019-03-24T22:12:29.183+02:00" - "2019-04-19T02:42:28.182+02:00" and then using a
transformation function. The functions take two dates as input and output integer literals.
Note: Regular SPARQL subtraction can return negative values, as evidenced by the negative duration literal used
in the example. However, comparisons are only positive. So, comparison isn’t an exact match for a subtraction
followed by transformation. If one of the timestamps has timezone but the other does not, the result is illdefined.
The following table describes the functions that are implemented and gives example results, assuming the date
literals 2019-03-24T22:12:29.183+02:00" and "2019-04-19T02:42:28.182+02:00" are passed to them. Note
that the return values are positive:
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX ofn:<http://www.ontotext.com/sparql/functions/>
SELECT ?result {
bind (ofn:millisBetween("2019-03-24T22:12:29.183+02:00"^^xsd:dateTime, "2019-04-19T02:42:28.
,→182+02:00"^^xsd:dateTime) as ?result)
}
The fourth group of functions includes operations such as: adding duration to a date; adding dayTimeDuration
to a dateTime; adding time duration to a time; comparing durations. This is done via the SPARQL operator
extensibility.
GraphDB supports several OWL like dialects: OWLHorst (owl-horst), OWLMax (owl-max), which covers most
of OWL Lite and RDFS, OWL2 QL (owl2-ql), and OWL2 RL (owl2-rl).
With the owl-max ruleset, GraphDB supports the following semantics:
• full RDFS semantics without constraints or limitations, apart from the entailment related to typed literals
(known as Dentailment). For instance, metaclasses (and any arbitrary mixture of class, property, and
individual) can be combined with the supported OWL semantics;
• most of OWL Lite;
• all of OWL DLP.
The differences between OWLHorst and the OWL dialects supported by GraphDB (owl-horst and owl-max) can
be summarised as follows:
• GraphDB does not provide the extended support for typed literals, introduced with the Dentailment exten
sion of the RDFS semantics. Although such support is conceptually clear and easy to implement, it is our
understanding that the performance penalty is too high for most applications. You can easily implement the
rules defined for this purpose by ter Horst and add them to a custom ruleset;
• There are no inconsistency rules by default;
• A few more OWL primitives are supported by GraphDB (ruleset owl-max);
• There is extended support for schemalevel (TBox) reasoning in GraphDB.
Even though the concrete rules predefined in GraphDB differ from those defined in OWLHorst, the complexity
and decidability results reported for Rentailment are relevant for TRREE and GraphDB. To be more precise, the
rules in the owl-horst ruleset do not introduce new BNodes, which means that Rentailment with respect to them
takes polynomial time. In KR terms, this means that the owl-horst inference within GraphDB is tractable.
Inference using owl-horst is of a lesser complexity compared to other formalisms that combine DL formalisms
with rules. In addition, it puts no constraints with respect to metamodeling.
The correctness of the support for OWL semantics (for these primitives that are supported) is checked against the
normative Positive and Negativeentailment OWL test cases.
System statements are used as SPARQL pragmas specific to GraphDB. They are ways to alter the behavior of
SPARQL queries in specific ways. The IDs of system statements are not present in the repository in any way.
GraphDB System Statements can be recognized by their identifiers which begin either with the onto or the sys
prefix. Those stand for <https://www.ontotext.com/> and <http://www.ontotext.com/owlim/system#>, re
spectively.
System graphs modify the result or change the dataset on which the query operates. The semantics used are
identical to standard graphs the FROM keyword. An example of graph usage would be:
System predicates are used to change the way in which the repository behaves. An example of system predicate
usage would be:
INSERT DATA {
[] sys:addRuleset "owl-horst-optimized" .
[] sys:defaultRuleset "owl-horst-optimized" .
[] sys:reinfer [] .
}
The diagram below provides an illustration of an RDF graph that describes a repository configuration:
Often, it is helpful to ensure that a repository starts with a predefined set of RDF statements usually one or more
schema graphs. This is possible by using the graphdb:imports property. After startup, these files are parsed and
their contents are permanently added to the repository.
In short, the configuration is an RDF graph, where the root node is of rdf:type rep:Repository, and it must
be connected through the rep:RepositoryID property to a Literal that contains the human readable name of the
repository. The root node must be connected via the rep:repositoryImpl property to a node that describes the
configuration.
GraphDB repository The type of the repository is defined via the rep:repositoryType property and its value
must be graphdb:SailRepository to let RDF4J know what the desired Sail repository implementation
is. Then, a node that specifies the Sail implementation to be instantiated must be connected through the
sr:sailImpl property. To instantiate GraphDB, this last node must have a property sail:sailType with
the value graphdb:Sail the RDF4J framework will locate the correct SailFactory within the application
classpath that will be used to instantiate the Java implementation class.
The namespaces corresponding to the prefixes used in the above paragraph are as follows:
rep: <http://www.openrdf.org/config/repository#>
sr: <http://www.openrdf.org/config/repository/sail#>
sail: <http://www.openrdf.org/config/sail#>
graphdb: <http://www.ontotext.com/trree/graphdb#>
All properties used to specify the GraphDB configuration parameters use the graphdb: prefix and the local names
match up with the configuration parameters, e.g., the value of the ruleset parameter can be specified using the
graphdb:ruleset property.
GraphDB’s owl:sameAs optimization is used for mapping the same concepts from two or more datasets, where
each of these concepts can have different features and relations to other concepts. In this way, making a union
between such datasets provides more complete data. In RDF, concepts are represented with a unique resource
name by using a namespace, which is different for every dataset. Therefore, it is more useful to unify all names of
a single concept, so that when querying data, you are able to work with concepts rather than names (i.e., IRIs).
For example, when merging four different datasets, you can use the following query on DBpedia to select every
thing about Sofia:
SELECT * {
{
<http://dbpedia.org/resource/Sofia> ?p ?o .
}
UNION
{
<http://data.nytimes.com/nytimes:N82091399958465550531> ?p ?o .
}
UNION
{
<http://sws.geonames.org/727011/> ?p ?o .
}
UNION
{
<http://rdf.freebase.com/ns/m/0ftjx> ?p ?o .
}
}
As you can see, here Sofia appears with four different URIs, although they denote the same concept. Of course,
this is a very simple query. Sofia has also relations to other entities in these datasets, such as Plovdiv, i.e., <[http:/
/dbpedia.org/resource/Plovdiv]>, <[http://sws.geonames.org/653987/]>, <[http://rdf.freebase.com/
ns/m/1aihge]>.
What’s more, not only the different instances of one concept have multiple names but their properties also appear
with many names. Some of them are specific for a given dataset (e.g., GeoNames has longitude and latitude, while
DBpedia provides wikilinks) but there are class hierarchies, labels, and other common properties used by most of
the datasets.
This means that even for the simplest query, you may have to write the following:
SELECT * {
?s ?p1 ?x .
?x ?p2 ?o .
FILTER (?s IN (
<http://dbpedia.org/resource/Sofia>,
<http://data.nytimes.com/nytimes:N82091399958465550531>,
<http://sws.geonames.org/727011/>,
<http://rdf.freebase.com/ns/m/0ftjx>))
FILTER (?p1 IN (
<http://dbpedia.org/property/wikilink>,
<http://sws.geonames.org/p/relatesTo>))
FILTER (?p2 IN (
<http://dbpedia.org/property/wikilink>,
<http://sws.geonames.org/p/relatesTo>))
FILTER (?o IN (<http://dbpedia.org/resource/Plovdiv>,
<http://sws.geonames.org/653987/>,
<http://rdf.freebase.com/ns/m/1aihge>))
}
But if you can say through rules and assertions that given URIs are the same, then you can simply write:
SELECT * {
<http://dbpedia.org/resource/Sofia> <http://sws.geonames.org/p/relatesTo> ?x .
?x <http://sws.geonames.org/p/relatesTo> <http://dbpedia.org/resource/Plovdiv> .
}
If you link two nodes with owl:sameAs, the statements that appear with the first node’s subject, predicate, and
object will be copied, replacing respectively the subject, predicate, and the object that appear with the second
node.
For example, given that <[http://dbpedia.org/resource/Sofia]> owl:sameAs <[http://data.nytimes.com/
N82091399958465550531]> and also that:
<http://dbpedia.org/resource/Sofia> a <http://dbpedia.org/resource/Populated_place> .
<http://data.nytimes.com/N82091399958465550531> a <http://www.opengis.net/gml/_Feature> .
<http://dbpedia.org/resource/Plovdiv> <http://dbpedia.org/property/wikilink> <http://dbpedia.org/
,→resource/Sofia> .
<http://dbpedia.org/resource/Sofia> a <http://www.opengis.net/gml/_Feature> .
<http://data.nytimes.com/N82091399958465550531> a <http://dbpedia.org/resource/Populated_place> .
<http://dbpedia.org/resource/Plovdiv> <http://dbpedia.org/property/wikilink> <http://data.nytimes.com/
,→N82091399958465550531> .
The challenge with owl:sameAs is that when there are many ‘mappings’ of nodes between datasets, and especially
when big chains of owl:sameAs appear, it becomes inefficient. owl:sameAs is defined as Symmetric and Transitive,
so given that A sameAs B sameAs C, it also follows that A sameAs A, A sameAs C, B sameAs A, B sameAs B,
C sameAs A, C sameAs B, C sameAs C. If you have such a chain with N nodes, then N^2 owl:sameAs statements
will be produced (including the explicit N1 owl:sameAs statements that produce the chain). Also, the owl:sameAs
rules will copy the statements with these nodes N times, given that each statement contains only one node from
the chain and the other nodes are not sameAs anything. But you can also have a statement <S P O> where S
sameAs Sx, P sameAs Py, O sameAs Oz, where the owl:sameAs statements for S are K, for P are L and for O are
M, yielding K*L*M statement copies overall.
Therefore, instead of using these simple rules and axioms for owl:sameAs (actually two axioms that state that it is
Symmetric and Transitive), GraphDB offers an effective nonrule implementation, i.e., the owl:sameAs support is
hardcoded. The given rules are commented out in the .pie files and are left only as a reference.
An RDF database can store collections of RDF statements (triples) in separate graphs identified (named) by a URI.
A group of statements with a unique name is called a ‘named graph’. An RDF database has one more graph, which
does not have a name, and it is called the ‘default graph’.
The SPARQL query syntax provides a means to execute queries across default and named graphs using FROM
and FROM NAMED clauses. These clauses are used to build an RDF dataset, which identifies what statements
the SPARQL query processor will use to answer a query. The dataset contains a default graph and named graphs
and is constructed as follows:
• FROM <uri> brings statements from the database graph, identified by URI, to the dataset’s default graph,
i.e., the statements ‘lose’ their graph name.
• FROM NAMED <uri> brings the statements from the database graph, identified by URI, to the dataset, i.e.,
the statements keep their graph name.
If either FROM or FROM NAMED are used, the database’s default graph is no longer used as input for processing this
query. In effect, the combination of FROM and FROM NAMED clauses exactly defines the dataset. This is
somewhat bothersome, as it precludes the possibility, for instance, of executing a query over just one named graph
and the default graph. However, there is a programmatic way to get around this limitation as described below.
Note: The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are present
in a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In this
situation, implementations are free to construct the default dataset as necessary.
Query Bindings
SELECT * { ?s ?p ?o } ?s=ex:x ?p=ex:y ?o=ex:z
SELECT * { GRAPH ?g { ?s ?p ?o } } ?s=ex:x ?p=ex:y ?o=ex:z ?g=ex:g
In other words, the triple ex:x ex:y ex:z will appear to be in both the default graph and the named graph ex:g.
There are two reasons for this behavior:
1. It provides an easy way to execute a triple pattern query over all stored RDF statements.
2. It allows all named graph names to be discovered, i.e., with this query: SELECT ?g { GRAPH ?g { ?s ?p
?o } }.
• infer <s _ I>, infer <s _ I>, retract <_ _ _> (if the two inferences are from the same premises).
This does not show all possible sequences, but it shows the principles:
• No duplicate statement can exist in the default graph;
• Delete/retract clears the appropriate flag;
• The statement is deleted only after both flags are cleared;
• Deleting an inferred statement has no effect (except to clear the I flag, if any);
• Retracting an inserted statement has no effect (except to clear the E flag, if any);
• Inserting the same statement twice has no effect: insert is idempotent;
• Inferring the same statement twice has no effect: infer is idempotent, and I is a flag, not a counter, but the
Retraction algorithm ensures I is cleared only after all premises of s are retracted.
Now, let’s consider operations on statement s in the named graph G, and inferred statement s in the default graph:
• insert <s G E>, infer <s _ I> <s G E>, delete <s _ I>, retract <_ _ _>;
• insert <s G E>, infer <s _ I> <s G E>, retract <s G E>, delete <_ _ _>;
• infer <s _ I>, insert <s G E> <s _ I>, delete <s _ I>, retract <_ _ _>;
• infer <s _ I>, insert <s G E> <s _ I>, retract <s G E>, delete <_ _ _>;
• insert <s G E>, insert <s G E>, delete <_ _ _>;
• infer <s _ I>, infer <s _ I>, retract <_ _ _> (if the two inferences are from the same premises).
The additional principles here are:
• The same statement can exist in several graphs as explicit in graph G and implicit in the default graph;
• Delete/retract works on the appropriate graph.
Note: In order to avoid a proliferation of duplicate statements, it is recommended not to insert inferable statements
in named graphs.
The database’s default graph can contain a mixture of explicit and implicit statements. The RDF4J API provides
a flag called ‘includeInferred’, which is passed to several API methods and when set to false causes only explicit
statements to be iterated or returned. When this flag is set to true, both explicit and implicit statements are iterated
or returned.
GraphDB provides extensions for more control over the processing of explicit and implicit statements. These
extensions allow the selection of explicit, implicit or both for query answering and also provide a mechanism for
identifying which statements are explicit and which are implicit. This is achieved by using some ‘pseudograph’
names in FROM and FROM NAMED clauses, which cause certain flags to be set.
The details are as follows:
FROM <http://www.ontotext.com/explicit>
The dataset’s default graph includes only explicit statements from the database’s default graph.
FROM <http://www.ontotext.com/implicit>
The dataset’s default graph includes only inferred statements from the database’s default graph.
FROM NAMED <http://www.ontotext.com/explicit>
The dataset contains a named graph http://www.ontotext.com/explicit that includes only explicit
statements from the database’s default graph, i.e., quad patterns such as GRAPH ?g {?s ?p ?o} rebind
explicit statements from the database’s default graph to a graph named http://www.ontotext.com/
explicit.
The dataset contains a named graph http://www.ontotext.com/implicit that includes only implicit
statements from the database’s default graph.
Note: These clauses do not affect the construction of the default dataset in the sense that using any combination
of the above will still result in a dataset containing all named graphs from the database. All it changes is which
statements appear in the dataset’s default graph and whether any extra named graphs (explicit or implicit) appear.
The RDF4J API provides an interface Dataset and an implementation class DatasetImpl for defining the dataset
for a query by providing the URIs of named graphs and adding them to the default graphs and named graphs
members. This permits null to be used to identify the default database graph (or null context to use RDF4J
terminology).
Internally, GraphDB uses integer identifiers (IDs) to index all entities (URIs, blank nodes, literals, and RDFstar
[formerly RDF*] embedded triples). Statement indexes are made up of these IDs and a large data structure is used to
map from ID to entity value and back. There are occasions (e.g., when interfacing to an application infrastructure)
when having access to these internal IDs can improve the efficiency of data structures external to GraphDB by
allowing them to be indexed by an integer value rather than a full URI.
Here, we introduce a special GraphDB predicate and function that provide access to the internal IDs. The datatype
of the internal IDs is <http://www.w3.org/2001/XMLSchema#long>.
Predicate <http://www.ontotext.com/owlim/entity#id>
Description A map between an entity and an internal ID
Example Select all entities and their IDs:
PREFIX ent: <http://www.ontotext.com/owlim/entity#>
SELECT * WHERE {
?s ent:id ?id
} ORDER BY ?id
Function <http://www.ontotext.com/owlim/entity#id>
Description Return an entity’s internal ID
Example Select all statements and order them by the internal ID of the object values:
PREFIX ent: <http://www.ontotext.com/owlim/entity#>
SELECT * WHERE {
?s ?p ?o .
} order by ent:id(?o)
Examples
• Enumerate all entities and bind the nodes to ?s and their IDs to ?id, order by ?id:
select * where {
?s <http://www.ontotext.com/owlim/entity#id> ?id
} order by ?id
• Enumerate all nonliterals and bind the nodes to ?s and their IDs to ?id, order by ?id:
SELECT * WHERE {
?s <http://www.ontotext.com/owlim/entity#id> ?id .
FILTER (!isLiteral(?s)) .
} ORDER BY ?id
• Find the internal IDs of subjects of statements with specific predicate and object values:
SELECT * WHERE {
?s <http://test.org#Pred1> "A literal".
?s <http://www.ontotext.com/owlim/entity#id> ?id .
} ORDER BY ?id
• Find all statements where the object has the given internal ID by using an explicit, untyped value as the ID
(the "115" is used as object in the second statement pattern):
SELECT * WHERE {
?s ?p ?o.
?o <http://www.ontotext.com/owlim/entity#id> "115" .
}
• As above, but using an xsd:long datatype for the constant within a FILTER condition:
SELECT * WHERE {
?s ?p ?o.
?o <http://www.ontotext.com/owlim/entity#id> ?id .
FILTER (?id="115"^^<http://www.w3.org/2001/XMLSchema#long>) .
} ORDER BY ?o
• Find the internal IDs of subject and object entities for all statements:
SELECT * WHERE {
?s ?p ?o.
?s <http://www.ontotext.com/owlim/entity#id> ?ids.
?o <http://www.ontotext.com/owlim/entity#id> ?ido.
}
• Retrieve all statements where the ID of the subject is equal to "115"^^xsd:long, by providing an internal
ID value within a filter expression:
SELECT * WHERE {
?s ?p ?o.
FILTER ((<http://www.ontotext.com/owlim/entity#id>(?s))
= "115"^^<http://www.w3.org/2001/XMLSchema#long>).
}
• Retrieve all statements where the stringized ID of the subject is equal to "115", by providing an internal ID
value within a filter expression:
SELECT * WHERE {
?s ?p ?o.
FILTER (str( <http://www.ontotext.com/owlim/entity#id>(?s) ) = "115").
}
GraphDB supports the RDF4J specific vocabulary for determining ‘direct’ subclass, subproperty and type rela
tionships. The special vocabulary used and their definitions are shown below. The three predicates are all defined
using the namespace definition:
Predicate Definition
A sesame:directSubClassOf B Class A is a direct subclass of B if:
1. A is a subclass of B and;
2. A and B are not equal and;
3. there is no class C (not equal to A or B) such that A is a subclass
of C and C of B.
There are several more special graph URIs in GraphDB, which are used for controlling query evaluation.
FROM / FROM NAMED <http://www.ontotext.com/disable-sameAs>
Switch off the enumeration of equivalence classes produced by the Optimization of owl:sameAs. By
default, all owl:sameAs URIs are returned by triple pattern matching. This clause reduces the number
of results to include a single representative from each owl:sameAs class. For more details, see Not
enumerating sameAs.
FROM / FROM NAMED <http://www.ontotext.com/count>
Used for triggering the evaluation of the query, so that it gives a single result in which all variable
bindings in the projection are replaced with a plain literal, holding the value of the total number of
solutions of the query. In the case of a CONSTRUCT query in which the projection contains three
variables (?subject, ?predicate, ?object), the subject and the predicate are bound to <http://www.
ontotext.com/> and the object holds the literal value. This is because there cannot exist a statement
with a literal in the place of the subject or predicate. This clause is deprecated in favor of using the
COUNT aggregate of SPARQL 1.1.
Used for triggering the exclusion of implicit statements when there is an explicit one within a specific
context (even default). Initially implemented to allow for filtering of redundant rows where the context
part is not taken into account and which leads to ‘duplicate’ results.
FROM <http://www.ontotext.com/distinct>
Using this special graph name in DESCRIBE and CONSTRUCT queries will cause only distinct triples to
be returned. This is useful when several resources are being described, where the same triple can be
returned more than once, i.e., when describing its subject and its object. This clause is deprecated in
favor of using the DISTINCT clause of SPARQL 1.1.
FROM <http://www.ontotext.com/owlim/cluster/control-query>
Identifies the query to a GraphDB EE cluster master node as needing to be routed to all worker nodes.
The default behavior of the GraphDB query optimizer is to try and reposition BIND clauses so that all the variables
within its EXPR part (on the left side of ‘AS’) are bound to have valid bindings for all of the variables referred
within it.
If you look at the following data:
INSERT DATA {
<urn:q> <urn:pp1> 1 .
<urn:q> <urn:pp2> 2 .
<urn:q> <urn:pp3> 3 .
}
and try to evaluate a SPARQL query such as the one below (without any rearrangement of the statement patterns):
SELECT ?r {
?q <urn:pp1> ?x .
?q <urn:pp2> ?y .
BIND (?x + ?y + ?z AS ?r) .
?q <urn:pp3> ?z .
}
because the expression that sums several variables will not produce any valid bindings for ?r.
But if you rearrange the statement patterns in the same query so that you have bindings for all of the variables used
within the sum expression of the BIND clause:
SELECT ?r {
?q <urn:pp1> ?x .
?q <urn:pp2> ?y .
?q <urn:pp3> ?z .
BIND (?x + ?y + ?z AS ?r) .
}
the query would return a single result and now the value bound to ?r will be 6:
1 result: r=6
By default, the GraphDB query optimizer tries to move the BIND after the last statement pattern, so that all the
variables referred internally have a binding. However, that behavior can be modified by using a special ‘system’
graph within the dataset section of the query (e.g., as FROM clause) that has the following URI:
<http://www.ontotext.com/retain-bind-position>.
In this case, the optimizer retains the relative position of the BIND operator within the group in which it appears,
so that if you evaluate the following query against the GraphDB repository:
SELECT ?r
FROM <http://www.ontotext.com/retain-bind-position> {
?q <urn:pp1> ?x .
?q <urn:pp2> ?y .
BIND (?x + ?y + ?z AS ?r) .
?q <urn:pp3> ?z .
}
1 result: r=UNDEF
Still, the default evaluation without the special ‘system’ graph provides a more useful result:
1 result: r="6"
19.18 Glossary
Datalog A query and rule language for deductive databases that syntactically is a subset of Prolog.
Dentailment A vocabulary entailment of an RDF graph that respects the ‘meaning’ of data types.
Description Logic A family of formal knowledge representation languages that are subsets of first order logic,
but have more efficient decision problems.
Horn Logic Broadly means a system of logic whose semantics can be captured by Horn clauses. A Horn clause
has at most one positive literal and allows for an IF…THEN interpretation, hence the common term ‘Horn
Rule’.
Knowledge Base (In the Semantic Web sense) is a database of both assertions (ground statements) and an infer
ence system for deducing further knowledge based on the structure of the data and a formal vocabulary.
Knowledge Representation An area in artificial intelligence that is concerned with representing knowledge in a
formal way such that it permits automated processing (reasoning).
Load Average The load average represents the average system load over a period of time.
Materialization The process of inferring and storing (for later retrieval or use in query answering) every piece of
information that can be deduced from a knowledge base’s asserted facts and vocabulary.
Named Graph A group of statements identified by a URI. It allows a subset of statements in a repository to be
manipulated or processed separately.
Ontology A shared conceptualisation of a domain, described using a formal (knowledge) representation language.
OWL A family of W3C knowledge representation languages that can be used to create ontologies. See Web
Ontology Language.
OWLHorst An entailment system built upon RDF Schema, see Rentailment.
Predicate Logic Generic term for symbolic formal systems like firstorder logic, secondorder logic, etc. Its
formulas may contain variables which can be quantified.
RDF Graph Model The interpretation of a collection of RDF triples as a graph, where resources are nodes in the
graph and predicates form the arcs between nodes. Therefore one statement leads to one arc between two
nodes (subject and object).
RDF Schema A vocabulary description language for RDF with formal semantics.
Resource An element of the RDF model, which represents a thing that can be described, i.e., a unique name to
identify an object or a concept.
Rentailment A more general semantics layered on RDFS, where any set of rules (i.e., rules that extend or even
modify RDFS) are permitted. Rules are of the form IF…THEN… and use RDF statement patterns in their
premises and consequences, with variables allowed in any position.
Resource Description Framework (RDF) A family of W3C specifications for modeling knowledge with a va
riety of syntaxes.
Semantic Repository A semantic repository is a software component for storing and manipulating RDF data. It
is made up of three distinct components:
• An RDF database for storing, retrieving, updating and deleting RDF statements (triples);
• An inference engine that uses rules to infer ‘new’ knowledge from explicit statements;
• A powerful query engine for accessing the explicit and implicit knowledge.
Semantic Web The concept of attaching machine understandable metadata to all information published on the
internet, so that intelligent agents can consume, combine and process information in an automated fashion.
SPARQL The most popular RDF query language.
Statement or Triple A basic unit of information expression in RDF. A triple consists of subjectpredicateobject.
Universal Resource Identifier (URI) A string of characters used to (uniquely) identify a resource.
TWENTY
RELEASE NOTES
GraphDB release notes provide information about the features and improvements in each release, as well as various
bug fixes. GraphDB’s versioning scheme is based on semantic versioning. The full version is composed of three
components:
major.minor.patch
e.g., 9.11.2 where the major version is 9, the minor version is 11 and the patch version is 2.
Occasional versions may include a modifier after a hyphen, e.g., 10.0.0-RC1 to signal additional information, e.g.,
a test release (TR1, TR2 and so on), a release candidate (RC1, RC2 and so on), a milestone release (M1, M2 and
so on), or other relevant information.
Note: Releases with the same major and minor versions do not contain any new features. Releases with different
patch versions contain fixes for bugs discovered since the previous minor. New or significantly changed features
are released with a higher major or minor version.
Important: GraphDB 10.2.5 improves cluster stability. We recommend everyone using the cluster to upgrade to
this version.
755
GraphDB Documentation, Release 10.2.5
Bug fixing
Important: GraphDB 10.2.4 includes multiple bug fixes and improvements across different components. We
recommend everyone to upgrade to this version.
Important bug fixes include several issues that improve the cluster stability.
Various thirdparty libraries were updated to address vulnerabilities and fixes.
Bug fixing
• GDB8584 Two cluster nodes go out of sync at the same time and cannot get in sync
• GDB8543 Out of sync node does not proxy the requests to the leader node
• GDB8536 GraphDB does not delete temporary files created by the inferencer
• GDB8534 Extend Prometheus metrics to include description and type of each metric
• GDB8518 Cluster node goes out of sync due to leader shutdown when leader tries to rollback the last
received entry
• GDB8511 The work directory of a temporary inferencer instance is the current directory of the Java process,
which may lead to failing validation of custom rulesets on repository creation
• GDB8498 Performing a backup while another backup is still running will output the error message as a file
over HTTP
• GDB8496 Cluster management operations should be disabled while a backup restore operation is running
• GDB8494 Cluster follower node cannot return to insync state if the leader node goes out of sync while the
follower node is catching up
• GDB8478 Reduced log level verbosity related to message “Signature already used” when a cluster node
changes IP
• GDB8462 Cluster node cannot provide snapshot because of blocked verification of entry
• GDB8381 Nodes cannot catch up if added to the cluster right after a backup is restored
• GDB8358 Two running cluster nodes cannot elect leader in threenode cluster
• GDB8326 Cluster node cannot return to insync after another node is stopped
• GDB8088 “FAILED_PRECONDITION: Unable to process entry during snapshot recovery” error in cluster
• GDB7892 A cluster created during a single instance backup restore procedure results in a NullPointerEx
ception after the restore
Bug fixing
• GDB8369 Cluster View broken toast error about lack of user permissions
Important: GraphDB 10.2.3 includes multiple bug fixes and improvements across different components.
Important bug fixes include issues with parallel import as well as multiple cluster stability issues.
Various thirdparty libraries were updated to address vulnerabilities and fixes.
We recommend everyone to upgrade.
Bug fixing
• GDB8491 Requesting two backups simultaneously will result in “DEADLINE_EXCEEDED” error on the
leader node and will always trigger new leader election after both backups are completed
• GDB8475 IllegalStateException thrown after a local backup in cluster mode on the node that was delegated
to perform the backup
• GDB8454 gRPC communication to a GraphDB cluster may hang due to a dead lock
• GDB8450 External proxy will fail health check until at least one request is proxied
• GDB8440 Cluster node is not readable after force shutdown of all nodes
• GDB8431 Cluster cannot accept writes while a new node is joining the cluster
• GDB8407 Cluster cannot recover when the leader receives a snapshot request during transaction replication
• GDB8388 Cluster node does not process previously interrupted update after restart
• GDB8380 Mapping exception when trying to connect to GraphDB via JDBC* GDB8374 Cluster dead
lock when a plugin fails the transaction
• GDB8360 Node state is NO_CLUSTER after a node is restarted even though the node is part of a cluster
• GDB8331 Cluster proxy cannot be configured with the gRPC addresses of the cluster nodes
• GDB8305 Node may end up with infinite communication retries before building a snapshot
• GDB8118 Using parallel import may introduce storage inconsistencies or corrupted predicate list index
• GDB7143 GraphDB may throw an exception in commit during data import
• GDB8312 Decrease verbosity of some cluster log messages
Bug fixing
Bug fixing
• GDB8433 Helm: External proxy should not be restarted when a node is added/deleted from the cluster
Important: GraphDB 10.2.2 includes multiple bug fixes and improvements across different components.
Important bug fixes include an integer overflow when attempting to flush changes to journal file on a very large
repository, entity pool initialization issues after abnormal shutdown, and various issues with cluster recovery sce
narios.
Various thirdparty libraries were updated to address vulnerabilities and fixes.
We recommend everyone to upgrade.
Bug fixing
• GDB8361 Integer overflow when attempting to flush changes to journal file on a very large repository
• GDB8322 Cluster node remains locked when going out of sync and rejecting a streaming update
• GDB8306 External cluster proxy returns error 500 after request to abort query
• GDB8233 Server report missing information about a single node
• GDB8210 Creating a snapshot during a large update in a cluster leads to a misleading error message
• GDB8173 Unable to add new node to cluster immediately after transaction log truncate
• GDB8146 After abnormal shutdown, transactions fail with EntityPoolConnectionException: Could not
read entity X
• GDB8090 Cluster node cannot recover after failing to send a snapshot to another node
• GDB7754 Inconsistent fingerprint in cluster
Bug fixing
Important: GraphDB 10.2.1 includes multiple bug fixes and improvements across different components.
Notable bug fixes include unnecessary rebuilding of the entity pool during initialization of large datasets and
various cluster stability issues. Additionally, there are bug fixes addressing issues such as slow SPARQL MINUS
operation on large datasets, and a filtering issue in the Connectors, among others.
Various thirdparty libraries were updated to address vulnerabilities and fixes.
We recommend everyone to upgrade.
Bug fixing
Bug fixing
Bug fixing
GraphDB 10.2 offers improved cluster backup with support for cloud backup, lower memory requirements, and a
more transparent memory model. Monitoring system health and diagnosing problems is now easier thanks to new
monitoring facilities via the industrystandard toolkit Prometheus, as well as directly in the GraphDB Workbench.
In addition, accessing GraphDB now offers more flexibility with support for X.509 client certificate authentication.
Important:
Improved cluster backup and support for cloud backup GraphDB 10.2 introduces a redesigned backup and
restore API that makes creating and restoring backups a breeze both in a cluster and in a single instance
environment. Backups are now streamed to the caller so there is more flexibility in where and how they can
be stored.
In addition, backups can be stored directly in Amazon S3 storage to make sure your latest data is securely
protected against inadvertent changes or hardware failures in your local infrastructure.
Lower memory requirements and a more transparent memory model The global page cache is one of the
components that takes up a significant amount of the configured GraphDB memory. While the value can
be configured, people typically stick to the default value. Until GraphDB 10.2, the default value was fixed
to 50% of the configured maximum Java heap. With the release of GraphDB 10.2, the default value varies
between 25% and 40% of the heap according to the maximum size of the heap. This results in lower memory
usage without sacrificing the performance benefit of a large page cache size.
Historically, GraphDB used two independent chunks of memory, the Java heap and the offheap memory.
This made it difficult to determine and set the required memory for a given GraphDB instance, as off
heap memory was not intuitive and often forgotten about. The problem was more evident in virtualized
environments with memory allocated to match only the Java maximum heap size, leading to unexpected
failures when the offheap memory usage grows.
To address this, we redesigned some internal structures and moved memory usage from offheap to the
Java heap. This results in a more straightforward memory configuration, where a single number (the Java
maximum heap size) controls the maximum memory available to GraphDB.
We also optimized the memory used during RDF Rank computation. As a result, it is now possible to
compute the rank of larger repositories with less memory.
Better monitoring and support for Prometheus GraphDB 10.2 adds support for monitoring via Prometheus, an
opensource systems monitoring and alerting toolkit adopted by many companies and organizations. The ex
posed metrics include memory usage, cluster health, storage space, cache statistics, slow/suboptimal queries,
and others. They allow a DevOps team to assemble a dashboard of vital GraphDB statistics that can be used
to monitor system health and diagnose problems.
In addition to Prometheus, we exposed the most important metrics as part of the GraphDB Workbench so
that everyone can benefit from the additional information regardless of whether they use Prometheus.
More flexible authentication options with X.509 certificates Security is an important aspect of any system. In
addition to the existing authentication options, we added support for X.509 client certificates. Once a certifi
cate is issued, the client can simply connect to GraphDB without requiring any other means of authentication.
The identity of the user is extracted from the certificate and used to look up the respective user authorization
(roles and access rights) in the configured authorization database (local or LDAP).
Stay uptodate with the latest versions of thirdparty libraries As a general strategy to offer a secure and re
liable product, we strive to provide uptodate versions of thirdparty libraries. This includes both features
and bug fixes provided by the libraries and also addresses newly identified public vulnerabilities.
The RDF4J library in GraphDB is now upgraded to 4.2.2 and also brings SHACL improvements and general
bug fixes.
Bug fixing
• GDB7917 Removing the leader node in cluster can result in an unrecoverable situation
• GDB7846 External URL is always enforced for transaction URLs
• GDB7773 Issues with writes in cluster when using the ImportRDF tool
• GDB7727 GraphDB becomes unresponsive when FedX repositories are used
• GDB7683 Typo in GraphDB log message
• GDB7671 Cannot start cluster with correct config after an attempt to start it with wrong config
• GDB7499 ARQ aggregate: var_samp() for single value returns different result from Jena
• GDB7498 ARQ aggregate functions: “Error: null” when using a nonexistent function
• GDB7360 Invalid remote location replication in cluster
• GDB7080 Cluster breaks readded node after autocomplete/RDF rank has been built/computed
• GDB6666 Providing an OWL file whose name matches the config.ttl file when creating Ontop repository,
leads to “Error loading location”
• GDB7737 As a DB administrator, I want to be able to view in the Workbench various system resources and
metrics
Bug fixing
• GDB7739 As a user of the history plugin, I need to define rules with negation
• GDB7663 As a developer, I need the security context in GraphDB plugins
• GDB5952 Improve RDF Rank memory footprint
Bug fixing
• GDB7979 Update artifact name and URLs in graphdbruntime deployed to Maven Central
• GDB7139 Clean up the runtime .jar and its dependencies
Bug fixing
• GDB7578 GraphDB console should not be able to connect to the data directory of a running instance
• GDB7464 Cannot set isolation level when using RDF4J client 3.x
TWENTYONE
FAQ
21.1 General
OWLIM is the former name of GraphDB, which originally came from the term “OWL In Memory” and
was fitting for what later became OWLIMLite. However, OWLIMSE used a transactional, index
based filestorage layer where “In Memory” was no longer appropriate. Nevertheless, the name stuck
and it was rarely asked where it came from.
We recommend using enterprisegrade SSDs whenever possible as they provide a significantly faster
database performance compared to harddisk drives.
Unlike relational databases, a semantic database needs to compute the inferred closure for inserted
and deleted statements. This involves making highly unpredictable joins using statements anywhere
in its indexes. Despite utilizing paging structures as best as possible, a large number of disk seeks can
be expected and SSDs perform far better than HDDs in such a task.
Yes, GraphDB provides a standard SPARQL 1.1 endpoint so it is fully interoperable with any SPARQL
1.1 client, including Jena.
21.2 Configuration
The major/minor version and patch number are part of the GraphDB distribution .zip file name. They
can also be seen at the bottom of the GraphDB Workbench home page, together with the RDF4J,
Connectors, and Plugin API’s versions.
A second option is to run the graphdb -v startup script command if you are running GraphDB as a
standalone server (without Workbench). It will also return the build number of the distribution.
Another option is to run the following DESCRIBE query in the Workbench SPARQL editor:
DESCRIBE <http://www.ontotext.com/SYSINFO> FROM <http://www.ontotext.com/SYSINFO>
It returns pseudotriples providing information on various GraphDB states, including the number of
triples (total and explicit), storage space (used and free), commits (total and whether there are any
active ones), the repository signature, and the build number of the software.
767
GraphDB Documentation, Release 10.2.5
A repository is essentially a single GraphDB database. Multiple repositories can be active at the same
time and they are isolated from each other.
Then open the result file named repositoryname-config.ttl, which contains this information.
A location is either a local (to the Workbench installation) directory where your repositories will be
stored or a remote instance of GraphDB. You can have multiple attached locations but only a single
location can be active at a given time.
Go to Setup � Repositories. Click Attach remote location. For a location on the same machine, provide
the absolute path name to a directory, and for a remote location, provide a URL through which the
server running the Workbench can see the remote GraphDB instance.
GraphDB is a semantic repository, packaged as a Storage and Inference Layer (Sail) for the RDF4J
framework and it makes extensive use of the features and infrastructure of RDF4J, especially the
RDF model, RDF parsers, and query engines.
For more details, see the GraphDB RDF4J.
When RDFstar (formerly RDF*) embedded triples are serialized in formats (both RDF and query re
sults) that do not support RDFstar, they are serialized as special IRIs starting with urn:rdf4j:triple:
followed by Base64 URLsafe encoding of the NTriples serialization of the triple. This is controlled
by a boolean writer setting, and is ON by default. The setting is ignored by writers that support RDF
star natively.
Such special IRIs are converted back to triples on parsing. This is controlled by a boolean parser
setting, and is ON by default. It is respected by all parsers, including those with native RDFstar
support.
See RDFstar and SPARQLstar.
21.4 Security
Every software potentially exposes security vulnerabilities, mainly when it depends on several thirdparty libraries
like Spring, Apache Tomcat, JavaScript frameworks, etc. The GraphDB team does everything possible to con
stantly fix and discover new vulnerabilities using OWASP dependency check, Trivy, and Snyk packages. In addi
tion, every GraphDB release is checked for any publicly known vulnerabilities and all suspected issues with score
High are investigated.
No, it is not affected. All GraphDB editions and plugins between 6.x and 9.x use Logback, but not Apache Log4j
2; thus, our users are safe in terms of CVE202144228 (aka Log4Shell).
21.5 Troubleshooting
21.5.1 Why can’t I use custom rule file (.pie) - an exception occurred?
To use custom rule files, GraphDB must be running in a JVM that has access to the Java compiler.
The easiest way to do this is to use the Java runtime from a Java Development Kit (JDK).
If you receive an error message saying that MacOS cannot open GraphDB since it cannot be checked for malicious
software, this is because the security settings of your Mac are configured to only allow apps from the App Store.
GraphDB is a developersigned software, so in order to install it, you need to modify these settings to allow apps
from both the App Store and identified developers.
You can find detailed assistance on how to configure them in the Apple support pages.
TWENTYTWO
SUPPORT
• email: graphdbsupport@ontotext.com
• Twitter: @OntotextGraphDB
• GraphDB tag on Stack Overflow at http://stackoverflow.com/questions/tagged/graphdb
771