Jena TDB2 Performance / When to scale horizontally #2099

uahic · 2023-11-20T09:53:53Z

uahic
Nov 20, 2023

Hi there (again),

I'm trying to answer the following question: "how many users (or a mix of simple and complex queries) can be handled by Jena/TDB2 simultaneously before performance starts to become an issue".

Now, TDB2 seems to be limited to a single virtual/physical machine and the question above is rather vague. Let's assume RAM isnt limited and the graph is not anywhere near the limits of TDB2 could store (maybe up to 10 million triple to be extremely conservative for my application).

Now the question for me is: can I employ Jena for simultaneous queries of (O-Notation):
O(100) users
O(1000) users
O(10000) users?

I did search for benchmarks but so far, they did not really answer my question (maybe just the wrong ones?). The people who work in our project are a little concerned that we cant scale horizontally later on. I would be grateful for hints/suggestions/answers :)

It seems there is also a project called the 'SANSA' stack (https://sansa-stack.net/) and id kind of builds on top of Jena (at least Jena modules appear quite often in their source code), but I did not get replies from the devs so far to what extent the package works already yet and what it does involve in terms of functionality.

rvesse · 2023-11-21T09:52:19Z

rvesse
Nov 21, 2023
Collaborator

Hi there (again),

I'm trying to answer the following question: "how many users (or a mix of simple and complex queries) can be handled by Jena/TDB2 simultaneously before performance starts to become an issue".

It depends. On a whole range of factors:

The complexity of your data model
The complexity of your queries
- Are they arbitrary user supplied queries
- Hand-crafted by data scientists who are experts in your data model
- Auto-generated by your system somewhere (hopefully with some input from the data scientists)
- Something else entirely
The hardware environment your instance is running in (SSDs vs HDDs makes a huge difference for example)
Is the database serving a pure read workload or are there writes, i.e. updates, happening as well?

Now, TDB2 seems to be limited to a single virtual/physical machine and the question above is rather vague. Let's assume RAM isn't limited and the graph is not anywhere near the limits of TDB2 could store (maybe up to 10 million triple to be extremely conservative for my application).

One gotcha re: TDB2 and RAM Usage. TDB2 relies on memory mapping the database files into the OS file cache to maximise performance. A common mistake by users is to set the JVM heap size really high which leaves the JVM competing with the OS for memory and tanking performance as the memory mapped files get swapped in and out of memory.

Rule of thumb is to set a low JVM heap (2-4GB typically) and leave the rest of the memory free for the OS to cache the memory mapped database files.

If running in a containerised environment then make sure the memory allocated to the container accounts for both these sources of memory usage and you are explicitly setting the JVM memory so it doesn't just grab its default (which is usually 1/4 of the total memory reported by the OS)

Now the question for me is: can I employ Jena for simultaneous queries of (O-Notation): O(100) users O(1000) users O(10000) users?

Maybe, the only real way to know is to benchmark on representative data and queries for your use case. Some tools for this exist, e.g. my own sparql-query-bm, and lots of "standard" benchmarks exist (LUBM, SP2B etc) but they often aren't representative of real world use cases.

I did search for benchmarks but so far, they did not really answer my question (maybe just the wrong ones?). The people who work in our project are a little concerned that we cant scale horizontally later on. I would be grateful for hints/suggestions/answers :)

It seems there is also a project called the 'SANSA' stack (https://sansa-stack.net/) and id kind of builds on top of Jena (at least Jena modules appear quite often in their source code), but I did not get replies from the devs so far to what extent the package works already yet and what it does involve in terms of functionality.

There's several options that have been used/put together by folks associated with the Jena project in the past:

Just use a load balancer. If your setup generates a dataset that is then read-only from that point onwards you can simply take a copy of the database on disk (when TDB2 is not running!) and replicate that and then run separate Fuseki+TDB2 instances against the same data. Layer a load balancer over the top and you have horizontal scalability for read-only workloads.
rdf-delta - A way to replicate changes between different Fuseki+TDB2 instances, this extends the above by providing a way for updates to any of the instances to replicate to the other instances. Again you'd typically layer a load balancer over these instances for horizontal scalability.
jena-fuseki-kafka - A Fuseki module that allows a Fuseki instance to read new data from an Apache Kafka topic. You assign each instance a unique consumer group ID so each instance can independently read the Kafka topic to recreate your dataset. You can then scale up by adding new instances which will read the Kafka topic to bring themselves up to date. Broadly this is the same underlying approach as rdf-delta but utilising the robustness of Kafka as the coordination layer.

2 replies

uahic Nov 21, 2023
Author

@rvesse wohow! I was totally not aware of rdf-delta and fuseki-kafka, great to see that people put effort into pushing Jena/Fuseki further towards enterprise land.

I will try to setup rdf-delta in a docker-compose setup to test it; for jena-fuseki-kafka I dont understand yet who is actually sending the updates via Kafka. In my understanding each Server would have to be also a potential publisher on that kafka topic

rvesse Nov 22, 2023
Collaborator

So for jena-fuseki-kafka I use this in my $dayjob. Fuseki is one of the services sitting atop our data integration platform through which data flows from various customer data sources and gets converted into RDF by various ETL steps before arriving on a Kafka topic. So all the updates originate outside of Fuseki and are simply consumed by it when they arrive.

@afs has also been experimenting with an alternative update service for Fuseki that instead of writing direct to the database calculates the effective changes from an update and generates a RDF Patch to the Kafka topic instead. Then all the Fuseki instances consume that newly generated patch and are automatically synchronised, however that work isn't public yet (and again don't know if and when it will be).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jena TDB2 Performance / When to scale horizontally #2099

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Jena TDB2 Performance / When to scale horizontally #2099

uahic Nov 20, 2023

Replies: 1 comment · 2 replies

rvesse Nov 21, 2023 Collaborator

uahic Nov 21, 2023 Author

rvesse Nov 22, 2023 Collaborator

uahic
Nov 20, 2023

Replies: 1 comment 2 replies

rvesse
Nov 21, 2023
Collaborator

uahic Nov 21, 2023
Author

rvesse Nov 22, 2023
Collaborator