Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
1 of 42
Download to read offline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
More Related Content
How We Added Replication to QuestDB - JonTheBeach
1. How We Added Replication to
QuestDB, a time-series database
Javier Ramírez
@supercoco9
Database Advocate
2. Agenda.
If you dislike technical details, this is
the wrong presentation
● Intro to Fast & Streaming Data
● Overview of QuestDB Storage
● About Replication
● Common solutions
● The QuestDB implementation
● Parallel Write-Ahead Log
● Physical layout
● Object Storage
● Dealing with upgrades
● What’s next
4. We have 400k smart meters, each
sending a record every 5 minutes.
~120 million rows per day
Real request from potential user
5. ● a factory floor with 500 machines, or
● a fleet with 500 vehicles, or
● 50 trains, with 10 cars each, or
● 500 users with a mobile phone
Sending data every second
How to be a (data) billionaire
43,200,000 rows a day
302,400,000 rows a week
1,314,144,000 rows a month
6. ● Optimised for fast ingestion
● Data lifecycle policies
● Analytics over chunks of time
● Time-based aggregations
● Often power real-time dashboards
Time-series database basics
8. QuestDB would like to be known for:
● Performance
○ Also with smaller machines
● Developer Experience
○ Multiple protocols and client libraries. Sensible SQL
extensions
● Open Source
○ (Apache 2.0 license)*
* Enterprise and Cloud Versions add non OSS features like Single Sign On, RBAC, managed snapshots, or multi-primary replication
11. Production is a scary place
● Application errors
● Connectivity issues
● Network timeout/server busy
● Component temporarily
offline/restarting/updating
● Hardware failure
● Full disk
● Just how protocols work
12. The path to implementing replication
● Reflecting on our current (at the time) storage layer
● Deciding the flavour of replication we want
● Decoupling ingestion from storage
● Making it robust (upgrades, fault-tolerance…)
13. QuestDB at a glance
13
Network
API
Compute
API
Storage
API Storage Engine Writer API
ILP (over
TCP socket
or HTTP)
Bulk Loader SQL Engine
REST
PG Wire
Data Ingress
Data Egress
Reader API
Reader API
14. QuestDB ingestion and storage layer
● Data always stored by incremental timestamp.
● No indexes needed*. Data is immediately available
after writing.
● Data partitioned by time units and stored in tabular
columnar format.
● Predictable ingestion rate, even under demanding
workloads (millions/second).
● Row updates and upserts supported.
https://questdb.io/docs/concept/storage-model/
16. Storage Engine - var-size
16
O1
O2
.i
O3
.d
index
data
Var size data
● Index contains 64-bit offsets into data file
● Data entries are length prefixed
● Index contains N+1 elements
O4
18. ● Partitions are versioned
● Columns are versioned within partition
● Merge operation will create a new partition with new transaction index
● Queries will switch over to new snapshot when they are ready
Storage Engine - snapshots
18
2022–04-11T18.9901
ticker.d.10031
2022–04-11T18.9945
ticker.d.10049
21. Architectural Considerations
● Synchronous vs Asynchronous replication
● Multi primary vs Single primary with Read-Only Replicas
● External coordinator vs Peer-to-Peer
● Replicate everything vs Replicate Shards
● Write Ahead Log vs non-sorted (for example, hinted handoffs)
31. Types of WAL records
● Data Record
● SQL Record (DDL Schema Changes)
● Symbol Entry and Symbol Map Records
● Bind Variable and Named Bind Variable Records
● Commit Record
37. Dealing with upgrades: index.msgpack
pub struct TableMetadata {
/// The number of transactions in each sequencer part.
pub sequencer_part_txn_count
: u32,
/// The first transaction with data in the object store.
/// Note: `TxnId::zero()` represents a newly created table.
pub first_txn: TxnId,
/// Timestamp of the `first_txn`.
/// If `first_txn > 0` (i.e. a non-new table),
/// then this represents the lowest bound for a minimum required
/// full-database snapshot.
pub first_at: EpochMicros,
/// The last transaction (inclusive) with data in the object
store.
pub last_txn: TxnId,
/// The timestamp of when the table was created.
pub created_at: EpochMicros,
/// The timestamp when the table was dropped.
pub deleted_at: Option<EpochMicros>,
}
pub struct Index {
/// Format version
pub version: u64,
pub sync_id: IndexSyncId,
/// Map of tables to their
creation and deletion times.
pub tables: HashMap<TableDirName,
TableMetadata>,
}
38. Multi-primary ingestion (Enterprise only right now)
Same concept than local sequencer and transaction IDs, but with a sequencer
backed by FoundationDB which stores metadata and information about cluster
members.
Client libraries transparently get the addresses of available primaries and
replicas to send data and queries.
Optimistic locking for conflict resolution.
39. What’s next: Parquet (also coming to Open Source)
● Separation of storage and computation
● Allows using datasets larger than a single drive
● Allows for data lakehouse architecture
● “First-mile” time-series queries are served from local storage (also on
parquet), and older data is served from the shared file system.
● Tight integration with our query engine to leverage compression as
much as possible
● Arrow Database Connectivity compatibility to read data out quickly
40. What we discussed.
If you dislike technical details, it is
probably too late now
● Intro to Fast & Streaming Data
● Overview of QuestDB Storage
● About Replication
● Common solutions
● The QuestDB implementation
● Parallel Write-Ahead Log
● Physical layout
● Object Storage
● Dealing with upgrades
● What’s next
41. QuestDB OSS
Open Source. Self-managed. Suitable for
production workloads.
https://github.com/questdb/questdb
QuestDB Enterprise
Licensed. Self-managed. Enterprise features like
RBAC, compression, replication, TLS on all
protocols, cold storage, K8s operator…
https://questdb.io/enterprise/
QuestDB Cloud
Fully managed, pay per usage environment,
with enterprise-grade features.
https://questdb.io/cloud/
42. 42
● github.com/questdb/questdb
● https://questdb.io
● https://demo.questdb.io
● https://slack.questdb.io/
● https://github.com/questdb/time-series-
streaming-analytics-template
We 💕 contributions
and GitHub ⭐ stars
Javier Ramírez
@supercoco9
Database Advocate