DRUID
DRUID
Master server:
➢ The main considerations for the Master server are available CPUs and RAM for
the Coordinator and Overlord heaps.
➢ Sum up the allocated heap sizes for your Coordinator and Overlord from the
single-server deployment, and choose Master server hardware with enough
RAM for the combined heaps, with some extra RAM for other processes on the
machine.
➢ For CPU cores, you can choose hardware with approximately 1/4th of the cores
of the single-server deployment.
Data server:
➢ When choosing Data server hardware for the cluster, the main considerations
are available CPUs and RAM, and using SSD storage if feasible.
➢ In a clustered deployment, having multiple Data servers is a good idea for
fault-tolerance purposes.
➢ When choosing the Data server hardware, you can choose a split factor N,
divide the original CPU/RAM of the single-server deployment by N, and deploy
N Data servers of reduced size in the new cluster.
➢ Instructions for adjusting the Historical/MiddleManager configs for the split are
described in a later section in this guide.
Query server:
➢ The main considerations for the Query server are available CPUs and RAM for
the Broker heap + direct memory, and Router heap.
➢ Sum up the allocated memory sizes for your Broker and Router from the
single-server deployment, and choose Query server hardware with enough
RAM to cover the Broker/Router, with some extra RAM for other processes on
the machine.
➢ For CPU cores, you can choose hardware with approximately 1/4th of the cores
of the single-server deployment.
➢ The basic cluster tuning guide has information on how to calculate
Broker/Router memory usage.
Formatting data:
The following samples show data formats that are natively supported in Druid:
JSON
CSV
TSV (Delimited)
➢ Note that the CSV and TSV data do not contain column heads. This becomes
important when you specify the data for ingesting.
➢ Besides text formats, Druid also supports binary formats such as Orc and
Parquet formats.
Druid schema model:
➢ Druid stores data in data sources, which are similar to tables in a traditional
relational database management system (RDBMS). Druid's data model shares
similarities with both relational and time-series data models.
Primary timestamp:
➢ Druid schemas must always include a primary timestamp. Druid uses the
primary timestamp to partition and sort your data. Druid uses the primary
timestamp to rapidly identify and retrieve data within the time range of
queries. Druid also uses the primary timestamp column for time-based data
management operations such as dropping time chunks, overwriting time
chunks, and time-based retention rules.
➢ Druid parses the primary timestamp based on the timestampSpec
configuration at ingestion time. Regardless of the source field for the primary
timestamp, Druid always stores the timestamp in the __time column in your
Druid datasource.
➢ You can control other important operations that are based on the primary
timestamp in the granularitySpec. If you have more than one timestamp
column, you can store the others as secondary timestamps.
Dimensions:
➢ Dimensions are columns that Druid stores "as-is". You can use dimensions for
any purpose. For example, you can group, filter, or apply aggregators to
dimensions at query time when necessary.
➢ If you disable rollup, then Druid treats the set of dimensions like a set of
columns to ingest. The dimensions behave exactly as you would expect from
any database that does not support a rollup feature.
➢ At ingestion time, you configure dimensions in the dimensionsSpec.
Metrics:
➢ Metrics are columns that Druid stores in an aggregated form. Metrics are most
useful when you enable rollup. If you specify a metric, you can apply an
aggregation function to each row during ingestion. This has the following
benefits:
➢ Rollup is a form of aggregation that collapses dimensions while aggregating
the values in the metrics, that is, it collapses rows but retains its summary
information."
➢ Rollup is a form of aggregation that combines multiple rows with the same
timestamp value and dimension values. For example, the rollup tutorial
demonstrates using rollup to collapse NetFlow data to a single row per
(minute, srcIP, dstIP) tuple, while retaining aggregate information about total
packet and byte counts.
➢ Druid can compute some aggregators, especially approximate ones, more
quickly at query time if they are partially computed at ingestion time, including
data that has yet to be rolled up.
➢ At ingestion time, you configure Metrics in the metricsSpec.
Druid architecture:
Druid services:
Druid servers:
Druid services can be deployed any way you like, but for ease of deployment, we
suggest organizing them into three server types: Master, Query, and Data.
➢ Master: Runs Coordinator and Overlord processes, and manages data
availability and ingestion.
➢ Query: Runs Broker and optional Router processes, and handles queries from
external clients.
➢ Data: Runs Historical and MiddleManager processes, executes ingestion
workloads, and stores all queryable data.
Segments:
Apache Druid stores its data and indexes in segment files partitioned by time. Druid
creates a segment for each segment interval that contains data. If an interval is
empty—that is, containing no rows—no segment exists for that time interval. Druid may
create multiple segments for the same interval if you ingest data for that period via
different ingestion jobs.
➢ Dictionary: Maps values (which are always treated as strings) to integer IDs,
allowing compact representation of the list and bitmap values.
➢ List: The column’s values, are encoded using the dictionary. Required for
GroupBy and TopN queries. These operators allow queries that solely
aggregate metrics based on filters to run without accessing the list of
values.
➢ Bitmap: One bitmap for each distinct value in the column, to indicate which
rows contain that value. Bitmaps allow for quick filtering operations
because they are convenient for quickly applying AND and OR operators.
Also known as inverted indexes.
Deep storage:
Metadata storage:
➢ metadata stores: Druid supports Derby, MySQL, and PostgreSQL for storing
metadata.