0% found this document useful (0 votes)

32 views

DRUID

Apache Druid is a real-time analytics database designed for fast analytics on large datasets. It is commonly used for clickstream analytics, server metrics storage, and other applications. Druid uses columnar storage and distributed processing to handle high volumes of data and queries. It supports real-time and batch ingestion as well as fast filtering and aggregation queries.

Uploaded by

ARYAN PATEL

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

DRUID

Uploaded by

ARYAN PATEL

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DRUID

➢ Apache Druid is a real-time analytics database designed for fast slice-and-dice

analytics ("OLAP" queries) on large data sets. Most often, Druid powers use
cases where real-time ingestion, fast query performance, and high uptime are
important.
➢ Druid is commonly used as the database backend for GUIs of analytical
applications, or for highly-concurrent APIs that need fast aggregations. Druid
works best with event-oriented data.
➢ Common application areas for Druid include:
➢ Clickstream analytics including web and mobile analytics
➢ Network telemetry analytics including network performance monitoring
➢ Server metrics storage
➢ Supply chain analytics including manufacturing metrics
➢ Application performance metrics
➢ Digital marketing/advertising analytics
➢ Business intelligence/OLAP

Key features of Druid:

➢ Druid's core architecture combines ideas from data warehouses, timeseries
databases, and logsearch systems. Some of Druid's key features are:
➢ Columnar storage format: Druid uses column-oriented storage. This means
it only loads the exact columns needed for a particular query.
➢ Scalable distributed system: Typical Druid deployments span clusters
ranging from tens to hundreds of servers.
➢ Massively parallel processing: Druid can process each query parallel across
the entire cluster.
➢ Realtime or batch ingestion: Druid can ingest data either in real-time or in
batches. Ingested data is immediately available for querying.
➢ Self-healing, self-balancing, easy to operate: As an operator, you add
servers to scale out or remove servers to scale down. The Druid cluster
re-balances itself automatically in the background without any downtime. If
a Druid server fails, the system automatically routes data around the
damage until the server can be replaced. Druid is designed to run
continuously without planned downtime for any reason. This is true for
configuration changes and software updates.
➢ Cloud-native, fault-tolerant architecture that won't lose data: After
ingestion, Druid safely stores a copy of your data in deep storage. You can
recover your data from deep storage even in the unlikely case that all Druid
servers fail.
➢ Indexes for quick filtering: Druid uses Roaring or CONCISE compressed
bitmap indexes to create indexes to enable fast filtering and searching
across multiple columns.
➢ Time-based partitioning: Druid first partitions data by time(by default). You
can optionally implement additional partitioning based on other fields.
Time-based queries only access the partitions that match the time range of
the query which leads to significant performance improvements.
➢ Approximate algorithms: Druid includes algorithms for approximate
count-distinct, approximate ranking, and computation of approximate
histograms and quantiles.
➢ Automatic summarization at ingest time: Druid optionally supports data
summarization at ingestion time. This summarization partially
pre-aggregates your data, potentially leading to significant cost savings
and performance boosts.

When to use Druid:

Druid is likely a good choice if your use case matches a few of the following:
➢ Insert rates are very high, but updates are rare.
➢ Most of your queries are aggregation and reporting queries.
➢ You are targeting query latencies of 100ms to a few seconds.
➢ Your data has a time component. Druid includes optimizations and design
choices specifically related to time.
➢ You may have multiple tables, but each query hits just one big distributed
table. Queries may potentially hit more than one smaller "lookup" table.
➢ You have high cardinality data columns, e.g. URLs, and user IDs, and need
fast counting and ranking over them.
➢ You want to load data from Kafka, HDFS, flat files, or object storage like
Amazon S3.
➢ Situations where you would likely not want to use Druid include:
➢ You need low-latency updates of existing records using a primary key. Druid
supports streaming inserts, but not streaming updates. You can perform
updates using background batch jobs.
➢ You are building an offline reporting system where query latency is not very
important.
➢ You want to do "big" join, meaning joining one big fact table to another big
fact table, and you are okay with these queries taking a long time to
complete.

There are 2 types of Druid deployments:

1. Single Server Deployment

2. Cluster Server Deployment

Cluster Server Deployment:

There are 3 parts of it:

Master server:
➢ The main considerations for the Master server are available CPUs and RAM for
the Coordinator and Overlord heaps.
➢ Sum up the allocated heap sizes for your Coordinator and Overlord from the
single-server deployment, and choose Master server hardware with enough
RAM for the combined heaps, with some extra RAM for other processes on the
machine.
➢ For CPU cores, you can choose hardware with approximately 1/4th of the cores
of the single-server deployment.

Data server:
➢ When choosing Data server hardware for the cluster, the main considerations
are available CPUs and RAM, and using SSD storage if feasible.
➢ In a clustered deployment, having multiple Data servers is a good idea for
fault-tolerance purposes.
➢ When choosing the Data server hardware, you can choose a split factor N,
divide the original CPU/RAM of the single-server deployment by N, and deploy
N Data servers of reduced size in the new cluster.
➢ Instructions for adjusting the Historical/MiddleManager configs for the split are
described in a later section in this guide.

Query server:
➢ The main considerations for the Query server are available CPUs and RAM for
the Broker heap + direct memory, and Router heap.
➢ Sum up the allocated memory sizes for your Broker and Router from the
single-server deployment, and choose Query server hardware with enough
RAM to cover the Broker/Router, with some extra RAM for other processes on
the machine.
➢ For CPU cores, you can choose hardware with approximately 1/4th of the cores
of the single-server deployment.
➢ The basic cluster tuning guide has information on how to calculate
Broker/Router memory usage.
Formatting data:
The following samples show data formats that are natively supported in Druid:

JSON

➢ {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language":

"en", "user": "nuclear", "unpatrolled": "true", "NewPage": "true", "robot":
"false", "anonymous": "false", "namespace": "article", "continent": "North
America", "country": "United States", "region": "Bay Area", "city": "San
Francisco", "added": 57, "deleted": 200, "delta": -143}

CSV

➢ 2013-08-31T01:02:33Z, "Gypsy Danger", "en", "nuclear", "true", "true",

"false", "false", "article", "North America", "United States", "Bay Area"," San
Francisco",57,200,-143

TSV (Delimited)

➢ 2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false"

"false" "article" "North America" "United States" "Bay Area" "San Francisco"
57 200 -143

➢ Note that the CSV and TSV data do not contain column heads. This becomes
important when you specify the data for ingesting.
➢ Besides text formats, Druid also supports binary formats such as Orc and
Parquet formats.
Druid schema model:

➢ Druid stores data in data sources, which are similar to tables in a traditional
relational database management system (RDBMS). Druid's data model shares
similarities with both relational and time-series data models.

Primary timestamp:
➢ Druid schemas must always include a primary timestamp. Druid uses the
primary timestamp to partition and sort your data. Druid uses the primary
timestamp to rapidly identify and retrieve data within the time range of
queries. Druid also uses the primary timestamp column for time-based data
management operations such as dropping time chunks, overwriting time
chunks, and time-based retention rules.
➢ Druid parses the primary timestamp based on the timestampSpec
configuration at ingestion time. Regardless of the source field for the primary
timestamp, Druid always stores the timestamp in the __time column in your
Druid datasource.
➢ You can control other important operations that are based on the primary
timestamp in the granularitySpec. If you have more than one timestamp
column, you can store the others as secondary timestamps.

Dimensions:
➢ Dimensions are columns that Druid stores "as-is". You can use dimensions for
any purpose. For example, you can group, filter, or apply aggregators to
dimensions at query time when necessary.
➢ If you disable rollup, then Druid treats the set of dimensions like a set of
columns to ingest. The dimensions behave exactly as you would expect from
any database that does not support a rollup feature.
➢ At ingestion time, you configure dimensions in the dimensionsSpec.
Metrics:
➢ Metrics are columns that Druid stores in an aggregated form. Metrics are most
useful when you enable rollup. If you specify a metric, you can apply an
aggregation function to each row during ingestion. This has the following
benefits:
➢ Rollup is a form of aggregation that collapses dimensions while aggregating
the values in the metrics, that is, it collapses rows but retains its summary
information."
➢ Rollup is a form of aggregation that combines multiple rows with the same
timestamp value and dimension values. For example, the rollup tutorial
demonstrates using rollup to collapse NetFlow data to a single row per
(minute, srcIP, dstIP) tuple, while retaining aggregate information about total
packet and byte counts.
➢ Druid can compute some aggregators, especially approximate ones, more
quickly at query time if they are partially computed at ingestion time, including
data that has yet to be rolled up.
➢ At ingestion time, you configure Metrics in the metricsSpec.
Druid architecture:

Druid has a distributed architecture that is designed to be cloud-friendly and easy to

operate. You can configure and scale services independently so you have maximum
flexibility over cluster operations. This design includes enhanced fault tolerance: an
outage of one component does not immediately affect other components.

Druid services:

Druid has several types of services:

➢ The coordinator service manages data availability on the cluster.

➢ Overlord service controls the assignment of data ingestion workloads.
➢ The broker handles queries from external clients.
➢ Router services are optional; they route requests to Brokers, Coordinators,
and Overlords.
➢ Historical services store queryable data.
➢ MiddleManager services ingest data.

Druid servers:
Druid services can be deployed any way you like, but for ease of deployment, we
suggest organizing them into three server types: Master, Query, and Data.
➢ Master: Runs Coordinator and Overlord processes, and manages data
availability and ingestion.
➢ Query: Runs Broker and optional Router processes, and handles queries from
external clients.
➢ Data: Runs Historical and MiddleManager processes, executes ingestion
workloads, and stores all queryable data.

Segments:

Apache Druid stores its data and indexes in segment files partitioned by time. Druid
creates a segment for each segment interval that contains data. If an interval is
empty—that is, containing no rows—no segment exists for that time interval. Druid may
create multiple segments for the same interval if you ingest data for that period via
different ingestion jobs.

➢ Segment file structure:

Dimension columns are different because they support filter and group-by
operations, so each dimension requires the following three data structures:

➢ Dictionary: Maps values (which are always treated as strings) to integer IDs,
allowing compact representation of the list and bitmap values.
➢ List: The column’s values, are encoded using the dictionary. Required for
GroupBy and TopN queries. These operators allow queries that solely
aggregate metrics based on filters to run without accessing the list of
values.
➢ Bitmap: One bitmap for each distinct value in the column, to indicate which
rows contain that value. Bitmaps allow for quick filtering operations
because they are convenient for quickly applying AND and OR operators.
Also known as inverted indexes.

Deep storage:

➢ Deep storage is where segments are stored. It is a storage mechanism that

Apache Druid does not provide. This deep storage infrastructure defines the
level of durability of your data. As long as Druid processes can see this storage
infrastructure and get at the segments stored on it, you will not lose data no
matter how many Druid nodes you lose. If segments disappear from this
storage layer, then you will lose whatever data those segments represent.
➢ In addition to being the backing store for segments, you can use queries from
deep storage and run queries against segments stored primarily in deep
storage. The load rules you configure determine whether segments exist
primarily in deep storage or a combination of deep storage and Historical
processes.
➢ Querying from deep storage: Although not as performant as querying
segments stored on disk for Historical processes, you can query from deep
storage to access segments that you may not need frequently or with the
extremely low latency Druid queries traditionally provide. You trade some
performance for a total lower storage cost because you can access more of
your data without the need to increase the number or capacity of your
Historical processes.

Metadata storage:

➢ Apache Druid relies on an external dependency for metadata storage. Druid

uses the metadata store to house various metadata about the system, but not
to store the actual data. The metadata store retains all metadata essential for
a Druid cluster to work.
➢ The metadata store includes the following:
➢ Segments records
➢ Rule records
➢ Configuration records
➢ Task-related tables
➢ Audit records
➢ Derby is the default metadata store for Druid, however, it is not suitable for
production. MySQL and PostgreSQL are more production-suitable metadata
stores.

➢ metadata stores: Druid supports Derby, MySQL, and PostgreSQL for storing
metadata.