Apache Calcite Tutorial
Apache Calcite Tutorial
Steps
1. Clone GitHub repository git clone --branch boss21
2. Load to IDE (preferred IntelliJ) https://github.com/zabetak/calcite-tutorial.git
a. Click Open
b. Navigate to calcite-tutorial
c. Select pom.xml file
d. Choose “Open as Project”
3. Compile the project java -version
cd calcite-tutorial
./mvnw package -DskipTests
About us
8. Dialects optional
FILESYSTEM
XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int
FILESYSTEM
XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int
360+ DBMS
FILESYSTEM
XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
Apache Lucene
★ Open-source search engine
★ Java library
★ Powerful indexing & search features
★ Spell checking, hit highlighting
★ Advanced analysis/tokenization capabilities
★ ACID transactions
★ Ultra compact memory/disk format
How to query the data?
1. Retrieve books and authors
2. Display image, title, price of the book along with firstname & lastname of the
author
3. Sort the books based on their id (price or something else)
4. Show results in groups of five
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book b
LEFT OUTER JOIN Author a ON b.author=a.id
ORDER BY b.id
LIMIT 5
Query processor architecture
Query Schema
Parser CatalogReader
Relational
Algebra
Rules
Query planner
Metadata
(Cost, Statistics)
Relational
Algebra
Execution engine
Results
Apache Calcite
SQL query API calls Schema
SqlParser SqlNode
SqlToRelConverter CatalogReader
RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode
RelRunner
Results
Apache Calcite
SQL query API calls Schema
Dev & extensions
SqlParser SqlNode
SqlToRelConverter CatalogReader
RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode
RelRunner
Results
2. CSV Adapter Demo
Adapter
Implement SchemaFactory interface
{
Connect to a data source using "schemas": [
{
parameters "name": "BOOKSTORE",
"type": "custom",
Extract schema - return a list of tables "factory":
"org.apache.calcite.adapter.file.FileSchemaFactory",
"operand": {
Push down processing to the data "directory": "bookstore"
}
source: }
● A set of planner rules ]
]
● Calling convention (optional)
● Query model & query generator
(optional)
3. Coding module I: Main components
Setup schema & type factory
<<Interface>>
Schema
+getTable(name: Table): Table
tables *
<<Interface>>
Table
+getRowType(typeFactory: RelDataTypeFactory): RelDataType
<<Interface>>
RelDataTypeFactory
+createJavaType(clazz: Class): RelDataType
+createSqlType(typeName: SqlTypeName): RelDataType
Query to Abstract Syntax Tree (AST)
SQL query SqlParser SqlNode
OrderBy
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book AS b
LEFT OUTER JOIN Author a ON b.author = a.id Select
NodeList Numeric
Literal
WHERE b.year > 1830 5
ORDER BY b.id Identifier
LogicalSort
OrderBy [sort0=$0,dir=ASC,fetch=5]
b.id FETCH 5
LogicalProject
[ID=$0,TITLE=$1,YEAR=$2,
Select
FNAME=$5,LNAME=$6]
b.id, b.title, b.year, a.fname, a.lname
LogicalFilter
Join BasicCall [$2>1830]
b.year > 1830
LogicalJoin
[$3==$4,type=left]
LogicalSort
[sort0=$0,dir=ASC,fetch=5] EnumerableSort
RelRule [sort0=$0,dir=ASC,fetch=5]
LogicalProject EnumerableSortRule
[ID=$0,TITLE=$1,YEAR=$2, EnumerableProject
FNAME=$5,LNAME=$6] [ID=$0,TITLE=$1,YEAR=$2,
EnumerableProjectRule FNAME=$5,LNAME=$6]
LogicalFilter EnumerableFilterRule
[$2>1830] EnumerableFilter
[$2>1830]
EnumerableJoinRule
LogicalJoin
[$3==$4,type=left] EnumerableTableScanRule
EnumerableJoin
[$3==$4,type=left]
LogicalTableScan LogicalTableScan
[Book] [Author] EnumerableTableScan EnumerableTableScan
[Book] [Author]
Physical to Executable plan
RelNode EnumerableInterpretable Java code
EnumerableSort
[sort0=$0,dir=ASC,fetch=5]
EnumerableProject
[ID=$0,TITLE=$1,YEAR=$2,
FNAME=$5,LNAME=$6]
EnumerableFilter
[$2>1830]
EnumerableJoin
[$3==$4,type=left]
EnumerableTableScan EnumerableTableScan
[Book] [Author]
4. Coding module I: Exercises (Homework)
Exercise I: Execute more SQL queries
Include GROUP BY and other types of clauses:
LogicalSort LogicalSort
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC] [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]
LogicalProject LogicalProject
[C_NAME=$1,O_ORDERKEY=$8, [C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12] O_ORDERDATE=$12]
RelRule
LogicalFilter LogicalJoin
FilterIntoJoinRule [$0=$9,type=inner]
[$0<3]
LogicalJoin LogicalFilter
[$0=$9,type=inner] [$0<3]
builder
.scan("orders")
.filter(
builder.call(
SqlStdOperatorTable.GREATER_THAN,
builder.field("o_totalprice"),
builder.literal(220388.06)))
.aggregate(
builder.groupKey("o_custkey"),
builder.count());
5. Hybrid planning
Calling convention
Initially all nodes belong to “logical”
calling convention.
Join
Logical calling convention cannot be
implemented, so has infinite cost
Filter Scan
Join
Scan Scan
Calling convention
Tables can’t be moved so there is
only one choice of calling convention
for each table.
Join
Examples:
Filter Scan
● Enumerable
● Druid
Join ● Drill
● HBase
Scan Scan ● JDBC
Calling convention
Rules fire to convert nodes to
particular calling conventions.
Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join
Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.
Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join
Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.
Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join
Scan Scan
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Filter to
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Blue to
Scan
Logical
property, and converter is the enforcer.)
BlueFilterRule:
Join
LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Blue to
to
Logical
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Filter Scan
property, and converter is the enforcer.)
BlueFilterRule:
Join
LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Generating programs to implement hybrid plans
Hybrid plans are glued together using
Join
an engine - a convention that does
Blue to Green to
not have a storage format. (Example
Orange Orange
engines: Drill, Spark, Presto.)
Filter Scan
To implement, we generate a
program that calls out to query1 and
query2.
Join
EnumerableSort EnumerableCalc
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8, EnumerableJoin
O_ORDERDATE=$12] [$0=$9,type=inner]
EnumerableJoin
[$0=$9,type=inner]
LuceneToEnumerableConverter
EnumerableCalc LuceneFilter
[$0<3] [$0<3]
LuceneToEnumerableConverter
1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan EnumerableJoin
2. LuceneToEnumerableConverter [$0=$9,type=inner]
3. LuceneFilter
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan →
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter
3. LuceneANY →
LuceneToEnumerableConverter LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
What do we need?
EnumerableSort
Two calling conventions: [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]
1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan STEP 1 EnumerableJoin
2. LuceneToEnumerableConverter STEP 3 [$0=$9,type=inner]
3. LuceneFilter STEP 5
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan → STEP 2
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter STEP 6
3. LuceneANY →
LuceneToEnumerableConverter STEP 4 LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
7. Volcano Planner Internals
Volcano planning algorithm
R0
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0
too.
Volcano planning algorithm
R0 R1 R2
We keep equivalence sets of expressions (class
RelSet).
Uo
Volcano planning algorithm
R0 R1 R2
Each relational expression has a memo (digest),
so we will recognize it if we generate it again.
S0
T0 T1
Uo
Volcano planning algorithm
R0 R1 R2
If an expression transforms to an expression in
another equivalence set, we can merge those
equivalence sets.
S0
T0 T1 Uo
Matches and queues
Project Filter
Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join
Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join
Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join
matches… which would generate new 3. Pop the top rule match.
Fire the rule. Register each
matches… and we'd never get to match #2. RelNode generated by the
rule, merging sets if
Instead, we put the matched rules on a queue. equivalences are found. Goto
1.
It takes a lot of time effort to write a high-quality rule. Rules can be reused, tested,
improved, and they compose with other rules. Calcite's library of rules is valuable.
8. Dialects
Calcite architecture
At what points in the Calcite
stack do ‘languages’ exist?
Apache Calcite
SQL
● Incoming SQL
● Validating SQL against
SQL parser &
validator built-in operators
● Type system (e.g. max
Relational Query Pluggable
algebra planner rewrite rules
size of INTEGER type)
● JDBC adapter
generates SQL
Enumerable MongoDB
adapter
JDBC adapter
adapter ● Other adapters
generate other
File adapter Apache Kafka Apache Spark
(CSV, JSON, Http) adapter adapter
languages
Parsing & validating SQL - PARSER_FACTORY =
Lex.unquotedCasing = Casing.TO_UPPER
Lex.quoting = Quoting.BRACKET
SELECT deptno AS d,
Lex.quotedCasing = Casing.UNCHANGED
SUM(sal) AS [sumSal]
FROM [HR].[Emp] Lex.charLiteralStyle =
CharLiteralStyle.BQ_DOUBLE
WHERE ename NOT ILIKE "A%"
FUN = "postgres" (ILIKE is not standard SQL)
GROUP BY d
ORDER BY 1, 2 DESC SqlConformance.isGroupByAlias() = true
SqlConformance.isSortByOrdinal() = true
SqlValidator.Config.defaultNullCollation =
HIGH
interface SqlParserImplFactory
CalciteConnectionProperty.CONFORMANCE
SQL
interface SqlConformance
R0 R0
Ropt
Until now, we have seen forward planning. Forward planning transforms an expression (R0) to many
equivalent forms and picks the one with lowest cost (Ropt). Backwards planning transforms an expression
to an equivalent form (RN) that contains a target expression (T).
Backwards planning
R0 R0
R1 R1
Ropt RN
S T
Until now, we have seen forward planning. Forward planning transforms an expression (R0) to many
equivalent forms and picks the one with lowest cost (Ropt). Backwards planning transforms an expression
to an equivalent form (RN) that contains a target expression (T).
Applications of backwards planning
Indexes (e.g. b-tree indexes). An index is a derived data structure whose contents
can be described as a relational expression (generally project-sort). When we are
planning a query, it already exists (i.e. the cost has already been paid).
Summary tables. A summary table is a derived data structure (generally
filter-project-join-aggregate).
Replicas with different physical properties (e.g. copy the table from New York to
Tokyo, or copy the table and partition by month(orderDate), sort by productId).
Incremental view maintenance. Materialized view V is populated from base table
T. Yesterday, we populated V with V0 = Q(T0). Today we want to make its contents
equal to V1 = Q(T1). Can we find and apply a delta query, dQ = Q(T1 - T0)?
Materialized views in Calcite /** Transforms a relational expression into a
* semantically equivalent relational expression,
* according to a given set of rules and a cost
{
* model. */
"schemas": {
public interface RelOptPlanner {
"name": "HR",
/** Defines an equivalence between a table and
"tables": [ {
* a query. */
"name": "emp"
void addMaterialization(
} ],
RelOptMaterialization materialization);
"materializations": [ {
"table": "i_emp_job",
/** Finds the most efficient expression to
"sql": "SELECT job, empno
* implement this query. */
FROM emp
RelNode findBestExp();
ORDER BY job, empno"
}
}, {
"table": "add_emp_deptno",
/** Records that a particular query is materialized
"sql": "SELECT deptno,
* by a particular table. */
SUM(sal) AS ss, COUNT(*) AS c
public class RelOptMaterialization {
FROM emp
public final RelNode tableRel;
GROUP BY deptno"
public final List<String> qualifiedTableName;
} ]
public final RelNode queryRel;
}
}
}
You can define materializations in a JSON model, via the planner API, or via
CREATE MATERIALIZED VIEW DDL (not shown).
More about materialized views
● There are several algorithms to rewrite queries to match materialized views
● A lattice is a data structure to model a star schema
● Calcite has algorithms to recommend an optimal set of summary tables for
a lattice (given expected queries, and statistics about column cardinality)
● Data profiling algorithms estimate the cardinality of all combinations of
columns
10. Working with spatial data
King
Yen
•
Spatial query
•
Find all restaurants within 1.5 distance units of Station
burger
my current location:
•
SELECT *
Filippo’s
FROM Restaurants AS r
WHERE ST_Distance( Zachary’s
pizza
ST_MakePoint(r.x, r.y), •
ST_MakePoint(6, 7)) < 1.5
restaurant x y
Zachary’s pizza 3 1
We cannot use a B-tree index (it can sort points
King Yen 7 7
by x or y coordinates, but not both) and
Filippo’s 7 4
specialized spatial indexes (such as R*-trees)
Station burger 5 6
are not generally available.
Hilbert space-filling curve
Zachary’s
pizza
•
SELECT *
FROM Restaurants AS r
restaurant x y h
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46) Zachary’s pizza 3 1 5
There are similar techniques for other spatial patterns Station burger 5 6 36
Filippo’s 7 4 52
11. Research using Apache Calcite
Yes, VLDB 2021!
Go to the talk!
0900 Wednesday.
@julianhyde @szampetak
https://calcite.apache.org
Thank you!
Resources
● Calcite project https://calcite.apache.org
● Materialized view algorithms
https://calcite.apache.org/docs/materialized_views.html
● JSON model https://calcite.apache.org/docs/model.html
● Lazy beats smart and fast (DataEng 2018) - MVs, spatial, profiling
https://www.slideshare.net/julianhyde/lazy-beats-smart-and-fast
● Efficient spatial queries on vanilla databases (ApacheCon 2018)
https://www.slideshare.net/julianhyde/spatial-query-on-vanilla-databases
● Graefe, McKenna. The Volcano Optimizer Generator, 1991
● Graefe. The Cascades Framework for Query Optimization, 1995
● Slideshare (past presentations by Julian Hyde, including several about
Apache Calcite) https://www.slideshare.net/julianhyde