Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
208 views

Apache Calcite Tutorial

The document provides instructions for setting up an environment to run a Calcite tutorial. It explains how to clone a GitHub repository containing sample code, check the Java version, and compile the project. It then outlines the topics that will be covered in the tutorial, including introducing Calcite, demonstrating a CSV adapter, explaining key components like the schema and type factory, and exploring query planning and custom operators. Optional advanced topics like dialects, materialized views, and spatial data are also listed.

Uploaded by

zhangxin1992pm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views

Apache Calcite Tutorial

The document provides instructions for setting up an environment to run a Calcite tutorial. It explains how to clone a GitHub repository containing sample code, check the Java version, and compile the project. It then outlines the topics that will be covered in the tutorial, including introducing Calcite, demonstrating a CSV adapter, explaining key components like the schema and type factory, and exploring query planning and custom operators. Optional advanced topics like dialects, materialized views, and spatial data are also listed.

Uploaded by

zhangxin1992pm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Tutorial @BOSS’21 Copenhagen

Stamatis Zampetakis, Julian Hyde • August 16, 2021


Tutorial @BOSS’21 Copenhagen
Stamatis Zampetakis, Julian Hyde • August 16, 2021

# Follow these steps to set up your environment


# (The first time, it may take ~3 minutes to download dependencies.)

git clone --branch boss21 https://github.com/zabetak/calcite-tutorial.git


java -version # need Java 8 or higher
cd calcite-tutorial
./mvnw package -DskipTests
Setup Environment
Requirements
1. Git
2. JDK version ≥ 1.8

Steps
1. Clone GitHub repository git clone --branch boss21
2. Load to IDE (preferred IntelliJ) https://github.com/zabetak/calcite-tutorial.git
a. Click Open
b. Navigate to calcite-tutorial
c. Select pom.xml file
d. Choose “Open as Project”
3. Compile the project java -version
cd calcite-tutorial
./mvnw package -DskipTests
About us

Julian Hyde @julianhyde


Senior Staff Engineer @ Google / Looker
Creator of Apache Calcite
PMC member of Apache Arrow, Drill, Eagle, Incubator and Kylin

Stamatis Zampetakis @szampetak


Senior Software Engineer @ Cloudera, Hive query optimizer team
PMC member of Apache Calcite; Hive committer
PhD in Data Management, INRIA & Paris-Sud University
Outline
1. Introduction
2. CSV Adapter Demo
3. Coding module I: Main components
4. Coding module I Exercises (Homework)
5. Hybrid planning
6. Coding module II: Custom operators/rules (Homework)
7. Volcano Planner internals optional

8. Dialects optional

9. Materialized views optional

10. Working with spatial data optional

11. Research using Apache Calcite


1. Calcite introduction
Motivation: Data views

1. Retrieve books and authors


2. Display image, title, price of the book along with firstname & lastname of the author
3. Sort the books based on their id (price or something else)
4. Show results in groups of five
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int

FILESYSTEM

XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int

FILESYSTEM

XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int

360+ DBMS

FILESYSTEM

XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
Apache Lucene
★ Open-source search engine
★ Java library
★ Powerful indexing & search features
★ Spell checking, hit highlighting
★ Advanced analysis/tokenization capabilities
★ ACID transactions
★ Ultra compact memory/disk format
How to query the data?
1. Retrieve books and authors
2. Display image, title, price of the book along with firstname & lastname of the
author
3. Sort the books based on their id (price or something else)
4. Show results in groups of five
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book b
LEFT OUTER JOIN Author a ON b.author=a.id
ORDER BY b.id
LIMIT 5
Query processor architecture
Query Schema

Parser CatalogReader

Relational
Algebra
Rules

Query planner
Metadata
(Cost, Statistics)
Relational
Algebra

Execution engine

Results
Apache Calcite
SQL query API calls Schema

SqlParser SqlNode

SqlNode SqlValidator RelBuilder

SqlToRelConverter CatalogReader

RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode

RelRunner

Results
Apache Calcite
SQL query API calls Schema
Dev & extensions
SqlParser SqlNode

SqlNode SqlValidator RelBuilder

SqlToRelConverter CatalogReader

RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode

RelRunner

Results
2. CSV Adapter Demo
Adapter
Implement SchemaFactory interface
{
Connect to a data source using "schemas": [
{
parameters "name": "BOOKSTORE",
"type": "custom",
Extract schema - return a list of tables "factory":
"org.apache.calcite.adapter.file.FileSchemaFactory",
"operand": {
Push down processing to the data "directory": "bookstore"
}
source: }
● A set of planner rules ]
]
● Calling convention (optional)
● Query model & query generator
(optional)
3. Coding module I: Main components
Setup schema & type factory
<<Interface>>

Schema
+getTable(name: Table): Table

tables *
<<Interface>>

Table
+getRowType(typeFactory: RelDataTypeFactory): RelDataType

<<Interface>>

RelDataTypeFactory
+createJavaType(clazz: Class): RelDataType
+createSqlType(typeName: SqlTypeName): RelDataType
Query to Abstract Syntax Tree (AST)
SQL query SqlParser SqlNode
OrderBy
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book AS b
LEFT OUTER JOIN Author a ON b.author = a.id Select
NodeList Numeric
Literal
WHERE b.year > 1830 5
ORDER BY b.id Identifier

LIMIT 5 NodeList Join BasicCall


b.id

Identifier Identifier Identifier Identifier Identifier


b.id b.title b.year a.fname a.lname

Literal BasicCal Binary Numeric


BasicCall BasicCall Literal Identifier
l Operator Literal
LEFT ON b.year > 1830

As As Identifier Binary Identifier


Identifier Identifier Identifier Identifier
Operator Operator Operator
book b author a b.author = a.id
AST to logical plan
SqlNode SqlValidator SqlNode SqlToRelConverter RelNode

LogicalSort
OrderBy [sort0=$0,dir=ASC,fetch=5]
b.id FETCH 5
LogicalProject
[ID=$0,TITLE=$1,YEAR=$2,
Select
FNAME=$5,LNAME=$6]
b.id, b.title, b.year, a.fname, a.lname

LogicalFilter
Join BasicCall [$2>1830]
b.year > 1830

LogicalJoin
[$3==$4,type=left]

BasicCall Literal BasicCall Literal BasicCall


book AS b LEFT author AS a ON b.author = a.id
LogicalTableScan LogicalTableScan
[Book] [Author]
Logical to physical plan
RelNode RelOptPlanner RelNode

LogicalSort
[sort0=$0,dir=ASC,fetch=5] EnumerableSort
RelRule [sort0=$0,dir=ASC,fetch=5]

LogicalProject EnumerableSortRule
[ID=$0,TITLE=$1,YEAR=$2, EnumerableProject
FNAME=$5,LNAME=$6] [ID=$0,TITLE=$1,YEAR=$2,
EnumerableProjectRule FNAME=$5,LNAME=$6]

LogicalFilter EnumerableFilterRule
[$2>1830] EnumerableFilter
[$2>1830]
EnumerableJoinRule
LogicalJoin
[$3==$4,type=left] EnumerableTableScanRule
EnumerableJoin
[$3==$4,type=left]

LogicalTableScan LogicalTableScan
[Book] [Author] EnumerableTableScan EnumerableTableScan
[Book] [Author]
Physical to Executable plan
RelNode EnumerableInterpretable Java code

EnumerableSort
[sort0=$0,dir=ASC,fetch=5]

EnumerableProject
[ID=$0,TITLE=$1,YEAR=$2,
FNAME=$5,LNAME=$6]

EnumerableFilter
[$2>1830]

EnumerableJoin
[$3==$4,type=left]

EnumerableTableScan EnumerableTableScan
[Book] [Author]
4. Coding module I: Exercises (Homework)
Exercise I: Execute more SQL queries
Include GROUP BY and other types of clauses:

SELECT o.o_custkey, COUNT(*)


FROM orders AS o
GROUP BY o.o_custkey
Exercise I: Execute more SQL queries
Include GROUP BY and other types of clauses:

SELECT o.o_custkey, COUNT(*)


FROM orders AS o
GROUP BY o.o_custkey

● Missing rule to convert LogicalAggregate to EnumerableAggregate


● Add EnumerableRules.ENUMERABLE_AGGREGATE_RULE to the planner
Exercise II: Improve performance by applying more
optimization rules
Push filter below the join:

SELECT c.c_name, o.o_orderkey, o.o_orderdate


FROM customer AS c
INNER JOIN orders AS o ON c.c_custkey = o.o_custkey
WHERE c.c_custkey < 3
ORDER BY c.c_name, o.o_orderkey
Exercise II: Improve performance by applying more
optimization rules
Push filter below the join:

SELECT c.c_name, o.o_orderkey, o.o_orderdate


FROM customer AS c
INNER JOIN orders AS o ON c.c_custkey = o.o_custkey
WHERE c.c_custkey < 3
ORDER BY c.c_name, o.o_orderkey

1. Add rule CoreRules.FILTER_INTO_JOIN to the planner


2. Compare plans before and after (or logical and physical)
3. Check cost estimates by using SqlExplainLevel.ALL_ATTRIBUTES
Exercise II: Improve performance by applying more
optimization rules

LogicalSort LogicalSort
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC] [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

LogicalProject LogicalProject
[C_NAME=$1,O_ORDERKEY=$8, [C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12] O_ORDERDATE=$12]
RelRule

LogicalFilter LogicalJoin
FilterIntoJoinRule [$0=$9,type=inner]
[$0<3]

LogicalJoin LogicalFilter
[$0=$9,type=inner] [$0<3]

LogicalTableScan LogicalTableScan LogicalTableScan LogicalTableScan


[CUSTOMER] [ORDERS] [CUSTOMER] [ORDERS]
Exercise III: Use RelBuilder API to construct the logical plan
Open LuceneBuilderProcessor.java and complete TODOs

SELECT o.o_custkey, COUNT(*)


Q1: FROM orders AS o
GROUP BY o.o_custkey

SELECT o.o_custkey, COUNT(*)


FROM orders AS o
Q2:
WHERE o.o_totalprice > 220388.06
GROUP BY o.o_custkey
Exercise III: Use RelBuilder API to construct the logical
plan

builder
.scan("orders")
.filter(
builder.call(
SqlStdOperatorTable.GREATER_THAN,
builder.field("o_totalprice"),
builder.literal(220388.06)))
.aggregate(
builder.groupKey("o_custkey"),
builder.count());
5. Hybrid planning
Calling convention
Initially all nodes belong to “logical”
calling convention.

Join
Logical calling convention cannot be
implemented, so has infinite cost
Filter Scan

Join

Scan Scan
Calling convention
Tables can’t be moved so there is
only one choice of calling convention
for each table.
Join
Examples:
Filter Scan
● Enumerable
● Druid
Join ● Drill
● HBase
Scan Scan ● JDBC
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Filter to
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Blue to
Scan
Logical
property, and converter is the enforcer.)

BlueFilterRule:
Join

LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Blue to
to
Logical
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Filter Scan
property, and converter is the enforcer.)

BlueFilterRule:
Join

LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Generating programs to implement hybrid plans
Hybrid plans are glued together using
Join
an engine - a convention that does
Blue to Green to
not have a storage format. (Example
Orange Orange
engines: Drill, Spark, Presto.)

Filter Scan
To implement, we generate a
program that calls out to query1 and
query2.
Join

The "Blue-to-Orange" converter is


Scan Scan typically a function in the Orange
language that embeds a Blue query.
Similarly "Green-to-Orange".
6. Coding module II: Custom operators/rules
(Homework)
What we want to achieve?
EnumerableSort
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

EnumerableSort EnumerableCalc
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8, EnumerableJoin
O_ORDERDATE=$12] [$0=$9,type=inner]

EnumerableJoin
[$0=$9,type=inner]
LuceneToEnumerableConverter

EnumerableCalc LuceneFilter
[$0<3] [$0<3]
LuceneToEnumerableConverter

EnumerableTableScan EnumerableTableScan LuceneTableScan LuceneTableScan


[CUSTOMER] [ORDERS] [CUSTOMER] [ORDERS]
What do we need?
EnumerableSort
Two calling conventions: [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan EnumerableJoin
2. LuceneToEnumerableConverter [$0=$9,type=inner]

3. LuceneFilter
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan →
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter
3. LuceneANY →
LuceneToEnumerableConverter LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
What do we need?
EnumerableSort
Two calling conventions: [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan STEP 1 EnumerableJoin
2. LuceneToEnumerableConverter STEP 3 [$0=$9,type=inner]

3. LuceneFilter STEP 5
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan → STEP 2
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter STEP 6
3. LuceneANY →
LuceneToEnumerableConverter STEP 4 LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
7. Volcano Planner Internals
Volcano planning algorithm
R0
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational


expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0
the lowest cost.

Much of the cost of R is the cost of its input(s).


So we apply dynamic programming to its inputs,
too.
Volcano planning algorithm
R0 R1
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational


expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0
the lowest cost.

Much of the cost of R is the cost of its input(s).


So we apply dynamic programming to its inputs,
too.
Volcano planning algorithm
R0 R1
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational


expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0 T1
the lowest cost.

Much of the cost of R is the cost of its input(s).


So we apply dynamic programming to its inputs,
too.
Volcano planning algorithm
R0 R1 R2
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational


expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0 T1
the lowest cost.

Much of the cost of R is the cost of its input(s).


So we apply dynamic programming to its inputs, Uo

too.
Volcano planning algorithm
R0 R1 R2
We keep equivalence sets of expressions (class
RelSet).

Each input of a relational expression is an S0

equivalence set + required physical properties


(class RelSubset).
T0 T1

Uo
Volcano planning algorithm
R0 R1 R2
Each relational expression has a memo (digest),
so we will recognize it if we generate it again.
S0

T0 T1

Uo
Volcano planning algorithm
R0 R1 R2
If an expression transforms to an expression in
another equivalence set, we can merge those
equivalence sets.
S0

T0 T1 Uo
Matches and queues
Project Filter

We register a new RelNode by adding it to a


RelSet.

Each rule instance declares a pattern of RelNode


types (and other properties) that it will match. Union

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly


matched.
Matches and queues
Project Filter

We register a new RelNode by adding it to a


RelSet.

Each rule instance declares a pattern of RelNode


types (and other properties) that it will match. Union Project

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly


matched.
Matches and queues
Project Filter

We register a new RelNode by adding it to a


RelSet.

Each rule instance declares a pattern of RelNode


types (and other properties) that it will match. Union Project

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly


matched. (4 matches.)
Matches and queues 0. Register each RelNode in the initial tree.

Equivalence sets containing 1. Each time a


registered RelNodes RelNode is
Should we fire these matched rules registered, find
rule matches,
immediately? and put
RuleMatch
objects on the
No! Because rule match #1 would generate new queue.

matches… which would generate new 3. Pop the top rule match.
Fire the rule. Register each
matches… and we'd never get to match #2. RelNode generated by the
rule, merging sets if
Instead, we put the matched rules on a queue. equivalences are found. Goto
1.

The queue allows us to: Rule match queue

● Search breadth-first (rather than depth-first)


2. If the queue is empty, or
● Prioritize (fire more "important" rules first) the cost is good enough,
we're done.
● Potentially terminate when we have a "good
enough" plan
Other planner engines, same great rules
Three planner engines:
● Volcano
● Volcano top-down (Cascades style)
● Hep applies rules in a strict "program"

The same rules are used by all engines.

It takes a lot of time effort to write a high-quality rule. Rules can be reused, tested,
improved, and they compose with other rules. Calcite's library of rules is valuable.
8. Dialects
Calcite architecture
At what points in the Calcite
stack do ‘languages’ exist?
Apache Calcite
SQL
● Incoming SQL
● Validating SQL against
SQL parser &
validator built-in operators
● Type system (e.g. max
Relational Query Pluggable
algebra planner rewrite rules
size of INTEGER type)
● JDBC adapter
generates SQL
Enumerable MongoDB
adapter
JDBC adapter
adapter ● Other adapters
generate other
File adapter Apache Kafka Apache Spark
(CSV, JSON, Http) adapter adapter
languages
Parsing & validating SQL - PARSER_FACTORY =

what knobs can I turn? "org.apache.calcite.sql.parser.impl.SqlParserImpl.FACTORY"

Lex.unquotedCasing = Casing.TO_UPPER

Lex.quoting = Quoting.BRACKET
SELECT deptno AS d,
Lex.quotedCasing = Casing.UNCHANGED
SUM(sal) AS [sumSal]
FROM [HR].[Emp] Lex.charLiteralStyle =
CharLiteralStyle.BQ_DOUBLE
WHERE ename NOT ILIKE "A%"
FUN = "postgres" (ILIKE is not standard SQL)
GROUP BY d
ORDER BY 1, 2 DESC SqlConformance.isGroupByAlias() = true

SqlConformance.isSortByOrdinal() = true

SqlValidator.Config.defaultNullCollation =
HIGH
interface SqlParserImplFactory

SQL dialect - APIs and properties CalciteConnectionProperty.LEX


enum Lex
enum Quoting
enum Casing
enum CharLiteralStyle

CalciteConnectionProperty.CONFORMANCE
SQL
interface SqlConformance

SQL parser & Pluggable parser, lexical, CalciteConnectionProperty.FUN


validator conformance, operators interface SqlOperatorTable
class SqlStdOperatorTable
Relational Query Pluggable class SqlLibraryOperators
algebra planner rewrite rules class SqlOperator
class SqlFunction extends SqlOperator
class SqlAggFunction extends SqlFunction
Pluggable
JDBC adapter
SQL dialect
class RelRule

SQL class SqlDialect


interface SqlDialectFactory
Contributing a dialect (or anything!) to Calcite
For your first code contribution,
pick a small bug or feature.

Introduce yourself! Email dev@,


saying what you plan to do.

Create a JIRA case describing the


problem.

To understand the code, find


similar features. Run their tests in
a debugger.

Write 1 or 2 tests for your feature.

Submit a pull request (PR).


Other front-end languages
Calcite is an excellent
platform for implementing
SQL Pig Datalog Morel
your own data language
SQL parser &
validator
RelBuilder Write a parser for your
language, use RelBuilder
Relational Query
algebra planner to translate to relational
algebra, and you can use
Adapter
any of Calcite's back-end
Physical
operators implementations
Storage
9. Materialized views
Backwards planning

Forwards planning Backwards planning

R0 R0

Ropt

Until now, we have seen forward planning. Forward planning transforms an expression (R0) to many
equivalent forms and picks the one with lowest cost (Ropt). Backwards planning transforms an expression
to an equivalent form (RN) that contains a target expression (T).
Backwards planning

Forwards planning Backwards planning

Equivalence set Equivalence set R2


R2

R0 R0
R1 R1
Ropt RN

S T

Until now, we have seen forward planning. Forward planning transforms an expression (R0) to many
equivalent forms and picks the one with lowest cost (Ropt). Backwards planning transforms an expression
to an equivalent form (RN) that contains a target expression (T).
Applications of backwards planning
Indexes (e.g. b-tree indexes). An index is a derived data structure whose contents
can be described as a relational expression (generally project-sort). When we are
planning a query, it already exists (i.e. the cost has already been paid).
Summary tables. A summary table is a derived data structure (generally
filter-project-join-aggregate).
Replicas with different physical properties (e.g. copy the table from New York to
Tokyo, or copy the table and partition by month(orderDate), sort by productId).
Incremental view maintenance. Materialized view V is populated from base table
T. Yesterday, we populated V with V0 = Q(T0). Today we want to make its contents
equal to V1 = Q(T1). Can we find and apply a delta query, dQ = Q(T1 - T0)?
Materialized views in Calcite /** Transforms a relational expression into a
* semantically equivalent relational expression,
* according to a given set of rules and a cost
{
* model. */
"schemas": {
public interface RelOptPlanner {
"name": "HR",
/** Defines an equivalence between a table and
"tables": [ {
* a query. */
"name": "emp"
void addMaterialization(
} ],
RelOptMaterialization materialization);
"materializations": [ {
"table": "i_emp_job",
/** Finds the most efficient expression to
"sql": "SELECT job, empno
* implement this query. */
FROM emp
RelNode findBestExp();
ORDER BY job, empno"
}
}, {
"table": "add_emp_deptno",
/** Records that a particular query is materialized
"sql": "SELECT deptno,
* by a particular table. */
SUM(sal) AS ss, COUNT(*) AS c
public class RelOptMaterialization {
FROM emp
public final RelNode tableRel;
GROUP BY deptno"
public final List<String> qualifiedTableName;
} ]
public final RelNode queryRel;
}
}
}

You can define materializations in a JSON model, via the planner API, or via
CREATE MATERIALIZED VIEW DDL (not shown).
More about materialized views
● There are several algorithms to rewrite queries to match materialized views
● A lattice is a data structure to model a star schema
● Calcite has algorithms to recommend an optimal set of summary tables for
a lattice (given expected queries, and statistics about column cardinality)
● Data profiling algorithms estimate the cardinality of all combinations of
columns
10. Working with spatial data
King
Yen

Spatial query

Find all restaurants within 1.5 distance units of Station
burger
my current location:

SELECT *
Filippo’s
FROM Restaurants AS r
WHERE ST_Distance( Zachary’s
pizza
ST_MakePoint(r.x, r.y), •
ST_MakePoint(6, 7)) < 1.5
restaurant x y

Zachary’s pizza 3 1
We cannot use a B-tree index (it can sort points
King Yen 7 7
by x or y coordinates, but not both) and
Filippo’s 7 4
specialized spatial indexes (such as R*-trees)
Station burger 5 6
are not generally available.
Hilbert space-filling curve

● A space-filling curve invented by mathematician David Hilbert


● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
King
Yen

Using Hilbert index

Add restriction based on h, a restaurant’s distance Station
burger

along the Hilbert curve •


Must keep original restriction due to false positives Filippo’s

Zachary’s
pizza

SELECT *
FROM Restaurants AS r
restaurant x y h
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46) Zachary’s pizza 3 1 5

AND ST_Distance( King Yen 7 7 41


ST_MakePoint(r.x, r.y),
Filippo’s 7 4 52
ST_MakePoint(6, 7)) < 1.5
Station burger 5 6 36
Telling the optimizer
CREATE TABLE Restaurants (
1. Declare h as a generated column
restaurant VARCHAR(20),
2. Sort table by h x DOUBLE,
y DOUBLE,
Planner can now convert spatial range
h DOUBLE GENERATED ALWAYS AS
queries into a range scan
ST_Hilbert(x, y) STORED)
Does not require specialized spatial index SORT KEY (h);
such as R*-tree
restaurant x y h

Very efficient on a sorted table such as HBase Zachary’s pizza 3 1 5

There are similar techniques for other spatial patterns Station burger 5 6 36

(e.g. region-to-region join) King Yen 7 7 41

Filippo’s 7 4 52
11. Research using Apache Calcite
Yes, VLDB 2021!
Go to the talk!
0900 Wednesday.
@julianhyde @szampetak
https://calcite.apache.org
Thank you!
Resources
● Calcite project https://calcite.apache.org
● Materialized view algorithms
https://calcite.apache.org/docs/materialized_views.html
● JSON model https://calcite.apache.org/docs/model.html
● Lazy beats smart and fast (DataEng 2018) - MVs, spatial, profiling
https://www.slideshare.net/julianhyde/lazy-beats-smart-and-fast
● Efficient spatial queries on vanilla databases (ApacheCon 2018)
https://www.slideshare.net/julianhyde/spatial-query-on-vanilla-databases
● Graefe, McKenna. The Volcano Optimizer Generator, 1991
● Graefe. The Cascades Framework for Query Optimization, 1995
● Slideshare (past presentations by Julian Hyde, including several about
Apache Calcite) https://www.slideshare.net/julianhyde

You might also like