0% found this document useful (0 votes)

208 views

Apache Calcite Tutorial

The document provides instructions for setting up an environment to run a Calcite tutorial. It explains how to clone a GitHub repository containing sample code, check the Java version, and compile the project. It then outlines the topics that will be covered in the tutorial, including introducing Calcite, demonstrating a CSV adapter, explaining key components like the schema and type factory, and exploring query planning and custom operators. Optional advanced topics like dialects, materialized views, and spatial data are also listed.

Uploaded by

zhangxin1992pm

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

208 views

Apache Calcite Tutorial

Uploaded by

zhangxin1992pm

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Tutorial @BOSS’21 Copenhagen

Stamatis Zampetakis, Julian Hyde • August 16, 2021

Tutorial @BOSS’21 Copenhagen
Stamatis Zampetakis, Julian Hyde • August 16, 2021

# Follow these steps to set up your environment

# (The first time, it may take ~3 minutes to download dependencies.)

git clone --branch boss21 https://github.com/zabetak/calcite-tutorial.git

java -version # need Java 8 or higher
cd calcite-tutorial
./mvnw package -DskipTests
Setup Environment
Requirements
1. Git
2. JDK version ≥ 1.8

Steps
1. Clone GitHub repository git clone --branch boss21
2. Load to IDE (preferred IntelliJ) https://github.com/zabetak/calcite-tutorial.git
a. Click Open
b. Navigate to calcite-tutorial
c. Select pom.xml file
d. Choose “Open as Project”
3. Compile the project java -version
cd calcite-tutorial
./mvnw package -DskipTests
About us

Julian Hyde @julianhyde

Senior Staff Engineer @ Google / Looker
Creator of Apache Calcite
PMC member of Apache Arrow, Drill, Eagle, Incubator and Kylin

Stamatis Zampetakis @szampetak

Senior Software Engineer @ Cloudera, Hive query optimizer team
PMC member of Apache Calcite; Hive committer
PhD in Data Management, INRIA & Paris-Sud University
Outline
1. Introduction
2. CSV Adapter Demo
3. Coding module I: Main components
4. Coding module I Exercises (Homework)
5. Hybrid planning
6. Coding module II: Custom operators/rules (Homework)
7. Volcano Planner internals optional

8. Dialects optional

9. Materialized views optional

10. Working with spatial data optional

11. Research using Apache Calcite

1. Calcite introduction
Motivation: Data views

1. Retrieve books and authors

2. Display image, title, price of the book along with firstname & lastname of the author
3. Sort the books based on their id (price or something else)
4. Show results in groups of five
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int

FILESYSTEM

XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
What, where, how data are stored?
AUTHOR BOOK
id int id int
author
fname string title string
lname string 0..1 price decimal
birth date year int

FILESYSTEM

360+ DBMS

FILESYSTEM

XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN XML CSV JSON BIN
Apache Lucene
★ Open-source search engine
★ Java library
★ Powerful indexing & search features
★ Spell checking, hit highlighting
★ Advanced analysis/tokenization capabilities
★ ACID transactions
★ Ultra compact memory/disk format
How to query the data?
1. Retrieve books and authors
2. Display image, title, price of the book along with firstname & lastname of the
author
3. Sort the books based on their id (price or something else)
4. Show results in groups of five
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book b
LEFT OUTER JOIN Author a ON b.author=a.id
ORDER BY b.id
LIMIT 5
Query processor architecture
Query Schema

Parser CatalogReader

Relational
Algebra
Rules

Query planner
Metadata
(Cost, Statistics)
Relational
Algebra

Execution engine

Results
Apache Calcite
SQL query API calls Schema

SqlParser SqlNode

SqlNode SqlValidator RelBuilder

SqlToRelConverter CatalogReader

RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode

RelRunner

Results
Apache Calcite
SQL query API calls Schema
Dev & extensions
SqlParser SqlNode

SqlNode SqlValidator RelBuilder

SqlToRelConverter CatalogReader

RelNode
RelRule
RelMetadata RelOptPlanner
Provider
RelNode

RelRunner

Results
2. CSV Adapter Demo
Adapter
Implement SchemaFactory interface
{
Connect to a data source using "schemas": [
{
parameters "name": "BOOKSTORE",
"type": "custom",
Extract schema - return a list of tables "factory":
"org.apache.calcite.adapter.file.FileSchemaFactory",
"operand": {
Push down processing to the data "directory": "bookstore"
}
source: }
● A set of planner rules ]
]
● Calling convention (optional)
● Query model & query generator
(optional)
3. Coding module I: Main components
Setup schema & type factory
<<Interface>>

Schema
+getTable(name: Table): Table

tables *
<<Interface>>

Table
+getRowType(typeFactory: RelDataTypeFactory): RelDataType

<<Interface>>

RelDataTypeFactory
+createJavaType(clazz: Class): RelDataType
+createSqlType(typeName: SqlTypeName): RelDataType
Query to Abstract Syntax Tree (AST)
SQL query SqlParser SqlNode
OrderBy
SELECT b.id, b.title, b.year, a.fname, a.lname
FROM Book AS b
LEFT OUTER JOIN Author a ON b.author = a.id Select
NodeList Numeric
Literal
WHERE b.year > 1830 5
ORDER BY b.id Identifier

LIMIT 5 NodeList Join BasicCall

b.id

Identifier Identifier Identifier Identifier Identifier

b.id b.title b.year a.fname a.lname

Literal BasicCal Binary Numeric

BasicCall BasicCall Literal Identifier
l Operator Literal
LEFT ON b.year > 1830

As As Identifier Binary Identifier

Identifier Identifier Identifier Identifier
Operator Operator Operator
book b author a b.author = a.id
AST to logical plan
SqlNode SqlValidator SqlNode SqlToRelConverter RelNode

LogicalSort
OrderBy [sort0=$0,dir=ASC,fetch=5]
b.id FETCH 5
LogicalProject
[ID=$0,TITLE=$1,YEAR=$2,
Select
FNAME=$5,LNAME=$6]
b.id, b.title, b.year, a.fname, a.lname

LogicalFilter
Join BasicCall [$2>1830]
b.year > 1830

LogicalJoin
[$3==$4,type=left]

BasicCall Literal BasicCall Literal BasicCall

book AS b LEFT author AS a ON b.author = a.id
LogicalTableScan LogicalTableScan
[Book] [Author]
Logical to physical plan
RelNode RelOptPlanner RelNode

LogicalSort
[sort0=$0,dir=ASC,fetch=5] EnumerableSort
RelRule [sort0=$0,dir=ASC,fetch=5]

LogicalProject EnumerableSortRule
[ID=$0,TITLE=$1,YEAR=$2, EnumerableProject
FNAME=$5,LNAME=$6] [ID=$0,TITLE=$1,YEAR=$2,
EnumerableProjectRule FNAME=$5,LNAME=$6]

LogicalFilter EnumerableFilterRule
[$2>1830] EnumerableFilter
[$2>1830]
EnumerableJoinRule
LogicalJoin
[$3==$4,type=left] EnumerableTableScanRule
EnumerableJoin
[$3==$4,type=left]

LogicalTableScan LogicalTableScan
[Book] [Author] EnumerableTableScan EnumerableTableScan
[Book] [Author]
Physical to Executable plan
RelNode EnumerableInterpretable Java code

EnumerableSort
[sort0=$0,dir=ASC,fetch=5]

EnumerableProject
[ID=$0,TITLE=$1,YEAR=$2,
FNAME=$5,LNAME=$6]

EnumerableFilter
[$2>1830]

EnumerableJoin
[$3==$4,type=left]

EnumerableTableScan EnumerableTableScan
[Book] [Author]
4. Coding module I: Exercises (Homework)
Exercise I: Execute more SQL queries
Include GROUP BY and other types of clauses:

SELECT o.o_custkey, COUNT(*)

FROM orders AS o
GROUP BY o.o_custkey
Exercise I: Execute more SQL queries
Include GROUP BY and other types of clauses:

SELECT o.o_custkey, COUNT(*)

FROM orders AS o
GROUP BY o.o_custkey

● Missing rule to convert LogicalAggregate to EnumerableAggregate

● Add EnumerableRules.ENUMERABLE_AGGREGATE_RULE to the planner
Exercise II: Improve performance by applying more
optimization rules
Push filter below the join:

SELECT c.c_name, o.o_orderkey, o.o_orderdate

FROM customer AS c
INNER JOIN orders AS o ON c.c_custkey = o.o_custkey
WHERE c.c_custkey < 3
ORDER BY c.c_name, o.o_orderkey
Exercise II: Improve performance by applying more
optimization rules
Push filter below the join:

SELECT c.c_name, o.o_orderkey, o.o_orderdate

FROM customer AS c
INNER JOIN orders AS o ON c.c_custkey = o.o_custkey
WHERE c.c_custkey < 3
ORDER BY c.c_name, o.o_orderkey

1. Add rule CoreRules.FILTER_INTO_JOIN to the planner

2. Compare plans before and after (or logical and physical)
3. Check cost estimates by using SqlExplainLevel.ALL_ATTRIBUTES
Exercise II: Improve performance by applying more
optimization rules

LogicalSort LogicalSort
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC] [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

LogicalProject LogicalProject
[C_NAME=$1,O_ORDERKEY=$8, [C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12] O_ORDERDATE=$12]
RelRule

LogicalFilter LogicalJoin
FilterIntoJoinRule [$0=$9,type=inner]
[$0<3]

LogicalJoin LogicalFilter
[$0=$9,type=inner] [$0<3]

LogicalTableScan LogicalTableScan LogicalTableScan LogicalTableScan

[CUSTOMER] [ORDERS] [CUSTOMER] [ORDERS]
Exercise III: Use RelBuilder API to construct the logical plan
Open LuceneBuilderProcessor.java and complete TODOs

SELECT o.o_custkey, COUNT(*)

Q1: FROM orders AS o
GROUP BY o.o_custkey

SELECT o.o_custkey, COUNT(*)

FROM orders AS o
Q2:
WHERE o.o_totalprice > 220388.06
GROUP BY o.o_custkey
Exercise III: Use RelBuilder API to construct the logical
plan

builder
.scan("orders")
.filter(
builder.call(
SqlStdOperatorTable.GREATER_THAN,
builder.field("o_totalprice"),
builder.literal(220388.06)))
.aggregate(
builder.groupKey("o_custkey"),
builder.count());
5. Hybrid planning
Calling convention
Initially all nodes belong to “logical”
calling convention.

Join
Logical calling convention cannot be
implemented, so has infinite cost
Filter Scan

Join

Scan Scan
Calling convention
Tables can’t be moved so there is
only one choice of calling convention
for each table.
Join
Examples:
Filter Scan
● Enumerable
● Druid
Join ● Drill
● HBase
Scan Scan ● JDBC
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Calling convention
Rules fire to convert nodes to
particular calling conventions.

Join
The calling convention propagates
through the tree.
Filter Scan
Because this is Volcano, each node
can have multiple conventions.
Join

Scan Scan
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Filter to
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Blue to
Scan
Logical
property, and converter is the enforcer.)

BlueFilterRule:
Join

LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Converters
To keep things honest, we need to
Join
insert a converter at each point where
Green
the convention changes.
Blue to
to
Logical
Logical
(Recall: Volcano has an enforcer for
each trait. Convention is a physical
Filter Scan
property, and converter is the enforcer.)

BlueFilterRule:
Join

LogicalFilter(BlueToLogical(Blue b))
Scan Scan →
BlueToLogical(BlueFilter(b))
Generating programs to implement hybrid plans
Hybrid plans are glued together using
Join
an engine - a convention that does
Blue to Green to
not have a storage format. (Example
Orange Orange
engines: Drill, Spark, Presto.)

Filter Scan
To implement, we generate a
program that calls out to query1 and
query2.
Join

The "Blue-to-Orange" converter is

Scan Scan typically a function in the Orange
language that embeds a Blue query.
Similarly "Green-to-Orange".
6. Coding module II: Custom operators/rules
(Homework)
What we want to achieve?
EnumerableSort
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

EnumerableSort EnumerableCalc
[sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8, EnumerableJoin
O_ORDERDATE=$12] [$0=$9,type=inner]

EnumerableJoin
[$0=$9,type=inner]
LuceneToEnumerableConverter

EnumerableCalc LuceneFilter
[$0<3] [$0<3]
LuceneToEnumerableConverter

EnumerableTableScan EnumerableTableScan LuceneTableScan LuceneTableScan

[CUSTOMER] [ORDERS] [CUSTOMER] [ORDERS]
What do we need?
EnumerableSort
Two calling conventions: [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan EnumerableJoin
2. LuceneToEnumerableConverter [$0=$9,type=inner]

3. LuceneFilter
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan →
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter
3. LuceneANY →
LuceneToEnumerableConverter LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
What do we need?
EnumerableSort
Two calling conventions: [sort0=$0,dir0=ASC,sort1=$1,dir1=ASC]

1. Enumerable
2. Lucene EnumerableCalc
[C_NAME=$1,O_ORDERKEY=$8,
O_ORDERDATE=$12]
Three custom operators:
1. LuceneTableScan STEP 1 EnumerableJoin
2. LuceneToEnumerableConverter STEP 3 [$0=$9,type=inner]

3. LuceneFilter STEP 5
LuceneToEnumerableConverter
Three custom conversion rules:
1. LogicalTableScan → STEP 2
LuceneTableScan LuceneFilter LuceneToEnumerableConverter
[$0<3]
2. LogicalFilter → LuceneFilter STEP 6
3. LuceneANY →
LuceneToEnumerableConverter STEP 4 LuceneTableScan LuceneTableScan
[CUSTOMER] [ORDERS]
7. Volcano Planner Internals
Volcano planning algorithm
R0
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational

expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0
the lowest cost.

Much of the cost of R is the cost of its input(s).

So we apply dynamic programming to its inputs,
too.
Volcano planning algorithm
R0 R1
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational

expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0
the lowest cost.

Much of the cost of R is the cost of its input(s).

Dynamic programming: to optimize a relational

expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0 T1
the lowest cost.

Much of the cost of R is the cost of its input(s).

So we apply dynamic programming to its inputs,
too.
Volcano planning algorithm
R0 R1 R2
Based on two papers by Goetz Graefe in the
1990s (Volcano, Cascades), now the industry
standard for cost-based optimization.
S0

Dynamic programming: to optimize a relational

expression R0, convert it into equivalent
expressions {R1, R2, …}, and pick the one with T0 T1
the lowest cost.

Much of the cost of R is the cost of its input(s).

So we apply dynamic programming to its inputs, Uo

too.
Volcano planning algorithm
R0 R1 R2
We keep equivalence sets of expressions (class
RelSet).

Each input of a relational expression is an S0

equivalence set + required physical properties

(class RelSubset).
T0 T1

Uo
Volcano planning algorithm
R0 R1 R2
Each relational expression has a memo (digest),
so we will recognize it if we generate it again.
S0

T0 T1

Uo
Volcano planning algorithm
R0 R1 R2
If an expression transforms to an expression in
another equivalence set, we can merge those
equivalence sets.
S0

T0 T1 Uo
Matches and queues
Project Filter

We register a new RelNode by adding it to a

RelSet.

Each rule instance declares a pattern of RelNode

types (and other properties) that it will match. Union

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly

matched.
Matches and queues
Project Filter

We register a new RelNode by adding it to a

RelSet.

Each rule instance declares a pattern of RelNode

types (and other properties) that it will match. Union Project

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly

matched.
Matches and queues
Project Filter

We register a new RelNode by adding it to a

RelSet.

Each rule instance declares a pattern of RelNode

types (and other properties) that it will match. Union Project

Suppose we have:
● Filter-on-Project
● Project-on-Project
Join Scan Project
● Project-on-Join

On register, we detect rules that are newly

matched. (4 matches.)
Matches and queues 0. Register each RelNode in the initial tree.

Equivalence sets containing 1. Each time a

registered RelNodes RelNode is
Should we fire these matched rules registered, find
rule matches,
immediately? and put
RuleMatch
objects on the
No! Because rule match #1 would generate new queue.

matches… which would generate new 3. Pop the top rule match.
Fire the rule. Register each
matches… and we'd never get to match #2. RelNode generated by the
rule, merging sets if
Instead, we put the matched rules on a queue. equivalences are found. Goto
1.

The queue allows us to: Rule match queue

● Search breadth-first (rather than depth-first)

2. If the queue is empty, or
● Prioritize (fire more "important" rules first) the cost is good enough,
we're done.
● Potentially terminate when we have a "good
enough" plan
Other planner engines, same great rules
Three planner engines:
● Volcano
● Volcano top-down (Cascades style)
● Hep applies rules in a strict "program"

The same rules are used by all engines.

It takes a lot of time effort to write a high-quality rule. Rules can be reused, tested,
improved, and they compose with other rules. Calcite's library of rules is valuable.
8. Dialects
Calcite architecture
At what points in the Calcite
stack do ‘languages’ exist?
Apache Calcite
SQL
● Incoming SQL
● Validating SQL against
SQL parser &
validator built-in operators
● Type system (e.g. max
Relational Query Pluggable
algebra planner rewrite rules
size of INTEGER type)
● JDBC adapter
generates SQL
Enumerable MongoDB
adapter
JDBC adapter
adapter ● Other adapters
generate other
File adapter Apache Kafka Apache Spark
(CSV, JSON, Http) adapter adapter
languages
Parsing & validating SQL - PARSER_FACTORY =

what knobs can I turn? "org.apache.calcite.sql.parser.impl.SqlParserImpl.FACTORY"

Lex.unquotedCasing = Casing.TO_UPPER

Lex.quoting = Quoting.BRACKET
SELECT deptno AS d,
Lex.quotedCasing = Casing.UNCHANGED
SUM(sal) AS [sumSal]
FROM [HR].[Emp] Lex.charLiteralStyle =
CharLiteralStyle.BQ_DOUBLE
WHERE ename NOT ILIKE "A%"
FUN = "postgres" (ILIKE is not standard SQL)
GROUP BY d
ORDER BY 1, 2 DESC SqlConformance.isGroupByAlias() = true

SqlConformance.isSortByOrdinal() = true

SqlValidator.Config.defaultNullCollation =
HIGH
interface SqlParserImplFactory

SQL dialect - APIs and properties CalciteConnectionProperty.LEX

enum Lex
enum Quoting
enum Casing
enum CharLiteralStyle

CalciteConnectionProperty.CONFORMANCE
SQL
interface SqlConformance

SQL parser & Pluggable parser, lexical, CalciteConnectionProperty.FUN

validator conformance, operators interface SqlOperatorTable
class SqlStdOperatorTable
Relational Query Pluggable class SqlLibraryOperators
algebra planner rewrite rules class SqlOperator
class SqlFunction extends SqlOperator
class SqlAggFunction extends SqlFunction
Pluggable
JDBC adapter
SQL dialect
class RelRule

SQL class SqlDialect

interface SqlDialectFactory
Contributing a dialect (or anything!) to Calcite
For your first code contribution,
pick a small bug or feature.

Introduce yourself! Email dev@,

saying what you plan to do.

Create a JIRA case describing the

problem.

To understand the code, find

similar features. Run their tests in
a debugger.

Write 1 or 2 tests for your feature.

Submit a pull request (PR).

Other front-end languages
Calcite is an excellent
platform for implementing
SQL Pig Datalog Morel
your own data language
SQL parser &
validator
RelBuilder Write a parser for your
language, use RelBuilder
Relational Query
algebra planner to translate to relational
algebra, and you can use
Adapter
any of Calcite's back-end
Physical
operators implementations
Storage
9. Materialized views
Backwards planning

Forwards planning Backwards planning

R0 R0

Ropt

Forwards planning Backwards planning

Equivalence set Equivalence set R2

R0 R0
R1 R1
Ropt RN

S T

Until now, we have seen forward planning. Forward planning transforms an expression (R0) to many
equivalent forms and picks the one with lowest cost (Ropt). Backwards planning transforms an expression
to an equivalent form (RN) that contains a target expression (T).
Applications of backwards planning
Indexes (e.g. b-tree indexes). An index is a derived data structure whose contents
can be described as a relational expression (generally project-sort). When we are
planning a query, it already exists (i.e. the cost has already been paid).
Summary tables. A summary table is a derived data structure (generally
filter-project-join-aggregate).
Replicas with different physical properties (e.g. copy the table from New York to
Tokyo, or copy the table and partition by month(orderDate), sort by productId).
Incremental view maintenance. Materialized view V is populated from base table
T. Yesterday, we populated V with V0 = Q(T0). Today we want to make its contents
equal to V1 = Q(T1). Can we find and apply a delta query, dQ = Q(T1 - T0)?
Materialized views in Calcite /** Transforms a relational expression into a
* semantically equivalent relational expression,
* according to a given set of rules and a cost
{
* model. */
"schemas": {
public interface RelOptPlanner {
"name": "HR",
/** Defines an equivalence between a table and
"tables": [ {
* a query. */
"name": "emp"
void addMaterialization(
} ],
RelOptMaterialization materialization);
"materializations": [ {
"table": "i_emp_job",
/** Finds the most efficient expression to
"sql": "SELECT job, empno
* implement this query. */
FROM emp
RelNode findBestExp();
ORDER BY job, empno"
}
}, {
"table": "add_emp_deptno",
/** Records that a particular query is materialized
"sql": "SELECT deptno,
* by a particular table. */
SUM(sal) AS ss, COUNT(*) AS c
public class RelOptMaterialization {
FROM emp
public final RelNode tableRel;
GROUP BY deptno"
public final List<String> qualifiedTableName;
} ]
public final RelNode queryRel;
}
}
}

You can define materializations in a JSON model, via the planner API, or via
CREATE MATERIALIZED VIEW DDL (not shown).
More about materialized views
● There are several algorithms to rewrite queries to match materialized views
● A lattice is a data structure to model a star schema
● Calcite has algorithms to recommend an optimal set of summary tables for
a lattice (given expected queries, and statistics about column cardinality)
● Data profiling algorithms estimate the cardinality of all combinations of
columns
10. Working with spatial data
King
Yen
•
Spatial query
•
Find all restaurants within 1.5 distance units of Station
burger
my current location:
•
SELECT *
Filippo’s
FROM Restaurants AS r
WHERE ST_Distance( Zachary’s
pizza
ST_MakePoint(r.x, r.y), •
ST_MakePoint(6, 7)) < 1.5
restaurant x y

Zachary’s pizza 3 1
We cannot use a B-tree index (it can sort points
King Yen 7 7
by x or y coordinates, but not both) and
Filippo’s 7 4
specialized spatial indexes (such as R*-trees)
Station burger 5 6
are not generally available.
Hilbert space-filling curve

● A space-filling curve invented by mathematician David Hilbert

● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
King
Yen
•
Using Hilbert index
•
Add restriction based on h, a restaurant’s distance Station
burger

along the Hilbert curve •

Must keep original restriction due to false positives Filippo’s

Zachary’s
pizza
•
SELECT *
FROM Restaurants AS r
restaurant x y h
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46) Zachary’s pizza 3 1 5

AND ST_Distance( King Yen 7 7 41

ST_MakePoint(r.x, r.y),
Filippo’s 7 4 52
ST_MakePoint(6, 7)) < 1.5
Station burger 5 6 36
Telling the optimizer
CREATE TABLE Restaurants (
1. Declare h as a generated column
restaurant VARCHAR(20),
2. Sort table by h x DOUBLE,
y DOUBLE,
Planner can now convert spatial range
h DOUBLE GENERATED ALWAYS AS
queries into a range scan
ST_Hilbert(x, y) STORED)
Does not require specialized spatial index SORT KEY (h);
such as R*-tree
restaurant x y h

Very efficient on a sorted table such as HBase Zachary’s pizza 3 1 5

There are similar techniques for other spatial patterns Station burger 5 6 36

(e.g. region-to-region join) King Yen 7 7 41

Filippo’s 7 4 52
11. Research using Apache Calcite
Yes, VLDB 2021!
Go to the talk!
0900 Wednesday.
@julianhyde @szampetak
https://calcite.apache.org
Thank you!
Resources
● Calcite project https://calcite.apache.org
● Materialized view algorithms
https://calcite.apache.org/docs/materialized_views.html
● JSON model https://calcite.apache.org/docs/model.html
● Lazy beats smart and fast (DataEng 2018) - MVs, spatial, profiling
https://www.slideshare.net/julianhyde/lazy-beats-smart-and-fast
● Efficient spatial queries on vanilla databases (ApacheCon 2018)
https://www.slideshare.net/julianhyde/spatial-query-on-vanilla-databases
● Graefe, McKenna. The Volcano Optimizer Generator, 1991
● Graefe. The Cascades Framework for Query Optimization, 1995
● Slideshare (past presentations by Julian Hyde, including several about
Apache Calcite) https://www.slideshare.net/julianhyde

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Instructions Youtube View Bot
No ratings yet
Instructions Youtube View Bot
4 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
J Nativememory Linux PDF
No ratings yet
J Nativememory Linux PDF
37 pages
Data Domain Retention Lock
No ratings yet
Data Domain Retention Lock
21 pages
CUIC Report Customization Guide
0% (1)
CUIC Report Customization Guide
66 pages
Anatomy of A Program in Memory
No ratings yet
Anatomy of A Program in Memory
19 pages
Virtual Memory Behavior in Red Hat Linux Advanced ...
No ratings yet
Virtual Memory Behavior in Red Hat Linux Advanced ...
10 pages
CSNA v4 Training Book With Virtual Labs and Appendices 2021-12-21-1
No ratings yet
CSNA v4 Training Book With Virtual Labs and Appendices 2021-12-21-1
578 pages
Cassandra
No ratings yet
Cassandra
31 pages
Autonomous Driving Data Lake
No ratings yet
Autonomous Driving Data Lake
1 page
Operating System
No ratings yet
Operating System
60 pages
IJERT Data Analysis Using Python
No ratings yet
IJERT Data Analysis Using Python
6 pages
Tableau CheatSheet Zep
No ratings yet
Tableau CheatSheet Zep
1 page
Deploying With Jruby
No ratings yet
Deploying With Jruby
220 pages
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
No ratings yet
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
12 pages
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
No ratings yet
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
99 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Networks (Second Edition) Mark Newman download pdf
100% (1)
Networks (Second Edition) Mark Newman download pdf
55 pages
In - Memory Data Fabric in Action: Apache Ignite
No ratings yet
In - Memory Data Fabric in Action: Apache Ignite
16 pages
Samba Ibm Redbooks 1999
No ratings yet
Samba Ibm Redbooks 1999
104 pages
SMB
No ratings yet
SMB
431 pages
Complex Network Analysis in Python Recognize Construct Visualize Analyze Interpret 1st Edition Dmitry Zinoviev All Chapters Instant Download
100% (6)
Complex Network Analysis in Python Recognize Construct Visualize Analyze Interpret 1st Edition Dmitry Zinoviev All Chapters Instant Download
50 pages
SCA Guide 22.2.0
No ratings yet
SCA Guide 22.2.0
223 pages
Analysis of The Banking Industry
No ratings yet
Analysis of The Banking Industry
44 pages
Graph Analytics PDF
No ratings yet
Graph Analytics PDF
13 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
Best Network Monitoring Software
No ratings yet
Best Network Monitoring Software
30 pages
Databook PDF
No ratings yet
Databook PDF
64 pages
CBDT3103 Intro To Distributed System PDF
No ratings yet
CBDT3103 Intro To Distributed System PDF
207 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
PDF Succinctly
100% (1)
PDF Succinctly
60 pages
Concrete Architecture of The Linux Kernel
No ratings yet
Concrete Architecture of The Linux Kernel
34 pages
User Manual
No ratings yet
User Manual
116 pages
Linux Device Model
100% (1)
Linux Device Model
45 pages
Learning Cypher Sample Chapter
No ratings yet
Learning Cypher Sample Chapter
26 pages
2016 - Linux Networking Explained - 0
No ratings yet
2016 - Linux Networking Explained - 0
27 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
Solution Methodology2
No ratings yet
Solution Methodology2
3 pages
P4
No ratings yet
P4
55 pages
Install HAproxy
No ratings yet
Install HAproxy
2 pages
Installing Tivoli System Automation For High Availability of DB2 UDB BCU On AIX Redp4254
No ratings yet
Installing Tivoli System Automation For High Availability of DB2 UDB BCU On AIX Redp4254
14 pages
Crunchy Postgresql High-Availability Suite Keeps Critical Applications Running
No ratings yet
Crunchy Postgresql High-Availability Suite Keeps Critical Applications Running
2 pages
Dynamodb DG
No ratings yet
Dynamodb DG
705 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Linux Io Stack Diagram v0.1
No ratings yet
Linux Io Stack Diagram v0.1
1 page
Get Serverless Development on AWS: Building Enterprise-Scale Serverless Solutions 1st Edition Sheen Brisals PDF ebook with Full Chapters Now
100% (3)
Get Serverless Development on AWS: Building Enterprise-Scale Serverless Solutions 1st Edition Sheen Brisals PDF ebook with Full Chapters Now
26 pages
Samba
100% (1)
Samba
652 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
Data Analytics For Ioe: Syllabus
No ratings yet
Data Analytics For Ioe: Syllabus
23 pages
Flask With Aws Cloudwatch
No ratings yet
Flask With Aws Cloudwatch
6 pages
PLNY12 Galera Cluster Best Practices
No ratings yet
PLNY12 Galera Cluster Best Practices
76 pages
Download full Graph Powered Machine Learning 1st Edition Alessandro Negro ebook all chapters
100% (18)
Download full Graph Powered Machine Learning 1st Edition Alessandro Negro ebook all chapters
40 pages
PESIT Bangalore South Campus: Internal Assessment Test Ii-Solution
No ratings yet
PESIT Bangalore South Campus: Internal Assessment Test Ii-Solution
8 pages
Cd-Rom Included: Business User Action
100% (1)
Cd-Rom Included: Business User Action
11 pages
Virtio 1.0: Paravirtualized I/O For KVM and Beyond
No ratings yet
Virtio 1.0: Paravirtualized I/O For KVM and Beyond
15 pages
Using The Cost of Quality Approach For Software
No ratings yet
Using The Cost of Quality Approach For Software
6 pages
05 Monitoring The Cluster
No ratings yet
05 Monitoring The Cluster
12 pages
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
100% (3)
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
65 pages
Learning SaltStack
From Everand
Learning SaltStack
Colton Myers
4/5 (1)
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Web Programming Step by Step: Reset Buttons (6.2.7)
No ratings yet
Web Programming Step by Step: Reset Buttons (6.2.7)
11 pages
Todoscontroller Crud Example: ?PHP Namespace Use Use Use Class Extends Public Function
No ratings yet
Todoscontroller Crud Example: ?PHP Namespace Use Use Use Class Extends Public Function
8 pages
Kotlin Topics Overview
No ratings yet
Kotlin Topics Overview
5 pages
Svetoslav Dimitrakov
No ratings yet
Svetoslav Dimitrakov
4 pages
Assignments A1
No ratings yet
Assignments A1
2 pages
Full Download Programming in Visual C 2008 3rd Edition Julia Case Bradley PDF
100% (10)
Full Download Programming in Visual C 2008 3rd Edition Julia Case Bradley PDF
70 pages
Python 3.8.1 - Datetime - Basic Date and Time Types
No ratings yet
Python 3.8.1 - Datetime - Basic Date and Time Types
44 pages
NUnit
0% (1)
NUnit
47 pages
Core Java Cheat Sheet
No ratings yet
Core Java Cheat Sheet
11 pages
Technical Writing White Paper Obsidian Vs Onenote
No ratings yet
Technical Writing White Paper Obsidian Vs Onenote
23 pages
Librenms Doc2
No ratings yet
Librenms Doc2
32 pages
Verilog HDL - Samir Palnitkar
No ratings yet
Verilog HDL - Samir Palnitkar
403 pages
BSC Designer
No ratings yet
BSC Designer
16 pages
EX294 - Sample Questions
No ratings yet
EX294 - Sample Questions
9 pages
Python Functions
No ratings yet
Python Functions
10 pages
Welcome To Java
No ratings yet
Welcome To Java
22 pages
Linux 2 - Test1.answers
No ratings yet
Linux 2 - Test1.answers
3 pages
QA Principles
No ratings yet
QA Principles
108 pages
Information Sheet 1.4 1
No ratings yet
Information Sheet 1.4 1
4 pages
Live Weather Notifications Using Python
No ratings yet
Live Weather Notifications Using Python
6 pages
Connecting Microsoft SQL Server Reporting Services To Oracle Autonomous Database
No ratings yet
Connecting Microsoft SQL Server Reporting Services To Oracle Autonomous Database
17 pages
IEC 61131-3: Coding Guidelines
No ratings yet
IEC 61131-3: Coding Guidelines
4 pages
Simple Jira Guide for Users
No ratings yet
Simple Jira Guide for Users
12 pages
Cython
No ratings yet
Cython
35 pages
Name: Mashal Khan REG: 069 Sec: B: Program1
No ratings yet
Name: Mashal Khan REG: 069 Sec: B: Program1
11 pages
QA-Checklist: Homepage (Header & Footer)
No ratings yet
QA-Checklist: Homepage (Header & Footer)
5 pages
Note - Share - Synopsis
No ratings yet
Note - Share - Synopsis
8 pages
Mobile Application & Development Unit-1
No ratings yet
Mobile Application & Development Unit-1
31 pages