In Search of Database Nirvana
In Search of Database Nirvana
In Search of Database Nirvana
Database Nirvana
The Challenges of Delivering Hybrid
Transaction/Analytical Processing
Rohit Jain
The OReilly logo is a registered trademark of OReilly Media, Inc. In Search of Data
base Nirvana, the cover image, and related trade dress are trademarks of OReilly
Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-95903-9
[LSI]
Table of Contents
v
In Search of Database Nirvana
1
They did not handle semistructured and unstructured data very
well. (Yes, you could stick that data into an XML, BLOB, or
CLOB column, but very little was offered to process it easily
without using complex syntax. Add-on capabilities had vendor
tie-ins and minimal flexibility.)
They had not evolved User-Defined Functions (UDFs) beyond
scalar functions, which limited parallel processing of user code
facilitated later by Map/Reduce.
They took a long time addressing reliability issues, where Mean
Time Between Failure (MTBF) in certain cases grew so high that
it became cheaper to run Hadoop on large numbers of high-end
servers on Amazon Web Services (AWS). By 2008, this cost dif
ference became substantial.
Most of all, these systems were too elaborate and complex to deploy
and manage for the modest needs of these web-scale companies.
Transactional support, joins, metadata support for predefined col
umns and data types, optimized access paths, and a number of other
capabilities that RDBMSs offered were not necessary for these com
panies big data use cases. Much of the volume of data was transi
tionary in nature, perhaps accessed at most a few times, and a
traditional EDW approach to store that data would have been cost
prohibitive. So these companies began to turn to NoSQL databases
to overcome the limitations of RDBMSs and avoid the high price tag
of proprietary systems.
The pendulum swung to polyglot programming and persistence, as
people believed that these practices made it possible for them to use
the best tool for the task. Hadoop and NoSQL solutions experienced
incredible growth. For simplicity and performance, NoSQL solu
tions supported data models that avoided transactions and joins,
instead storing related structured data as a JSON document. The
volume and velocity of data had increased dramatically due to the
Internet of Things (IoT), machine-generated log data, and the like.
NoSQL technologies accommodated the data streaming in at very
high ingest rates.
As the popularity of NoSQL and Hadoop grew, more applications
began to move to these environments, with increasingly varied use
cases. And as web-scale startups matured, their operational work
load needs increased, and classic RDBMS capabilities became more
relevant. Additionally, large enterprises that had not faced the same
Before we discuss these points, though, lets first understand the dif
ferences between operational and analytical workloads and also
review the distinctions between a query engine and a storage engine.
With that background, we can begin to see why building an HTAP
database is such a feat.
Statistics
Statistics are necessary when query engines are trying to generate
query plans or understand whether a workload is operational or
analytical. In the single-row-access scenario described earlier, if the
predicate(s) used in the query only cover some of the columns in the
key, the engine must figure out whether the predicate(s) cover the
leading columns of the key, or any of the key columns. Let us
assume that leading columns of the key have equality predicates
specified on them. Then, the query engine needs to know how many
rows would qualify, and how the data that it needs to access is
spread across the nodes. Based on the partitioning schemethat is,
Degree of Parallelism
All right, so now we know how we are going to scan a particular
table, we have an estimate of rows that will be returned by the stor
age engine from these scans, and we understand how the data is
spread across partitions. We can now consider both serial and paral
lel execution strategies, and balance the potentially faster response
time of parallel strategies against the overhead of parallelism.
Yes, parallelism does not come for free. You need to involve more
processes across multiple nodes, and each process will consume
memory, compete for resources in its node, and that node is subject
to failure. You also must provide each process with the execution
plan, for which they must then do some setup to execute. Finally,
each process must forward results to a single node that then has to
collate all the data.
All of this results in potentially more messaging between processes,
increases skew potential, and so on.
The optimizer needs to weigh the cost of processing those rows by
using a number of potential serial and parallel plans and assess
But for OLTP and operational queries, this data flow architecture
(Figure 1-4) can be a huge overhead. If you are accessing a single
row, or just a few rows, you dont need the queues and complex data
flows. In such a case, you can have optimizations to reduce the path
length and quickly just access and return the relevant row(s).
While you are optimizing for OLTP queries with fast paths, for BI
and analytics queries you need to consider prefetching blocks of
data, provided the storage engine supports this, while the query
engine is busy processing the previous block of data. So the nature
Figure 1-5. Serial plan for reads and writes of single rows or a set of
rows clustered on key columns, residing in a single partition. An exam
ple of this is when a single row is being inserted, deleted, or updated for
a customer, or all the data being accessed for a customer, for a specific
transaction date, resides in the same partition.
Mixed Workload
One of the biggest challenges for HTAP is the ability to handle
mixed workloads; that is, both OLTP queries and the BI and analyt
ics queries running concurrently on the same cluster, nodes, disks,
and tables. Workload management capabilities in the query engine
can categorize queries by data source, user, role, and so on and allow
users to prioritize workloads and allocate a higher percentage of
CPU, memory, and I/O resources to certain workloads over others.
Or, short OLTP workloads can be prioritized over BI and analytics
workloads. Various levels of sophistication can be used to manage
this at the query level.
However, storage engine optimization is required, as well. The stor
age engine should automatically reduce the priority of longer run
ning queries and suspend execution on a query when a higher
priority query needs to be serviced, and then go back to running the
Streaming
More and more applications need incoming streams of data pro
cessed in real time, necessitating the application of functions, aggre
gations, and trigger actions across a stream of data, often time-series
data, over row count or time-based windows. This is very different
from processing statistical or user-defined functions, sophistical
algorithms, aggregates, and even Online Analytical Processing
(OLAP) window functions over data persisted in a table on disk or
memory. Even though Jennifer Widom had proposed new SQL syn
tax to handle streams in 2008, there is no standard SQL syntax to
process streaming data. Query engines have to be able to deal with
this new data processing paradigm.
Feature Support
Last but not least is the list of features you need to support for
operational and analytical workloads. These features range from ref
erential integrity, stored procedures, triggers, various levels of trans
actional isolation and consistency, for operational workloads; to
materialized views; fast/bulk Extract, Transform, Load (ETL) capa
bilities; and OLAP, time series, statistical, data mining, and other
functions for BI and analytics workloads.
Features common to both types of workloads are many. Some of the
capabilities a query engine needs to support are scalar and table
mapping UDFs, inner, left, right, and full outer joins, un-nesting of
subqueries, converting correlated subqueries to joins, predicate push
down, sort avoidance strategies, constant folding, recursive union,
and so on.
This is not close to an exhaustive list, but supporting all these capa
bilities for these different workloads takes a huge investment of
resources.
Statistics
As was discussed earlier, statistics are needed to support any query
workload in order to generate a good query plan or to even under
stand whether the workload is operational or analytical. So, when a
Key Structure
For operational workloads, keyed access is important for subsecond
response times. Single-row access requires access via a key. Ideally
multicolumn keys are supported, because invariably transactional or
fact tables have multicolumn keys. If multicolumn keys are not sup
ported, the query engine needs to map the multiple primary key col
umns to the single storage engine key. Also, short operational
queries often require clustered access to retrieve a small number of
rows. Range access over clustered keys helps operational workloads
meet Service Level Agreements (SLAs). The query engine needs to
understand what keyed access options are available for the storage
engine in order to appropriately exploit them for the most efficient
access. It needs to optimize this access for each storage engine it
supports.
Partitioning
How the storage engine partitions data across disks and nodes is
also very important for the query engine to understand. Does it sup
port hash and/or range partitioning, or a combination of these?
How is this partitioning determined? Does the query engine need to
salt data so that the load is balanced across partitions in order to
avoid bottlenecks? If it does, how can it add a salt key, say as the left
most column of the table key, and still avoid table scans? Does the
storage engine handle repartitioning or rebalancing of partitions as
the cluster is expanded or contracted, or does the query engine need
to worry about doing that? This can get very complex because users
might need full read and write access while this repartitioning of
data is occurring. How the data is spread across partitions is very
important if the query engine is going to have parallel processes
working on data from these partitions. It needs to try to localize
Extensibility
Some storage engines can run user-defined code, such as coproces
sors in HBase, or before and after triggers, in order to implement
more functionality on the storage engine/server side to reduce the
amount of message traffic and improve efficiencies.
For example, the optimizer might be smart enough to do eager
aggregation if it can push that aggregation to the storage engine.
Security Enforcement
Handling security is another point of contact between the query
engine and the storage engine. The storage engine might have its
own underlying security implementation. If it is well integrated with
the SQL model, the query engine can use it. Otherwise, the query
engine must administer privileges to schemas, tables, columns, and
even rows in the case of fine-grained access control. It might need to
integrate with, and even map, its authorization framework to that of
the storage engine. If there are other considerations related to secu
rity, for example Hadoop security solutions such as Sentry or Ranger
managing these objects, the query engine needs to integrate with
that security framework. Depending on support for security logging
and other Security Information and Event Management (SIEM)
capabilities, the query engine must be able to integrate with those
storage engine and platform capabilities as well, or provide its own.
With HTAP, this is all made more complicated by the fact that there
might be multiple storage engines that need to be supported.
Transaction Management
We can assume that a storage engine will provide some level of rep
lication for high availability, backup and restore, and some level of
multidata center support. But transactional support can be a differ
ent thing altogether. There are the Atomic, Consistent, Isolated,
Durable (ACID) and Basic Availability, Soft-state, Eventual consis
tency (BASE) models of transaction management. Depending on the
support provided by the storage engine, the query engine might
need to provide full multirow, multitable, and multistatement trans
action support. It also needs to provide online backup, transaction
ally consistent recovery, often to a point-in-time, and multidata
center support with active-active synchronous updates as well as
eventual consistency. It needs to be able to integrate this implemen
tation with write-ahead logging and other mechanisms of the stor
Metadata Support
The query engine needs to add metadata support for the storage
engine tables being supported. There are potential mappings (such
as catalog, name, location, as well as data type mappings); extended
options specific to the storage engine (such as compression
options); supporting multiple column familiesfor example, map
ping data partition options to underlying storage engine structures
(e.g., HBase regions); and so on. When someone changes the storage
engine table externally, how is this metadata updated, or how is it
marked as invalid until someone fixes the discrepancy?
In fact, if there is access to the storage engine external to the query
engine, there is a whole can of worms to deal with, such as how can
transactional consistency and data integrity be guaranteed? What
should the query engine do if the data in a column is not consistent
with the data type it is supposed to have, or contains invalid data,
perhaps because it has been updated by some other means? Also, do
you allow secondary indexes, views, constraints, materialized views,
and so on to be created on such a table? Not only does metadata
support need to be provided for this support, but the issue of incon
sistency potentially caused by external access has to be addressed.
There can be differences in metadata support needed for operational
workloads versus BI and analytic workloads. For example, opera
tional features such as referential integrity, triggers, and stored pro
cedures need metadata support that is not needed for BI workloads,
whereas metadata support for materialized views is not often needed
for operational workloads.
Error Handling
Error handling, too, is different for each storage engine that a query
engine might need to support for HTAP. What are the error han
dling mechanisms available in the storage engine and how are these
errors logged? Does the storage engine log errors or does the query
engine need to interpret each type of error and log it, as well as pro
vide some meaningful error message and remediation guidance to
the user?
Security
Security implementations between operational and analytical work
loads can be very different. For operational workloads, generally
security is managed at the application level. The application inter
faces with the user and manages all access to the database. On the
other hand, BI and analytics workloads could have end users work
Manageability
One of the most important aspects of a database is the ability to
manage it and its workloads. As you can see in Figure 1-9, managea
bility entails a long list of things, and perhaps can only be partially
implemented.
What are the capabilities of the query engine that would meet
your workload needs?
What are the capabilities of the storage engines that would meet
your workload needs? How well does the query engine integrate
with those storage engines?
What data models are important for your applications? Which
storage engines support those models? Does a single query
engine support those storage engines?
What are the enterprise caliber capabilities that are important to
you? How do the query and storage engines meet those require
ments?
Statistics
Does the query engine maintain statistics for the data?
Can the query engine gather cardinality for multiple key or join
columns, besides that for each column?
Do these statistics provide the query engine information about
data skew?
How long does it take to update statistics for a very large table?
Can the query engine incrementally update these statistics when
new data is added or old data is aged?
Degree of parallelism
How does the query engine access data that is partitioned across
nodes and disks on nodes?
Does the query engine rely on the storage engine for that, or
does it provide a parallel infrastructure to access these partitions
in parallel?
If the query engine considers serial and parallel plans, how does
it determine the degree of parallelism needed?
Does the query engine use only the number of nodes needed for
a query based on that degree of parallelism?
Join type
What are the types of joins supported?
How are joins used for different workloads?
What is the impact of using the wrong join type and how is that
impact avoided?
Mixed workload
Can you prioritize workloads for execution?
What criteria can you use for such prioritization?
Can these workloads at different service levels be allocated dif
ferent percentages of resources?
Does the priority of queries decrease as they use more resour
ces?
Are there antistarvation mechanisms or a way to switch to a
higher priority query before resuming a lower priority one?
Streaming
Can the query engine handle streaming data directly?
What functionality is supported against this streaming data
such as row- and/or time-based windowing capabilities?
Feature support
What capabilities and features are provided by the database for
operational, analytical, and all other workloads?
Statistics
What statistics on the data does the storage engine maintain?
Can the query engine use these statistics for faster histogram
generation?
Does the storage engine support sampling to avoid full-table
scans to compute statistics?
Does the storage engine provide a way to access data changes
since the last collection of statistics, for incremental updates of
statistics?
Does the storage engine maintain update counters for the query
engine to schedule a refresh of the statistics?
Key structure
Does the storage engine support key access?
If it is not a multicolumn key, does the query engine map it to a
multicolumn key?
Partitioning
How does the storage engine partition data across disks and
nodes? Does it support hash and/or range partitioning, or a
combination of these?
Does the query engine need to salt data so that the load is bal
anced across partitions to avoid bottlenecks?
If it does, how can it add a salt key as the leftmost column of the
table key and still avoid table scans?
Does the storage engine handle repartitioning of partitions as
the cluster is expanded or contracted, or does the query engine
do that?
Is there full read/write access to the data as it is rebalanced?
How does the query engine localize data access and avoid shuf
fling data between nodes?
Extensibility
Does the storage engine support server side pushdown of oper
ations, such as coprocessors in HBase, or before and after trig
gers in Cassandra?
How does the query engine use these?
Security enforcement
What are the security frameworks for the query and storage
engines and how do they map relative to ANSI SQL security
enforcement?
Does the query engine integrate with the underlying Hadoop
Kerberos security model?
Does the query engine integrate with security frameworks like
Sentry or Ranger?
How does the query engine integrate with security logging, and
SIEM capabilities of the underlying storage engine and platform
security?
Transaction management
Are replication for high availability, backup and restore, and
multi-data center support provided completely by the storage
engine, or is the query engine involved with ensuring consis
tency and integrity across all operations?
Metadata support
How does the storage engine metadata (e.g., table names, loca
tion, partitioning, columns, data types) get mapped to the query
engine metadata?
How are storage engine specific options (e.g., compression,
encryption, column families) managed by the query engine?
Does the query engine provide transactional support, secondary
indexes, views, constraints, materialized views, and so on for an
external table?
If changes to external tables can be made outside of the query
engine, how does the query engine deal with those changes and
the discrepancies that could result from them?
Error handling
How are storage and query engine errors logged?
How does the query engine map errors from the storage engine
to meaningful error messages and resolution options?
High availability
What percentage of uptime is provided (99.99%99.999%)?
Can you upgrade the underlying OS online (with data available
for reads and writes)?
Can you upgrade the underlying file system online (e.g.,
Hadoop Distributed File System)?
Can you upgrade the underlying storage engine online?
Can you upgrade the query engine online?
Can you redistribute data to accommodate node and/or disk
expansions and contractions online?
Can the table definition be changed online; for example, all col
umn data type changes, and adding, dropping, renaming
columns?
Can secondary indexes be created and dropped online?
Are online backups supportedboth full and incremental?
Manageability
What required management capabilities are supported (see
Figure 1-9 for a list)?
Is operational performance reported in transactions per second
and analytical performance by query?
What is the overhead of gathering metrics on operational work
loads as opposed to analytical workloads?
Is the interval of statistics collection configurable to reduce this
overhead?
Can workloads be managed to Service Level Objectives, based
on priority and/or resource allocation, especially high priority
operational workloads against lower priority analytical work
loads?
Conclusion
This report has attempted to do a modest job of highlighting at least
some of the challenges of having a single query engine service both
operational and analytical needs. That said, no query engine neces
sarily has to deliver on all the requirements of HTAP, and one cer
tainly could meet the mixed workload requirements of many
customers without doing so. The report also attempted to explain
what you should look for and where you might need to compromise
as you try and achieve the nirvana of a single database to handle
all of your workloads, from operational to analytical.
Conclusion | 47
About the Author
Rohit Jain is cofounder and CTO at Esgyn, an open source database
company driving the vision of a Converged Big Data Platform. Rohit
provided the vision behind Apache Trafodion, an enterprise-class
MPP SQL Database for Big Data, donated to the Apache Software
Foundation by HP in 2015. EsgynDB, Powered by Apache Trafo
dion, is delivering the promise of a Converged Big Data Platform
with a vision of any data, any size, and any workload. A veteran
database technologist over the past 28 years, Rohit has worked for
Tandem, Compaq, and Hewlett-Packard in application and database
development. His experience spans online transaction processing,
operational data stores, data marts, enterprise data warehouses,
business intelligence, and advanced analytics on distributed mas
sively parallel systems.