WP SQL To Nosql Architectur Differences Considerations Migration 1+ (6) - 1641371845027
WP SQL To Nosql Architectur Differences Considerations Migration 1+ (6) - 1641371845027
WP SQL To Nosql Architectur Differences Considerations Migration 1+ (6) - 1641371845027
SQL to NoSQL:
Architecture Differences
and Considerations
for Migration
RDBMS NoSQL
CONTENTS
SQL VERSUS NOSQL BASICS 3
ECONOMIES OF SCALE 4
REPLICAS OF DATA 5
APPLICATION-DRIVEN USE CASES 5
CONSISTENCY VERSUS AVAILABILITY 6
ACID VERSUS BASE CONSISTENCY 6
LIGHTWEIGHT TRANSACTIONS 7
SCALING CHARACTERISTICS 8
QUERY PATTERNS 9
MATERIALIZED VIEWS 10
SECONDARY INDEXES 10
REFERENTIAL INTEGRITY 10
MIGRATION TO NOSQL 10
HYBRID CONVERSION TO NOSQL 10
DATA FORKLIFTING 10
DATA VALIDATION 10
SQL VERSUS NOSQL BASICS many organizations have been reevaluating their
use of traditional relational databases.
Since their invention in 1970 by Edgar Codd,
relational databases have served as the default Cloud computing exposed many limitations of
data store for almost every IT organization, large relational databases. RDBMSs proliferated in
or small. Today, the most iconic and familiar an age when databases were isolated islands
relational databases include IBM DB2, Oracle with relatively stable user bases, running in a
Database, Microsoft SQL Server, PostgreSQL, traditional client-server configuration. RDBMSs
and MySQL. support arbitrary reshaping and joining of
data, but performance can be variable and
Structured Query Language, or SQL, was unpredictable. In restricted environments, such
invented at IBM soon after the introduction of variable performance characteristics could
the relational database. Since its introduction, be managed. The shift to a mobile, globally
SQL has become the most widely used dispersed user base caught many organizations
database language, used for querying data, data off guard.
manipulation (insert, update and delete), data
definition (schema creation and modification), During the same time period, consumer
and data access control. Though the terms refer demands shifted radically. Today, users expect
to different technologies, ‘SQL’ and ‘RDBMS’ low-latency applications that deliver an
have become virtually interchangeable. Though extremely responsive experience, regardless
some non-relational databases support SQL, of the user’s location. Apps that are slow
the term “SQL database” generally means a and unresponsive contribute significantly to
relational database. customer churn. Predictable performance
became more important than the semantic
During the decades in which relational databases flexibility afforded by RDBMSs.
proliferated, data entry was largely a manual
process. Times have changed. The advent of Latency issues can be addressed by shifting
smartphones, the ‘app economy,’ and cloud data closer to the customer. To meet this
computing in the late 2000s caused a seachange need, data must be replicated across different
in the workloads, query types, and traffic geographic locations. Such geographical
patterns needed to support a global user base. replication turned out to be a struggle for
RDBMSs. While RDBMSs are not fit for
Fast forward to 2020, when people, smart- distributed deployments, non-relational
devices, sensors, and machines emit continuous databases are designed specifically to support
streams of data, such as user activity, IoT and such topologies.
machine-generated data, and metadata that
encompasses geolocation and telemetry. As A number of alternative non-relational database
early as 2013, one researcher noted that 90% systems have been proposed, including Google’s
of the world’s data had been generated over Bigtable (2006) and Amazon’s Dynamo (2007).
the previous two years. This trend has only The papers for these projects paved the way
accelerated. In response to this torrent of data, for Cassandra (2008) and MongoDB (2009).
3
SQL NoSQL
Relational Database Management
Key-Value Graph
Systems (RDBMS)
SQL vs NoSQL
Today, a range of mature NoSQL databases are In contrast, NoSQL databases are designed
available to help organizations scale big data for low latency and high resilience, being built
applications. from the ground up to run across clusters of
distributed nodes. This architecture is often
Yet, despite their origins in a long-forgotten
referred to as ‘horizontal scale,’ or ‘scaling
technology cycle, relational SQL databases
out.’ To add capacity to a NoSQL database,
are by no means ‘legacy’ technology. Some
administrators simply add more nodes, a very
SQL databases, notably PostgreSQL and
simple process in modern cloud environments.
MySQL, have experienced a recent resurgence
in popularity. A new generation of NewSQL In a NoSQL cluster, nodes are easy to add and
databases, notably Google Spanner and remove according to demand, providing ‘elastic’
CockroachDB, leverage SQL as a query language capacity. This feature enables organizations
and offer a distributed architecture similar to align their data footprint with the needs of
to that of NoSQL databases yet provide full the business while maintaining availability even
transactional support. in the face of seasonal demand spikes, node
failures, and network outages.
The horizontal scale of NoSQL brings tradeoffs
ARCHITECTURAL DIFFERENCES of its own. Adding commodity hardware to
BETWEEN SQL AND NOSQL a cluster can be cheap in terms of software
licenses and subscriptions. However, as more
ECONOMIES OF SCALE and more nodes are added in the pursuit of
higher throughput and lower latency, operational
Database administrators add capacity to RDBMS
overhead and administrative costs spike. Big
and NoSQL databases in very different ways.
clusters of small instances demand more
Typically, the only way to add capacity in a
attention and generate more alerts than small
relational system is to add expensive hardware,
clusters of large instances.
faster CPUs, more RAM, and more advanced
networking components. This is often referred to (Notably, some next-generation NoSQL
as ‘vertical’ scale, or ‘scaling up.’ databases like Scylla are able to overcome this
tradeoff, scaling out in a way that can take
4
advantage of the powerful hardware in modern In a masterless architecture, no single node
servers and ultimately running in smaller, though can bring down an entire cluster. A typical
still distributed, clusters of fewer nodes.) masterless topology involves three or more
replicas for each dataset. Adopting a NoSQL
REPLICAS OF DATA database that implements a masterless
architecture provides yet another layer
Replicating data across multiple nodes allows of resilience for high-volume, low-latency
databases to achieve higher levels of resilience. applications.
In the RDBMS world, it’s not trivial to replicate
data across multiple instances. Relational
databases do not support replication. Instead,
APPLICATION-DRIVEN USE CASES
they rely on external tools to extract and update The rise in popularity of NoSQL databases
copies of datasets. These tools run batch paralleled the adoption of agile development
processes that often take hours to complete. and DevOps practices. Unlike RDBMSs, NoSQL
As a result, there is no way to ensure real-time databases encourage ‘application-first’ or
synchronization of data among the copies of API-first development patterns. Following
data. these models, developers first consider queries
that support the functionality specific to an
While non-relational databases provide native
application, rather than considering the data
support for data replication, they follow three
models and entities. This developer-friendly
basic models: multi-master databases, such as
architecture paved the path to the success of
DynamoDB, master-slave architectures, such as
the first generation of NoSQL databases.
MongoDB, and masterless, such as Scylla. Given
their reliance on master nodes, both multi- In contrast, relational databases impose fairly
master and master-slave architecture introduce rigid, schema-based structures to data models;
a point of failure. When a master goes down, the tables consisting of columns and rows, which
process of electing a new master introduces a can be joined to enable ‘relations’ among
brief downtime. Even though the delay may be entities. Each table typically defines an entity.
minimal, measured in milliseconds, that delay Each row in a table holds one entry, and each
can still cause SLA violations. column contains a specific piece of information
for that record. The relationships among tables
A masterless architecture addresses this
are clearly defined and usually enforced by
limitation. In these databases, data is replicated
schemas and database rules.
across multiple nodes, all of which are equal.
Node
Node Node
Cluster
Node Node
5
Relational data models enforce uniformity, A
whereas non-relational models do not. NoSQL
databases permit multiple ‘shapes’ of data Availability
6
In simple terms, consistency is a guarantee that to run and maintain. It should be noted some
a read should return the result of the latest RDBMS systems enable performance to be
successful write. This seems simple, but such improved by relaxing ACID guarantees. Still,
a guarantee is incredibly difficult to deliver all SQL databases are ACID compliant to
without impacting the performance of the varying degrees, and as such, they all share
system as a whole. In a relational database, this downside. The practical effect of ACID
a single data item is actually split across compliance is to make it extraordinarily difficult
independent registers that must agree with one and expensive to achieve resilient, distributed
another. Thus, a single database write is actually SQL database deployments.
decomposed into several small writes to these
In contrast to RDBMS’ ACID guarantees, NoSQL
registers, which must be completed and visible
databases provide so-called ‘BASE guarantees.’
when the read is executed. With concurrent
BASE enables availability and relaxes the
operations running against the database, the
stringent consistency. The acronym BASE
semblance of order between the group of
designates:
sub-operations needs to be maintained; the
concurrent operations must be atomic. ACID • Basic Availability: Data is available most of
consistency means the rules of relations must the time, even during a partial system failure.
be satisfied. In a globally distributed database • Soft state: Individual data items are
topology, which involves multiple clusters independent and do not have to be consistent
each containing many nodes the problem with each other.
of consistency becomes exponentially more
• Eventual consistency: Data will become
complex.
consistent at some unspecified point in the
In general, relational databases that support future.
‘strong consistency’ provide ‘ACID guarantees.’
As such, NoSQL databases sacrifice a degree
ACID is an acronym designed to capture the
of consistency in order to increase availability.
essential elements of a strongly consistent
Rather than providing strong consistency,
database. The components of the ACID are as
NoSQL databases generally provide eventual
follows:
consistency. A data store that provides
• Atomicity: Guarantees that each transaction BASE guarantees can occasionally fail to
is treated as a single “unit”, which either return the result of the latest write, providing
succeeds completely or fails completely. different answers to applications making
• Consistency: Guarantees that each transaction requests. Developers building applications
only changes affected data in permitted ways. against eventually consistent data stores
often implement consistency checks in their
• Isolation: Guarantees that the concurrent
application code.
execution of transactions leaves the database
in the same state that would have been Lightweight transactions
obtained if the transactions were executed In a traditional SQL RDBMS, a “transaction”
sequentially. is a logical unit of work — a group of tasks
• Durability: The transactions results are that provides the ACID guarantees discussed
permanent, even in the event of system failure. above. To compensate for relaxed consistency,
some NoSQL databases offer ‘lightweight
ACID compliance is a complex and often
transactions’ (LWTs).
contested topic. In fact, one popular system
of analysis, the Jepsen test, is dedicated to Lightweight transactions are limited to a single
verifying vendor consistency claims. conditional statement, which enables an atomic
“compare and set” operation. Such an operation
By their nature, ACID-compliant databases are
checks whether a condition is true before it
generally slow, difficult to scale, and expensive
conducts the transaction. If the condition is not
7
met, the transaction is not executed. (For this Values versus objects
reason, LWTs are sometimes called ‘conditional Query results are also returned differently. SQL
statements’). LWTs do not truly lock the natively returns data-typed values, usually to
database for the duration of the transaction; be read into an object one field at a time. In
they only ‘lock’ a single cell or row. LWTs contrast, CQL natively returns complete objects,
leverage a consensus protocol such as Paxos often serialized in extensible markup language
to ensure that all nodes in the cluster agree (XML) or Javascript object notation (JSON). This
the change is committed. In this way, LWTs can makes applications responsible for parsing these
provide sufficient consistency for applications objects to obtain the desired result of a query.
that require the availability and resilience of a
distributed database. Scaling characteristics
In NoSQL, data is stored across nodes in a
QUERY LANGUAGES: SQL VERSUS CQL cluster based on a token range, which is a
hashed value of the primary key. By using token
As we’ve noted, relational databases are defined ranges, NoSQL databases enable objects to
in part by their use of the Structured Query be stored on different nodes. CQL queries are
Language (SQL). In contrast, NoSQL databases inherently more scalable than SQL queries,
employ a host of alternative query languages having been specifically designed to query
that have been designed to support diverse across a horizontally distributed cluster of
application use cases. A partial list includes servers, rather than a single database at a time.
MongoDB Query Language (MQL), Couchbase’s
N1QL, Elasticsearch’s Query DSL, Microsoft
Azure’s Cosmos DB query language, and
CONSIDERATIONS FOR SQL TO NOSQL
MIGRATIONS
Cassandra Query Language (CQL).
In this paper, we will focus on the most widely Data models
used NoSQL query language, CQL. While CQL SQL data models follow a normalized design;
is the primary language for communicating different but related pieces of information are
with Apache Cassandra, it is also supported by stored in ‘relations,’ which are separate logical
a range of familiar NoSQL databases. Common tables connected by joins. NoSQL databases
CQL-compliant databases include Scylla, use denormalized data models, in which
DataStax Enterprise, Microsoft’s cloud-native redundant copies of data are added as needed
Azure Cosmos DB, and Amazon Keyspaces. by the consuming applications. The point of
denormalization is to increase performance
CQL’s similarity to SQL enables developers
and lower latency since the joins involved
to move between the languages with relative
in normalized data models can introduce
ease. A few distinctions between SQL and CQL
significant performance overhead, especially in
include:
distributed topologies.
Joins When migrating from SQL to NoSQL, the
SQL and CQL share similar statements to store primary key in the relational table becomes the
and modify data, such as Create, Alter, Drop, and partition key in the NoSQL table. If the RDBMS
Truncate commands, but unlike SQL, CQL is not table must be joined to additional tables to
designed to support joins between tables. In CQL, retrieve the business object, those closely
relations are implemented within the application, related tables should combine into a single
rather than within the database query. NoSQL table. The NoSQL cluster ordering key
determines the physical order of records, so it
should be a unique value (often a composite
value) that would be useful for searching.
8
Project Project Project Project Employee Employee Department Department Hourly
Code Name Manager Budget No. Name No. Name Rate
PC010 Reservation Mr. Ajay 120500 $100 Mohan D03 Database 21.00
System
PC010 Reservation Mr. Ajay 120500 $101 Vipul D02 Testing 16.50
System
PC011 HR System Mrs. Charu 500500 $103 Pavan D03 Database 18.50
PC011 HR System Mrs. Charu 500500 $104 Jitendra D02 Testing 17.00
PC012 Attendance Mr. Rajesh 710700 $137 Rahul D03 Database 21.50
System
PC012 Attendance Mr. Rajesh 710700 $218 Avneesh D02 Testing 15.50
System
Denormalized data
2011 02 03 04:05:10:
heart_rate
89 Rows
Partitions
Sorted by time
2011 12 17 09:21:00:
84
heart_rate
Example of Partition
9
Secondary keys and indexes can be added tables and storage for streaming data, especially
later. A UNIQUE constraint in a SQL database in the context of event-driven architecture (EDA),
becomes a good candidate for a cluster are good candidates to migrate to NoSQL.
ordering key in NoSQL.
Data forklifting
Materialized views Tools like Apache Kafka can facilitate the
Common, frequent queries against a database process of migrating existing data from an
can become expensive. When the same RDBMS to NoSQL. Depending on the complexity
query is run again and again, it makes sense of the conversion, more comprehensive
to ‘virtualize’ the query. Materialized views operations may be needed. Tools such as
address this need by enabling common queries Apache Spark, a lightning-fast unified analytics
to be represented by a database object that is engine for big data and machine learning, can
continuously updated as data changes. be used to enable such data conversions.
10
For these reasons, the choice of a database by adding new nodes running on inexpensive
must take into account the expertise of the commodity servers. But within the family of
organization and the need or desire to build up NoSQL databases, these two capabilities vary
internal expertise around a given technology. considerably.
The ongoing maintenance of a database Some NoSQL databases also require expert
requires close monitoring and frequent administrators with detailed knowledge of
performance tuning. As datasets grow and proprietary tuning settings. Others adopt a
application traffic increases, administrators more automated approach that minimizes
need to keep a close eye on disk space, CPU tricky manual tuning parameters, enabling
consumption, memory allocation, and index non-specialties to administer and operate the
fragmentation. Performance adjustments are database.
proprietary to each database and often require
Likewise, some NoSQL databases take a
significant dedicated expertise.
horizontal scale to an extreme, often requiring
A database administrator never wants to huge clusters to achieve the required
see database utilization spike over 100%. performance targets and maintain SLAs.
Therefore, administrators must provide a buffer Sometimes these clusters run into the tens
against traffic spikes by ‘overprovisioning’ of thousands of nodes. While providing a
hardware. The degree to which hardware must frictionless path to scale, this approach also
be overprovisioned depends on the scaling increases operational overhead. The ideal non-
characteristics of the database. In general, relational database can efficiently use powerful
NoSQL databases have a flatter and more modern hardware, while also enabling clusters
predictable performance curve. Therefore, to grow and shrink elastically with minimal
NoSQL databases tend to require administrators administrator intervention.
to minimize overprovisioning without
compromising safety. Backup and Recovery
In both RDBMS and NoSQL worlds, data
Performance tuning can be used to minimize
can become corrupt due to hardware issues,
overprovisioning, but it can only go so far in
software bugs, and user errors. The resilient
preventing full utilization. When performance
architecture of NoSQL databases typically
tuning hits a wall, the database must be scaled;
provides a buffer against data loss. Still,
the RDBMS administrator has two choices. First,
administrators need to be able to restore the
the dataset can be ‘sharded,’ such that a subset
data to a known ‘good state.’ A backup and
of the data is stored on each node. Second, the
recovery plan is essential, being built around
administrator can add more powerful hardware,
two core targets: Recovery Point Objective
increasing the capacity of hardware by adding
(RPO) and Recovery Time Objective (RTO).
more powerful CPUs, more storage, and faster
networking components. • RPO is defined by the age of data in backup
storage needed to resume normal operations
Often, teams do both, sharding and scaling,
after a failure.
which adds both complexity and cost. The
vertical scale adds significant cost at each step • RTO defines the time needed to restore the
and eventually runs up against the physical system to a normal state.
limits of the network. A classic database restore plan might include
NoSQL databases make it easier for a single daily backup along with differential
administrators to monitor and manage database backups every hour to support a one-hour RPO.
deployments. First, they tend to be capable of For a large database, the recovery time for a
running at higher levels of utilization than most full restore can take hours to days, and every
RDBMSs. Second, capacity can be increased backup takes additional storage space.
11
Node Repair and Replacement many modern, cloud-native applications are
Given the distributed nature of NoSQL clusters, better suited to databases that support high
nodes occasionally fall out-of-sync. To address availability and a developer-centric data
this issue, NoSQL databases provide tools to model. The decision is based on business
bring out-of-sync nodes up-to-date using a considerations: how important is each
repair procedure. Repairs populate the node to transaction? Where the aggregate scale and
match the data on the other replicas. Sometimes speed of all transactions outweighs the specific
a node can fall so far out-of-sync with the correctness of any single query, then NoSQL
cluster that it needs to be replaced. As they are is the best fit.
bootstrapped into the cluster, fresh nodes must With this fundamental tradeoff in mind, one
stream a copy of the whole dataset; for large database, Scylla, has been designed from
datasets, such a refresh can take an inordinate the ground-up to overcome one of the key
amount of time. NoSQL databases perform such limitations of the first generation of NoSQL
operations using a variety of algorithms, some databases. Using a unique, close-to-the-
of which are more efficient than others. Thus, hardware design, Scylla combines the scale
some NoSQL databases recover more quickly up capabilities of traditional RDBMSs with the
and predictably than others. high availability and resilience of non-relational
databases. The result is a database that
extracts maximum performance from modern
SCYLLA NOSQL: SCALE-UP OF hardware to deliver predictable, low latency,
RDBMS AND HIGH AVAILABILITY while also minimizing operational overhead and
OF NON-RELATIONAL significantly reducing TCO.
In this document, we have discussed a set of Many IT organizations have followed the
trade offs between SQL and NoSQL databases. principles in this paper and have migrated
If your use case requires ACID guarantees, successfully from RDBMS to the Scylla NoSQL
then NoSQL might not be an option. But database.
SQL NoSQL
Orientation Relational Generally non-relational
Schema Strict and rigid schema design and data Loose and more varied designs for
normalization unstructured and semi-structured data;
data is generally denormalized
Language Structured Query Language (SQL) for There are different languages for querying,
defining, reading and manipulating data. some quite similar to SQL, such as
Supports JOIN statements to relate data Cassandra Query Language (CQL) for
across tables. wide column databases, or others radically
different, such as using object-oriented
JSON for document databases.
Scalability Vertically scalable. Loads on a single server Generally designed for horizontal
can be increased with CPU, RAM or SSD. scalability. Increased traffic can be handled
by adding more servers in the database.
This is useful for large and frequently
changing datasets.
Structure Table-based, which is efficient for NoSQL database structure is variable, and
applications using multi-row transactions can be based on documents, key-value
or systems that were built with a relational pairs, graph structures or wide-column
structure. stores.
12
ABOUT SCYLLADB
Scylla is the real-time big data database. API-compatible
with Apache Cassandra and Amazon DynamoDB, Scylla
embraces a shared-nothing approach that increases
throughput and storage capacity as much as 10X.
Comcast, Discord, Disney+ Hotstar, Grab, Medium,
Starbucks, Ola Cabs, Samsung, IBM, Investing.com and
many more leading companies have adopted Scylla to
realize order-of-magnitude performance improvements
and reduce hardware costs. Scylla’s database is available
as an open source project, an enterprise edition and a
fully managed database as a service. ScyllaDB was
founded by the team responsible for the KVM hypervisor.
For more information: ScyllaDB.com
SCYLLADB.COM