Para Distr Nosql Notes
Para Distr Nosql Notes
So far, we have assumed centralised database management systems where the DBMS software runs on
a single processor and the data resides at a single site.
Three important developments in databases have been:
(i) the use of parallel processing techniques for achieving faster DBMS performance and handling
larger volumes of data than is possible with a single-processor system,
(ii) the prevalence of distributed data, and
(iii) the development of so-called NoSQL systems (which often involve both distributed data and
parallel processing).
1 Parallel Architectures
Parallel database systems use parallel processing techniques to achieve faster DBMS performance and
handle larger volumes of data than is possible with single-processor systems.
There are three major architectures for parallel database systems:
• shared-memory
• shared-disk
• shared-nothing
In all three architectures, all processors have their own cache, and cache accesses are much faster than
accesses to main memory.
The two primary measures of DBMS performance are throughput — the number of tasks that can be
performed within a given time, and response time — the time from a request being issued to a response
being returned. Response time includes the time it takes to complete a single task (processing time)
as well as any (communication) delay, known as latency, which is particularly relevant in distributed
databases.
Processing of a large number of small tasks can be speeded up by processing many tasks in parallel.
Processing of individual large tasks can be speeded up by processing sub-tasks in parallel.
Speed-up refers to the performance of tasks faster due to more processors being added. Linear speed-up
means that a system with n times as many processors can perform the same task n times faster. See
Figure 1.
Scale-up refers to the performance of larger tasks due to more processors being added. Linear scale-up
means that a system with n times as many processors can perform a task that is n times larger in the
same time. See Figure 2.
A number of factors can affect speed-up and scale-up in parallel database architectures:
1
Linear
speed-up
Trx/Sec
Sub-linear
speed-up
Linear
Trx/Sec scale-up
Sub-linear
Scale-up
2
• assembly costs associated with combining the results of parallel evaluation;
• interference between multiple processes for shared system resources;
• communication costs between processes.
We will not consider the shared-memory and shared-disk architectures further, other than to note
that
The shared-nothing architecture does not suffer from problems of contention for memory accesses or
disk accesses, though it has higher communication costs than the other two architectures. Due to the
lack of contention problems, it has the potential to achieve linear speed-up and scale-up. It also has
higher availability as both memory and disk faults can be isolated from other processors.
However, load balancing may be hard to achieve as the data needs to be partitioned effectively (see
Section 4) between the disks accessed by the different processors.
2 Distributed databases
A distributed database system (DDB system) consists of several databases stored at different sites
of a computer network (i.e., a shared-nothing architecture).
The data at each site is managed by a database server running some DBMS software.
These servers can cooperate in executing global queries and global transactions, i.e. queries and
transactions whose processing may require access to databases stored at multiple sites.
A significant cost factor in global query and transaction processing is the communications costs incurred
by transmitting data between servers over the network.
An extra level of coordination is needed in order to guarantee the ACID properties of global transactions.
• Homogeneous DDB systems consist of local databases that are all managed by the same DBMS
software.
• Heterogeneous DDB systems consist of local databases each of which may be managed by a different
DBMS.
Thus, the data model, query language and the transaction management protocol may be different
for different local databases.
• Integrated DDB systems provide one integrated view of the data to users.
• A single database administration authority decides
3
Distributed DBMS A Distributed DBMS B
global catalog
global catalog
local catalog
local catalog
• Multi-database DDB systems consist of a set of fully autonomous ‘local’ database systems.
• Additional middleware — the Mediator — manages interaction with these local database systems,
including providing global query processing and global transaction management capabilities.
• The Mediator interacts with each local database system through an appropriate Wrapper, which
provides information about the data and the query processing capabilities of the local database
system.
4
3 NoSQL systems
NoSQL database systems were developed to provide reduced functionality compared to traditional Re-
lational DBMSs, with the aim of achieving higher performance and scalability for specific types of
applications.
The functionality reductions may include:
The query processing and data storage capabilities of NoSQL systems tend to be oriented towards
supporting specific types of applications.
The archetypal examples are settings where there are very large volumes of relatively unstructured data
supporting web-scale applications that require quick response times and high availability for users, or
that require real-time or near real-time data analytics: This is so-called “Big Data”, examples being web
log data, social media data, data collected by mobile and ubiquitous devices on the Internet of Things,
and large-scale scientific data from experiments and simulations.
A key aim of NoSQL database systems is elasticity, i.e. undisrupted service in the face of changes to the
computing resources of a running system, with adaptive load-balancing.
Two other key aims are scalability and fault-tolerance:
NoSQL database systems partition and replicate their data (see Section 4) so as to achieve scalability
by adding more servers as needed, and also so as to achieve fault-tolerance.
BASE: Rather than ACID, many NoSQL systems provide BASE:
We will discuss these in more detail when we cover distributed transaction processing.
Examples
5
• graph DBMSs, e.g. Neo4J, Sparksee, Trinity:
– although these are classified as NoSQL systems by some commentators, they predate the
NoSQL movement and they generally do support full ACID transactions;
– graph DBMSs focus on managing large volumes of graph-structured data;
– graph-structured data differs from other “big” data in its greater focus on the relationships
between entities, regarding these relationships to be as important as the entities;
– graph DBMSs typically include features such as
∗ special-purpose graph-oriented query languages
∗ graph-specific storage structures, for fast edge and path traversal
∗ in-database support for graph algorithms such as subgraph matching, breadth-first/depth-
first search, path finding and shortest path.
6
3.1 Examples in more detail
Amazon DynamoDB and Riak
DynamoDB grew out of the earlier Dynamo system which was created to provide a highly scalable,
available and durable key-value store for shopping cart data. It supports hundreds of thousands of
customer applications, as well as Alexa, the Amazon.com sites and all Amazon fulfillment centres.
“In 2021, during the 66-hour Amazon Prime shopping event, Amazon systems made trillions of API calls
to DynamoDB, peaking at 89.2 million requests per second.”2
It provides the following properties:
• Flexibility. Dynamo tables don’t have a fixed schema, and can use either a key-value or document
data model. Developers can choose either strong or eventual consistency (see later) when reading
items.
• Fully managed cloud service. Developers can create tables and read and write data without regard
for where tables are stored or how they are managed.
• Boundless scale for tables. There is no predefined limit on table size. Data is spread across servers
as required.
• Predictable performance. The simple API (see below) allows for responses to requests with consis-
tent low latency (in the single digit millisecond range for a 1 KB item).
• High availability. Data is replicated across multiple data centres, potentially in different geograph-
ical regions.
1. GetItem: returns a set of attributes for the item with the given key.
2. PutItem: inserts a new item or replace an old item.
3. UpdateItem: updates an existing item or adds a new item.
4. DeleteItem: deletes a single item specified by the given key.
Each of the above can include a condition which must be satisfied for the operation to succeed.
Riak is an open-source key-value store based on Dynamo.
Google Spanner
Google’s Spanner3 has evolved from earlier NoSQL systems to become a relational database system,
providing a strongly-typed schema system, an SQL query processor and ACID transactions.
This evolution was motivated by Google developers’ experience trying to build applications on key-value
stores such as Bigtable. The response was first to provide transaction processing on top of Bigtable,
leading to Megastore. However, the lack of a declarative query language meant that developers had to
write complex code in their applications to process and aggregate data.
As of 2017, over 5000 Spanner databases were being used by Google teams. Applications using Spanner
include AdWords and the Google Play platform. The overall system was processing tens of millions of
queries per second, managing hundreds of petabytes of data4 . Replicas of data are served from data
centres around the world, providing low latency and high availability.
2 See Amazon DynamoDB: A Scalable, Predictably Performant, and Full Managed NoSQL Database Service, Mostafa
7
Summary
We are seeing a growing convergence between SQL and NoSQL technologies: NoSQL stores are gradually
moving towards supporting full ACID transactions, database schemas and declarative querying facilities.
For example FoundationDB5 , used by Apple among others, is a key-value store which provides ACID
transactions.
Conversely, relational DBMS are extending their capabilities to support NoSQL functionalities, e.g.
Oracle is now a “multi-model” DBMS supporting storage of XML, JSON, graphs and RDF.
4 NoSQL/Distributed Storage
4.1.1 SSTables
In order to speed up the merging process as well as read operations if the key is not in memory, the
records in each disk segment can be sorted by their key values. This format has been called a Sorted
String Table, or SSTable, and is used in Google’s BigTable, for example.
Now merging can be done using mergesort. Also, in order to speed up read operations, an index to
entries in the SSTable on disk can be held in memory. This index can be sparse, i.e. only having entries
5 https://www.foundationdb.org
6 inspired by examples in Martin Kleppmann’s book.
8
for some key values, because the keys are sorted. For example, the index could contain an entry only for
the smallest key in each block/page on disk.
How is the SSTable sorted in the first place? The in-memory log is in fact stored as a balanced tree
structure which maintains keys in sorted order (e.g., an AVL tree). So when it is written out as an
SSTable, the keys can be accessed in sorted order. This in-memory tree is sometimes called a memtable.
Overall, this storage engine works as follows:
• fragments of the data can be stored at the sites where they are most frequently accessed;
• intra-query parallelism can be supported, i.e. a single (global) query can be translated into
multiple (local) subqueries that can be processed in parallel;
Data Fragmentation:
In the case of relational data, fragmentation might be horizontal or vertical:
R = R1 ∪ R2 ∪ . . . ∪ Rn
• Vertical fragmentation splits a relation R into n projections πatts1 R, πatts2 R, . . . , πattsn R, such that
R = πatts1 R o
n πatts2 R o
n ... o
n πattsn R
This is known as a loss-less join decomposition of R, and requires each set of attributes attsi
to include a key (or superkey) of R.
Hybrid fragmentation is also possible, i.e. applying both vertical and horizontal fragmentation.
Data Replication:
Data can also be replicated on more than one site, with the aim of:
• faster processing of queries (using a local copy rather than a remote one)
• increased availability and reliability (if a site failure makes one copy unavailable, another copy may
be available on a different site)
9
A disadvantage of replication is the added overhead in maintaining consistency of replicas after an update
occurs to one of them.
Thus, decisions regarding whether on not to replicate data, and how many replicas to create, involve a
trade-off between lower query costs and increased availability/reliability on the one hand, and increased
update and replica synchronisation costs on the other.
Data Partitioning in Shared-Nothing Architectures:
A common way of utilising the multiple processors and disks available in a shared-nothing, relational
architecture is by horizontally partitioning relations across the disks in the system in order to allow
parallel access to, and processing of, the data. The aim is to achieve even load-balancing between
processors and linear scale-up/speed-up performance characteristics.
A query accessing a horizontally partitioned relation can be composed of multiple subqueries, one for
each fragment of the relation. These subqueries can be processed in parallel in a shorter time than the
original query.
As mentioned above, this is known as intra-query parallelism i.e. parallelising the execution of one
query. We will look more closely at this later in this module.
( Inter-query parallelism is also present in parallel databases, i.e. executing several different queries
concurrently. )
A key issue with intra-query parallelism in shared-nothing architectures is what is the best way to
partition the tuples of each relation across the available disks. Three major approaches are used:
(i) round-robin partitioning: tuples are placed on the disks in a circular fashion;
(ii) hash partitioning: tuples are placed on the disks according to some hash function applied to one
or more of their attributes;
(iii) range partitioning: tuples are placed on the disks according to the sub-range in which the value of
an attribute, or set of attributes, falls.
The advantage of (i) is an even distribution of data on each disk. This is good for load balancing a scan
of the entire relation. However, it is not good for exact-match queries since all disks will need to be
accessed whereas in fact only a subset of them is likely to contain the relevant tuples.
The advantage of (ii) is that exact-match queries on the partitioning attribute(s) can be directed to the
relevant disk. However, (ii) is not good for supporting range queries.
Approach (iii) is good for range queries on the partitioning attribute(s) because only the disks that are
known to overlap with the required range of values need be accessed.
However, one potential problem with (iii) is that data may not be evenly allocated across the disks —
this is known as data skew.
One solution to this problem is to maintain a histogram for the partitioning attribute(s), i.e. divide
the domain into a number of ranges and keep a count of the number of rows that fall within each range
as the relation is updated. This histogram can be used to determine the subrange of values allocated to
each disk (rather than have fixed ranges of equal length).
10
s1
x2
s2
x0 x1
s0
It was subsequently used in peer-to-peer networks in the form of distributed hash tables, and in NoSQL
key-value systems such as Amazon’s Dynamo(DB) and Riak.
One could use a hash function such as h(k) = k mod n, where k is the key and n the number of servers.
This would return the number of the server, so meeting requirement 2. But it would not provide uniform
distribution (requirement 1) unless the keys themselves were uniformly distributed.
Instead, one could use a function h(k) = hash(k) mod n, where hash is a hashing function which
distributes keys uniformly. This would satisfy requirements 1 and 2. But what about requirement 3?
If we add an extra server, we have change n to n + 1, and almost all the keys/content would have to
move.
The solution is
2. use the hash function to map both servers and keys onto the space.
3. View the address space as a circle.
4. Each key is assigned to the server that “follows” it in the address space.
In Figures 4 and 5,
Typically, each server is allocated multiple (blocks of) hash values (since there are fewer servers than
2160 , say!).
Riak, e.g., divides the hash address space by default into 64 partitions, i.e., sets of contiguous hash values
(see Figure 6).
It recommends the data be distributed over at least 5 servers (nodes in Riak terminology).
7 Riak uses one which returns a 160-bit number (i.e., the space consists of 2160 numbers).
11
s1
x2
s2
s3
x0 x1
s0
160
favorite
2 -1 0
partition
63 64 1 2 3
61 62 4
60 5
59 6
58 7
57 8
56 9
55 10
54 11
53 12
52 13
51 14
50 15
49 16
Hash Ring
48 17
47 18
46 19
45 20
44 21
43 22
42 23
41 24
40 25
39 26
38 27
37 28
36 35 29
34 33 32 31 30
Instead of the hash function being used to allocate servers/nodes to partitions, Riak does the allocation
in a round-robin way (see Figure 7).
When data is written to a node, it is replicated to a number of other nodes. For 5 nodes, Riak’s default
number of replicas is 3.
In Figure 7, the 5 servers are denoted A, B, C, D and E. Partitions 1 to 5 are allocated to servers A to
E, respectively. Then partition 6 starts with A again.
If a record hashes to partition i, then it is written to partitions i, i + 1 and i + 2. In Figure 7, “favorite”
hashes to partition 3, so is replicated on partitions 3, 4 and 5, allocated to servers/nodes C, D and E,
respectively.
Later on, we will discuss the issue of consistency of replicas.
Reading
Read the following for more information:
1. Appendix A on NoSQL Overview, in Graph Databases, Ian Robinson, Jim Webber, and Emil
12
160
favorite
2 -1 0
replicated to
vnodes
partition/
63 64 1 2 3
61 62
vnode
4
60 5
59 C D A B C 6
58 A B D 7
E E A
57 D 8
56 B C B
C 9
replicated to
55 A node D 10
nodes
54 E E 11
D A 12
53
C B
52 B 13
C
51 A D 14
50 E E 15
49 D A 16
Riak Ring
48 C B 17
47 B C 18
A D 19
46
E E
45 20
D A
44 C B 21
43 B C 22
42 A D 23
E E
41 D A 24
40 C B 25
39 B A C 26
38 E D C B A E D 27
37 28
36 35 29
34 33 32 31 30
3. Megastore: Providing Scalable, Highly Available Storage for Interactive Services, Jason Blake et
al., Proc. CIDR 2011, pp. 223–234.
4. Spanner: Becoming a SQL System, David F. Bacon et al., Proc. SIGMOD 2017, pp. 331–343.
5. The first part (pp. 69–85) of Chapter 3 (on Storage and Retrieval) of Designing Data-Intensive
Applications. Martin Kleppmann. O’Reilly Media, Inc.
6. A Little Riak Book, Eric Redmond and John Daly, see https://github.com/basho-labs/little_
riak_book/blob/master/rendered/riaklil-print-en.pdf
13