Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Para Distr Nosql Notes

The document discusses advances in databases including parallel processing, distributed data, and NoSQL systems. Parallel databases use multiple processors to improve performance and handle larger data volumes. Distributed databases consist of databases stored across a network, requiring coordination for queries and transactions. NoSQL databases sacrifice some functionality for higher performance and scalability.

Uploaded by

ihtishaamahmed6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Para Distr Nosql Notes

The document discusses advances in databases including parallel processing, distributed data, and NoSQL systems. Parallel databases use multiple processors to improve performance and handle larger data volumes. Distributed databases consist of databases stored across a network, requiring coordination for queries and transactions. NoSQL databases sacrifice some functionality for higher performance and scalability.

Uploaded by

ihtishaamahmed6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Advances in Data Management:

Parallel, Distributed and NoSQL Databases

So far, we have assumed centralised database management systems where the DBMS software runs on
a single processor and the data resides at a single site.
Three important developments in databases have been:

(i) the use of parallel processing techniques for achieving faster DBMS performance and handling
larger volumes of data than is possible with a single-processor system,
(ii) the prevalence of distributed data, and
(iii) the development of so-called NoSQL systems (which often involve both distributed data and
parallel processing).

1 Parallel Architectures
Parallel database systems use parallel processing techniques to achieve faster DBMS performance and
handle larger volumes of data than is possible with single-processor systems.
There are three major architectures for parallel database systems:

• shared-memory
• shared-disk
• shared-nothing

In all three architectures, all processors have their own cache, and cache accesses are much faster than
accesses to main memory.
The two primary measures of DBMS performance are throughput — the number of tasks that can be
performed within a given time, and response time — the time from a request being issued to a response
being returned. Response time includes the time it takes to complete a single task (processing time)
as well as any (communication) delay, known as latency, which is particularly relevant in distributed
databases.
Processing of a large number of small tasks can be speeded up by processing many tasks in parallel.
Processing of individual large tasks can be speeded up by processing sub-tasks in parallel.
Speed-up refers to the performance of tasks faster due to more processors being added. Linear speed-up
means that a system with n times as many processors can perform the same task n times faster. See
Figure 1.
Scale-up refers to the performance of larger tasks due to more processors being added. Linear scale-up
means that a system with n times as many processors can perform a task that is n times larger in the
same time. See Figure 2.
A number of factors can affect speed-up and scale-up in parallel database architectures:

• start-up costs for initiating multiple parallel processes;

1
Linear
speed-up

Trx/Sec

Sub-linear
speed-up

Increasing no. of processors

Figure 1: Linear and sub-linear speed-up

Linear
Trx/Sec scale-up

Sub-linear
Scale-up

Increasing no. of processors & database size (data scale-up)

Figure 2: Linear and sub-linear scale-up

2
• assembly costs associated with combining the results of parallel evaluation;
• interference between multiple processes for shared system resources;
• communication costs between processes.

We will not consider the shared-memory and shared-disk architectures further, other than to note
that

• memory faults in shared-memory systems will affect all processors


• contention for memory or disk, respectively, will decrease performance

The shared-nothing architecture does not suffer from problems of contention for memory accesses or
disk accesses, though it has higher communication costs than the other two architectures. Due to the
lack of contention problems, it has the potential to achieve linear speed-up and scale-up. It also has
higher availability as both memory and disk faults can be isolated from other processors.
However, load balancing may be hard to achieve as the data needs to be partitioned effectively (see
Section 4) between the disks accessed by the different processors.

2 Distributed databases
A distributed database system (DDB system) consists of several databases stored at different sites
of a computer network (i.e., a shared-nothing architecture).
The data at each site is managed by a database server running some DBMS software.
These servers can cooperate in executing global queries and global transactions, i.e. queries and
transactions whose processing may require access to databases stored at multiple sites.
A significant cost factor in global query and transaction processing is the communications costs incurred
by transmitting data between servers over the network.
An extra level of coordination is needed in order to guarantee the ACID properties of global transactions.

2.1 Autonomy and heterogeneity of DDB systems


DDB systems may be homogeneous or heterogeneous:

• Homogeneous DDB systems consist of local databases that are all managed by the same DBMS
software.
• Heterogeneous DDB systems consist of local databases each of which may be managed by a different
DBMS.
Thus, the data model, query language and the transaction management protocol may be different
for different local databases.

DDB systems may be integrated or multi-database:

• Integrated DDB systems provide one integrated view of the data to users.
• A single database administration authority decides

– what information is stored in each site of the DDB system,


– how this information is stored and accessed, and
– who is able to access it.

3
Distributed DBMS A Distributed DBMS B

Global Query Processor Global Query Processor

Global Trx Manager Global Trx Manager

Local Query Processor Local Query Processor

Local Trx Manager Local Trx Manager

Local Storage Manager Local Storage Manager

Local database Local database

global catalog
global catalog

local log file


local log file

local catalog
local catalog

Figure 3: Integrated DDB Architecture

The general architecture of an integrated DDB is illustrated in Figure 3.

• Multi-database DDB systems consist of a set of fully autonomous ‘local’ database systems.
• Additional middleware — the Mediator — manages interaction with these local database systems,
including providing global query processing and global transaction management capabilities.
• The Mediator interacts with each local database system through an appropriate Wrapper, which
provides information about the data and the query processing capabilities of the local database
system.

An example of an integrated, heterogeneous DDB (or polystore) is BigDAWG1 .


For simplicity, we will use the term homogeneous DDB to mean ‘integrated, homogeneous DDB’.
We will not delve into the topic of heterogeneous database integration further.

2.2 Parallel DBs vs DDBs


A key difference between parallel database systems and distributed database (DDB) systems is that
the processors of a parallel database system are linked by high-speed interconnections within a single
computer or at a single geographical site, and queries/transactions are submitted to a single database
server.
The aim of a parallel database system is to utilise multiple processors in order to speed up query and
transaction processing; whereas the aim of a DDB system is to manage multiple databases stored at
different sites of a computer network.
In DDB systems, a significant cost is therefore the communications cost incurred when transmitting
data between the DDB servers over the network, whereas communications costs are much lower within
a parallel database system. It is possible of course for multiple parallel database instances to be
connected into an overall distributed database system.
1 https://bigdawg.mit.edu

4
3 NoSQL systems
NoSQL database systems were developed to provide reduced functionality compared to traditional Re-
lational DBMSs, with the aim of achieving higher performance and scalability for specific types of
applications.
The functionality reductions may include:

• not offering full ACID guarantees for transactions,


• not supporting a high-level query language such as SQL, but instead a low-level programmer
interface, and/or
• not requiring data to conform to a schema.

The query processing and data storage capabilities of NoSQL systems tend to be oriented towards
supporting specific types of applications.
The archetypal examples are settings where there are very large volumes of relatively unstructured data
supporting web-scale applications that require quick response times and high availability for users, or
that require real-time or near real-time data analytics: This is so-called “Big Data”, examples being web
log data, social media data, data collected by mobile and ubiquitous devices on the Internet of Things,
and large-scale scientific data from experiments and simulations.
A key aim of NoSQL database systems is elasticity, i.e. undisrupted service in the face of changes to the
computing resources of a running system, with adaptive load-balancing.
Two other key aims are scalability and fault-tolerance:
NoSQL database systems partition and replicate their data (see Section 4) so as to achieve scalability
by adding more servers as needed, and also so as to achieve fault-tolerance.
BASE: Rather than ACID, many NoSQL systems provide BASE:

• Basic Availability: almost always available.


• Soft-state: different replicas don’t have to be mutually consistent all the time.

• Eventual consistency: replicas will be consistent at some point in the future.

We will discuss these in more detail when we cover distributed transaction processing.

Examples

Some examples of NoSQL systems include:

• key-value stores, e.g. Dynamo(DB) (Amazon), Redis and Riak


– store data values (with no predefined structure) that are indexed by a key
– support insert, delete and look-up operations

• document stores, e.g. MongoDB and Couchbase:


– store more complex, nested-structure, data (usually JSON)
– support both primary and secondary indexes to the data

• wide-column stores, e.g. BigTable (Google), HBase (Apache), Cassandra (Facebook):


– store records that can be extended with additional attributes;
• all allow partitioning/distribution and replication of data

5
• graph DBMSs, e.g. Neo4J, Sparksee, Trinity:
– although these are classified as NoSQL systems by some commentators, they predate the
NoSQL movement and they generally do support full ACID transactions;
– graph DBMSs focus on managing large volumes of graph-structured data;
– graph-structured data differs from other “big” data in its greater focus on the relationships
between entities, regarding these relationships to be as important as the entities;
– graph DBMSs typically include features such as
∗ special-purpose graph-oriented query languages
∗ graph-specific storage structures, for fast edge and path traversal
∗ in-database support for graph algorithms such as subgraph matching, breadth-first/depth-
first search, path finding and shortest path.

The following “ranking” of systems is from https://db-engines.com/en/ranking (October 2023):

1. Oracle (Relational, Multi-model)


2. MySQL (Relational, Multi-model)
3. Microsoft SQL Server (Relational, Multi-model)
4. PostgreSQL (Relational, Multi-model)
5. MongoDB (Document, Multi-model)
6. Redis (Key-value, Multi-model)
7. Elasticsearch (Search engine, Multi-model)
8. IBM Db2 (Relational, Multi-model)
9. SQLite (Relational)
10. Microsoft Access (Relational)
11. Snowflake (Relational)

11. Cassandra (Wide column)


12. MariaDB (Relational, Multi-model)
13. Splunk (Search engine)
14. Microsoft Azure SQL Database (Relational, Multi-model)
15. Amazon DynamoDB (Multi-model)
16. Databricks (Multi-model)
17. Hive (Relational)
18. Teradata (Relational, Multi-model)
19. Google BigQuery (Relational)
20. FileMaker (Relational)
21. SAP HANA (Relational, Multi-model)
22. Neo4j (Graph)

The full ranking contains almost 400 systems.

6
3.1 Examples in more detail
Amazon DynamoDB and Riak

DynamoDB grew out of the earlier Dynamo system which was created to provide a highly scalable,
available and durable key-value store for shopping cart data. It supports hundreds of thousands of
customer applications, as well as Alexa, the Amazon.com sites and all Amazon fulfillment centres.
“In 2021, during the 66-hour Amazon Prime shopping event, Amazon systems made trillions of API calls
to DynamoDB, peaking at 89.2 million requests per second.”2
It provides the following properties:

• Flexibility. Dynamo tables don’t have a fixed schema, and can use either a key-value or document
data model. Developers can choose either strong or eventual consistency (see later) when reading
items.
• Fully managed cloud service. Developers can create tables and read and write data without regard
for where tables are stored or how they are managed.
• Boundless scale for tables. There is no predefined limit on table size. Data is spread across servers
as required.
• Predictable performance. The simple API (see below) allows for responses to requests with consis-
tent low latency (in the single digit millisecond range for a 1 KB item).
• High availability. Data is replicated across multiple data centres, potentially in different geograph-
ical regions.

The CRUD APIs for items are:

1. GetItem: returns a set of attributes for the item with the given key.
2. PutItem: inserts a new item or replace an old item.
3. UpdateItem: updates an existing item or adds a new item.
4. DeleteItem: deletes a single item specified by the given key.

Each of the above can include a condition which must be satisfied for the operation to succeed.
Riak is an open-source key-value store based on Dynamo.

Google Spanner

Google’s Spanner3 has evolved from earlier NoSQL systems to become a relational database system,
providing a strongly-typed schema system, an SQL query processor and ACID transactions.
This evolution was motivated by Google developers’ experience trying to build applications on key-value
stores such as Bigtable. The response was first to provide transaction processing on top of Bigtable,
leading to Megastore. However, the lack of a declarative query language meant that developers had to
write complex code in their applications to process and aggregate data.
As of 2017, over 5000 Spanner databases were being used by Google teams. Applications using Spanner
include AdWords and the Google Play platform. The overall system was processing tens of millions of
queries per second, managing hundreds of petabytes of data4 . Replicas of data are served from data
centres around the world, providing low latency and high availability.
2 See Amazon DynamoDB: A Scalable, Predictably Performant, and Full Managed NoSQL Database Service, Mostafa

Elhemali et al., Proc. USENIX 2022.


3 See https://cloud.google.com/spanner.
4 A petabyte is 1000 terabytes, or a million gigabytes.

7
Summary

We are seeing a growing convergence between SQL and NoSQL technologies: NoSQL stores are gradually
moving towards supporting full ACID transactions, database schemas and declarative querying facilities.
For example FoundationDB5 , used by Apple among others, is a key-value store which provides ACID
transactions.
Conversely, relational DBMS are extending their capabilities to support NoSQL functionalities, e.g.
Oracle is now a “multi-model” DBMS supporting storage of XML, JSON, graphs and RDF.

4 NoSQL/Distributed Storage

4.1 NoSQL storage


In a number of NoSQL systems, there is a need to support very high volumes of write operations, still
with low processing time. This has implications for the methods they use for concurrency control (as we
will see later), and also how they store data.
One way to speed up write operations is not to update the original records, which may be scattered over
random locations on disk, but instead to append updates to files. This is very similar to what is done
with a database write-ahead log which we studied earlier, and such approaches are usually referred to as
log-structured.
Say we are recording the number of times YouTube videos are being watched6 . Each video has a unique
id, for which we will use strings like ‘cat123’ and ‘dog456’. Assume that the current values for each video
are as follows:
cat123: 100 dog456: 50
If ‘cat123’ is viewed once more, then we get the following:
cat123: 100 dog456: 50 cat123: 101
(if we assume that we append to the right).
When reading a value for a particular key k, we need to read backwards from the end of the log to find
the most recent entry for k.
The log is initially held in memory. When memory runs out, or the log has been deemed to have grown
too large, it is written to disk as a data file segment. At this time (or at some subsequent time), the
segment can be compacted to remove out-of-date entries (i.e., keeping only the most recent for each key).
This would yield
dog456: 50 cat123: 101
in our example. The new, compacted segment is written as a new file.
If many segments have been written to disk, they can be merged and compacted in one operation (done
as a background task).

4.1.1 SSTables

In order to speed up the merging process as well as read operations if the key is not in memory, the
records in each disk segment can be sorted by their key values. This format has been called a Sorted
String Table, or SSTable, and is used in Google’s BigTable, for example.
Now merging can be done using mergesort. Also, in order to speed up read operations, an index to
entries in the SSTable on disk can be held in memory. This index can be sparse, i.e. only having entries
5 https://www.foundationdb.org
6 inspired by examples in Martin Kleppmann’s book.

8
for some key values, because the keys are sorted. For example, the index could contain an entry only for
the smallest key in each block/page on disk.
How is the SSTable sorted in the first place? The in-memory log is in fact stored as a balanced tree
structure which maintains keys in sorted order (e.g., an AVL tree). So when it is written out as an
SSTable, the keys can be accessed in sorted order. This in-memory tree is sometimes called a memtable.
Overall, this storage engine works as follows:

• When a write operation occurs, add it to the memtable.


• When the memtable gets bigger than some threshold, write it to disk as an SSTable, which becomes
the most recent. While this takes place, writes can happen to the (new) memtable.
• When a read operation occurs, try to find the key in the memtable. If it doesn’t appear there, try
the most recent SSTable, followed by the next, . . .
• From time to time, run a merge and compaction process in the background to combine segment
files and remove overwritten and deleted keys.

Similar storage engines are used in Cassandra and HBase.


In general, these structures are often referred to as log-structured merge trees (LSM-trees) after a 1996
paper by Patrick O’Neil et al.

4.2 Parallel and distributed storage


In parallel and distributed databases, the data may be fragmented (or partitioned ) across multiple
disks/sites of the system, for better query performance:

• fragments of the data can be stored at the sites where they are most frequently accessed;
• intra-query parallelism can be supported, i.e. a single (global) query can be translated into
multiple (local) subqueries that can be processed in parallel;

Data Fragmentation:
In the case of relational data, fragmentation might be horizontal or vertical:

• Horizontal fragmentation splits a relation R into n disjoint subsets R1 , R2 , . . . , Rn such that

R = R1 ∪ R2 ∪ . . . ∪ Rn

• Vertical fragmentation splits a relation R into n projections πatts1 R, πatts2 R, . . . , πattsn R, such that

R = πatts1 R o
n πatts2 R o
n ... o
n πattsn R

This is known as a loss-less join decomposition of R, and requires each set of attributes attsi
to include a key (or superkey) of R.

Hybrid fragmentation is also possible, i.e. applying both vertical and horizontal fragmentation.
Data Replication:
Data can also be replicated on more than one site, with the aim of:

• faster processing of queries (using a local copy rather than a remote one)
• increased availability and reliability (if a site failure makes one copy unavailable, another copy may
be available on a different site)

9
A disadvantage of replication is the added overhead in maintaining consistency of replicas after an update
occurs to one of them.
Thus, decisions regarding whether on not to replicate data, and how many replicas to create, involve a
trade-off between lower query costs and increased availability/reliability on the one hand, and increased
update and replica synchronisation costs on the other.
Data Partitioning in Shared-Nothing Architectures:
A common way of utilising the multiple processors and disks available in a shared-nothing, relational
architecture is by horizontally partitioning relations across the disks in the system in order to allow
parallel access to, and processing of, the data. The aim is to achieve even load-balancing between
processors and linear scale-up/speed-up performance characteristics.
A query accessing a horizontally partitioned relation can be composed of multiple subqueries, one for
each fragment of the relation. These subqueries can be processed in parallel in a shorter time than the
original query.
As mentioned above, this is known as intra-query parallelism i.e. parallelising the execution of one
query. We will look more closely at this later in this module.
( Inter-query parallelism is also present in parallel databases, i.e. executing several different queries
concurrently. )
A key issue with intra-query parallelism in shared-nothing architectures is what is the best way to
partition the tuples of each relation across the available disks. Three major approaches are used:

(i) round-robin partitioning: tuples are placed on the disks in a circular fashion;
(ii) hash partitioning: tuples are placed on the disks according to some hash function applied to one
or more of their attributes;

(iii) range partitioning: tuples are placed on the disks according to the sub-range in which the value of
an attribute, or set of attributes, falls.

The advantage of (i) is an even distribution of data on each disk. This is good for load balancing a scan
of the entire relation. However, it is not good for exact-match queries since all disks will need to be
accessed whereas in fact only a subset of them is likely to contain the relevant tuples.
The advantage of (ii) is that exact-match queries on the partitioning attribute(s) can be directed to the
relevant disk. However, (ii) is not good for supporting range queries.
Approach (iii) is good for range queries on the partitioning attribute(s) because only the disks that are
known to overlap with the required range of values need be accessed.
However, one potential problem with (iii) is that data may not be evenly allocated across the disks —
this is known as data skew.
One solution to this problem is to maintain a histogram for the partitioning attribute(s), i.e. divide
the domain into a number of ranges and keep a count of the number of rows that fall within each range
as the relation is updated. This histogram can be used to determine the subrange of values allocated to
each disk (rather than have fixed ranges of equal length).

4.3 Key/value storage


Consistent hashing was proposed (in 1997) to solve issues related to web caching as used in what became
known as content delivery networks (CDNs). It led to the formation of Akamai, now a multi-billion
dollar company.
The requirements were

1. to distribute contents uniformly over a number of servers,

10
s1

x2

s2

x0 x1
s0

Figure 4: Consistent hashing.

2. allow clients to know where to find content without communication, and


3. allow the network of servers to grow or shrink without too much content having to move.

It was subsequently used in peer-to-peer networks in the form of distributed hash tables, and in NoSQL
key-value systems such as Amazon’s Dynamo(DB) and Riak.
One could use a hash function such as h(k) = k mod n, where k is the key and n the number of servers.
This would return the number of the server, so meeting requirement 2. But it would not provide uniform
distribution (requirement 1) unless the keys themselves were uniformly distributed.
Instead, one could use a function h(k) = hash(k) mod n, where hash is a hashing function which
distributes keys uniformly. This would satisfy requirements 1 and 2. But what about requirement 3?
If we add an extra server, we have change n to n + 1, and almost all the keys/content would have to
move.
The solution is

1. to choose a hash function which maps to a large space7 ,

2. use the hash function to map both servers and keys onto the space.
3. View the address space as a circle.
4. Each key is assigned to the server that “follows” it in the address space.

In Figures 4 and 5,

1. the hash function returns a 32-bit number.


2. There also happen to be 32 “slots” in the circle.
3. In Figure 4, servers s0 , s1 and s2 are mapped onto the space, as are keys/content x0 , x1 and x2 .
4. In Figure 5, a new server s3 is added. Only x2 has to be moved from s0 to s3 .

Typically, each server is allocated multiple (blocks of) hash values (since there are fewer servers than
2160 , say!).
Riak, e.g., divides the hash address space by default into 64 partitions, i.e., sets of contiguous hash values
(see Figure 6).
It recommends the data be distributed over at least 5 servers (nodes in Riak terminology).
7 Riak uses one which returns a 160-bit number (i.e., the space consists of 2160 numbers).

11
s1

x2

s2
s3

x0 x1
s0

Figure 5: Consistent hashing, adding server s3 .

160
favorite
2 -1 0

partition
63 64 1 2 3
61 62 4
60 5
59 6
58 7
57 8
56 9
55 10
54 11
53 12
52 13
51 14
50 15
49 16
Hash Ring
48 17
47 18
46 19
45 20
44 21
43 22
42 23
41 24
40 25
39 26
38 27
37 28
36 35 29
34 33 32 31 30

Figure 6: Riak ring (from Little Riak Book).

Instead of the hash function being used to allocate servers/nodes to partitions, Riak does the allocation
in a round-robin way (see Figure 7).
When data is written to a node, it is replicated to a number of other nodes. For 5 nodes, Riak’s default
number of replicas is 3.
In Figure 7, the 5 servers are denoted A, B, C, D and E. Partitions 1 to 5 are allocated to servers A to
E, respectively. Then partition 6 starts with A again.
If a record hashes to partition i, then it is written to partitions i, i + 1 and i + 2. In Figure 7, “favorite”
hashes to partition 3, so is replicated on partitions 3, 4 and 5, allocated to servers/nodes C, D and E,
respectively.
Later on, we will discuss the issue of consistency of replicas.
Reading
Read the following for more information:

1. Appendix A on NoSQL Overview, in Graph Databases, Ian Robinson, Jim Webber, and Emil

12
160
favorite
2 -1 0

replicated to
vnodes
partition/
63 64 1 2 3
61 62
vnode
4
60 5
59 C D A B C 6
58 A B D 7
E E A
57 D 8
56 B C B
C 9
replicated to
55 A node D 10
nodes
54 E E 11
D A 12
53
C B
52 B 13
C
51 A D 14
50 E E 15
49 D A 16
Riak Ring
48 C B 17
47 B C 18
A D 19
46
E E
45 20
D A
44 C B 21
43 B C 22
42 A D 23
E E
41 D A 24
40 C B 25
39 B A C 26
38 E D C B A E D 27
37 28
36 35 29
34 33 32 31 30

Figure 7: Riak ring replication (from Little Riak Book).

Eifrem, O’Reilly Media, 2015 (available through the library).


2. Amazon DynamoDB: A Scalable, Predictably Performant, and Full Managed NoSQL Database
Service, Mostafa Elhemali et al., Proc. USENIX 2022.

3. Megastore: Providing Scalable, Highly Available Storage for Interactive Services, Jason Blake et
al., Proc. CIDR 2011, pp. 223–234.
4. Spanner: Becoming a SQL System, David F. Bacon et al., Proc. SIGMOD 2017, pp. 331–343.
5. The first part (pp. 69–85) of Chapter 3 (on Storage and Retrieval) of Designing Data-Intensive
Applications. Martin Kleppmann. O’Reilly Media, Inc.
6. A Little Riak Book, Eric Redmond and John Daly, see https://github.com/basho-labs/little_
riak_book/blob/master/rendered/riaklil-print-en.pdf

13

You might also like