Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

6 Documentdatabases

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Large-Scale and Multi-Structured

Databases
Document Databases
Data Modelling and Partitioning
Prof. Pietro Ducange
Some considerations
A document database could theoretically implement a third
normal form schema.

Tables, as in relational databases, may be “simulated”


considering collections with JSON documents with an identical
pre-defined structure.

Image extracted from “Guy Harrison, Next Generation Databases, Apress, 2015”
JSON Databases: An example
Document databases usually
adopts a reduced number of
collections for modeling data.

Nested documents are used for


representing relationships among
the different entities.

Document databases do not


generally provide join operations.

Programmers like to have the


JSON structure map closely to the
object structure of their code!!!
Image extracted from “Guy Harrison, Next Generation Databases, Apress, 2015”
Data Modeling: Document Embedding

The solution above allows the user to retrieve a film and all its actors in a single
operation.

However, “actors” result to be duplicated across multiple documents.

In a complex design this could lead to issues and possibly inconsistencies if any of the
“actor” attributes need to be changed.

Moreover, some JSON databases have some limitations of the maximum dimension
of a single document.
Data Modeling: Document Linking

In the solution above, an array of actor IDs has been embedded into the
film document.

The IDs can be used to retrieve the documents of the actors (in on other
collection) who appear in a film.

We are rolling back to a relational model!!! Now, at least two collections of


document must be defined.
Data Modeling: Document Linking

We are rolling back to a the third normal form!!!!!


Data Modeling: Some Discussions
Document linking approaches are usually somewhat an
unnatural style for a document database.

However, for some workloads it may provide the best balance


between performance and maintainability.

When modeling data for document databases, there is no


equivalent of third normal form that defines a “correct” model.

In this context, the nature of the queries to be executed drives


the approach to model data.
Data Modeling: One to Many
Relationship
Let consider the customer entity which may have associated a list of address
entities.

The basic pattern is that the one entity in a one-to-many relation is the primary
document, and the many entities are represented as an array of embedded
documents.
Data Modelling:
Many to Many Relationships
Let consider an example of application in which:
• A student can be enrolled in many courses
• A course can have many students enrolled to it
We can model this situation considering the following two collections:

We have to take care when updating data in this kind of relationship. Indeed, the
DBMS will not control the referential integrity as in relational DBMSs.
Modeling Hierarchies (I)

Parent reference solution:


This solution is useful if we
have frequently to show a
specific instance of an object
and then show the more
general type of that category.
Modeling Hierarchies (II)

Child reference solution:


This solution is useful if we
have frequently to retrieve
the children (or sub parts)
of a specific instance of an
object.
Modeling Hierarchies (III)

List of ancestor solutions: This solution allows to retrieve the full


path of the ancestors with one read
operation.

A change to the hierarchy may require


many write operations, depending on
the level at which the change occurred.
Data Modeling: An Example (I)
Scenario: Trucks in a company fleet have to transmit location, fuel
consumption and other metrics every three minutes to a fleet management
data base (one-to-many relationship between a truck and the transmitted
details)

We may consider to generate a new document to add to the DB for each data
transmission.

Let consider a document as follows: Repeated


information in each
new document

At the end of the day, the DB will include 200 new documents for each truck (we
consider 20 transmissions per hour, 10 working hours)
Data Modeling: An Example (II)
An alternative solution may be to use embedded documents as follows:

Pay attention: we can have a potential performance problem!


Planning for Mutable Documents (I)
When a document is created, the DBMS allocates a certain amount of space for
the document.

If the document grows more than


the allocated space, the DBMS has
to relocate it to another location.

Moreover, the DBMS must free


the previously allocated space.

The previous steps can adversely


affect the system performance.
Planning for Mutable Documents (II)
A solution for avoiding to move oversized document is to allocate sufficient
space at the moment in which the document is created.

Regarding the previous problem, the following solution may be adopted:

In conclusion, we have to consider the life cycle of a document and planning, if


possible, the strategies for handling its growing.
Indexing Document Database
In order to avoid the entire scan of the overall database, DBMSs for document
databases (for example MongoDB) allow the definition of indexes.

Indexes, like in book indexes, are a structured set of information that maps
from one attribute to related information.

In general, indexes are special data structures that store a small portion of the
collection’s data set in an easy to traverse form.

The index stores the value of a specific field or set of fields, ordered by the
value of the field.

The ordering of the index entries supports efficient equality matches and
range-based query operations.
An Example of Index
Read-Heavy Applications
In the figure we show the classical scheme of a read-heavy application (business
intelligence and analytics applications):

In this kind of applications, the use of several indexes allows the user to quickly
access to the database. For example, indexes can be defined for easily retrieve
documents describing objects related to a specific geographic region or to a
specific type.
Write-Heavy Applications
The example of the truck information transmission (each three minutes) is a
typical write-heavy application.

The higher the number of indexes adopted the higher the amount of time
required for closing a write operation.

Indeed, all the indexes must be updated (and created at the beginning).

Reducing the number of indexes, allow us to obtain systems with fast write
operation responses. On the other hand, we have to accept to deal with slow
read operations.

In conclusion, the number and the type of indexes to adopt must be identified as
a trade-off solution.
Transactions Processing Systems
These systems are designed for fast write operation and targeted reads, as
shown in the figure below:

OLTP: online OLAP: On-Line


transaction Analytical Processing,
processing
Vertical Partitioning
Vertical partitioning is a technique for improving (relational) database
performance by separating columns of a relational table into multiple
separate table.
Sharding
• Sharding or horizontal partitioning is the process of dividing
data into blocks or chunks.
• Each block, labeled as shard, is deployed on a specific node
(server) of a cluster.
• Each node can contain only one shard.
• In case of data replication, a shard can be hosted by more than
one node.

Image extracted from: “Dan Sullivan, NoSQL For Mere Mortals, Addison-Wesley, 2015”
Advantages of Sharding

• Allows handling heavy loads and the increase of system users.

• Data may be easily distributed on a variable number of servers


that may be added or removed by request.

• Cheaper than vertical scaling (adding ram and disks, upgrading


CPUs to a single server).

• Combined with replications, ensures a high availability of the


system and fast responses.
Shard Keys
To implement sharding, document database designers have to
select a shard key and a partitioning method.

A shard key is one or more fields, that exist in all documents in a


collection, that is used to separate documents.

Examples of shard keys may be: Unique document ID, Name, Date,
such as creation date, Category or type, Geographical region.

Actually, any atomic field in a document may be chosen as a shard


key.
Partition Algorithms
There are three main categories of partition algorithms, based on:

• Range: for example, if all documents in a collection had a


creation date field, it could be used to partition documents into
monthly shards.

• Hashing: a hash function can be used to determine where to


place a document. Consistent hashing may be also used.

• List: for example, let imagine a product database with several


types (electronics, appliances, household goods, books, and
clothes). These product types could be used as a shard key to
allocate documents across five different servers.
Suggested Readings
Chapter 4 of the book “Guy Harrison, Next
Generation Databases, Apress, 2015”.

Chapters 7,8 of the book “Dan Sullivan, NoSQL For


Mere Mortals, Addison-Wesley, 2015”

You might also like