100% found this document useful (1 vote)

135 views

Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)

This document discusses different types of data including structured, unstructured, semi-structured data and metadata. It provides examples and descriptions of each type of data. It also discusses different types of data analytics including descriptive, diagnostic, predictive and prescriptive analytics.

Uploaded by

hariom12367855

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

135 views

Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)

Uploaded by

hariom12367855

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 135

NoSQL Database Systems

M.Tech. (IInd, Sem CE/CN)

Datasets
Collections or groups of related data are generally referred
to as datasets. Each group or dataset member (datum)
shares the same set of attributes or properties as others in
the same dataset. Some examples of datasets are:
• tweets stored in a flat file
• a collection of image files in a directory
• an extract of rows from a database table stored in a CSV
formatted file
• historical weather observations that are stored as XML files
Data Analysis
Data analysis is the process of examining data to find facts,
relationships, patterns, insights and/or trends. The overall goal of
data analysis is to support better decision making. A simple data
analysis example is :

1. The analysis of ice cream sales data in order to determine how the
number of ice cream cones sold is related to the daily
temperature. The results of such an analysis would support
decisions related to
– how much ice cream a store should order in relation to weather forecast
information.

Carrying out data analysis helps establish patterns and relationships

among the data being analyzed
Data Analytics
• Data analytics is a broader term that encompasses data
analysis. Data analytics is a discipline that includes the
management of the complete data lifecycle, which
encompasses
– collecting, cleansing, organizing, storing, analyzing and governing data.

• The term includes the development of analysis methods,

scientific techniques and automated tools.
• In Big Data environments, data analytics has developed
methods that allow data analysis to occur through the use of
highly scalable distributed technologies and frameworks that
are capable of analyzing large volumes of data from different
sources.
Types of Data Analytics
There are four general categories of analytics
that are distinguished by the results they
produce:
• descriptive analytics
• diagnostic analytics
• predictive analytics
• prescriptive analytics
Descriptive Analytics
• Descriptive analytics are carried out to answer questions about
events that have already occurred. This form of analytics
contextualizes data to generate information.
Sample questions can include:
• What was the sales volume over the past 12 months?
• What is the number of support calls received as categorized by
severity and geographic location?
• What is the monthly commission earned by each sales agent?
It is estimated that 80% of generated analytics results are
descriptive in nature. Value wise, descriptive analytics provide
the least worth and require a relatively basic skill set.
Descriptive Analytics
• Descriptive analytics are often carried out via ad-
hoc reporting or dashboards.
• The reports are generally static in nature and
display historical data that is presented in the form
of data grids or charts.
• Queries are executed on operational data stores
from within an enterprise, for example a Customer
Relationship Management system (CRM) or
Enterprise Resource Planning (ERP) system.
Diagnostic Analytics
Diagnostic analytics aim to determine the cause of a
phenomenon that occurred in the past using questions that
focus on the reason behind the event. The goal of this type of
analytics is to determine what information is related to the
phenomenon in order to enable answering questions that
seek to determine why something has occurred.
Such questions include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the
Eastern region than from the Western region?
• Why was there an increase in patient re-admission rates over
the past three months?
Diagnostics Analytics
• Diagnostic analytics provide more value than descriptive
analytics but require a more advanced skill set.
• Diagnostic analytics usually require collecting data from
multiple sources and storing it in a structure that lends
itself to performing drill-down and roll-up analysis.
• Diagnostic analytics results are viewed via interactive
visualization tools that enable users to identify trends
and patterns.
• The executed queries are more complex compared to
those of descriptive analytics and are performed on
multidimensional data held in analytic processing
systems.
Predictive Analytics
• Predictive analytics are carried out in an attempt to determine the
outcome of an event that might occur in the future. With
predictive analytics, information is enhanced with meaning to
generate knowledge that conveys how that information is related.
• The strength and magnitude of the associations form the basis of
models that are used to generate future predictions based upon
past events.
• It is important to understand that the models used for predictive
analytics have implicit dependencies on the conditions under
which the past events occurred. If these underlying conditions
change, then the models that make predictions need to be
updated.
Predictive Analytics
Questions are usually formulated using a what-if
rationale, such as the following:
• What are the chances that a customer will default
on a loan if they have missed a monthly payment?
• What will be the patient survival rate if Drug B is
administered instead of Drug A?
• If a customer has purchased Products A and B,
what are the chances that they will also purchase
Product C?
Prescriptive Analytics
• Prescriptive analytics build upon the results of
predictive analytics by prescribing actions that
should be taken.
• The focus is not only on which prescribed option is
best to follow, but why. In other words, prescriptive
analytics provide results that can be reasoned about
because they embed elements of situational
understanding.
• Thus, this kind of analytics can be used to gain an
advantage or mitigate a risk.
Prescriptive Analytics
• Sample questions may include:
• Among three drugs, which one provides the
best results?
• When is the best time to trade a particular
stock?
Types of Data
• Structured
• Un-structured
• Semi-structured
• Meta-data
Structured Data
• Structured data conforms to a data model or schema
and is often stored in tabular form.
• It is used to capture relationships between different
entities and is therefore most often stored in a
relational database.
• Structured data is frequently generated by
– Enterprise applications and information systems like ERP
and CRM systems.
Examples of this type of data include banking
transactions, invoices, and customer records.
Unstructured Data
• Data that does not conform to a data model or data schema is known
as unstructured data.
• It is estimated that unstructured data makes up 80% of the data under
the umbrella of Big data.
• Unstructured data has a faster growth rate than structured data.
• This form of data is either textual or binary and often conveyed via
files that are self-contained and non-relational.
– For Example a text file may contain the contents of various tweets or blog
postings.
– Binary files are often media files that contain image, audio or video data.
Technically, both text and binary files have a structure defined by the
file format itself, but this aspect is disregarded, and the notion of
being unstructured is in relation to the format of the data contained in
the file itself.
Unstructured Data
• Video, image and audio files are all types of unstructured data.
• Special purpose logic is usually required to process and store
unstructured data. For example, to play a video file, it is
essential that the correct codec (coder-decoder) is available.
• Unstructured data cannot be directly processed or queried
using SQL.
• If it is required to be stored within a relational database, it is
stored in a table as a Binary Large Object (BLOB)
• Alternatively, a Not-only SQL (NoSQL) database is a non-
relational database that can be used to store unstructured
data alongside structured data.
Semi-structured data
• Semi-structured data has a defined level of structure and
consistency, but is not relational in nature.
• Instead, semi-structured data is hierarchical or graph-
based.
• This kind of data is commonly stored in files that contain
text.
– For example XML and JSON files are common forms of semi-
structured data.
• Due to the textual nature of this data and its conformance
to some level of structure, it is more easily processed than
unstructured data.
Semi-structured data
• XML, JSON and sensor data are semi-structured.
• Examples of common sources of semi-structured data
include electronic data interchange (EDI) files,
spreadsheets, RSS feeds and sensor data.
• Semi-structured data often has special pre-processing
and storage requirements, especially if the underlying
format is not text-based.
• An example of pre-processing of semi-structured data
would be the validation of an XML file to ensure that it
conformed to its schema definition.
Metadata
• Metadata provides information about a dataset’s characteristics
and structure. This type of data is mostly machine-generated and
can be appended to data.

• The tracking of metadata is crucial to Big Data processing, storage

and analysis because it provides information about the pedigree of
the data and its provenance during processing. Examples of
metadata include:
– XML tags providing the author and creation date of a document
attributes providing the file size and resolution of a digital photograph

• Big Data solutions rely on metadata, particularly when processing

semi-structured and unstructured data.
Introduction to NoSQL
• The use of relational database model has
been considered to solve almost any problem
of data management and storage. Choice of
Enterprise. (Persistence, Concurrency,
Integration, Standard Model)
Challenges with RDBMs
• Impedance Mismatch
– Difference between relational model (relationship
& tuples) and in-memory data structures
• Un-structured data storage (text, video,
tweets, mail etc.)
• Application integration databases to
application databases.
• Attack of clusters
Impedance mismatch
Intro : What means NoSQL
• NoSQL was popularized to refer to databases that do not support the relational
model, and they do not use the structured query language of relational databases.

• The term NoSQL groups all those technologies, like graphs, documents, text et
cetera. That by definition are non-relational.

• A more precise definition of NoSQL corresponds to all those new generation

databases that are non-relational, distributed, open source, schema less and
horizontally scalable.

• Summarizing, NoSQL, now, not only SQL is a general category of database

management system that differs from the relational database management system
in different ways. They do not have a schema, they do not allow joins, they do not
try to guarantee the transaction ACID properties, and they scale horizontally.
Diff. Between NoSQL & RDBMs
• Data models : In the case of data models, a NoSQL database lets you
build an application without having to define the schema first unlike
relational databases.

• Data structure : Regarding data structure, NoSQL databases are designed

to handle unstructured data, such as text, social media, post, video, or
email, which makes up much of the data that exist today.

• Development model : Regarding development model, NoSQL databases

are open source, whereas relational databases typically are
closed source with licensing fees baked it into the use of their
software. With NoSQL, you can get started on a project without any
heavy investment in software fees
Cassandra, Ubuntu One, CouchDB
• The NoSQL movement also got the attention of several companies
that have developed NoSQL technologies, like Facebook, who
developed the Cassandra system, later used by Twitter.

• In addition, Linkedin developed Project Voldemort and Ubuntu

One, a synchronized cloud storage system based on CouchDB.

• In addition to web applications, NoSQL databases also support

diverse activity, including predictive analysis and non-critical
transactional systems, those that do not require the ACID
properties.
Benefits of NoSQL
• As data is stored schema-less, Read and Write speeds are much higher in
NoSQL as compared to RDBMs. Data is written at greatest speeds in a
NoSQL database are usually fast and then provide high performance.
– For example, Hypertable's technology is capable of storing one trillion data per
day.
– Another example is a Google BigTable software, which can process 20 petabytes
of data in a day.
• Other benefit is that if a database experiences some growth, it is possible
to add nodes to a distributed system is such a way that it provides
processing and storage economically, also known as horizontal scalability.
• NoSQL databases have the necessary mechanisms for horizontal
scalability. In terms of a storage, the NoSQL storage model is much
simpler than the table schema of the relational model.
More Benefits ...
• Furthermore, NoSQL storage is more flexible than the rigid schema of tables. In reality, NoSQL
databases lack a fixed schema. When the data to be stored cannot be translated into tables of
the relational model, it is convenient to look for a solution by NoSQL.

• NoSQL databases are a good option for those information systems that do not require the
ACID properties. Therefore, without a strict management of a transactions, the control of
concurrency is not done by blocking, but there are other mechanisms.

• Furthermore, the principal relational database management systems have

absorbed alternative proposals such as XML document warehouses and object-oriented
databases.

• Far from replacing the relational database management systems, they complement and
extend their functional characteristics.

• Since durability is not a critical element in some web applications, it's not necessary to write
logs for each transaction. Examples of data not sensitive to durability include copies of user
session data and text message in forums or social networks.
Challenges with NoSQL
• NoSQL has a non-standardized query language. In the absence
of a standard query language for NoSQL databases, it
is necessary to learn each query language for a given database
management system. This is true even within each NoSQL
storage category.
• NoSQL has problems in transactions. As a result of not following
the ACID properties, there is a weaker control over the
consistency, durability and isolation of transactions.
• NoSQL has integrity problems. Ensuring the integrity of the data
requires extra programming manually and is not function of the
DBMS. Integrity is understood from the point of view of the
restrictions, of domain, referential of null value, etc.
RDBMS or NoSQL
The decision on what type of DBMS should be used depends on
a set of factors including but not limited to.
• The volume of data to store.
• The estimated concurrence.
• The number of operations that are done on the database per
unit of time.
• The desired scalability of the database.
• The degree of integrity and consistency that is desired.
• The nature of the data to be stored.
• The most frequent types of operations that you want to do with
the data.
CAP and BASE
There are two main characteristics regarding NoSQL data bases and they are the
CAP theorem and the BASE properties. On the one hand, the CAP theorem has
been adopted by several companies in the web and the NoSQL community. The
acronym for CAP refers to the following,
• C as consistency. It refers to whether a system is in a consistent state after the
execution of an operation or not. A distributed system is considered consistent if
after an update operation by a node, the rest of the nodes see the update in a
shared data resource.
• A as availability, it means that a system is designed and implemented in such a
way that it can continue its operation if there are software or hardware problems
or a node fails.
• P as Partition tolerance it's the ability of a system to continue it's operation in the
presence of network partitions. This happens if a set of nodes in a network looses
connectivity with other nodes. Partition tolerance can also be considered as
ability of a system to dynamically add and remove nodes.
BASE
So basically the CAP theorem says that the most you can have two of
the three characteristics in a shared data system.
• On the other hand, a NoSQL database follows a paradigm of the BASE
properties, Basically available system instead of ACID
• Basically available.
• Soft state,
(the information will expire unless it is refreshed.)
• Eventual consistency.

The BASE properties can be summarized as follows,

An application works basically all the time. It does not have to be
consistent all the time, but it will eventually reach a known state.
Aggregate Data Models
• A data model is the mechanism through which
we perceive and manipulate our data.
• An aggregate data model is a collection of related
objects which can be treated together as a unit.
• These data models require aggregates to be
updated in atomic operations and get
communicated to storage as aggregates.
• Key-Value/Document/Column family databases
have aggregate orientation
RDBM Oriented Data Model
Data Representation in RDBMS
Aggregate Data Model
JASON representation
• // in customers{
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]}
• // in orders{
"id":99,
"customerId":1,
"orderItems":[ {
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled" }],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[ {
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"} } ],}
Customer aggregation containing orders and
its aggregates
JASON representation
• // in customers
• “customer”: {
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]},
“orders” [{
"id":99,
"customerId":1,
"orderItems":[ {
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled" }],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[ {
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"} } ],}]
Key-Value database
It is a database system that stores values indexed by keys and it can
store structured and unstructured data. (The possibility of storing any
type of value is called schema-less.)

– The data values are stored as arrays of bytes.

– The content is not important for the database.
– These databases offer high performance, very scalable, very flexible, and low
complexity.

• The disadvantage of key value databases is that they do not support

complex queries, because they only look for a key.
• Examples of key-value databases are Amazon, Riak, DynamoDB,
Voldemort, RAMCloud, and Flare.
Key-Value Database
Key-Value Database
Domain Specific Key
Document Databases
• Document databases are commonly key value where a value is
stored as a binary field with a format that the DBMS can
understand.

• This format is often a JSON document (JavaScript Object

Notation), BSON (Binary JSON) but it can be XML or any other.
– These databases allow very advanced queries on the data.
– Allows relations between data.
– Do not allow join operations due to performance issues.
• Examples of document-oriented DBMS include CouchDB,
MongoDB, Cloudkit, and XML databases such as DB2 pureXML.
Document Databases
{
"firstname": "Martin",
"likes": [ "Biking",
"Photography" ],
"lastcity": "Boston",
"lastVisited":
}

The above shown document may be represented as a row in RDBMS.

{
"firstname": "Pramod",
"citiesvisited": [ "Chicago", "London", "Pune", "Bangalore" ],
"addresses": [
{ "state": "AK",
"city": "DILLINGHAM", "type": "R"
},
{ "state": "MH", "city": "PUNE", "type": "R" }
],
"lastcity": "Chicago"
}
Column Family Database
• Regarding the column-oriented databases, as
the name implies, they store the data in
columns instead of rows.
• That is, all the attributes of a single data entity
are stored so that each of them can be accessed
as a unit.
• Examples of DBMS oriented to columns
includes Sybase IQ, Cassandra, hyper table, and
HBase
Column Family Database
Column Family Databases
• Column-oriented databases are not good for queries that
required to present the whole record of an entity. In this
case, the row-store is better.
• Column-oriented storage is especially efficient when data
readings are massive, and writes to a few columns. This is
because in a query, only the data of the columns that
interest are obtained, not all the columns of a record,
which increases efficiency.
• The problem of writings would be when you want to write
in all the columns of a record, since the columns work as
individual units and are not necessarily contiguous as in
the storage by rows.
Cassandra Datamodel with Column Families
Cassandra Model
Example of a column
{
name: "fullName",
value: "Martin Fowler",
timestamp: 12345667890
}
• The column has a key of firstName and
the value of Martin and has a
timestamp attached to it.
• A row is a collection of columns
attached or linked to a key;
• a collection of similar rows makes a
column family.
• When the columns in a column family
are simple columns, the column family
is known as standard column family.
Standard Column Family
//column family

{ //row
//row "martin-fowler" : {
"pramod-sadalage" : { firstName: "Martin",
firstName: "Pramod", lastName: "Fowler",
lastName: "Sadalage", location: "Boston"
lastVisit: "2012/12/12" }
} }

• Each column family can be compared to a container of rows in an RDBMS table where the key identifies
the row and the row consists on multiple columns.
• The difference is that various rows do not have to have the same columns, and columns can be added to
any row at any time without having to add it to other rows.
• We have the pramod-sadalage row and the martin-fowler row with different columns; both rows are part
of the column family.
Super Column
• When a column consists of a map of columns,
then we have a super column. A super column
consists of a name and a value which is a map
of columns. Think of a super column as a
container of columns.
Example of Super Column
{
name: "book:978-0767905923",
value: {
author: "Mitch Albon",
title: "Tuesdays with Morrie",
isbn: "978-0767905923"
}
}
Super Column Family (When we use super columns to create a column family,
we get a super column family.)
//super column family
{
//row
name: "billing:martin-fowler", value: {
address: {
name: "address:default",
value: {
fullName: "Martin Fowler", street:"100 N. Main Street", zip: "20145"
}
},
billing: {
name: "billing:default",
value: {
creditcard:
} "8888-8888-8888-8888", expDate: "12/2016"
}}
//row
name: "billing:pramod-sadalage", value: {
address: {
name: "address:default",
value: {
fullName: "Pramod Sadalage", street:"100 E. State Parkway", zip: "54130"
}
},
billing: {
name: "billing:default",
value: {
creditcard:
} "9999-8888-7777-4444", expDate: "01/2016"
}
}
}
Super Column Family
• Super column families are good to keep related data
together, but when some of the columns are not
needed most of the time, the columns are still fetched
and deserialized by Cassandra, which may not be
optimal.
• Cassandra puts the standard and super column families
into keyspaces. A keyspace is similar to a database in
RDBMS where all column families related to the
application are stored. Keyspaces have to be created so
that column families can be assigned to them:
Graph Databases
• In graph-oriented databases, graph is represented as a set of nodes or entities interconnected
by edge or relationships.

• Graphs give importance not only to the data, but to the relations between them too.

• Relationships can also have attributes and direct queries can be made to relationships, rather
than to the nodes.

• Being stored in this way, it is much more efficient to navigate between relationships than in a
relational model.

• Obviously, these type of databases are only useful when the information can be easily
represented as a network.

• Among the most used implementations are Neo4J, Hyperbase-DB, and InfoGrid.
• The performance and scalability of a graph-oriented database are variable, highly flexible,
and highly complex to implement.
Graph Database
When to use - which
Schema less Databases
• RDBMS has restriction on kind of data which
needs to be stored based on data types of
columns.
• NoSQL has freedom. Any document or value
can has its own fields.
• Challenge move to applications.
• Schema moves into application.
Materialized Views
• Aggregates – Advantage/Disadvantage
• Views in RDBMS
• Materialised View
– Eager View
– Computational View
– Summary View
Modelling for Data Access
• Key-Value store
Document store with references
• Reference
Document Database with
reference
# Customer object
{
"customerId": 1, "customer": {
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],
"orders":[{"orderId":99}]
}
}

# Order object
{
"customerId": 1, "orderId": 99, "order":{
"orderDate":"Nov-20-2011", "orderItems":[{"productId":27,
"price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}
}
Business Analytics with Aggregates
• Materialised View (which order has given product in
them)
{ "itemid":27,
"orders":
{99545,897,678}
}
{ "itemid":29,
"orders":
{199,545,704,819}
}
Removing References from Document
Storage
# Customer object
{
"customerId": 1, "name": "Martin",
"billingAddress": [{"city": "Chicago"}], "payment": [
{"type": "debit",
"ccinfo": "1000-1000-1000-1000"} //No need to update customer record with order id.
]
}
# Order object
{
"orderId": 99, "customerId": 1, "orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}], "orderPayment":[{"ccinfo":"1000-1000-1000-
1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}
Column Family Database
Graph Database
Distribution Models
Background
• NoSQL primarily for running on Clusters
• Can scale better
• Mainly two paths for data distribution
– Sharding
– Replication
• Master-Slave
• Peer-to-Peer
Sharding
• Horizontally partitioning large datasets into a collection of
smaller more manageable datasets called shards.
• Shards are distributed across multiple nodes (machine),
sharing the same schema and collectively represent
complete dataset.
• Scalability is achieved by distributing the processing load
across various nodes and it could be easily enhanced by
adding more powerful resources to existing infrastructure.
• As each node is only responsible for their data, the
read/writes are greatly improved.
Sharding
Sharding
• Each shard can independently service reads and writes for
the specific subset of data that it is responsible for.
• Shards provide some resilience against failures as only the
data of failed node will be inaccessible.
• Query pattern needs to be taken care.
• Depending upon the query the data may need to be fetched
from multiple the shards.
• Query requiring data from multiple shards may impact
performance.
• Data locality keeps commonly accessed data co-located on a
single shard and improve performance.
Sharding
Replication
• Replication stores multiple copies of the datasets, known
as, replicas on multiple nodes.
• Scalability and Availability is the result.
• Fault tolerance

Two types of replication strategies

• Master-Slave
• Peer-to-Peer
Replication
Master-Slave
• During master-slave replication, nodes are arranged in a master-slave
configuration, and all data is written to a master node. Once saved, the
data is replicated over to multiple slave nodes.
• All external write requests, including insert, update and delete, occur
on the master node, whereas read requests can be fulfilled by any
slave node.
• Master-slave replication is ideal for read intensive loads rather than
write intensive loads since growing read demands can be managed by
horizontal scaling to add more slave nodes.
• Writes are consistent, as all writes are coordinated by the master node.
• The implication is that write performance will suffer as the amount of
writes increases. If the master node fails, reads are still possible via any
of the slave nodes.
Master-Slave - Writes are managed by the master node and data can be read
from either Slave A or Slave B.
Master-Slave
• A slave node can be configured as a backup node for the master node. In the
event that the master node fails, writes are not supported until a master node is
re-established. The master node is either resurrected from a backup of the
master node, or a new master node is chosen from the slave nodes.

• One concern with master-slave replication is read inconsistency, which can be an

issue if a slave node is read prior to an update to the master being copied to it.

A scenario where read inconsistency occurs.

1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read
the data from Slave B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is
updated by the Master.
Read Inconsistency
Read Inconsistency
• To ensure read consistency, a voting system
can be implemented where a read is declared
consistent if the majority of the slaves contain
the same version of the record.
Implementation of such a voting system
requires a reliable and fast communication
mechanism between the slaves.
Peer-to-Peer
• With peer-to-peer replication, all nodes
operate at the same level. In other words,
there is not a master-slave relationship
between the nodes. Each node, known as a
peer, is equally capable of handling reads and
writes. Each write is copied to all peers.
Peer-to-Peer
Write-Inconsistency
• Peer-to-peer replication is prone to write
inconsistencies that occur as a result of a
simultaneous update of the same data across multiple
peers. This can be addressed by implementing either a
pessimistic or optimistic concurrency strategy.

• Pessimistic concurrency is a proactive strategy that

prevents inconsistency. It uses locking to ensure that
only one update to a record can occur at a time.
Write-Inconsistency
However, this is detrimental to availability since the
database record being updated remains unavailable until
all locks are released.

• Optimistic concurrency is a reactive strategy that does not

use locking. Instead, it allows inconsistency to occur with
knowledge that eventually consistency will be achieved
after all updates have propagated. With optimistic
concurrency, peers may remain inconsistent for some
period of time before attaining consistency. However, the
database remains available as no locking is involved.
Read-Inconsistency
• Like master-slave replication, reads can be
inconsistent during the time period when
some of the peers have completed their
updates while others perform their updates.
• However, reads eventually become consistent
when the updates have been executed on all
peers.
Read-Inconsistency
• A scenario where an inconsistent read occurs.
1. User A updates data.
2. a. The data is copied over to Peer A.
• b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B
tries to read the data from Peer C, resulting in an
inconsistent read.
4. The data will eventually be updated on Peer C, and
the database will once again become consistent.
Read-Inconsistency
Sharding and Replication
• To improve on the limited fault tolerance
offered by sharding, while additionally
benefiting from the increased availability and
scalability of replication, both sharding and
replication can be combined
– Sharding and Master-Slave replication
– Sharding and Peer-to-Peer replication
Sharding and Replication
Sharding and Master-Slave
• When sharding is combined with master-slave
replication, multiple shards become slaves of a single
master, and the master itself is a shard.
• This results in multiple masters, but a single slave-
shard can only be managed by a single master-shard.
• Write consistency is maintained by the master-shard.
However, if the master-shard becomes non-
operational or a network outage occurs, fault
tolerance with regards to write operations is
impacted.
• Replicas of shards are kept on multiple slave nodes to
provide scalability and fault tolerance for read
operations.
Sharding and Master-Slave
• Each node acts both as a master and a slave
for different shards.
• Writes (id = 2) to Shard A are regulated by
Node A, as it is the master for Shard A.
• Node A replicates data (id = 2) to Node B,
which is a slave for Shard A.
• Reads (id = 4) can be served directly by either
Node B or Node C as they each contain Shard
B.
Sharding & Master-Slave
Sharding and Peer-to-Peer
• When combining sharding with peer-to-peer
replication, each shard is replicated to multiple
peers, and each peer is only responsible for a
subset of the overall dataset.
• Collectively, this helps achieve increased
scalability and fault tolerance. As there is no
master involved, there is no single point of
failure and fault-tolerance for both read and
write operations is supported.
Sharding and Peer-to-Peer
• Each node contains replicas of two different
shards.
• Writes (id = 3) are replicated to both Node A
and Node C (Peers) as they are responsible for
Shard C.
• Reads (id = 6) can be served by either Node B
or Node C as they each contain Shard B.
Peer-to-Peer
CAP Theorem
• The Consistency, Availability, and Partition
tolerance (CAP) theorem, also known as
Brewer’s theorem, expresses a triple
constraint related to distributed database
systems.
• It states that a distributed database system,
running on a cluster, can only provide two of
the following three properties:
CAP Theorem
• Consistency – A read from any node results in
the same data across multiple nodes.

• E.g. In the following figure all three users get

the same value for the amount column even
though three different nodes are serving the
record.
Consistency
Availability and Partitioning
• Availability – A read/write request to a non-
failing node will always be acknowledged in
the form of a success or a failure. In other
words request to a failed node doesn’t mean
lack of availability.
• Partition tolerance – The database system can
tolerate communication outages that split the
cluster into multiple silos and can still service
read/write requests
Availability and Partitioning
Availability and Partitioning
• In the event of a communication failure,
requests from both users are still serviced (1,
2).
• However, with User B, the update fails as the
record with id = 3 has not been copied over to
Peer C because of network failure. The user is
duly notified (3) that the update has failed.
CAP
• Venn Diagram
CAP Theorem
• In a distributed database, scalability and fault tolerance
can be improved through additional nodes, although this
challenges consistency (C). The addition of nodes can
also cause availability (A) to suffer due to the latency
caused by increased communication between nodes.
• Although communication outages are rare and
temporary, partition tolerance (P) must always be
supported by a distributed database. So a cluster has to
be partition tolerant. It means C v/s A.
• Therefore, CAP is generally a choice between choosing
either C+P or A+P. The requirements of the system will
dictate which is chosen.
ACID property
• Transaction management property of RDBMS.
• Pessimist style of maintaining consistency by
locking records
• Stands for
– Atomicity
– Consistency
– Isolation
– Durability
Atomicity
• Atomicity ensures that all operations will always succeed or
fail completely. In other words, there are no partial
transactions.
• The following steps are illustrated in Figure :
1. A user attempts to update three records as a part of a
transaction.
2. Two records are successfully updated before the
occurrence of an error.
3. As a result, the database roll backs any partial effects of
the transaction and puts the system back to its prior state.
Atomicity
Consistency
• Consistency ensures that the database will always remain in a
consistent state by ensuring that only data that conforms to the
constraints of the database schema can be written to the
database. Thus a database that is in a consistent state will
remain in a consistent state following a successful transaction.

1. A user attempts to update the amount column of the table

that is of type float with a varchar value.
2. The database applies its validation check and rejects this
update because the value violates the constraint checks for
the amount column.
Consistency
Isolation
• Isolation ensures that the results of a transaction are not
visible to other operations until it is complete.

1. User A attempts to update two records as part of a

transaction.
2. The database successfully updates the first record.
3. However, before it can update the second record, User B
attempts to update the same record. The database does not
permit User B’s update until User A’s update succeeds or
fails in full. This occurs because the record with id3 is locked
by the database until the transaction is complete.
Isolation
Durability
• Durability ensures that the results of an operation are
permanent. In other words, once a transaction has been
committed, it cannot be rolled back. This is irrespective of any
system failure.
1. A user updates a record as part of a transaction.
2. The database successfully updates the record.
3. Right after this update, a power failure occurs. The database
maintains its state while there is no power.
4. The power is resumed.
5. The database serves the record as per last update when
requested by the user.
Durability
ACID principle
• ACID principle results in consistent database
behaviour
1. User A attempts to update a record as part of a
transaction.
2. The database validates the value and the update is
successfully applied.
3. After the successful completion of the transaction,
when Users B and C request the same record, the
database provides the updated value to both the
users.
ACID principle
BASE
• BASE is a database design principle based on the CAP theorem
and leveraged by database systems that use distributed
technology. BASE stands for:
• basically available
• soft state
• eventual consistency
• When a database supports BASE, it favors availability over
consistency. In other words, the database is A+P from a CAP
perspective.
• In essence, BASE leverages optimistic concurrency by relaxing the
strong consistency constraints mandated by the ACID properties.
BASE – basically available
• If a database is “basically available,” that
database will always acknowledge a client’s
request, either in the form of the requested
data or a success/failure notification.
• The database is basically available, even
though it has been partitioned as a result of a
network failure.
User A and User B receive data despite the database being partitioned by a
network failure.
BASE – soft state
• Soft state means that a database may be in an inconsistent
state when data is read; thus, the results may change if the
same data is requested again.
• This is because the data could be updated for consistency, even
though no user has written to the database between the two
reads. This property is closely related to eventual consistency.
1. User A updates a record on Peer A.
2. Before the other peers are updated, User B requests the same
record from Peer C.
3. The database is now in a soft state, and stale data is returned
to User B.
An example of the soft state property of BASE is
shown here.
BASE – Eventual consistency
• Eventual consistency is the state in which reads by different clients,
immediately following a write to the database, may not return
consistent results. The database only attains consistency once the
changes have been propagated to all nodes. While the database is in
the process of attaining the state of eventual consistency, it will be in a
soft state.

1. User A updates a record.

2. The record only gets updated at Peer A, but before the other peers can
be updated, User B requests the same record.
3. The database is now in a soft state. Stale data is returned to User B
from Peer C.
4. However, the consistency is eventually attained, and User C gets the
correct value.
An example of the eventual consistency property of BASE.
BASE
• BASE emphasizes availability over immediate
consistency, in contrast to ACID, which ensures
immediate consistency at the expense of availability
due to record locking.
• This soft approach toward consistency allows BASE
compliant databases to serve multiple clients without
any latency albeit serving inconsistent results.
• However, BASE-compliant databases are not useful for
transactional systems where lack of consistency is a
concern.

Principles of Database Manageme - Wilfried Lemahieu
100% (6)
Principles of Database Manageme - Wilfried Lemahieu
1,843 pages
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Nosqlmodule 1
100% (1)
Nosqlmodule 1
102 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
16 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Ankit Sir All Units Dbms
100% (1)
Ankit Sir All Units Dbms
142 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
SQL NoSQL NewSQL
No ratings yet
SQL NoSQL NewSQL
12 pages
J2EE Architecture
No ratings yet
J2EE Architecture
49 pages
SQLGraph - When ClickHouse Marries Graph Processing Amoisbird PDF
0% (1)
SQLGraph - When ClickHouse Marries Graph Processing Amoisbird PDF
35 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Data Warehousing&Data Mining
No ratings yet
Data Warehousing&Data Mining
170 pages
Data Warehouses and Data Cubes
No ratings yet
Data Warehouses and Data Cubes
21 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
188 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Object Relational DBMSs
No ratings yet
Object Relational DBMSs
34 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Real Time Data Processing With PDI
No ratings yet
Real Time Data Processing With PDI
15 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Unit - 2
No ratings yet
Unit - 2
26 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Datawarehouse Tools
No ratings yet
Datawarehouse Tools
8 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Steps in SHA-256 Algorithm
No ratings yet
Steps in SHA-256 Algorithm
5 pages
What Is Indexing?: Indexing Is A Data Structure Technique Which Allows You To Quickly Retrieve
100% (1)
What Is Indexing?: Indexing Is A Data Structure Technique Which Allows You To Quickly Retrieve
7 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Mongodb Notes Basic To Advanced 1692833294
No ratings yet
Mongodb Notes Basic To Advanced 1692833294
10 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Uit 1 & Unit 2 Notes
No ratings yet
Uit 1 & Unit 2 Notes
79 pages
MySQL-Full Notes
No ratings yet
MySQL-Full Notes
51 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Data Warehouse Full Slides
100% (3)
Data Warehouse Full Slides
822 pages
The Best of Bruce's Postgres Slides: Ruce Omjian
No ratings yet
The Best of Bruce's Postgres Slides: Ruce Omjian
26 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Understanding Business Intelligence:: ETL and Data Mart Best Practices
No ratings yet
Understanding Business Intelligence:: ETL and Data Mart Best Practices
20 pages
The Data WareHouse ETL Toolkit - Chapter 05
100% (1)
The Data WareHouse ETL Toolkit - Chapter 05
40 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
100% (3)
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
11 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Data Warehouse Development Approach
No ratings yet
Data Warehouse Development Approach
25 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
100% (1)
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
71 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
CS 411: Database Systems: Course Description
No ratings yet
CS 411: Database Systems: Course Description
5 pages
Get BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning Raj Kamal free all chapters
No ratings yet
Get BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning Raj Kamal free all chapters
51 pages
PYthon Last Moment
No ratings yet
PYthon Last Moment
36 pages
Graph Databases
No ratings yet
Graph Databases
191 pages
Data Lake A New Ideology in Big Data Era
No ratings yet
Data Lake A New Ideology in Big Data Era
11 pages
Introduction To Firestore
No ratings yet
Introduction To Firestore
3 pages
Ganesh Vandana Ma2 03
No ratings yet
Ganesh Vandana Ma2 03
5 pages
DBMS Question
No ratings yet
DBMS Question
57 pages
BSIT D 2018 Prospectus (1)
No ratings yet
BSIT D 2018 Prospectus (1)
2 pages
Swdnd501 Note
No ratings yet
Swdnd501 Note
7 pages
System Design
No ratings yet
System Design
4 pages
Preview PDF
No ratings yet
Preview PDF
65 pages
2020 Big Data Question
No ratings yet
2020 Big Data Question
7 pages
Graph Neo4j
No ratings yet
Graph Neo4j
46 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
62 pages
Bill of Supermarket
No ratings yet
Bill of Supermarket
9 pages
System Design Cheatsheet 1651760511
No ratings yet
System Design Cheatsheet 1651760511
6 pages
IM Ch14 Big Data Analytics NoSQL Ed12
No ratings yet
IM Ch14 Big Data Analytics NoSQL Ed12
8 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
000_Company Interview Qns
No ratings yet
000_Company Interview Qns
13 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
[FREE PDF sample] (eBook PDF) Modern Database Management 12th Global Edition ebooks
100% (7)
[FREE PDF sample] (eBook PDF) Modern Database Management 12th Global Edition ebooks
56 pages
The Node - Js Developer Roadmap For 2021
No ratings yet
The Node - Js Developer Roadmap For 2021
6 pages
MongoDB Notes For Pros
No ratings yet
MongoDB Notes For Pros
73 pages
Big Data Essentials
No ratings yet
Big Data Essentials
25 pages
Couchbase Under The Hood WP
No ratings yet
Couchbase Under The Hood WP
36 pages
The Essential Guide To Data in The Cloud:: A Handbook For Dbas
No ratings yet
The Essential Guide To Data in The Cloud:: A Handbook For Dbas
20 pages
m3 NoSQL Database
No ratings yet
m3 NoSQL Database
9 pages
Nosql Databases
No ratings yet
Nosql Databases
2 pages